® IBM Software Group TCP/IP Configuration and Diagnosis with WebSphere MQ: Part II Justin Fries WebSphere® Support Technical Exchange IBM Software Group Agenda Introduction Review of Part I Seven Problems Errors and Errnos Debug FFST® Files Packet Traces KeepAlive Questions and Answers WebSphere® Support Technical Exchange 2 IBM Software Group Review: Network Layers WebSphere® Support Technical Exchange 3 IBM Software Group Review: Internet Protocol IP Basics Data is transmitted as packets Packet size is limited by the network Packets are routed based on IP addresses • IPv4: 192.168.1.100 • IPv6: 2001:0db8:0000:0000:0000:0000:c980:00b4 IP Characteristics Primarily machine-to-machine communication Delivery is unreliable as packets may be lost or delayed The packet header controls its handling on the network The payload contains ICMP control messages or data WebSphere® Support Technical Exchange 4 IBM Software Group Review: Transmission Control Protocol TCP Basics Data is transmitted as segments Each segment fits within an IP packet: “TCP/IP packets” All routing across the network is handled by the IP layer TCP Characteristics Communication between two connected endpoints Builds reliability on IP using segment header fields Data is automatically retransmitted and put in order The state of each connection is managed by the system WebSphere® Support Technical Exchange 5 IBM Software Group Review: WebSphere MQ and Sockets WebSphere MQ channels use the sockets API The listener waits for new inbound connections MQ channels and clients connect to the listener Accepted connections run in a pooling process Established channels send and receive TSHes TSH: The Transmission Segment Header used for control data, messages, MQI calls WebSphere® Support Technical Exchange 6 IBM Software Group Problem 1: The Error A Linux sender channel goes right to the RETRYING state The queue manager error log message reads: 08/07/2008 10:06:36 AM - Process(5440.1) User(justinf) Program(runmqchl) AMQ9202: Remote host 'bombarde (192.168.1.11) (1417)' not available, retry later. EXPLANATION: The attempt to allocate a conversation using TCP/IP to host 'bombarde (192.168.1.11) (1417)' was not successful. However the error may be a transitory one and it may be possible to successfully allocate a TCP/IP conversation later. ACTION: Try the connection again later. If the failure persists, record the error values and contact your systems administrator. The return code from TCP/IP is 111 (X'6F'). The reason for the failure may be that this host cannot reach the destination host. It may also be possible that the listening program at host 'bombarde (192.168.1.11) (1417)' was not running. If this is the case, perform the relevant operations to start the TCP/IP listening program, and try again. ----- amqccita.c : 1288 ------------------------------------------------------- WebSphere® Support Technical Exchange 7 IBM Software Group Problem 1: Errno Values The errno value helps to explain system errors It is set when a function like connect() fails Each errno value has a name beginning “E…” • ENOMEM, ECONNRESET, EADDRINUSE… • On Windows, network errors begin “WSAE…” Every system has its own values, eg. ECONNRESET: AIX 73 OpenVMS 54 HP-UX 232 Solaris 131 IBM i 3426 Windows 10054 Linux 104 z/OS 1121 NonStop 4120 z/VSE 1121 WebSphere® Support Technical Exchange 8 IBM Software Group Problem 1: Errno Lookup Where can you look for errno values? The IBM® Support Assistant (ISA) tool System documentation and manuals System header files (errno.h) UNIX and Linux man pages Search the web On Windows the “net” program can look these up C:\> net helpmsg 10054 An existing connection was forcibly closed by the Exchange remote host. WebSphere® Support Technical 9 IBM Software Group Problem 1: Errno Utility The errno utility is a new script for UNIX and Linux > errno 13 EINVAL 128 Errno lookup on Solaris 10 (sparc): EACCES 13 /* Permission denied */ EINVAL 22 /* Invalid argument */ ENETUNREACH 128 /* Network is unreachable */ http://www.ibm.com/support/docview.wss?rs=171&uid=swg21321347 Let’s look up the errno from the WebSphere MQ message: > errno 111 Errno lookup on SUSE LINUX Enterprise Server 9 (ppc): ECONNREFUSED 111 /* Connection refused */ WebSphere® Support Technical Exchange 10 IBM Software Group Problem 1: Explanation The errno explains what happened here The channel tried to connect() using the CONNAME The system sent a SYN packet (three-way handshake) The remote system replied with a RST packet The system set errno to ECONNREFUSED The Linux manual page confirms this explanation The output from man connect reads: ECONNREFUSED No one listening on the remote address. WebSphere® Support Technical Exchange 11 IBM Software Group Problem 1: Further Information If no listener is running, channels cannot connect Message channels will fail and print an error message WebSphere® MQ Client programs will fail during MQCONN/X Up through V6.0 the reason code was always: MQRC_Q_MGR_NOT_AVAILABLE WebSphere® 2059 WebSphere MQ V7.0 adds more specific reason Support Technical Exchange codes: MQRC_CHANNEL_NOT_AVAILABLE 2537 MQRC_HOST_NOT_AVAILABLE 2538 12 IBM Software Group Problem 2: The Error A Solaris sender channel goes to RETRYING after 75 seconds The queue manager error log message reads: 08/08/08 15:22:00 - Process(1405110.1) User(justinf) Program(runmqchl) AMQ9202: Remote host 'bombarde (192.168.1.11) (1427)' not available, retry later. EXPLANATION: The attempt to allocate a conversation using TCP/IP to host 'bombarde (192.168.1.11) (1427)' was not successful. However the error may be a transitory one and it may be possible to successfully allocate a TCP/IP conversation later. ACTION: Try the connection again later. If the failure persists, record the error values and contact your systems administrator. The return code from TCP/IP is 145 (X‘91'). The reason for the failure may be that this host cannot reach the destination host. It may also be possible that the listening program at host 'bombarde (192.168.1.11) (1427)' was not running. If this is the case, perform the relevant operations to start the TCP/IP listening program, and try again. ----- amqccita.c : 1288 ------------------------------------------------------- WebSphere® Support Technical Exchange 13 IBM Software Group Problem 2: The Timeout The errno script explains this code: > errno 145 Errno lookup on Solaris 10 (sparc): ETIMEDOUT 145 /* Connection timed out */ Why was there a 75 second delay? The first SYN packet received no acknowledgement The system retransmitted the SYN after 3 seconds And again after delays of 6, 12 and 24 seconds At 75 seconds the system set errno to ETIMEDOUT WebSphere® Support Technical Exchange 14 IBM Software Group Problem 2: Channel Status During that 75 second period there are other clues DISPLAY CHSTATUS(SWELL.TO.GREAT) AMQ8417: Display Channel Status details. CHANNEL(SWELL.TO.GREAT) CHLTYPE(SDR) CONNAME(192.168.1.11(1427)) CURRENT RQMNAME( ) STATUS(BINDING) SUBSTATE(NETCONNECT) XMITQ(GREAT) The STATUS explains generally the state of the channel The SUBSTATE is specific about what it is doing WebSphere® Support Technical Exchange 15 IBM Software Group Problem 2: SUBSTATE SUBSTATE may indicate socket activity for TCP channels NAMESERVER • getaddrinfo(), gethostbyname() NETCONNECT • connect() SEND • write()/send(), poll()/select() RECEIVE • read()/recv(), poll()/select() WebSphere® Support Technical Exchange 16 IBM Software Group Problem 2: The netstat Program The netstat program shows the connection status netstat –an TCP:IPv4 Local Address Remote Address Swind Send-Q Rwind Recv-Q State -------------------- ------------------- ----- ------ ----- ------ ------192.168.1.23.32892 192.168.1.11.1427 0 0 49640 0 SYN_SENT This connection is in a SYN_SENT state It has sent a SYN but received no SYN+ACK There is no data queued on the socket buffers The same command on the other machine shows nothing WebSphere® Support Technical Exchange 17 IBM Software Group TCP State Diagram Our TCP connection is stuck right here in the three-way handshake Normal client path is shown in red Normal server path is shown in blue Our connection is about to time out rather than continue WebSphere® Support Technical Exchange 18 IBM Software Group Problem 2: Explanation The absence of any reply packets is important Usually it means the remote machine is offline Here we know the machine and listener are up In this case, a personal firewall is to blame The firewall is blocking normal TCP replies The firewalled system is in “stealth mode” Solution: Open the port in the firewall WebSphere® Support Technical Exchange 19 IBM Software Group Problem 3: The Error Every now and then a pair of unexplained errors appears The queue manager error log message reads: 08/09/08 19:09:51 - Process(28536.13) User(justinf) Program(amqrmppa.exe) AMQ9209: Connection to host ‘foehammer (192.168.1.16)' closed. EXPLANATION: An error occurred receiving data from ‘foehammer (192.168.1.16)' over TCP/IP. The connection to the remote host has unexpectedly terminated. ACTION: Tell the systems administrator. ----- amqccita.c : 3182 ------------------------------------------------------08/09/08 19:09:51 - Process(28536.13) User(justinf) Program(amqrmppa.exe) AMQ9492: The TCP/IP responder program encountered an error. EXPLANATION: The responder program was started but detected an error. ACTION: Look at previous error messages in the error files to determine the error encountered by the responder program. ----- amqrmrsa.c : 455 -------------------------------------------------------- WebSphere® Support Technical Exchange 20 IBM Software Group Problem 3: The Responder The TCP/IP responder program is the process in which an accepted channel connection runs The listener only accepts the new connection before handing it off The runmqlsr listener spreads incoming connections across MQ pooling processes: amqrmppa The UNIX inetd listener runs each incoming connection in a separate unthreaded process: amqcrsta WebSphere® Support Technical Exchange 21 IBM Software Group Problem 3: Missing Information Neither message contains an errno value The connection closed with no errors at all The other side had no more data to send This means we received a FIN packet There are no related channel messages in the log No “Channel started” message No “Channel ended (ab)normally” message Therefore this connection was not a channel WebSphere® Support Technical Exchange 22 IBM Software Group Problem 3: Explanation These errors were probably due to a port scanner The scanner connects to any ports it can After connecting the scanner closes its socket Scanners are used by security teams (and hackers) WebSphere MQ 7.0 no longer reports port scanners It ignores new connections that close immediately An environment variable allows the old behavior MQ_REPORT_NETWORK_PROBE=1 WebSphere® Support Technical Exchange 23 IBM Software Group Problem 4: The Error An AIX sender channel hangs for five minutes when starting The queue manager error log message reads: 08/10/08 22:20:58 - Process(1396752.1) User(justinf) Program(runmqchl) AMQ9202: Remote host ‘cortana' not available, retry later. EXPLANATION: The attempt to allocate a conversation using TCP/IP to host ‘cortana' was not successful. However the error may be a transitory one and it may be possible to successfully allocate a TCP/IP conversation later. ACTION: Try the connection again later. If the failure persists, record the error values and contact your systems administrator. The return code from TCP/IP is 0 (X'0'). The reason for the failure may be that this host cannot reach the destination host. It may also be possible that the listening program at host ‘cortana' was not running. If this is the case, perform the relevant operations to start the TCP/IP listening program, and try again. WebSphere® Support Technical Exchange 24 IBM Software Group Problem 4: Some Oddities What is errno zero? It doesn’t exist: Zero means success > errno 0 Errno lookup on AIX 5300-08-02-0822 (powerpc): * Errno 0 not found. What does the system show during the five minutes? The netstat program lists nothing on either side The system does not know about the connection • Perhaps the connection does not yet exist WebSphere® Support Technical Exchange 25 IBM Software Group Problem 4: Channel Status While the channel is waiting its status shows: DISPLAY CHSTATUS(POSITIV.TO.GREAT) JOBNAME AMQ8417: Display Channel Status details. CHANNEL(POSITIV.TO.GREAT) CHLTYPE(SDR) CONNAME(192.168.1.11(1427)) CURRENT JOBNAME(0015501000000001) RQMNAME( ) STATUS(BINDING) SUBSTATE(NAMESERVER) XMITQ(GREAT) The SUBSTATE shows we are waiting for a DNS lookup The channel is process identifier 0x00155010: 1396752 WebSphere® Support Technical Exchange 26 IBM Software Group Problem 4: Debug FFSTs To request a debug FFST file for a channel Use the JOBNAME to find its process identifier On UNIX systems send SIGUSR2 to that pid: > kill –USR2 1396752 On Windows systems use amqldbgn instead: C:\> amqldbgn –p 1396752 These commands do not print any messages to the screen Look in the top-level errors directory for the file WebSphere® Support Technical Exchange 27 IBM Software Group Problem 4: The Debug FFST The FFST contains much more data than this (2500+ lines): WebSphere MQ First Failure Symptom Report ========================================= Date/Time LVLS Product Long Name Probe Id Component Program Name Process Major Errorcode ::::::::- Sun August 10 2008 22:18:25 EST 7.0.0.0 WebSphere MQ for AIX CO368255 cciTcpResolveHostname runmqchl 1396752 OK MQM Function Stack rriCaller rriCallerEntry rriInitSess ccxAllocConv cciTcpAllocConv cciTcpResolveHostname xcsFFST WebSphere® Support Technical Exchange 28 IBM Software Group Problem 4: Operating System Tools Some systems allow for even more detailed information On AIX and Solaris the stackit script is useful Usually must be run by root for MQ processes http://www.ibm.com/support/docview.wss?rs=171&uid=swg21179404 It dumps by pid, by name, or by matching an argument: > stackit –p 1396752 > stackit –n runmqchl > stackit –m POSITIV.TO.GREAT WebSphere® Support Technical Exchange 29 IBM Software Group Problem 4: Stackit Output This output includes some very low-level system information: 1396752: /usr/mqm/bin/runmqchl -c POSITIV.TO.GREAT -m POSITIV ---------- tid# 3363011 (pthread ID: 1) ---------0x0900000000111a34 __fd_poll(??, ??, ??) + 0x98 0x09000000000a9b00 poll(??, ??, ??) + 0xc 0x09000000000a8988 res_nsend(0x1100bdee8, 0xfffffffffffbdd0, 0x2d0000002d... 0x09000000000fb26c res_nquery(??, ??, ??, ??, ??, ??) + 0x130 0x09000000000fa8d0 res_nquerydomain(??, ??, ??, ??, ??, ??, ??) + 0x180 0x09000000000fac94 res_nsearch(??, ??, ??, ??, ??, ??) + 0x320 0x09000000000b3acc res_search(??, ??, ??, ??, ??) + 0xa8 0x09000000001005e8 ho_byname2(??, ??, ??) + 0x13c 0x090000000011b560 ho_byname2(??, ??, ??) + 0x1ac 0x09000000000a6570 gethostbyname2(??, ??) + 0x190 0x09000000000a9eb8 getaddrinfo2(??, ??, ??, ??) + 0x384 0x09000000000ab410 getaddrinfo(??, ??, ??, ??) + 0x498 0x090000000606bf90 cciTcpResolveHostname() + 0x2f0 0x090000000605081c cciTcpAllocConv() + 0x2bc 0x0900000003c04c48 ccxAllocConv() + 0xe8 0x0900000003c8e894 rriInitSess() + 0xc34 0x0900000003da3724 rriCallerEntry() + 0xa64 0x00000001000007d4 main() + 0x374 0x0000000100000288 __start() + 0x90 WebSphere® Support Technical Exchange 30 IBM Software Group Problem 4: Explanation In this case the DNS server was not responding at all The nameserver address was wrong in the system MQ could not convert the CONNAME to an IP address The error message in this case is somewhat misleading Without an IP address, MQ never created a socket Therefore there was no TCP/IP connection at all • No netstat output, no errno either WebSphere MQ development is looking at this message WebSphere® Support Technical Exchange 31 IBM Software Group Problem 5: The Issue A channel between queue managers sometimes runs slowly There are no explanatory messages in the logs The channel remains in a RUNNING state DISPLAY CHSTATUS(SWELL.TO.GREAT) JOBNAME AMQ8417: Display Channel Status details. CHANNEL(SWELL.TO.GREAT) CHLTYPE(SDR) CONNAME(192.168.1.11(1427)) CURRENT JOBNAME(000AE02600000001) RQMNAME(GREAT) STATUS(RUNNING) SUBSTATE(SEND) XMITQ(GREAT) WebSphere® Support Technical Exchange 32 IBM Software Group Problem 5: The Sender Debug FFST WebSphere MQ First Failure Symptom Report ========================================= Date/Time LVLS Product Long Name Probe Id Component Program Name Process Major Errorcode ::::::::- MQM Function Stack rriCaller rriCallerEntry rriSendData ccxSend cciTcpSend xcsWaitFd xcsFFST WebSphere® Support Technical Exchange Mon August 11 2008 17:24:33 EST 7.0.0.0 WebSphere MQ for AIX XC464255 xcsWaitFd runmqchl 1396752 OK The xcsWaitFd function uses poll() or select() to wait until the socket can accept more data It can also wait for data to arrive on an empty socket 33 IBM Software Group Problem 5: The Receiver Status The receiver channel also shows no errors at all There are no messages to be found in the logs The channel remains in a RUNNING state: DISPLAY CHSTATUS(SWELL.TO.GREAT) JOBNAME AMQ8417: Display Channel Status details. CHANNEL(SWELL.TO.GREAT) CHLTYPE(RCVR) CONNAME(192.168.1.23) CURRENT JOBNAME(0016603E0000000D) RQMNAME(SWELL) STATUS(RUNNING) SUBSTATE(RECEIVE) XMITQ( ) WebSphere® Support Technical Exchange 34 IBM Software Group Problem 5: The Receiver Debug FFST WebSphere MQ First Failure Symptom Report ========================================= Date/Time LVLS Product Long Name Probe Id Component Program Name Process Major Errorcode ::::::::- MQM Function Stack ccxResponder rrxResponder rriReceiveData ccxReceive cciTcpReceive xcsFFST WebSphere® Support Technical Exchange Mon August 11 2008 17:24:44 EST 6.0.2.4 WebSphere MQ for AIX CO052255 cciTcpReceive amqrmppa 1466430 OK The receiver channel is busy reading data from the socket As it reads each message the receiver will MQPUT the message to its destination 35 IBM Software Group Problem 5: Packet Trace Is there a communications problem here? Only if the receiver FFST also showed xcsWaitFd() In this case the receiver has data available on its socket We can take a packet trace to be sure The trace will show packet traffic for the channel Missing ACKs or the SACKs may be a network problem Look for the receiver window to judge its performance • If its window is closed (zero), it isn’t keeping up tcpdump –w perf.cap host 192.168.1.11 and port 1427 WebSphere® Support Technical Exchange 36 IBM Software Group Problem 5: Packet Trace Analysis WebSphere® Support Technical Exchange 37 IBM Software Group Problem 5: Explanation The sender channel is running faster than the receiver The socket buffers are full in one direction: The receiving machine may lack the resources to keep up Investigate the activity on the receiving machine WebSphere® Support Technical Exchange 38 IBM Software Group Problem 6: The Issue A WebSphere MQ client program fails with reason code 2009 This code is MQRC_CONNECTION_BROKEN The client is connected through a firewall This happens after a period of inactivity In Part I we discussed how firewalls can kill idle connections Packet traces from both sides can confirm the root cause The solution is to enable KeepAlive for the queue manager TCP: KeepAlive=Yes WebSphere® Support Technical Exchange 39 IBM Software Group Problem 6: AIX tcp_keepidle For help on the timer: /usr/sbin/no –h tcp_keepidle To see details about its limits and current values: /usr/sbin/no –L tcp_keepidle NAME CUR DEF BOOT MIN MAX UNIT TYPE --------------------------------------------------------------tcp_keepidle 14400 14400 14400 1 8E-1 halfsecond C To change the value temporarily, use a value in half-seconds: /usr/sbin/no –o tcp_keepidle=600 To change the value permanently, add the ‘p’ flag: /usr/sbin/no –po tcp_keepidle=600 WebSphere® Support Technical Exchange 40 IBM Software Group Problem 6: HP-UX tcp_keepalive_interval To display the timer value, which is in milliseconds: /usr/bin/ndd –get /dev/tcp tcp_keepalive_interval To change the value temporarily: /usr/bin/ndd –set /dev/tcp tcp_keepalive_interval 300000 To change the value on reboot, edit /etc/rc.config.d/nddconf Follow the examples in the file comments to add an entry: TRANSPORT_NAME[0]=tcp NDD_NAME[0]=tcp_keepalive_interval NDD_VALUE[0]=300000 WebSphere® Support Technical Exchange 41 IBM Software Group Problem 6: Linux tcp_keepalive_time To display the timer value, which is in seconds: /sbin/sysctl net.ipv4.tcp_keepalive_time To change the value temporarily: /sbin/sysctl –w net.ipv4.tcp_keepalive_time=300 To change the value on reboot, add a line to /etc/sysctl.conf: net.ipv4.tcp_keepalive_time=300 To pick up the new setting immediately: /sbin/sysctl -p WebSphere® Support Technical Exchange 42 IBM Software Group Problem 6: Solaris tcp_keepalive_interval To display the timer value, which is in milliseconds: /usr/sbin/ndd –get /dev/tcp tcp_keepalive_interval To change the value temporarily: /usr/sbin/ndd –set /dev/tcp tcp_keepalive_interval 300000 To change the value on reboot, add this command to a startup script One common way is to create a script named like nettune Link the script under one of the /etc/rcX.d directories Refer to the Solaris documentation for details WebSphere® Support Technical Exchange 43 IBM Software Group Problem 6: Windows KeepAliveTime To work with the value run the regedit program and go to: HKEY_LOCAL_MACHINE SYSTEM CurrentControlSet Services Tcpip Parameters If the KeepAliveTime value does not exist: Right-click on Parameters and choose New > DWORD Value Name it KeepAliveTime and give a value in milliseconds Otherwise double-click KeepAliveTime to enter a new value Reboot the system to pick up any changes WebSphere® Support Technical Exchange 44 IBM Software Group Problem 6: Confirming the Solution WebSphere® Support Technical Exchange 45 IBM Software Group Problem 6: New V7.0 Client Feature KeepAlive has always been necessary for MQ clients When the client is between MQI calls it is silent The client cannot respond to server heartbeats WebSphere MQ V7.0 clients now can heartbeat at any time MQ starts a thread which responds to heartbeats This occurs only when SHARECNV is non-zero Therefore a V7.0 client and V7.0 server do not need KeepAlive MQ heartbeats will keep the connection lively KeepAlive is still necessary for older clients WebSphere® Support Technical Exchange 46 IBM Software Group Problem 6: V7.0 Client Heartbeats WebSphere® Support Technical Exchange 47 IBM Software Group Problem 7: Bonus Round Define a sender channel and its transmission queue Be sure to pick an unused port number DEFINE CHANNEL(ECHO.SDR) CHLTYPE(SDR) TRPTYPE(TCP) + CONNAME(‘localhost(9999)’) LOCLADDR(‘(9999)’) XMITQ(ECHO.XQ) DEFINE QLOCAL(ECHO.XQ) USAGE(XMITQ) START CHANNEL(ECHO.SDR) Why does this channel start successfully? What happens if you try to send messages? WebSphere® Support Technical Exchange 48 IBM Software Group Summary Reviewed IP, TCP and socket programming WebSphere MQ error messages and errno values MQ channel status and TCP connection status Debug FFSTs, stack dumps and packet traces KeepAlive configuration on several platforms New features in WebSphere MQ V7.0 WebSphere® Support Technical Exchange 49 IBM Software Group Additional WebSphere Product Resources Discover the latest trends in WebSphere Technology and implementation, participate in technically-focused briefings, webcasts and podcasts at: http://www.ibm.com/developerworks/websphere/community/ Learn about other upcoming webcasts, conferences and events: http://www.ibm.com/software/websphere/events_1.html Join the Global WebSphere User Group Community: http://www.websphere.org Access key product show-me demos and tutorials by visiting IBM Education Assistant: http://www.ibm.com/software/info/education/assistant View a Flash replay with step-by-step instructions for using the Electronic Service Request (ESR) tool for submitting problems electronically: http://www.ibm.com/software/websphere/support/d2w.html Sign up to receive weekly technical My support emails: http://www.ibm.com/software/support/einfo.html WebSphere® Support Technical Exchange 50 IBM Software Group Questions and Answers WebSphere® Support Technical Exchange 51