Troubleshooting Methods for UCS Customer POCs and Labs August 2012 © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal OnlyCisco – Do Confidential not Distribute 1 • Why would we need this presentation? • Overview of some recurring items to address Infrastructure Items • Adapter and IOM systems troubleshooting • Server systems troubleshooting • Operating systems troubleshooting • Chassis systems troubleshooting • Fabric Interconnect systems troubleshooting © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 2 • Often there are “lessons learned” which can be shared do simplify the POC process • Some of the known bugs and operational details are lost in the depths of planning customer scenarios • This is not a review of how to develop a testing plan, nor a script in running a POC • The key goal is to help information sharing to put best foot forward • All required UCS system training and real-world hands on experience is assumed • This is a living presentation – the goal is to keep updated with these lessons learned and common bug issues – to present to the field on request © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 3 • Data and Many Control Plane functions are Active/Active cluster UCS VIP • User to UCSM is Active/Standby to Virtual IP (VIP) FI-A IP FI-B IP • These management connections are where the Blade CIMC connections are reached via the unified IO of UCS • Blade CIMC are actually NAT entries on the mgmt port • UCSM Client (Centrale) or CLI to manage and troubleshoot © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 4 • The fundamental items on a UCS are the Managed Information Tree (MIT) and the Application Gateways (AG’s) that do the work • This is for any UCS form factor device Blade/Rack AG: BIOS RAID CPLD Boot Method BMC Setup Alerting Etc. XML API MIT Switch AG: Ether Port Networks QoS Policy Security Policy Linkages to Server NICs Network Segments Etc. Fabric AG: NIC AG: # NICs Networks to Tie in QOS and Security Policy # HBAs VSANs to Tie in QoS and Security Policy Etc. Storage Segments VSAN Mappings F Port Trunking F Port Channeling Zoning ** Etc. Other AG’s: VMM AG Etc. © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 5 © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 6 • Many stages of a given process are run through (FSM-Stage) • Some can be skipped if unneeded or type of action (Shallow vs. Deep) • Almost all actions contain a verification step that the action completed • Logs are retained • View and Monitor © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 7 • These will feed into the normal fault policy of UCSM FSM faults are just one type – refer to the link below for listing of types Highly recommend at least becoming familiar with layout of UCS faults and error message reference in URL below • Severity can change over the life of fault • For POC labs recommend elimination of Critical, Major, Minor faults Others will be there in normal course of all the actions waiting and performed http://www.cisco.com/en/US/partner/docs/unified_computing/ucs/ts/faults/reference/ErrMess.html © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 8 Each UCS IOM in a UCS 5100 Blade Server Chassis is connected to a 6000 Series Fabric Interconnect for Redundancy or Bandwidth Aggregation Fabric Extender provides 10GE ports to the Fabric Interconnect Link physical health and the chassis discovery occurs over these links UCS 6000 Series FI B UCS 6000 Series FI A UCS 5100 Series Blade Server Chassis © 2012 Cisco and/or its affiliates. All rights reserved. Back Cisco Confidential UCS 2xxx Series IOM Internal Only – Do not Distribute 9 Various points of monitoring and visibility IOM © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 10 Live/now • Visibility into many counters within UCS • These are “count up” with raw numbers Use the Delta to monitor the changes over the collection intervals History © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 11 • Setting a system baseline Most of our initial issues in CPOC situations are due to firmware issues All system components must be on same firmware package version Host and mgmt firmware policies are excellent tools to do this – rather than server by server Viewing the components of a package shown When demo FI’s arrive, individually set them to a common UCSM version and erase the configurations before attempting to join them in a cluster © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 12 • Setting a system baseline First get the package on the UCS Create the right FW packages for the POC Can check conformance to package in a single screen © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 13 • Setting a system baseline Can upgrade via the Bundle Mechanism at the POC start Bundle option there for both update and activate – handles all upgrade This is totally disruptive, so don’t do this method during the POC (after staging) © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 14 • Upgrade prep, checkpoints, cleanup – when uptime is key (not a POC) Implement a management interface monitoring policy Prior to upgrading one fabric, disable all upstream data and FC ports Disable the mgmt interface also (KVM traffic on the fabric that will not be taken down) This will force traffic to the fabric that will be up (can quickly recover if an error) Upgrade fabric, restore uplinks and mgmt interface Repeat on peer fabric – but only after the cluster state is showing as HA_READY when in the CLI and you connect to local management and “show cluster extended” © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 15 • Discovery Policy vs. Re Acknowledgement behavior Discovery policy is just that – a floor in the number of links before a chassis will be discovered The link policy will dictate bringing up portchannels from the IOMs to Fis – after discovery Must then re-acknowledge the chassis (disruptive to blade connectivity) for all connections beyond discovery to be used Always re-acknowledge the chassis after it is discovered, or any cabling changes © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 16 • Multicast behavior In all current versions, IGMP Snooping is enabled and cannot be turned off Only the 224.0.0.X is flooded within the UCS Fundamentally different from traditional switches which flood We need an upstream PIM router or IGMP snooping querier upstream for proper multicast flow beyond a new flow timeout (~180 seconds) © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 17 • It is always best as a preparation to review the release notes • This is the PRIMARY method we notify the field of issues to keep aware of • Can be large with the product breadth, but for a POC or install will be a great starting step © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 18 CiscoLive-A# connect nxos CiscoLive-A(nxos)#show interface brief CiscoLive-A(nxos)#show interface fex-fab Ports to Blade Adaptors Internal VLAN interface for management Displayed from FI “A” Eth X/Y/Z where X = chassis number Y = mezz card number Z = IOM port number 10 Gig Links to Chassis 2 10 GiG Links to Chassis 1 © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 19 • How to locate the MAC of the interfaces Find the interesting adapter in UCSM from or from the NXOS CLI #Found mac address in Fabric interconnect A. It should not be visible on Fabric interconnect B. If it is then the customer is doing per flow/packet load balancing at the host level, which is not allowed on UCS B-Series © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 20 From UCS CLI From UCSM © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 21 From FI NXOS: show interface veth 752 show int vfc 756 Management link 1/1 Path 1 A © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 2/2 1/1 B A Path 2 2/2 B vCON-1 vCON-2 Interface 1 Interface 2 Slot 7 Slot 8 Internal Only – Do not Distribute 22 UCS 2104XP PX 4 0 1 2 S C U HI7 UCS B200 M1 ! Blade1 ! vCON-2 Interface 1 ! Slot 1 HI7 Slot 2 HI6 Slot 3 HI5 Slot 4 HI4 Reset Console UCS B200 M1 HI6 ! Blade2 ! vCON-2 Interface 1 ! HI5 Reset Console UCS B200 M1 ! Blade3 ! vCON-2 Interface 1 ! Reset Console UCS B200 M1 HI4 ! Blade4 ! vCON-2 Interface 1 1 1 NI3 ! NI31 1 Reset Console UCS B200 M1 2 2 NI2 HI3 ! Blade5 ! ! HI3 Slot 5 vCON-2 Interface 1 2 NI2 2 Reset Console 3 3 NI1 3 3 NI1 UCS B200 M1 HI2 ! Blade6 ! 4 4 NI0 Blade7 Slot 7 Console NI0 4 Reset ! ! vCON-2 Interface 1 vCON-1 Interface 2 © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential UCS B250 M1 HI1 Slot 8 ! HI0 HI2 4 ! HI1 Slot 6 vCON-2 Interface 1 Console Reset HI0 Internal Only – Do not Distribute 23 FarNorth-A(nxos)# show vifs interface ethernet 2/1/8 Interface VIFS -------------- --------------------------------------------------------Eth2/1/8 veth1241, veth1243, veth9461, veth9463, FarNorth-A(nxos)# sh int vethernet 9463 vethernet9463 is up Bound Interface is Ethernet2/1/8 Hardware: VEthernet Encapsulation ARPA Port mode is access Last link flapped 1week(s) 1day(s) Last clearing of "show interface" counters never 1 interface resets FarNorth-A(nxos)# show int vfc1271 vfc1271 is up Bound interface is vethernet9463 Hardware is Virtual Fibre Channel Port WWN is 24:f6:00:0d:ec:d0:7b:7f Admin port mode is F, trunk mode is off snmp link state traps are enabled Port mode is F, FCID is 0x710005 Port vsan is 100 © 2012 Cisco and/or its affiliates. All rights reserved. • All vifs associated with a EthX/Y/Z interfaces are pinned to the fabric port that EthX/Y/Z interface is pinned to. • Check the VLAN to VSAN mapping (show vlan fcoe) FarNorth-A(nxos)# show vifs interface vethernet 9463 Interface VIFS -------------- --------------------------------------------------------veth9463 vfc1271, FarNorth-A(nxos)# show vlan fcoe VLAN VSAN Status -------- -------- -------1 1 Operational 100 100 Operational Cisco Confidential Internal Only – Do not Distribute 24 All baseline troubleshooting should be done from Connect NXOS CiscoLive-A(nxos)# show flogi database vsan 100 ---------------------------------------------------------------------------------------------------------INTERFACE VSAN FCID PORT NAME NODE NAME ---------------------------------------------------------------------------------------------------------vfc703 100 0xdc0002 20:00:00:25:b5:00:00:1b 20:00:00:25:b5:00:00:2a vfc725 100 0xdc0000 20:00:00:25:b5:10:10:01 20:00:00:25:b5:00:00:0e vfc731 100 0xdc0001 20:00:00:25:b5:10:20:10 20:00:00:25:b5:00:00:2c CiscoLive-A(nxos)# show fcdomain domain-list vsan 100 Number of domains: 3 Domain ID WWN --------- ------------------------------------------------0x24 (36) 20:64:00:0d:ec:20:97:c1 [Principal] 0x40 (64) 20:64:00:0d:ec:ee:ef:c1 0xdc (220) 20:64:00:0d:ec:d0:7b:41 [Local] CiscoLive-A(nxos)# show zoneset active CiscoLive-A(nxos)# show fcns database vsan 100 VSAN 100: -----------------------------------------------------------------------------------------------------FCID TYPE PWWN (VENDOR) FC4-TYPE:FEATURE -----------------------------------------------------------------------------------------------------0x2402ef N 50:06:01:6d:44:60:4a:41 (Clariion) scsi-fcp:target 0x2400d9 NL 21:00:00:20:37:42:4a:b2 (Seagate) scsi-fcp:target 0x400002 N 50:0a:09:88:87:d9:6e:b7 (NetApp) scsi-fcp:target 0x40000e N 10:00:00:00:c9:9c:de:9f (Emulex) ipfc scsi-fcp:init 0xdc0000 N 20:00:00:25:b5:10:10:01 scsi-fcp:init fc-gs 0xdc0001 N 20:00:00:25:b5:10:20:10 scsi-fcp:init fc-gs 0xdc0002 N 20:00:00:25:b5:00:00:1b scsi-fcp:init Total number of entries = 6 © 2012 Cisco and/or its affiliates. All rights reserved. zoneset name ZS_mn_bootcamp_v100 vsan 100 zone name Server-1-Palo vsan 100 * fcid 0xdc0000 [pwwn 20:00:00:25:b5:10:10:01] * fcid 0x2400d9 [pwwn 21:00:00:20:37:42:4a:b2] Cisco Confidential Internal Only – Do not Distribute 25 FarNorth-B(nxos)# show npv flogi-table -----------------------------------------------------------------------------------------------------------------SERVER EXTERNAL INTERFACE VSAN FCID PORT NAME NODE NAME INTERFACE -----------------------------------------------------------------------------------------------------------------vfc1205 100 0x240007 20:00:00:25:b5:00:00:0a 20:00:00:25:b5:00:00:06 fc2/1 vfc1206 100 0x240006 20:00:00:25:b5:00:00:09 20:00:00:25:b5:00:00:06 fc2/1 vfc1210 100 0x240008 20:00:10:25:b5:00:00:09 20:00:00:10:b5:00:00:09 fc2/2 vfc1238 100 0x240002 20:00:00:25:b5:00:00:10 20:00:00:25:b5:00:00:0f fc2/1 vfc1240 100 0x240003 20:00:00:25:b5:00:00:04 20:00:00:25:b5:00:00:0f fc2/2 Total number of flogi = 5. • No FC services running in NPV Mode • FCIDs assigned from Core NPIV switch • NP port to core Switch must be up and assigned to proper VSANs FarNorth-B(nxos)# show npv status npiv is enabled disruptive load balancing is disabled External Interfaces: ==================== Interface: fc2/1, VSAN: 100, FCID: 0x240000, State: Up Interface: fc2/2, VSAN: 100, FCID: 0x240001, State: Up FarNorth-B(nxos)# show int brief Number of External Interfaces: 2 Server Interfaces: ================== Interface: vfc1205, VSAN: Interface: vfc1206, VSAN: Interface: vfc1210, VSAN: Interface: vfc1238, VSAN: Interface: vfc1240, VSAN: Interface: vfc1270, VSAN: Interface: vfc1272, VSAN: Interface: vfc1280, VSAN: Interface: vfc1284, VSAN: 100, State: Up 100, State: Up 100, State: Up 100, State: Up 100, State: Up 100, State: Up 100, State: Up 100, State: Up 100, State: Up Number of Server Interfaces: 9 © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential ------------------------------------------------------------------------------Interface Vsan Admin Admin Status SFP Oper Oper Port Mode Trunk Mode Speed Channel Mode (Gbps) ------------------------------------------------------------------------------fc2/1 100 NP off up swl NP 2 -fc2/2 100 NP off up swl NP 2 -fc2/3 1 NP off sfpAbsent -- --fc2/4 1 NP off sfpAbsent -- --fc2/5 1 NP off sfpAbsent -- --fc2/6 1 NP off sfpAbsent -- --fc2/7 1 NP off sfpAbsent -- --fc2/8 1 NP off sfpAbsent -- --Internal Only – Do not Distribute 26 • Server Upgrade Items Do NOT do a BIOS recovery as a mechanism to perform an upgrade of BIOS We should do this through the update method (M3 Blades) or Host FW package In General, we want the CIMC version to be greater than the BIOS version as the data returned from BIOS to CIMC and properly understanding it (delta in documentation today) All firmware components must be from same B (blade components) and C (rack components) packages, matched to the A (infrastructure) package © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 27 • Corrupt CIMC Firmware POST Failure Not completing boot • Connecting to CIMC in band to test connectivity • Manually reboot CIMC **Note, today there is a bug in B230 and B440 where network performance can be negatively affected on CIMC only reboot on VMware hosts © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 28 __________________________________________ Debug Firmware Utility __________________________________________ Command List __________________________________________ • A quick test to verify the health • This is a very low level data point • Source of blade issue reporting CiscoLive-A# connect cimc 1/1 Trying 127.5.1.1... Connected to 127.5.1.1. Escape character is '^]'. __________________________________________ Notes: "enter Key" will execute last command "COMMAND ?" will execute help for that command __________________________________________ CIMC Debug Firmware Utility Shell © 2012 Cisco and/or its affiliates. All rights reserved. alarms cores exit help [COMMAND] images mctools memory messages network obfl post power sensors sel fru mezz1fru mezz2fru tasks top update users version Cisco Confidential Internal Only – Do not Distribute 29 • Non disruptive to data path ** ** with exception of the current bug on VMware environments © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 30 • KVM Access • Independent of Centrale • UCS AAA Login © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 31 • This will show errors detected and reported by BIOS and the CIMC • These are also stored in the System Event Log (SEL) • Uncorrectable are an issue, Correctable is making use of ECC parity CiscoLive-A /chassis/server # show sel 3/1 | include Memory 487 | 03/18/2011 00:16:49 | BIOS | Memory #0x02 | Uncorrectable ECC/other uncorrectable memory error | RUN, Rank: 0, DIMM Socket: 4, Channel: C, Socket: 0, DIMM: C4 | Asserted 5f1 | 04/16/2011 09:53:12 | BIOS | Memory #0x02 | Uncorrectable ECC/other uncorrectable memory error | RUN, Rank: 3, DIMM Socket: 7, Channel: A, Socket: 0, DIMM: A7 | Asserted 731 | 04/21/2011 01:59:28 | BIOS | Memory #0x02 | Correctable ECC/other correctable memory error | RUN, Rank: 1, DIMM Socket: 1, Channel: B, Socket: 0, DIMM: B1 | Asserted 732 | 04/21/2011 10:50:55 | BIOS | Memory #0x02 | Uncorrectable ECC/other uncorrectable memory error | RUN, Rank: 2, DIMM Socket: 6, Channel: A, Socket: 0, DIMM: A6 | Asserted 799 | 04/29/2011 02:50:31 | BIOS | Memory #0x02 | Correctable ECC/other correctable memory error | RUN, Rank: 0, DIMM Socket: 0, Channel: B, Socket: 0, DIMM: B0 | Asserted 79a | 04/29/2011 04:41:33 | BIOS | Memory #0x02 | Uncorrectable ECC/other uncorrectable memory error | RUN, Rank: 3, DIMM Socket: 3, Channel: B, Socket: 0, DIMM: B3 | Asserted © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 32 • We want to know of both correctable (for prediction of failure) and uncorrectable via threshold policy © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 33 • Cycling Through Servers Performing Testing on Deployed Hardware Evacuate the VMs from a given server and put in maintenance mode Mount the e2e diagnostic .ISO and reboot the server to it Run utilities to stress test the memory and CPU Test 1: ./burnin/bin/stress –c 8 –i 4 –m 2 –-vm-bytes 128M –t 100s –v Test 2: ./burnin/bin/pmemtest –a –l 1000000000 Test 3: ./burnin/bin/stream Test 4: ./burnin/bin/cachebench -rwbsp -x1 -m24 -d5 -e1 DO NOT RUN THE DISK STRESS (will corrupt the existing RAID) Record the results Remove .ISO and reboot VMware to exit maintenance mode • Identify any suspect devices from tests and plan for maintenance of that item © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 34 • Initial Server Deployment or Suspected Issues • Example Results from one Customer POC: Test #1 Test #2 Test #3 Test #4 © 2012 Cisco and/or its affiliates. All rights reserved. B200-M2 / X5570 / 96G 1m 40s 50s 5m 4s 13m 45s B230-M1 /X6550 / 256G 1m 40s 1m 20s 5m 11s 13m 46s Cisco Confidential Internal Only – Do not Distribute 35 • Windows Items With the latest BIOS on B230 and B440 M1, the PCI devices are ordered correctly on 1.4 to 2.0 upgrade, but interfaces can be renumbered regardless – fix coming We can define PCI order, but the adapter definitions to the OS are dependent on the order you map the VIC driver to them • Red Hat Items We have very good control over these, using the /etc/sysconfig/network-scripts to map the HW address to the eth number There are kernel parameters which can affect performance – contact TME teams directly • ESX Items In box drivers occasionally need to be updated Due to time sync requirements for inbox deployments (can be 6+ months) © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 36 • Intra chassis component communications Inter-Integrated Circuit communications (I2C) Systems Management Bus was later subset Multi-Master Bus for simple communications between system elements In use inside a standard industry server, and also between chassis components (inside a single chassis only) • I2C bug cases with some components coming too close to certain margins Locking the I2C bus Creating spurious noise on the bus and Initial Customer Deployments? Be certain to be running a software at/later than 1.4(3s) which includes SW fixes to these situations – for additional HW margin increments: Power supplies should be ordered as MFG_NEW if possible IO Modules that are 2104 should be ordered as MFG_NEW if possible Manifests in unpredictable behavior © 2012 Cisco and/or its affiliates. All rights reserved. • What does this mean for POC Cisco Confidential Internal Only – Do not Distribute 37 • 6100 Top Considerations 3k prior to UCS 1.4(1), then 6k to UCS 1.4(1), 14k P*V Count Limit as of UCS 1.4(3q) VIF limits can be very restrictive in C series implementations • 6200 Top Considerations 32k P*V Count Limit at UCS v2.x Multicast when using Port Channels upstream (only do on UCS v2.0(2) and later) © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 38 • Gathering Tech Support Files We have the ability to gather the tech support data from UCSM to your localhost Always recommend gathering when asking questions to various internal mailers © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 39 • Gathering any Core Dumps Once TFTP core exporter is configured, they will be moved off the system Move exported cores to the trash can © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 40 • Viewing data plane traffic within the UCS We can SPAN from most sources within the UCS Can SPAN the physical and virtual interfaces © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Hardware or software Analyzer Internal Only – Do not Distribute 41 Thank you. © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Internal Only – Do not Distribute 42