Installation and troubleshooting overview 5.3 Unit objectives After completing this unit, you should be able to: • Identify the BladeCenter components used to provide PD information • List the planning elements required for the BladeCenter management network • Select the functions available to modify firmware settings • List the blade server indicators and Light Path Components • Select the steps appropriate in diagnosing blade server hardware failures • Identify the utility to use in displaying BladeCenter component health 2 Best practices • Best practices • Troubleshooting and problem determination • BladeCenter management interfaces • Firmware updates and settings • Information gathering • IBM BladeCenter support resources 3 BladeCenter chassis questions: Requirements • Given your specific needs, what is the best BladeCenter solution (in terms of components) necessary to meet your requirements? • Define the networking and SAN requirements for your BladeCenter environment based on your existing infrastructure, including fault tolerance, throughput and interoperability. • Do you plan on having a separate Management LAN and production LAN? What is the advantage/disadvantage of this environment? • Are all of the components being installed in the BladeCenter chassis on the ServerProven list? • Is this BladeCenter chassis to be deployed locally or in a remote location? 4 Blade server considerations: Questions • Is the blade server at the latest firmware level? If not, what method of applying the latest firmware updates are you going to implement? • Besides the BIOS, what other firmware updates are needed for the blade server? • What operating system are you going to put on the blade server. How do I find out if this OS is supported on the blade server? • What are the different deployment methods for operating system installations, and which method makes the most sense in my environment? • What performance requirements are needed out of my blade server? Based upon these requirements, which model best fits my business needs? 5 BladeCenter chassis questions: Power • Do you understand the necessary power requirements for a given BladeCenter solution? • Will your BladeCenter chassis be connected to either a frontend or high-density front-end rack PDU? • How many blade servers are in the chassis and will that impact oversubscription of the power domains? • Do you have the correct electrical connectors to power your new BladeCenters and their PDUs? 6 Cooling questions • Are the systems on a raised floor? • How many BTUs am I generating when my installation is complete? • What are the power requirements for the new systems? • Are there plans to grow in the future? 7 Troubleshooting and problem determination • Best practices • Troubleshooting and problem determination • BladeCenter management interfaces • Firmware updates and settings • Information gathering • IBM BladeCenter support resources 8 Problem determination: Information gathering • Due to the variety of hardware and software combinations that can be encountered, use the following information to assist you in problem determination. If possible, have this information available when requesting assistance from Service Support and Engineering functions. – Machine type and model – Microprocessor or hard disk upgrades – Failure symptom • • • • • • Do diagnostics fail? What, when, where, single, or multiple systems? Is the failure repeatable? Has this configuration ever worked? If it has been working, what changes were made prior to it failing? Is this the original reported failure? – Diagnostics version — type and version level – Hardware configuration • Print (print screen) configuration currently in use • BIOS level – Operating system software — type and version level 9 Blade servers: Diagnostics tools • Light Path Diagnostics • Standalone diagnostics • Diagnostics by PC Doctor – Test results are stored in a test log – Management Module event logs contain system status messages from the blade server service processor and can be: • Viewed • Saved to diskette • Printed • Attached to e-mail alerts – Standard log is a summary of tests – Press <Tab> while viewing the test log • Power On Self Test (POST) beep codes • Unified Extensible Firmware Interface (UEFI) – Elimination of Beep Codes – Advanced logging and firmware control • Command-line interface (CLI) 10 IBM Blade Server: Front panel LEDs HS22 example IBM HS22 Blade Server Front Panel indicators and controls HS22 Blade Server Front Panel 11 IBM Blade Server: System board diagnostic indicators HS22 example • IBM HS22 Blade server system board example – Memory, processor, and disk Indicators – Light Path Panel IBM Blade Server HS22 System Board Indicators HS22 System Board Light Path Panel 12 IBM Blade Server: Front panel LEDs LS22 example LS22 Blade Server Front Panel Controls and Indicators IBM LS22 Blade Server Front Panel 13 IBM Blade Server: System board diagnostic indicators LS22 example LS22 Blade Server System Board Light Path Panel IBM LS22 Blade Server System Board 14 IBM Blade Server: Diagnostics tools • Light Path Diagnostics • Press F2 at POST to invoke standalone diagnostics • Diagnostics by PC Doctor – Test results are stored in a test log – Management Module event logs contain system status messages from the blade server service processor and can be: • • • • Viewed Saved to diskette Printed Attached to e-mail alerts – Standard log is a summary of tests – Press <Tab> while viewing the test log • Power On Self Test (POST) beep codes • Real time diagnostics • Command-line interface (CLI) 15 Blade server: Basic input/output system (BIOS) • Blade server BIOS – – – – Menu-driven setup Settings for configuration and performance Set, change, delete (IRQ, date and time, and Passwords) Advanced settings for specific needs (for example, memory, CPU, PCI bus and BMC) – BIOS defaults • Flash diskette • BIOS updates for host and devices CD-ROM BIOS/firmware updates and configuration for host and devices • BIOS system board jumpers or switches – BIOS boot selection – Password override – Wake on LAN enablement 16 UEFI: Unified Extensible Firmware Interface (1 of 3) • The next generation of BIOS • Allows OSs to take full advantage of the hardware – Architecture independent – Modular • 64-bit code architecture • 16 TB of memory can be addressed • More functionality – Adapter vendors can add more features in their options (for example, IPv6) – Design allows faster updates as new features are introduced – More adaptors can be installed and used simultaneously – Fully backwards compatible with legacy BIOS • Better user interface – Replaces ctrl key sequences with a more intuitive human interface – Moves adaptor and iSCSI configuration into F1 setup – Creates human readable event logs • Easier management – Eliminates “beep” codes; all errors can now be covered by Light Path – Reduces the number of error messages and eliminates out-dated errors – Can be managed both in-band and out of band 17 UEFI: Unified Extensible Firmware Interface (2 of 3) Tomorrow’s update and configuration on systems Today’s update and configuration on systems xFlash xFlash & ASU & ASU Configuration Configuration RSAII Diags BIOS BMC Pb DSA IMM UEFI 18 UEFI: Unified Extensible Firmware Interface (3 of 3) UEFI versus BIOS UEFI BIOS 64 bit code architecture: 16 TB of memory can be addressed 16 bit code architecture: Only 1MB of memory can be addressed. Eliminates Code Space Constraints. Adapter Option ROMs can be loaded anywhere in memory with no size restrictions. Adapter Vendors must fit all option code into a shared 128K. Limits the number of adapters that can be effectively installed. Adapter vendors are free to add function. i.e. IPV6 Vendors are limited in the function they can provide in the option ROM. UEFI defines a Human Interface that is being extended to Adapter Vendors. Cryptic Ctrl Key sequences required for configuring Adapters. iSCSI Configuration is in F1 Setup and consolidated in to ASU. iSCSI Configuration requires separate tool. Elimination of Beep Codes – All Errors covered by Light Path. Reduction in Number of Error Messages. Multiple Beep Codes for fundamental failures. Adapter Configuration can move into F1 Setup. Eliminates Ctrl Key sequences for configuring Adapters. Advanced Settings Utility (ASU) has partial coverage of F1 Settings In & Out of Band UEFI Updates. Settings accessed Out of Band via ASU and the IMM. In-Band only updates via DOS, wFlash, or lFlash. UEFI Event codes available out of band. Human readable Event logs in F1 Setup Numerous Legacy POST Errors. 19 Blade server: Integrated Management Module (IMM) • Integrated Management Module (IMM) – Replacement for BMC – LAN over USB – OS drivers included in Windows and Linux 20 Blade server six system states System State Data Gathering Data Analysis 1 There is no AC Visual PDSG 2 There is AC power but no DC Advanced Management Module (AMM) & (IMM) Light Path System event log 3 There is AC and DC power but the system fails to complete post Checkpoint codes F1 and F2 Beep codes (prior to UEFI) Adapter BIOS messages PDSG Retain tips IBM Support Web site 4 There is AC and DC power, the system completes POST but the NOS fails to start loading F2 diagnostics PDSG Retain tips There is AC and DC power, the system completes POST but the NOS fails to complete NOS boot messages 'Blue Screen' 5 NOS Vendor messages 21 Advanced Management Modules (AMM): Overview • The Management Module stores all event and error information for the BladeCenter • The Management Module configuration data is stored both in itself and on the midplane – To reset the IP address back to the default settings, press and hold the IP reset button for 3 seconds or less Power-on LEDS Activity LEDS Serial Console Connector RJ45 Error LEDS Release handle Video Connector 10/100 Ethernet Connector RJ45 Port Link LED Port Activity LED Advanced Management Module LEDS USB Dual Stack Pin-hole Reset MAC Address 22 Recovering Management Module TCP/IP address • MM configuration data is stored in the midplane – To reset a TCP/IP address only: • Remove the cable from the MM Ethernet port • Press and hold the IP reset button for 3 seconds or less – TCP/IP address will reset to 192.168.70.125/255.255.255.0 – Simply replacing the MM will cause the replacement MM to adopt the same values as the original MM • PERFORM ALL RESET STEPS BEFORE REPLACING THE MM 23 Management Module full reset: Factory defaults • MM configuration data is stored in the midplane – To force a complete MM reset (including password): • Remove the cable from the MM ethernet port • Press and hold the IP reset button for 5 seconds • Release the IP reset button for 5 seconds • Press and hold the IP reset button for 10 seconds – TCP/IP address will be reset to 192.168.70.125/255.255.255.0 – All IDs and passwords will be deleted (except USERID/PASSW0RD) – Simply replacing the MM will cause the replacement MM to adopt the same values as the original MM • PERFORM ALL RESET STEPS BEFOIRE REPLACING THE MM 24 Advanced management event log 25 Problem determination: Blade server example • Example of a memory DIMM problem – Display of BladeCenter Front Panel LEDs Management Module web interface indicating error LEDs 26 Problem determination: Blade server example • Example of a memory DIMM problem – Display of the Blade server front panel LEDs Advanced Management Module Blade server LEDs 27 Problem determination: Blade server example • Example of a memory DIMM problem – Display of the BladeCenter Event Log Advanced Management Module Event Log 28 Problem determination: Blade server example • Using the IBM Problem Determination guide - IBM BladeCenter HS21 – Locate the error symptom code in the log (in this example: 289) – Match the table entry to the code Check POST error log for error message 289: 29 Problem determination: Blade server example • Consult the IBM Installation Guide for the HS21 – Proper DIMM installation procedure HS21 DIMM Installation slot and order 30 Problem determination: Blade server example • Verifying fix and proper operation AMM Status Display and Event Log 31 Problem determination: Blade servers • What do you do if: – Blade server powered down for no apparent reason – Blade server does not power on, the system-error LED on the BladeCenter system-LED panel is lit, the blade error LED on the blade server LED panel is lit, and the system-error log contains the following message: ″CPUs Mismatched″ – Some components do not report environmental status (temperature, voltage) – Switching KVM control between blade servers gives USB device error 32 Ethernet switch modules: Addressing issues • What do you do if: – – – – – You have duplicate IP address reported on the ESM You have duplicate IP address reported on the blade server You have a native VLAN mismatch reported on the ESM There are connection problems to the blade servers The DHCP server uses up all IP addresses and the blade server still cannot get an address 33 Problem determination: Ethernet switch I/O modules • Hardware failures • Not very common – On MM, look under I/O Module Tasks -> Power/Restart to see diagnostic code after reboot. Also look at fault LED on the Ethernet Switch Module • Software Failures – Not very common – As with all products, software bugs do exist – Reference the latest code readme file for a list of resolved bugs with each release of code • Misconfiguration of Ethernet Switch Module or other component – This is the most common issue encountered – Often requires close cooperation between different administrative groups to resolve 34 Ethernet switch modules: Configuration issues • Most common issue encountered – May be with the Ethernet Switch Module, a device upstream or the server within the BladeCenter – May also be misconfiguration on the Management Module • Same tools used to troubleshoot configuration issues can also be used to help isolate broken hardware and software bugs • Usually requires close cooperation between network administrators and server administrators • Often helps to have special tools (for example, network sniffer) to understand and resolve problem 35 Ethernet switch modules: Basic rules • Do not attach cables to the ESM until both sides of the connection are configured • Do not put the blade servers on the VLAN that the ESM uses for its management VLAN interface • Make sure the ESM firmware (IOS) code is upgraded • Decide the ESM management path (via Management Module or ESM uplinks) and configure for it 36 BladeCenter management interfaces • Best practices • Troubleshooting and problem determination • BladeCenter management interfaces • Firmware updates and settings • Information gathering • IBM BladeCenter support resources 37 BladeCenter AMM: System status screen Navigation menu Main information window 38 System Event Log (SEL) screen • This screen shows event history of the BladeCenter 39 Hardware Vital Product Data (VPD) • This screen shows information relating to the hardware in the BladeCenter 40 Rules for I/O module management • In-band management – Use the AMM path to an I/O module • Provides centralized management of all I/O modules – All activities and reporting is through a single Ethernet port – Makes LAN configuration easier • Requires MM and all I/O modules to be on the same IP subnet • Out-of-band management – Requires enablement of external management over all ports • May require management VLAN configuration • Access will involve many Ethernet ports • I/O module need not be on the same IP subnet as the MM – If subnets are different, AMM path to I/O module is unavailable 41 I/O module tasks: Close up 42 I/O module tasks: Advanced switch management 43 Ethernet switch I/O module Web interface 44 CIGESM Web interface 45 Nortel ESM Web interface 46 Fibre Channel switch module Web interface • SAN Utility (QLogic) – Full Function GUI • SAN Browser (Qlogic) – Limited functionality • Switch Explorer (Brocade) – Limited functionality 47 Firmware updates and settings • Best practices • Troubleshooting and problem determination • BladeCenter management interfaces • Firmware updates and settings • Information gathering • IBM BladeCenter support resources 48 UpdateXpress CD-ROM package • UpdateXpress – Bootable CD-ROM – Supports maintenance of system firmware and Windows device drivers • Automatically detects current device-driver and firmware levels • Gives the option of selecting specific upgrades or allowing UpdateXpress to update all of the system levels it detected as needing upgrades • Can be installed using local DVD or over network using the AMM 49 UpdateXpress firmware update scripts • UpdateXpress Firmware Update Scripts for BladeCenter (UXBC) – Process that enables firmware updates to be run in a remote, unattended fashion • Requires a management station and supporting software – Windows or Linux OS – FTP and TFTP servers somewhere on the management LAN – UXBC discovery and deployment components – For more information, see – http://www-03.ibm.com/systems/management/uxs.html 50 IBM preboot dynamic system analysis • Provides problem isolation, configuration analysis, error log collection – Collects information about: • System configuration • Network interfaces and settings • Installed hardware • Light path diagnostics status • Service processor status and configuration • Vital product data, firmware, and UEFI configuration • Hard disk drive health 51 Advanced settings utility • Enables the user to modify firmware settings from the command line – Supported on multiple operating system platforms – Enables remote changes to POST and BIOS settings • Does not require F1 access to a console session – – – – Supports scripting through a batch processing mode Does not update any of the firmware code For more information, see http://www304.ibm.com/systems/support/supportsite.wss/docdisplay?brandind=5 000008&lndocid=MIGR-55021 52 Information gathering • Best practices • Troubleshooting and problem determination • BladeCenter management interfaces • Firmware updates and settings • Information gathering • IBM BladeCenter support resources 53 Data gathering • Read the BladeCenter data collection guide – Contains details of what logs and information are needed for escalations – Contains a step-by-step guide on how the logs are collected – For more information, see – http://www304.ibm.com/systems/support/supportsite.wss/docdisplay?lndocid=SE RV-BLADE&brandind=5000008 54 Gathering information from blade servers • Blade server logs can be gathered within the operating system – Use the following table to determine what utility to use Type of blade server Operating system Type of gathering utility: HS Series Windows Dynamic System Analysis HS Series Linux Dynamic System Analysis LS Series Windows Dynamic System Analysis LS Series Linux Dynamic System Analyses SNAP is built into AIX and SNAP for Linux on Power can be found at: http://techsupport.services.ibm.com/server/lopdiags. JS Series Linux SNAP 55 Gathering information from I/O switch modules • Logs from a Brocade, Cisco, BNT or QLogic switch module can be captured within the switch interface – Enable capture text/console logging within the telnet application – Login to the switch using telnet – Issue the command from the table below Type of switch: Command: Brocade showSupport Cisco show tech-support Nortel maint/tsdmp Qlogic support show 56 IBM BladeCenter support resources • Best practices • Troubleshooting and problem determination • BladeCenter management interfaces • Firmware updates and settings • Information gathering • IBM BladeCenter support resources 57 IBM support Web site • New central Web site for all server products: http://www-304.ibm.com/systems/support/ – Select BladeCenter from the drop-down menu 58 Documentation • Hardware Maintenance Manual – Available electronically (Adobe Acrobat .PDF format) from the IBM support Web site • Primary support document for diagnostics and troubleshooting • User’s Guide, Installation Guide – System documentation that ships with the BladeCenter and with options such as blade servers and switch modules • Useful for confirming shipping group contents (missing parts, and so on) and initial customer setup 59 IBM Blade Server references • IBM BladeCenter Products and Technology – http://www.redbooks.ibm.com/cgi-bin/searchsite.cgi?query=bladecenter • IBM ServerProven – Compatibility for BladeCenter Products – http://www-03.ibm.com/servers/eserver/serverproven/compat/us/ • System x Reference (xREF) – http://www.redbooks.ibm.com/xref/usxref.pdf • Intel Products – http://www.intel.com/products/server/processors/index.htm • AMD Products – http://www.amd.com/us/products/server/Pages/server.aspx 60 Key words • • • • • • • • • • • • • • • • • • • • • • • • Advanced Management Module (AMM) Alternating Current (AC) Basic Input/Output System (BIOS) British thermal unit (BTU) Central Processing Unit (CPU) Cisco Intelligent Gigabit Ethernet Switch Module (CIGESM) Command-line interface (CLI) Compact Disc Read-Only Memory (CD-ROM) Dynamic Host Configuration Protocol (DHCP) Ethernet switch modules (ESM) Fibre Channel Switch Module (FSCM) File Transfer Protocol (FTP) Graphical User Interface (GUI) IBM BladeCenter E (Enterprise) IBM BladeCenter H (High Performance) IBM BladeCenter HT (High Performance Telco) IBM BladeCenter S (Simplification) IBM BladeCenter T (Telco) Integrated Management Module (IMM) Input-output (I/O) Internet Protocol (IP) Interrupt Request (IRQ) Jumper (J) Keyboard, Video, and Mouse (KVM) • • • • • • • • • • • • • • • • • • • • • • • • Local-Area Network (LAN) Management Module (MM) Non-Maskable Interrupt (NMI) Operating System (OS) Peripheral Component Interconnect (PCI) Power Distribution Unit (PDU) Power On Self Test (POST) Remote Supervisor Adapter II (RSA II) Secure Sockets Layer (SSL) Serial over LAN (SoL) Servcie Pack (SP) Service Support Representative ( SSR ) Simple Mail Transfer Protocol (SMTP) Simple Network Management Protocol (SNMP) Storage Area Network (SAN) System Event Log (SEL) Transmission Control Protocol (TCP) Trivial File Transfer Protocol (TFTP) Unified Extensible Firmware Interface (UEFI) UpdateXpress Firmware Update Scripts for BladeCenter (UXBC) Virtual Local Area Network (VLAN) Vital Product Data (VPD) Volt (V) Watt (W) 61 Checkpoint (1 of 2) 1. The _______________________ stores all major event and error information for the BladeCenter and is the starting point for PD. a. Ethernet Switch Module (ESM) b. AMM c. BIOS d. Blade Server operating system log 2. True/False: In planning the BladeCenter management network, bandwidth is the primary consideration. 3. The __________ enables the user to modify firmware settings from the command line. 4. True/False: While AMM management can be done through a Web interface, all switch modules must be configured using command line. 62 Checkpoint solutions (1 of 2) 1. The _______________________ stores all major event and error information for the BladeCenter and is the starting point for PD. a. b. c. d. Ethernet Switch Module (ESM) AMM BIOS Blade Server operating system log Answer: b 2. True/False: In planning the BladeCenter management network, bandwidth is the primary consideration. Answer: False 3. The __________ enables the user to modify firmware settings from the command line. Answer: Advanced Settings Utility (ASU) 4. True/False: While AMM management can be done through a Web interface, all switch modules must be configured using command line. Answer: False 63 Checkpoint (2 of 2) 5. Select the correct statement regarding Blade Server status indicators. a. Memory and processor LEDs are on the Blade Server front panel b. All Blade Server status LEDs are on the Light Path diagnostics panel c. Blade Server status and error LEDs are on the Front Panel, Control Panel and adjacent to components on the system board d. Light Path status and error indicators require the Blade to be powered on 6. True/False: The UEFI is a functional replacement for legacy BIOS 7. True/False: To diagnose a Blade Server hardware problem, the first step to take would be to remove the Blade from the chassis and check the system board LEDs. 8. True/False: As a rule, power consumption is directly related to resultant heat output. 9. Which function should be used to view Service Processor configuration and hard disk drive health? a. AMM Event Log b. PreBoot DSA c. AMM Monitor status page 64 Checkpoint solutions (2 of 2) 5. Select the correct statement regarding Blade Server status indicators. a. Memory and processor LEDs are on the Blade Server front panel b. All Blade Server status LEDs are on the Light Path diagnostics panel c. Blade Server status and error LEDs are on the Front Panel, Control Panel and adjacent to components on the system board d. Light Path status and error indicators require the Blade to be powered on Answer: c 6. True/False: The UEFI is a functional replacement for legacy BIOS Answer: True 7. True/False: To diagnose a Blade Server hardware problem, the first step to take would be to remove the Blade from the chassis and check the system board LEDs. Answer: False 8. True/False: As a rule, power consumption is directly related to resultant heat output. Answer: True 8. Which function should be used to view Service Processor configuration and hard disk drive health? a. AMM Event Log b. PreBoot DSA c. AMM Monitor status page Answer: b 65 Unit summary Having completed this unit, you should be able to: • Identify the BladeCenter components used to provide PD information • List the planning elements required for the BladeCenter management network • Select the functions available to modify firmware settings • List the blade server indicators and Light Path Components • Select the steps appropriate in diagnosing blade server hardware failures • Identify the utility to use in displaying BladeCenter component health 66