Top Support issues and how to solve them – Part II Darren Burnett Senior Technical Support Engineer Unable to connect to the Service Console Rebuild Networking Network Connection Problem. • • • • • Deleting the vSwitch that vSwif0 is connected Connecting the wrong NICs to vSwitch0 Upgrade issues Incorrect IP Address External Network Changes Rebuild Networking At this stage you can no longer connect to your ESX server using VI Client or SSH. • You can connect to the Service Console remotely if you have ILO, DRAC an IP KVM or something similar. • Otherwise, it‟s time to use some shoe leather and walk to the server room. Rebuild Networking The following procedure will work. However it is a quick and inelegant way to get your VI client connected. Other options include, • Crossover cable connected to a laptop • Adding or removing NICs to a vSwitch Rebuild Networking Use the esxcfg-vswitch –l command to list all of you vSwitches Rebuild Networking [root@newross root]# esxcfg-vswitch -l Switch Name Num Ports Used Ports Configured Ports Uplinks vSwitch0 32 4 32 vmnic0 PortGroup Name Internal ID VLAN ID Used Ports Uplinks VM Network portgroup1 0 0 vmnic0 Service Console portgroup0 0 1 vmnic0 VMkernel portgroup7 0 1 vmnic0 Switch Name Num Ports Used Ports Configured Ports Uplinks vSwitch1 64 2 64 vmnic1 PortGroup Name Internal ID VLAN ID Used Ports Uplinks vlan100 portgroup10 100 0 vmnic1 install VLAN 310 portgroup6 0 0 vmnic1 Switch Name Num Ports Used Ports Configured Ports Uplinks vSwitch2 64 2 64 vmnic6 PortGroup Name Internal ID VLAN ID Used Ports Uplinks crossovercable portgroup9 0 0 vmnic6 Rebuild Networking Delete all vSwitches • esxcfg-vswitch –d vSwitch0 • esxcfg-vswitch –d vSwitch1 • esxcfg-vswitch –d vSwitch2 Rebuild Networking Create a vSwitch esxcfg-vswitch -a vSwitch0 Create the Service Console portgroup esxcfg-vswitch -p "Service Console" vSwitch0 Add a NIC to the vSwitch esxcfg-vswitch –L vmnic0 vSwitch0 Add a vswif interface and configure esxcfg-vswif –a vswif0 -p "Service Console" i 10.10.10.3 –n 255.0.0.0 Rebuild Networking Check if you can connect. • Use PING both to and from the ESX server • Try SSH • Try VI Client Rebuild Networking If you still can’t connect. Use “esxcfg-nics –l” to list available NICs. Rebuild Networking [root@newross root]# esxcfg-nics -l Name PCI Driver Link Speed Duplex Description vmnic0 03:0c.00 e1000 Up 1000Mbps Full Intel Corporation 82546EB Gigabit Ethernet Controller (Copper) vmnic1 03:0c.01 e1000 Down 0Mbps Half Intel Corporation 82546EB Gigabit Ethernet Controller (Copper) vmnic8 07:00.00 tg3 Up 1000Mbps Full Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet vmnic6 08:00.00 tg3 Down 0Mbps Half Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet Rebuild Networking Remove and add NICs to vSwitch0 • Remove esxcfg-vswitch –U vmnic0 vSwitch0 • Add esxcfg-vswitch –L vmnic8 vSwitch0 Rebuild Networking It might also need a VLAN ID esxcfg-vswitch –v 101 –p “Service Console” vSwitch0 Rebuild Networking To avoid this issue, be careful when configuring the Service Console virtual NIC or its parent virtual switch property that can affect the Service Console virtual NIC connectivity, for example, the uplink. If possible, before updating the Service Console virtual NIC, create another independent working Service Console NIC so that, in the event the configuration brings the console NIC down, the second Service Console NIC is still available to repair the configuration. Rebuild Networking As stated, depending on your environment it may be easier not to delete all your configurations. It may just require reassigning a NIC or changing VLAN ID. Check the following guide. http://www.rtfm-ed.eu/docs/vmwdocs/esx3.xvc2.x-serviceconsole-guide.pdf Network Bond NICs in a Bond not on the same Broadcast Domain Network Bond NICs in a Bond should be in the same broadcast domain. Determine what NICs are in each Bond. Network Bond [root@cork root]# esxcfg-info|grep -i -A 2 VirtualSwitchImpl \==+VirtualSwitchImpl : |----Name.........................................vSwitch0 |----Uplinks......................................vmnic0 -\==+VirtualSwitchImpl : |----Name.........................................vSwitch1 |----Uplinks......................................vmnic4 -\==+VirtualSwitchImpl : |----Name.........................................vSwitch2 |----Uplinks......................................vmnic1,vmnic3 -- Next we need to determine which networks each NIC is connected. [root@cork root]# esxcfg-info |grep -i -B 5 hint \==+PnicImpl : |----_name.............................................. vmnic3 |----_bus...............................................6 |----_slot..............................................3 |----_function..........................................1 |----Network Hint.......................................0 10.16.157.00/255.255.255.192 -\==+PnicImpl : |----_name..............................................vmnic0 |----_bus...............................................11 |----_slot..............................................7 |----_function..........................................0 |----Network Hint.......................................0 10.16.156.00/255.255.255.00 -\==+PnicImpl : |----_name.............................................. vmnic1 |----_bus...............................................12 |----_slot..............................................8 |----_function..........................................0 |----Network Hint.......................................0 10.16.156.00/255.255.255.00 HA and STP HA and STP Spanning Tree Protocol When there is a network change STP can cause a temporary outage on your Network HA An ESX Server will determine that it is isolated after 15 Seconds Depending on the “Isolation Response” that you have set, all your VMs may power down It is therefore worth checking your network to determine if STP can be configured to reduce the temporary network outage Expanding the size of a VMDK with an existing Snapshot Expanding VM with a Snapshot You can NOT expand a VM‟s VMDK file while it still has snapshots. e.g. #ls * important.vmdk important-000001-delta.vmdk #vmkfstools –X 20G important.vmdk Expanding VM with a Snapshot If you do, you will now have a VM that won’t boot Expanding VM with a Snapshot Tricking ESX into seeing the expanded VMDK as the original size. In this example we have a test.vmdk that we expand from 5GB to 6GB #vmkfstools -X 6G test.vmdk Expanding VM with a Snapshot If we check test.vmdk we see # Disk DescriptorFile version=1 CID=3f24a1b3 parentCID=ffffffff createType="vmfs" # Extent description RW 12582912 VMFS "test-flat.vmdk" # The Disk Data Base #DDB ddb.virtualHWVersion = "4" ddb.geometry.cylinders = "783" ddb.geometry.heads = "255" ddb.geometry.sectors = "63" ddb.adapterType = "buslogic" Expanding VM with a Snapshot • Original - RW 10485760 VMFS "test-flat.vmdk“ • New - RW 12582912 VMFS "test-flat.vmdk“ Expanding VM with a Snapshot If we have no “BACKUPS” how do we get the original value? #grep -i rw test-000001.vmdk RW 10485760 VMFSSPARSE “test-000001-delta.vmdk" Expanding VM with a Snapshot We change test.vmdk RW value. # Disk DescriptorFile version=1 CID=3f24a1b3 parentCID=ffffffff createType="vmfs" # Extent description RW 10485760 VMFS "test-flat.vmdk" # The Disk Data Base #DDB ddb.virtualHWVersion = "4" ddb.geometry.cylinders = "783" ddb.geometry.heads = "255" ddb.geometry.sectors = "63" ddb.adapterType = "buslogic" Expanding VM with a Snapshot Commit The snapshot(s) #vmware-cmd /pathtovmx/test.vmx removesnapshots Expanding VM with a Snapshot Grow the VMDK file #vmware-cmd –X 6GB test.vmdk Expanding VM with a Snapshot If needed add a snapshot #vmware-cmd pathtovmx/test.vmx createsnapshot <name> <description> Corrupted .VMSD file Corrupted Snapshots In this example we will deal with a corrupt .VMSD file. Let‟s first look at a working .VMSD file, separated into 3 slides for the people at the back Corrupted Snapshots snapshot.lastUID = "4" snapshot.numSnapshots = "3" snapshot.current = "4" snapshot0.uid = "2" snapshot0.filename = "VC1.3to201_standardSnapshot2.vmsn" snapshot0.displayName = "myfirst" snapshot0.description = "My first test snapshot" snapshot0.createTimeHigh = "273684" snapshot0.createTimeLow = "942632403" snapshot0.numDisks = "1" snapshot0.disk0.fileName = "VC1.3to201_standard.vmdk" snapshot0.disk0.node = "scsi0:0" Corrupted Snapshots snapshot.needConsolidate = "FALSE" snapshot1.uid = "3" snapshot1.filename = "VC1.3to201_standardSnapshot3.vmsn" snapshot1.parent = "2" snapshot1.displayName = "second" snapshot1.description = "My second test snapshot" snapshot1.createTimeHigh = "273684" snapshot1.createTimeLow = "980947483" snapshot1.numDisks = "1" snapshot1.disk0.fileName = "VC1.3to201_standard000001.vmdk" snapshot1.disk0.node = "scsi0:0" Corrupted Snapshots snapshot2.uid = "4" snapshot2.filename = "VC1.3to201_standardSnapshot4.vmsn" snapshot2.parent = "3" snapshot2.displayName = "third" snapshot2.description = "My third test snapshot" snapshot2.createTimeHigh = "273684" snapshot2.createTimeLow = "1088942286" snapshot2.numDisks = "1" snapshot2.disk0.fileName = "VC1.3to201_standard000002.vmdk" snapshot2.disk0.node = "scsi0:0" Corrupted Snapshots After corruption of the .VMSD file the file now looks like this. Corrupted Snapshots Corrupted Snapshots However we see that the snapshots still exist [root@newross VC1.3to201_standard]# ls VC1.3to201_standard-000001-delta.vmdk VC1.3to201_standardSnapshot4.vmsn VC1.3to201_standard-000001.vmdk VC1.3to201_standard.vmdk VC1.3to201_standard-000002-delta.vmdk VC1.3to201_standard.vmsd VC1.3to201_standard-000002.vmdk VC1.3to201_standard.vmx VC1.3to201_standard-000003-delta.vmdk VC1.3to201_standard.vmxf VC1.3to201_standard-000003.vmdk vmware-1.log VC1.3to201_standard-flat.vmdk vmware-2.log VC1.3to201_standard.nvram vmware-3.log VC1.3to201_standard-Snapshot2.vmsn vmware.log VC1.3to201_standard-Snapshot3.vmsn Corrupted Snapshots At this stage rename the .VMSD file to .VMSD.OLD Corrupted Snapshots We are going to create a new .VMSD file What kind of magic is required to build a new VMSD file? Corrupted Snapshots Create another snapshot to automatically recreate a .VMSD file #vmware-cmd VC1.3to201_standard.vmx createsnapshot addedforrecovey "Hope it works" Corrupted Snapshots You wont be able to selectively rollback to a particular snapshot. You will have to commit them all. Corrupted Snapshots Commit the Snapshots #vmware-cmd VC1.3to201_standard.vmx removesnapshots All a bit too easy Corrupted Snapshots Corrupted Snapshot Corrupted Snapshots What happens if the last snapshot is corrupt? This can be caused by the VMFS volume being full. Now there is data loss. We can try limit this to losing only the last changes since the last snapshot. Corrupted Snapshots Move the last delta file to a temp area (or delete). Corrupted Snapshots Edit the .VMX file and point to the second last 000xx.vmdk file Corrupted Snapshots [root@newross VC1.3to201_standard]# ls VC1.3to201_standard-000001-delta.vmdk VC1.3to201_standard-Snapshot4.vmsn VC1.3to201_standard-000001.vmdk VC1.3to201_standard.vmdk VC1.3to201_standard-000002-delta.vmdk VC1.3to201_standard.vmsd VC1.3to201_standard-000002.vmdk VC1.3to201_standard.vmx VC1.3to201_standard-000003-delta.vmdk VC1.3to201_standard.vmxf VC1.3to201_standard-000003.vmdk vmware-1.log VC1.3to201_standard-flat.vmdk vmware-2.log VC1.3to201_standard.nvram vmware-3.log VC1.3to201_standard-Snapshot2.vmsn vmware.log VC1.3to201_standard-Snapshot3.vmsn Corrupted Snapshots scsi0:0.present = "TRUE" scsi0:0.fileName = " VC1.3to201_standard000003.vmdk" scsi0:0.present = "TRUE" scsi0:0.fileName = " VC1.3to201_standard000002.vmdk" Corrupted Snapshots When the .VMX has been updated to point to the second last snapshot. Commit the snapshots. #vmware-cmd VC1.3to201_standard.vmx removesnapshots Corrupted Snapshots Examining the Snapshots. Corrupted Snapshots The original file will contain something similar. [root@newross VC1.3to201_standard]# more VC1.3to201_standard.vmdk # Disk DescriptorFile version=1 CID=9e6bfa08 parentCID=ffffffff createType="vmfs" # Extent description RW 16777216 VMFS "VC1.3to201_standard-flat.vmdk" # The Disk Data Base #DDB ddb.virtualHWVersion = "4" ddb.geometry.cylinders = "1044" ddb.geometry.heads = "255" ddb.geometry.sectors = "63" ddb.adapterType = "lsilogic" ddb.toolsVersion = "7201" Corrupted Snapshots A snapshot disk can look similar to this [root@newross VC1.3to201_standard]# more VC1.3to201_standard000001.vmdk # Disk DescriptorFile version=1 CID=9e6bfa08 parentCID=9e6bfa08 createType="vmfsSparse" parentFileNameHint="VC1.3to201_standard.vmdk" # Extent description RW 16777216 VMFSSPARSE "VC1.3to201_standard-000001delta.vmdk" # The Disk Data Base #DDB Corrupted Snapshots [root@newross VC1.3to201_standard]# more VC1.3to201_standard-000007.vmdk # Disk DescriptorFile version=1 CID=678cf29b parentCID=9e6bfa08 createType="vmfsSparse" parentFileNameHint=" VC1.3to201_standard-00006.vmdk " # Extent description RW 16777216 VMFSSPARSE "VC1.3to201_standard000007-delta.vmdk" # The Disk Data Base #DDB VMFS Volumes and Extents Avoiding Issues with Extents When you add an extent to a VMFS volume only one ESX server is aware of the change. It is best practice to rescan VMFS volumes from all hosts. Otherwise it is possible to add another extent from another ESX server and cause issues with the VMFS volume. #esxcfg-rescan vmhba1 #esxcfg-rescan vmhba2 #service mgmt-vmware restart Recover VMFS How to recover after deleting a VMFS partition Recover VMFS How to recover after deleting a VMFS partition Why would somebody delete their partition? Recover VMFS How to recover after deleting a VMFS partition Why would somebody delete their partition? fdisk dd over the beginning of the disk Unattended install of Linux with clearpart –all LUN corruption of partition information Recover VMFS What options are available? • If the VMFS volume is corrupted or formatted over the recovery procedure is less likely to work. • If it is only the partition information then recreating the partition will most likely bring it back. Here we will use fdisk to recreate the partition information Recover VMFS First we need to identify the correct device • ESX 2.X vmkpcidivy –q vmhba_devs • ESX 3.X esxcfg-vmhbadevs –m For this example we will assume it is /dev/sdf Check that there are no partitions by using the command • fdisk –l /dev/sdf Recover VMFS Checking the Volume Header of a VMFS3 Volume. What does it look like? Recover VMFS f15e2fab000400002c15accef245465e293b00032304a5c50 26c00006300616c69726f695f6e756c316e00000000000 000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000002 00001000000000002c00acce014500002400accec64507 f7449a00002304a5c5016c000000000000000000000000 000000000100040000000000010000000000000000000 000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000 000 Recover VMFS f15e2fab000400002c15accef245465e293b00032304a5c50 26c00006300616c69726f695f6e756c316e00000000000 000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000002 00001000000000002c00acce014500002400accec64507 f7449a00002304a5c5016c000000000000000000000000 000000000100040000000000010000000000000000000 000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000 000 Recover VMFS Extracting Volume Header dd if=/dev/sdf bs=1k count=1 skip=19456 2>/dev/null|od -x -v |awk '{print $2, $3, $4, $5,$6,$7, $8, $9}'|tr -d " "|tr -d '\n„ magicnumber=${longstring:4:4}${longstring:0:4} Recover VMFS Recreate the partition using the following commands • fdisk /dev/sdf n (to create a new partition) p (to create a primary partition) 1 (to create the 1st partition) [enter] to keep the default value [enter] to keep the default value t (to change the type of partition) fb (to set the partition as VMFS) w (to save) • vmkfstools -V (to discover the VMFS) Recover VMFS If the VMFS still isn‟t present this can be due to the fact that it needs to be realigned. To do this use fdisk again • fdisk /dev/sdf x (to move to expert mode) b (to change the beginning of the partition) 128 (to move to the block 128 the beginning of the partition) w (to save) vmkfstools -V (to discover the vmfs back) Recover VMFS What happens if the LUN was formatted? Do you have a BACKUP? Recover VMFS At this stage any data that you can get back is a bonus. If there are still ESX servers with running VMs on the VMFS volume there are three options Run VMware Converter. Convert the VM doing a V2V conversion. Run backup software in the VM. Copy the files from the VM. Recover VMFS If the VMs are powered down and an ESX server can still see the LUN. • Copy all files (VMDK VMX etc) to another LUN Best Practice Best Practices • Backups • Run vm-support before and after changes. Also periodically and move tar.gz to another location. • Change control for your full environment • If issues are intermittent, record the time and date when this happens. Any Questions?