Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya August 2006 “The SLAC Computing Services Group is dedicated to providing leadership and support in computing and communications to the laboratory as a whole, and to physics research, in particular” Major Concerns • Power consumption • Cooling • Monitoring What Is My Computer Doing??? • • • • • • I/O Rate CPU usage Memory Usage Temperature Fan Speed Load Monitoring Software -low overhead -scalable -low impact on individual machines “Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids” • • • • Scalable, overhead increases by number of clusters not nodes Works on multiple operating systems Round Robin Database Measures metrics like CPU usage, load, I/O rate, and memory usage GMOND, GMETAD, GMETRIC Ganglia Architecture http://www.slac.stanford.edu/comp/unix/ganglia/index.html Updates RRD, polls clusters periodically Cluster Two Machines 1 and 3 know state of entire cluster 1 A 3 Cluster One All machines know state of entire cluster B 2 C 4 GMETRIC Allows users to monitor metrics to expand on the core monitored by the daemon gmond • • • • Name Value Type Units gmetric conf=/var/ganglia/gmond.conf –nCPUTemp1 –v75 –tuint8 –uCelsius Good because allows us to be more machine specific, can monitor temperature and fan speed A little bit on hardware Noma - batch machines • Tyan Thunder LE-T motherboard • Winbond w83782d (lm_sensor compatible) • 2 pentium III processors Why is temperature important? •Chip specifications give temperature range •Behavior is unpredictable outside temperature range •Clues to weird machine behavior •Pentiums have a max temp of 77°-82° C Tyan Thunder LE-T What’s a Noma? • • • NOMA Horse from Noma County Japan Smallest native Japanese pony 10.1 -10.3 hands Super rare 27 pure blood nomas left (1988) Some more machines DON COB TORI ORLOV MORAB • • • • • • • • • • • • • • • • • • • • • • • • • • caitiem@noma0449 $ sensors w83782d-i2c-0-29 Adapter: SMBus PIIX4 adapter at 0580 Algorithm: Non-I2C SMBus adapter VCore 1: +1.48 V (min = +4.08 V, max = +4.08 V) VCore 2: +1.26 V (min = +4.08 V, max = +4.08 V) +3.3V: +3.37 V (min = +2.97 V, max = +3.63 V) +5V: +4.97 V (min = +4.50 V, max = +5.48 V) +12V: +12.08 V (min = +10.79 V, max = +13.11 V) -12V: -1.03 V (min = -13.21 V, max = -10.90 V) -5V: +2.84 V (min = -5.51 V, max = -4.51 V) V5SB: +5.12 V (min = +4.50 V, max = +5.48 V) VBat: +3.34 V (min = +2.70 V, max = +3.29 V) fan1: 8231 RPM (min = 3000 RPM, div = 2) fan2: 8333 RPM (min = 3000 RPM, div = 2) fan3: 0 RPM (min = 3000 RPM, div = 2) temp1: +77°C (limit = +60°C) sensor = thermistor ALARM temp2: +65.0°C (limit = +60°C, hysteresis = +50°C) sensor = thermistor ALARM temp3: +65.0°C (limit = +60°C, hysteresis = +50°C) sensor = thermistor ALARM vid: +1.450 V alarms: Chassis intrusion detection ALARM beep_enable: Sound alarm disabled Perl Fills gap between low level languages like C and C++ and high level languages like shell. -mostly fast -basically unlimited -good for working with text -portable Regular Expressions /^temp([0-9]):\s+\+([0-9]+\.*[0-9]*)/ matches temp1: temp2: +77°C (limit = +60°C) sensor = thermistor +65.0°C (limit = +60°C, hysteresis = +50°C) sensor = thermistor Sample Time - Decreasing • Time interval = 12.15 minutes • Fri Aug 11 03:04:05 PDT 2006 • • • • • • • • FanSpeed1 8035 FanSpeed2 7941 Temp 1: 77 Change: 0 Temp 2: 64.0 Change: 0 Temp 3: 64.0 Change: 1 Want Sample time to decrease faster when temperatures are changing faster • Time interval = 9.8415 minutes • Fri Aug 11 03:16:15 PDT 2006 New time = old time * Decrement ^(Change / Trigger) *if new time < min time then newTime = minTime New time = 12.15 * .9 ^ (1 / .05) = 9.8415 Parameters •Trigger = 0.5 degrees •Decrement = 0.9 •MaxTime = 15 minutes •MinTime = 1 minute Sample Time – Increasing • • Time interval = 12.15 minutes Fri Aug 11 08:25:18 PDT 2006 • • • • • • • • • • Found FanSpeed1 8035 Found FanSpeed2 7941 Temp 1: 77 Change: 0 Temp 2: 64.0 Change: 0 Temp 3: 64.0 Change: 0 Time interval = 13.5 minutes Fri Aug 11 08:37:28 PDT 2006 Want Sample Time to Increase Temperature is changing slowly or not at all *If we increase by large amounts we could miss valuable data NewTime = OldTime / Decrement NewTime = 12.15 / 0.9 = 13.5 Parameters •Trigger = 0.5 degrees •Decrement = 0.9 •MaxTime = 15 minutes •MinTime = 1 minute noma0450 noma0449 Up and running on two Nomas currently • Noma0449 • Noma0450 Will be installed on all Nomas Can be used on any Ganglia monitored machine with a compatible Winbond chip Much thanks to the DOE, SCCS systems group and especially Yemi Adesanya, John Goebel, & Karl Amrhein for all their help throughout the summer. Smartmontools for SCSI devices • Command smartctl –l error /dev/sda Error counter log: Errors Corrected Total delay: [rereads/ minor | major rewrites] read: 234237 0 0 write: 0 0 0 Non-medium error count: Total Correction Gigabytes Total errors algorithm processed uncorrected corrected invocations [10^9 bytes] errors 234237 234237 605.516 0 0 0 1457.589 0 0 http://smartmontools.sourceforge.net/smartmontools_scsi.html Corrected Errors • Minor/ Fast • Correction algorithm works successfully • No delay to reading later sectors • These are ok • Major / Slow •Correction algorithm works successfully •Delay in reading later sectors •Not so good • Uncorrected Errors •Correction algorithm fails •Very Bad Other Information • Total [rereads/rewrites] – errors corrected by applying retries • Total errors corrected – number of all correctable errors • Correction Algorithm Invocation – number of times algorithm is used • Gigabytes Processed – number of bytes successfully and unsuccessfully read or written This indicates there might be a problem This should be a flag as well This is ok, its correcting the errors and not losing any time doing so errorsWatch Monitors • • • • • • • • Read Uncorrected Errors Read Delayed Errors Read No Delay Errors Write Uncorrected Errors Write Delayed Errors Write No Delay Errors Total Uncorrected Errors Total Delayed Errors Collects Data Once a Day -Noma -Don -Tori -Cob -Morab -Orlov