ppt - SLAC

advertisement
Monitoring Temperature and Fan
Speed Using Ganglia and Winbond
Chips
Caitie McCaffrey, Yemi Adesanya
August 2006
“The SLAC Computing Services Group is dedicated to
providing leadership and support in computing and
communications to the laboratory as a whole, and to physics
research, in particular”
Major Concerns
• Power consumption
• Cooling
• Monitoring
What Is My Computer Doing???
•
•
•
•
•
•
I/O Rate
CPU usage
Memory Usage
Temperature
Fan Speed
Load
Monitoring Software
-low overhead
-scalable
-low impact on individual machines
“Ganglia is a scalable distributed monitoring system for
high-performance computing systems such as clusters
and Grids”
•
•
•
•
Scalable, overhead increases by number of clusters not nodes
Works on multiple operating systems
Round Robin Database
Measures metrics like CPU usage, load, I/O rate, and memory usage
GMOND, GMETAD, GMETRIC
Ganglia Architecture
http://www.slac.stanford.edu/comp/unix/ganglia/index.html
Updates RRD, polls
clusters periodically
Cluster Two
Machines 1 and 3
know state of entire
cluster
1
A
3
Cluster One
All machines
know state of
entire cluster
B
2
C
4
GMETRIC
Allows users to monitor metrics to expand on the core
monitored by the daemon gmond
•
•
•
•
Name
Value
Type
Units
gmetric conf=/var/ganglia/gmond.conf –nCPUTemp1 –v75 –tuint8 –uCelsius
Good because allows us to be more machine specific,
can monitor temperature and fan speed
A little bit on hardware
Noma - batch machines
• Tyan Thunder LE-T motherboard
• Winbond w83782d (lm_sensor compatible)
• 2 pentium III processors
Why is temperature important?
•Chip specifications give temperature range
•Behavior is unpredictable outside temperature range
•Clues to weird machine behavior
•Pentiums have a max temp of 77°-82° C
Tyan Thunder LE-T
What’s a Noma?
•
•
•
NOMA
Horse from Noma County Japan
Smallest native Japanese pony 10.1 -10.3 hands
Super rare 27 pure blood nomas left (1988)
Some more machines
DON
COB
TORI
ORLOV
MORAB
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
caitiem@noma0449 $ sensors
w83782d-i2c-0-29
Adapter: SMBus PIIX4 adapter at 0580
Algorithm: Non-I2C SMBus adapter
VCore 1: +1.48 V (min = +4.08 V, max = +4.08 V)
VCore 2: +1.26 V (min = +4.08 V, max = +4.08 V)
+3.3V: +3.37 V (min = +2.97 V, max = +3.63 V)
+5V:
+4.97 V (min = +4.50 V, max = +5.48 V)
+12V: +12.08 V (min = +10.79 V, max = +13.11 V)
-12V:
-1.03 V (min = -13.21 V, max = -10.90 V)
-5V:
+2.84 V (min = -5.51 V, max = -4.51 V)
V5SB:
+5.12 V (min = +4.50 V, max = +5.48 V)
VBat:
+3.34 V (min = +2.70 V, max = +3.29 V)
fan1: 8231 RPM (min = 3000 RPM, div = 2)
fan2: 8333 RPM (min = 3000 RPM, div = 2)
fan3:
0 RPM (min = 3000 RPM, div = 2)
temp1:
+77°C (limit = +60°C)
sensor = thermistor
ALARM
temp2: +65.0°C (limit = +60°C, hysteresis = +50°C) sensor = thermistor
ALARM
temp3: +65.0°C (limit = +60°C, hysteresis = +50°C) sensor = thermistor
ALARM
vid:
+1.450 V
alarms: Chassis intrusion detection
ALARM
beep_enable:
Sound alarm disabled
Perl
Fills gap between low level languages like C and C++ and high
level languages like shell.
-mostly fast
-basically unlimited
-good for working with text
-portable
Regular Expressions
/^temp([0-9]):\s+\+([0-9]+\.*[0-9]*)/
matches
temp1:
temp2:
+77°C (limit = +60°C)
sensor = thermistor
+65.0°C (limit = +60°C, hysteresis = +50°C) sensor = thermistor
Sample Time - Decreasing
• Time interval = 12.15 minutes
•
Fri Aug 11 03:04:05 PDT 2006
•
•
•
•
•
•
•
•
FanSpeed1 8035
FanSpeed2 7941
Temp 1: 77
Change: 0
Temp 2: 64.0
Change: 0
Temp 3: 64.0
Change: 1
Want Sample time to
decrease faster when
temperatures are
changing faster
• Time interval = 9.8415 minutes
•
Fri Aug 11 03:16:15 PDT 2006
New time = old time * Decrement ^(Change / Trigger)
*if new time < min time
then newTime = minTime
New time = 12.15 * .9 ^ (1 / .05) = 9.8415
Parameters
•Trigger = 0.5 degrees
•Decrement = 0.9
•MaxTime = 15 minutes
•MinTime = 1 minute
Sample Time – Increasing
•
•
Time interval = 12.15 minutes
Fri Aug 11 08:25:18 PDT 2006
•
•
•
•
•
•
•
•
•
•
Found FanSpeed1 8035
Found FanSpeed2 7941
Temp 1: 77
Change: 0
Temp 2: 64.0
Change: 0
Temp 3: 64.0
Change: 0
Time interval = 13.5 minutes
Fri Aug 11 08:37:28 PDT 2006
Want Sample Time to
Increase Temperature is
changing slowly or not at all
*If we increase by large amounts
we could miss valuable data
NewTime = OldTime / Decrement
NewTime = 12.15 / 0.9 = 13.5
Parameters
•Trigger = 0.5 degrees
•Decrement = 0.9
•MaxTime = 15 minutes
•MinTime = 1 minute
noma0450
noma0449
Up and running on two Nomas currently
• Noma0449
• Noma0450
Will be installed on all Nomas
Can be used on any Ganglia monitored machine with a
compatible Winbond chip
Much thanks to the DOE, SCCS systems group and especially
Yemi Adesanya, John Goebel, & Karl Amrhein for all their help
throughout the summer.
Smartmontools for SCSI devices
• Command smartctl –l error /dev/sda
Error counter log:
Errors Corrected Total
delay:
[rereads/
minor | major
rewrites]
read: 234237 0
0
write:
0
0
0
Non-medium error count:
Total
Correction Gigabytes
Total
errors
algorithm
processed uncorrected
corrected invocations [10^9 bytes] errors
234237 234237
605.516
0
0
0
1457.589
0
0
http://smartmontools.sourceforge.net/smartmontools_scsi.html
Corrected Errors
• Minor/ Fast
• Correction algorithm works successfully
• No delay to reading later sectors
• These are ok
• Major / Slow
•Correction algorithm works successfully
•Delay in reading later sectors
•Not so good
• Uncorrected Errors
•Correction algorithm fails
•Very Bad
Other Information
• Total [rereads/rewrites] – errors corrected by applying retries
• Total errors corrected – number of all correctable errors
• Correction Algorithm Invocation – number of times algorithm
is used
• Gigabytes Processed – number of bytes successfully and
unsuccessfully read or written
This indicates there might be a
problem
This should be a flag as well
This is ok, its correcting the
errors and not losing any time
doing so
errorsWatch
Monitors
•
•
•
•
•
•
•
•
Read Uncorrected Errors
Read Delayed Errors
Read No Delay Errors
Write Uncorrected Errors
Write Delayed Errors
Write No Delay Errors
Total Uncorrected Errors
Total Delayed Errors
Collects Data Once a Day
-Noma
-Don
-Tori
-Cob
-Morab
-Orlov
Download