Capacity Management

advertisement
Capacity
Management
for Web Operations
John Allspaw
Operations Engineering
the book I’m writing
???
Rules of Thumb
Planning/Forecasting
Stupid Capacity Tricks
(with some Flickr statistics sprinkled in)
Things that can cause
downtime
bugs (disguised as capacity problems)
edge cases (disguised as capacity
problems)
security incidents
real capacity problems*
* (should be the last thing you need to worry about)
Capacity != Performance
Forget about performance for right
now
Measure what you have right NOW
Don’t count on it getting any better
Thank You HPC Industry!
Automated Stuff
Scalable Metric Collection/Display
a lot of great deployment and management tricks
come from them, adopted by web ops
Good
Measuremen
t Tools
record and
store
metrics in/out
custom metrics
easily compare
lightweight-ish
I
Clouds need planning too
Makes deployment and
procurement easy and quick
But clouds are still resources with
costs and limits, just like your own
stuff
Black-boxes: you may need to pay
even more attention than before
Metrics
System Statistics
Metrics
“Application” Level
(photos processed per minute)
(average processing time per phot
(apache requests)
(concurrent busy apache procs)
Metrics
App-level meets system-level
here, total CPU = ~1.12 * # busy apache procs
2400
photos per minute being uploaded right NOW (Tuesday
Ceiling
s
the most amount of “work” your
resources will allow before
degradation
or failure
Forget Benchmarking
Find your ceilings
what you have left
The End
Use real live production data
to find ceilings
Production: “it’s like a lab, but bigger!”
Like: database ceilings
replication lag: bad!
Ceilings
waiting on disk sustained disk I/O wait for
>40% creates
too much
slave lag*
*for us, YMMV
35,000
oto requests per second on a Tuesday peak
Safety
Factors
Safety Factors
Ceiling * Factor of Safety = UR LIMITZ
Safety Factors
webserver!
Safety Factors
what you have left
“safe”
ceiling
@85% CPU
85% total CPU = ~76 busy apache procs
Safety Factors
Yahoo Front Page
link to Chinese NewYear
Photos
(8% spike)
(photo requests/second)
Forecasting
Forecasting
Fictional Example:
webservers
Forecasting
peak of the week
Fictional example: 15 webservers. 1 week.
Forecasting
...bigger sample, 6 weeks....isolate the peaks...
Forecasting
not too shabby
now
...”Add a Trendline” with some decent correlation...
Forecasting
ceiling
this will tell you when it is
when is this?
what you have left
15 servers @76 busy apache proc limit = 1140 total procs
Forecasting
(1140-726) / 42.751 = 9.68
(week #10, duh)
Forecasting Automation
Writing excel macros is boring
All we want is “days remaining”, so
all we need is the curve-fit
Use http://fityk.sf.net to
automate the curve-fit
Forecasting
Fictional Example:
storage consumption
Forecasting Automation
this will tell
you when this is
actual flickr storage consumption from early 2005, in GB
(ceiling is fictional)
Forecasting Automation
jallspaw:~]$cfityk ./fit-storage.fit
cmd line script
output
1> # Fityk script. Fityk version: 0.8.2
2> @0 < '/home/jallspaw/storage-consumption.xy'
15 points. No explicit std. dev. Set as sqrt(y)
3> guess Quadratic
New function %_1 was created.
4> fit
Initial values: lambda=0.001 WSSR=464.564
#1: WSSR=0.90162 lambda=0.0001 d(WSSR)=-463.663 (99.8059%)
#2: WSSR=0.736787 lambda=1e-05 d(WSSR)=-0.164833 (18.2818%)
#3: WSSR=0.736763 lambda=1e-06 d(WSSR)=-2.45151e-05 (0.00332729%)
#4: WSSR=0.736763 lambda=1e-07 d(WSSR)=-3.84524e-11 (5.21909e-09%)
Fit converged.
Better fit found (WSSR = 0.736763, was 464.564, -99.8414%).
5> info formula in @0
# storage-consumption
14147.4+146.657*x+0.786854*x^2
6> quit
bye...
Forecasting Automation
fityk gave:
y = 0.786854x2 + 146.657x + 14147.4
( R2 = 99.84)
Excel gave:
y = 0.7675x2 + 146.96x + 14147.3
( R2 = 99.84)
(SAME)
Capacity Health
12,629 nagios checks
1314 hosts
6 datacenters
4 photo “farms”
farm = 2 DCs (east/west)
High and Low Water Marks
alert if higher
alert if lower
Per server, squid requests per second
A good dashboard looks
something like...
type
#
limit/bo ceiling
x
units
www
20
80
shard
db
20
40
squid
18
950
(yes, fictional numbers)
limit
(total)
current
(peak)
%
peak
busy
62.50
1600 1000
procs
%
I/O
27.50
800
220
wait
%
req/se
66.67
17,100 11,400
c
%
Est
days
left
36
120
48
Diagonal Scaling
vertically scaling your already horizontal nodes
Image processing machines
Replace Dell PE860s with HP
DL140G3s
Diagonal Scaling
example: image processing
4 cores
8 cores
(about the same CPU “usage” per box)
Diagonal Scaling
example: image processing throughput
~45 images/min @ peak
~140 images/min @ peak
(same CPU usage, but ~3x more work)
“processing” means making 4 sizes from originals
Diagonal Scaling
example: image processing
went from:
23
to:
8
3008.4
Dell PE860s Watts
1035
photos/min
23U
rack
1036.8
8U
1120
HP DL140 G3s Watts
photos/min rack
!!!
(75% faster, even)
3.52
terabytes will be consumed today (on a
2nd Order Effects
(beware the wandering bottleneck)
running hot,
so add more
2nd Order Effects
(beware the wandering bottleneck)
now
these run
hot
running great now,
so more traffic!
Stupid Capacity Tricks
Stupid Capacity Tricks
quick and dirty management
DSH
http://freshmeat.net/projects/dsh
[root@netmon101 ~]# cat group.of.servers
www100
www118
dbcontacts3
admin1
admin2
Stupid Capacity Tricks
quick and dirty management
[root@netmon101 ~]# dsh -N group.of.servers
dsh> date
executing 'date'
www100:
Mon
www118:
Mon
dbcontacts3:
Mon
admin1:
Mon
admin2:
Mon
dsh>
Jun
Jun
Jun
Jun
Jun
23
23
23
23
23
14:14:53
14:14:53
07:14:53
14:14:53
14:14:53
UTC
UTC
PDT
UTC
UTC
2008
2008
2008
2008
2008
Stupid Capacity Tricks
Turn Stuff OFF
Disable heavy-ish features of the
site(on/off switches)
We have 195 different things to
disable in case of emergency.
Stupid Capacity Tricks
Turn Stuff OFF
uploads (photo)
uploads (video)
uploads by email
various API things
various mobile things
various search things
etc., etc.
Stupid Capacity Tricks
Outages Happen
Host your outage/status/blog page
in more than one datacenter.
Tell your users WTF is going on,
they’ll appreciate it.
Stupid Capacity Tricks
Hit the Pause Button
Bake the dynamic into static
Some Y! properties have a big red
button to instantly bake (and unbake) at will
thanks
http://flickr.com/photos/bondidwhat/402089763/
http://flickr.com/photos/74876632@N00/2394833962/
http://flickr.com/photos/42311564@N00/220394633/
http://flickr.com/photos/unloveable/2422483859/
http://flickr.com/photos/absolutwade/149702085/
http://flickr.com/photos/krawiec/521836276/
http://flickr.com/photos/eschipul/1560875648/
http://flickr.com/photos/library_of_congress/2179060841/
http://flickr.com/photos/jekkyl/511187885/
http://flickr.com/photos/ab8wn/368021672/
http://flickr.com/photos/jaxxon/165559708/
http://flickr.com/photos/sparktography/75499095/
We’re Hiring!
flickr.com/jobs
Come see me!
questions?
Download