DataGarage - Microsoft Research

advertisement
Warehousing Massive
Performance Data
on Commodity Servers
Charles Loboz, Slawek Smyl, Suman Nath
Microsoft Corporation
Monitoring Large DataCenters
Management Task
Monitoring
Planning
Historical analysis
CPU, memory, disk utilization,…
Response time, queue length,…
Performance data
Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments
Monitoring Data Management
100K servers = 1TB data per day!
Storage challenge
Query challenge
Store data over many
months, years
Petabytes of data
Hours to run simple
queries
Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments
DataGarage
Performance data
warehousing system
CPU, memory, disk utilization,…
Response time, queue length,…
Storage, query processing
Efficient, scalable, cheap
Performance data
Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments
Outline
•
•
•
•
•
•
•
Context
Performance data characteristics
Design goals
DataGarage design
Query Processing
Evaluation
Conclusion
Performance Data Collection
Time
CPU
Mem
Jobs
Disk
…
10:00
48
37
3
134
…
10:01
52
39
3
342
…
10:02
58
45
2
324
…
…
…
…
…
…
…
Our Deployment
Monitoring process
CPU utilization, memory usage,
disk space, SQL queue length,
app response time, cache hit
rate, network bandwidth, …
Sampling period 15 seconds
100-1000 counters/server
5-100 MB/server/day
0.01% CPU time
Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments
Performance Data Characteristics
• Heterogeneous counter sets
– 30K different counters, 100-1000 per server
• Numeric, read-only, possibly-dirty
– Dirty data retained, may be ignored for query
• Hierarchical queries
– Selection, projection, aggregation, data mining
• Fraction of hotmail.com servers in a given rack with CPU
utilization > 50%
• Average memory utilization trend of hotmail servers
Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments
DataGarage Design Goals
• Small storage footprint
– Reduces storage and communication cost
– Small pay-as-you-go cost for Cloud systems
• Cheap
– Commodity hardware and off-the-shelf software
• Fast and robust query processing
– Allows fast decisions
– Tolerates faulty and slow hardware
• Simple and flexible query interface (SQL + UDF)
– Fast query writing
Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments
Outline
•
•
•
•
•
•
•
Context
Performance data characteristics
Design goals
DataGarage design
Query Processing
Evaluation
Conclusion
Options
• TableStore: Relational table
– DB engine: single-node DBMS, parallel DBMS
– MapReduce: HadoopDB [Abouzeid et al. VLDB’09]
• FileStore: Files
– MapReduce: Hadoop, Dryad [Isard et al., EuroSys’07]
Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments
Trade-offs
Performance
Faulttolerance
Cost
Storage
footprint
TableStore
+ Parallel DB Engine
(DBMS-X)
TableStore + MR
+ single node DB
(HadoopDB)
FileStore
+ MapReduce
(Hadoop, Dryad)
TableStore in files
+ MapReduce
(DataGarage)
Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments
Storage Inefficiency: TableStore
Key problem: heterogeneous counter sets
Total 30,000 unique counters, <1000/server
All possible counters
• Too many columns
• >95% sparse
Value
Counter id
Timestamps
Narrow table
Machine id
Counter n
Counter 2
Counter 1
Timestamps
Machine id
Wide table
Key-value store
• Redundant keys
(4x more expensive
than raw data)
• Expensive joins needed
Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments
Storage Inefficiency: FileStore
• Heterogeneous counter sets
– Files need to maintain schema for each server
• No structure in data
– Compression cannot exploit data correlation
Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments
Our Solution
• One wide-table per server
– Benefits of TableStore, without sparseness/ redundancy
• Each wide-table in an embedded database file
SQL Lite, MS SQL Server Compact Edition
.sdf file
– Benefits of FileStore
c1 c2 c3
c1
c4
c6
c7
c8
c2
c4
c5
c8
Microsoft SQL Server Compact Edition library
Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments
DataGarage Architecture
Query
Controller
(Query Dissemination)
Data
analysis
tools
Distributed file system
Summary
Database
Embedded database
Data
collector
Data
collector
Data
collector
Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments
Data Compression
• Zipping files with PKZip is not effective
• Compress one column at a time
– Exploit strong correlation
– RLE, delta encoding not very effective
• Our idea: Bit-truncation + Byte-interleaving
…
42
42
42
42
AE
AE
AE
AE
91
83
2B
39
…
…
A0
E4
38
C4
if lossy
<1%
42
42
42
42
AE
AE
AE
AE
…
…
91
83
2B
39
42 42 42
42 .. AE
AE AE AE
.. 91 83
…
Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments
Storage Efficiency
Context  Performance Data  Design Goals  DataGarage  Query Processing  Results
Outline
•
•
•
•
•
•
•
Context
Performance data characteristics
Design goals
DataGarage design
Query Processing
Evaluation
Conclusion
DataGarage Query
• DataGarage query: Three components
– On: filesystem path: /hotmail/dc1/*.10-.-2009.sdf
– Apply: a SQL query run on individual database files
– Combine: a SQL query to compute final result
• Enables map-reduce style execution
Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments
Query Execution
Controller
Node
Execution
Nodes
Distributed
File system
Apply
Dissemination
On
Controller
Combine
Combine
Apply
Apply
Apply
…
Result
Apply
Apply
Apply
Temporary
Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments
Query Execution Time
Context  Performance Data  Design Goals  DataGarage  Query Processing  Results
Fault Tolerance
• DataGarage key technology:
– Decoupling of execution and storage
– Fine-grained data partitioning
• Data is replicated by the file system
• Slow execution nodes
– Assigned smaller jobs
– Faster nodes take additional load after finished
• Execution node failures
– New nodes work on remaining job of failed nodes
Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments
Goals Revisited
• High performance: queries are pushed inside
embedded database
• Storage efficient: compression
• Fault tolerant: fine partitioning of data and query
processing, aggressive restarting, speculative
execution
• Hierarchical queries: file system paths
• Simple interface: SQL queries
• Cheap: off-the-shelf tools, commodity machines
Outline
•
•
•
•
•
•
•
Context
Performance data characteristics
Design goals
DataGarage design
Query Processing
Experience
Conclusion
Operational Experience
• Have been in operation for more than 1 year
– Warehousing data from Microsoft data centers
• Partitioning with fine granularity + compression
is the key to store massive data
– Previous implementation with narrow table
• 30K server-days in 1TB disk
• Slow queries
– Current implementation:
• 1-3 million server-days/TB
• Orders of magnitude faster queries
Context  Performance Data  Design Goals  DataGarage  Query Processing  Results
Operational Experience
• Embedded database files give flexibility
– Placement, backup simplified
– Scavenge available storage on the fly
• Simple design helps
– Several thousands lines of C# code to glue together
existing tools (FS, Embedded DB, R, …)
• Defer features until necessary: Parallel Combine
• Good fit with Cloud computing model
– Data and/or computation can be on the Cloud
– Cheap: only file storage needed, small footprint
Context  Performance Data  Design Goals  DataGarage  Query Processing  Results
Conclusion
• Existing solutions are not efficient for
warehousing performance data
• DataGarage: performance data warehouse
• Cheap, scalable, fault tolerant
– Combines benefits of DB, MapReduce, file systems
• Operational experience shows the benefits
Questions?
Context  Performance Data  Design Goals  DataGarage  Query Processing  Results
Compression Overhead
Context  Performance Data  Design Goals  DataGarage  Query Processing  Results
Related Work
• HadoopDB
– DataGarage has finer data partitioning
• Improves fault tolerance and storage efficiency
– DataGarage uses embedded databases
• Cheap, enables using hierarchical file system
– DataGarage uses data compression
Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments
Query Processing
<apply_script>
<target>
Controller
(Query Dissemination)
Result
<combine_script>
<combine_script>
<apply_script>
Temporary table
Embedded database
<apply_script>
Distributed file system
Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments
Download