Robin Schumacher
Director of Product Strategy www.enterprisedb.com
www.enterprisedb.com
Building a Logical Design
Transitioning to the Physical
Monitoring for Success www.enterprisedb.com
www.enterprisedb.com
www.enterprisedb.com
*
What is the key component for success?
In other words, what you do with your PostgreSQL
Server – in terms of physical design, schema design, and performance design – will be the biggest factor on whether a BI system hits the mark…
* Philip Russom, “Next Generation Data Warehouse Platforms”, TDWI, 2009. www.enterprisedb.com
Source: Oracle Corporation
Your Database www.enterprisedb.com
*
First Recommendation – Use a Modeling Tool www.enterprisedb.com
A logical design for OLTP Isn’t For Data Warehouse www.enterprisedb.com
Simple reporting databases
End Users
Application Servers
OLTP Database
ETL
Replication
Just use the same design on a different box… www.enterprisedb.com
www.enterprisedb.com
The logical design for analytics/data warehousing www.enterprisedb.com
• Datatypes are more generally defined, not directed toward a database engine. Still choose carefully
• Entities aren’t designed for performance necessarily
• Redundancy is avoided, but simplicity is still a goal
• Bottom line: you want to make sure your data is correctly represented and is easily understood (new class of user today) www.enterprisedb.com
Modeling technique to overcome large data volumes www.enterprisedb.com
www.enterprisedb.com
Pro’s/con’s to manual partitioning
Pro’s
• Less I/O if design holds up
• Easy to prune obsolete data
• Possibly less object contention
Con’s
• More tables to manage
• More referential integrity to manage
• More indexes to manage
• Joins oftentimes needed to accomplish query requests
• Oftentimes, a redesign is needed because the rows / columns you thought you’d be accessing together change; it’s hard to predict ad-hoc query traffic www.enterprisedb.com
Use a modeling tool
Don’t use 3 rd normal form
Manual partition but…
Let the DB do the heavy lifting www.enterprisedb.com
www.enterprisedb.com
How to scale…?
Should I worry about High availability…?
How should I partition my data…?
www.enterprisedb.com
*
* Philip Russom , “Next Generation Data Warehouse Platforms”, TDWI, 2009. www.enterprisedb.com
• Whether you choose to go NoSQL, Shard, use MPP databases, or something similar, divide & conquer is your best friend
• You can scale-up and divide & conquer to a point, but you will hit disk, memory, or other limitations
• Scaling up and out is the best future proof methodology www.enterprisedb.com
Divide & Conquer via Sharding www.enterprisedb.com
A column-oriented architecture looks the same on the surface, but stores data differently than legacy/rowbased databases… www.enterprisedb.com
Row or ColumnBased Database…?
Yes, Row-based database!
• Will need most columns in a table for query
• Will be doing lots of single inserts/deletes
• Small-medium data
• Know exactly what to index; won’t change
Yes, Column-based database!
• Only need subset of columns for query
• Need very fast loads; little DML
• Medium-very large data
• Very dynamic; query patterns change www.enterprisedb.com
Example: Column DB vs. “Leading” row DB
InfiniDB takes up 22% less space InfiniDB loaded data 22% faster
InfiniDB total query times were 65% less InfiniDB average query times were 59% less
Notice not only are the queries faster, but also more predictable
* Tests run on standalone machine: 16 CPU, 16GB RAM, CentOS 5.4 with 2TB of raw data www.enterprisedb.com
Some vendors now give you a choice of table format – row or column – based on expected access patterns.
www.enterprisedb.com
• Standard model is not relational
• Typically don’t use SQL to access the data
• Take up more space than column databases
• Lack special optimizers / features to reduce I/O www.enterprisedb.com
• Really are row-oriented architectures that store data in
‘column families, which are expected to be accessed together (remember logical vertical partitioning?)
Individual columns cannot be accessed independently
• Will be faster with individual insert and delete operations
• Will normally be faster with single row requests
• Will lag in typical analytic / data warehouse use cases www.enterprisedb.com
• Main goal: reduce I/O via partitioning
• Partitioning in PostgreSQL is somewhat more laborious than other RDBMS’s
• Consider when table size exceeds memory capacity
• Partitioning key is ‘key’ for many reasons
• Have seen > 90% response time reductions when done right
• Partitioning also assists in more efficient data pruning activities than typical DELETE operations www.enterprisedb.com
• One option for divide-and-conquer strategy
• Does have limitations with respect to PostgreSQL feature and syntax support
• One customer of EnterpriseDB is running well with
8TB www.enterprisedb.com
• If query patterns are known and predictable, and data is relatively static, then indexing isn’t that difficult
• If the situation is a very ad-hoc environment, indexing becomes more difficult. Must analyze SQL traffic and index the best you can
• Over-indexing a table that is frequently loaded / refreshed / updated can severely impact load and DML performance. Test dropping and re-creating indexes vs. doing in-place loads and DML. Realize, though, any queries will be impacted from dropped indexes
• Remember that a benefit of (most) column databases is that they do not need or use indexes www.enterprisedb.com
• The two biggest killers of load performance are (1) very wide tables for row-based tables; (2) many indexes on a table / foreign keys;
• Column-based tables typically load faster than row-based tables with load utilities, however they will experience slower insert/delete rates than rowbased tables
• Move the data as close to the database as possible; avoid having applications on remote machines do data manipulations and send data across the wire a row at a time – perhaps the worst way to load data
• Oftentimes good to create staging tables then use procedural language to do data modifications and/or create flat files for high speed loaders
• PostgreSQL COPY much faster than INSERT; EnterpriseDB’s
EDB*Loader faster than COPY www.enterprisedb.com
• Increasing maintenance_work_mem to a large values (e.g. >
1GB) helps speed index and foreign key constraint creations
• Turning autovacuum off can help speed load operations
• Turning off synchronous commit (synchronous_commit )can help improve load efficiency, but utilize generally only for allor-nothing use cases
• Minimizing checkpoint I/O is a good idea
(checkpoint_segments to 100-200 and checkpoint_timeout to a higher value such as 1 hour or so). www.enterprisedb.com
www.enterprisedb.com
www.enterprisedb.com
• The focus of this methodology is the answer to the question
“what am I waiting on?”
• With general PostgreSQL, unfortunately, it can be difficult to determine latency in the database server
• Lock contention rarely an issue in data warehouses
• Can use EnterpriseDB’s wait interface
• Problems found in bottleneck analysis translate into better lock handling in the app, partitioning improvements, better indexing, or storage engine replacement www.enterprisedb.com
• The focus of this methodology is the answer to three questions: (1) Who’s logged on?; (2) What are they doing?;
(3) How is my machine handing it?
• Monitor active and inactive sessions. Keep in mind idle connections do take up resources
• I/O and ‘hot objects’ a key area of analysis
• Key focus should be on SQL statement monitoring and collection; something that goes beyond standard preproduction EXPLAIN analysis www.enterprisedb.com
* Philip Russom , “Next Generation Data
Warehouse Platforms”, TDWI, 2009. www.enterprisedb.com
www.enterprisedb.com
• SQL analysis basically becomes bottleneck analysis, because you’re asking where your SQL statement is spending its time
• Once you have collected and identified your ‘top SQL’, the next step is to do tracing and interrogation into each SQL statement to understand its execution
• Historical analysis is important too; a query that ran fine with 5 million rows may tank with 50 million or with more concurrent users
• Design changes usually involve data file striping, indexing, partitioning, or parallel processing additions www.enterprisedb.com
• The pgstatspack utility available to the community can provide some statistics for workload analysis
• The pgfouine log analyzer tool can help you identify bad SQL
• EnterpriseDB packages a utility that duplicates Oracle Automatic
Workload Repository (AWR) reports, and shows hot objects, top wait events, and much more
• A new SQL Profiler utility will be available from EnterpriseDB in the first half of 2011 that will help in tracing and analyzing SQL statement execution www.enterprisedb.com
• Least useful of all the performance analysis methods
• May be OK to get a general rule of thumb as to how various resources are being used
• Do not be misled by ratios; for example, a high cache hit ratio is sometimes meaningless. Databases can be brought to their knees by excessive logical I/O
• Design changes from ratios typically include the altering of configuration parameters and sometimes indexing www.enterprisedb.com
www.enterprisedb.com
• Design is the #1 contributor to the overall performance and availability of a system
• With PostgreSQL, you have greater flexibility and opportunity than ever before to build well-designed data warehouses
• With PostgreSQL, you now have more options and features available than ever before
• The above translates into you being able to design data warehouses that can be future proofed: they can run as fast as you’d like (hopefully) and store as much data as you need (ditto) www.enterprisedb.com
Thanks…!
www.enterprisedb.com