Column Oriented Database Technologies

advertisement
Column Oriented Database Technologies
Monday, July 23, 2012
By: Dale T. Anderson
Principal Consultant
DB Best, Technologies, LLC
My recent blog (Big Data & NoSQL Technologies) discussed various NoSQL technologies and market
vendors. Today let’s dive into column-oriented databases and why they should play an important role in
any data warehouse whose focus is on aggregations or metrics (and whose isn’t?).
So you are all probably familiar with row-oriented databases. Tables of data where rows of fields (also
called columns) represent the structural storage and the corresponding SQL queries that select, insert,
update, and delete that data. Most database vendors like Oracle, Microsoft, Sybase, Informix, and many
others all base their technology on this ANSI standard. Column-oriented databases are indeed what you
might surmise; tables of data where columns of data values represent the structural storage. What you
might not expect is that on the surface many column-oriented databases look and feel like row oriented
databases also using SQL queries in much the same way. Creating tables, storing data, querying them
are all pretty much identical. They may appear similar, but two principal things to understand is that the
significant differences under the hood, in particular, physical storage and query optimization.
As noted in my previous blogs on NoSQL, there is also a column-store technology out there. Let’s not
confuse that with column oriented databases. They are different. Since several NoSQL column-store
vendors were highlighted before, we will focus instead on the column oriented database vendors here.
First, some key benefits to column oriented databases:





High performance on aggregation queries (like COUNT, SUM, AVG, MIN, MAX)
Highly efficient data compression and/or partitioning
True scalability and fast data loading for Big Data
Accessible by many 3rd party BI analytic tools
Fairly simple systems administration
Due to their aggregation capabilities which compute large numbers of similar data items, column
oriented databases offer key advantages for certain types of systems, including:




Data Warehouses and Business Intelligence
Customer Relationship Management (CRM)
Library Card Catalogs
Ad hoc query systems
Column oriented database technology has actually been around for many years originating in 1969 with
an application called TAXIR which provided abstracts for mathematical biosciences. In 1976, Canada
implemented the RAPID system for processing and retrieval of population and housing census statistics.
Sybase IQ was the only commercially available column-oriented database for many years, yet that has
changed rapidly in the last few years. Let’s take a quick look at some of today’s key players:
 SAP Sybase IQ (www.sybase.com)
A highly optimized analytics server designed
specifically to deliver superior performance
for mission-critical business intelligence,
analytics and data warehousing solutions on any standard hardware and
operating system. Its column oriented grid-based architecture, patented data
compression, and advanced query optimizer delivers high performance,
flexibility, and economy in challenging reporting and analytics environments.
Essentially a data partitioned, index based storage technology, Sybase IQ’s
engine offers several key features which include:
 Web enabled analytics
 Communications & Security
 Fast Data Loading
 Query Engine supporting Full Text Search
 Column Indexing Sub System
 Column Storage Processor
 User Friendly CUI based Administration & Monitoring
 Multiplex Grid Architecture
 Information Live-cycle management
The Sybase IQ Very Large Data Base (VLDB) option provides partitioning and
placement where a table can have a specified column partition key with value
ranges. This partition allows data that should be grouped together to be
grouped together and separates data where they should be separated. The
drawback to this methodology is that it is not always known which is which.
 Infobright (www.infobright.com)
Offering both a commercial (IEE) and a free
community (ICE) edition, the combination
of a column oriented database with their
Knowledge Grid architecture delivers a self-managed, scalable, high
performance analytics query platform. Allowing 50Tb using a single server, their
industry-leading data compression (10:1 up to 40:1) significantly reduces
storage requirements and expensive hardware infrastructures. Delivered as a
MySQL engine, Infobright runs on multiple operating systems and processors
needing only a minimum of 4Gb of RAM (however 16Gb is a recommended
starting point).
Avoiding partition schemes, Infobright data is stored in data packs, each node
containing pre-aggregated statistics about the data stored within them. The
Knowledge Grid above provides related metadata providing a high level view of
the entire content of the database. Indexes, projections, partitioning or
aggregated tables are not needed as these metadata statistics are managed
automatically. The granular computing engine processes queries using the
Knowledge Grid information to optimize query processing eliminating or
significantly reducing the amount of data required for decompressing and
access to answer a query. Some queries may not need to access the data at all,
finding instead the answer in the Knowledge Grid itself.
The Infobright Data Loader is highly efficient so data inserts are very fast. This
performance gain does come at a price so avoid updates unless absolutely
necessary, design de-normalized tables, and don’t plan on any deletes. New
features to the data loader include a reject option which allows valid rows to
commit while invalid rows are logged. This is highly useful when loading
millions of rows and only having a few rows with bad data. Without this feature
the entire data load would be rolled back.
 Vertica (HP) (www.vertica.com)
Recently acquired by Hewlett Packard, this
platform was purpose built from the ground
up to enable data values having high
performance real-time analytics needs. With
extensive data loading, queries, columnar storage, MPP architecture, and data
compression features, diverse communities can develop and scale with a
seamless integration ecosystem.
Claiming elasticity, scale, performance, and simplicity the Vertica analytics
platform uses transformation partitioning to specify which rows belong together
and parallelism for speed. Several key features include:
 Columnar Storage & Execution
 Real-Time Query & Loading
 Scale-out MPP Architecture
 Automatic High Availability
 Aggressive Data Compression
 Extensible In-Database Analytics Framework
 In-Database Analytics Library
 Database Designer & Administration Tools
 Native BI & ETL support for MapReduce & Hadoop
The Vertica Optimizer is the brains of the analytics platform producing optimal
query execution plans where several choices exist. It does this through
traditional considerations like disk I/O and further incorporates CPU, memory,
network, concurrency, parallelism factors and the unique details of the
columnar operator and runtime environment.
 ParAccel (www.paraccel.com)
Analytic-driven companies need a platform,
not just a database where speed, agility, and
complexity drive the data ecosystem. The
ParAccel Analytic Platform streamlines the delivery of complex business
decisions through its high performance analytic database. Designed for speed,
its extensible framework supports on-demand integration and embedded
functions.
The ParAccel Database (PADB) present four main components: the ‘Leader’
node, the ‘Compute’ node, the Parallel Communications Fabric, and an optional
Storage Area Network (SAN). The ‘Leader’ controls the execution of the
‘Compute’ nodes and all nodes communicate with each other via the ‘Fabric’
running on standard x86 Linux servers. Each ‘Compute’ node is subdivided into
a set of parallel processes called ‘slices’ that include a CPU core, and thier
allocation of memory, and local disk storage. The ‘Communication Fabric’
provides a low-level MPP network protocol for increased performance.
Key PADB features include:
 High Performance & Scalability
 Columnar Orientation
 Extensible Analytics
 Query Compilation
 High Availability
 Solution Simplicity
ParAccel Integrated Analytics Library and Extensibility Framework incorporates
advanced functions along with an API to add your own functions to help address
complex business problems right in the core database enabling customers to
focus upon their specific data complexities.
 Microsoft SQL Server 2012 (www.microsoft.com)
Released this year, Microsoft has now
embraced the columnar database idea. The
latest SQL Server release 2012 includes
xVelocity, a column-store index feature that stores data similar to a columnoriented DBMS. While not a true column oriented database, this technique
allows for the creation of a memory optimized index that groups and stores data
for each column then and joins them together to complete the index. For
certain types of queries, like aggregations, the query processor can take
advantage of the column-store index to significantly improve execution times.
Column store indexes can be used with partitioned tables providing a new way
to think about how to design and process large datasets.
The column-store index can be very useful on large fact tables in a Star schema
improving overall performance, however the cost model approach utilized may
choose the column-store index for a table when a row based index would have
been better. Using the IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX query
hint will work around this if it occurs. When data is stored with a column-store index,
data can often be compressed more effectively over a row based index.
This is
accomplished as typically there is more redundancy within a column than within a
row. Higher compression means less IO is required to retrieve data into memory
which can significantly reduce response times.
There are several restrictions and limitation in using a column-store index.
For
example, which data types are supported or not and that you can only create one
column-store index on any table can be problematic. Become familiar with what it
can do and where best to use it. Currently the column-store index is not supported
on Microsoft Azure.
Column-oriented databases provide significant advantages over traditional row oriented system applied
correctly; In particular for data warehouse and business intelligence environments where aggregations
prevail. It would not be fair however to ignore the disadvantages. Let’s look at these two:

Column-Oriented Advantages
 Efficient storage and data compression
 Fast data loads
 Fast aggregation queries
 Simplified administration & configurations

‘Column-Oriented Disadvantages
 Transactions are to be avoided or just not supported
 Queries with table joins can reduce high performance
 Record updates and deletes reduce storage efficiency
 Effective partitioning/indexing schemes can be difficult to design
The real value in using column-oriented database technology
comes from high performance, scalable storage and retrieval of
large to massive datasets (Big Data) focused on aggregation
queries. Simply put: Reports! You can design Star schema’s or
Data Vaults (The Data Vault – What is it? – Why do we need it?)
incorporating these technologies and you will find that columnoriented databases provide a clear solution in data warehouse and
business intelligence.
Look for future blogs on Hadoop/Hive/HBase and Extract-Transform-Load (ETL) technologies, and don’t
be afraid to comment, question, or debate, there is always room to learn new things...
Download