Column Oriented Databases Vs Row Oriented Databases

advertisement
ITK - 478
Column Oriented Databases Vs Row
Oriented Databases
Special Interest Activity
Submitted by: Venkat. K & Rakesh. K
11/19/2007 Updated: 12/03/2007
Column Oriented Databases Vs Row Oriented Databases
Table of Contents
Introduction ......................................................................................................................... 3
MonetDB............................................................................................................................. 4
Monet Architecture ......................................................................................................... 5
VOC Data set: ................................................................................................................. 8
Queries in MonetDB: ........................................................................................................ 11
TPC-H Benchmark............................................................................................................ 13
TPC-H DDL .................................................................................................................. 14
For MonetDB ................................................................................................................ 14
For Oracle ..................................................................................................................... 17
TPC-H Data Insertion ....................................................................................................... 19
MonetDB/SQL .................................................................................................................. 21
Queries .............................................................................................................................. 21
LucidDB ............................................................................................................................ 26
Main features: ................................................................................................................... 26
Query Optimization and Execution .............................................................................. 29
Data insertion ................................................................................................................ 32
Advantages and Disadvantages: ....................................................................................... 34
Summary: ...................................................................................................................... 34
References: ........................................................................................................................ 35
Page |2
Column Oriented Databases Vs Row Oriented Databases
Introduction [1][2]
Database that we find extensively is a row oriented database which stores data in rows. It
has high performance for the OLTP i.e. online transaction processing. This document
talks about Column oriented database that stores data by column.
After leading the team that worked on the C-Stotre, an open source column oriented
database Mike Stonebraker along with 6 other have come up with the market product of
the column oriented database, Vertica. Mike Stonebraker is considered as one of the
major contributor to the column oriented database.
The document takes you through different column oriented databases that are in the
market and the installation procedures and details of two such databases.
The following example illustrates the difference between them
Consider the Project table which has three column pjno, status, pjtitle and pjno is the
primary key.
Pjno
5555
6666
7777
Status
A
C
A
Pjtitle
Marketing
Inventory
Order entry
The row store implementation of the table is stored in the file as following. Here values
of different attributes from the same tuple are stored consecutively.
5555,A,Marketing;6666,C,Inventory;7777,A,Order Entry;
The column store implementation of the table is stored in the file as following. Here the
values of same attributes are stored consecutively. The column oriented database stores
data in columns and are joined with the help of the IDs. It also has the feature of
compressing the data by column wise with the help of projections. It stores the repeated
data in the columns as one.
5555,6666,7777; A,C,A; Marketing, Inventory, Order entry;
The row store architecture is well suited for OLTP operations where as column store
architecture is suited for OLAP operations.
The following are the advantages [1]
Page |3
Column Oriented Databases Vs Row Oriented Databases

While row stores are extremely "write friendly", in that adding a row of data to a
table requires a simple file appending I/O, column stores perform better for
complex read queries. For tables with many columns and queries that use only
few of them, a column store can confine its reads to the columns required,
whereas a row store must read the entire table. In addition, the storage efficiency
properties of column stores can greatly reduce the number of actual disk reads
required to satisfy a query.

Column data is of uniform type. Therefore it is much easier to compress than row
data, and NULL values need never be stored. Row stores cannot omit columns
from any row and still achieve direct random access to a table, because random
access requires that the data for each row be of fixed width. In column stores, this
is trivially true because of type uniformity within a single column's storage,
allowing omission of NULL values and therefore efficient storage of wide,
sparsely populated tables. In practice, row stores can and do implement tables
with variable-width rows, but this require either some form of indirect access or
giving up random access in favor of some type of fast ordered access, e.g. B-trees
(used in the architecture of lucid DB). However, both storage efficiency and code
complexity of such approaches generally compare unfavorably to
implementations of sparse column stores.
The same above concept is used in all of the following databases and in the databases we
are going to talk about.
Open source  C-store
 MonetDB
 LucidDB
 Metakit
Proprietary
 BigTable
 Sybase IQ
 Xplain
 KDB
 DataProbe
We performed our special interest activity on MonetDB and LucidDB.
MonetDB [4]
MonetDB is an open source high-performance database management system developed at
the National Research Institute for Mathematics and Computer Science (CWI; Centrum
voor Wiskunde en Informatica) in the Netherlands. It was designed to provide high
performance on complex queries against large databases, e.g. combining tables with
hundreds of columns and multi-million rows. MonetDB has been successfully applied in
Page |4
Column Oriented Databases Vs Row Oriented Databases
high-performance applications for data mining, OLAP, GIS, XML Query, text and
multimedia retrieval. MonetDB internal data representation is memory-based, relying on
the huge memory addressing ranges of contemporary CPUs, and thus departing from
traditional DBMS designs involving complex management of large data stores in limited
memory.
MonetDB is one of the first database systems to focus its query optimization effort on
exploiting CPU caches.
The MonetDB family consists of:



MonetDB/SQL: the relational database solution
MonetDB/XQuery: the XML database solution
MonetDB Server: the multi-model database server
Monet Architecture
1. Query language parser – reads SQL (structured query language) from the user and
checks for the syntax.
2. Query rewriter – rewrites the query into some normal form.
3. Query optimizer – translates the logical description of the query into a query plan.
4. Query executor – executes the physical query and produces the result.
5. Access methods – system services to access data from the tables.
6. Buffer manager – handles caching of table data stored in the table.
7. Lock manager – system services for locking the transaction.
8. Recovery manager – when the transactions are commit it makes sures that data is
persistent and erases when there is not.
Page |5
Column Oriented Databases Vs Row Oriented Databases
The architecture emphasis that the multiple front ends can connect to the back end. We
can have relational, object oriented queries as front end and have MonetDB as back-end.
The intermediate language between the front end and back end is the MIL (Monet
Interpreter Language).
Query execution is divided into strategic and tactical phase. The strategic optimization is
done in front end while the tactical phase is done at run-time in the Monet Query
Executor.
Monet uses binary table model where are all the tables consists of exactly two columns.
These are called as Binary Association Tables. Each front end uses mapping rules to map
logical data model as seen as the end user onto binary tables in Monet. In case of
relational model relational tables are vertically fragmented, by storing each column from
a relational table in a separate BAT. The right column of the BATs holds the column
value and left column holds the row or object identifier.
Consider the following tables
A relational data model can be stored in Monet by splitting each relational table by
column. Each column becomes a BAT that holds the column values in the right column
(tail) and object identifier (OID) in left column (head). The relational tuples can be
reconstructed by taking all tail values of the column BATs with the same OID.
The tables are decomposed into following columns
Page |6
Column Oriented Databases Vs Row Oriented Databases
MIL is a procedural block-structured language with standard control structures like ifthen-else and while loops.
The following is the MIL translation of the SQL
Installation of MonetDB [5]
MonetDB is an open source database. It can be downloaded from this link
http://monetdb.cwi.nl/projects/monetdb//Download/index.html#SQL
The following are the tools for MonetDB
 DBVisualizer
 Squirrel
Page |7
Column Oriented Databases Vs Row Oriented Databases


Aqua Data Studio
iSQL
We used the DBVisualizer for our activity. The driver settings for it are
DBVisualizer
VOC Data set:
Exploring the wealth of functionality offered by MonetDB/SQL is best started using a toy
database. For this we use the VOC database which provides a peephole view into the
administrative system of an early multi-national company, the Vereenigde geoctrooieerde
Oostindische Compagnie (VOC for short - The (Dutch) East Indian Company)
established on March 20, 1602.
We used Oracle Server which is given access as part of the course with the following
details for connection using SqlDeveloper. We have also run the queries by running the
oracle in our system instead of college server using following credentials.
Connection Name: Test10G (You can give anything)
Username & Password (my computer ID and password)
Oracle is default
Connection type: Basic
Page |8
Column Oriented Databases Vs Row Oriented Databases
Scripts –
CREATE TABLE "voyages" (
"number"
integer NOT NULL,
"number_sup"
char(1) NOT NULL,
"trip"
integer,
"trip_sup"
char(1),
"boatname"
varchar(50),
"master"
varchar(50),
"tonnage"
integer,
"type_of_boat"
varchar(30),
"built"
varchar(15),
"bought"
varchar(15),
"hired"
varchar(15),
"yard"
char(1),
"chamber"
char(1),
"departure_date" date,
"departure_harbour" varchar(30),
"cape_arrival"
date,
"cape_departure" date,
"cape_call"
boolean,
"arrival_date"
date,
"arrival_harbour" varchar(30),
"next_voyage"
integer,
"particulars"
varchar(530)
);
CREATE TABLE "craftsmen" (
Page |9
Column Oriented Databases Vs Row Oriented Databases
"number"
integer NOT NULL,
"number_sup"
char(1)
NOT NULL,
"trip"
integer,
"trip_sup"
char(1),
"onboard_at_departure" integer,
"death_at_cape"
integer,
"left_at_cape"
integer,
"onboard_at_cape"
integer,
"death_during_voyage" integer,
"onboard_at_arrival" integer
);
CREATE TABLE "impotenten" (
"number"
integer NOT NULL,
"number_sup"
char(1)
NOT NULL,
"trip"
integer,
"trip_sup"
char(1),
"onboard_at_departure" integer,
"death_at_cape"
integer,
"left_at_cape"
integer,
"onboard_at_cape"
integer,
"death_during_voyage" integer,
"onboard_at_arrival" integer
);
CREATE TABLE "invoices" (
"number" integer,
"number_sup" char(1),
"trip"
integer,
"trip_sup" char(1),
"invoice" integer,
"chamber" char(1)
);
CREATE TABLE "passengers" (
"number"
integer NOT NULL,
"number_sup"
char(1)
NOT NULL,
"trip"
integer,
"trip_sup"
char(1),
"onboard_at_departure" integer,
"death_at_cape"
integer,
"left_at_cape"
integer,
"onboard_at_cape"
integer,
"death_during_voyage" integer,
"onboard_at_arrival" integer);
CREATE TABLE "seafarers" (
"number"
integer NOT NULL,
"number_sup"
char(1)
NOT NULL,
"trip"
integer,
"trip_sup"
char(1),
"onboard_at_departure" integer,
"death_at_cape"
integer,
"left_at_cape"
integer,
"onboard_at_cape"
integer,
"death_during_voyage" integer,
"onboard_at_arrival" integer
P a g e | 10
Column Oriented Databases Vs Row Oriented Databases
);
CREATE TABLE "soldiers" (
"number"
integer NOT NULL,
"number_sup"
char(1)
NOT NULL,
"trip"
integer,
"trip_sup"
char(1),
"onboard_at_departure" integer,
"death_at_cape"
integer,
"left_at_cape"
integer,
"onboard_at_cape"
integer,
"death_during_voyage" integer,
"onboard_at_arrival" integer
);
CREATE TABLE "total" (
"number"
integer NOT NULL,
"number_sup"
char(1)
NOT NULL,
"trip"
integer,
"trip_sup"
char(1),
"onboard_at_departure" integer,
"death_at_cape"
integer,
"left_at_cape"
integer,
"onboard_at_cape"
integer,
"death_during_voyage" integer,
"onboard_at_arrival" integer
);
ALTER TABLE "voyages" ADD PRIMARY KEY ("number", "number_sup");
ALTER TABLE "craftsmen" ADD PRIMARY KEY ("number", "number_sup");
ALTER TABLE "impotenten" ADD PRIMARY KEY ("number", "number_sup");
ALTER TABLE "passengers" ADD PRIMARY KEY ("number", "number_sup");
ALTER TABLE "seafarers" ADD PRIMARY KEY ("number", "number_sup");
ALTER TABLE "soldiers" ADD PRIMARY KEY ("number", "number_sup");
ALTER TABLE "total" ADD PRIMARY KEY ("number", "number_sup");
ALTER TABLE "craftsmen" ADD FOREIGN KEY ("number", "number_sup")
REFERENCES "voyages" ("number", "number_sup");
ALTER TABLE "impotenten" ADD FOREIGN KEY ("number", "number_sup")
REFERENCES "voyages" ("number", "number_sup");
ALTER TABLE "invoices" ADD FOREIGN KEY ("number", "number_sup")
REFERENCES "voyages" ("number", "number_sup");
ALTER TABLE "passengers" ADD FOREIGN KEY ("number", "number_sup")
REFERENCES "voyages" ("number", "number_sup");
ALTER TABLE "seafarers" ADD FOREIGN KEY ("number", "number_sup")
REFERENCES "voyages" ("number", "number_sup");
ALTER TABLE "soldiers" ADD FOREIGN KEY ("number", "number_sup")
REFERENCES "voyages" ("number", "number_sup");
ALTER TABLE "total" ADD FOREIGN KEY ("number", "number_sup")
REFERENCES "voyages" ("number", "number_sup");
Queries in MonetDB:
P a g e | 11
Column Oriented Databases Vs Row Oriented Databases
We performed the following ran the following queries on MonetDB, Oracle (SQL
Developer) in both college and one on our system. The time given in the box is the time
taken from their respective GUI clients.
The data used for the queries and schema is directly taken from the MonetDB website
under VOC copy.
Note:
1. Time provided for the each query varies each time we run the query.
2. We have left out the Query plan as DBvisualizer used for the MonetDB is an trail
pack it doesn’t support the query plan feature.
Query
Oracle:
Execution time
.819
0.505(from the system)
select count(*) from “passengers”;
MonetDB:
.0106
select count(*) from Passengers;
Query
Execution time
Oracle:
3.513
select * from “craftsmen”;
MonetDB:
1.007(from the system)
0.406
select * from craftsmen;
Query
Oracle:
Execution time
4.97
SELECT number from “impotenten”;
0.505(from the system)
MonetDB:
.093
SELECT "number" from impotenten;
Query
Oracle:
Execution time
1.0041
SELECT COUNT(*) FROM "voyages"
WHERE "particulars" LIKE '%_recked%';
0.153(from the system)
MonetDB:
0.39
SELECT COUNT(*) FROM voyages
WHERE particulars LIKE '%_recked%';
P a g e | 12
Column Oriented Databases Vs Row Oriented Databases
Query
Execution time
Oracle:
0.56
SELECT "chamber", CAST(AVG("invoice") AS integer) AS
average
FROM "invoices"
WHERE "invoice" IS NOT NULL
GROUP BY "chamber"
ORDER BY average desc;
MonetDB:
0.507(from the system)
0.313
SELECT chamber,CAST(AVG(invoice) AS integer) AS average
FROM invoices
WHERE invoice IS NOT NULL
GROUP BY chamber
ORDER BY average DESC;
Query
Execution time
Oracle:
SELECT voyages.number FROM voyages inner join craftsmen
on voyages.number = craftsmen.number and
voyages.number_sup = craftsmen.number_sup
WHERE voyages.particulars LIKE '%_recked%';
0.506
MonetDB:
SELECT voyages.number FROM voyages inner join craftsmen
on voyages.number = craftsmen.number and
voyages.number_sup = craftsmen.number_sup
WHERE voyages.particulars LIKE '%_recked%';
0.26
0.504(from the system)
TPC-H Benchmark
The TPC Benchmark (TPC-H) is a decision support benchmark. It consists of a suite of
business oriented ad-hoc queries and concurrent data modifications. The queries and the
data populating the database have been chosen to have broad industry-wide relevance.
This benchmark illustrates decision support systems that examine large volumes of data,
execute queries with a high degree of complexity, and give answers to critical business
questions. It is stated that the results of the TPC-H benchmark ran by the MonetDB
company is 10 times faster in Monetdb when compare to MySQL and PostgreSQL.
P a g e | 13
Column Oriented Databases Vs Row Oriented Databases
TPC-H
SCHEMA
TPC-H DDL
Because of the difference in data types support in MonetDB and Oracle we needed to
define DDL separately.
For MonetDB
create table part(
p_partkey int ,
p_name varchar(55),
p_mfgr char(25),
p_brand char(10),
p_type varchar(25),
p_size int,
p_container char(10),
p_retailprice decimal(10,5),
p_comment varchar(23));
P a g e | 14
Column Oriented Databases Vs Row Oriented Databases
alter table part add primary key(p_partkey);
create table supplier(
s_suppkey int ,
s_name varchar(25),
s_address varchar(100),
s_nationkey int,
s_phone varchar(20),
s_acctbal decimal(10,5),
s_comment varchar2(120));
alter table supplier add primary key(s_suppkey);
ALTER TABLE supplier ADD FOREIGN KEY (s_nationkey)
REFERENCES nation(n_nationkey );
create table partsupp(
ps_partkey int,
ps_suppkey int,
ps_availqty int,
ps_supplycost decimal(10,5),
ps_comment varchar(220));
alter table partsupp add primary key(ps_partkey, ps_suppkey);
ALTER TABLE partsupp ADD FOREIGN KEY (ps_partkey)
REFERENCES part (p_partkey );
ALTER TABLE partsupp ADD FOREIGN KEY (ps_suppkey)
REFERENCES supplier (s_suppkey );
create table customer(
c_custkey int,
c_name varchar(30),
c_address varchar(100),
c_nationkey int,
c_phone varchar(30),
c_acctbal decimal(10,5),
c_mktsegment varchar(20),
c_comment varchar(150));
alter table customer add primary key(c_custkey);
ALTER TABLE customer ADD FOREIGN KEY (c_nationkey)
REFERENCES nation(N_nationkey );
Create table nation(
n_nationkey int,
n_name varchar(50),
n_regionkey int,
n_comment varchar(200));
alter table nation add primary key(n_nationkey);
P a g e | 15
Column Oriented Databases Vs Row Oriented Databases
ALTER TABLE nation ADD FOREIGN KEY (n_regionkey)
REFERENCES region (r_regionkey );
Create table region(
R_regionkey int,
R_name varchar(25),
R_comment varchar(180));
alter table region add primary key(R_regionkey);
create table orders(
o_orderkey int,
o_custkey int,
o_orderstatus varchar(10),
o_totalprice decimal(10,5),
o_orderdate varchar(20),
o_orderpriority varchar(20),
o_clerk varchar(20),
o_shippriority int,
o_comment varchar(100));
alter table orders add primary key(o_orderkey);
ALTER TABLE orders ADD FOREIGN KEY (o_custkey)
REFERENCES customer (c_custkey );
Create table lineitem(
L_orderkey int,
L_partkey int,
L_suppkey int,
L_linenumber int,
L_quantity decimal(10,5),
L_extendedprice decimal(10,5),
L_discount decimal(10,5),
L_tax decimal(10,5),
L_returnflag varchar(10),
L_linestatus varchar(10),
L_shipdate varchar(20),
L_commitdate varchar(20),
L_receiptdate varchar(20),
L_shipinstruct varchar(30),
L_shipmode varchar(20),
L_comment varchar(50));
alter table lineitem add primary key(l_orderkey,l_linenumber);
ALTER TABLE lineitem ADD FOREIGN KEY (L_orderkey)
REFERENCES orders (o_orderkey );
P a g e | 16
Column Oriented Databases Vs Row Oriented Databases
ALTER TABLE lineitem ADD FOREIGN KEY (L_partkey)
REFERENCES part (p_partkey );
ALTER TABLE lineitem ADD FOREIGN KEY (L_partkey,l_suppkey)
REFERENCES partsupp (ps_partkey,ps_suppkey );
ALTER TABLE lineitem ADD FOREIGN KEY (L_suppkey)
REFERENCES supplier (s_suppkey );
For Oracle
create table part(
p_partkey integer ,
p_name varchar(55),
p_mfgr char(25),
p_brand char(10),
p_type varchar2(25),
p_size integer,
p_container char(10),
p_retailprice decimal(10,5),
p_comment varchar2(23));
alter table part add primary key(p_partkey);
create table supplier(
s_suppkey integer ,
s_name varchar(25),
s_address varchar(100),
s_nationkey integer,
s_phone varchar(20),
s_acctbal decimal(10,5),
s_comment varchar2(120));
alter table supplier add primary key(s_suppkey);
ALTER TABLE supplier ADD FOREIGN KEY (s_nationkey)
REFERENCES nation(n_nationkey );
create table partsupp(
ps_partkey integer,
ps_suppkey integer,
ps_availqty integer,
ps_supplycost decimal(10,5),
ps_comment varchar2(220));
alter table partsupp add primary key(ps_partkey, ps_suppkey);
ALTER TABLE partsupp ADD FOREIGN KEY (ps_partkey)
REFERENCES part (p_partkey );
ALTER TABLE partsupp ADD FOREIGN KEY (ps_suppkey)
REFERENCES supplier (s_suppkey );
create table customer(
c_custkey integer,
P a g e | 17
Column Oriented Databases Vs Row Oriented Databases
c_name varchar2(30),
c_address varchar2(100),
c_nationkey integer,
c_phone varchar2(30),
c_acctbal decimal(10,5),
c_mktsegment varchar2(20),
c_comment varchar2(150));
alter table customer add primary key(c_custkey);
ALTER TABLE customer ADD FOREIGN KEY (c_nationkey)
REFERENCES nation(N_nationkey );
Create table nation(
n_nationkey integer,
n_name varchar2(50),
n_regionkey integer,
n_comment varchar2(200));
alter table nation add primary key(n_nationkey);
ALTER TABLE nation ADD FOREIGN KEY (n_regionkey)
REFERENCES region (r_regionkey );
Create table region(
R_regionkey integer,
R_name varchar2(25),
R_comment varchar2(180));
alter table region add primary key(R_regionkey);
create table orders(
o_orderkey integer,
o_custkey integer,
o_orderstatus varchar2(10),
o_totalprice decimal(11,5),
o_orderdate varchar2(20),
o_orderpriority varchar2(20),
o_clerk varchar2(20),
o_shippriority integer,
o_comment varchar2(100));
alter table orders add primary key(o_orderkey);
ALTER TABLE orders ADD FOREIGN KEY (o_custkey)
REFERENCES customer (c_custkey );
Create table lineitem(
L_orderkey integer,
L_partkey integer,
P a g e | 18
Column Oriented Databases Vs Row Oriented Databases
L_suppkey integer,
L_linenumber integer,
L_quantity decimal(10,5),
L_extendedprice decimal(10,5),
L_discount decimal(10,5),
L_tax decimal(10,5),
L_returnflag varchar2(10),
L_linestatus varchar2(10),
L_shipdate varchar2(20),
L_commitdate varchar2(20),
L_receiptdate varchar2(20),
L_shipinstruct varchar2(30),
L_shipmode varchar2(20),
L_comment varchar2(50));
alter table lineitem add primary key(l_orderkey,l_linenumber);
ALTER TABLE lineitem ADD FOREIGN KEY (L_orderkey)
REFERENCES orders (o_orderkey );
ALTER TABLE lineitem ADD FOREIGN KEY (L_partkey)
REFERENCES part (p_partkey );
ALTER TABLE lineitem ADD FOREIGN KEY (L_partkey,l_suppkey)
REFERENCES partsupp (ps_partkey,ps_suppkey );
ALTER TABLE lineitem ADD FOREIGN KEY (L_suppkey)
REFERENCES supplier (s_suppkey );
TPC-H Data Insertion
We have downloaded the reference dataset for all the tables in the schema from
www.tpc.prg. The reference data which we got was delimited by bars. So we had to use
some program which takes each token and insert in the specified column.
We used a Java Program to insert rows into the MonetDB as the tool which we are using
DbVisualzer(Trial Edition) did not allow us to import data. The following is the java
program to insert data into customer table
import java.util.*;
import java.io.*;
import java.sql.*;
public class customerscanner {
public static void main(String[] args) throws FileNotFoundException {
Scanner in = new Scanner(System.in);
String insertSQL = " INSERT
INTO "
+ " sys.customer "
+ " ( c_custkey,c_name,c_address, c_nationkey, c_phone, c_acctbal, " +
" c_mktsegment,c_comment)"
+" VALUES ( ";
String url = "jdbc:monetdb://localhost/demo";
Connection con;
P a g e | 19
Column Oriented Databases Vs Row Oriented Databases
Statement stmt;
try {
Class.forName("nl.cwi.monetdb.jdbc.MonetDriver");
} catch(java.lang.ClassNotFoundException e) {
System.err.print("ClassNotFoundException: ");
System.err.println(e.getMessage());
}
while(in.hasNext())
{
String sql = "";
String line = in.nextLine();
StringTokenizer tokenizer = new StringTokenizer(line);
String token = tokenizer.nextToken("|");
sql = insertSQL + token + ",";
token = tokenizer.nextToken("|");
sql += "'"+ token + "',";
token = tokenizer.nextToken("|");
sql += "'"+ token + "',";
token = tokenizer.nextToken("|");
sql += token + ",";
token = tokenizer.nextToken("|");
sql += "'"+ token + "',";
token = tokenizer.nextToken("|");
sql += token + ",";
token = tokenizer.nextToken("|");
sql += "'"+ token + "',";
token = tokenizer.nextToken("|");
sql += "'"+ token + "');";
System.out.println(sql);
try {
con
"monetdb");
//
=
DriverManager.getConnection(url,"monetdb",
System.out.println("Connected to the DB.");
// Create a Statement
stmt = con.createStatement();
//
//
//
P a g e | 20
// Execute the Statement
System.out.println("\nExecuting statement ...");
if(stmt.executeUpdate(sql)<1)
{
System.err.println("An error occurred");
}
System.out.println("table created");
// clean up
stmt.close();
con.close();
sql = "";
} catch(SQLException ex) {
System.err.println("SQLException: " + ex.getMessage());
Column Oriented Databases Vs Row Oriented Databases
}
}
}
}
The following is the code which runs the class and inserts in the specified table.
java -cp .;"C:\Program Files\CWI\MonetDB5\share\MonetDB\lib\monetdb-1.6-jdbc.jar" customerscanner
<"C:\Documents and Settings\friend\Desktop\478\referenceDataSet_2.5\TPCH250_sf1\customer.tbl.1"
For insertion in oracle we used SQLDeveloper which has import data option from excel
sheet. We converted delimited data into excel sheet and named the columns same as the
columns in the table. We then imported it and matched with columns in the table.
MonetDB/SQL
Queries
We performed different TPC_H queries on both MonetDB and Oracle for performance
analysis.
Business Question
The Pricing Summary Report Query provides a summary pricing report for all lineitems
shipped as of a given date. The ship date is 1998-12-01. The query lists totals for
extended price, discounted extended price, discounted extended price plus tax, average
quantity, average extended price, and average discount. These aggregates are grouped by
RETURNFLAG and LINESTATUS, and listed in ascending order of RETURNFLAG
and LINESTATUS. A count of the number of lineitems in each group is included [5].
Query:
select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice*(1-l_discount)) as sum_disc_price,
sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from
lineitem
where
l_shipdate <= '1998-12-01'
group by
l_returnflag,
l_linestatus
P a g e | 21
Column Oriented Databases Vs Row Oriented Databases
order by
l_returnflag,
l_linestatus;
Response time :
Oracle – 0.177 sec /1.05897(from the system)
MonetDB – 0.063 sec
Query – 2
This query finds which supplier should be selected to place an order for a given part in a
given region.
Business question
The Minimum Cost Supplier Query finds, in a given region, for each part of a certain
type and size, the supplier who can supply it at minimum cost. If several suppliers in that
region offer the desired part type and size at the same (minimum) cost, the query lists the
parts from suppliers with the 100 highest account balances. For each supplier, the query
lists the supplier's account balance, name and nation; the part's number and manufacturer;
the supplier's address, phone number and comment information.
select s_acctbal, s_name, n_name, p_partkey, p_mfgr, s_address, s_phone, s_comment
from part, supplier, partsupp, nation, region
where
p_partkey = ps_partkey
and s_suppkey = ps_suppkey
and p_size > 1
and p_type like '%STEEL'
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'AFRICA'
and ps_supplycost = (
select
min(ps_supplycost)
from
partsupp, supplier,
nation, region
where
p_partkey = ps_partkey
and s_suppkey = ps_suppkey
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'AFRICA'
)
order by
s_acctbal desc,
n_name,
s_name,
p_partkey;
P a g e | 22
Column Oriented Databases Vs Row Oriented Databases
Response time:
Oracle – 0.764 sec /0.568 (from the System)
MonetDB – 0.016 sec
Query – 3
This query seeks relationships between customers and the size of their orders.
Business Question
This query determines the distribution of customers by the number of orders they have
made, including customers who have no record of orders, past or present. It counts and
reports how many customers have no orders, how many have 1, 2, 3, etc. A check is
made to ensure that the orders counted do not fall into one of several special categories of
orders. Special categories are identified in the order comment column by looking for a
particular pattern.
select c_custkey, count(*) as custdist
from (select c_custkey,count(o_orderkey)
from
customer left outer join orders on
c_custkey = o_custkey
and o_comment not like '%requests%'
group by c_custkey) c_counts
group by c_custkey
order by custdist desc,c_custkey desc;
Response Time
Oracle – 3.762 sec /0.5104(from the system)
MonetDB – 0.062 sec
Based on above queries it clearly shows that MonetDB is much faster than Oracle.
MonetDB More Funcrions
Profile Statement:
The profile statement gives out the execution time, optimize time, parser time for the
queries.
sql> set profile= true;
sql> select count(*)from tables;
sql> select * from profile;
P a g e | 23
Column Oriented Databases Vs Row Oriented Databases
Profile statement for the query 3 in the benchmark data, the optimize is 0 because we have not used any
indexes
Explain Statement:
The explain statement gives the code that is executed behind the query statements. As
said earlier the database is stored in bat files in the monetDB we can see the output of the
below statements involves different bat files.
sql>explain select count(*) from orders;
+------------------------------------------------------------------------------+
|function user.s2_3():void;
|
| _1:bat[:oid,:int]{rows=4000:lng,bid=23374} :=
:sql.bind("sys","orders","o_orderkey",0);
| _6:bat[:oid,:int]{rows=0:lng,bid=28316} :=
P a g e | 24
|
|
|
Column Oriented Databases Vs Row Oriented Databases
:sql.bind("sys","orders","o_orderkey",1);
|
| constraints.emptySet(_6);
|
| _6:bat[:oid,:int]{rows=0:lng,bid=28316} := nil;
|
| _8:bat[:oid,:int]{rows=0:lng,bid=27098} :=
|
:sql.bind("sys","orders","o_orderkey",2);
|
| constraints.emptySet(_8);
|
| _8:bat[:oid,:int]{rows=0:lng,bid=27098} := nil;
|
| _11{rows=4000:lng} := algebra.markT(_1,0@0);
|
| _1:bat[:oid,:int]{rows=4000:lng,bid=23374} := nil;
|
| _12{rows=4000:lng} := bat.reverse(_11);
|
| _11{rows=4000:lng} := nil;
|
| _13{rows=1:lng} := aggr.count(_12);
|
| _12{rows=4000:lng} := nil;
|
| sql.exportValue(1,"sys.","count_","int",32,0,6,_13,"");
|end s2_3;
|
|
Trace Statement:
This statement gives the time taken for each statement of the code behind file of the SQL
statement.
sql>trace select count(*) from orders;
+-------------+----------------------------------------------------------------+
|
0 usec |
| 78000 usec |
:
mdb.setTimer(_2=true)
|
_1:bat[:oid,:int] := sql.bind(_2="sys", _3="orders",
|_4="o_orderkey", _5=0)
|
| 94000 usec | _6:bat[:oid,:int] := sql.bind(_2="sys", _3="orders",
:
|_4="o_orderkey", _7=1)
|
|
|
| 31000 usec | constraints.emptySet(_6=<tmp_67234>bat[:oid,:int]{0})
| 31000 usec | _6:bat[:oid,:int] := nil;
|
| 16000 usec | _8:bat[:oid,:int] := sql.bind(_2="sys", _3="orders",
P a g e | 25
|
|
Column Oriented Databases Vs Row Oriented Databases
:
|_4="o_orderkey", _9=2)
|
| 62000 usec | constraints.emptySet(_8=<tmp_64732>bat[:oid,:int]{0})
| 32000 usec | _8:bat[:oid,:int] := nil;
|
|
| 31000 usec | _11 := algebra.markT(_1=<tmp_55516>bat[:oid,:int]{4000},
:
|_10=0@0)
|
|
| 31000 usec | _1:bat[:oid,:int] := nil;
|
| 31000 usec | _12 := bat.reverse(_11=<tmp_6231>bat[:oid,:oid]{4000})
| 63000 usec | _11 := nil;
|
|
| 31000 usec | _13 := aggr.count(_12=<~tmp_6231>bat[:oid,:oid]{4000})
| 31000 usec | _12 := nil;
|
|
+-------------+----------------------------------------------------------------+
| 4000
|
+-------------+----------------------------------------------------------------+
| 16000 usec | sql.exportValue(_7=1, _15="sys.", _16="count_", _17="int", |
:
|_18=32, _5=0, _19=6, _13=4000, _20="")
|640000 usec | user.s3_3()
|
|
+-------------+----------------------------------------------------------------+
Optimizer control:
We can also control on how we going to optimize using Optimizer control.
LucidDB [7]
LucidDB is the first and only open-source RDBMS purpose-built entirely for data
warehousing and business intelligence. Most database systems (both proprietary and
open-source) start life with a focus on transaction processing capabilities, then get
analytical capabilities bolted on as an afterthought (if at all). By contrast, every
component of LucidDB was designed with the requirements of flexible, highperformance data integration and sophisticated query processing in mind.
Main features:
Category
Storage
P a g e | 26
Feature
Benefits
Very high data compression rates for columns with
many repeated values; reduced I/O for queries which
Column-store tables
access only a subset of columns; greater cache
effectiveness
Column Oriented Databases Vs Row Oriented Databases
Automatically adapts to either bitmap or btree
representation depending on data distribution (even
using both in the same index for different portions of
Intelligent indexing the same table), yielding optimal data compression,
reduced I/O, and fast evaluation of boolean
expressions, without the need for a DBA to choose
index type
Supports read/write concurrency with snapshot
consistency, allowing readers to access a table while
Page-level
multi- data is being bulk loaded or updated; versioning at
versioning
page-level is much more efficient than transactional
multi-versioning schemes such as row-level
versioning or log-based page reconstruction
Avoids reading fact table rows which are not needed
Star join optimization
by query
Optimization Cost-based
join
ordering and index No hints required
selection
Can scale to number-crunch even the largest datasets
Hash
in limited RAM via skew-resistant disk-based
join/aggregation
partitioning
High performance and greater cache and disk
Coming
soon: effectiveness because LucidDB can almost always
Intelligent prefetch predict exactly which disk blocks are needed to
Execution
satisfy a query
Tables can be loaded directly from external sources
via SQL; no separate bulk loader utility is required
INSERT/UPSERT as (for performance, loads are never logged at the rowbulk load
level, yet are fully recoverable via page-level undo);
the SQL:2003 MERGE statement provides standard
upsert capability
Allows LucidDB to connect to heterogeneous
SQL/MED
external data sources via foreign data wrappers and
architecture
access their content as foreign tables
Allows foreign tables in any JDBC data source to be
JDBC foreign data
queried via LucidDB, with filters pushed down to the
wrapper
source where possible
Connectivity
Flat file foreign data Allows flat files (e.g. BCP or CSV format) to be
wrapper
queried as foreign tables via LucidDB
Allows new foreign data wrappers (e.g. for accessing
Pluggability
data from a web service) to be developed in Java and
hot-plugged into a running LucidDB instance
Allows new functions and transformations to be
Extensibility SQL/JRT architecture
developed in Java and hot-plugged into a running
P a g e | 27
Column Oriented Databases Vs Row Oriented Databases
User-defined
functions
User-defined
transformations
SQL:2003
JDBC
Standards
J2EE
LucidDB instance; LucidDB also comes with a
companion library of common ETL functions
(applib)
Allows the set of builtin functions to be extended
with custom user logic
Allows new table functions (such as custom logic for
data mining operators or CONNECT BY queries) to
be added to the system
Smooths migration of applications to and from other
DBMS products
Allows connectivity from popular front-ends such as
the Mondrian OLAP engine
Java architecture enables deployment of LucidDB
into a J2EE application server (just like hsqldb or
Derby); usage of Java as the primary extensibility
mechanism makes it a snap to integrate with the
many enterprise API's available
Architecture:
The core consists of a top-half implemented in Java and a bottom half implemented in
C++. This hybrid approach yields a number of advantages:

the Java portion provides ease of development, extensibility, and integration, with
managed memory reducing the likelihood of security exploits
P a g e | 28
Column Oriented Databases Vs Row Oriented Databases


the C++ portion provides high performance and direct access to low-level
operating system, network, and file system resources
the Java runtime system enables machine-code evaluation of SQL expressions via
a combination of Java code generation and just-in-time compilation (as part of
query execution)
The sections below provide high-level overviews of some of the most innovative
components.
In LucidDB, database tables are vertically partitioned and stored in a highly compressed
form. Vertical partitioning means that each page on disk stores values from only one
column rather than entire rows; as a result, compression algorithms are much more
effective because they can operate on homogeneous value domains, often with only a few
distinct values. For example, a column storing the state component of a US address only
has 50 possible values, so each value can be stored using only 6 bits instead of the 2-byte
character strings used in a traditional uncompressed representation.
Vertical partitioning also means that a query that only accesses a subset of the columns of
the referenced tables can avoid reading the other columns entirely. The net effect of
vertical partitioning is greatly improved performance due to reduced disk I/O and more
effective caching (data compression allows a greater logical dataset size to fit into a given
amount of physical memory). Compression also allows disk storage to be used more
effectively (e.g. for maintaining more indexes).
The companion to column store is bitmap indexing, which has well-known advantages
for data warehousing. LucidDB's bitmap index implementation takes advantage of
column store features; for example, bitmaps are built directly off of the compressed row
representation, and are themselves stored compressed, reducing load time significantly.
And at query time, they can be rapidly intersected to identify the exact portion of the
table which contributes to query results.
Although LucidDB is primarily intended as a read-only data warehouse, write operations
are required for loading data into the warehouse. To allow reads to continue during data
loads and updates, LucidDB uses page versioning. Data pages are read based on a
snapshot of the data at the start of the initiating transaction. When a page needs to be
updated, a new version of the page is created and chained from the original page. Each
subsequent write transaction will create a new version of the page and add it to the
existing page chain. Therefore, long-running, read-only transactions can continue to read
older snapshots while newer transactions will read more up-to-date snapshots. Pages that
are no longer in use can be reclaimed so the page chains don't grow forever.
Query Optimization and Execution
LucidDB's optimizer is designed with the assumptions of a data warehousing
environment in mind, so no hints are needed to get it to choose the best plan for typical
P a g e | 29
Column Oriented Databases Vs Row Oriented Databases
analytical query patterns. In particular, cost-based analysis is used to determine the order
in which joins are executed, as well as which bitmap indexes to use when applying tablelevel filters and star join optimizations. The analysis uses data statistics gathered and
stored as metadata in the system catalogs, allowing the optimizer to realistically compare
one option versus another even when many joins are involved. By using cost-based
analysis in a targeted fashion for these complex areas, LucidDB is able to consider a large
space of viable candidates for join and indexing combinations. By using heuristics in
other areas, LucidDB keeps optimization time to a minimum by avoiding an explosion in
the search space.
LucidDB is capable of executing extract/transform/load processes directly as pipelined
SQL statements, without any external ETL engine required.
Installation
It can be installed from the following website. We installed the windows version it is still
beta version. Lucid 0.7.2 is the latest http://www.luciddb.org/
First you need to set the environment variable for JAVA_HOME.
After that from the command prompt we installed the install.sh file from install folder.
Then we opened the luciddb server.bat file . Once the server is listening we starting
running the queries from sqlclient.
Jdbc url : jdbc:luciddb:rmi://localhost
Website – http://www.luciddb.org
Classname – com.lucidera.jdbc.luciddbrmidriver
lucidDbClient jar files are located at plugin folder.
Set Query Benchmark Results
This page provides the LucidDB-specific setup needed to run the classic Set Query
Benchmark.
Create Schema
First, create the schema and table which will hold the loaded data:
CREATE SCHEMA SQBM;
SET SCHEMA 'SQBM';
CREATE TABLE BENCH1M (
KSEQ INTEGER PRIMARY KEY,
K2 INTEGER,
K4 INTEGER,
K5 INTEGER,
K10 INTEGER,
K25 INTEGER,
K100 INTEGER,
K1K INTEGER,
K10K INTEGER,
P a g e | 30
Column Oriented Databases Vs Row Oriented Databases
K40K INTEGER,
K100K INTEGER,
K250K INTEGER,
K500K INTEGER,
S1 VARCHAR(8),
S2 VARCHAR(20),
S3 VARCHAR(20),
S4 VARCHAR(20),
S5 VARCHAR(20),
S6 VARCHAR(20),
S7 VARCHAR(20),
S8 VARCHAR(20)
);
Data Source
For the data source we downloaded the luciddb-sqbm-testdata from the website which
has the file name bench1m.csv. The following sql inserts the data into the table. The flat
file is created as a foreign server.
LucidDb as the plugin SQL/MED for accessing flat files. The data file may be associated
with a control file that colntains column descriptions.
The flat file wrapper is accessed by SQL/MED interface.
The data wrapper is defined as sys_file_wrapper. A foreign server is defined which
describes the directory of files corresponding to single schema. data files can be accessed
by creating a foreign table . The plugin is included as part of the LucidDB distribution; a
corresponding foreign data wrapper instance named SYS_FILE_WRAPPER is
predefined by LucidDB initialization scripts. A File foreign data wrapper provides the
capability to mediate access to file foreign servers. You can define a flat file foreign
server to access files in a particular directory.
CREATE SERVER FF_SERVER
FOREIGN DATA WRAPPER SYS_FILE_WRAPPER
OPTIONS(
DIRECTORY '../luciddb-sqbm-testdata/',
FILE_EXTENSION '.csv',
CTRL_FILE_EXTENSION '.bcp',
FIELD_DELIMITER ',',
LINE_DELIMITER '\n',
QUOTE_CHAR '"',
ESCAPE_CHAR '',
WITH_HEADER 'yes',
NUM_ROWS_SCAN '3'
);
CREATE FOREIGN TABLE BENCH_SOURCE (
C1 INTEGER,
C2 INTEGER,
C4 INTEGER,
C5 INTEGER,
C10 INTEGER,
C25 INTEGER,
C100 INTEGER,
P a g e | 31
Column Oriented Databases Vs Row Oriented Databases
C1K INTEGER,
C10K INTEGER,
C40K INTEGER,
C100K INTEGER,
C250K INTEGER,
C500K INTEGER
)
SERVER FF_SERVER
OPTIONS(
SCHEMA_NAME 'BCP',
FILENAME 'bench1M'
);
Data insertion
We executed the following SQL to load the data from the flat file into the LucidDB table.
This will load actual data for the columns whose names start with K, whereas per the
benchmark specs, it will synthesize constant data for the columns whose names starts
with S.
INSERT INTO BENCH1M (
KSEQ,K2,K4,K5,K10,K25,K100,K1K,K10K,K40K,K100K,K250K,K500K, S1, S2, S3, S4,
S5, S6, S7, S8)
SELECT C1,C2,C4,C5,C10,C25,C100,C1K,C10K,C40K,C100K,C250K,C500K,
'12345678', '12345678900987654321', '12345678900987654321',
'12345678900987654321', '12345678900987654321',
'12345678900987654321',
'12345678900987654321', '12345678900987654321'
FROM BENCH_SOURCE;
Creating indexes in LucidDB
0: jdbc:luciddb:rmi://localhost:5434> CREATE INDEX B1M_K2_IDX ON BENCH1M(K2);
No rows affected (2.453 seconds)
0: jdbc:luciddb:rmi://localhost:5434> CREATE INDEX B1M_K4_IDX ON BENCH1M(K4);
No rows affected (1.75 seconds)
0: jdbc:luciddb:rmi://localhost:5434> CREATE INDEX B1M_K5_IDX ON BENCH1M(K5);
No rows affected (1.109 seconds)
0: jdbc:luciddb:rmi://localhost:5434> CREATE INDEX B1M_K10_IDX ON BENCH1M(K10);
No rows affected (1.125 seconds)
0: jdbc:luciddb:rmi://localhost:5434> CREATE INDEX B1M_K25_IDX ON BENCH1M(K25);
No rows affected (6.719 seconds)
0: jdbc:luciddb:rmi://localhost:5434> CREATE INDEX B1M_K1K_IDX ON BENCH1M(K1K);
No rows affected (9.422 seconds)
0: jdbc:luciddb:rmi://localhost:5434> CREATE INDEX B1M_K10K_IDX ON BENCH1M(K10K) ;
No rows affected (20.625 seconds)
0: jdbc:luciddb:rmi://localhost:5434> CREATE INDEX B1M_K40K_IDX ON BENCH1M(K40K);
;
No rows affected (19.625 seconds)
0: jdbc:luciddb:rmi://localhost:5434> CREATE INDEX B1M_K100K_IDX ON BENCH1M(K100
K);
No rows affected (13.203 seconds)
0: jdbc:luciddb:rmi://localhost:5434> CREATE INDEX B1M_K250K_IDX ON BENCH1M(K250
K);
No rows affected (15.75 seconds)
P a g e | 32
Column Oriented Databases Vs Row Oriented Databases
0: jdbc:luciddb:rmi://localhost:5434> CREATE INDEX B1M_K500K_IDX ON BENCH1M(K500
K);
No rows affected (14.875 seconds)
0: jdbc:luciddb:rmi://localhost:5434> CREATE INDEX B1M_K2_K100_IDX ON BENCH1M(K2
,K100);
No rows affected (17.25 seconds)
0: jdbc:luciddb:rmi://localhost:5434> CREATE INDEX B1M_K4_K25_IDX ON BENCH1M(K4,
K25);
No rows affected
0: jdbc:luciddb:rmi://localhost:5434> CREATE INDEX B1M_K10_K25_IDX ON BENCH1M(K1
0,K25);
No rows affected (15.297 seconds)
Queries:
select * from bench1m;
lucidDB
339.563 sec - 6 mins
Oracle - 15 mins
SELECT COUNT(*) FROM BENCH1M
WHERE
(KSEQ BETWEEN 40000 AND 41000
OR KSEQ BETWEEN 42000 AND 43000
OR KSEQ BETWEEN 44000 AND 45000
OR KSEQ BETWEEN 46000 AND 47000
OR KSEQ BETWEEN 48000 AND 50000)
AND K10 = 3;
luciddb - 1.25 sec
Oracle -34.44 sec
SELECT KSEQ, K500K FROM BENCH1M
WHERE
K2 = 1
AND K100 > 80
AND K10K BETWEEN 2000 AND 3000;
lucidDB - 6.25 sec
Oracle - 7.289 sec
SELECT K2, K100, COUNT(*) FROM BENCH1M
GROUP BY K2, K100;
luciddb - 1.156
Oracle - 2.03 sec
P a g e | 33
Column Oriented Databases Vs Row Oriented Databases
Advantages and Disadvantages:
Pros:
1. Data compression: The repeating column values are represented by the single
column value and also different projections can be used for storing the column in
a format that is used mostly.
2. Improved Bandwidth Utilization: Only the required data is read from the disk, it
does not read any extra data or columns as in the case of the row oriented
database.
3. Improved Code Pipelining: CPU cycle performance is saved with the column
oriented as we use the performance only for the required attributes.
4. Improved cache locality: the cache in the column oriented contains only the
required data instead of the unnecessary data which is the case for the row
oriented database.
Cons:
1. Increased Disk Seek Time: As multiple columns are read in parallel increases the
time disk seek time.
2. Increased cost of Inserts: Small inserts will need more time because as column
oriented data is stored in columns, multiple places needs to be updated.
3. Increases tuple reconstruction costs: while interfacing with the Drivers and JDBC
the reconstruction of the row from these columns takes more time and offsets the
advantages of the column oriented database.
Summary:
Column architecture doesn’t read unnecessary columns
Avoids decompression costs and perform operations faster.
Use compression schemes allow us to lower our disk space requirements.
After the completion we felt that determining of the performance of a database
system we need to consider different factors which we have left out such as the
CPU cost, Plan etc.
5. From the above advantages and disadvantages we can say that column oriented
databases can be used only in the places where the requirements satisfies the
advantages of column oriented databases as in OLAP, Data mining etc..,.
1.
2.
3.
4.
P a g e | 34
Column Oriented Databases Vs Row Oriented Databases
References:
1. Wikipedia, http://en.wikipedia.org/wiki/Column-oriented_DBMS
Accessed – 14-sep-2007
2. http://db.lcs.mit.edu/projects/cstore/abadisigmod06.pdf
Accessed – 14-sep-2007
3. http://marklogic.blogspot.com/2007/03/whats-column-oriented-dbms.html
Accessed – 14-sep-2007
4. http://en.wikipedia.org/wiki/MonetDB
Accessed – 14-sep-2007
5. http://monetdb.cwi.nl/projects/monetdb/SQL/QuickTour/index.html
Accessed – 14-sep-2007
6. Compression and Query Execution within Column Oriented Databases by
Miguel C. Ferreira , MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2005
7. http://www.luciddb.org/
Accessed by 30-nov-2007.
P a g e | 35
Download