WX2 Data Warehouse Appliance

advertisement
Introduction to Kognitio
06 January 2012
Michael Hiskey
Vice President, Marketing and Business Development
Sachin Sangtani
Senior Technical Consultant
Kognitio is an
analytical accelerator
Built from the ground-up to satisfy large
and complex analytics on big data sets
A massively parallel, in-memory analytical
data warehouse that interoperates with
your existing infrastructure
Why rip out and replace systems?
Kognitio:
Accelerate your analytical ability
without disruption to existing systems
Lower your hardware and software
costs, while increasing performance
10-100x
Agenda
• Company Overview
• WX2 Overview
– DW Origins and evolution
– In Memory: What, Why, Where it helps
• Technical Overview
–
–
–
–
–
–
Software Features
Performance
Integration
Appliance
DR/Backup/Recovery
Timescale
• Product Roadmap
• What To Look For In Potential Clients
• Q&A
About Kognitio
Software company founded in the UK
• 20+ year heritage focused on high
performance analytical solutions
– Solid reference client successes
• WX2 Analytical Data Warehouse
– In-memory processing
– Massively parallel (MPP)
– Cloud deployment model
– Mature product (v7)
– Open standards & linear scalability
– Multi-threaded high-volume data loads
– Ease of administration & low TCO:
3rd Party
Services/Apps
Direct
BI / Analytics
Kognitio
legacy
EDW
• NO indexes, partitions, aggregates, etc.
• Flexible delivery model
ETL
ETL
ETL
– Software , Appliance
– DaaSTM -Data warehousing as a Service
• Commodity x86/Linux servers
Underlying Systems, Hadoop Clusters,
Enterprise Data Warehouses
Kognitio WX2 Overview
Traditional data warehouse/BI solution stack
BI Analysis
Application
BI
Reporting
Application
BI Analysis
Application
BI
Reporting
Application
Data
Mart
BI Analysis
Application
Depend.
Database
Cube
Data
Warehouse
ETL
Transactional Data
Customer Data
Product Data
Operational
Systems
BI
Reporting
Application
“Single version of the
truth”
However:
• Data duplication in marts
and cubes
• Massive cube proliferation
• Complex and time
consuming data extract
operations
• High Admin. Cost (DBAs,
Sys Admins, Cube)
• Often specialized HW and
Storage required
Why do we need such complex solutions?
Poor performance:
• Data held on slow mechanical disk
• Queries run against disk-based data
(I/O Contention)
Data resident in high
speed memory
RAM
Data
(fast access)
Queries
Results
Persistent data store
on disk as well
Mechanical Disk
Queries run against
data in memory
Add MPP to in-memory
Industry-standard x86 Servers, Storage, Memory.
RAM merged
together
across
servers into a
shared fabric
Queries
Results
Parallelism in every direction = linear scalability
massively parallel
processing (MPP)
Allows systems to be “scaled
out” to accommodate any
data size…
…and analytical workload
So what’s required?
• Array of industry standard
servers
• Standard operating system
• An in-memory capable
relational database
management system (such
as Kognitio WX2)
How do MPP systems differ from traditional databases?
Traditional systems have to focus on
making subsets of data available at
acceptable performance levels
Traditional Relational Database Projects
Target High Performance Database Projects
Traditional systems have to develop
cubes, indices, and tune for
performance
High performance MPP
No cubes, No indices. Performance to
burn.
Focus on results
(MPP – Massively Parallel Processing)
$$$
How are cost savings achieved ? In development
Let’s compare a typical implementation using traditional resources with that of in memory
MPP platforms
100%
Circa >60%
saving
Circa >55%
saving
50%
Traditional Relational Database Projects
Target High Performance Database Projects
Business & data
understanding
Data preparation
With parallel
high performance
loading and
and
theno
very
indices,
high speed
the need
to create sophisticated
processing,
ETL becomes
schemas
ELT fortoeasier
matchdata
current business
manipulation.
Fastneeds
analytics
whilesaves
also anticipating
users
change is almost
developing
queries
eliminated
based onextent.
what can
Saving
be 60%
of pre-business
achieved
rather than
analytics
what they want to do.
© Kognitio 2010
Why? Because traditional database management needs all this…
$$
Operational
Systems
Management
New Data
Extract
Request
New Data Feed
ETL
DB Development
Data Load
Analytical
Database
Database
Administration
$$$$$
New
Analytical
Scope
Index Management
Table Space
Management
Significant
DBA Cost
Partition
Aggregation
Temp Table
Management
User Community
We provide data agility…eliminating partitioning, indexing, space
and aggregation management
$$
Operational
Systems
Management
New Data
Extract
Request
New Data Feed
ETL
Database
Administration
DB Development
Data Load
Analytical
Database
New
Analytical
Scope
User Community
Simply Load and Go
Technical Overview
Typical Analysis/Reporting Query
CASE
Statements
-- Balance information of targeted accounts obtained from transaction table
-select C.Client_ID,
D.Demog_Group, D.Demog_Desc, 1+avg(F.Credit_Limit_Changes) CL_Issued,
sum(case when T.Trans_Type='C' then T.Transaction_Amount else 0 end) - sum(case when T.Trans_Type='D' then T.Transaction_Amount else 0 end) Balance,
sum(case when T.Trans_Type='C' then T.Transaction_Amount else 0 end) Total_Credit,
sum(case when T.Trans_Type='D' then T.Transaction_Amount else 0 end) Total_Debit,
min(case when T.Trans_Type='C' then date '2009-11-15' - T.Effective_Date else 365*10 end) Days_Last_Credit,
min(case when T.Trans_Type='D' then date '2009-11-15' - T.Effective_Date else 365*10 end) Days_Last_Debit
from DEMO_FS.V_FIN_ACCOUNT F, DEMO_FS.V_FIN_CLIENT C, DEMO_FS.V_FIN_CLIENT_ACCOUNT_LINK L,
DEMO_FS.V_FIN_ADD_CLIENT A, DEMO_FS.V_FIN_DEMOG_DESCS D, DEMO_FS.V_FIN_CC_TRANS T,
--Query to produce campaign planning
-(
select Account_ID, count(Trans_Year) Years_Present, sum(No_Trans) No_Trans, sum(Total_Spend) Total_Spend,
case count(Trans_Year) when 1 then 'One-off' else 'Repeat‘ end Behavior_Flag
from (
select * from
(
select Account_ID, Extract(Year from Effective_Date) Trans_Year, count(Transaction_ID) No_Trans,
sum(Transaction_Amount) Total_Spend, avg(Transaction_Amount) Avg_Spend
from DEMO_FS.V_FIN_CC_TRANS
where extract(year from Effective_Date)<2009 and Trans_Type='D' and Account_ID<>9025011
and actionid in (
select actionid
from
DEMO_FS.V_FIN_actions
where actionoriginid =1)
group by Account_ID, Extract(Year from Effective_Date
)
) Acc_Summary
where No_Trans in (3,4,5,6) and Avg_Spend>1000 and Trans_Year between 2004 and 2008
) Target_Accs
group by Account_ID
) Campaign_Grouping
where Campaign_Grouping.Account_ID=L.Account_ID
and L.Client_ID=C.Client_ID
and C.Client_ID=A.Client_ID
and A.Demog_Code=D.Demog_Code
and D.Demog_code in (1,4,5,9,10,11,50,55)
and Campaign_Grouping.Account_ID=F.Account_ID
and Campaign_Grouping.Account_ID=T.Account_ID
and T.Effective_Date < date '2009-11-15'
group by C.Client_ID, Demog_Group, Demog_Desc
order by Days_Last_Debit;
6 Tables
plus inline
subqueries
NOT EQUAL
TO
Multiple
passes
through fact
BETWEEN
4 nested
subqueries
IN
Aggregation
Numerous
Predicates
11 BILLION row fact :: 10-30 seconds *
* on different sized machines / different volumes
WX2 :: Building Block
Processing
RAM
Single
Tables
X86
&
Linux
Views
Server/
Blade
Data
Storage
Query Processing
Messaging
Queue/Resource Management
Database Processes (Compiler, Optimizer, etc.)
Tables/Views pinned in memory
Data Files (Persistent data)
WX2
Database Software
Linux
Operating System
WX2 :: Appliance
•
•
Processing
•
•
•
Tables •
&
•
Commodity X86 Linux hardware
Standard form factor for most data centers
Redundant Network and Hardware components
Standard Appliance built on HP Blades; delivered pre-configured
10GBe networking
Heavy use of RAM and CPU
Applicance can be:
Views
• Carved into multiple instances
• Strung together with other appliances to scale horizontally
• Used together or separately for configuring resiliency
Data
Storage• Rapids – High Performance (high use of RAM)
• Rivers – Medium Capacity (Mostly RAM, some Disk)
WX2 • Lakes – High Capacity Simple reporting, lower performance)
Linux
WX2 :: Appliance
High speed data loads
•
•
•
Into RAM @ ~ 8 TB/Hour
Onto Disk @ ~1.5TB/hour
Linear File System
Create/Refresh images in RAM
•
•
•
•
High speed access to hot data
Complex/nested views/images
ELT
Manage massive amounts of data in RAM
Utilize RAM for query processing
•
•
•
•
•
•
•
•
Access RAM-based views/images
Process in memory, no disk I/O
All nodes in appliance participate equally
MPP
Message Passing Kernel optimizes communication
Queries executed in machine code (jump offsets to access columns)
Machine-code-level utilization of offsets to optimize access of RAM
Mature RAM management techniques
WX2 :: Software :: Performance
• Row based scanning technology in common with other DWA
technologies
• All server nodes participate equally and maximally in a query
• Enormous brute force processing from arrays of commodity
servers with lots of CPU cores
• In-memory data can feed CPU cores without I/O wait
• ~650 million rows per second per server
– 10 servers = 6.5 BILLION rows / second
– 100 servers = 65 BILLION rows /second
• Load Rates of over 8TB/Hour to RAM; 1.5TB/Hour to Disk
• Effective and mature memory and resource management
Getting smaller, getting faster
• Retail analytics with ~24 Billion EPOS records
• POC in 2005 required 125 blade server system
– Platform was physically located in Germany
– WX2 installation and data load was done remotely from
the UK – no Kognitio or customer employees on site
– Installation took one day, data load 4 hrs
– System scanned all 24 billion records in 0.8s
– Complex basket analysis queries took 11-15 seconds
• Clients in 2008/9 purchase 64 blade appliances for similar
production volume
• Today can demo on16 blade servers with better performance!
• On-going increase in CPU cores and RAM per server
• WX2 requires no tweaks or changes for different scale
systems – Kognitio benefits greatly by exploiting the
commodity computing development curve
WX2 :: Software :: Standards :: Integration
Data
Sources
ETL / ELT
SQDR
Server
DataStage
Large Server
Server
SQL: ODBC/JDBC
MDX
WX2 :: Resiliency
•
•
•
•
•
Disk
Node
Hardware
Network
Software
WX2 :: DR/Backup/Recovery
•
•
Production
•
•
•
Flexibility in system configurations – Instance B
need not have same configuration as Instance A
(primary instance)
Parallel operations for bulk import and export of
data
Multi-versioning file system – row changes are kept
until reclaim/repack event and historical changes
can be queried
• Exploited by incremental backup – changeonly backup
• Incremental backup can lock transaction
history via transaction marker
Queries against in-memory data isolated from disk
I/O of backup operations
Approaches
• Full + Incremental
• Simultaneous load to two environments
• Dual feed
• Dual ETL instances
• Incremental backup to smaller environment
• Hybrid
• Incremental backup of dimensions; dual
feed of Facts
• Snapshot/Clone disk volumes (Prod has to be
stopped)
• SAN to SAN mirror
DR
Timescale
Company
Clients
(post-2005)
Product
Releases
1979
1989
1995
2000
Competition: Approximate year of founding
Teradata
2005
Greenplum
Vertica
Netezza
ParAccel
2010
SAP HANA
Oracle Exalytics
Thank you
Q&A
Thank You!
connect
kognitio.com
kognitio.tel
kognitio.com/blog
contact
Michael Hiskey
Vice President,
Marketing & Business Development
michael.hiskey@kognitio.com
+1.917.375.8196
twitter.com/kognitio
linkedin.com/companies/kognitio
tinyurl.com/kognitio
youtube.com/user/kognitiowx2
Sach Sangtani
Senior Technical Consultant
sachin.sangtani@kognitio.com
+1.617.645.4073
Download