Introduction to Kognitio 06 January 2012 Michael Hiskey Vice President, Marketing and Business Development Sachin Sangtani Senior Technical Consultant Kognitio is an analytical accelerator Built from the ground-up to satisfy large and complex analytics on big data sets A massively parallel, in-memory analytical data warehouse that interoperates with your existing infrastructure Why rip out and replace systems? Kognitio: Accelerate your analytical ability without disruption to existing systems Lower your hardware and software costs, while increasing performance 10-100x Agenda • Company Overview • WX2 Overview – DW Origins and evolution – In Memory: What, Why, Where it helps • Technical Overview – – – – – – Software Features Performance Integration Appliance DR/Backup/Recovery Timescale • Product Roadmap • What To Look For In Potential Clients • Q&A About Kognitio Software company founded in the UK • 20+ year heritage focused on high performance analytical solutions – Solid reference client successes • WX2 Analytical Data Warehouse – In-memory processing – Massively parallel (MPP) – Cloud deployment model – Mature product (v7) – Open standards & linear scalability – Multi-threaded high-volume data loads – Ease of administration & low TCO: 3rd Party Services/Apps Direct BI / Analytics Kognitio legacy EDW • NO indexes, partitions, aggregates, etc. • Flexible delivery model ETL ETL ETL – Software , Appliance – DaaSTM -Data warehousing as a Service • Commodity x86/Linux servers Underlying Systems, Hadoop Clusters, Enterprise Data Warehouses Kognitio WX2 Overview Traditional data warehouse/BI solution stack BI Analysis Application BI Reporting Application BI Analysis Application BI Reporting Application Data Mart BI Analysis Application Depend. Database Cube Data Warehouse ETL Transactional Data Customer Data Product Data Operational Systems BI Reporting Application “Single version of the truth” However: • Data duplication in marts and cubes • Massive cube proliferation • Complex and time consuming data extract operations • High Admin. Cost (DBAs, Sys Admins, Cube) • Often specialized HW and Storage required Why do we need such complex solutions? Poor performance: • Data held on slow mechanical disk • Queries run against disk-based data (I/O Contention) Data resident in high speed memory RAM Data (fast access) Queries Results Persistent data store on disk as well Mechanical Disk Queries run against data in memory Add MPP to in-memory Industry-standard x86 Servers, Storage, Memory. RAM merged together across servers into a shared fabric Queries Results Parallelism in every direction = linear scalability massively parallel processing (MPP) Allows systems to be “scaled out” to accommodate any data size… …and analytical workload So what’s required? • Array of industry standard servers • Standard operating system • An in-memory capable relational database management system (such as Kognitio WX2) How do MPP systems differ from traditional databases? Traditional systems have to focus on making subsets of data available at acceptable performance levels Traditional Relational Database Projects Target High Performance Database Projects Traditional systems have to develop cubes, indices, and tune for performance High performance MPP No cubes, No indices. Performance to burn. Focus on results (MPP – Massively Parallel Processing) $$$ How are cost savings achieved ? In development Let’s compare a typical implementation using traditional resources with that of in memory MPP platforms 100% Circa >60% saving Circa >55% saving 50% Traditional Relational Database Projects Target High Performance Database Projects Business & data understanding Data preparation With parallel high performance loading and and theno very indices, high speed the need to create sophisticated processing, ETL becomes schemas ELT fortoeasier matchdata current business manipulation. Fastneeds analytics whilesaves also anticipating users change is almost developing queries eliminated based onextent. what can Saving be 60% of pre-business achieved rather than analytics what they want to do. © Kognitio 2010 Why? Because traditional database management needs all this… $$ Operational Systems Management New Data Extract Request New Data Feed ETL DB Development Data Load Analytical Database Database Administration $$$$$ New Analytical Scope Index Management Table Space Management Significant DBA Cost Partition Aggregation Temp Table Management User Community We provide data agility…eliminating partitioning, indexing, space and aggregation management $$ Operational Systems Management New Data Extract Request New Data Feed ETL Database Administration DB Development Data Load Analytical Database New Analytical Scope User Community Simply Load and Go Technical Overview Typical Analysis/Reporting Query CASE Statements -- Balance information of targeted accounts obtained from transaction table -select C.Client_ID, D.Demog_Group, D.Demog_Desc, 1+avg(F.Credit_Limit_Changes) CL_Issued, sum(case when T.Trans_Type='C' then T.Transaction_Amount else 0 end) - sum(case when T.Trans_Type='D' then T.Transaction_Amount else 0 end) Balance, sum(case when T.Trans_Type='C' then T.Transaction_Amount else 0 end) Total_Credit, sum(case when T.Trans_Type='D' then T.Transaction_Amount else 0 end) Total_Debit, min(case when T.Trans_Type='C' then date '2009-11-15' - T.Effective_Date else 365*10 end) Days_Last_Credit, min(case when T.Trans_Type='D' then date '2009-11-15' - T.Effective_Date else 365*10 end) Days_Last_Debit from DEMO_FS.V_FIN_ACCOUNT F, DEMO_FS.V_FIN_CLIENT C, DEMO_FS.V_FIN_CLIENT_ACCOUNT_LINK L, DEMO_FS.V_FIN_ADD_CLIENT A, DEMO_FS.V_FIN_DEMOG_DESCS D, DEMO_FS.V_FIN_CC_TRANS T, --Query to produce campaign planning -( select Account_ID, count(Trans_Year) Years_Present, sum(No_Trans) No_Trans, sum(Total_Spend) Total_Spend, case count(Trans_Year) when 1 then 'One-off' else 'Repeat‘ end Behavior_Flag from ( select * from ( select Account_ID, Extract(Year from Effective_Date) Trans_Year, count(Transaction_ID) No_Trans, sum(Transaction_Amount) Total_Spend, avg(Transaction_Amount) Avg_Spend from DEMO_FS.V_FIN_CC_TRANS where extract(year from Effective_Date)<2009 and Trans_Type='D' and Account_ID<>9025011 and actionid in ( select actionid from DEMO_FS.V_FIN_actions where actionoriginid =1) group by Account_ID, Extract(Year from Effective_Date ) ) Acc_Summary where No_Trans in (3,4,5,6) and Avg_Spend>1000 and Trans_Year between 2004 and 2008 ) Target_Accs group by Account_ID ) Campaign_Grouping where Campaign_Grouping.Account_ID=L.Account_ID and L.Client_ID=C.Client_ID and C.Client_ID=A.Client_ID and A.Demog_Code=D.Demog_Code and D.Demog_code in (1,4,5,9,10,11,50,55) and Campaign_Grouping.Account_ID=F.Account_ID and Campaign_Grouping.Account_ID=T.Account_ID and T.Effective_Date < date '2009-11-15' group by C.Client_ID, Demog_Group, Demog_Desc order by Days_Last_Debit; 6 Tables plus inline subqueries NOT EQUAL TO Multiple passes through fact BETWEEN 4 nested subqueries IN Aggregation Numerous Predicates 11 BILLION row fact :: 10-30 seconds * * on different sized machines / different volumes WX2 :: Building Block Processing RAM Single Tables X86 & Linux Views Server/ Blade Data Storage Query Processing Messaging Queue/Resource Management Database Processes (Compiler, Optimizer, etc.) Tables/Views pinned in memory Data Files (Persistent data) WX2 Database Software Linux Operating System WX2 :: Appliance • • Processing • • • Tables • & • Commodity X86 Linux hardware Standard form factor for most data centers Redundant Network and Hardware components Standard Appliance built on HP Blades; delivered pre-configured 10GBe networking Heavy use of RAM and CPU Applicance can be: Views • Carved into multiple instances • Strung together with other appliances to scale horizontally • Used together or separately for configuring resiliency Data Storage• Rapids – High Performance (high use of RAM) • Rivers – Medium Capacity (Mostly RAM, some Disk) WX2 • Lakes – High Capacity Simple reporting, lower performance) Linux WX2 :: Appliance High speed data loads • • • Into RAM @ ~ 8 TB/Hour Onto Disk @ ~1.5TB/hour Linear File System Create/Refresh images in RAM • • • • High speed access to hot data Complex/nested views/images ELT Manage massive amounts of data in RAM Utilize RAM for query processing • • • • • • • • Access RAM-based views/images Process in memory, no disk I/O All nodes in appliance participate equally MPP Message Passing Kernel optimizes communication Queries executed in machine code (jump offsets to access columns) Machine-code-level utilization of offsets to optimize access of RAM Mature RAM management techniques WX2 :: Software :: Performance • Row based scanning technology in common with other DWA technologies • All server nodes participate equally and maximally in a query • Enormous brute force processing from arrays of commodity servers with lots of CPU cores • In-memory data can feed CPU cores without I/O wait • ~650 million rows per second per server – 10 servers = 6.5 BILLION rows / second – 100 servers = 65 BILLION rows /second • Load Rates of over 8TB/Hour to RAM; 1.5TB/Hour to Disk • Effective and mature memory and resource management Getting smaller, getting faster • Retail analytics with ~24 Billion EPOS records • POC in 2005 required 125 blade server system – Platform was physically located in Germany – WX2 installation and data load was done remotely from the UK – no Kognitio or customer employees on site – Installation took one day, data load 4 hrs – System scanned all 24 billion records in 0.8s – Complex basket analysis queries took 11-15 seconds • Clients in 2008/9 purchase 64 blade appliances for similar production volume • Today can demo on16 blade servers with better performance! • On-going increase in CPU cores and RAM per server • WX2 requires no tweaks or changes for different scale systems – Kognitio benefits greatly by exploiting the commodity computing development curve WX2 :: Software :: Standards :: Integration Data Sources ETL / ELT SQDR Server DataStage Large Server Server SQL: ODBC/JDBC MDX WX2 :: Resiliency • • • • • Disk Node Hardware Network Software WX2 :: DR/Backup/Recovery • • Production • • • Flexibility in system configurations – Instance B need not have same configuration as Instance A (primary instance) Parallel operations for bulk import and export of data Multi-versioning file system – row changes are kept until reclaim/repack event and historical changes can be queried • Exploited by incremental backup – changeonly backup • Incremental backup can lock transaction history via transaction marker Queries against in-memory data isolated from disk I/O of backup operations Approaches • Full + Incremental • Simultaneous load to two environments • Dual feed • Dual ETL instances • Incremental backup to smaller environment • Hybrid • Incremental backup of dimensions; dual feed of Facts • Snapshot/Clone disk volumes (Prod has to be stopped) • SAN to SAN mirror DR Timescale Company Clients (post-2005) Product Releases 1979 1989 1995 2000 Competition: Approximate year of founding Teradata 2005 Greenplum Vertica Netezza ParAccel 2010 SAP HANA Oracle Exalytics Thank you Q&A Thank You! connect kognitio.com kognitio.tel kognitio.com/blog contact Michael Hiskey Vice President, Marketing & Business Development michael.hiskey@kognitio.com +1.917.375.8196 twitter.com/kognitio linkedin.com/companies/kognitio tinyurl.com/kognitio youtube.com/user/kognitiowx2 Sach Sangtani Senior Technical Consultant sachin.sangtani@kognitio.com +1.617.645.4073