Data Packs

Infobright Meetup Host Avner Algom May 28, 2012 Agenda  Infobright  Paul Desjardins VP Bus Dev Infobright Inc What part of the Big Data problem does Infobright solve? Where does Infobright fit in the database landscape?  Joon Kim Technical Overview Sr. Sales Engineer Use cases  WebCollage  Lior Haham WebCollage Case Study: Using Analytics for Hosted Web Applications  Zaponet  Asaf Birenzvieg CEO  Q/A Introduction / Experience with Infobright Growing Customer Base across Use Cases and Verticals 1000 direct and OEM installations across North America, EMEA and Asia 8 of Top 10 Global Telecom Carriers using Infobright via OEM/ISVs Logistics, Manufacturing, Business Intelligence Online & Mobile Advertising/Web Analytics Government Utilities Research Financial Services Telecom & Security Gaming, Social Networks The Machine-Generated Data Problem “Machine-generated data is the future of data management.” Curt Monash, DBMS2 Machine-generated/hybrid data  Weblogs  Computer, network events  CDRs  Financial trades  Sensors, RFID etc  Online game data Rate of Growth Human-generated data - input from most conventional kinds transactions  Purchase/sale  Inventory  Manufacturing  Employment status change The Value in the Data “Analytics drives insights; insights lead to greater understanding of customers and markets; that understanding yields innovative products, better customer targeting, improved pricing, and superior growth in both revenue and profits.” Accenture Technology Vision, 2011 Network Analytics • • • • • Network optimization Troubleshooting Capacity Planning Customer Assurance Fraud Detection CDR Analytics • Customer Behavior Analysis • Marketing Campaigns/Services Analysis • Optimize network capacity • Fraud Detection • Compliance and Audit Mobile Advertising Analytics • Need to capture web data, mobile data, network data • Mobile ad campaign analytics • Customer Behavior Analysis Current Technology: Hitting the Wall Today’s database technology requires huge effort and massive hardware How Performance Issues are Typically Addressed – by Pace of Data Growth 75% Tune or upgrade existing databases 66% 70% Upgrade server hardware/processors 54% 60% Upgrade/expand storage systems 33% 44% Archive older data on other systems 30% 32% Upgrade networking infrastructure 21% High Growth 4% 7% Don't Know / Unsure 0% 20% Source: KEEPING UP WITH EVER-EXPANDING ENTERPRISE DATA By Joseph McKendrick, Research Analyst; Unisphere Research October 2010 40% 60% Low Growth 80% 100% Infobright Customer Performance Statistics Fast query response with no tuning or indexes Mobile Data Analytic Queries Alternative 2+ hours with MySQL Oracle Query Set (15MM events) Alternative <10 seconds 43 min with SQL Server BI Report 23 seconds 10 seconds – 15 minutes Data Load Alternative Alternative 7 hrs in Informix Alternative 17 seconds 11 hours in MySQL ISAM 11 minutes 0.43 – 22 seconds Save Time, Save Cost  Fastest time to value  Economical  Download in minutes, install in minutes  No indexes, no partitions, no projections  No complex hardware to install  Minimal administration  Self-tuning  Self-managing  Eliminate or reduce aggregate table creation  Outstanding performance  Fast query response against large data volume  Load speeds over 2TB /hour with DLP  High data compression 10:1 to 40:1+ 8  Low subscription cost  Less data storage  Industry-standard servers Where does Infobright fit in the database landscape?  One Size DOESN’T fit all.  Specialized Databases Deployed  Excellent at what they were designed for  More open source specialized databases than commercial  Cloud / SaaS use for specialty DBMS becomes popular  Database Virtualization  Significantly lowered DBA costs Row Hadoop Column Your Warehouse 9 NewSQL NoSQL The Emerging Database Landscape Row / NewSQL* Columnar NoSQL-Key Value Store NoSQL – Document Store NoSQL – Column Store Basic Description Structured data stored in rows on disk Structured data is vertically striped and stored in columns on disk Data stored usually in memory with some persistent backup Persistent storage along with some SQL like querying functionality Very large data storage, MapReduce support Common use cases Transaction processing, interactive transactional applications Historical data analysis, data warehousing, business intelligence Used as a cache for storing frequently requested data for a web app Web apps or any app which needs better performance without having to define columns in an RDBMS Real-time data logging such as in finance or web analytics Positives Strong for capturing and inputting new records. Robust, proven technology. Fast query support, especially for ad hoc queries on large datasets; compression Scalability, very fast storage and retrieval of unstructured and partly structured data Persistent store with scalability features such as sharding built in with better query support than key-value stores. Very high throughput for Big Data, strong partitioning support, random read write access Negatives Scale issues - less suitable for queries, especially against large databases Not suited for transactions; import and export speed; heavy computing resource utilization Usually all data must fit into memory, no complex query capabilities Lack of sophisticated query capabilities Low-level API, inability to perform complex queries, high latency of response for queries Key Player MySQL, Oracle, SQL Infobright, Aster Server, Sybase ASE Data Sybase IQ, Vertica, ParAccel MemCached, Amazon S3, Redis, Voldemort MongoDb, Couchdb, SimpleDb HBase, Big Table, Cassandra , Why use Infobright to deal with large volumes of machine generated data? EASY •TO INSTALL •TO USE AFFORDABLE •LESS HW •LOW SW COST Technical Overview of Infobright Joon Kim Senior Sales Engineer joon.kim@infobright.com Key Components of Infobright 003 Column-Oriented Knowledge Grid – statistics and metadata “describing” the supercompressed data Data Packs – data stored in manageably sized, highly compressed data packs Data compressed using algorithms tailored to data type 1 Smarter architecture  Load data and go  No indices or partitions to build and maintain  Knowledge Grid automatically updated as data packs are created or updated  Super-compact data footprint can leverage off-theshelf hardware Infobright Architecture 1. Column Orientation Incoming Data EMP_ID 1 2 3 FNAME Moe Curly Larry LNAME Howard Joe Fine SALARY 10000 12000 9000 Column Oriented Layout (1,2,3; Moe,Curly,Larry; Howard,Joe,Fine; 10000,12000,9000;)  Works well with aggregate results (sum, count, avg. )  Only columns that are relevant need to be touched  Consistent performance with any database design  Allows for very efficient compression 2. Data Packs and Compression Data Packs 64K  Each data pack contains 65,536 data values  Compression is applied to each individual data pack  The compression algorithm varies depending on data type and distribution 64K Compression  Results vary depending on the distribution of 64K 64K 16 Patent Pending Compression Algorithms data among data packs  A typical overall compression ratio seen in the field is 10:1  Some customers have seen results of 40:1 and higher  For example, 1TB of raw data compressed 10 to 1 would only require 100GB of disk capacity 3. The Knowledge Grid Knowledge Grid Applies to the whole table Knowledge Nodes Built for each Data Pack Information about the data Column A DP1 Column B … Basic Statistics Calculated during load Numerical Ranges DP2 DP3 Character Maps DP4 DP5 DP6 Dynamic 17 Calculated during query Knowledge Grid Internals Data Pack Nodes (DPN) A separate DPN is created for every data pack created in the database to store basic statistical information Character Maps (CMAPs) Every Data Pack that contains text creates a matrix that records the occurrence of every possible ASCII character Histograms Histograms are created for every Data Pack that contains numeric data and creates 1024 MIN-MAX intervals. Pack-to-Pack Nodes (PPN) PPNs track relationships between Data Packs when tables are joined. Query performance gets better as the database is used. This metadata layer = 1% of the compressed volume 006 Optimizer / Granular Computing Engine 1. 2. 3. 4. Query received Engine iterates on Knowledge Grid Each pass eliminates Data Packs If any Data Packs are needed to resolve query, only those are decompressed Query Knowledge Grid Results 1% Q: How are my sales doing this year? 19 Compressed Data How the Optimizer Works SELECT count(*) FROM employees WHERE salary > 50000 AND age < 65 AND job = ‘Shipping’ AND city = ‘TORONTO’; 1. Find the Data Packs with salary > 50000 2. Find the Data Packs that contain age < 65 3. Find the Data Packs that have job = ‘Shipping’ 4. Find the Data Packs that have City = “Toronto’ 5. Now we eliminate all rows that have been flagged as irrelevant. 6. Finally we have identified the data pack that needs to be decompressed 007 salary age job city All packs ignored Row s1 to 65,53 65,5 7 to 36 131,0 131,0 72 73 to …… All packs ignored All packs ignored Only this pack will be decompressed Completely Irrelevant Suspect All values match Infobright Architected on MySQL “The world’s most popular open source database” 21 Sample Script (Create Table, Import, Export) USE Northwind; DROP TABLE IF EXISTS customers; CREATE TABLE customers ( CustomerID varchar(5), CompanyName varchar(40), ContactName varchar(30), ContactTitle varchar(30), Address varchar(60), City varchar(15) Region char(15) PostalCode char(10), Country char(15), Phone char(24), Fax varchar(24), CreditCard float(17,1), FederalTaxes decimal(4,2) ) ENGINE=BRIGHTHOUSE; 065 -- Import the text file. Set AUTOCOMMIT=0; SET @bh_dataformat = 'txt_variable'; LOAD DATA INFILE "/tmp/Input/customers.txt" INTO TABLE customers FIELDS TERMINATED BY ';' ENCLOSED BY 'NULL' LINES TERMINATED BY '\r\n'; COMMIT; -- Export the data into BINARY format. SET @bh_dataformat = 'binary'; SELECT * INTO OUTFILE "/tmp/output/customers.dat" FROM customers; -- Export the data into TEXT format. SET @bh_dataformat = 'txt_variable'; SELECT * INTO OUTFILE "/tmp/output/customers.text" FIELDS TERMINATED BY ';' ENCLOSED BY 'NULL' LINES TERMINATED BY '\r\n' FROM customers; Infobright 4.0 – Additional Features Built-in intelligence for machine-generated data: Find ‘Needle in the Haystack’ faster DomainExpert Intelligence about machinegenerated data drives faster performance Near-real time, ad-hoc analysis of Big Data • Enhanced Knowledge Grid with domain intelligence • DLP: Linear scalability of data load for very high performance • Automatically optimizes database, no fine tuning • Rough Query: Data mining “drill down” at RAM speed • Users can directly add domain expertise to drive faster performance Work with Data Even Faster DomainExpert • Intelligence to automatically optimize the database DomainExpert: Breakthrough Analytics  Enables users to add intelligence into Knowledge Grid directly with no schema changes  Pre-defined/Optimized for web data analysis  IP addresses  Email addresses  URL/URI  Can cut query time in half when using this data definition DomainExpert: Prebuilt plus DIY options  Pattern recognition enables faster queries  Patterns defined and stored  Complex fields decomposed into more homogeneous parts  Database uses this information when processing query  Users can also easily add their own data patterns  Identify strings, numerics, or constants  Financial Trading example– ticker feed “AAPL–350,354,347,349” encoded “%s-%d,%d,%d,%d”  Will enable higher compression Get Data In Faster: DLP Near-real time ad-hoc analysis • Linear scalability of data load for very high performance Distributed Load Processor (DLP)  Add-on product to IEE which linearly scales load performance  Remote servers compress data and build Knowledge Grid elements on-the-fly…  Appended to the data server running the main Infobright database  It’s all about speed: Faster Load & Queries Get Data In Faster: Hadoop Near-real time ad-hoc analysis • Hadoop connectivity • Use the right tool for the job Big Data - Hadoop Support  DLP Hadoop connector  Extracts data from HDFS, load into Infobright at high speeds • You load 100s of TBs or Petabytes into Hadoop for bulk storage and batch processing • Then load TBs into Infobright for near-real time analytics using Hadoop connector and DLP Infobright / Hadoop: Perfect complement to analyze Big Data Rough Query: Speed Up Data Mining by 20x Near-real time ad-hoc analysis • Rough Query: Data mining “drill down” at RAM speed Rough Query – Another Infobright Breakthrough  Enables very fast iterative queries to quickly drill down into large volumes of data  “Select roughly” to instantaneously see interval range for relevant data,  uses only the in-memory Knowledge Grid information  Filtering can narrow results  Need more detail? Drill down further with rough query or query for exact answer The Value Infobright Delivers High performance with much less work and lower cost Faster queries without extra work Fast load / High compression • No indexes • No projections or cubes • No data partitioning • Faster ad-hoc analytics • Multi-machine Distributed Load Processor • Query while load (DLP) • 10:1 to 40:1+ compression Lower costs • Less storage and servers • Low cost HW • Low-cost subscriptions • 90% less administration Faster time to production • Download in minutes • Minimal configuration • Implement in days Q&A Infobright Use Cases Infobright and Hadoop in Video Advertising: LiveRail LiveRails’s Need  LiveRail’s platform enables publishers, advertisers, ad networks and media groups to manage, target, display and track advertising in online video.  With a growing number of customers, LiveRail was faced with managing increasingly large data volumes.  They also needed to provide near real-time access to their customers for reporting and ad hoc analysis. Infobright’s Solution  LiveRail chose two complementary technologies to manage hundreds of millions of rows of data each day -Apache Hadoop and Infobright.  Detail is loaded hourly into Hadoop and at the same time summarized and loaded into Infobright.  Customers access Infobright 7x24 for ad-hoc reporting and analysis and can schedule time if needed to access cookie-level data stored in Hadoop. “Infobright and Hadoop are complementary technologies that help us manage large amounts of data while meeting diverse customers needs to analyze the performance of video advertising investments.” Andrei Dunca, CTO of LiveRail Example in Mobile Analytics: Bango Bango’s Need A leader in mobile billing and analytics services utilizing a SaaS model Infobright’s Solution  Reduced queries from minutes to seconds Query SQL Server Infobright 1 Month Report (5MM events) 11 min 10 secs  450GB per month on SQL Server 1 Month Report (15MM events) 43 min 23 secs SQL Server could not support required query performance Complex Filter (10MM events) 29 min 8 secs Received a contract with a large media provider  150 million rows per month Needed a database that could  scale for much larger data sets  with fast query response  with fast implementation  and low maintenance  in a cost-effective solution  Reduced size of one customer’s database from 450 GB to 10 GB for one month of data Online Analytics: Yahoo! Customer’s Need Infobright’s Solution • Pricing and Yield Management team • Loading over 30 million records per day responsible for pricing online display ads • Can now store all detailed data, retain 6 billion records  Requires sophisticated analysis of terabytes of ad impression data  With prior database, could only store 30 days of summary data  Needed a database that could: • 6TB of data is compressed to 600GB on disk • Queries are very fast, Yahoo! can do adhoc analysis without manual tuning • Easy to maintain and support • Store 6 months+ of detailed data • Reduce hardware needed • Eliminate database admin work • Execute ad-hoc queries much faster “Using Infobright allows us to do pricing analyses that would not have been possible before. We now have access to all of our detailed Web impression data, and we can keep 6x the amount of data history we could previously.” Sr. Director PYM, Yahoo! Case Study: JDSU  Annual revenues exceeded $1.3B in 2010  4700 employees are based in over 80 locations worldwide  Communications sector offers instruments, systems, software, services, and integrated solutions that help communications service providers, equipment manufacturers, and major communications users maintain their competitive advantage JDSU Service Assurance Solutions  Ensure high quality of experience (QoE) for wireless voice, data, messaging, and billing.  Used by many of the world’s largest network operators JDSU Project Goals  New version of Session Trace solution that would:  Support very fast load speeds to keep up with increasing call volume and the need for near real-time data access  Reduce the amount of storage by 5x, while also keeping much longer data history  Reduce overall database licensing costs 3X  Eliminate customers’ “DBA tax,” meaning there should require zero maintenance or tuning while enabling flexible analysis  Continue delivering the fast query response needed by Network Operations Center (NOC) personnel when troubleshooting issues and supporting up to 200 simultaneous users High Level View 38 Session Trace Application For deployment at Tier 1 network operators, each site will store between 6 and 45TB of data, and the total data volume will range from 700TB to 1PB of data. Infobright Implementation Save Time, Save Cost  Fastest time to value  Economical  Download in minutes, install in minutes  No indexes, no partitions, no projections  No complex hardware to install  Minimal administration  Self-tuning  Self-managing  Eliminate or reduce aggregate table creation  Outstanding performance  Fast query response against large data volume  Load speeds over 2TB /hour with DLP  High data compression 10:1 to 40:1+ 41  Low subscription cost  Less data storage  Industry-standard servers What Our Customers Say “Using Infobright allows us to do pricing analyses that would not have been possible before.” “With Infobright, [this customer] has access to data within minutes of transactions occurring, and can run ad-hoc queries with amazing performance.” "Infobright offered the only solution that could handle our current data load and scale to accommodate a projected growth rate of 70 percent, without incurring prohibitive hardware and licensing costs. “Using Infobright allowed JDSU to meet the aggressive goals we set for our new product release: reducing storage and increasing data history retention by 5x, significantly reducing costs, and meeting the fast data load rate and query performance needed by the world’s largest network operators.” Where does Infobright fit in the database landscape?  One Size DOESN’T fit all.  Specialized Databases Deployed  Excellent at what they were designed for  More open source specialized databases than commercial  Cloud / SaaS use for specialty DBMS becomes popular  Database Virtualization  Significantly lowered DBA costs Row Hadoop Column Your Warehouse 43 NewSQL NoSQL NoSQL: Unstructured Data Kings Tame the Unstructured • Store Anything • Keep Everything      Schema-less Designs Extreme Transaction Rates Massive Horizontal Scaling Heavy Data Redundancy Niche Players Top NoSQL Offerings NoSQL: Breakout Key-Value Document Store Hybrid Column Store 120+ Variants : Find More at nosql-databases.org Graph What do we see with NoSQL Strengths Weakness • Application Focused • Programmatic API • Capacity • Lookup Speed • Streaming data • Generally no SQL Interface • Programmatic Interfaces • Expensive Infrastructure • Complex • Limits with Analytics Lest We Forget Hadoop Scalable, fault-tolerant distributed system for data storage and processing Hadoop Distributed File System (HDFS): selfhealing high-bandwidth clustered storage MapReduce: fault-tolerant distributed processing Value Add     Flexible : store schema-less data and add as needed Affordable : low cost per terabyte Broadly adopted : Apache Project with a large, active ecosystem Proven at scale : petabyte+ implementations in production today Hadoop Data Extraction NewSQL: Operational, Relational Powerhouses Overclock Relational Performance • Scale-Out • Scale “Smart”     New, Scalable SQL Extreme Transaction Rates Diverse Technologies ACID Compliance

Data Packs

Related documents

Products

Support

Data Packs

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib