Big Data, NoSQL . . . So What? Iran Hutchinson Me • I work for InterSystems who: – Drives http://globalsdb.org NoSQL project. – Has 20+ years of NoSQL production deployments – Has 20+ years of Big Data production deployments – Built a ~250 million Euro business on the above • Email: iran.hutchinson@intersystems.com • Twitter: #iranic #iranic Big Data is … • Important data in varying formats and volumes that is being generated across all areas affecting your business that is generally not centrally correlated or managed. • Examples include: – Word Files, PowerPoint, PDFs – Emails, Instant Messaging, Texts – Blogs and Social Media – Automated data from machine activities – Stream data from financial stock markets #iranic Some Big Data Numbers • Source: McKinsey Global Institute • 5 Billion mobile phones used in 2010 • 30 Billion pieces of info shared on Facebook each month • 40% projected growth in global data generated • 235 Terabytes collected by US Library of Congress 04/11 – 15 out of 17 sectors in US have more data stored per company than this. #iranic Some Big Data Numbers … • Source: McKinsey Global Institute • $300 Billion in potential value in US Healthcare system • €250 Billion in Europe’s public sector administration • $600 Billion in annual consumer surplus using location data • 60% Potential increase in retail operating margins • 140,000 – 190,000 analytical talent positions in US • 1.5 Million data-savvy managers needed in US #iranic Case Study: Credit Suisse • Key Challenges: – Revamp order routing architecture – Revamp order management architecture – Serve current demand and scale to new levels – Address downtime challenges #iranic Case Study: Credit Suisse … • Big Data in the form of volumes of transactions • Leveraged Caché’s: – In-memory architecture for performance – On-disk resiliency for availability – Distributed architecture for data coherency • Can easily process 1,000,000,000 transactions – During business hours #iranic Case Study: European Space Agency (ESA) • Key Challenges – Make the largest, most precise 3-D map of our Galaxy – Monitor 1,000,000,000 stars over 5 years, precisely charting position, movement, and brightness – Along the way discover hundreds of thousands of new celestial objects #iranic Case Study: ESA Continued … • Challenge Calculation: • Capture data for 1 Billion Celestial Objects • http://www.intersystems.com/cache/whitepapers/pdf/Charting_th e_Galaxy.pdf X X 1,000,000,000 100 600 60,000,000,000,000 Solution: #iranic objects observations per object bytes per observation (60TB) Caché/XEP, delivering 100,000+ sustained inserts per second per server, stored as real objects with SQL access Enabling Technology • Focus on Caché • A quick look at the architecture #iranic Enabling Technology … • Java + C database kernel run in same process #iranic Enabling Technology … • ECP, Distributed Computing #iranic Enabling Technology … • Multiple, simultaneous data to disk writers #iranic Who is this Guy? • Edgar Frank “Ted” Codd • Known for 12 Rules (0 ~ 12) for Relational Data Systems #iranic NoSQL … Breaking the Rules • Rule 1: The information Rule – All information is represented in 1 and only 1 way, namely by values in column positions within rows of tables • Rule 12: The no subversion Rule – If the system provides a low-level (record-at-a-time) interface, then that interface cannot be used to subvert the system i.e. relational security or integrity constraints. #iranic Why NoSQL? • No to ACID transactions • No to the impedance mismatch with SQL • Dealing with Big Data and Web Scale • High prices from RDBMS vendors • Use commodity hardware • Flexible data models • It’s a cool movement …. #iranic Is NoSQL a new Concept? • No • Remember MUMPS? – SET ^Car("Door","Color")="BLUE” • Remember Multi-value/PICK – MATWRITE array.variable ON file.variable,id. …. • Ever heard of the NoSQL RDB? – Carlo Strozzi – http://www.strozzi.it/cgibin/CSA/tw7/I/en_US/nosql/Home%20Page #iranic CAP Theorem • Consistent – A service that is consistent operates fully or not. • Availability – The service is available to operate fully or not. • Partition Tolerance – Managing data on multiple nodes. 1 node is 1 partition so it works or does not when it comes to processing data. • Significant as you can get 2 of these only … #iranic CAP Theorem … • Arguments and links – http://www.julianbrowne.com/article/viewer/brewerscap-theorem – http://ksat.me/a-plain-english-introduction-to-captheorem/ – http://voltdb.com/company/blog/clarifications-captheorem-and-data-related-errors #iranic CAP Theorem …: Consistency DB1 DB7 DB2 DB6 DB3 DB5 #iranic DB4 CAP Theorem …: Consistency Spoke DB1 Spoke DB4 Hub Spoke DB3 #iranic Spoke DB2 CAP Theorem …: Consistency DB1 DB3 #iranic DB2 Distributed computing • Fallacies (Peter Deutsch) – The network is reliable – Latency is zero – Bandwidth is infinite – The network is secure – Topology doesn’t change – There is one administrator – Transport cost is zero – The network is homogeneous • Remember JINI? (See Apache River project) #iranic NoSQL: Which Model to Use? Key-Value Graph Data Column #iranic Document NoSQL: Which project? • http://nosql-database.org/ lists 122 today. • Depends on your model selection. • Most likely choose well-known project. • Don’t forget about shared risk! #iranic NoSQL: Querying • Some solutions have no querying • When available query languages differ • Lack of general AD-Hoc querying – “no” SQL • Have you heard of UnQL? – http://www.unqlspec.org/display/UnQL/Home • NOTE: Toad for Cloud #iranic NoSQL: How to Succeed? • Know your application • Don’t forget the past lessons • Consider a hybrid approach • Fight the desire to Roll-Your-Own-DB • Start small but significant #iranic NoSQL: Hybrid Approach 1 • Two Systems • NoSQL System • SQL/RDBMS NoSQL #iranic Data Mapper / Translator SQL/RDBMS NoSQL: Hybrid Approach 2 • One system does both NoSQL and SQL Relational ? Key-Value Data Graph Document Column #iranic GlobalsDB.org Project • Name comes from the underlying data structure – Multi-dimensional array – Basis for commercial Caché data system • Free for development and production deployment • NoSQL DB with Java and Node.js APIs • Code base is same as commercial product • APIs are open sourced or being open sourced • Database kernel is not open source #iranic A “Global” Definition • A Global is persistent sparse multi-dimensional array, which consists of one or more storage elements or "nodes". Each node is identified by a node reference (which is, essentially, its logical address) – simple =="some data” – complex["subscript-1", "subscript-2"] =="some data” • Example – product[item,type,os,proccessor] == quantity – product[“computer”,”laptop”,”Mac”,”i7”] == 3 #iranic GlobalsDB Architecture • Current Architecture #iranic GlobalsDB, NoSQL, Big Data • http://nosql.mypopescu.com/ • http://highscalability.com/ • http://nosqltapes.com/ • http://globalsdb.wordpress.com #iranic