“Big Data” - Technical Architecture Roni Schuling - Enterprise Architecture Tom Scroggins – IS Domain Architecture Principal Financial Group “Big Data” - Technical Architecture AGENDA • • • • • • Foundational Definitions & where these technologies came from • Big Data • NoSQL • Hadoop Business & Technical Drivers How they are being used in many companies Predictions for the future Challenges & Obstacles Questions “Big Data” - Technical Architecture Foundational Definition – Big Data • • Big data is an evolving term that describes any voluminous amount of structured, semistructured and unstructured data that has the potential to be mined for information. Big data can be characterized by 3Vs: the extreme volume of data, the wide variety of types of data and the velocity at which the data must be must processed. There are many other aspects as well such as: Viscosity, Complexity, Ambiguity. Data in a corporation that cannot be processed using traditional data management techniques and technologies can be broadly classified as Big Data. “Big Data” - Technical Architecture “Big Data” - Technical Architecture Big Data ≠ Hadoop Big Data ≠ NoSQL Hadoop ≠ NoSQL Hadoop & NoSQL are key technologies for working with Big Data effectively. “Big Data” - Technical Architecture “Big Data” - Technical Architecture Foundational Definition - NoSQL • NoSQL database, also called Not Only SQL, is an approach to data management and database design that's useful for very large sets of distributed data. • NoSQL seeks to solve the scalability and big data performance issues that relational databases weren’t designed to address. • NoSQL is especially useful when an enterprise needs to access and analyze massive amounts of unstructured data or data that's stored remotely on multiple virtual servers in the cloud • However - NoSQL is not just about Big Data “Big Data” - Technical Architecture Where this technology came from - NoSQL 1970 1980 1990 2000 2005 2007 2010 2014+ Polygot Persistence Document DB Inspired by Lotus Notes Key Value Store Replicate Data during 24x7 Availability Enterprise will have a variety of different data storage technologies for different kinds of data & application needs Need to Store Tabular Data in Distributed System Many Innovators In The 2005 to 2010 Timeframe “Big Data” - Technical Architecture Market view of what’s out there – we do NOT have all of these at PFG today. There are over 150 NoSQL databases in the market – these are just a few of the top ones. “Big Data” - Data Architecture at PFG Foundational Definition - Hadoop • Hadoop is a open source, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. • It is part of the Apache project sponsored by the Apache Software Foundation. • Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes. • Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to continue operating uninterrupted in case of a node failure. This approach lowers the risk of catastrophic system failure, even if a significant number of nodes become inoperative. “Big Data” - Data Architecture at PFG Where this technology came from - Hadoop 1995 2004 2005 1995 – 2005: Yahoo! Search team builds 4+ generations of systems to crawl & index the WWW. 20 Billion pages! Google publishes Google File System & MapReduce papers • • Yahoo! Staffs ‘Juggernaut’, open source DFS & MapReduce Doug Cutting builds Nutch DFS & MapReduce, joins Yahoo! 2006 2010 2014+ Juggernaut & Nutch join forces – Hadoop is born! Other Internet companies add tools / frameworks to enhance Hadoop Service providers step into the market – provide training, support, & hosting “Big Data” - Technical Architecture The Hadoop Vendor Landscape - 2014 “Big Data” - Technical Architecture “Big Data” - Technical Architecture Business Drivers • Provide access to all data needed for analytics (internal or external) • Provide the ability to realistically interact with greater ‘depths’ of data – IE: tens of years instead of a couple of months • Provide a greater “speed to insight” for all types of requests • Lower the total cost of ownership across the enterprise for analytics • Allow for exploration of our data in ways we never anticipated to identify differentiating understanding of customers and markets There’s an Imbalance today…. “Big Data” - Technical Architecture Technical Drivers Current technical capabilities don’t align with changing expectations “Big Data” - Technical Architecture How they are being used today NoSQL Not focused on Big Data….yet • Many companies using or at least experimenting with MongoDB Document store for web applications that only needs to persist the content for the lifespan of that interaction. • Using NoSQL stores for user preferences to personalize what is presented on a web page for their interaction. • Beginning to organization social streams of data Hadoop • Interrogating our web logs to better understand the behavior of people interacting with a website. • Merging that semi-structured web activity with other structured legacy data. • Massive storage of data for exploration and discovery – often using interoperability with analytic consumption tools. “Big Data” - Technical Architecture NoSQL Plans for the future • Database for web applications that need that speed of development and nimbleness. • Layering of NoSQL solutions on top of Hadoop to improve searchability and performance. • Exploration of Graph NoSQL solutions for analytics on hierarchical type data . Hadoop • Expansion of web activity data (more logs, more data in logs, more use cases.) • Speech-to-text translation of Call Recordings and text analysis/Natural Language processing to determine call topics and caller sentiment. • Extraction of text from documents to aid in analysis. • ‘Data Lake’ solutioning – both for ingestion and archive. “Big Data” - Technical Architecture Lake of Data Data Refinery “Big Data” - Technical Architecture Data Refinery “Big Data” - Technical Architecture Many Kinds of data in our organization Conceptually for illustration – not a vetted/approved picture of the PFG environment “Big Data” - Technical Architecture Conceptual Workload Isolation Today… Conceptually for illustration – not a vetted/approved picture of the PFG environment “Big Data” - Technical Architecture Conceptual Workload Isolation in the Future… Conceptually for illustration – not a vetted/approved picture of the PFG environment “Big Data” - Technical Architecture “Big Data” - Technical Architecture Big Data technologies are broader than just Hadoop & NoSQL – but those are the key starting points for us. Market view of what’s out there – we do NOT have all of these at PFG today. “Big Data” - Technical Architecture Challenges and Obstacles to overcome • • • • • Security Governance Clear Use Cases Integration Points Hosting models “Big Data” - Technical Architecture Q&A Kapur.Gurwinder@principal.com •NoSQL Data Architecture& Best Practices Data View - Overview We are in a Database Revolution • Existing paradigms are being challenged o Models o Hardware o Software o Languages • Will tweaking current data solutions be enough? •NoSQL Data Architecture& Best Practices Data View - Overview •NoSQL Data Architecture& Best Practices Data View – Five Data Paradigms •NoSQL Data Architecture& Best Practices Data View – Five Data Paradigms Relational Model PROs • • • • • • Most flexible queries & updates Reuse data structures in any context Great DB-to-DB integration Mature tools Standard query language Easy to hire expertise CONs • • • • • Design-time, static relationships Design-time, static structures: design first then load data Hard to normalize model Requires code to integrate relational data with object-oriented code Cannot query for relevance •NoSQL Data Architecture& Best Practices Data View – Five Data Paradigms Dimensional Model PROs • • • • • • Queries facts in context Self-service, ad hoc queries High-performance platforms Mature tools and integration Standard query language Turns data into information CONs • • • • • Expensive platforms Design-time, static relationships Design-time, static structures: design first then load data Cannot query for relevance Cannot query for answers that are not built into the model •NoSQL Data Architecture& Best Practices Data View – Five Data Paradigms What’s wrong (aka challenging) with SQL DB’s? •NoSQL Data Architecture& Best Practices Data View – Five Data Paradigms Key Value / Column Family Models PROs • • • • • • • • Fast puts and gets Massive scalability Easy to shard & replicate Data colocation Simple to model Inexpensive Data in transactional context Developer in control CONs • • • • • • • Carefully design key Shred JSON into flat columns Secondary indexes required to query outside of hierarchical key No standard query API or language Hand code all joins in app Immature tools and platform Hard to integrate and hire •NoSQL Data Architecture& Best Practices Data View – Five Data Paradigms Document Model PROs • • • • • • Fast development “Schemaless”, run-time designed, rich, JSON and/or XML data structures Queries everything in context Self-service, ad hoc queries Turns data into information Can query for relevance CONs • • • • Defensive programming for unexpected data structures Expensive platforms, immature tools, and hard to integrate Non-standard Query Languages, and hard to hire expertise Not as fast as Column-Family / Key-Value databases •NoSQL Data Architecture& Best Practices Data View – Five Data Paradigms Graph Model PROs • • • • • • Unlimited flexibility – model any structure Run time definition of types & relationships Relate anything to anything in any way Query relationship patterns Standard Query Language (SPARQL) Creates maximum context around data CONs • • • • • Hard to model at such a low level Hard to integrate with other systems Immature tools Hard to hire expertise Cannot query for relevance because original document context is not preserved •NoSQL Data Architecture& Best Practices Data ViewData – FiveView Data Paradigms .. What’s wrong (aka challenging) with NoSQL DB’s? •NoSQL Data Architecture& Best Practices Data ViewData – FiveView Data Paradigms •NoSQL Data Architecture& Best Practices Data View Modeling Takeaways Each model has a specialized purpose • Dimensional analytics Business intelligence reporting and • Relational standard Flexible queries, joins, updates, mature, • Column / Key-Value • Document JSON/XML, Fast Development, “schemaless” searchable • Graph / RDF Modeling anything at runtime including relationships Simple, fast puts and gets, massively scalable •NoSQL Data Architecture& Best Practices Data View –Data HowView do you choose? .. How do you choose? How much Durability do you need? Durable data survives system failures & can be recovered after unwanted deletion How much Atomicity do you need? An atomic transaction is all or nothing, sets of data and/or sets of commands. How much Isolation do you need? Isolation prevents concurrent transactions from affecting each others. How much Consistency do you need (or when do you need it)? Consistency exists when data is committed and consistent with all data rules at a point in time. •NoSQL Data Architecture& Best Practices Data ViewData – HowView do you choose? .. Durability • • • • Can you live with writing advanced code to compensate? o Trusting all developers to properly check for partial transaction failures, current physical layout of the data cluster, and write code to propagate data across the cluster. Can you live with lost data? o No logs, archives, mirroring, etc…. Can you live with accidental deletion of data? o No point in time recovery feature Can you live with scripting your own backup & recovery solutions? •NoSQL Data Architecture& Best Practices Data ViewData – HowView do you choose? .. Atomicity • Can you live with modifying single documents at a time? • Can you live with partially successful transactions? o You can achieve higher availability because transactions can partially succeed. • Can you live with inconsistent and incomplete data? o Is it OK to not know when data anomalies are caused by bugs in your code or are temporarily inconsistent because they haven’t been synchronized yet? • Can you live with writing advanced code to compensate? o Custom solutions for atomic rollback, handling of transactions that fail, find & fix inconsistent data. •NoSQL Data Architecture& Best Practices Data ViewData – HowView do you choose? .. Isolation • Can you live with modifying single documents at a time? • Can you live with inaccurate queries? o Without isolation, query results are inaccurate because concurrent transactions can change data while processing it. • Can you live with race conditions and dead locks? • Can you live with writing advanced code to compensate? o Your own versioning system, code to hide concurrent updates, inserts and deletes from queries, handle race conditions and deadlocks. •NoSQL Data Architecture& Best Practices Data ViewData – HowView do you choose? .. Consistency - Do you need complete consistency? Not necessarily – instead, you may prefer: • • • • • • • • Absolute fastest performance at lowest hardware cost Highest global data availability at lowest hardware cost Working with one document at a time Writing advanced code to create your own consistency model Eventually consistent data Some inconsistent data that can’t be reconciled Some missing data that can’t be recovered Some inconsistent query results •NoSQL Data Architecture& Best Practices Data ViewData – HowView do you choose? .. What do you need most? • Highest performance for queries and transactions • Highest data availability across multiple data centers • Less data loss (eg. Durability) • More query accuracy & less deadlocks (eg. Isolation) • More data integrity (eg. Atomicity) • Less code to compensate for lack of ACID compliance •NoSQL Data Architecture& Best Practices Key Points RDBM’s will always have an important place in our architecture. NoSQL implementations have a benefit to our future. Once you have a list of NoSQL databases that meet your modeling needs, choose the one that best meets your need for velocity and volume. It is not a one-or-the-other ‘all in’ choice to make.