Big Data Analysis and Business Knowledge Lesson 2: Fundamentals of Big Data Analytics Dr. Le, Hai Ha Content Review Concepts and Terminology Different categories of Data Distributed computing Functional Programming 2 0. Review Big Data Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. -- Wikipedia • Big Data is a field dedicated to the analysis, processing, and storage of large collections of data that frequently originate from disparate sources. • Big Data solutions and practices are typically required when traditional data analysis, processing and storage technologies and techniques are insufficient. • Big Data addresses distinct requirements, such as: • the combining of multiple unrelated datasets • processing of large amounts of unstructured data and • harvesting of hidden information in a time-sensitive manner. 4 Scaling out vs. Scaling up Compute Compute Compute Compute Compute Storage Storage Storage Storage Storage A Cluster of computers 5 Storage – HDFS example 6 Analysis, Processing - MapReduce 7 Typical architecture 8 1. Concepts and Terminology Datasets • Collections or groups of related data are generally referred to as datasets. • Each group or dataset member (datum) shares the same set of attributes or properties as others in the same dataset. 10 Data Analysis • Data analysis is the process of examining data to find facts, relationships, patterns, insights and/or trends. • Carrying out data analysis helps establish patterns and relationships among the data being analyzed. 11 Data Analytics • Data Analytics is a discipline that includes the management of the complete data lifecycle, which encompasses collecting, cleansing, organizing, storing, analyzing and governing data. • Data analytics is a broader term that encompasses data analysis. • In Big Data environments, data analytics has developed methods that allow data analysis to occur through the use of highly scalable distributed technologies and frameworks that are capable of analyzing large volumes of data from different sources. 12 Categories of analytics • Descriptive analytics • Diagnostic analytics • Predictive analytics • Prescriptive analytics The reality is that the generation of high value analytic results increases the complexity and cost of the analytic environment. 13 Descriptive Analytics • Descriptive Analytics are carried out to answer questions about events that have already occurred. • Sample questions can include: • What was the sales volume over the past 12 months? • What is the number of support calls received as categorized by severity and geographic location? • What is the monthly commission earned by each sales agent? 14 Diagnostic Analytics • Diagnostic Analytics are aim to determine the cause of a phenomenon that occurred in the past using questions that focus on the reason behind the event. • Sample questions can include: • Why were Q2 sales less than Q1 sales? • Why have there been more support calls originating from the Eastern region than from the Western region? • Why was there an increase in patient re-admission rates over the past three months? 15 Predictive Analytics • Predictive Analytics are carried out in an attempt to determine the outcome of an event that might occur in the future. • Questions are usually formulated using a what-if rationale, such as the following: • What are the chances that a customer will default on a loan if they have missed a monthly payment? • What will be the patient survival rate if Drug B is administered instead of Drug A? • If a customer has purchased Products A and B, what are the chances that they will also purchase Product C? 16 Prescriptive Analytics • Prescriptive Analytics are build upon the results of predictive analytics by prescribing actions that should be taken. • Sample questions may include: • Among three drugs, which one provides the best results? • When is the best time to trade a particular stock? • Prescriptive analytics involve the use of business rules and large amounts of internal and external data to simulate outcomes and prescribe the best course of action 17 Business Intelligence (BI) • Business Intelligence (BI) enables an organization to gain insight into the performance of an enterprise by analyzing data generated by its business processes and information systems. • BI applies analytics to large amounts of data across the enterprise, which has typically been consolidated into an enterprise data warehouse to run analytical queries. 18 Key Performance Indicators (KPI) • Key Performance Indicators (KPI) is a metric that can be used to gauge success within a particular business context. • KPIs are linked with an enterprise’s overall strategic goals and objectives. • They are often used to identify business performance problems and demonstrate regulatory compliance. 19 2. Different Categories of Data Different Categories of Data • The data processed by Big Data solutions can be human-generated or machine-generated • The primary categories of data are: • structured data • unstructured data • semi-structured data 21 Structured Data • Structured data conforms to a data model or schema and is often stored in tabular form. • It is used to capture relationships between different entities and is therefore most often stored in a relational database. • Structured data is frequently generated by enterprise applications and information systems like ERP and CRM systems. 22 Types of Data • Categorical (nominal) data • sorted into categories according to specified characteristics. E.g. gender: male, female • Ordinal data • ordered or ranked according to some relationship to one another. E.g. rating a service as poor, average, good, very good, or excellent • Interval data • ordinal but have constant differences between observations and have arbitrary zero points. E.g. time and temperature • Ratio data • continuous and have a natural zero. E.g. weight, revenue 23 Example • Categorical data (=labels, nominal [ordered], binary) • Quantitative data (=numbers, discrete [integer], continues [real] • TABLE ROWS = instances, examples, data points, observations, samples • TABLE COLUMNS = attributes, features, variables 24 Unstructured Data • Data that does not conform to a data model or data schema is known as unstructured data. • It is estimated that unstructured data makes up 80% of the data within any given enterprise. • Unstructured data has a faster growth rate than structured data. 25 Semi-structured Data • Semi-structured data has a defined level of structure and consistency, but is not relational in nature. • Instead, semi-structured data is hierarchical or graph-based. • This kind of data is commonly stored in files that contain text. 26 Metadata • Metadata provides information about a dataset’s characteristics and structure. • This type of data is mostly machine-generated and can be appended to data. • Examples of metadata include: • XML tags providing the author and creation date of a document • attributes providing the file size and resolution of a digital photograph 27 3. Distributed Computing Distributed Computing for Big Data • • Get parallelism from computing clusters – large collections of commodity hardware, including conventional processors (“compute nodes”) connected by ethernet cables or inexpensive switches instead of a single supercomputer. In these computing paradigms, we have a distributed file system (DFS) • which features much larger units than the disk blocks in a conventional operating system. • DFS also provide replication of data or redundancy to protect against the frequent media failures that occur when data is distributed over thousands of low-cost compute nodes. • MapReduce is a programming style on top of DFS • There are many high level languages such as HIVE with a MapReduce foundation • Apache Spark is an extension of the MapReduce framework 29 A Compute Cluster Physical Setup ● ● ● ● Compute nodes are stored on racks, perhaps 8–64 on a rack. The nodes on a single rack are connected by a network, typically gigabit Ethernet. There can be many racks of compute nodes, and racks are connected by another level of network or a switch. The bandwidth of inter-rack communication is somewhat greater than the intrarack Ethernet 30 Some of the Challenges with Distributed Computing Systems • • • • Communication costs. As the cluster grows, you need better bandwidth for the nodes to across racks to compute effectively. Administration and maintenance. Requires effort to ensure software running across nodes is synchronized and also inspecting and visualizing what's happening on each node is not easy Partial failures. Since there are many nodes, failure of nodes is inevitable. Programming in this environment brings its own challenges 31 Overview of Main Solutions How does DFS and MapReduce deal with the challenge of constant failures? • DFS solution. Files must be stored redundantly. If we did not duplicate the file at several compute nodes, then if one node failed, all its files would be unavailable until the node is replaced. • MapReduce (or any programming system working with computer clusters). Computations must be divided into tasks, such that if any one task fails to execute to completion, it can be restarted without affecting other tasks. 32 DFS Implementations HDFS is not the only DFS out there, there are other ones. • • • • Some DFS are open source (e.g., HDFS) while others are proprietary They are also implemented in different programming languages. For example, HDFS is implemented in Java There are also cloud based/remote based DFS such as: • AWS S3 • Google Cloud Storage • Microsoft Azure • IBM Cloud Object Storage See this wikipedia for comparisons of DFS 33 MapReduce A programming style which works well with data in DFS. Hadoop MapReduce is just one implementation of this style. For example, Google has their own implementation of MapReduce called MapReduce. 1. 2. 3. Map tasks each are given one or more chunks of input/data from a DFS. These Map tasks turn the chunk into a sequence of key-value pairs. The way key-value pairs are produced (e.g., what should be the value) from the input data is determined by the code written by the user. The key-value pairs from each Map task are collected by a master controller and sorted by key. The keys are divided among all the Reduce tasks, so all key-value pairs with the same key wind up at the same Reduce task. The Reduce tasks work on one key at a time, and combine all the values associated with that key. The manner of combination of values is determined by the code written by the user. For instance, you can combine by adding the values for a single key. 34 MapReduce 35 Map Tasks in MapReduce • Input into Map task • Input files for a Map task can be seen as consisting of elements, which can be any type: a tuple or a document. A chunk is a collection of elements, and no element is stored across two chunks. • For example, in the word-count example, you can have multiple documents to count from such that the input file is a repository of documents, and each document is an element. • • The Map function takes an input element as its argument and produces zero or more key-value pairs. Key-value pairs. • Keys do not have to be unique. Rather a Map task can produce several key-value pairs with the same key, even from the same element. 36 4. Functional Programming Contents 1. Big Data problem solving 2. Big Data package stack in Python 3. Functional programming basics a) b) c) d) e) f) g) What is functional programming? Advantages of functional programming Functional programming and Big Data processing Lambda functions Higher order functions Functional programming in Python Data structures for functional programming in Python 4. Further reading on functional programming 5. Functional programming tutorial in Python 38 Advice on Tackling a Big Data Problem Some questions to ask yourself before you jump to the big guns 1. Can I optimize pandas to solve the problem?: If you are using Pandas for data munging, you can optimize pandas to load large datasets depending on the nature of your problem 2. How about drawing a sample from the large dataset?: Depending on your use case, drawing a sample out of a large dataset may or may not work. Just be careful that you sample correctly. 3. Can I use simple Python parallelism to solve the problem on my laptop?: Sometimes the data isn't that big but you just need to run more intense computations on the smaller data, multiprocessing can help. 4. Can I use a big data framework on my laptop?: For some tasks, even with a 25GB dataset, frameworks like Spark and Dask can work on a single laptop. 5. Which package should I use? 6. Need to build a cluster: Take time to think about which distribution of Hadoop to use, which vendors to use, whether you will put the cluster on the cloud or on-premise. You will need input of IT people for this one. 39 Big Data Package Ecosystem in Python There are only 4 packages which are known to handle large datasets in Python. Out of those, Pyspark and Dask are the most stable options for enterprise level data processing 1. 2. 3. 4. Apache Spark (Pyspark) Dask Vaex Datatable 40 Explore Dask on Your Own In this course, we use Dask but I encourage you to explore PySpark • One of the best features about Dask is that it uses existing Python APIs and data structures, its easy to switch between NumPy, pandas, scikit-learn to their Daskpowered equivalents. • At the same time, you can also run it on compute clusters such as those powered by Hadoop framework. • Learn all about Dask here 41 What is Functional Programming? “In computer science, functional programming is a programming paradigm where programs are constructed by applying and composing functions. It is a declarative programming paradigm in which function definitions are trees of expressions that map values to other values, rather than a sequence of imperative statements which update the running state of the program.”-Wikipedia A pure function is a function whose output value follows solely from its input values and cannot be affected by any mutable state or other side effects. In functional programming, a program consists entirely of evaluation of pure functions. 42 What is Functional Programming • • • • • • • A mathematical function programming style Follows declarative programming model Emphasizes the “what” of the solution instead of “how to” get to the solution Uses expressions instead of statements LISt Processing Language, known as LISP, was the first functional programming language, starting in the 1950s. Haskell and Scala is the most recent representative in this family of programming languages. Apache Spark is written mainly in Scala Other languages (e.g., Python, R, Java) also provide rudimentary support for functional programming 43 Functional programming style in Scala Procedural programming style in Python 44 Traditional Vs. Functional Program in Javascript Source: Wikipedia 45 Advantages of Functional Programming Elegant code: Code is elegant and concise because of higher order function abstractions. ● High level: You’re describing the result you want rather than explicitly specifying the steps required to get there. ● Transparent: The behavior of a pure function depends only on its inputs and outputs, without intermediary values. That eliminates the possibility of side effects, which facilitates debugging and also reduces introduction of bugs ● Parallelizable: Turning FP code to run into parallel requires no changes to the function definition which is different from traditional procedural code. ● Programs are deterministic. ● 46 Disadvantages of Functional Programming Potential performance losses because of the amount of garbage- collection that needs to happen when we end up creating new variables as we can’t mutate existing ones. ● File I/O is difficult because it typically requires interaction with state. ● Programmers who’re used to imperative programming could find this paradigm harder to grasp. ● 47 Advantages of Functional Programming-Parallelization Input Function Output Compute node-1 Compute node-2 Increase compute nodes as input size increase 48 Lambda Functions • Data types such as numbers, strings, booleans etc. don’t need to be bound to a variable. The same can be done for functions! • • In computer programming, an anonymous function (function literal, lambda abstraction, lambda function, lambda expression or block) is a function definition that is not bound to an identifier. Anonymous functions are often arguments being passed to higher-order functions or used for constructing the result of a higher-order function that needs to return a function. Anonymous functions are ubiquitous in functional programming languages and other languages with first-class functions, where they fulfil the same role for the function type as literals do for other data types. Source: Wikipedia 49 Lambda Functions in Scala vs. Python 50 Functional Programming and Big Data processing • Functional programming lends itself amenable to Big Data processing because of ease of parallelization • For instance, Spark parallelizes computations using the lambda calculus • All functional Spark programs are inherently parallelizable-which means when you increase your input data from 1MB to 1 PB during analysis, all you have to do is add more compute resources, no need to change the code 51 Functional Programming in Python • Functions in Python are first class citizensThat means functions have the same characteristics as values like strings and numbers. • Functions have two abilities which are crucial for functional programming as follows: • They can take another function as an argument • They can return functions as values • Storing them in variables just like other datatypes • Anonymous functions are easy to define with lambda • Therefore, Python provides good support functional programming 52 Data Structures for Functional Programming in Python • Mutable data structures such as Dictionaries and Lists are not ideal for functional programs because they can be changed while the program is running • Instead, immutable data structures are better where you are forced to make a copy of the object before you change it • In Python “namedtuple” and “tuple” can be used instead of Lists and Dictionaries 53 Why Does Data Immutability Matter in FP In pure functional languages, all data is immutable and the program state cannot change ● What are the implications of this property ? ○ Functions are deterministic - the same input will always yield the same output. This makes it easier to re-use functions elsewhere. ○ The order of execution of multiple functions does not affect the final outcome of the program. ○ Programs don’t contain any side effects. ● We will see that in Apache Spark, all data structures are immutable, you have to make a copy or perform some action/transformations to change it ● 54 Defining Lambda Functions in Python 55 How Lambda Functions Fit Together with Python Function Definition and Lambda Calculus 56 Higher Order Functions A function is a higher-order function if it either takes in one or more functions as parameters and/or returns a function. 57 Higher Order Functions in Action 58 Common Higher Order Functions Similar idea to the map and reduce functions in MapReduce programming model • Map - is a higher order function with the following specifications • Inputs: a function f and a list of elements L • Outputs: a new list of elements, with f applied to each of the elements of L • Reduce - reduces a list of elements to one element using a binary function to successively combine the elements. • Inputs: a function f , a list of elements L and an accumulator, acc, the parameter that collects the return value. You can think of acc as the initial value • Outputs: The value of f sequentially applied and tracked in ‘acc’ • Filter 59 Map Example 60 Reduce Example 61 Further Reading in Functional Programming 1.Wikipedia has good content on the topic 2.The Lambda calculus background is also interesting to read 3.These slides provide good introductory information on FP 62 Exercises • Practice with Python basic, Numpy, Pandas. 63 Exercises • Identify each of the variables in the Excel file Credit Approval Decisions as categorical, ordinal, interval, or ratio and explain why 64