DPS2017 Candidates @PACE – Fall 2015 Team3 Amir Ataee Marvin Billings Sharice Cannady LLiver José Egal Sanchez Professor Chuck Tappert, Ph.D. DCS860A EMERGING INFORMATION TECHNOLOGIES 1 Brief overview of Big Data Analytics with an emphasis on enabling algorithms TABLE OF CONTENT • Big Data Overview (Egal) • • • Data Structures Perspectives on Data Repository State of the Practice in Analytics (Marvin) • Current Analytical Architecture • • • • • Data sources Data Warehouse (DW) EDW (Enterprise DW) Data Science Users Drivers of Big Data • Example of Big Data Analytics – IOT (Sharice) • Data Analytics Lifecycle (Amir) • • • • • • Phase 1: Discovery Phase 2: Data Preparation Phase 3: Model Planning Phase 4: Model Building Phase 5: Communicate Results Phase 6: Operationalize Algorithms (LLiver) • Clustering • K-means • Association Rules • Apriori Algorithm • Regression • Linear Regression • Logistic Regression • Classification • Decision Trees • Naïve Bayes • Time Series Analysis • Text Analysis EGAL’S SECTION Big Data Definition: Big Data can be defined as all data that is not a fit for a traditional Relational Data Base Management System (RDBMS), regardless of its use, for online transaction processing (OLTP) or analytic purpose. Big data is not only associated with the size of the data, but most importantly with the format of the data. However, there are ecosystems like Hadoop that can utilize both structured and unstructured data. The collection of data and the size of the data have been throughout an evolutionary process that covers four stages Engineer hard wired computer to extract information “ the introduction of telegraph” Data input by Computer operators “corporation and organizations logging corporate data” The introduction of the web “ the introduction of web.01 and web.02” The introduction of sensors “ internet of things” Explosion of data collection has been exponential, going from KB to TB In computer science, a data structure is a methodical way to organizing data in a computer so that it can be used efficiently Structured databases use B-tree indexes for small percentages of data retrieval Compilers tables and databases use dynamic hash tables as look-up Usually, efficient data structures are very important to designing efficient algorithms. Some formal design methods and programming languages emphasize data structures, rather than algorithms, as the key organizing factor in software design. Relational Popular Data Base Management System (RDBMS) RDBMS: Microsoft SQL, Oracle SQL, and MySQL Enterprise Data Warehouse (EDW) Structured Data In data Warehouse traditional EDW BI/analytics Popular NO-SQL technology: Casandra, CouchDB, and MongoDB Big-data Data repository: Unstructured Hadoop Big data Big-data Repository data applications Some big data ecosystem like Hadoop can leverage both data repositories (RDBMS) and Big-Data NO-SQL repositories MARVIN’S SECTION STATE OF THE PRACTICE IN ANALYTICS - WHAT IS DRIVING IT? ❖ [3] 2.5 Quintillion bytes of data created daily. ❖ [5]Data generated every 2 days = All data from beginning of time until 2003. ❖ 90% percent of data stored in the world created in last 2 years. ❖ All data in storage in the world doubles in 1.2 years. ❖ 3.2 zettabytes today to 40 zettabytes in 2020. ❖ 570 websites are created every minute. ❖ Google processes over 40000 searches per second. ❖ Every minute Facebook process 1.8 million likes and 200000 photo uploads. ❖ Youtube receives 100 hours of video a minute. ❖ NSA monitors 60 petabytes daily(1.8% of internet traffic). ❖ 1.8 billion smartphones to 7.1 billion people on earth. ❖ 25% of the people on this planet have smart phone full of sensors. ❖ Typical smartphone sensors (Motion, Air Temp., Light, Promity, Humdity, Location, etc.) STATE OF THE PRACTICE IN ANALYTICS Mobile device Users ❖ Sensors ❖ Archives ❖ Social Networks ❖ Internet of Things ❖ Enterprise Applications ❖ Cameras ❖ Software logs ❖ Health Data - Fitbit ❖ Databases ❖ Etc. ❖ WHERE IS THE DATA COMING FROM? STATE OF THE PRACTICE IN ANALYTICS - CURRENT GENERAL ARCHITECTURE Mysore, D., Khupat, S., & Jain, S.[Photograph]Retrieved from http://www.ibm.com/developerworks/library/bd-archpatterns1/fig1.png SHARICE’S SECTION BIG DATA & THE INTERNET OF THINGS • Data management software such as: NoSQL database Hadoop • Initially were used to analyzed website traffic and social media data. • It turns out, that NoSQL and Hadoop are capable of analyzing data streams coming from sensors and controllers. BIG DATA & THE INTERNET OF THINGS • Today sensors and controllers are streaming data from: 1. Jet Engines 2. Mobile Devices 3. Health Monitors 4. Items of Shipment • On the right: Is a Simplified view of key Internet of Things components. BIG DATA & THE INTERNET OF THINGS BIG DATA & THE INTERNET OF THINGS • The “Killer User Cases” for Big Data • Sensors and intelligent controllers increasing provide data that is critical to running the business. • Big Data use case could be driven by the need to analyze data from the Internet of Things. AMIR’S SECTION DATA ANALYTICS LIFE CYCLE Big Data is defined as “extremely large data sets that have grown beyond the ability to manage and analyze them with traditional data processing tools Dealing with big data has several problems such as acquirement, storage, searching, sharing, analytics, and visualization of data To overcome these issues, we need a process which facilitated the analytical process of the Big Data. For this purpose, the data analytics life cycle process was designed DATA ANALYTICS LIFE CYCLE- CONTINUE 1 This life cycle, has 6 phases but they are not have to be in serial order any time, one or more phases can happen at the same time Discovery 6 2 Data Prep Operationalize Operationalize At most of these phases can either go forward or backward depend on what additional as new information is available 5 Communicate Results 3 4 Model Buildin g Model Planni ng PHASE 1: DISCOVERY The team learns the business domain The team assesses the resources available to support the project The team formulating initial hypotheses (IHs) to test and begin learning the data. PHASE 2: DATA PREPARATION It requires the presence of an analytic sandbox, in which the team can work with data The team needs to execute extract, load, and transform (ELT) The team needs to familiarize itself with the data thoroughly and take steps to condition the data PHASE 3: MODEL PLANNING The team determines the methods, techniques, and workflow it intends to follow for the subsequent model building phase The team explores the data to learn about the relationships between variables and subsequently selects key variables and the most suitable models PHASE 4: MODEL BUILDING The team develops datasets for testing and production purposes The team builds and executes models based on the work done in the model planning phase PHASE 5: COMMUNICATE RESULTS The team, in collaboration with major stakeholders, determines if the results of the project are a success or a failure The team should identify key findings, quantify the business value, and develop a narrative to summarize and convey findings to stakeholders. PHASE 6: OPERATIONALIZE The team delivers final reports, briefings, code, and technical documents. The team may run a pilot project to implement the models in a production environment. LLIVER’S SECTION Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 WHAT IS AN ALGORITHM • In mathematics and computer science, an algorithm is a self-contained step-by-step set of operations to be performed. Algorithms exist that perform calculation, data processing, automated reasoning, etc. • The concept of algorithm has existed for centuries, however a partial formalization of what would become the modern algorithm began with attempts to solve the Entscheidungs problem ("determining whether or not two Boolean functions returning Boolean values are equivalent") posed by David Hilbert in 1928. Giving a formal definition of algorithms, corresponding to the intuitive notion, remains a challenging problem. [3] • Algorithms normally behave as they are designed, performing a number of tasks. But when left unsupervised, can and will do strange things. As we put more and more of our world under the control of algorithms, we can lose track of who-or-what-is pulling the strings. • Algorithms entered the general public newscast through the Flash Crash , but they did not leave. They soon showed up in stories about dating, shopping, entertainment, medicine, everything imaginable. Algorithms are taking over everything. [12] WHAT IS AN ALGORITHM (CONT.) • The bounds of algorithms get pushed every day. They have displaced humans in a growing number of industries, something they often do well. They are faster than us, and when things work as they should, they make far fewer mistakes than we do. But, as algorithms acquire power and independence, there can be unexpected consequences. They observe, experiment and learn – all independently of their human creators. • Using computer science advanced techniques such as machine learning and neural networking, algorithms can even create new and improved algorithms based on observed results. Algorithms have already written symphonies, picked through legalese, diagnosed patients, written news article, fly airplanes, driven vehicles on urban highways with far better control than humans. [12] • It’s no coincidence that the most upwardly mobile people in society right now are those who can manipulate code to create algorithms that can sprint through oceans of data. • In the history of thought, before the discovery of calculus, mathematics had been a discipline of great interest; afterward, it became a discipline of great power. Only after the advent of computer algorithm in the twentieth century has represented a mathematical idea of comparable influence. The calculus and the algorithm are the two leading ideas of the Western science. ALGORITHM: DISCUSSION • What will become of: our duties as humans, our employment, etc.? • Since they are here to stay, how can we be more in control, trust, etc.? CLASSIFICATION OF ALGORITHMS By implementation means • Recursion or iteration • Logical • Serial, parallel or distributed • Deterministic or non-deterministic • Exact or approximate • Quantum Optimization problems • Linear programming • Dynamic programming • The greedy method • The heuristic method By design paradigm • Brute-force or exhaustive search • Divide and conquer • Search and enumeration • Randomized • Reduction of complexity By complexity By field of study MAJOR BIG DATA ANALYTICS ALGORITHMS CLUSTERING Clustering Use Cases Clustering is the use of unsupervised techniques for grouping similar objects. In machine learning, unsupervised refers to the problem of finding hidden structure within unlabeled data. Clustering techniques are unsupervised in the sense that the data scientist does not determine, in advance, the labels to apply to the clusters. For large data set is computationally expensive. Image Processing Video is one example of the growing volumes of unstructured data being collected. Within each frame of a video, k-means analysis can be used to identify objects in the video. Clustering is primarily an exploratory technique to discover hidden structures of the data, possibly as a prelude to more focused analysis or decision processes. K-means is a simple and straightforward method for defining clusters. Once clusters and their associated centroids are identified, it is easy to assign new objects to a cluster based on the object's distance from the closest centroid. Medical Patient attributes such as age, height, weight, systolic and diastolic blood pressures, cholesterol level, and other attributes can identify naturally occurring clusters. These clusters could be used to target individuals for specific preventive measures or clinical trial participation. Clustering, in general, is useful in biology for the classification of plants and animals, as well as in the field of human genetics. Customer Segmentation Marketing and sales groups use k-means to better identify customers who have similar behaviors and spending patterns. Major Big Data Analytics Algorithms: Clustering (Cont.) If n is the known number of patterns and c the desired number of clusters, the k-means algorithm is: initialize n, c, 1, 2, …, c(randomly selected) do classify n samples according to nearest I compute i until no change in i return 1, 2, …, c The question is how to evaluate that the samples in one cluster are more similar among themselves than samples in other clusters • Two isses: • How to measure the similarity between samples? • How to evaluate a partitioning of a set into clusters? • The most obvious measure of similarity between two samples is the distance between them, i.e., define a metric • Once this measure is defined, one would expect the distance between samples of the same cluster to be significantly less than the distance between samples in different classes. MAJOR BIG DATA ANALYTICS ALGORITHMS ASSOCIATION RULES (ARS) This is a descriptive, not predictive, method often used to discover interesting relationships hidden in a large dataset. The relationship occurs too frequently to be random and is meaningful from a business perspective, which may or may not be obvious. ARs are commonly used for mining transactions in databases. Possible questions that ARs can answer: • Which products tend to be purchased together? • Of those customers who are similar to this person, what products do they tend to buy? • Of those customers who have purchased this product, what other similar products do they tend to view or purchase? Apriori [8] is one of the earliest and the most fundamental algorithms for generating association rules. Applications of Association Rules Market basket analysis refers to a specific implementation of association rules mining that many companies use for a variety of purposes, including these: • Broad-scale approaches to better merchandising • Cross-merchandising between products and high-margin or high-ticket items • Physical or logical placement of product within related categories of products • Promotional programs of multiple product purchase incentives managed through a loyalty card program ARs are commonly used for recommender systems [9] and click stream analysis. Many online service providers such as Amazon and Netflix use recommender systems; which can discover related products or identify customers who have similar interests. This observation provides valuable insight on how to better personalize and recommend the content to site visitors. The framework has expanded to web contexts, such as mining path traversal patterns and usage patterns [7] to facilitate organization of web pages. MAJOR BIG DATA ANALYTICS ALGORITHMS: REGRESSION: LINEAR REGRESSION Linear Regression Use Cases Regression analysis attempts to explain the influence that a set of variables has on the outcome of another variable of interest. Often, the outcome variable is called a dependent variable because the outcome depends on the other variables. These additional variables are sometimes called the input variables or the independent variables. • What is a person's expected income? • What is the probability that an applicant will default on a loan? Linear regression is often used in business, government, and other scenarios. Some common practical applications: • Real estate: A simple linear regression analysis can be used to model residential home prices as a function of the home's living area. The model could be further improved by including other input variables such as number of bathrooms, number of bedrooms, lot size, school district rankings, crime statistics, and property taxes. • Demand forecasting: Businesses and governments can use linear regression models to predict demand for goods and services. Similar models can be built to predict retail sales, emergency room visits, and ambulance dispatches. • Medical: A linear regression model can be used to analyze the effect of a proposed radiation treatment on reducing tumor sizes. Linear regression is an analytical technique used to model the relationship between several input variables and a continuous outcome variable. Linear regression models are useful in physical and social science applications where there may be considerable variation in a particular outcome based on a given set of input values. MAJOR BIG DATA ANALYTICS ALGORITHMS: REGRESSION: LOGISTIC REGRESSION Logistic Regression When the outcome variable is categorical, logistic regression is a better choice. Both models assume a linear additive function of the input variables. If such an assumption does not hold true, both regression techniques perform poorly. Use Cases The logistic regression model is applied to a variety of situations in both the public and the private sector. • Medical: Develop a model to determine the likelihood of a patient's successful response to a specific medical treatment or procedure. • Finance: Using a loan applicant's credit history and the details on the loan, determine the probability that an applicant will default on the loan. Based on the prediction, the loan can be approved or denied, or the terms can be modified. • Marketing: Determine a wireless customer's probability of switching carriers (known as churning) based on age, number of family members on the plan, months remaining on the existing contract, and social network contacts. • Engineering: Based on operating conditions and various diagnostic measurements, determine the probability of a mechanical part experiencing a malfunction or failure. With this probability estimate, schedule the appropriate preventive maintenance activity. MAJOR BIG DATA ANALYTICS ALGORITHMS: REGRESSION: CLASSIFICATION Classification Classification is another fundamental learning method that appears in applications related to data mining. The primary task performed by classifiers is to assign class labels to new observations. The set of labels for classifiers is predetermined, unlike in clustering, which discovers the structure without a training set and allows the data scientist optionally to create and assign labels to the clusters. Most classification methods are supervised, in that they start with a training set of pre-labeled observations to learn how likely the attributes of these observations may contribute to the classification of future unlabeled observations. Classification is widely used for prediction purposes. Two fundamental methods: Decision Trees and Naïve Bayes. Decision Trees A decision tree (also called prediction tree) uses a tree structure to specify sequences of decisions and consequences. • Classification trees usually apply to output variables that are categorical-often binary-in nature, such as yes or no. • Regression trees, on the other hand, can apply to output variables that are numeric or continuous, such as the predicted price of a consumer good or the likelihood a subscription will be purchased. Naïve Bayes Naïve Bayes is a probabilistic classification method based on Bayes' theorem (or Bayes' law) with a few tweaks. Bayes' theorem gives the relationship between the probabilities of two events and their conditional probabilities. Because Bayes classifiers are easy to implement and can execute efficiently even without prior knowledge of the data, they are among the most popular algorithms for classifying text documents. E-mail spam filtering is a classic use case of Bayes text classification. Naïve Bayes classifiers can also be used for fraud detection [11] MAJOR BIG DATA ANALYTICS ALGORITHMS: REGRESSION: TIME SERIES ANALYSIS Time series analysis attempts to model the underlying structure of observations taken over time. It has many applications in finance, economics, biology, engineering, retail, and manufacturing. Following are the goals of time series analysis: • Identify and model the structure of the time series. • Forecast future values in the time series. Use cases: • Retail sales: For various product lines, a clothing retailer is looking to forecast future monthly sales. • Spare parts planning: Companies' service organizations have to forecast future spare part demands to ensure an adequate supply of parts to repair customer products. • Stock trading: Some high-frequency stock traders utilize a technique called pairs trading. MAJOR BIG DATA ANALYTICS ALGORITHMS: REGRESSION: TEXT ANALYSIS Text analysis, sometimes called text analytics, refers to the representation, processing, and modeling of textual data to derive useful insights. An important component of text analysis is text mining, the process of discovering relationships and interesting patterns in large text collections. Text analysis suffers from the curse of high dimensionality. A corpus (plural: corpora) is a large collection of texts used for various purposes in Natural Language Processing (NLP). Table below lists a few example corpora that are commonly used in NLP research. Corpus Word Count Domain Shakespeare 0.88 million Written http://shakespeare.mit.edu/ Brown Corpus 1 million Written http://icame.uib.no/brown/bcm.html Penn Treebank 1 million Newswire http://www.cis.upenn.edu/-treebank/ Switchboard Phone Conversations 3 million Spoken British National 100 million Written and spoken NA News 350 million Newswire European Parliament Proceedings Parallel 600 million Legal 1 trillion Written Google N-Grams Website http://catalog.ldc.upenn.edu/LDC97S62 http://www.natcorp.ox.ac.uk/ http://catalog.ldc.upenn.edu/LDC95T21 http://www.statmt.org/europarl/ http://catalog.ldc.upenn.edu/LDC2006T13 REFERENCES (LLIVER) [1] Axt, P (1959). "On a Sub-Recursive Hierarchy and Primitive Recursive Degrees". Transactions of the American Mathematical Society 92: 85–105. [2] Bell, C. Gordon and Newell, Allen (1971), Computer Structures: Readings and Examples, McGraw-Hill Book Company, New York. ISBN 0-07-004357-4. [3] Moschovakis, Yiannis N. (2001). "What is an algorithm?" In Engquist, B.; Schmid, W. Mathematics Unlimited — 2001 and beyond. Springer. pp. 919–936 (Part II). ISBN 9783540669135. [4] J. MacQueen, "Some Methods for Classification and Analysis of Multivariate Observations," in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, 1967. [5] P. Hajek, I. Havel, and M. Chytil, "The GUHA Method of Automatic Hypotheses Determination," Computing, vol. 1, no. 4, pp. 293-308, 1966. [6] R. Agrawal, T. lmielinski, and A. Swami, "Mining Association Rules Between Sets of Items in Large Databases," SIGMOD '93 Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pp. 207-216, 1993. [7] R. Cooley, B. Mobasher, and J. Srivastava, "Web Mining: Information and Pattern Discovery on the World Wide Web," Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence, pp. 558-567, 1997. [8] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules in Large Databases," in Proceedings of the 20th International Conference a Very Large Data Bases, San Francisco, CA, USA, 1994. [9] W. Lin, S. A. Alvarez, and C. Ruiz, "Efficient Adaptive-Support Association Rule Mining for Recommender Systems," Data Mining and Knowledge Discovery, vol. 6, no. 1, pp. 83-105, 2002. [10] W. Lin, S. A. Alvarez, and C. Ruiz, "Collaborative Recommendation via Adaptive Association Rule Mining," in Proceedings of the International Workshop on Web Mining for E-Commerce (WEBKDD), Boston, MA, 2000. [11] C. Phua, V. C. S. Lee, S. Kate, and R. W. Gayler, "A Comprehensive Survey of Data Mining-Based Fraud Detection," CoRR, vol. abs/1009.6119, 2010. [12] C. Steiner, “Automate This: How Algorithms Took Over Our Markets, Our Jobs, and the World”, Portfolio / Penguin, NY, NY, 2013. ISBN 9781591846529. [13] R. O. Duda, P. E. Hart and D. G. Stork, “Pattern Classification” (2nd ed). John Wiley & Sons, 2000 REFERENCES (MARVIN) [1]Assunção, M. D., Calheiros, R. N., Bianchi, S., Netto, M. A., & Buyya, R. (2015). Big Data computing and clouds: Trends and future directions. Journal of Parallel and Distributed Computing, 79, 3-15. [2]Mysore, D., Khupat, S., & Jain, S. (2013, September 17). Big data architecture and patterns, Part 1: Introduction to big data [3]classification and architecture. Retrieved from http://www.ibm.com/developerworks/library/bd-archpatterns1/ [4]Wang, L., Ranjan, R., Kołodziej, J., Zomaya, A., & Alem, L. (2015). Software Tools and Techniques for Big Data Computing in Healthcare Clouds. Future Generation Computer Systems, 43, 38-39. [5]Marr, B. (2015). Big Data: using SMART big data, analytics and metrics to make better decisions and improve performance. References (Egal): [1]https://en.wikipedia.org/wiki/Data_structure [2]http://searchbusinessanalytics.techtarget.com/definition/big- data-analytics [3]http://searchstorage.techtarget.com/definition/file-storage [4]Big Data: Evolution, Components, Challenges and Opportunities, Universidad Iberoamericana, Alejandro Zarate Santovena [5]http://www.adamadiouf.com/2013/03/22/bigdata-vs-enterprise- data-warehouse/ REFERENCES (SHARICE) 1 Licht, A., Mantha, N., Nagode, L., & Stackowiak, Robert. , (2015). Big Data and The Internet of Things : Enterprise Information Architecture for A New Age. New York : Apress 2 CIT Group.,(2015). The Internet of Things. Retrieved December 17, 2015. From https://www.youtube.com/watch?v=ygfBzQ1RKts AMIR’S REFERENCES Big Data Analytics: Turning Big Data into Big Money , Frank Ohlhorst John Wiley & Sons, 2013 • http://www.brainyquote.com/quotes/authors/l/lao_ tzu.html • Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data , EMC Education Services, John Wiley & Sons, 2015 • Questions?