Big Data a small introduction Prabhakar TV IIT Kanpur, India tvp@iitk.ac.in Much of this content is generously borrowed from all over the Internet Let us start with a story Can we predict that a customer is expecting a baby? 2 “As Pole’s(Statistician at Target) computers crawled through the data, he was able to identify about 25 products that, when analyzed together, allowed him to assign each shopper a “pregnancy prediction” score. More important, he could also estimate her due date to within a small window, so Target could send coupons timed to very specific stages of her pregnancy” 3 What is Big Data? How big is Big? Constantly moving target More than 100 petabytes in 2012 4 Big in What? Big in Volume Big in Velocity Big in Variety 5 Big Data Dimensions Michael Schroeck etal IBM Executive Report Analytics: The real-world use of big data 6 Big Data Dimensions – add more Vs Michael Schroeck etal IBM Executive Report Analytics: The real-world use of big data 7 Gartner’s Definition "Big data are high volume, high velocity, high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." 8 Big data can be very small Thousands of sensors in planes, power stations, trains… These sensors have errors(tolerance) Monitor engine efficiency to safety passenger well being The size of the dataset is not very large – several Gigabytes, but number of permutations in the source are very large http://mike2.openmethodology.org/wiki/Big_Data_Definition 9 Large datasets that ain’t big Media streaming is generating very large volumes with increasing amounts of structured metadata. Telephone calls and internet connections Petabytes of data, but content is extremely structured. Relational databases can handle well-structured very well 10 Who coined the term Big Data? Not clear An Economist has claims to it (Prof Francis Diebold of Univ of Pennsylvania) There is even a NYTimes article http://bits.blogs.nytimes.com/2013/02/01/the-origins-ofbig-data-an-etymological-detective-story/ 11 But generally speaking.. originated as a tag for a class of technology with roots in high-performance computing pioneered by Google in the early 2000s Includes technologies, such as distributed file and database management tools led by the Apache Hadoop project; Big data analytic platforms, also led by Apache; and integration technology for exposing data to other systems and services. 12 A/B testing association rule learning Big data Toolkit classification cluster analysis genetic algorithms machine learning natural language processing neural networks pattern recognition anomaly detection predictive modeling regression • sentiment analysis • signal processing • supervised and unsup ervised learning • simulation • time series analysis • Visualisation 13 What is special about big data processing? 14 Big Volume - Little Analytics Well addressed by data warehouse crowd Who are pretty good at SQL analytics on Hundreds of nodes Petabytes of data From Stonebraker 15 Big Data - Big Analytics Complex math operations (machine learning, clustering, trend detection, ….) In the market, the world of the “quants” Mostly specified as linear algebra on array data 16 Big Data - Big Analytics An Example Consider closing price on all trading days for the last 5 years for two stocks A and B What is the covariance between the two timeseries? 17 Now Make It Interesting … Do this for all pairs of 4000 stocks The data is the following 4000 x 1000 matrix Stock t1 t2 t3 t4 t5 t6 t7 …. t1000 S1 S2 … S4000 Hourly data? All securities? 18 And Now try it for companies headquartered in Switzerland! 19 Goal of Big Data Good data management Integrated with complex analytics 20 How to manage big data? While big data technology may be quite advanced, everything else surrounding it – best practices, methodologies, organizational structures, etc. – is nascent. 21 What is wrong with Bigdata End of theory Traditional Statistics have model – a distribution, say normal Compute Mean and variance Here there is no apriori model – it is discovered Like how many clusters? 22 How companies learn your secrets? Privacy issues http://www.forbes.com/sites/kashmirhill/2012/02/16/howtarget-figured-out-a-teen-girl-was-pregnant-before-herfather-did/ http://www.nytimes.com/2012/02/19/magazine/shoppinghabits.html?pagewanted=1&_r=2&hp 23 Will now talk about Map reduce Hadoop Bigdata in India – academic scene 24 Map reduce Map Reduce Inspired by Lisp programming language programming model for processing large data sets with a parallel, distributed algorithm on a cluster Many problems can be phrased this way Easy to distribute across nodes Google has a patent!! Will it hurt me? 26 The MapReduce Paradigm Platform for reliable, scalable parallel computing Abstracts issues of distributed and parallel environment from programmer. Runs over distributed file systems Google File System Hadoop File System (HDFS) Adapted from S. Sudarshan, IIT Bombay 27 MapReduce Consider the problem of counting the number of occurrences of each word in a large collection of documents How would you do it in parallel ? Solution: Divide documents among workers Each worker parses document to find all words, outputs (word, count) pairs Partition (word, count) pairs across workers based on word For each word at a worker, locally add up counts 28 Map - Reduce Iterate over a large number of records Map: extract something of interest from each Shuffle and sort intermediate results Reduce: aggregate intermediate results Generate final output 29 MapReduce Programming Model Input: a set of key/value pairs User supplies two functions: map(k,v) list(k1,v1) reduce(k1, list(v1)) v2 (k1,v1) is an intermediate key/value pair Output is the set of (k1,v2) pairs 30 MapReduce: Execution overview 31 MapReduce: The Map Step Input key-value pairs k1 v1 k2 v2 Intermediate key-value pairs v k v k v map map … kn k … vn k E.g. (doc—id, doc-content) v E.g. (word, wordcount-in-a-doc) Adapted from Jeff Ullman’s course slides 32 MapReduce: The Reduce Step Intermediate key-value pairs k Output key-value pairs Key-value groups v k v v v reduce reduce k v k v group k v v k v … … k v k v E.g. (word, wordcount-in-a-doc) k … v (word, list-of-wordcount) ~ SQL Group by k v (word, final-count) ~ SQL aggregation Adapted from Jeff Ullman’s course slides 33 Pseudo-code map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); // Group by step done by system on key of intermediate Emit above, and // reduce called on list of values in each group. reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); 34 Distributed Execution Overview User Program fork fork Master assign map input data from distributed file system Worker Split 0 read Split 1 Split 2 Worker local write fork assign reduce Worker write Worker Worker remote read, sort From Jeff Ullman’s course slides Output File 0 Output File 1 35 Map Reduce vs. Parallel Databases Map Reduce widely used for parallel processing Google, Yahoo, and 100’s of other companies Example uses: compute PageRank, build keyword indices, do data analysis of web click logs, …. Database people say: but parallel databases have been doing this for decades Map Reduce people say: we operate at scales of 1000’s of machines We handle failures seamlessly We allow procedural code in map and reduce and allow data of any type 36 Implementations Google Not available outside Google Hadoop An open-source implementation in Java Uses HDFS for stable storage Aster Data Cluster-optimized SQL Database that also implements MapReduce And several others, such as Cassandra at Facebook, .. 37 Reading Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters http://labs.google.com/papers/mapreduce.html Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google File System, http://labs.google.com/papers/gfs.html 38 Map reduce in English Map: In this phase, a User Defined Function (UDF), also called Map, is executed on each record in a given file. The file is typically striped across many computers, and many processes (called Mappers) work on the file in parallel. The output of each call to Map is a list of <KEY, VALUE> pairs. Shuffle: This is a phase that is hidden from the programmer. All the <KEY, VALUE> pairs are sent to another group of computers, such that all <KEY, VALUE> pairs with the same KEY go to the same computer, chosen uniformly at random from this group, and independently of all other keys. At each destination computer, <KEY, VALUE> pairs with the same KEY are aggregated together. So if <x; y1>; <x; y>; : : : ;<x; yK> are all the key-value pairs produced by the Mappers with the same key x, at the destination computer for key x, these get aggregated into a large <KEY, VALUE> pair <x;{y1; y2; : : : ; yK}>; observe that there is no ordering guarantee. The aggregated <KEY, VALUE>pair is typically called a Reduce Record, and its key is referred to as the Reduce Key. Reduce: In this phase, a UDF, also called Reduce, is applied to each Reduce Record, often by many parallel processes. Each process is called a Reducer. For each invocation of Reduce, one or more records may get written into a local output file. 39 Hadoop Why is Hadoop exciting? Blazing speed at low cost on commodity hardware Linear Scalability Highly scalable data store with a good parallel programming model, MapReduce Doesn't solve all problems, but it is a strong solution for many tasks. 41 What is Hadoop? For the executives: Hadoop is an Apache open source software project Gives you value from the volume/velocity/variety of data you have 42 What is Hadoop? Technical managers An open source suite of software that mines the structured and unstructured BigData 43 What is Hadoop? Legal An open source suite of software that is packaged and supported by multiple suppliers. licensed under the Apache v2 license 44 Apache V2 A licensee of Apache Licensed V2 software can: copy, modify and distribute the covered software in source and/or binary forms exercise patent rights that would normally only extend to the licensor provided that: all copies, modified or unmodified, are accompanied by a copy of the licensee all modifications are clearly marked as being the work of the modifier all notices of copyright, trademark and patent rights are reproduced accurately in distributed copies the licensee does not use any trademarks that belong to the licensor Furthermore, the grant of patent rights specifically is withdrawn if: the licensee starts legal action against the licensor(s) over patent infringements within the covered software 45 What is Hadoop? Engineering A massively parallel, shared nothing, Java-based mapreduce execution environment. hundreds to thousands of computers working on the same problem, with built-in failure resilience Projects in the Hadoop ecosystem provide data loading, higher-level languages, automated cloud deployment, and other capabilities. Kerberos-secured software suite 46 What are the components of Hadoop? Two core components, File store called Hadoop Distributed File System (HDFS) Programming framework called MapReduce 47 HDFS FlumeNG MapReduce Whirr Hadoop Streaming Mahout Hive and Hue Fuse Pig Zookeeper Sqoop Hbase 48 Hadoop Components HDFS: spreads data over thousands of nodes The Datanodes store your data, and the Namenode keeps track of where stuff is stored. 49 Hadoop Components Pig: A higher-level programming environment to do MapReduce coding Sqoop: data transfer between Hadoop and relational database HBase: highly scalable key-value store Whirr: Cloud provisioning for Hadoop 50 51 Reduce Shuffle/sort mapper output Mapper – read 64+ MB blocks HDFS ........ HDFS, the bottom layer, sits on a cluster of commodity hardware. For a map-reduce job, the mapper layer reads from the disks at very high speed. The mapper emits key value pairs that are sorted and presented to the reducer, and the reducer layer summarizes the key-value pairs 53 54 Hadoop and relational databases? Hadoop integrates very well with relational database Apache Sqoop Used for moving data between Hadoop and relational databases 55 Some elementary references Open Source Big Data for the Impatient, Part 1: Hadoop tutorial: Hello World with Java, Pig, Hive, Flume, Fuse, Oozie, and Sqoop with Informix, DB2, and MySQL How to get started with Hadoop and your favorite databasesMarty Lurie (marty@cloudera.com), Systems Engineer, Cloudera http://www.ibm.com/developerworks/data/library/techart icle/dm-1209hadoopbigdata/ 56 Bigdata India Scene Big data India Will restrict myself to the academic scene Almost every institute has courses and researchers in this space But not with the label Big Data Found only one ‘course’ with this title 58 A/B testing association rule learning Big data Toolkit classification cluster analysis genetic algorithms machine learning natural language processing neural networks pattern recognition anomaly detection predictive modeling regression sentiment analysis • simulation • time series analysis • Visualisation 59 Courses Machine Learning Natural Language Processing Data Mining Soft computing Statistics 60 MOOC on Bigdata Coursera 24 March 2013 10 weeks Dr. Gautam Shroff Department of Computer Science and Engineering Indian Institute of Technology Delhi 61 http://www.mu-sigma.com/ Mu Sigma, one of the world’s largest Decision Sciences and analytics firms, helps companies institutionalize data-driven decision making and harness Big Data http://www.veooz.com/ 62 63 64 What is Veooz? • pronounced as "views” • helps you to get a quick overview, understand and insights from the • views/opinions by users on different social media platforms like • Facebook, Twitter, Google+, LinkedIn, News Sites, Blogs, ... • Track views/opinions expressed by social media users- on people, places, products, movies, events, brands … • billions of views on millions of topics at one place 65 Goal: Organize thoughts and interactions in Social media in real time 66 veooz: Real time Social Media Search & Analytics Engine 67 Social media is a Good Proxy for the Real World 68 Social media monitoring to listening to Social Intelligence Social New Power … 69 Most Social Media data is Noisy Not easy, because… 70 Spelling errors and variations messenger message and problem Short forms and abbreviations Semantic equivalents #HashTagMapping Social media SPAM Irony, Sarcasm Negation Noisy Data in Social Media 71 and detecti Text Analysis Context/Semantic Topic level aggregation Fine grained and Processing vs. text level processing Detecting variations of the topic Using Prior Global Sentiment in Computing current sentiment Noise? Because it makes sentiment computation and deeper text processing very hard 72 Literal Special Symbols Non-Literal SPAM User Engagement •Non-opinions •Opinion •Intensity/graded expressions •Emoticons •Punctuation Transgression •Grapheme stretching •Abbreviations •Metaphor •Sarcasm •Irony •Oxymoron •Incorrect/Ill-intention •Reputation/Influence •Content •User actions •User reactions •Social relations Sentiment Expression Axis 73 http://www.bda2013.net/ Important Dates (Research, Tutorial, Industry): Abstract submission deadline: June 30 2013 Paper submission deadline: July 7 2013 Notification to authors: August 23 2013 Camera ready submission: September 4 2013 74 Conferences on Big Data 75 Indian Institutes of Technology 76 Indian Institutes of Technology (IITs) IITs are a group of fifteen autonomous engineering and technology oriented institutes of higher education established and declared as Institutes of National Importance by the Parliament of India. 77 IITs were created to train scientists and engineers, with the aim of developing a skilled workforce to support the economic and social development of India after independence in 1947. 78 Original IITs 1. As a step towards this direction, the first IIT was established in 1951, in Kharagpur (near Kolkata) in the state of West Bengal. 79 2. IIT Bombay was founded in 1958 at Powai, Mumbai with assistance from UNESCO and the Soviet Union, which provided technical expertise. 80 3. IIT Madras is located in the city of Chennai in Tamil Nadu. It was established in 1959 with technical assistance from the Government of West Germany. 81 4. IIT Kanpur was established in 1959 in the city of Kanpur, Uttar Pradesh. During its first 10 years, IIT Kanpur benefited from the Kanpur–Indo-American Programme (KIAP), where a consortium of nine US universities. 82 5. Established as the College of Engineering in 1961, located in Hauz Khas was renamed as IIT Delhi. 6. IIT Guwahati was established in 1994 near the city of Guwahati (Assam) on the bank of the Brahmaputra River. 83 7. IIT Roorkee, originally known as the University of Roorkee, was established in 1847 as the first engineering college of the British Empire. Located in Uttarakhand, the college was renamed The Thomson College of Civil Engineering in 1854. It became first technical university of India in 1949 and was renamed University of Roorkee which was included in the IIT system in 2001. 84 New IITs 1. Patna (Bihar) 2. Jodhpur(Rajasthan) 3. Hyderabad (Andhra Pradesh) 4. Mandi(Himachal Pradesh) 5. Bhubaneshwar (Orissa) 6. Indore (Madhya Pradesh) 7. Gandhinagar (Gujarat) 8. Ropar (Punjab) 85 Admission Admission to undergraduate B.Tech., M.Sc., and dual degree (BT-MT) programs are through Joint Entrance Examination (JEE) 1 out of 100 get in 86 Features • IITs receive large grants compared to other engineering colleges in India. • About Rs. 1,000 million per year for each IIT. 87 Features (cont.) The availability of resources has translated into superior infrastructure and qualified faculty in the IITs and consequently higher competition among students to gain admissions into the IITs. 88 Features (cont.) The government has no direct control over internal policy decisions of IITs (such as faculty recruitment) but has representation on the IIT Council. 89 Features (cont.) All over, IIT degrees are respected, largely due to the prestige created by very successful alumni. 90 Success story Other factors contributing to the success of IITs are stringent faculty recruitment procedures and industry collaboration. This combination of success factors has led to the concept of the IIT Brand. 91 Success story (cont.) IIT brand was reaffirmed when the United States House of Representatives passed a resolution honouring Indian Americans and especially graduates of IIT for their contributions to the American society. Similarly, China also recognised the value of IITs and has planned to replicate the model. 92 Indian Institute of Technology Kanpur Indian Institute of Technology, Kanpur is one of the premier institutions established in 1959 by the Government of India. 93 IITK (Cont.) “to provide meaningful education, to conduct original research of the highest standard and to provide leadership in technological innovation for the industrial growth of the country” 94 IITK (Cont.) Under the guidance of eminent economist John Kenneth Galbraith, IIT Kanpur was the first Institute in India to start Computer Science education. The Institute now has its own residential campus spread over 420 hectors of land. 95 Statistics Undergraduate 3679 Postgraduate 2039 Ph.D. 1064 Faculty Research Staff 30 Supporting Staff 900 Alumni 351 26900 96 Departments Sciences: Chemistry, Physics, Mathematics & Statistics Engineering: Aerospace, Bio-Sciences and Bioengineering, Chemical, Civil, Computer Science & Engineering, Electrical, Industrial & Management Engineering, Mechanical, Material Science & Engineering Humanities and Social Sciences Interdisciplinary: Environmental Engineering & Management, Laser Technology, Master of Design, Materials Science Programme, Nuclear Engineering & Technology 97 Thank you 98