Big Data and Hadoop On Windows Image credit: morguefile.com/creative/imelench on .Net SIG Cleveland About Me Serkan Ayvaz, Sn. Systems Analyst, Cleveland Clinic PhD Candidate, Computer Science, Kent State Univ. LinkedIn: serkanayvaz@gmail.com email:ayvazs@ccf.org Twitter:@sayvaz Agenda Introduction to Big Data Hadoop Framework Hadoop On Windows Ecosystem Conclusions What is Big Data?(“Hype?”) Big data is a collection of data sets so large and complex that it becomes difficult to process using onhand database management tools or traditional data processing applications. The challenges include capture, curation, storage,search, sharing, transfer, analysis,and visualization.-Wikipedia What is new? Enterprise data grows rapidly Emerging Market for Vendors New Data Sources Competitive industries - need for more Insights Asking different questions Generating models instead transforming data into models What is the problem? Size of Data; Rapid growth, TBs to PBs are norm for many organizations As of 2012, size of data sets that are feasible to process in a reasonable amount of time were on the order of exabytes of data. Variety of Data; Relational, Device generated data, Mobile, Logs, Web data, Sensor networks, Social Networks, etc Structured Unstructured Semi-structured Rate of Data Growth As of 2012, every day 2.5 quintillion (2.5×1018) bytes of data were created -Wikipedia Particularly large datasets; meteorology, genomics, complex physics simulations, and biological and environmental research, Internet search, finance and business informatics Critique Even as companies invest eight- and nine-figure sums to derive insight from information streaming in from suppliers and customers, less than 40% of employees have sufficiently mature processes and skills to do so. To overcome this insight deficit, "big data", no matter how comprehensive or well analyzed, needs to be complemented by "big judgment", according to an article in the Harvard Business Review. Consumer privacy concerns by increasing storage and integration of personal information Things to consider Return of Investment may differ Asking wrong questions, won’t get right answers Experts to fit in the organization Requires leadership decision Might be fine with traditional systems(for now) What is Hadoop? Scalability Scales Horizontally, Vertical scaling has limits Scales seamlesly Moves processing to the data, opposed to traditional methods Network bandwidth is limited resource Processes data sequentially in chunks, avoid random access Seeks are expensive, disk throughput is reasonable Fault tolerance Data Replication Hadoop Core HDFS Storage Economical Ecosystem Commodity-Servers(“not Low-end”) vs Specialized Servers Integration with other tools Open Source Innovative, Extensible MapReduce Processing What can I do with Hadoop? Distributed Programming(MapReduce) Storage, Archive Legacy data Transform Data Analysis, Ad Hoc Reporting Look for Patterns Monitoring/ Processing logs Abnormality detection Machine Learning and advanced algorithms Many more HDFS Blocks • • • Large enough to minimize the cost of seeks-64 MB default Unit of abstraction makes storage management simpler than file Fits well with replication strategy and availability NameNode • Maintains the filesystem tree and metadata for all the files and directories • Stores the namespace image and edit log Datanode • Store and retrieve blocks • Report the blocks back to NameNode periodically HDFS Good Designed for and Shines with large files Fault tolerance - Data Replication within and across Racs Not so good optimized for high throughput data, may be at the expense of latency. Consider Hbase for low latency Lots of small files Hadoop breaks data into smaller blocks Data locality Most efficient with write-once, read-many-times pattern Low-latency data access namenode holds filesystem metadata in memory the limit to the number of files in a filesystem Multiple writers, arbitrary file modifications Files in HDFS may be written to by a single writer. Write Read Data Flow Source:Hadoop:The Definitive Guide MapReduce Programming Splits input files into blocks Operates on key-value pairs Mappers filter & transform input data Reducers aggregate mappers output Handles processing efficiently in parallel Move code to data – data locality Same code run on all machines Can be difficult to implement some algorithms Can be implemented in almost any language Streaming MapReduce for python, ruby, perl, php etc pig latin as data flow language hive for sql users MapReduce Programmers write two functions: map (k, v) → <k’, v’>* reduce (k’, v’) → <k’, v’>* All values with the same key are reduced together For efficiency, programmers typically also write: partition (k’, number of partitions) → partition for k’ Often a simple hash of the key, e.g., hash(k’) mod n Divides up key space for parallel reduce operations combine (k’, v’) → <k’, v’>* Mini-reducers that run in memory after the map phase Used as an optimization to reduce network traffic The framework takes care of rest of the execution Simple example - Word Count // Map Reduce function in JavaScript // ------------------------------------------------------------var map = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) { if (words[i] !== "") { context.write(words[i].toLowerCase(), 1); } } }; var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); }; Input Divide and Conquer k1 v 1 k2 v 2 map x 4 y y k4 v 4 map 1 z combine x 4 k3 v 3 3 z map 6 x 6 combine 1 z partition k5 v 5 z map 1 y combine 9 x 6 partition z 7 z combine 1 y partition 7 z partition Shuffle and Sort: aggregate values by keys x 4 6 y reduce x r1 7 reduce y 10 1 8 Output 9 z 1 9 98 reduce z r2 r3 s3 19 9 How MapReduce Works? Map(String docid, String text): for each word w in text: Emit(w, 1); Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, value); Source:Hadoop:The Definitive Guide How is it different from other Systems? Parallel - Message Passing Interfaces(MPI) Challenge of Coordinating the processes in a largescale distributed computation Compute-intensive jobs, Issue larger data volumes Network bandwidth is the bottleneck and compute nodes become idle. Hard to implement Handling partial failure Managing check pointing and recovery Comparing MapReduce to RDBMs Traditional RDBMs MapReduce Data size Gigabytes Petabytes Access Interactive and batch Batch Updates Read and write many times Write once, read many times Structure Static schema Dynamic schema Integrity High Low Scaling Nonlinear Linear MapReduce MapReduce complementary to RDBMs, not competing MapReduce good fit for analyzing the whole dataset in batch An RDBMS is good for point queries or updates indexed to deliver low-latency retrieval relatively small amount of data. MapReduce suits applications where the data is written once and read many times, An RDBMS is good for datasets that are continually updated. Hadoop on Windows Overview HDInsight HDInsight Server HDInsight on Cloud Familiar Tools &Functionality Hortonworks Data platform Windows Platform 100% Open Source Contributions to Community Apache Hadoop Core Common framework Open Source Community Shared by all Distribution Hadoop on Windows Standard Hadoop Modules HDFS MapReduce Pig Hive Monitoring Pages Easy installation and Configuration Integration with Microsoft system Active Directory System Center etc Why Hadoop on Windows important? Windows Server Large Market share Large Developer and User community Existing Enterprise tools Familiarity Simplicity of Use and Management Deployment options on both Windows Server and Windows Azure. User -Self Service Tools: Data Viewers, BI, Visualization HADOOP [Server and Cloud] Java Streaming NOSQL HiveQL PigLatin [unstructured, semi-structured, structured] .NET Other langs. SQL HDFS DATA HDFS Legacy Data RDBMS External Data •Web •Mobile Devices • Social Media Run Jobs Submit a JAR file(Java MapReduce) HiveQL PigLatin .NET wrapper through Streaming .Net MapReduce LINQ to Hive JavaScript Console Excel Hive Add-In .Net MapReduce Example NuGet Packages install-package Microsoft.Hadoop.MapReduce install-package Microsoft.Hadoop.Hive install-package Microsoft.Hadoop.WebClient Reference “Microsoft.Hadoop.MapReduce.DLL” Create a class the implements “HadoopJob<YourMapper> Create a class called “FirstMapper” that implements “MapperBase” Run DLL using MRRunner Utility; > MRRunner -dll MyDll -class MyClass -- extraArg1 extraArg2 Run Invoke Exe using MRRunner Utility; var hadoop = Hadoop.Connect(); hadoop.MapReduceJob.ExecuteJob<JobType>(arguments); .Net MapReduce Example public class FirstJob : HadoopJob<SqrtMapper> { public override HadoopJobConfiguration Configure(ExecutorContext context) { HadoopJobConfiguration config = new HadoopJobConfiguration(); config.InputPath = "input/SqrtJob"; config.OutputFolder = "output/SqrtJob"; return config; } } public class SqrtMapper : MapperBase { public override void Map(string inputLine, MapperContext context) { int inputValue = int.Parse(inputLine); // Perform the work. double sqrt = Math.Sqrt((double)inputValue); // Write output data. context.EmitKeyValue(inputValue.ToString(), sqrt.ToString()); } } Hadoop Ecosystem Hadoop HBase Data Mining Algorithms ZooKeeper Tool for bulk Import/export between HDFS, HBase, Hive and relational databases Mahout Data transformation language Sqoop Distributed data warehouse-SQL like query platform Pig Column oriented distributed database Hive Common, MapReduce, HDFS Distributed Coordination service Oozie Job Running and scheduling workflow service What’s HBase? Column Oriented Distiributed DB Inspired by Google BigTable Uses HDFS Interactive Processing Can use either without MapRed PUT, GET, SCAN Commands What’s Hive? Translate HiveQL ,similar to SQL, to MapReduce A Distributed Data warehouse HDFS table file format Integrate with BI products on tabular data, Hive ODBC, JDBC drivers Hive Good for o o o o o o HiveQL – Familiar, high level language Batch jobs – Ad Hoc Queries Self service BI tools via ODBC, JDBC Schema but not strict as traditional RDBMs Supports UDFs Easy access to Hadoop data Not so good for • No Updates or deletes, Insert only • Limited Indexes, built-in optimizer, no caching • Not OLTP • Not fast as MapReduce Conclusion Hadoop is great for its Purposes and here to stay Developing standards and best practices very important Existing systems, tools, Expertise Parallelization Users may abuse the resources and scalability Integration with Windows Platform BUT Not a common cure for every problem Easier to scale as need Economical Commodity Hardware Relatively short training, application development time with Windows Resources&References Hadoop: The Definitive Guide by Tom White Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer. Morgan & Claypool Publishers, 2010. Apache Hadoop http://www.microsoft.com/en-us/sqlserver/solutions-technologies/businessintelligence/big-data.aspx Hortonworks Data Platform http://hadoop.apache.org/ Microsoft Big data page http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/0596521979 http://hortonworks.com/products/hortonworksdataplatform/ Hadoop SDK http://hadoopsdk.codeplex.com/ Thank you! Any Questions?