pdfcoffee.com bda-final-lab-manual-pdf-free

Department of Computer Engineering Lab Manual Final Year Semester-VII Subject: Computational Lab- I (Big Data Analytics) 1 Institutional Vision and Mission Our Vision To foster and permeate higher and quality education with value added engineering, technology programs, providing all facilities in terms of technology and platforms for all round development with societal awareness and nurture the youth with international competencies and exemplary level of employability even under highly competitive environment so that they are innovative adaptable and capable of handling problems faced by our country and world at large. Our Mission The Institution is committed to mobilize the resources and equip itself with men and materials of excellence thereby ensuring that the Institution becomes pivotal center of service to Industry, academia, and society with the latest technology. RAIT engages different platforms such as technology enhancing Student Technical Societies, Cultural platforms, Sports excellence centers, Entrepreneurial Development Center and Societal Interaction Cell. To develop the college to become an autonomous Institution & deemed university at the earliest with facilities for advanced research and development programs on par with international standards. To invite international and reputed national Institutions and Universities to collaborate with our institution on the issues of common interest of teaching and learning sophistication. Our Quality Policy Our Quality Policy It is our earnest endeavour to produce high quality engineering professionals who are innovative and inspiring, thought and action leaders, competent to solve problems faced by society, nation and world at large by striving towards very high standards in learning, teaching and training methodologies. Our Motto: If it is not of quality, it is NOT RAIT! Dr. Vijay D.PatilPresident, RAES 2 Departmental Vision and Mission Vision To impart higher and quality education in computer science with value added engineering and technology programs to prepare technically sound, ethically strong engineers with social awareness. To extend the facilities, to meet the fast changing requirements and nurture the youths with international competencies and exemplary level of employability and research under highly competitive environments. Mission To mobilize the resources and equip the institution with men and materials of excellence to provide knowledge and develop technologies in the thrust areas of computer science and Engineering. To provide the diverse platforms of sports, technical, curricular and extracurricular activities for the overall development of student with ethical attitude. To prepare the students to sustain the impact of computer education for social needs encompassing industry, educational institutions and public service. To collaborate with IITs, reputed universities and industries for the technical and overall upliftment of students for continuing learning and entrepreneurship. 3 Departmental Program Educational Objectives (PEOs) 1. Learn and Integrate To provide Computer Engineering students with a strong foundation in the mathematical, scientific and engineering fundamentals necessary to formulate, solve and analyze engineering problems and to prepare them for graduate studies. 2. Think and Create To develop an ability to analyze the requirements of the software and hardware, understand the technical specifications, create a model, design, implement and verify a computing system to meet specified requirements while considering real-world constraints to solve real world problems. 3. Broad Base To provide broad education necessary to understand the science of computer engineering and the impact of it in a global and social context. 4. Techno-leader To provide exposure to emerging cutting edge technologies, adequate training & opportunities to work as teams on multidisciplinary projects with effective communication skills and leadership qualities. 5. Practice citizenship To provide knowledge of professional and ethical responsibility and to contribute to society through active engagement with professional societies, schools, civic organizations or other community activities. 6. Clarify Purpose and Perspective To provide strong in-depth education through electives and to promote student awareness on the life-long learning to adapt to innovation and change, and to be successful in their professional work or graduate studies. 4 Departmental Program Outcomes (POs) PO1: Engineering knowledge: Apply the knowledge of mathematics, science, engineering fundamentals, and an engineering specialization to the solution of complex engineering problems. PO2: Problem analysis: Identify, formulate, review research literature, and analyze complex engineering problems reaching substantiated conclusions using first principles of mathematics, natural sciences, and engineering sciences. PO3 : Design/development of solutions: Design solutions for complex engineering problems and design system components or processes that meet the specified needs with appropriate consideration for the public health and safety, and the cultural, societal, and environmental considerations. PO4: Conduct investigations of complex problems: Use research-based knowledge and research methods including design of experiments, analysis and interpretation of data, and synthesis of the information to provide valid conclusions. PO5: Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern engineering and IT tools including prediction and modeling to complex engineering activities with an understanding of the limitations. PO6: The engineer and society: Apply reasoning informed by the contextual knowledge to assess societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to the professional engineering practice. PO7: Environment and sustainability: Understand the impact of the professional engineering solutions in societal and environmental contexts, and demonstrate the knowledge of, and need for sustainable development. PO8: Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms of the engineering practice. PO9: Individual and team work: Function effectively as an individual, and as a member or leader in diverse teams, and in multidisciplinary settings. PO10 : Communication: Communicate effectively on complex engineering activities with the engineering community and with society at large, such as, being able to comprehend and write effective reports and design documentation, make effective presentations, and give and receive clear instructions. PO11 : Project management and finance: Demonstrate knowledge and understanding of the engineering and management principles and apply these to one‘s own work, as a member and leader in a team, to manage projects and in multidisciplinary environments. PO12 : Life-long learning: Recognize the need for, and have the preparation and ability to engage in independent and life-long learning in the broadest context of technological change. 5 Program Specific Outcomes: PSO PSO1: To build competencies towards problem solving with an ability to understand, identify, analyze and design the problem, implement and validate the solution including both hardware and software. PSO2: To build appreciation and knowledge acquiring of current computer techniques with an ability to use skills and tools necessary for computing practice. PSO3: To be able to match the industry requirements in the area of computer science and engineering. To equip skills to adopt and imbibe new technologies. 6 Index Sr. No. 1. 2. Contents List of Experiments Course Objective, Course Outcome & Experiment Plan Page No. 8 9,10 3. CO-PO ,CO-PSO Mapping 10,11 4. Study and Evaluation Scheme 12 5. Experiment No. 1 14 6. Experiment No. 2 21 7. Experiment No. 3 25 8. Experiment No. 4 29 9. Experiment No. 5 39 10. Experiment No. 6 40 11. Experiment No. 7 41 12. Experiment No. 8 44 Mini Project 48 7 List of Experiments Sr. No. 1 2 3 4 5 6 7 8 Experiments Name Installation of Hadoop Framework, it‘s components and study the HADOOP ecosystem. Write a program to implement word count program using MapReduce Experiment on Hadoop Map-Reduce / PySpark: -Implementing simple algorithms in Map-Reduce: Matrix multiplication. Install and configure MongoDB/ Cassandra/ HBase/ Hypertable to execute NoSQL commands. Implementing DGIM algorithm using any Programming Language/ Implement Bloom Filter using any programming language Implement and Perform Streaming Data Analysis using flume for data capture, PYSpark / HIVE for data analysis of twitter data, chat data, weblog analysis etc. Implement any one Clustering algorithm (K-Means/CURE) using Map-Reduce. Implement Page Rank Algorithm using Map-Reduce. Mini Project: One real life large data application to be implemented (Use standard Datasets available on the web). 8 Course Objective, Course Outcome & Experiment Plan Course Outcomes: CO1 Identify the key issues in big data management and experiment with Hadoop framework. CO2 Develop problem solving and critical thinking skills in fundamental enable techniques like Hadoop & MapReduce. CO3 Construct and Explain with structure and unstructured data by using NoSQL commands. CO4 Implement fundamental enabling techniques and scalable algorithms for data stream mining. CO5 Implement scientific computing algorithms for finding similar items and clustering. CO6 Analyse the algorithms of big data analytics in various applications like recommender systems, social media applications. Experiment Plan: Module No. Week No. 1 W1 2 W2 3 W3 4 W4 5 W5 6 W6 7 W7 8 W8 W9 Experiments Name Course Outcome it‘s CO1 Weightage CO2 5 CO2 5 CO3 10 CO4 5 CO4 5 CO5 10 CO6 4 CO6 6 Installation of Hadoop Framework, components and study the HADOOP ecosystem. Write a program to implement word count program using MapReduce Experiment on Hadoop Map-Reduce / PySpark: -Implementing simple algorithms in MapReduce: Matrix multiplication. Install and configure MongoDB/ Cassandra/ HBase/ Hypertable to execute NoSQL commands. Implementing DGIM algorithm using any Programming Language/ Implement Bloom Filter using any programming language Implement and Perform Streaming Data Analysis using flume for data capture, PYSpark / HIVE for data analysis of twitter data, chat data, weblog analysis etc. Implement any one Clustering algorithm (KMeans/CURE) using Map-Reduce. Implement Page Rank Algorithm using MapReduce. Mini Project: One real life large data application to be implemented (Use standard Datasets available on the web). 10 9 Mapping of Course Outcomes with Program Outcomes: Sub. Weight Course Outcomes Contribution to Program Outcomes (PO) 1 CO1: Identify the key issues in big data management and experiment with Hadoop framework. CO2: Develop problem solving and critical thinking skills in fundamental enable techniques like Hadoop & MapReduce. CO3: Construct and Explain with structure and unstructured data by PR using NoSQL commands. Implement fundamental 40 % CO4: enabling techniques and scalable algorithms for data stream mining. CO5: Implement scientific computing algorithms for finding similar items and clustering. CO6: Analyse the algorithms of big data analytics in various applications like recommender systems, social media applications. 2 2 3 4 5 6 7 8 9 10 11 12 1 1 1 3 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 2 1 1 1 1 1 2 2 1 1 1 1 1 1 2 2 1 1 1 1 1 1 3 2 Mapping of Course Outcomes with Program Specific Outcomes: Course Outcomes CO1 CO2 CO3 CO4 CO5 CO6 Identify the key issues in big data management and experiment with Hadoop framework. Develop problem solving and critical thinking skills in fundamental enable techniques like Hadoop & MapReduce. Construct and Explain with structure and unstructured data by using NoSQL commands. Implement fundamental enabling techniques and scalable algorithms for data stream mining. Implement scientific computing algorithms for finding similar items and clustering. Analyze the algorithms of big data analytics in various applications like recommender systems, social media applications. Contribution to Program Specific Outcome (PSO) PSO1 3 PSO2 PSO3 3 3 2 3 3 2 3 3 2 2 3 3 3 2 2 2 3 10 2 Study and Evaluation Scheme Course Code Course Name CSDLO7032 Big Data Analytics Course Teaching Scheme Theory Practical Tutorial Theory Practical Tutorial Total 04 Course Name Code CSL704 Course Code Name CSDLO7 032 Big Data Analytics Course Code Name CSL704 Computat ional Lab –I -- 04 - -- - Credits Assigned Theory Practical Tutorial Theory Practical Tutorial Total - 02 -- - 01 -- 01 Examination Scheme Theory 80 Course - Teaching Scheme Computational Lab – I Course Credits Assigned Internal Term Practical & Assessment Work Oral 20 - - Total 100 Examination Scheme Theory - Internal Term Practical & Assessment Work Oral - 25 25 Total 50 Term Work: 1. Term work assessment must be based on the overall performance of the student with every experiment graded from time to time. The grades should be converted into marks as per the Credit and Grading System manual and should be added and averaged. 2. The final certification and acceptance of term work ensures satisfactory performance of laboratory work and minimum passing marks in term work. 11 The distribution of marks for term work shall be as follows: Laboratory work (experiments + mini project ): ………… (15) Report and Documentation:………………………………. (05) Attendance(Theory & Practical) …………………………. (05) TOTAL: …………………………….………………… (25) Practical & Oral: Laboratory work shall consist of minimum 08 experiments and mini project, 3 assignments based on above theory syllabus. The final certification and acceptance of term work ensures that satisfactory performance of laboratory work and minimum passing marks in term work. Practical and oral exam will be based on the above syllabus. 12 Big Data Analytics Experiment No. : 1 Installation of HADDOP Framework, its components and study the HADOOP ecosystem 13 Experiment No. 1 1. Aim: Installation of Hadoop Framework, it‘s components and study the HADOOP ecosystem 2. Objectives:  To introduce the tools required to manage and analyze big data.  To be familiar with the opensource framework like Hadoop and features and tools of it. 3. Outcomes: The students will be able to  Understand the key issues in big data management and get familiar with the Hadoop framework used for big data analytics. 4. Theory: Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Hadoop Architecture: The Apache Hadoop framework includes following four modules: Hadoop Common: Contains Java libraries and utilities needed by other Hadoop modules. These libraries give filesystem and OS level abstraction and comprise of the essential Java files and scripts that are required to start Hadoop. Hadoop Distributed File System (HDFS): A distributed file-system that provides highthroughput access to application data on the community machines thus providing very high aggregate bandwidth across the cluster. Hadoop YARN: A resource-management framework responsible for job scheduling and cluster resource management. Hadoop MapReduce: This is a YARN- based programming model for parallel processing of large data sets. 14 Hadoop Ecosystem: Hadoop has gained its popularity due to its ability of storing, analyzing and accessing large amount of data, quickly and cost effectively through clusters of commodity hardware. It won‘t be wrong if we say that Apache Hadoop is actually a collection of several components and not just a single product. With Hadoop Ecosystem there are several commercial along with an open source products which are broadly used to make Hadoop laymen accessible and more usable. MapReduce Hadoop MapReduce is a software framework for easily writing applications which process big amounts of data in-parallel on large clusters of commodity hardware in a reliable, faulttolerant manner. In terms of programming, there are two functions which are most common in MapReduce.  The Map Task: Master computer or node takes input and convert it into divide it into smaller parts and distribute it on other worker nodes. All worker nodes solve their own small problem and give answer to the master node. 15  The Reduce Task: Master node combines all answers coming from worker node and forms it in some form of output which is answer of our big distributed problem. Generally both the input and the output are reserved in a file-system. The framework is responsible for scheduling tasks, monitoring them and even re-executes the failed tasks. Hadoop Distributed File System (HDFS) HDFS is a distributed file-system that provides high throughput access to data. When data is pushed to HDFS, it automatically splits up into multiple blocks and stores/replicates the data thus ensuring high availability and fault tolerance. Note: A file consists of many blocks (large blocks of 64MB and above). Here are the main components of HDFS:  NameNode: It acts as the master of the system. It maintains the name system i.e., directories and files and manages the blocks which are present on the DataNodes.  DataNodes: They are the slaves which are deployed on each machine and provide the actual storage. They are responsible for serving read and write requests for the clients.  Secondary NameNode: It is responsible for performing periodic checkpoints. In the event of NameNode failure, you can restart the NameNode using the checkpoint. Hive Hive is part of the Hadoop ecosystem and provides an SQL like interface to Hadoop. It is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. HBase (Hadoop DataBase) HBase is a distributed, column oriented database and uses HDFS for the underlying storage. As said earlier, HDFS works on write once and read many times pattern, but this isn‘t a case always. We may require real time read/write random access for huge dataset; this is where HBase comes into the picture. HBase is built on top of HDFS and distributed on columnoriented database. 16 Zookeeper ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization and providing group services which are very useful for a variety of distributed systems. HBase is not operational without ZooKeeper. Mahout Mahout is a scalable machine learning library that implements various different approaches machine learning. At present Mahout contains four main groups of algorithms:  Recommendations, also known as collective filtering  Classifications, also known as categorization  Clustering  Frequent itemset mining, also known as parallel frequent pattern mining Sqoop (SQL-to-Hadoop) Sqoop is a tool designed for efficiently transferring structured data from SQL Server and SQL Azure to HDFS and then uses it in MapReduce and Hive jobs. One can even use Sqoop to move data from HDFS to SQL Server. Apache Spark: Apache Spark is a general compute engine that offers fast data analysis on a large scale. Spark is built on HDFS but bypasses MapReduce and instead uses its own data processing framework. Common uses cases for Apache Spark include real-time queries, event stream processing, iterative algorithms, complex operations and machine learning. Pig Pig is a platform for analyzing and querying huge data sets that consist of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Pig‘s built-in operations can make sense of semi-structured data, such as log files, and the language is extensible using Java to add support for custom data types and transformations. 17 Oozie Apache Oozie is a workflow/coordination system to manage Hadoop jobs. Flume Flume is a framework for harvesting, aggregating and moving huge amounts of log data or text files in and out of Hadoop. Agents are populated throughout ones IT infrastructure inside web servers, application servers and mobile devices. Flume itself has a query processing engine, so it‘s easy to transform each new batch of data before it is shuttled to the intended sink. Ambari: Ambari was created to help manage Hadoop. It offers support for many of the tools in the Hadoop ecosystem including Hive, HBase, Pig, Sqoop and Zookeeper. The tool features a management dashboard that keeps track of cluster health and can help diagnose performance issues. Installation Steps – Following are steps to Install Apache Hadoop on Ubuntu 14.04 Step1. Install Java (OpenJDK) - Since hadoop is based on java, make sure you have java jdk installed on the system. Please check the version of java (It should be 1.7 or above it) $ java –version If it returns "The program java can be found in the following packages", If Java isn't been installed yet, so execute the following command: $sudo apt-get install default-jdk Step2: Configure Apache Hadoop 1. Open bashrc in gedit mode 1. Open file in geditor $sudo geditbashrc ~/.bashrc 2. Set java environment variable export JAVA_HOME=/usr/jdk1.7.0_45/ 3. Set Hadoop environment variable export HADOOP_HOME=/usr/Hadoop 2.6/ 1. Apply environment variables to session 18 4. Apply environment variables $source ~/.bashrc Step3: Install eclipse Step4: Copy Hadoop plug-ins such as  hadoop-eclipse-kepler-plugin-2.2.0.jar  hadoop-eclipse-kepler-plugin-2.4.1.jar  hadoop-eclipse-plugin-2.6.0.jar from release folder of hadoop2x-eclipse-pluginmaster to eclipse plugins Step5: In eclipse, start new MapReduce project File->new->other->MapReduce project Step 6: Copy Hadoop packages such as commons-io-2.4.jar commons-lang3-3.4.jar in src file of MapReduce project Step 7: Create Mapper, Reducer, and driver Inside a project->src->File->new->other->Mapper/Reducer/Driver Step 8: Copy Log file log4j.properties from src file of hadoop in src file of MapReduce project 5. Output Analysis – (Different Test cases / Boundary Conditions) Student should perform experiments with all different cases and do the analysis and should write in their own language 6. Observations – Write Observation part here 7. Additional Learning – While performing the experiment what additional information given by faculty or what should understand by student should write in his / her language 8. Conclusion Hadoop is powerful because it is extensible and it is easy to integrate with any component. Its popularity is due in part to its ability to store, analyze and access large amounts of data, quickly and cost effectively across clusters of commodity hardware. Apache Hadoop is not actually a single product but instead a collection of several components. When all these components are merged, it makes the Hadoop very user friendly. 9. Viva Questions: 19    What is Hadoop? What are the features of Hadoop? What are different components of Hadoop? 10. References: 1. Alex Holmes ―Hadoop in Practice‖, Manning Press, Dreamtech Press. 2. Chuck Lam, ―Hadoop in Action‖, Dreamtech Press 20 Experiment No. : 2 Write a program to implement word count program using MapReduce 21 Experiment No. 2 1. Aim: Implementing distinct word count problem using Map-Reduce 2. Objectives:  To teach the fundamental techniques and principles in achieving big data analytics with scalability and streaming capability.  To understand importance of stream mining and techniques to implement it 3. Outcomes: The students will be able to  Apply fundamental enabling techniques and scalable algorithms for distinct word count problem 4. Hardware / Software Required : Hadoop 2.x, eclipse 5. Theory: This Program finds individual words present in given input text document and find how many times these words occure in it. The key will be Byte offset of line and value is one text line for each MAP task. Here we launch a map task per each single line of text. The function of the mapper is as follows:  Create a IntWritable variable 'one' with value as 1  Convert the input line in Text type to a String  Use a tokenizer to split the line into words  Iterate through each word and a form key value pairs as Assign each work from the tokenizer(of String type) to a Text 'word'  Form key value pairs for each word as < word,one > and push it to the output collector The function of Sort and Group: After this, "aggregation" and "Shuffling and Sorting" done by framework. Then Reducers task these final pair to produce output. The function of the reducer is as follows  Initaize a variable 'sum' as 0  Iterate through all the values with respect to a key and sum up all of them  Push to the output collector the Key and the obtained sum as value For Example: For the given sample input1 data file (input1.txt : Hello World Bye World) mapper emits: <Hello, 1> 22 <World, 1> <Bye, 1> <World, 1> The second input2 data file (input2.txt : Hello Hadoop Goodbye Hadoop) mapper emits: <Hello, 1> <Hadoop, 1> <Goodbye, 1> <Hadoop, 1> WordCount also specifies a combiner. Hence, the output of each map is passed through the local combiner (which is same as the Reducer as per the job configuration) for local aggregation, after being sorted on the keys. The output of the first map: <Bye, 1> <Hello, 1> <World, 2> The output of the second map: <Goodbye, 1> <Hadoop, 2> <Hello, 1> The Reducer implementation via the reduce method just sums up the values, which are the occurence counts for each key (i.e. words in this example). Thus the output of the job is: <Bye, 1> <Goodbye, 1> <Hadoop, 2> <Hello, 2> <World, 2> Data Flow Diagram: File1.txt Step1 Mapper Sort and Group Step1 Reducer Step2 Mapper Sort and Group File2.txt Step2 Reduce r Output DWC 23 6. Output Analysis – (Different Test cases / Boundary Conditions) Student should perform experiments with all different cases and do the analysis and should write in their own language 7. Observations – Write Observation part here 8. Additional Learning – While performing the experiment what additional information given by faculty or what should understand by student should write in his / her language 9. Conclusion Stream mining deals basically with the applications where real time decisions are required. Distinct element finding is one such application. We have understood that map-reduce helps finding the distinct elements through multiple mappers and reducers. 10. Viva Questions:   What is stream mining? What are the problems in stream mining? 11. References: 1. Anand Rajaraman and Jeff Ullman ―Mining of Massive Datasets‖, Cambridge University Press, 2. Alex Holmes ―Hadoop in Practice‖, Manning Press, Dreamtech Press. 24 Big Data Analytics Experiment No. : 3 Implementing simple algorithms in Map-Reduce: Matrix multiplication 25 Experiment No. 3 1. Aim: Implemention of Matrix Multiplication using MapReduce. 2. Objective:   To learn the key issues in big data management and its tools and techniques, specifically programming module of Hadoop. To understand need of multiple mappers and reducers in analytics. 3. Outcomes: Students will be able to  Apply various tools and techniques for big data analytics like Hadoop, Map Reduce and NO SQL. 4. Hardware / Software Required : Hadoop 2.X version, eclipse 5. Theory: Definitions: M is a matrix with element mij in row i and column j. N is a matrix with element njk in row j and column k. P is a matrix = MN with element pik in row i and column k, where pik =∑j mijnjk Mapper function does not have access to the i, j, and k values directly. An extra MapReduce Job has to be run initially in order to retrieve the values. The Map Function: For each element mij of M, emit a key-value pair (i, k), (M, j, mij ) for k = 1, 2, . . .number of columns of N. For each element njk of N, emit a key-value pair (i, k), (N, j, njk) for i = 1, 2, . . . number of rows of M. The Reduce Function: For each key (i, k), emit the key-value pair (i, k), pik where, Pik = ∑jmij * njk The product MN is almost a natural join followed by grouping and aggregation.That is, the natural join of M(I, J, V ) and N(J,K,W), having only attribute J in common, would produce tuples (i, j, k, v,w) from each tuple(i, j, v) in M and tuple (j, k,w) in N. 26 This five-component tuple represents the pair of matrix elements (mij ,njk).What we want instead is the product of these elements, that is, the four-component tuple (i, j, k, v × w), because that represents the product mijnjk. Once we have this relation as the result of one MapReduce operation, we can perform grouping and aggregation, with me and K as the grouping attributes and the sum of V × W as the aggregation. That is, we can implement matrix multiplication as the cascade of two MapReduce operations, as follows. Flow of entire matrix multiplication using map-reduce The input file contains two matrices M and N. The entire logic is divided into two parts: Step1: Find the product. Step2: Find sum of the products. 27 6. Algorithm 6. Output Analysis – (Different Test cases / Boundary Conditions) Student should perform experiments with all different cases and do the analysis and should write in their own language 7. Observations – Write Observation part here 8. Additional Learning – While performing the experiment what additional information given by faculty or what should understand by student should write in his / her language 9. Conclusion: Thus, we have studied the use of multiple mappers and reducers for implementing different tasks under one job. 10. Viva Questions:  What is a driver class?  What are the advantages of using multiple mappers and reducer for a single job? 11. References: 1. Alex Holmes ―Hadoop in Practice‖, Manning Press, Dreamtech Press. 2. Chuck Lam, ―Hadoop in Action‖, Dreamtech Press 28 Big Data Analytics Experiment No. : 4 Install and configure MongoDB/ Cassandra/ HBase/ Hypertable to execute NoSQL commands 29 Experiment No. 4 1. Aim: Install and Configure MongoDB to execute NoSQL Commands. 2. Objectives:  To learn the key issues in big data management and its tools and techniques, specifically programming module of Hadoop.  To understand need of NoSql databases.  To be familiar with NoSql databases. 3. Outcomes: The students will be able to  Apply various tools and techniques for big data analytics like Hadoop, Map Reduce and NO SQL. 4. Hardware / Software Required : MongoDB 5. Theory: A NoSQL (originally referring to "non SQL" or "non relational") database provides a mechanism for storage and retrieval of data which is modeled in means other than the tabular relations used in relational databases. Such databases have existed since the late 1960s, but did not obtain the "NoSQL" moniker until a surge of popularity in the early twenty-first century, triggered by the needs of Web 2.0 companies such as Facebook, Google and Amazon.com. NoSQL databases are increasingly used in big data and real-time web applications. NoSQL systems are also sometimes called "Not only SQL" to emphasize that they may support SQL-like query languages. Relational databases were not designed to cope with the scale and agility challenges that face modern applications, nor were they built to take advantage of the commodity storage and processing power available today. NoSQL Database Types     Document databases pair each key with a complex data structure known as a document. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents. Graph stores are used to store information about networks of data, such as social connections. Graph stores include Neo4J and Giraph. Key-value stores are the simplest NoSQL databases. Every single item in the database is stored as an attribute name (or "key"), together with its value. Examples of key-value stores are Riak and Berkeley DB. Some key-value stores, such as Redis, allow each value to have a type, such as "integer", which adds functionality. Wide-column stores such as Cassandra and HBase are optimized for queries over large datasets, and store columns of data together, instead of rows. 30 The Benefits of NoSQL      When compared to relational databases, NoSQL databases are more scalable and provide superior performance, and their data model addresses several issues that the relational model is not designed to address: Large volumes of rapidly changing structured, semi-structured, and unstructured data. Agile sprints, quick schema iteration, and frequent code pushes. Object-oriented programming that is easy to use and flexible. Geographically distributed scale-out architecture instead of expensive, monolithic architecture. MongoDB It is an open-source document database, and leading NoSQL database. MongoDB is written in c++. It is a cross-platform, document oriented database that provides, high performance, high availability, and easy scalability. MongoDB works on concept of collection and document. MongoDB Installation steps (Ubuntu 18.04): Step 1 — Update system sudo apt-get update Step 2 — Installing and Verifying MongoDB Now we can install the MongoDB package itself. sudo apt-get install -y mongodb This command will install several packages containing latest stable version of MongoDB along with helpful management tools for the MongoDB server. After package installation MongoDB will be automatically started. You can check this by running the following command. sudo service mongod status If MongoDB is running, you'll see an output like this (with a different process ID). Output mongod start/running, process 1611 You can also stop, start, and restart MongoDB using the service command (e.g. service mongod stop, service mongod start). 31 Commands MongoDB use DATABASE_NAME is used to create database. The command will create a new database, if it doesn't exist otherwise it will return the existing database. Syntax: Basic syntax of use DATABASE statement is as follows: use DATABASE_NAME Example Suppose a client needs a database design for his blog website and see the differences between RDBMS and MongoDB schema design. Website has the following requirements. Every post has the unique title, description and url. Every post can have one or more tags. Every post has the name of its publisher and total number of likes. Every Post have comments given by users along with their name, message, data-time and likes. On each post there can be zero or more comments. In RDBMS schema design for above requirements will have minimum three tables. While in MongoDB schema design will have one collection post and has the following structure: { _id: POST_ID title: TITLE_OF_POST, description: POST_DESCRIPTION, by: POST_BY, 32 url: URL_OF_POST, tags: [TAG1, TAG2, TAG3], likes: TOTAL_LIKES, comments: [ { user:'COMMENT_BY', message: TEXT, dateCreated: DATE_TIME, like: LIKES }, { user:'COMMENT_BY', message: TEXT, dateCreated: DATE_TIME, like: LIKES } ] } So while showing the data, in RDBMS you need to join three tables and in mongodb data will be shown from one collection only. Basic syntax of use DATABASE statement is as follows: use DATABASE_NAME Example: If you want to create a database with name <mydb>, then use DATABASE statement would be as follows: >use mydb switched to dbmydb To check your currently selected database use the command db 33 >db mydb If you want to check your databases list, then use the command show dbs. >show dbs local 0.78125GB test 0.23012GB Your created database (mydb) is not present in list. To display database you need to insert atleast one document into it. >db.movie.insert({"name":"tutorials point"}) >show dbs local 0.78125GB mydb test 0.23012GB 0.23012GB In mongodb default database is test. If you didn't create any database then collections will be stored in test database. The createCollection() Method MongoDBdb.createCollection(name, options) is used to create collection. Syntax: Basic syntax of createCollection() command is as follows db.createCollection(name, options) In the command, name is name of collection to be created. Options is a document and used to specify configuration of collection Parameter Name Type String Options Document Description Name of the collection to be created (Optional) Specify options about memory size and indexing Options parameter is optional, so you need to specify only name of the collection. Following is the list of options you can use max,size etc. While inserting the document, MongoDB first checks size field of capped collection, then it checks max field. 34 You can check the created collection by using the command show collections >show collections The dropDatabase() Method MongoDBdb.dropDatabase() command is used to drop a existing database. Syntax: Basic syntax of dropDatabase() command is as follows: db.dropDatabase() This will delete the selected database. If you have not selected any database, then it will delete default 'test' database The createCollection() Method MongoDBdb.createCollection(name, options) is used to create collection. Syntax: Basic syntax of createCollection() command is as follows db.createCollection(name, options) In the command, name is name of collection to be created. Options is a document and used to specify configuration of collection The insert() Method To insert data into MongoDB collection, you need to use MongoDB'sinsert() or save()method. Syntax Basic syntax of insert() command is as follows − >db.COLLECTION_NAME.insert(document) In the inserted document if we don't specify the _id parameter, then MongoDB assigns an unique ObjectId for this document. _id is 12 bytes hexadecimal number unique for every document in a collection. 12 bytes are divided as follows − _id: ObjectId(4 bytes timestamp, 3 bytes machine id, 2 bytes process id, 3 bytes incrementer) 35 To insert multiple documents in single query, you can pass an array of documents in insert() command. To insert the document you can use db.post.save(document) also. If you don't specify _id in the document then save() method will work same as insert() method. If you specify _id then it will replace whole data of document containing _id as specified in save() method. The pretty() Method To display the results in a formatted way, you can use pretty() method. Syntax >db.mycol.find().pretty() The find() Method To query data from MongoDB collection, you need to use MongoDB'sfind() method. Syntax Basic syntax of find() method is as follows >db.COLLECTION_NAME.find(). MongoDB'supdate() and save() methods are used to update document into a collection. The update() method update values in the existing document while the save() method replaces the existing document with the document passed in save() method. MongoDBUpdate() method The update() method updates values in the existing document. Syntax Basic syntax of update() method is as follows >db.COLLECTION_NAME.update(SELECTIOIN_CRITERIA, UPDATED_DATA) MongoDBSave() Method The save() method replaces the existing document with the new document passed in save() method Syntax Basic syntax of mongodbsave() method is shown below − >db.COLLECTION_NAME.save({_id:ObjectId(),NEW_DATA}) The remove() Method 36 MongoDB'sremove() method is used to remove document from the collection. remove() method accepts two parameters. One is deletion criteria and second is justOne flag deletion criteria : (Optional) deletion criteria according to documents will be removed. justOne : (Optional) if set to true or 1, then remove only one document. Syntax: Basic syntax of remove() method is as follows >db.COLLECTION_NAME.remove(DELLETION_CRITTERIA) Remove only one If there are multiple records and you want to delete only first record, then set justOne parameter in remove() method >db.COLLECTION_NAME.remove(DELETION_CRITERIA,1) Remove All documents If you don't specify deletion criteria, then mongodb will delete whole documents from the collection. This is equivalent of SQL's truncate command. >db.mycol.remove() >db.mycol.find() 6. Output Analysis – (Different Test cases / Boundary Conditions) Student should perform experiments with all different cases and do the analysis and should write in their own language 7. Observations – Write Observation part here 8. Additional Learning – While performing the experiment what additional information given by faculty or what should understand by student should write in his / her language 9. Conclusion: MongoDB is an open-source document database. Unlike SQL it works on unstrucrtured data also. We have understood the mongodb architecture and performed the basic operations on NoSQL database. 37 10. Viva Questions:   What is the difference in Sql and NoSql? What is document store? 11. References: Dan McCreary and Ann Kelly ―Making Sense of NoSQL‖ – A guide for managers and the rest of us, Manning Press. 38 Big Data Analytics Experiment No. : 5 Implement Bloom Filter using any programming language. 39 Experiment No. : 5 1. Aim: Write a program to implement bloom filtering 2. Objectives:  To teach the fundamental techniques and principles in filtering techniques.  To understand the need of similarity measures for blooms filtering. 3. Outcomes: The students will be able to  Search or filter the search word from large set of collections dataset. 4. Hardware / Software Required : Python, Hadoop 2.x, eclipse 5. Theory: Suppose you are creating an account on Geekbook, you want to enter a cool username, you entered it and got a message, “Username is already taken”. You added your birth date along username, still no luck. Now you have added your university roll number also, still got “Username is already taken”. It’s really frustrating, isn’t it? But have you ever thought how quickly Geekbook check availability of username by searching millions of username registered with it. There are many ways to do this job –  Linear search : Bad idea!  Binary Search : Store all username alphabetically and compare entered username with middle one in list, If it matched, then username is taken otherwise figure out , whether entered username will come before or after middle one and if it will come after, neglect all the usernames before middle one(inclusive). Now search after middle one and repeat this process until you got a match or search end with no match.This technique is better and promising but still it requires multiple steps. But, There must be something better!! Bloom Filter is a data structure that can do this job. For understanding bloom filters, you must know what is hashing. A hash function takes input and outputs a unique identifier of fixed length which is used for identification of input. What is Bloom Filter? 40 A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. For example, checking availability of username is set membership problem, where the set is the list of all registered username. The price we pay for efficiency is that it is probabilistic in nature that means, there might be some False Positive results. False positive means, it might tell that given username is already taken but actually it’s not. Interesting Properties of Bloom Filters  Unlike a standard hash table, a Bloom filter of a fixed size can represent a set with an arbitrarily large number of elements.  Adding an element never fails. However, the false positive rate increases steadily as elements are added until all bits in the filter are set to 1, at which point all queries yield a positive result.  Bloom filters never generate false negative result, i.e., telling you that a username doesn’t exist when it actually exists.  Deleting elements from filter is not possible because, if we delete a single element by clearing bits at indices generated by k hash functions, it might cause deletion of few other elements. Example – if we delete “geeks” (in given example below) by clearing bit at 1, 4 and 7, we might end up deleting “nerd” also Because bit at index 4 becomes 0 and bloom filter claims that “nerd” is not present. Working of Bloom Filter A empty bloom filter is a bit array of m bits, all set to zero, like this – We need k number of hash functions to calculate the hashes for a given input. When we want to add an item in the filter, the bits at k indices h1(x), h2(x), … hk(x) are set, where indices are calculated using hash functions. Example – Suppose we want to enter “geeks” in the filter, we are using 3 hash functions and a bit array of length 10, all set to 0 initially. First we’ll calculate the hashes as following : h1(“geeks”) % 10 = 1 h2(“geeks”) % 10 = 4 h3(“geeks”) % 10 = 7 Note: These outputs are random for explanation only. Now we will set the bits at indices 1, 4 and 7 to 1 41 Again we want to enter “nerd”, similarly we’ll calculate hashes h1(“nerd”) % 10 = 3 h2(“nerd”) % 10 = 5 h3(“nerd”) % 10 = 4 Set the bits at indices 3, 5 and 4 to 1 Now if we want to check “geeks” is present in filter or not. We’ll do the same process but this time in reverse order. We calculate respective hashes using h1, h2 and h3 and check if all these indices are set to 1 in the bit array. If all the bits are set then we can say that “geeks” is probably present. If any of the bit at these indices are 0 then “geeks” is definitely not present. False Positive in Bloom Filters The question is why we said “probably present”, why this uncertainty. Let’s understand this with an example. Suppose we want to check whether “cat” is present or not. We’ll calculate hashes using h1, h2 and h3 h1(“cat”) % 10 = 1 h2(“cat”) % 10 = 3 h3(“cat”) % 10 = 7 If we check the bit array, bits at these indices are set to 1 but we know that “cat” was never added to the filter. Bit at index 1 and 7 was set when we added “geeks” and bit 3 was set we added “nerd”. 42 So, because bits at calculated indices are already set by some other item, bloom filter erroneously claim that “cat” is present and generating a false positive result. Depending on the application, it could be huge downside or relatively okay. We can control the probability of getting a false positive by controlling the size of the Bloom filter. More space means fewer false positives. If we want decrease probability of false positive result, we have to use more number of hash functions and larger bit array. This would add latency in addition of item and checking membership. 6. Conclusion We implemented and learned blooms filter and it is applicable in following cases  Medium uses bloom filters for recommending post to users by filtering post which have been seen by user.  Quora implemented a shared bloom filter in the feed backend to filter out stories that people have seen before.  The Google Chrome web browser used to use a Bloom filter to identify malicious URLs  Google BigTable, Apache HBase and Apache Cassandra, and Postgresql use Bloom filters to reduce the disk lookups for non-existent rows or columns 7. Viva Questions:   What is the use of blooms filtering How blooms filtering works? 8. References: 1. Anand Rajaraman and Jeff Ullman ―Mining of Massive Datasets‖, Cambridge University Press, 2. https://www.geeksforgeeks.org/bloom-filters-introduction-and-pythonimplementation/ 43 Big Data Analytics Experiment No. : 6 Implement and perform Streaming data analysis using flume for data capture, HIVE/PYSpark for data analysis of twitter data, chat data, weblog analysis etc. 44 Experiment No. 6 1. Aim: Write a program to implement Twitter Sentiment Analysis System where we populate real-time sentiments for crisis management, service adjusting and target marketing using PYSpark 2. Objectives: To teach the fundamental techniques and principles in streaming data analysis using Apache Spark.  To understand the need of streaming data analysis.  To understand the techniques of analysis of twitter data. 3. Outcomes: The students will be able to Use techniques for sentiment analysis for twitter data, success of a movie, Political campaign success and take appropriate decisions. 4. Hardware / Software Required : Python, Hadoop 2.x, eclipse 5. Theory: What is Streaming? Data Streaming is a technique for transferring data so that it can be processed as a steady and continuous stream. Streaming technologies are becoming increasingly important with the growth of the Internet. Why Spark Streaming? We can use Spark Streaming to stream real-time data from various sources like Twitter, Stock Market and Geographical Systems and perform powerful analytics to help businesses. 45 Spark Streaming Overview Spark Streaming is used for processing real-time streaming data. It is a useful addition to the core Spark API. Spark Streaming enables high-throughput and fault-tolerant stream processing of live data streams. Spark Streaming Features 1. 2. 3. 4. 5. Scaling: Spark Streaming can easily scale to hundreds of nodes. Speed: It achieves low latency. Fault Tolerance: Spark has the ability to efficiently recover from failures. Integration: Spark integrates with batch and real-time processing. Business Analysis: Spark Streaming is used to track the behavior of customers which can be used in business analysis. Spark Streaming Fundamentals 1. 2. 3. 4. Streaming Context DStream Caching Accumulators, Broadcast Variables and Checkpoints Streaming Context Streaming Context consumes a stream of data in Spark. It registers an Input DStream to produce a Receiver object. It is the main entry point for Spark functionality. Spark provides a number of default implementations of sources like Twitter, Akka Actor and ZeroMQ that are accessible from the context. DStream Discretized Stream (DStream) is the basic abstraction provided by Spark Streaming. It is a continuous stream of data. It is received from a data source or a processed data stream generated by transforming the input stream. Caching DStreams allow developers to cache/ persist the stream‘s data in memory. This is useful if the data in the DStream will be computed multiple times. This can be done using the persist() method on a DStream Accumulators, Broadcast Variables and Checkpoints Accumulators: Accumulators are variables that are only added through an associative and commutative operation. They are used to implement counters or sums. Tracking 46 accumulators in the UI can be useful for understanding the progress of running stages. Spark natively supports numeric accumulators. We can create named or unnamed accumulators. Broadcast Variables: Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost. Checkpoints: Checkpoints are similar to checkpoints in gaming. They make it run 24/7 and make it resilient to failures unrelated to the application logic. Applications of Sentiment Analysis:      6. Predict the success of a movie Predict political campaign success Decide whether to invest in a certain company Targeted advertising Review products and services Conclusion – We have implemented and learned twitter sentiment analysis using pyspark 7. References – https://www.edureka.co/blog/spark-streaming/#Use_Case_Twitter_Sentiment_Analysis 47 Big Data Analytics Experiment No. : 7 Implement K-Means Clustering algorithm using Map-Reduce 48 Experiment No. 7 1. Aim: Write a program to implement K-means Clustering algorithm using MapReduce 2. Objectives: To teach the fundamental techniques and principles in achieving big data analytics with scalability and streaming capability.  To understand the need of similarity measures for clustering similar objects.  To understand the techniques to cluster the data. 3. Outcomes: The students will be able to Analyze the similarities between objects and use this analysis for grouping these objects in large dataset. 4. Hardware / Software Required : Hadoop 2.x, eclipse 5. Theory: Data clustering is the partitioning of a data set or sets of data into similar subsets. During the process of data clustering a method is often required to determine how similar one object or groups of objects is to another. This method is usually encompassed by some kind of distance measure. Data clustering is a common technique used in data analysis and is used in many fields including statistics, data mining, and image analysis. There are many types of clustering algorithms. Hierarchical algorithms build successive clusters using previously defined clusters. Hierarchical algorithms can be agglomerative meaning they build clusters by successively merging smaller ones, which is also known as bottom-up. They can also be divisive meaning they build clusters by successively splitting larger clusters, which is also known as top-down. Clustering algorithms can also be partitional meaning they determine all clusters at once. K-Means Algorithm K-means clustering is a classical clustering algorithm that uses an expectation maximization like technique to partition a number of data points into k clusters. K-means clustering is commonly used for a number of classification applications. Because k-means is run on such large data sets, and because of certain characteristics of the algorithm, it is a good candidate for parallelization. In this problem, we have considered inputs a set of n 1-dimensional points and desired clusters of size 3. 49 Once the k initial centers are chosen, the distance is calculated(Euclidean distance) from every point in the set to each of the 3 centers & point with the corresponding center is emitted by the mapper. Reducer collect all of the points of a particular centroid and calculate a new centroid and emit. Termination Condition: When difference between old and new centroid is less than or equal to 0.1 6. Algorithm 1. Initially randomly centroid is selected based on data. (e,g. 3 centroids) 2. The Input file contains initial centroid and data. 3. Mapper is used to first open the file and read the centroids and store in the data structure(we used ArrayList) 4. Mapper read the data file and emit the nearest centroid with the point to the reducer. 5. Reducer collect all this data and calculate the new corresponding centroids and emit. 6. In the job configuration, we are reading both files and checking if difference between old and new centroid is less than 0.1 then convergence is reached else repeat step 2 with new centroids. 7. Conclusion As data clustering has attracted a significant amount of research attention, many clustering algorithms have been proposed in the past decades. We have understood that K-means is the simplest algorithm for clustering and works efficiently for numeric data. 8. Viva Questions:   What is the use of cluster analysis? How k-means works? 9. References: 1. Anand Rajaraman and Jeff Ullman ―Mining of Massive Datasets‖, Cambridge University Press, 2. Alex Holmes ―Hadoop in Practice‖, Manning Press, Dreamtech Press. 50 Big Data Analytics Experiment No. : 8 Implementation of Page Rank Algorithm 51 Experiment No. 8 1. Aim: Implement Page Rank algorithm using Map Reduce. 2. Objectives:  To teach the fundamental techniques and principles in achieving big data analytics with scalability and streaming capability.  To understand importance of link analysis.  To study working for ranking the pages. 3. Outcomes: The students will be able to  Analyze the web links for relevant information retrieval. 4. Theory: Link Analysis,a data analysis technique used in network theory that is used to evaluate the relationships or connections between network nodes. These relationships can be between various types of objects (nodes) ,including people,organizations and even transactions. PageRank ,is a way of measuring the importance of website pages. PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites. PageRank is a link analysis algorithm and it assigns a numerical weighting to each element of a hyperlinked set of documents,such as the World Wide Web,with the purpose of measuring its relative importance within the set. The algorithm may be applied to any collection of entities with reciprocal quotations and references. The numerical weight that it assigns to any given element E is referred to as the PageRank of E and denoted by PR(E). Other factors like Author Rank can contribute to the importance of the family. A hyperlink to a page counts as a vote of support. The PageRank of a page is defined recursively and depends on the number and PageRank metric of all pages that link to it(―incoming links‖) . A page that is linked to by many pages with high PageRank receives a high rank itself. 52 Mathematical PageRank for a simple network Formula Eqn.1 is the formula to calculate the rank value for each webpage. We will learn this formula by applying it to the case in Fig.1. There are 11 webpages in Fig.1, which include: {A, B, C, D, E, F, G1, G2, G3, G4, G5}. Assuming the probability distribution for a web surfer accessing all these 11 pages in current iteration is {PR(A), PR(B), PR(C), … PR(G5)}, then the probability for the surfer to access Page B in the next iteration is: PR(B) = PR(D)/2 + PR(E)/3 + PR(F)/2 + PR(C) + PR(G1)/2 + PR(G2)/2 + PR(G3)/2 In a general case, the PageRank value for any page u can be expressed as: ....Eqn.1 The vertices seen in the right of the formula contain all the webpages that point to target webpage ‗u‘. The L(v) refers to the out degree of each webpage in the vertices set. The initial rank values of each webpage, like PR‘(u), can be any double value. After several iteration calculations, the rank values converge to the stationary distribution regardless of what their initial values are. Damping factor The PageRank theory holds that even an imaginary surfer who is randomly clicking on links will eventually stop clicking. The probability, at any step, that the person will continue is a damping factor d. Various studies have tested different damping factors, but it is generally assumed that the damping factor will be around 0.85. The formula 53 considering damping factor is shown in Eqn.2. N refers to the total number of unique urls. ...... Eqn.2 Example: 5. Conclusion Thus we have understood that PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. 6. Viva Questions:   What is the use of cluster analysis? How k-means works? 7. References: 1. Anand Rajaraman and Jeff Ullman ―Mining of Massive Datasets‖, Cambridge University Press, 54 Big Data Analytics Mini Project Case Study Implementation 55 Mini Project 1. Aim: Case Study: One real life large data application to be implemented (Use standard Datasets available on the web) a. Twitter data analysis b. Fraud Detection c. Text Mining etc. 2. Objectives:  To provide an overview of an exciting growing field of big data analytics.  To enable students to have skills that will help them to solve complex realworld problems in for decision support. 3. Outcomes: The students will be able to  Achieve adequate perspectives of big data analytics in various applications like recommender systems, social media applications etc. 4. References: 1. Anand Rajaraman and Jeff Ullman ―Mining of Massive Datasets‖, Cambridge University Press, 2. Alex Holmes ―Hadoop in Practice‖, Manning Press, Dreamtech Press. 3. https://acadgild.com/blog 56

pdfcoffee.com bda-final-lab-manual-pdf-free

Related documents

Products

Support

pdfcoffee.com bda-final-lab-manual-pdf-free

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib