SIX WEEKS SUMMER TRAINING REPORT on Modern Big Data Analysis with SQL Specialization Submitted by Name: Bonigi Syam Prasad Reg no:11902565 Program name: Computer science and Engineering Under the guidance of Glynn Durham, Ian Cook School of Computer Science & Engineering Lovely Professional University, Phagwara (June-July, 2021) DECLARATION I hereby declare that I have completed my six weeks summer training at Coursera from 20-052021 to 27-08-2021 under the guidance of Glynn Durham, Ian Cook. I have declared that I have worked with full dedication during these six weeks of training and my learning outcomes fulfil the requirements of training for the award of degree of Modern Big Data Analysis with SQL Specialization, Lovely Professional University, Phagwara. (Signature of student) Name of Student: Bonigi Syam Prasad Registration no: 11902565 Date: 30-07-21 Acknowledgement I would like to express my special thanks of gratitude to the Lovely Professional University for encouraging me to do this wonderful course on the name of Modern Big Data Analysis with SQL specialization as a part of six weeks summer training program, which also helped me in understanding lot of things related to my minor subject Data Science. Secondly, I would like to thank my instructors Glynn Durham and Ian Cook who taught me this entire course. In this course we will get a overview of database systems and the distinction between the operational and analytical data bases, we will Understand how database and table design provides structures for working with data, Recognize the features and benefits of SQL dialects designed to work with big data systems for storage and analysis. we will be able to learn about the common querying language (SQL). Value the contribution of all others for sharing the wealth of knowledge, wisdom and experience. Summer training Certificate from Cloudera Table of contents 1.Introduction 2.Technology Learnt 3.Reason for choosing this Technology 4.Foundations For Big Data Analysis With SQL 5.Analyzing Big Data With SQL 6.Managing Big Data In Clusters And Cloud Storage 7.Implementation 8.Learning Outcomes 9.Gnatt chart 10.Project legacy Introduction This Specialization teaches the essential skills for working with large scale data using SQL. For this we have to know about the SQL and Big Data. SQL is a domain-specific language used in programming and designed for managing data held in a relational database management system, or for stream processing in a relational data stream management system. It is a standard language for sorting, manipulating and retrieving data in databases. Big data is a term that describes the large volume of data – both structured and unstructured that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important… Big data can be analysed for insights that lead to better decisions and strategic business moves. Big data is a field that treats ways to analyse, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software. Today, more and more of the data that’s being generated is too big to be stored there, and it’s growing too quickly to be efficiently stored in commercial data warehouses. instead, it’s increasingly stored in distributed clusters and cloud storage. To query these huge datasets in clusters and cloud storage, you need a newer breed of SQL engine like Hive, Impala, Presto and Drill. These are open-source SQL engines capable of querying enormous datasets. This specialization focuses on Hive and Impala, the most widely deployed of these query engines. In this specialization. we have three sub courses. First course is Foundations for Big Data Analysis With SQL, In this course we will be able to distinguish operational from analytical databases, and understand how these are applied in big data, understand how database and table design provides structures for working with data Second course is Analyzing Big Data with SQL, in this course we will get an in-depth look at the SQL SELECT statement and its main clauses. The course focuses on big data SQL engines Hive and Impala ,but most of the information is applicable to SQL with traditional RDMs. By the end of the course ,you will be able to explore and navigate databases and tables using different tools, understanding the basics of SELECT statements and explore grouping and aggregation to answer analytic questions. we will work with sorting and limiting results and combing multiple tables in different ways. Third course is Managing Big Data in Clusters and Cloud storage.in this course we will learn how to manage big datasets, how to load them into clusters and cloud storage, and how to apply structure to the data so that you can run queries on it using distributed SQL engines like Hive and Impala, we will able to learn how to choose right data types, storage systems and file formats. By the end of the course we will be able to use different tools to browse existing databases and tables in big data systems and different tools to explore files in distributed big data filesystems and cloud storage. TECHNOLOGY LEARNT In this specialization we used VMware workstation player. VMware workstation is a hosted hypervisor that runs on x64 versions of windows and Linux operating systems; It enables users to set up virtual machines on a single physical machine and use them simultaneously along with the host machine, along with the software we need to download and install a virtual machine ( supplied by Cloudera ) and the software on which to run it. In the virtual machine of cloudera we have used the impala and hive for queries. Impala is an open source massively parallel processing SQL query engine for data stored in a computer cluster running apache Hadoop. Hive is a data warehouse software project built on top of apache Hadoop for providing data query query and analysis. Hive gives a SQL like interface to query data stored in various databases and file systems that integrate with Hadoop. The above figure shows the interface of impala where we can give different queries to the existing datasets , the customers,employees,offices,orders and salary grades are the different tables of the default dataset. Reason for choosing this Technology: In this specialization, Modern Big data analysis with SQL ,they discussed about the bigdata and sql.these topics are very essential for Data Science. In course 1 ,they described about the big data,types of data,where data can be stored. In course 2 they discussed about the SQL ,acid properties, different clauses of SQL like select,from,where,having,group by,order by , explore grouping and aggregation to answer analytic questions. we will work with sorting and limiting results and combing multiple tables in different ways and in course 3 they described about managing big data on clusters and cloud storage, how to apply structure to the data so that you can run queries on it. As a data science aspirant these course helps me a lot in understanding Big Data and Data base management systems and how data works in real life and how we can manipulate data in different ways using SQL . FOUNDATIONS FOR BIG DATA ANALYSIS WITH SQL In Big data, Data means digital data. Information that can be transmitted, stored, and processed using modern digital technologies, like the internet , disk drives, and modern computer. Now data itself can be divided into two kinds, analog data and digital data. Analog data is data that is represented in a physical way. where digital data is are represented as numbers in computer machines and can be intrepeted. The difference between the analog and digital is in how the information or data is measured. Analog data attempts to be continuous and identify every nuance of what is being measured, while digital data uses sampling to encode what is being measured. Data base management system is a software for sorting and retrieving users data while considering appropriate security measures. … In large systems, a DBMS helps users and other third-party software to store and retieve data. DBMS allows users create their own databases as per their requirement. Structured Query Language(SQL) is a programming language used to communicate with relational databases Relational databases store data in tables consisting of columns and rows similar to a spreadsheet. It allow for simple manipulation of stored data,while relational databases with the help of SQL allow for complex manipulation of the data. Relational databases are the most used tehnology for accessing structured data. In the context of SQL, data definition language(DDL) is a syntax for creating and modifying database objects such as tables, indices, and users. DDL statements are similar to a computer programming language for defining data structures, especially database schemas DDL statements include: • CREATE- is used to create the database or its objects. • ALTER-is used to alter the structure of the database. • DROP-is used to delete objects from the database. Data Query Language(DQL) statements are used for performing queries on the dat within schema objects.the purpose of the DQL command is to get some schema relation based on the query passed to it. DQL statements include: • SELECT-is used to retrieve daa from the database. Data Manipulation Language (DML) are the SQL commands that deals with the manipulation of data present in the databse belong to DML or Data Manipulation Language and this includes most of the SQL statements. DML statements include: • INSERT-is used to insert data into a table. • UPDATE-is used to update existing data within a table. • DELETE-is used to delete records from a database table. Data Control Language(DCL) includes commands which mainly deals with the rights,peremissions and other controls of the database system. DCL includes: • GRANT-gives users access privileges to the database. • REVOKE-which mainly deal with the rights, permissions and other controls of the database system. Transaction Control Language(TCL) commands deal with the transaction within the transaction within the database. TCL includes: • COMMIT-commits a transaction. • ROLLBACK-rollbacks a transaction in case of any error occurs. • SAVEPOINT-sets a savepoint within a transaction. • SET TRANSACTION-specify characteristics for the transaction. ACID Properties A transaction is a very small unit of a program and it may contain several low level tasks. A transaction in a database system must maintain Atomicity, Consistency, Isolation, and Durability − commonly known as ACID properties − in order to ensure accuracy, completeness, and data integrity. • Atomicity − This property states that a transaction must be treated as an atomic unit, that is, either all of its operations are executed or none. There must be no state in a database where a transaction is left partially completed. States should be defined either before the execution of the transaction or after the execution/abortion/failure of the transaction. • Consistency − The database must remain in a consistent state after any transaction. No transaction should have any adverse effect on the data residing in the database. If the database was in a consistent state before the execution of a transaction, it must remain consistent after the execution of the transaction as well. • Durability − The database should be durable enough to hold all its latest updates even if the system fails or restarts. If a transaction updates a chunk of data in a database and commits, then the database will hold the modified data. If a transaction commits but the system fails before the data could be written on to the disk, then that data will be updated once the system springs back into action. • Isolation − In a database system where more than one transaction are being executed simultaneously and in parallel, the property of isolation states that all the transactions will be carried out and executed as if it is the only transaction in the system. No transaction will affect the existence of any other transaction. Operational Databases: An operational database is a database that is used to manage and store data in real time. An operational database is the source for a data warehouse. Elements in an operational database can be added and removed on the fly. These databases can be either SQL or NoSQL-based, where the latter is geared toward real-time operations. Analytical Databases: An analytic database, also called an analytical database, is a read-only system that stores historical data on business metrics such as sales performance and inventory levels. Business analysts, corporate executives and other workers run queries and reports against an analytic database. The information is regularly updated to include recent transaction data from an organization's operational systems. Datatypes In SQL: Data types are used to represent the nature of the data that can be stored in the database table. For example, in a particular column of a table, if we want to store a string type of data then we will have to declare a string data type of this column. Data types mainly classified into three categories for every database. • • • String Data types Numeric Data types Date and time Datatypes Some of the data types are: • CHAR: It is used to specify a fixed length string that can contain numbers, letters, and special characters. Its size can be 0 to 255 characters. Default is 1. • VARCHAR: It is used to specify a variable length string that can contain numbers, letters, and special characters. Its size can be from 0 to 65535 characters. • INT: It is used for the integer value. Its signed range varies from -2147483648 to 2147483647 and unsigned range varies from 0 to 4294967295. The size parameter specifies the max display width that is 255. • INTEGER: It is equal to INT (size). • FLOAT: It is used to specify a floating-point number. Its size parameter specifies the total number of digits. The number of digits after the decimal point is specified by parameter. • FLOAT(p): It is used to specify a floating-point number. MySQL used p parameter to determine whether to use FLOAT or DOUBLE. If p is between 0 to24, the data type becomes FLOAT (). If p is from 25 to 53, the data type becomes DOUBLE (). • BOOL: It is used to specify Boolean values true and false. Zero is considered as false, and nonzero values are considered as true. • DATE: It is used to specify date format YYYY-MM-DD. Its supported range is from '10000101' to ‘9999-12-31'. • YEAR: It is used to specify a year in four-digit format. Values allowed in four-digit format from 1901 to 2155, and 0000. • TEXT(Size): It holds a string that can contain a maximum length of 255 characters. • TINYTEXT: • MEDIUMTEXT: It holds a string with a maximum length of 16,777,215. • LONGTEXT: It holds a string with a maximum length of 4,294,967,295 characters. It holds a string with a maximum length of 255 characters. Table1:card_rank Columns: sno Name Type Comment 1 rank 2 sno value tinyint same as rank if it is a number otherwise is null sample: rank value 1 Ace 2 3 string pk* NULL 3 Table 2: card_suit Columns sno Name Type Comment 1 suit 2 1 color string REd or BLACK suit color Clubs Black 2 Diamonds Red 3 Hearts sno string pk* Sample: Red Analyzing Big Data With SQL In this course, you'll get an in-depth look at the SQL SELECT statement and its main clauses. The course focuses on big data SQL engines Apache Hive and Apache Impala, but most of the information is applicable to SQL with traditional RDBMS. SELECT statement: The Select statement is the most important part of the SQL language. The options for what you can do with a Select statement are so extensive, that Select forms it's own category of SQL statements called queries. It is the most common operation in SQL, called "the query". SELECT retrieves data from one or more tables or expressions. Standard SELECT statements have no persistent effects on the database. Some non-standard SELECT can have persistent effects, such as the SELECT implementations some databases. of The syntax INTO select provided in statement is used to select data from a database. FROM Clause: The SQL From clause is the source of a rowset to be operated upon in a Data Manipulation Language statement. From clauses are very common, and provide the rowset to be exposed through a select statement, the source of values in an update statement, and the target rows to be deleted in a delete statement. From is an SQL reserved word in the SQL standard.The From clause is used in conjuction with SQL statements, and takes the following general form: SQL -DML- statement FROM table_name WHERE predicate WHERE Clause: The SQL WHERE clause is used to specify a condition while fetching the data from a single table or by joining with multiple tables. If the given condition is satisfied, then only it returns a specific value from the table. You should use the WHERE clause to filter the records and fetching only the necessary records. The WHERE clause is not only used in the SELECT statement, but it is also used in the UPDATE, DELETE statement, etc., which we would examine in the subsequent chapters. The basic syntax of the SELECT statement with the WHERE clause is as shown below: SELECT column1,column2 FROM table_name WHERE [condition] You can specify a condition using the comparision or logical operators like >,<,=,LIKE,NOT,etc. GROUP BY Clause: The SQL GROUP BY clause is used in collaboration with the SELECT statement to arrange identical data into groups. This GROUP BY clause follows the WHERE clause in a SELECT statement and precedes the ORDER BY clause. The basic syntax of a GROUP BY clause is shown in the following code block. The GROUP BY clause must follow the conditions in the WHERE clause and must precede the ORDER BY clause if one is used. SELECT column1, column2 FROM table_name WHERE [ conditions ] GROUP BY column1, column2 HAVING Clause: The HAVING Clause enables you to specify conditions that filter which group results appear in the results. The WHERE clause places conditions on the selected columns, whereas the HAVING clause places conditions on groups created by the GROUP BY clause. The HAVING clause must follow the GROUP BY clause in a query and must also precede the ORDER BY clause if used. The following code block has the syntax of the SELECT statement including the HAVING clause The following code block shows the position of the HAVING Clause in a query. SELECT column1, column2 FROM table1, table2WHERE [ conditions ] GROUP BY column1, column2 HAVING [ conditions ] ORDER BY column1, column2 The ORDER BY Clause: The SQL ORDER BY clause is used to sort the data in ascending or descending order, based on one or more columns. Some databases sort the query results in an ascending order by default.The basic syntax of the ORDER BY clause is as follows : SELECT column-list FROM table_name [WHERE condition] [ORDER BY column1, column2, .. columnN] [ASC |DESC]; You can use more than one column in the ORDER BY clause. Make sure whatever column you are using to sort that column should be in the column-list. LIMIT Clause: If they are a large number of tuples satisfying the query conditions,it might be resourceful to view only a handful of them at a time.The LIMIT clause is used to set an upper limit on the number of tuples returned by SQL. The following is the syntax for a basic Limit clause. SELECT column1,column2 FROM table_name WHERE [condition] HAVING [condition] LIMIT [condition]; Aggregate functions in SQL: In database management an aggregate function is a function where the values of multiple rows are grouped together as input on certain criteria to form a single value of more significant meaning. Various Aggregate Functions 1.Count()-Returns total number of records 2.Sum()-Sum all non null values of a column in a table 3.Avg()-Returns sum of values by total number of values i.e, sum(values)/count(values) 4.Min()-Returns miminum value in column of a table 5.Max()-Returns maximum value in column of a table JOINS Clause: The SQL Joins clause is used to combine records from two or more tables in a database. A JOIN is a means for combining fields from two tables by using values common to each. Consider the following two tables − Table 1 − CUSTOMERS Table +----+----------+-----+-----------+----------+ | ID | NAME | AGE | ADDRESS | SALARY | +----+----------+-----+-----------+----------+ | 1 | Ramesh | 32 | Ahmedabad | 2000.00 | | 2 | Khilan | 25 | Delhi | 1500.00 | | 3 | kaushik | 23 | Kota | 2000.00 | | 4 | Chaitali | 25 | Mumbai | 6500.00 | | 5 | Hardik | 27 | Bhopal | 8500.00 | | 6 | Komal | 22 | MP | 4500.00 | | 7 | Muffy | 24 | Indore | 10000.00 | +----+----------+-----+-----------+----------+ Table 2 − ORDERS Table +-----+---------------------+-------------+--------+ |OID | DATE | CUSTOMER_ID | AMOUNT | +-----+---------------------+-------------+--------+ | 102 | 2009-10-08 00:00:00 | 3 | 3000 | | 100 | 2009-10-08 00:00:00 | 3 | 1500 | | 101 | 2009-11-20 00:00:00 | 2 | 1560 | | 103 | 2008-05-20 00:00:00 | 4 | 2060 | +-----+---------------------+-------------+--------+ Now, let us join these two tables in our SELECT statement as shown below. SQL> SELECT ID, NAME, AGE, AMOUNT CUSTOMERS.ID = ORDERS.CUSTOMER_ID; This would produce the following result. FROM CUSTOMERS, ORDERS WHERE +----+----------+-----+--------+ | ID | NAME | AGE | AMOUNT | +----+----------+-----+--------+ | 3 | kaushik | 23 | 3000 | | 3 | kaushik | 23 | 1500 | | 2 | Khilan | 25 | 1560 | | 4 | Chaitali | 25 | 2060 | +----+----------+-----+--------+ They are different types of joins available in SQL – • INNER JOIN – returns rows when there is a match in both tables. • LEFT JOIN-returns all rows from the left table, even if there are no matches in the right table. • RIGHT JOIN-returns all rows from the right table, even if there are no matches in the eft table. • FULL JOIN-returns rows when there is a match in one of the tables. • SELF JOIN-is used to join a atable to itself as if the table were two tables, temporarily renaming at least one table in the SQL statement. • CARTESIAN JOIN-returns the cartesian product of the sets of records from the two or more joined tables. EQUI JOIN: EQUI JOIN creates a join for equality or matching column values of the relative tables.EQUI join also create JOIN bby using JOIN with ON and then providing the names of the columns with their relative tables to check equality using sign(=). Syntax: SELECT column_list FROM table1,table2… WHERE table1.column_name = table2.column_name; NON EQUI JOIN: NON EQUI JOIN performs a join using comparision operator other than equal(=) sign like >,<,>=,<= with conditions. Syntax: SELECT * FROM table_name1, table_name2 WHERE table_name1.column [> | < | >= | <= ] table_name2.column; MANAGING BIG DATA IN CLUSTERS AND CLOUD STORAGE In this course, we will learn how to manage big datasets, how to load them into clusters and cloud storage, and how to apply structure to the data so that you can run queries on it using distributed SQL engines like Apache Hive and Apache Impala. You’ll learn how to choose the right data types, storage systems, and file formats based on which tools you’ll use and what performance you need. Big data can be stored in different ways some companies will use the cloud storage, some companies will use on premises data warehouses and some of the companies will use hybrid approach using both cloud storages and on-premises. In addition to using HDFS for storage, you can store your data using cloud services such as Amazon web services,Microsoft Azure, or Google Cloud platform. Today, some companies store big data on-premises in HDFS, some store it in Cloud Storage, and some use a hybrid approach using both HDFS and cloud storage. The major reasons why companies use cloud storage are cost and scalability. Usually, it costs less to store some amount of data in cloud storage than it would to store it in HDFS. As the amount of data you need to store grows larger and larger, it's often easier to pay incrementally larger amounts of money to a cloud storage provider than it is to purchase new hard disks and new servers and install them in a data center. Amazon has many cloud services but their storage service is called S3 which is short for Simple Storage Service. S3 is the most popular cloud storage platform, and it's the one you will use in this course when you're using something other than HDFS. Hybrid Impala can use S3 very much like they use HDFS. So most of the time when you're querying a table, you won't even notice if the data is in S3 or in HDFS. S3 organizes data into buckets. Buckets are like the folders at the top or highest level of a file system. Buckets in S3 must have globally unique names. So if anyone else in the world is using a specific name for a bucket, you must pick a different name. Within a bucket, you can store files and folders. Technically, S3 stores all the files in your bucket in a flat file system, and it simulates folder structures by using slashes in the filenames. But that's not something you need to be concerned with for this course. S3 is connected to the internet. The data you store in S3 can be accessed from anywhere. S3 provides ways to control who has access to the data though you can make it public or restrict access to certain users or networks. There is only one instance of S3 and it's operated by Amazon and runs across Amazon's data centers globally. HDFS, on the other hand, is a file system that exists on a Hadoop cluster. There are many incidences of HDFS. There's one on every Hadoop cluster. Data stored in HDFS is generally not accessible from everywhere. Access is usually restricted to specific private networks. The major way that S3 is different from HDFS is that S3 provides storage and nothing more. S3 cannot process your data. It can only store it and provide it when requested. HDFS, on the other hand, typically stores files on the same computers that also provide processing power to your big data system. So if you're using HDFS to store files, then the files on HDFS reside on the same computers where data processing engines like Hive and Impala run. When you run a query in Hive or Impala,if the data for the table you're querying is stored in HDFS, then Hive or Impala can routinely read that data off the hard disk on the computer where it's running. This is called data locality or just locality for short. The processing happens on the same location where the data is stored. If you store your data in cloud service like S3 and there is no data locality, Hive or Impala must fetch the data from S3 over the network before it can process it. This makes queries run a little bit slower, but nowadays the networks that connect data centers together are so fast that the difference is often insignificant. The readings in this course we'll show how to access S3 from the VM.. If you using S3 though, you will need a network connection. You also won't be able to browse S3 files using the Hue file browser. It currently requires you to have right access to a bucket if you want to browse it directly, and we are not able to provide write access to all Coursera learners. You have read access only to the S3 bucket you'll use for this course, but you can use Hue to work with Hive and Impala tables that use S3 for their storage. Structured Data: The term structured data refers to data that resides in a fixed field within a file or record. Structured data is typically stored in a Relational database (RDBMS). It can consist of numbers and text, and sourcing can happen automatically or manually, as long as it's within an RDBMS structure. It depends on the creation of a data model , defining what types of data to include and how to store and process it. The programming language used for structured data is SQL (Structured Query Language). Developed by IBM in the 1970s, SQL handles relational databases. Typical examples of structured data are names, addresses, credit card numbers, geolocation, and so on. Unstructured Data: Unstructured data is more or less all the data that is not structured. Even though unstructured data may have a native, internal structure, it's not structured in a predefined way. There is no data model; the data is stored in its native format. Typical examples of unstructured data are rich media, text, social media activity, surveillance imagery, and so on. The amount of unstructured data is much larger than that of structured data. Unstructured data makes up a whopping 80% or more of all enterprise data, and the percentage keeps growing. This means that companies not taking unstructured data into account are missing out on a lot of valuable business intelligence. Creating a table: The create table statement creates a new table and specifies its characteristics. When you execute a create table command, Hive or Impala adds the table to the metastore and creates a new subdirectory in the warehouse directory in HDFS to store the table data. The location of this new subdirectory depends on the database in which the table is created. Tables created in a default database are stored in subdirectories directly under the warehouse directory. Tables created in other databases are stored in subdirectories under those database directories. The basic syntax of the create table statement, should be familiar to anyone who has created tables in a relational database. After create table, you optionally specify the database name. Then give the name of the new table, and a list of the columns, and their data types. If you omit the database name, then the new table will be created in the current database. IMPLEMENTATION In this specialization I have created a database about Bank which includes Bank details , customer details, and account info. I have created four tables which contains about customer personal info.customer reference info , customer’s account info and bank info. The following below is the source code of the tables: CREATE DATABASE BMS_DB33; USE BMS_DB33; SHOW DATABASES; -- CUSTOMER_PERSONAL_INFO CREATE TABLE CUSTOMER_PERSONAL_INFO (CUSTOMER_ID VARCHAR(5), CUSTOMER_NAME VARCHAR(30), DATE_OF_BIRTH DATE, GUARDIAN_NAME VARCHAR(30), ADDRESS VARCHAR(50), CONTACT_NO BIGINT(10), MAIL_ID VARCHAR(30), GENDER CHAR(1), MARITAL_STATUS VARCHAR(10), IDENTIFICATION_DOC_TYPE VARCHAR(20), ID_DOC_NO VARCHAR(20), CITIZENSHIP VARCHAR(10), CONSTRAINT CUST_PERS_INFO_PK PRIMARY KEY(CUSTOMER_ID) ); SHOW TABLES; -- CUSTOMER_REFERENCE_INFO CREATE TABLE CUSTOMER_REFERENCE_INFO ( CUSTOMER_ID VARCHAR(5), REFERENCE_ACC_NAME VARCHAR(20), REFERENCE_ACC_NO BIGINT(16), REFERENCE_ACC_ADDRESS VARCHAR(50), RELATION VARCHAR(25), CONSTRAINT CUST_REF_INFO_PK PRIMARY KEY(CUSTOMER_ID), CONSTRAINT CUST_REF_INFO_FK FOREIGN KEY(CUSTOMER_ID) REFERENCES CUSTOMER_PERSONAL_INFO(CUSTOMER_ID) ); show tables; -- BANK_INFO CREATE TABLE BANK_INFO ( IFSC_CODE VARCHAR(15), BANK_NAME VARCHAR(25), BRANCH_NAME VARCHAR(25), CONSTRAINT BANK_INFO_PK PRIMARY KEY(IFSC_CODE) ); -- ACCOUNT_INFO CREATE TABLE ACCOUNT_INFO ( ACCOUNT_NO BIGINT(16), CUSTOMER_ID VARCHAR(5), ACCOUNT_TYPE VARCHAR(10), REGISTRATION_DATE DATE, ACTIVATION_DATE DATE, IFSC_CODE VARCHAR(10), INTEREST DECIMAL(7,2), INITIAL_DEPOSIT BIGINT(10), CONSTRAINT ACC_INFO_PK PRIMARY KEY(ACCOUNT_NO), CONSTRAINT ACC_INFO_PERS_FK FOREIGN KEY(CUSTOMER_ID) REFERENCES CUSTOMER_PERSONAL_INFO(CUSTOMER_ID), CONSTRAINT ACC_INFO_BANK_FK FOREIGN KEY(IFSC_CODE) REFERENCES BANK_INFO(IFSC_CODE) ); SHOW TABLES; -- BANK_INFO INSERT INTO BANK_INFO(IFSC_CODE,BANK_NAME,BRANCH_NAME)VALUES('HDVL0012','HDFC','VALASA RAVAKKAM'); SELECT * FROM BANK_INFO; INSERT INTO BANK_INFO(IFSC_CODE,BANK_NAME,BRANCH_NAME)VALUES('SBITN0123','SBI','TNAGAR'); INSERT INTO BANK_INFO(IFSC_CODE,BANK_NAME,BRANCH_NAME)VALUES('ICITN0232','ICICI','TNAGAR'); INSERT INTO BANK_INFO(IFSC_CODE,BANK_NAME,BRANCH_NAME)VALUES('ICIPG0242','ICICI','PERUNGU DI'); INSERT INTO BANK_INFO(IFSC_CODE,BANK_NAME,BRANCH_NAME)VALUES('SBISD0113','SBI','SAIDAPET'); -- CUSTOMER_PERSONAL_INFO INSERT INTO CUSTOMER_PERSONAL_INFO(CUSTOMER_ID,CUSTOMER_NAME,DATE_OF_BIRTH,GUARDIA N_NAME,ADDR ESS,CONTACT_NO,MAIL_ID)VALUES('C-001','JOHN','1994-05-03','PETER', 'NO-14 ST.MARKS ROAD ,BENGALORE','9948148628','JOHN_123@gmail.com'); INSERT INTO CUSTOMER_PERSONAL_INFO(CUSTOMER_ID,CUSTOMER_NAME,DATE_OF_BIRTH,GUARDIA N_NAME,ADDR ESS,CONTACT_NO,MAIL_ID)VALUES('C-002','JAMES','1994-04-07','GEORGE', 'NO-18 MG. ROAD ,DELHI','9942137629','JAMES_213@gmail.com'); INSERT INTO CUSTOMER_PERSONAL_INFO(CUSTOMER_ID,CUSTOMER_NAME,DATE_OF_BIRTH,GUARDIA N_NAME,ADDR ESS,CONTACT_NO,MAIL_ID)VALUES('C-003','SUNITHA','1994-09-04','VINOD', 'NO-21 GM ROAD ,CHENNAI','9942138029','SUNITHA_453@gmail.com'); INSERT INTO CUSTOMER_PERSONAL_INFO(CUSTOMER_ID,CUSTOMER_NAME,DATE_OF_BIRTH,GUARDIA N_NAME,ADDRESS,CONTACT_NO,MAIL_ID)VALUES('C-004','RAMESH','1995-06-08','KIRAN', 'NO-15 LB ROAD ,CHENNAI','9942438629','RAMESH_63@gmail.com'); INSERT INTO CUSTOMER_PERSONAL_INFO(CUSTOMER_ID,CUSTOMER_NAME,DATE_OF_BIRTH,GUARDIA N_NAME,ADDR ESS,CONTACT_NO,MAIL_ID)VALUES('C-005','KUMAR','1994-07-02','PRASAD', 'NO-13 MM. ROAD ,BANGALORE','9949138629','KUMAR_3@gmail.com') After inserting the values into the tables I have given the following query to obtain the desired results SELECT ADDRESS, COUNT(*) FROM CUSTOMER_PERSONAL_INFO GROUP BY ADDRESS; Learning outcomes By the end of this specialization, I have learned about the Big data and Relational Database, distinguish operational from analytic databases, and understand how these are applied in big data; understand how database and table design provides structures for working with data; recognize the features and benefits of SQL dialects designed to work with big data systems for storage and analysis; and explore databases and tables in a big data platform. explore and navigate databases and tables using different tools; understand the basics of SELECT statements; understand how and why to filter results; explore grouping and aggregation to answer analytic questions; work with sorting and limiting results; and combine multiple tables in different ways. use different tools to browse existing databases and tables in big data systems; use different tools to explore files in distributed big data filesystems and cloud storage; create and manage big data databases and tables using Apache Hive and Apache Impala; and describe and choose among different data types and file formats for big data systems. So, I have learned about the Structured Query Language and how data works in real time using SQL with the help of the course Modern Big Data Specialization With SQL. Gantt chart Week W course 1 1.Foundation s for big data analysis with SQL. 2.Analysing big data with SQL. W 2 W 3 W 4 W 5 W 6 3.Managing big data in clusters and cloud storage. 4.Report, Project, PPT. Bibliography • Coursera • Google • youtube