Uploaded by Bonigi syam prasad

summar training project

advertisement
SIX WEEKS SUMMER TRAINING
REPORT
on
Modern Big Data Analysis with SQL Specialization
Submitted by
Name: Bonigi Syam Prasad
Reg no:11902565
Program name: Computer science and Engineering
Under the guidance of
Glynn Durham, Ian Cook
School of Computer Science & Engineering
Lovely Professional University, Phagwara
(June-July, 2021)
DECLARATION
I hereby declare that I have completed my six weeks summer training at Coursera from 20-052021
to 27-08-2021 under the guidance of Glynn Durham, Ian Cook. I have declared that I have
worked with full dedication during these six weeks of training and my learning outcomes fulfil the
requirements of training for the award of degree of Modern Big Data Analysis with SQL
Specialization, Lovely Professional University, Phagwara.
(Signature of student)
Name of Student: Bonigi Syam Prasad
Registration no: 11902565
Date: 30-07-21
Acknowledgement
I would like to express my special thanks of gratitude to the Lovely Professional University for
encouraging me to do this wonderful course on the name of Modern Big Data Analysis with SQL
specialization as a part of six weeks summer training program, which also helped me in
understanding lot of things related to my minor subject Data Science.
Secondly, I would like to thank my instructors Glynn Durham and Ian Cook who taught me this
entire course. In this course we will get a overview of database systems and the distinction
between the operational and analytical data bases, we will Understand how database and table
design provides structures for working with data, Recognize the features and benefits of SQL
dialects designed to work with big data systems for storage and analysis. we will be able to learn
about the common querying language (SQL).
Value the contribution of all others for sharing the wealth of knowledge, wisdom and experience.
Summer training Certificate from Cloudera
Table of contents
1.Introduction
2.Technology Learnt
3.Reason for choosing this Technology
4.Foundations For Big Data Analysis With SQL
5.Analyzing Big Data With SQL
6.Managing Big Data In Clusters And Cloud Storage
7.Implementation
8.Learning Outcomes
9.Gnatt chart
10.Project legacy
Introduction
This Specialization teaches the essential skills for working with large scale data using SQL. For
this we have to know about the SQL and Big Data. SQL is a domain-specific language used in
programming and designed for managing data held in a relational database management system,
or for stream processing in a relational data stream management system. It is a standard language
for sorting, manipulating and retrieving data in databases.
Big data is a term that describes the large volume of data – both structured and unstructured that
inundates a business on a day-to-day basis. But it’s not the amount of data that’s important… Big
data can be analysed for insights that lead to better decisions and strategic business moves. Big
data is a field that treats ways to analyse, systematically extract information from, or otherwise
deal with data sets that are too large or complex to be dealt with by traditional data-processing
application software. Today, more and more of the data that’s being generated is too big to be
stored there, and it’s growing too quickly to be efficiently stored in commercial data warehouses.
instead, it’s increasingly stored in distributed clusters and cloud storage. To query these huge
datasets in clusters and cloud storage, you need a newer breed of SQL engine like Hive, Impala,
Presto and Drill. These are open-source SQL engines capable of querying enormous datasets. This
specialization focuses on Hive and Impala, the most widely deployed of these query engines.
In this specialization. we have three sub courses. First course is Foundations for Big Data
Analysis With SQL, In this course we will be able to distinguish operational from analytical
databases, and understand how these are applied in big data, understand how database and table
design provides structures for working with data
Second course is Analyzing Big Data with SQL, in this course we will get an in-depth look at the
SQL SELECT statement and its main clauses. The course focuses on big data SQL engines Hive
and Impala ,but most of the information is applicable to SQL with traditional RDMs. By the end
of the course ,you will be able to explore and navigate databases and tables using different tools,
understanding the basics of SELECT statements and explore grouping and aggregation to answer
analytic questions. we will work with sorting and limiting results and combing multiple tables in
different ways.
Third course is Managing Big Data in Clusters and Cloud storage.in this course we will learn how
to manage big datasets, how to load them into clusters and cloud storage, and how to apply
structure to the data so that you can run queries on it using distributed SQL engines like Hive and
Impala, we will able to learn how to choose right data types, storage systems and file formats. By
the end of the course we will be able to use different tools to browse existing databases and tables
in big data systems and different tools to explore files in distributed big data filesystems and
cloud storage.
TECHNOLOGY LEARNT
In this specialization we used VMware workstation player. VMware workstation is a hosted
hypervisor that runs on x64 versions of windows and Linux operating systems; It enables
users to set up virtual machines on a single physical machine and use them simultaneously
along with the host machine, along with the software we need to download and install a
virtual machine ( supplied by Cloudera ) and the software on which to run it.
In the virtual machine of cloudera we have used the impala and hive for queries. Impala is an open
source massively parallel processing SQL query engine for data stored in a computer cluster
running apache Hadoop. Hive is a data warehouse software project built on top of apache Hadoop
for providing data query query and analysis. Hive gives a SQL like interface to query data stored
in various databases and file systems that integrate with Hadoop.
The above figure shows the interface of impala where we can give different queries to the existing
datasets , the customers,employees,offices,orders and salary grades are the different tables of the
default dataset.
Reason for choosing this Technology:
In this specialization, Modern Big data analysis with SQL ,they discussed about the bigdata and
sql.these topics are very essential for Data Science. In course 1 ,they described about the big
data,types of data,where data can be stored. In course 2 they discussed about the SQL ,acid
properties, different clauses of SQL like select,from,where,having,group by,order by , explore
grouping and aggregation to answer analytic questions. we will work with sorting and limiting
results and combing multiple tables in different ways and in course 3 they described about
managing big data on clusters and cloud storage, how to apply structure to the data so that you can
run queries on it. As a data science aspirant these course helps me a lot in understanding Big Data
and Data base management systems and how data works in real life and how we can manipulate
data in different ways using SQL .
FOUNDATIONS FOR BIG DATA ANALYSIS WITH SQL
In Big data, Data means digital data. Information that can be transmitted, stored, and processed
using modern digital technologies, like the internet , disk drives, and modern computer. Now data
itself can be divided into two kinds, analog data and digital data.
Analog data is data that is represented in a physical way. where digital data is are represented as
numbers in computer machines and can be intrepeted. The difference between the analog and
digital is in how the information or data is measured. Analog data attempts to be continuous and
identify every nuance of what is being measured, while digital data uses sampling to encode what
is being measured.
Data base management system is a software for sorting and retrieving users data while
considering appropriate security measures. … In large systems, a DBMS helps users and other
third-party software to store and retieve data. DBMS allows users create their own databases as
per their requirement.
Structured Query Language(SQL) is a programming language used to communicate with
relational databases
Relational databases store data in tables consisting of columns and rows similar to a spreadsheet.
It allow for simple manipulation of stored data,while relational databases with the help of SQL
allow for complex manipulation of the data. Relational databases are the most used tehnology for
accessing structured data.
In the context of SQL, data definition language(DDL) is a syntax for creating and modifying
database objects such as tables, indices, and users. DDL statements are similar to a computer
programming language for defining data structures, especially database schemas
DDL statements include:
•
CREATE- is used to create the database or its objects.
•
ALTER-is used to alter the structure of the database.
•
DROP-is used to delete objects from the database.
Data Query Language(DQL) statements are used for performing queries on the dat within
schema objects.the purpose of the DQL command is to get some schema relation based on the
query passed to it.
DQL statements include:
•
SELECT-is used to retrieve daa from the database.
Data Manipulation Language (DML) are the SQL commands that deals with the manipulation
of data present in the databse belong to DML or Data Manipulation Language and this includes
most of the SQL statements.
DML statements include:
•
INSERT-is used to insert data into a table.
•
UPDATE-is used to update existing data within a table.
•
DELETE-is used to delete records from a database table.
Data Control Language(DCL) includes commands which mainly deals with the
rights,peremissions and other controls of the database system.
DCL includes:
•
GRANT-gives users access privileges to the database.
•
REVOKE-which mainly deal with the rights, permissions and other controls of the
database system.
Transaction Control Language(TCL) commands deal with the transaction within the
transaction within the database.
TCL includes:
•
COMMIT-commits a transaction.
•
ROLLBACK-rollbacks a transaction in case of any error occurs.
•
SAVEPOINT-sets a savepoint within a transaction.
•
SET TRANSACTION-specify characteristics for the transaction.
ACID Properties
A transaction is a very small unit of a program and it may contain several low level tasks. A
transaction in a database system must maintain Atomicity, Consistency, Isolation, and Durability
− commonly known as ACID properties − in order to ensure accuracy, completeness, and data
integrity.
•
Atomicity − This property states that a transaction must be treated as an atomic unit, that
is, either all of its operations are executed or none. There must be no state in a database
where a transaction is left partially completed. States should be defined either before the
execution of the transaction or after the execution/abortion/failure of the transaction.
•
Consistency − The database must remain in a consistent state after any transaction. No
transaction should have any adverse effect on the data residing in the database. If the
database was in a consistent state before the execution of a transaction, it must remain
consistent after the execution of the transaction as well.
•
Durability − The database should be durable enough to hold all its latest updates even if
the system fails or restarts. If a transaction updates a chunk of data in a database and
commits, then the database will hold the modified data. If a transaction commits but the
system fails before the data could be written on to the disk, then that data will be updated
once the system springs back into action.
•
Isolation − In a database system where more than one transaction are being executed
simultaneously and in parallel, the property of isolation states that all the transactions will
be carried out and executed as if it is the only transaction in the system. No transaction will
affect the existence of any other transaction.
Operational Databases: An operational database is a database that is used to manage and store
data in real time. An operational database is the source for a data warehouse. Elements in an
operational database can be added and removed on the fly. These databases can be either SQL or
NoSQL-based, where the latter is geared toward real-time operations.
Analytical Databases: An analytic database, also called an analytical database, is a read-only
system that stores historical data on business metrics such as sales performance and inventory
levels. Business analysts, corporate executives and other workers run queries and reports against
an analytic database. The information is regularly updated to include recent transaction data from
an organization's operational systems.
Datatypes In SQL:
Data types are used to represent the nature of the data that can be stored in the database table. For
example, in a particular column of a table, if we want to store a string type of data then we will
have to declare a string data type of this column.
Data types mainly classified into three categories for every database.
•
•
•
String Data types
Numeric Data types
Date and time Datatypes
Some of the data types are:
•
CHAR: It is used to specify a fixed length string that can contain numbers, letters, and special
characters. Its size can be 0 to 255 characters. Default is 1.
•
VARCHAR: It is used to specify a variable length string that can contain numbers, letters, and
special characters. Its size can be from 0 to 65535 characters.
•
INT: It is used for the integer value. Its signed range varies from -2147483648 to 2147483647
and unsigned range varies from 0 to 4294967295. The size parameter specifies the max display
width that is 255.
•
INTEGER: It is equal to INT (size).
•
FLOAT: It is used to specify a floating-point number. Its size parameter specifies the total
number of digits. The number of digits after the decimal point is specified by parameter.
•
FLOAT(p):
It is used to specify a floating-point number. MySQL used p parameter to
determine whether to use FLOAT or DOUBLE. If p is between 0 to24, the data type becomes
FLOAT (). If p is from 25 to 53, the data type becomes DOUBLE ().
•
BOOL: It is used to specify Boolean values true and false. Zero is considered as false, and
nonzero values are considered as true.
•
DATE: It is used to specify date format YYYY-MM-DD. Its supported range is from '10000101' to ‘9999-12-31'.
•
YEAR: It is used to specify a year in four-digit format. Values allowed in four-digit format from
1901 to 2155, and 0000.
•
TEXT(Size): It holds a string that can contain a maximum length of 255 characters.
•
TINYTEXT:
•
MEDIUMTEXT: It holds a string with a maximum length of 16,777,215.
•
LONGTEXT: It holds a string with a maximum length of 4,294,967,295 characters.
It holds a string with a maximum length of 255 characters.
Table1:card_rank
Columns:
sno
Name Type
Comment
1
rank
2
sno
value tinyint same as rank if it is a number otherwise is null sample:
rank value
1
Ace
2
3
string pk*
NULL
3
Table 2: card_suit
Columns
sno
Name Type
Comment
1
suit
2
1
color string REd or BLACK
suit
color
Clubs
Black
2
Diamonds Red 3 Hearts
sno
string pk*
Sample:
Red
Analyzing Big Data With SQL
In this course, you'll get an in-depth look at the SQL SELECT statement and its main clauses. The course
focuses on big data SQL engines Apache Hive and Apache Impala, but most of the information is applicable
to SQL with traditional RDBMS.
SELECT statement:
The Select statement is the most important part of the SQL language. The options for what you can do with a
Select statement are so extensive, that Select forms it's own category of SQL statements called queries. It is
the most common operation in SQL, called "the query".
SELECT
retrieves data from one or more tables or
expressions. Standard SELECT statements
have
no
persistent effects on the database. Some
non-standard
SELECT can have persistent effects, such as the SELECT
implementations
some
databases.
of
The
syntax
INTO
select
provided
in
statement is
used to select data from a database.
FROM Clause:
The SQL From clause is the source of a rowset to be operated upon in a Data Manipulation Language
statement. From clauses are very common, and provide the rowset to be exposed through a select statement,
the source of values in an update statement, and the target rows to be deleted in a delete statement. From is an
SQL reserved word in the SQL standard.The From clause is used in conjuction with SQL statements, and
takes the following general form:
SQL -DML- statement FROM table_name WHERE predicate
WHERE Clause:
The SQL WHERE clause is used to specify a condition while fetching the data from a single table or by
joining with multiple tables. If the given condition is satisfied, then only it returns a specific value from the
table. You should use the WHERE clause to filter the records and fetching only the necessary records.
The WHERE clause is not only used in the SELECT statement, but it is also used in the UPDATE, DELETE
statement, etc., which we would examine in the subsequent chapters.
The basic syntax of the SELECT statement with the WHERE clause is as shown below:
SELECT column1,column2 FROM table_name WHERE [condition]
You can specify a condition using the comparision or logical operators like >,<,=,LIKE,NOT,etc.
GROUP BY Clause:
The SQL GROUP BY clause is used in collaboration with the SELECT statement to arrange identical data
into groups. This GROUP BY clause follows the WHERE clause in a SELECT statement and precedes the
ORDER BY clause. The basic syntax of a GROUP BY clause is shown in the following code block. The
GROUP BY clause must follow the conditions in the WHERE clause and must precede the ORDER BY
clause if one is used.
SELECT column1, column2 FROM table_name WHERE [ conditions ] GROUP BY column1,
column2
HAVING Clause:
The HAVING Clause enables you to specify conditions that filter which group results appear in the results.
The WHERE clause places conditions on the selected columns, whereas the HAVING clause places
conditions on groups created by the GROUP BY clause. The HAVING clause must follow the GROUP BY
clause in a query and must also precede the ORDER BY clause if used. The following code block has the
syntax of the SELECT statement including the HAVING clause The following code block shows the
position of the HAVING Clause in a query.
SELECT column1, column2 FROM table1, table2WHERE [ conditions ] GROUP BY column1, column2
HAVING [ conditions ] ORDER BY column1, column2
The ORDER BY Clause:
The SQL ORDER BY clause is used to sort the data in ascending or descending order, based on one or more
columns. Some databases sort the query results in an ascending order by default.The basic syntax of the
ORDER BY clause is as follows :
SELECT column-list FROM table_name [WHERE condition] [ORDER BY column1, column2, .. columnN]
[ASC |DESC];
You can use more than one column in the ORDER BY clause. Make sure whatever column you are using to
sort that column should be in the column-list.
LIMIT Clause:
If they are a large number of tuples satisfying the query conditions,it might be resourceful to view only a
handful of them at a time.The LIMIT clause is used to set an upper limit on the number of tuples returned by
SQL. The following is the syntax for a basic Limit clause.
SELECT column1,column2 FROM table_name WHERE [condition] HAVING [condition] LIMIT
[condition];
Aggregate functions in SQL:
In database management an aggregate function is a function where the values of multiple rows are grouped
together as input on certain criteria to form a single value of more significant meaning.
Various Aggregate Functions
1.Count()-Returns total number of records
2.Sum()-Sum all non null values of a column in a table
3.Avg()-Returns sum of values by total number of values i.e, sum(values)/count(values)
4.Min()-Returns miminum value in column of a table
5.Max()-Returns maximum value in column of a table
JOINS Clause:
The SQL Joins clause is used to combine records from two or more tables in a database. A JOIN is a means
for combining fields from two tables by using values common to each.
Consider the following two tables − Table 1 − CUSTOMERS Table
+----+----------+-----+-----------+----------+
| ID | NAME
| AGE | ADDRESS
| SALARY
|
+----+----------+-----+-----------+----------+
| 1 | Ramesh
| 32 | Ahmedabad | 2000.00 |
| 2 | Khilan
| 25 | Delhi
| 1500.00 |
| 3 | kaushik | 23 | Kota
| 2000.00 |
| 4 | Chaitali | 25 | Mumbai
| 6500.00 |
| 5 | Hardik
| 27 | Bhopal
| 8500.00 |
| 6 | Komal
| 22 | MP
| 4500.00 |
| 7 | Muffy
| 24 | Indore
| 10000.00 |
+----+----------+-----+-----------+----------+
Table 2 − ORDERS Table
+-----+---------------------+-------------+--------+
|OID | DATE
| CUSTOMER_ID | AMOUNT |
+-----+---------------------+-------------+--------+
| 102 | 2009-10-08 00:00:00 |
3 |
3000 |
| 100 | 2009-10-08 00:00:00 |
3 |
1500 |
| 101 | 2009-11-20 00:00:00 |
2 |
1560 |
| 103 | 2008-05-20 00:00:00 |
4 |
2060 |
+-----+---------------------+-------------+--------+
Now, let us join these two tables in our SELECT statement as shown below.
SQL> SELECT ID, NAME, AGE, AMOUNT
CUSTOMERS.ID = ORDERS.CUSTOMER_ID;
This would produce the following result.
FROM CUSTOMERS, ORDERS
WHERE
+----+----------+-----+--------+
| ID | NAME
| AGE | AMOUNT |
+----+----------+-----+--------+
| 3 | kaushik | 23 |
3000 |
| 3 | kaushik | 23 |
1500 |
| 2 | Khilan
| 25 |
1560 |
| 4 | Chaitali | 25 |
2060 | +----+----------+-----+--------+
They are different types of joins available in SQL –
•
INNER JOIN – returns rows when there is a match in both tables.
•
LEFT JOIN-returns all rows from the left table, even if there are no matches in the right
table.
•
RIGHT JOIN-returns all rows from the right table, even if there are no matches in the eft
table.
•
FULL JOIN-returns rows when there is a match in one of the tables.
•
SELF JOIN-is used to join a atable to itself as if the table were two tables, temporarily
renaming at least one table in the SQL statement.
•
CARTESIAN JOIN-returns the cartesian product of the sets of records from the two or more
joined tables.
EQUI JOIN:
EQUI JOIN creates a join for equality or matching column values of the relative tables.EQUI join also
create JOIN bby using JOIN with ON and then providing the names of the columns with their relative tables
to check equality using sign(=).
Syntax:
SELECT column_list FROM table1,table2… WHERE table1.column_name = table2.column_name;
NON EQUI JOIN:
NON EQUI JOIN performs a join using comparision operator other than equal(=) sign like >,<,>=,<= with
conditions.
Syntax:
SELECT * FROM table_name1, table_name2 WHERE table_name1.column [> | < | >= | <= ]
table_name2.column;
MANAGING BIG DATA IN CLUSTERS AND CLOUD STORAGE
In this course, we will learn how to manage big datasets, how to load them into clusters and cloud storage,
and how to apply structure to the data so that you can run queries on it using distributed SQL engines like
Apache Hive and Apache Impala. You’ll learn how to choose the right data types, storage systems, and file
formats based on which tools you’ll use and what performance you need.
Big data can be stored in different ways some companies will use the cloud storage, some companies will use
on premises data warehouses and some of the companies will use hybrid approach using both cloud storages
and on-premises. In addition to using HDFS for storage, you can store your data using cloud services such as
Amazon web services,Microsoft Azure, or Google Cloud platform. Today, some companies store big data
on-premises in HDFS, some store it in Cloud Storage, and some use a hybrid approach using both HDFS and
cloud storage. The major reasons why companies use cloud storage are cost and scalability. Usually, it costs
less to store some amount of data in cloud storage than it would to store it in HDFS. As the amount of data
you need to store grows larger and larger, it's often easier to pay incrementally larger amounts of money to a
cloud storage provider than it is to purchase new hard disks and new servers and install them in a data center.
Amazon has many cloud services but their storage service is called S3 which is short for Simple Storage
Service. S3 is the most popular cloud storage platform, and it's the one you will use in this course when
you're using something other than HDFS. Hybrid Impala can use S3 very much like they use HDFS. So most
of the time when you're querying a table, you won't even notice if the data is in S3 or in HDFS. S3 organizes
data into buckets. Buckets are like the folders at the top or highest level of a file system. Buckets in S3 must
have globally unique names. So if anyone else in the world is using a specific name for a bucket, you must
pick a different name. Within a bucket, you can store files and folders. Technically, S3 stores all the files in
your bucket in a flat file system, and it simulates folder structures by using slashes in the filenames. But
that's not something you need to be concerned with for this course. S3 is connected to the internet. The data
you store in S3 can be accessed from anywhere. S3 provides ways to control who has access to the data
though you can make it public or restrict access to certain users or networks. There is only one instance of S3
and it's operated by Amazon and runs across Amazon's data centers globally. HDFS, on the other hand, is a
file system that exists on a Hadoop cluster. There are many incidences of HDFS. There's one on every
Hadoop cluster. Data stored in HDFS is generally not accessible from everywhere.
Access is usually restricted to specific private networks. The major way that S3 is different from HDFS is
that S3 provides storage and nothing more. S3 cannot process your data. It can only store it and provide it
when requested. HDFS, on the other hand, typically stores files on the same computers that also provide
processing power to your big data system. So if you're using HDFS to store files, then the files on HDFS
reside on the same computers where data processing engines like Hive and Impala run. When you run a
query in Hive or Impala,if the data for the table you're querying is stored in HDFS, then Hive or Impala can
routinely read that data off the hard disk on the computer where it's running. This is called data locality or
just locality for short. The processing happens on the same location where the data is stored. If you store
your data in cloud service like S3 and there is no data locality, Hive or Impala must fetch the data from S3
over the network before it can process it. This makes queries run a little bit slower, but nowadays the
networks that connect data centers together are so fast that the difference is often insignificant. The readings
in this course we'll show how to access S3 from the VM.. If you using S3 though, you will need a network
connection. You also won't be able to browse S3 files using the Hue file browser. It currently requires you to
have right access to a bucket if you want to browse it directly, and we are not able to provide write access to
all Coursera learners. You have read access only to the S3 bucket you'll use for this course, but you can use
Hue to work with Hive and Impala tables that use S3 for their storage.
Structured Data: The term structured data refers to data that resides in a fixed field within a file or record.
Structured data is typically stored in a Relational database (RDBMS). It can consist of numbers and text, and
sourcing can happen automatically or manually, as long as it's within an RDBMS structure. It depends on the
creation of a data model , defining what types of data to include and how to store and process it. The
programming language used for structured data is SQL (Structured Query Language). Developed by IBM in
the 1970s, SQL handles relational databases. Typical examples of structured data are names, addresses,
credit card numbers, geolocation, and so on.
Unstructured Data: Unstructured data is more or less all the data that is not structured. Even though
unstructured data may have a native, internal structure, it's not structured in a predefined way. There is no
data model; the data is stored in its native format. Typical examples of unstructured data are rich media, text,
social media activity, surveillance imagery, and so on. The amount of unstructured data is much larger than
that of structured data. Unstructured data makes up a whopping 80% or more of all enterprise data, and the
percentage keeps growing. This means that companies not taking unstructured data into account are missing
out on a lot of valuable business intelligence.
Creating a table: The create table statement creates a new table and specifies its characteristics. When you
execute a create table command, Hive or Impala adds the table to the metastore and creates a new
subdirectory in the warehouse directory in HDFS to store the table data. The location of this new
subdirectory depends on the database in which the table is created. Tables created in a default database are
stored in subdirectories directly under the warehouse directory. Tables created in other databases are stored
in subdirectories under those database directories. The basic syntax of the create table statement, should be
familiar to anyone who has created tables in a relational database. After create table, you optionally specify
the database name. Then give the name of the new table, and a list of the columns, and their data types. If
you omit the database name, then the new table will be created in the current database.
IMPLEMENTATION
In this specialization I have created a database about Bank which includes Bank details , customer details,
and account info. I have created four tables which contains about customer personal info.customer reference
info , customer’s account info and bank info.
The following below is the source code of the tables:
CREATE DATABASE BMS_DB33;
USE BMS_DB33;
SHOW DATABASES;
-- CUSTOMER_PERSONAL_INFO
CREATE TABLE CUSTOMER_PERSONAL_INFO
(CUSTOMER_ID VARCHAR(5),
CUSTOMER_NAME VARCHAR(30),
DATE_OF_BIRTH DATE,
GUARDIAN_NAME VARCHAR(30),
ADDRESS VARCHAR(50),
CONTACT_NO BIGINT(10),
MAIL_ID VARCHAR(30),
GENDER CHAR(1),
MARITAL_STATUS VARCHAR(10),
IDENTIFICATION_DOC_TYPE VARCHAR(20),
ID_DOC_NO VARCHAR(20),
CITIZENSHIP VARCHAR(10),
CONSTRAINT CUST_PERS_INFO_PK PRIMARY KEY(CUSTOMER_ID)
);
SHOW TABLES;
-- CUSTOMER_REFERENCE_INFO
CREATE TABLE CUSTOMER_REFERENCE_INFO
(
CUSTOMER_ID VARCHAR(5),
REFERENCE_ACC_NAME VARCHAR(20),
REFERENCE_ACC_NO BIGINT(16),
REFERENCE_ACC_ADDRESS VARCHAR(50),
RELATION VARCHAR(25),
CONSTRAINT CUST_REF_INFO_PK PRIMARY KEY(CUSTOMER_ID),
CONSTRAINT CUST_REF_INFO_FK FOREIGN KEY(CUSTOMER_ID) REFERENCES
CUSTOMER_PERSONAL_INFO(CUSTOMER_ID)
);
show tables;
-- BANK_INFO
CREATE TABLE BANK_INFO
(
IFSC_CODE VARCHAR(15),
BANK_NAME VARCHAR(25),
BRANCH_NAME VARCHAR(25),
CONSTRAINT BANK_INFO_PK PRIMARY KEY(IFSC_CODE)
);
-- ACCOUNT_INFO
CREATE TABLE ACCOUNT_INFO
(
ACCOUNT_NO BIGINT(16),
CUSTOMER_ID VARCHAR(5),
ACCOUNT_TYPE VARCHAR(10),
REGISTRATION_DATE DATE,
ACTIVATION_DATE DATE,
IFSC_CODE VARCHAR(10),
INTEREST DECIMAL(7,2),
INITIAL_DEPOSIT BIGINT(10),
CONSTRAINT ACC_INFO_PK PRIMARY KEY(ACCOUNT_NO),
CONSTRAINT ACC_INFO_PERS_FK FOREIGN KEY(CUSTOMER_ID) REFERENCES
CUSTOMER_PERSONAL_INFO(CUSTOMER_ID),
CONSTRAINT ACC_INFO_BANK_FK FOREIGN KEY(IFSC_CODE) REFERENCES
BANK_INFO(IFSC_CODE)
);
SHOW TABLES;
-- BANK_INFO
INSERT INTO
BANK_INFO(IFSC_CODE,BANK_NAME,BRANCH_NAME)VALUES('HDVL0012','HDFC','VALASA
RAVAKKAM'); SELECT * FROM BANK_INFO;
INSERT INTO
BANK_INFO(IFSC_CODE,BANK_NAME,BRANCH_NAME)VALUES('SBITN0123','SBI','TNAGAR');
INSERT INTO
BANK_INFO(IFSC_CODE,BANK_NAME,BRANCH_NAME)VALUES('ICITN0232','ICICI','TNAGAR');
INSERT INTO
BANK_INFO(IFSC_CODE,BANK_NAME,BRANCH_NAME)VALUES('ICIPG0242','ICICI','PERUNGU
DI');
INSERT INTO
BANK_INFO(IFSC_CODE,BANK_NAME,BRANCH_NAME)VALUES('SBISD0113','SBI','SAIDAPET');
-- CUSTOMER_PERSONAL_INFO
INSERT INTO
CUSTOMER_PERSONAL_INFO(CUSTOMER_ID,CUSTOMER_NAME,DATE_OF_BIRTH,GUARDIA
N_NAME,ADDR ESS,CONTACT_NO,MAIL_ID)VALUES('C-001','JOHN','1994-05-03','PETER', 'NO-14
ST.MARKS ROAD ,BENGALORE','9948148628','JOHN_123@gmail.com');
INSERT INTO
CUSTOMER_PERSONAL_INFO(CUSTOMER_ID,CUSTOMER_NAME,DATE_OF_BIRTH,GUARDIA
N_NAME,ADDR ESS,CONTACT_NO,MAIL_ID)VALUES('C-002','JAMES','1994-04-07','GEORGE',
'NO-18 MG. ROAD ,DELHI','9942137629','JAMES_213@gmail.com');
INSERT INTO
CUSTOMER_PERSONAL_INFO(CUSTOMER_ID,CUSTOMER_NAME,DATE_OF_BIRTH,GUARDIA
N_NAME,ADDR ESS,CONTACT_NO,MAIL_ID)VALUES('C-003','SUNITHA','1994-09-04','VINOD',
'NO-21 GM ROAD ,CHENNAI','9942138029','SUNITHA_453@gmail.com');
INSERT INTO
CUSTOMER_PERSONAL_INFO(CUSTOMER_ID,CUSTOMER_NAME,DATE_OF_BIRTH,GUARDIA
N_NAME,ADDRESS,CONTACT_NO,MAIL_ID)VALUES('C-004','RAMESH','1995-06-08','KIRAN',
'NO-15 LB ROAD ,CHENNAI','9942438629','RAMESH_63@gmail.com');
INSERT INTO
CUSTOMER_PERSONAL_INFO(CUSTOMER_ID,CUSTOMER_NAME,DATE_OF_BIRTH,GUARDIA
N_NAME,ADDR ESS,CONTACT_NO,MAIL_ID)VALUES('C-005','KUMAR','1994-07-02','PRASAD',
'NO-13 MM. ROAD ,BANGALORE','9949138629','KUMAR_3@gmail.com')
After inserting the values into the tables I have given the following query to obtain the desired results
SELECT ADDRESS, COUNT(*) FROM CUSTOMER_PERSONAL_INFO GROUP BY ADDRESS;
Learning outcomes
By the end of this specialization, I have learned about the Big data and Relational Database, distinguish
operational from analytic databases, and understand how these are applied in big data; understand how
database and table design provides structures for working with data; recognize the features and benefits of
SQL dialects designed to work with big data systems for storage and analysis; and explore databases and
tables in a big data platform. explore and navigate databases and tables using different tools; understand the
basics of SELECT statements; understand how and why to filter results; explore grouping and aggregation to
answer analytic questions; work with sorting and limiting results; and combine multiple tables in different
ways. use different tools to browse existing databases and tables in big data systems; use different tools to
explore files in distributed big data filesystems and cloud storage; create and manage big data databases and
tables using Apache Hive and Apache Impala; and describe and choose among different data types and file
formats for big data systems.
So, I have learned about the Structured Query Language and how data works in real time using SQL with the
help of the course Modern Big Data Specialization With SQL.
Gantt chart
Week
W
course
1
1.Foundation
s for big data
analysis with
SQL.
2.Analysing
big data with
SQL.
W
2
W
3
W
4
W
5
W
6
3.Managing
big data in
clusters and
cloud storage.
4.Report,
Project, PPT.
Bibliography
• Coursera
• Google
• youtube
Download