Uploaded by kbhargav252

Unit 1 - Introduction to Big Data

advertisement
Overview of Big Data
Module 1
Evaluation Criteria - Theory
Criteria
Marks
Mid Marks(Best of Three)
30M
Assignment
5M
Quiz
5M
Total
40M
Evaluation Criteria - Lab
Criteria
Marks
Continuous Evaluation
40M
Lab Exam 1
15M
Lab Exam 2
15M
Coursera
20M
Case Study
Total
10M
100M
Coursera Course Link
Introduction to Big data with Spark and Hadoop
offered by IBM
Link: https://www.coursera.org/learn/introduction-tobig-data-with-spark-hadoop
Syllabus – Module 1
Getting an overview of Big Data
Big Data definition, History of Data Management, Structuring
Big Data, Elements of Big Data, Big Data Analytics.
Exploring use of Big Data in Business Context:
Use of Big Data in Social Networking, Use of Big Data in
preventing Fraudulent Activities in Insurance Sector & in
Retail Industry.
Why Big Data ?
• Soaring Demand for Analytics
Professionals
• Salary Aspects
• Big Data Analytics: A Top Priority
in a lot of Organizations
• Big Data Analytics is Used
Everywhere!
6
Soaring
Demand for
Analytics
Professionals
Salary
Aspects
8
Big Data –
Job Titles
Big Data –
Required
skills
Big Data/Analytics
Jobs (Toronto)
11
• Banks
• RBC, TD, CIBC, Scotiabank,
AMEX, CapitalOne, ING
Direct
• Telcommunications
• Rogers, Telus, Bell, etc.
• Technology
• BlackBerry, Huawei, CGI
• Manufacture/Services
• GM, Canada Post,
Workopolis
• Insurance
• SunLife, Manulife
• Web/Mobile/Startup
– Google, Mozilla
• Digital Media/Agencies
• Globe and Mail, Kobo
• Consulting
– Accenture, IBM, Deloitte, SAS
• Retail/e-commerce
– Amazon, HR, Hudson Bay,
Sears, Shoppers, Canadian Tire,
Sobeys
• Pharmaceutical/Healthcare
– Hospitals, Clinical Research
Companies etc.
Job Market
• Where are big data
jobs?
•
North America
•
•
Sillicon Valley, Seattle,
NYC, Toronto
India/China
Big data jobs around the nation: http://www.tableausoftware.com/public/gallery/big-data-jobs
Big Data Salary: http://goo.gl/4et998
Oreilly Media Data Science Salary Survey: http://www.oreilly.com/data/free/files/stratasurvey.pdf
KDNuggets 2014 Analytics/Data Science Salary Poll: http://goo.gl/VhO9IW
Why is Big Data important now?
'Big Data' has many valuable applications:
• Product recommendation
• Prediction
• Market Analysis
• Fraud detection
And many, many more ... Data must be processed to glean insights from it and derive the
value from it.
Big Data Made Possible
Hardware
‒ Big cluster of commodity machines at lower cost
• Faster processor
• Cheaper memory
• Bigger hard drive space
• Faster network bandwidth
Software
‒ Algorithms to allow parallel computing (map-reduce)
What is Big Data?
Think of the following:
• Every second, there are around 8,22 tweets on Twitter.
• Every minute, nearly 510 comments are posted, 293,000 status are updated and
136,000 photos are uploaded on Facebook.
• Every hour, Walmart handles more than 1 million customer transactions.
• Everyday, Customers make around 11.5 million payments by using PayPal.
- Digital world -> increase in data rapidly ->increase in the use of internet, sensors
at a very high rate.
- The sheer volume, variety, velocity and veracity of such data is signified by the
term ‘Big Data
What is Big Data?
• Big data is structured, unstructured and semi-structured in nature.
• Difficult for computing systems due to high speed and volume.
• Traditional data management, warehousing and analysis fizzle to
analyze the high speed of data.
• Hadoop by Apache is widely used for storing an managing Big data.
• According to IBM, everyday we create 2.5 quintillion bytes of data – so
much that 90% of the world today has been created in the last two
years alone.
• Data – sensor data, climate data, GPS data, bank data to name a
few.This data is Big data.
Big Data - Definition
• “Big data” is high-volume, velocity, and variety
information assets that demand cost-effective,
innovative forms of information processing for
enhanced insight and decision making.”
• In simple words, Big data is a collection of data that is
huge in volume, yet growing exponentially with time.
It is a data with so large size and complexity that none
of traditional data management tools can store it or
process it efficiently.
Data Expansion –
Day by Day
• One of the data production source - Smart electronic
devices
• Amount of data – 175 ZettaBytes by 2025.
• Total volume of data – double every two years.
Tabular Representation of various Memory
Sizes
NAME
EQUAL TO
SIZE(IN BYTES)
Bit
1 bit
1/8
Nibble
4 bits
1/2 (rare)
Byte
8 bits
1
Kilobyte-KB
1024 bytes
1024
Megabyte-MB
1, 024kilobytes
1, 048, 576
Gigabyte-GB
1, 024 megabytes
1, 073, 741, 824
Terrabyte-TB
1, 024 gigabytes
1, 099, 511, 627, 776
Petabyte-PB
1, 024 terrabytes
1, 125, 899, 906, 842, 624
Exabyte-EB
1, 024 petabytes
1, 152, 921, 504, 606, 846, 976
Zettabyte-ZB
1, 024 exabytes
1, 180, 591, 620, 717, 411, 303, 424
Yottabyte-YB
1, 024 zettabytes
1, 208, 925, 819, 614, 629, 174, 706, 176
19
In simple
words,
various
memory
sizes
Sources of Big Data
• Social media
• Sensor placed in various cities
• Customer satisfaction feedback
• IoT Appliance
• E-Commerce
• Global Positioning System(GPS)
Sources of Big Data
Social Media
• Whatsapp, Facebook, Instagram, Twitter, YouTube etc
• Each activity – upload photo/video, making comment, sending a
message, like etc create data.
Sensors
• Sensors in city – gather temperature, humidity etc
• Camera beside roads gather information
• Security cameras in airports/banks – create a lot of data
Customer Satisfaction feedback
• Amazon, flipkart, firstcry, licious, swiggy, blinkit, zepto etc –
gather customer feedback – quality of product/deliver time. It
creates a lot of data.
Sources of Big Data
IoT Appliance
• Electronic devices connected to the internet create data for their smart functionality. Example :
Samsung smartthings.
E-Commerce
• Payments through Credit card, Debit card, pay later, or all electronic ways are recorded as data.
Global Positioning System(GPS)
• Vehicle movement – directions/ traffic congestion. Creates a lot of data on vehicle position and
movement.
Features of Big data
Big Data
Is a new data
challenge that
requires leveraging
existing systems
differently
Is classified in terms of
4 V’s
Volume
Variety
Velocity
Veracity
Is usually unstructured
and qualitative in
nature
Real world examples – Big data
• Social media analytics – Consumer product companies and retail
organizations are observing data on social media websites to analyze
customer behaviour, preferences etc
• Insurance companies use BDA to see which home insurance
applications can be immediately processed and which ones need a
validating in person visit from an agent.
• Hospitals are analysing medical data and patient records to predict
those patients that are likely for readmission within few months of
discharge.
• Relying on Social networks and analytics, Companies are gathering
volumes of data from the web to help musicians and music
companies better understand their audiences.
Types and Sources of data
Type
Description
Source
Social Data
Information collected from various
social networking sites and online
portals
Facebook, Twitter and Linkedin
Machine Data
Information generated from RFID
chips, bar code scanners and sensors
RFID chip readings, Global
positioning System(GPS)
Results
Transactional
Data
Information generated from online
shopping sites, retailers and Business
to Business(B2B) transactions
Retail websites like ebay and
Amazon
Caselet
History of Data Management – Evolution of Big Data
• Big data is the new term of data evolution directed by velocity, variety
and volume of data.
• Velocity implies the speed with which the data flows in an
organization.
• Variety refers to the varied forms of data, such as structured,
semi-structured or unstructured.
• Volume defines the amount or quantity of data an organization has to
deal with.
Challenges faced while handling the data over the
past few decades
In the early 60’s, technology
witnessed problems with
velocity. This need, inspired
the evolution of databases.
In the 90’s,technology
witnessed issues with
variety
(emails,documents,videos),
leading to the emergence
of non-SQL stores.
Today, the technology is
facing issues related to huge
volume, leading to new
storage and processing
solutions,
Structuring
Big Data
• In simple terms, arranging the available data
so that it becomes easy to study, analyse,
and derive conclusion from it.
• Information processing systems – Can
analyse on basis of what you searched, what
you looked at, for how long you remained at
a particular page or website.
• When a user regularly visits or purchases
from Amazon, each time he/she logs in, the
system can present a recommended a list of
products that may interest the user on the
basis of his/her purchases or searches. This
is the power of Big Data Analytics.
Types of Data
• Data that comes from multiple sources such as
databases, ERP systems, weblogs, chat history and
GPA maps varies in its format.
• Data is obtained primarily from the following types of
sources:
(a) Internal sources, such as organizational data
(b) External sources, such as social data
Types of Data
Data Source
Internal
External
Definition
Examples
Provides structured data that
originates within the enterprise and
helps run business
•
Provides unstructured data that
originates from external
environment of an organization
Application
•
•
•
Customer Relationship
Management
Enterprise Resource Planning
Customers, details
Products and sales data
This data is used to support daily
business operations of an
organization
•
•
•
Business partners
Internet
Market research organizations
This data is analyzed to understand
the entities mostly to external
organizations, such as customers,
competitors, market and
environmemt.
Types of Data
• Big data comprises
- Structured data
- Unstructured data
- Semi-structured data
Structured data
•
•
•
•
Is organized data in a predefined format
Is stored in tabular form
Is the data that resides in fixed fields within a record or file
Is formatted data that has entities and their attributes
mapped
• Is used to query and report against predetermined
datatypes
• SQL is used for managing and querying data - represent
only 5 to 10% of all the data
• When data grows beyond the size of RDBMS, it Can be
stored & analyzed in data warehouses but only up to
certain limit
Example –Sample of Structured data
Customer
ID
Name
Product ID
City
State
123
Jack
4689
Graz
Styria
321
Sandy
5688
Wolfsberg
Carinthia
459
Robert
459
Enns
Upper
Austria
Unstructured Data
• lack of structure
• About 85% of total data is un-structured.
Ex:
• e-mail messages,
• word processing documents,
• videos, photos, audio files, presentations,
• web pages
• other kinds of business documents.
Semi Structured
Data
Also known as having a
schema-less
or
self
describing structure refers
to a form of structured data
that contains tags in order
to separate elements and
generate hierarchies of
records and fields in the
given table.
Sl
No
Name
E-Mail
1
Sam
smj@xyz.com
2
First Name : David
Second Name :
Brown
davidb@xyz.com
Elements of Big Data
• According to Gartner, data is growing at the rate of 59% every year.
This growth can be depicted in terms of the following four Vs:
(i) Volume
(ii) Velocity
(iii) Variety
(iv) Veracity
Video Box Position
Department of CSE, GIT
8 May 2023
Course Code: EID449 Course Title: BIG DATA ANALYTICS
40
Volume
• Volume is the amount of data generated by organizations
or individuals.
• At present, Volume of data – exabytes
• In coming years, Volume of data – zettabytes
• Organizations are doing their best to handle this everincreasing volume of data.
Example :
- Every minute, over 571+ new websites are being created.
- Boeing 737 will generate 240 terabytes of flight data during
a single flight across US.
Velocity
• Velocity describes the rate at which data is generated, captured and shared.
• Information processing systems face problem with the data, as the data which
keeps adding up but cannot be processed quickly.
Example : eBay analyses around 5 million transactions per day in real time to detect
and prevent frauds arising from the use of PayPal.
Sources of high velocity data:
- IT devices, including routers,firewalls, switches etc generate valuable data
- Social media, including Facebook posts, tweets create huge amount of data, to be
analyzed at fast speed as the value degrades quickly with the time.
Variety
• refers to structured, unstructured, and
semi structured data that is gathered from
multiple sources and comes in different
formats, such as images, text, videos etc.
• While in the past, data could only be
collected
from
spreadsheets
and
databases, today data comes in an array of
forms such as emails, PDFs, photos, videos,
audios, SM posts, and so much more.
Veracity
• Refers to Uncertainty of data i.e., that is data which is
available can sometimes get messy and quality and
accuracy are difficult to control.
Example: Data in bulk could create confusion whereas
less amount of data could convey half or Incomplete
Information.
In short, Simple 4V’s
Big Data Analytics
• Big Data analytics is a process used to extract meaningful
insights, such as hidden patterns, unknown correlations,
market trends, and customer preferences.
• Big Data analytics provides various advantages—it can be
used for better decision making, preventing fraudulent
activities, among other things.
• There are three main types of business/data analytics:
(a) Descriptive Analytics
(b) Diagnostics Analytics
(c) Predictive Analytics
(d) Prescriptive Analytics
Big Data Analytics - Descriptive analytics –
“What happened in the business”?
• Descriptive analytics analyses a database to provide
information on the trends of past or current business
events that can help managers, planners, leaders to
develop a roadmap for the future actions.
• In short, Identifying the root cause of the problem and
the underlying reason for failures.
Example: During the pandemic, a leading
pharmaceuticals company conducted data analysis on
its offices and research labs. Descriptive analytics
helped them identify unutilized spaces and departments
that were consolidated, saving the company millions of
dollars.
Big Data Analytics - Diagnostics
analytics
• Diagnostics analytics helps companies understand
why a problem occurred. Big data technologies and
tools allow users to mine and recover data that helps
dissect an issue and prevent it from happening in the
future.
Example: A clothing company’s sales have decreased
even though customers continue to add items to their
shopping carts. Diagnostics analytics helped to
understand that the payment page was not working
properly for a few weeks.
Big Data Analytics - Predictive analytics
– “What could happen”?
Understanding and predicting the future by using
statistical models and different forecast techniques.
Here, we use statistics, data mining techniques and
machine learning to analyze the future.
Example: In the manufacturing sector, companies
can use algorithms based on historical data to
predict if or when a piece of equipment will
malfunction or break down.
Big Data Analytics - Prescriptive
analytics – “What should we do”?
• Based on complex data from descriptive and
predictive analyses, prescriptive analytics is
used.
• By using the optimization technique, this
analytics determines the finest substitute to
minimize or maximize some equitable
marketing and many other areas.
Example: If we have to find the best way of
shipping goods from a factory to a destination
to minimize costs, we will use the prescriptive
analytics.
Questions
• List the four elements of Big Data.
• As an HR manager of a company providing Big Data
solutions to clients, what characteristics would you look for
recruiting a potential candidate for a position of a data
analyst?
• You are planning the marketing strategy for a new product
in your company. Identify and list some limitations of
structured data related to the work.
Exploring the Use of Big Data in
Business Context
Use of Big Data in Social Networking
Use of Big Data in preventing fraudulent activities
Use of Big Data in preventing fraudulent activities in Insurance Sector
Use of Big Data in Retail Industry
Exploring the Use of Big Data in Business Context
• An organization generally has to spend huge amounts to collect data
and information.
• For example, customer surveys collecting information goes on
escalating as an organization keeps on collecting more information.
The continuously increasing cost decreases the value of the collected
information.
• In other words, collecting and maintain a pool of data and
information is just a waste of resources unless any logical conclusions
and business insights can be derived from it.
• This is where Big data analytics come into the picture.
Use of Big Data in Social Networking
Use of Big
Data in Social
Networking
Use of Big Data in Social Networking
• Social network data refers to the data generated from people
socializing on social media.
• Some popular social networking sites are Twitter, Facebook,LinkedIn
etc
• On the social networking site, different people constantly add and
update comments, status, likes, preference etc. All these activities
generate large amounts of data.
• This data can be segregated on the basis of different age groups,
locations and genders for the purpose of analysis.
Use of Big Data in
Social Networking
• Social Networking Analysis(SNA) – Analysis
performed on the data from social media.
Example : Mobile Network Operator(MNO)
• The data captures by MNO in the form of
phone calls, text messages and other
record details of all its customers per day
is very huge in volume.
• The company should study the data of
people whom the customer called and
also of the people who called back. Such a
network is called Social Network.
Use of Big Data in
Social Networking
• The data analysis process can go
deeper and deeper within the network
to get a complete picture of a social
network.
• As the analysis goes deeper, the
volume of data to be analyzed also
becomes massive.
• The same structure of SNA is followed
when it comes to social networking
sites.
Use of Big Data in Social
Networking
• Following are the areas in which decision-making
processes are influenced by social network data:
(a) Business Intelligence
(b) Marketing
(c) Product design and development
Use of Big Data in Social Networking
– Business Intelligence(BI)
• Data analysis process to convert a raw dataset to
meaningful information.
• Allows a company to collect, store, access and
analyse the data for adding value to decision
making.
• The data generated from different social media is
analyzed using Social Customer Relationship
Management(CRM) which is used to describe the
data.
Use of Big Data in Social Networking
– Business Intelligence(BI)
Example:
• Mobile service provider that has a low-value customer.
• If the low-value customer is not satisfied with the services and
if he wants to leave the company generally has no problems to
let the customer go as he is providing low-revenue.
• With the help of SNA, the organization can identify some
connections of the customers network make a large number of
calls and text messaged and have a large network of friends.
• With such an analysis, the organization might take an
altogether decision making and might start valuing the
customer more – influence of a customer is very important to
organization.
Use of Big Data in Social Networking – Marketing
• Today the customer preferences has changes due to their busy
schedules – No time to read newspaper, TV commercials or go
through marketing emails.
• Customers can now make their preferences clear and select the
marketing messages they wish to receive.
• In today’s world, marketers aim to deliver what consumers want by
using interactive communication across digital channels such as email, mobile, social and the Web which inturn generates the social
data.
Use of Big Data in Social Networking – Marketing
Product Design and Development
• By listening to customers needs, ny understanding where the gap in the offering is, and
so on, organizations can make the right decisions in the direction of their product
development and offerings.
Example : YouTube – Rate a brand on a scale of 1-10/ know a brand etc
• Once the brand rating crosses 300 or more, the applications sends out a report about the
information what the customer is feeling about the product and the detailed analysis of
the brand’s reputation.
• In this way, social network can help organizations to improve the product development
by making sure about the customer needs.
• Sentiment analysis analyses human emotions, attitudes and views across popular social
networks.
Product Design and Development
Use of Big Data in Fraudulent Activities
• Most common types of Financial frauds:
(a) Credit card fraud
(b) Exchange or return policy fraud – Amazon/Flipkart
(c) Personal information fraud –
Obtaining the login details of a customer, purchase a product online, and then
change the delivery address to different location. The actual customer keeps calling
to retailer to refund the amount as he has not made the transaction
Preventing Fraud using Big Data Analytics
Analyzing Big Data allows organizations to:
• Keep track of and process huge volumes of data.
• Differentiate between real and fraudulent entries.
• Identify new methods of fraud of fraud and add them to the list of
fraud-prevention checks.
• Verify whether a product has actually been delivered to valid
recipient
• Determine the location of the customer and the time when the
product was actually delivered.
Use of Big Data in Detecting Fraudulent Activities
in Insurance Sector
• Insurance company wants to improve the ability to take decisions while
processing claims.
• Decides to implement a Big Data analytical platform, which will use the data
from social media to provide the real-time view of the case in hand.
• The information obtained will enable the insurance agent to diagnose the
patterns of customer’s claim, behavior and other issues.
Example: In some cases, social media could also provide great triggers to identify
fraud – A customer might indicate that his car was destroyed in a flood, but the
documentation from the social media feed any show that the car was actually in
another city on the day flood occurred.
Fraud Detection
• Fraudulent claims were identified by insurance companies by using
statistical models.
• Social Networking Analysis(SNA) is an innovative way to identify and
detect frauds.
• SNA tool uses a mix of analytical methods which includes statistical
methods, pattern analysis and link analysis to identify any kinds of
relationships or patterns within large amounts of data collected from
different sources.
• When link analysis is used in fraud detection, one looks for clusters of
data and how these clusters are linked to other data clusters.
Fraud detection using SNA method
Social Customer Relationship
Management(CRM)
• Social CRM enables effective fraud detection in the insurance sector.
• Social CRM is a process, it is not a platform or technology.
• Makes critical for insurance companies to link social media sites to
their CRM systems.
• If social media is integrated within an organization, it provides high
transparency in various issues related to customer.
Social CRM Process
• Collects data from organization’s existing CRM and different social
media platforms
• Reference data obtained from the social media platform and the data
stored in CR, are loaded into claim management system, which
compares and analyses the data and provides results.
• The response received from claim management system is then
investigated.
Use of Big Data in Retail Industry
• Big data has huge potential for the retail industry by considering the immense
number of transactions and their correlation.
• A single retail location has a small customer database and it is easy to answer the
simple questions like :
(a) How many basic tees did we sell today?
(b) What time of the year do we sell most leggings?
(c) What else has customer bought ,and what kind of coupons can we sent to the
customer?
• However, with millions of transactions spread across at multiple locations, it is
impossible to find answers to such questions.
Use of Big Data in Retail Industry
Use of Big Data in Detecting Fraudulent
Activities in Retail Sector
Retail fraud:
It is an illegal transaction that a fraudster performs using stolen credit
card details or loopholes in the order placement and payment systems
and company policies. As technology grew, so did the fraudsters'
sophistication of executing frauds online.
Types of Retail fraud:
(a) Transaction fraud
(b) Return fraud
(c) Chargeback guarantee fraud
Types of Retail fraud
• Transaction fraud
It is also called card-not-present (CNP) fraud where the fraudster uses a stolen credit card
for online purchases. The company loses money when the original owner of the card
demands a chargeback.
• Return fraud
Example - e-commerce industry
• Chargeback guarantee fraud
Many online retail fraud prevention solutions guarantee that they will block all transactions
and friendly frauds and even pay the admin fee out of their pocket. The problem arises
when the company blocks even legitimate customers. This is called a false positive that not
only damages your reputation but also results in loss of revenue.
Use of Big Data in Detecting Fraudulent Activities
in Retail Sector -Fraud Detection in Real time
• Big Data helps to detect frauds in real time.
Example :
(a) In an online transaction, BigData would compare the incoming IP address with
the geotag received from customer’s smartphone apps. A valid match between
the two confirms the authenticity of transaction.
(b) Also, examines the entire historical data to track suspicious patterns of the
customer order –
Big Data analysis is performed in real time by retailers to know the actual time of
the product delivered.
Costly products of have sensors attached to transmit their location
information,thereby, preventing frauds.
Questions
• Discuss some areas in which decision-making processes are
influenced by social network data.
• List some common types of financial frauds prevalent in the current
business scenario.
• In what ways does analyzing Big Data help organizations prevent
fraud?
• List some methods used for verification of credit cards.
• List the steps that SNA follows to detect fraud.
• What is Social Customer Relationship Management(CRM)
Download