Uploaded by Sa'dia Abdulnasir

01-Introduction[1]

advertisement
Chapter 01
Introduction
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Outline
• Introduction to Big Data
• Data Science and Business Intelligence
• The Skillset of Data Scientists
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
What is „Big Data“?!?
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Is this really
about size?
What is Data?
•The quantities, characters, or symbols on
which operations are performed by a computer,
which may be stored and transmitted in the form
of electrical signals and recorded on magnetic,
optical, or mechanical recording media.
•Now, let’s learn Big Data definition
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
What is Big Data?
• Big Data is a collection of data that is huge in volume, yet
growing exponentially with time. It is a data with so large size
and complexity that none of traditional data management
tools can store it or process it efficiently. Big data is also a
data but with huge size.
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
What is an Example of Big Data?
• Following are some of the Big Data examples• The New York Stock Exchange is an example of Big Data
that generates about one terabyte of new trade data per day
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Social Media
• The statistic shows that 500+terabytes of new data get ingested
into the databases of social media site Facebook, every day. This
data is mainly generated in terms of photo and video uploads,
message exchanges, putting comments etc
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Types Of Big Data
•Following are the types of Big Data:
1.Structured
2.Unstructured
3.Semi-structured
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Structured
• Any data that can be stored, accessed and processed in
the form of fixed format is termed as a ‘structured’ data.
Over the period of time, talent in computer science has
achieved greater success in developing techniques for
working with such kind of data (where the format is well
known in advance) and also deriving value out of it.
However, nowadays, we are foreseeing issues when a
size of such data grows to a huge extent, typical sizes
are being in the range of multiple zettabytes.
• Do you know? 1021 bytes equal to 1 zettabyte or one billion terabytes forms a
zettabyte.
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Examples Of Structured Data
Employee_ID
Employee_Name
Gender
Department
Salary_In_lacs
2365
Rajesh Kulkarni
Male
Finance
650000
3398
Pratibha Joshi
Female
Admin
650000
7465
Shushil Roy
Male
Admin
500000
7500
Shubhojit Das
Male
Finance
500000
7699
Priya Sane
Female
Finance
550000
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Unstructured
• Any data with unknown form or the structure is classified as
unstructured data. In addition to the size being huge, unstructured data poses multiple challenges in terms of its
processing for deriving value out of it. A typical example of
unstructured data is a heterogeneous data source containing a
combination of simple text files, images, videos etc. Now day
organizations have wealth of data available with them but
unfortunately, they don’t know how to derive value out of it since
this data is in its raw form or unstructured format
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Examples Of Un-structured Data
• The output returned by ‘Google Search’
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Semi-structured
• Semi-structured data can contain both the forms of
data. We can see semi-structured data as a structured
in form but it is actually not defined with e.g. a table
definition in relational DBMS. Example of semistructured data is a data represented in an XML file.
• Examples Of Semi-structured Data: Personal data stored in an XML file-
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Characteristics Of Big Data
•Big data can be described by the following
characteristics:
• Volume
• Variety
• Velocity
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Volume
• Volume – The name Big Data itself is related to
a size which is enormous. Size of data plays a
very crucial role in determining value out of
data. Also, whether a particular data can
actually be considered as a Big Data or not, is
dependent upon the volume of data.
Hence, ‘Volume’ is one characteristic which
needs to be considered while dealing with Big
Data solutions
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
The 3 Vs: Volume
• Scale of the data must be „big“
• No clear definition
• „that demand […] innovative forms of information processing“ (Gartner)
Data center storage worldwide
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
© Statista 2018
Variety
• Variety – The next aspect of Big Data is its variety.
• Variety refers to heterogeneous sources and the nature
of data, both structured and unstructured. During
earlier days, spreadsheets and databases were the only
sources of data considered by most of the
applications. Nowadays, data in the form of emails,
photos, videos, monitoring devices, PDFs, audio, etc.
are also being considered in the analysis applications.
This variety of unstructured data poses certain issues
for storage, mining and analyzing data.
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
The 3 Vs: Variety
• Diversity in data types and data sources
Structured
•
•
Data with defined types and structure
Example: comma separated values
SemiStructured
•
•
Textual data with parseable pattern
Example: XML files with schema
•
Textual data with erratic formats that can be
formated with effort
Example: Clickstream data
Quasi-Structured
•
•
Unstructured
•
Data that has no inherent structure, often
with multiple formats
Example: Web site, videos
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Velocity
• Velocity – The term ‘velocity’ refers to the speed of
generation of data. How fast the data is generated and
processed to meet the demands, determines real
potential in the data.
• Big Data Velocity deals with the speed at which data
flows in from sources like business processes,
application logs, networks, and social media sites,
sensors, Mobile devices, etc. The flow of data is
massive and continuous.
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
The 3 Vs: Velocity
• Speed at which new data is created
• Speed at which data must be processed and analyzed
• Often close to real-time
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Outline
• Introduction to Big Data
• Data Science and Business Intelligence
• The Skillset of Data Scientists
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Defining Data Science
• Unfortunately, there is no clear definition (yet?)
• Goal is the extraction of knowledge from data
• Combination of techniques from different disciplines
• Scientific principles guide the data analysis
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
What is data science?
• Data science combines math and statistics, specialized
programming, advanced analytics, artificial intelligence (AI), and
machine learning with specific subject matter expertise to
uncover actionable insights hidden in an organization’s data.
These insights can be used to guide decision making and
strategic planning.
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
a data science project undergoes the following stages:
• Data ingestion: The lifecycle begins with the data
collection--both raw structured and unstructured
data from all relevant sources using a variety of
methods. These methods can include manual entry,
web scraping, and real-time streaming data from
systems and devices. Data sources can include
structured data, such as customer data, along with
unstructured data like log files, video, audio,
pictures, the Internet of Things (IoT), social media,
and more.
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
a data science project undergoes the following stages:
• Data storage and data processing: Since data can
have different formats and structures, companies need
to consider different storage systems based on the type
of data that needs to be captured. Data management
teams help to set standards around data storage and
structure, which facilitate workflows around analytics,
machine learning and deep learning models. This stage
includes cleaning data, deduplicating, transforming and
combining the data using ETL (extract, transform, load)
jobs or other data integration technologies. This data
preparation is essential for promoting data quality
before loading into a data warehouse, data lake, or
other repository.
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
a data science project undergoes the following stages:
• Data analysis: Here, data scientists conduct an
exploratory data analysis to examine biases, patterns,
ranges, and distributions of values within the data. This
data analytics exploration drives hypothesis generation
for a/b testing. It also allows analysts to determine the
data’s relevance for use within modeling efforts for
predictive analytics, machine learning, and/or deep
learning. Depending on a model’s accuracy,
organizations can become reliant on these insights for
business decision making, allowing them to drive more
scalability.
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Data science versus business intelligence
• It may be easy to confuse the terms “data science” and “business intelligence” (BI)
because they both relate to an organization’s data and analysis of that data, but they do
differ in focus.
• Business intelligence (BI) is typically an umbrella term for the technology that enables
data preparation, data mining, data management, and data visualization. Business
intelligence tools and processes allow end users to identify actionable information from
raw data, facilitating data-driven decision-making within organizations across various
industries.
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Data science versus business intelligence
• While data science tools overlap in much of this regard, business intelligence
focuses more on data from the past, and the insights from BI tools are more
descriptive in nature. It uses data to understand what happened before to
inform a course of action. BI is geared toward static (unchanging) data that is
usually structured. While data science uses descriptive data, it typically
utilizes it to determine predictive variables, which are then used to categorize
data or to make forecasts
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Data Science vs. Business Intelligence
• Business Intelligence (Gartner IT Glossary)
• […] best practices that enable access to and analysis of information to
improve and optimize decisions and performance.
Business
Intelligence
High
Data
Science
Depth of
Insights
Techniques
Dashboards,
alerts, queries
Optimization,
predictive modelling,
forecasting
Data Types
Structured, data
warehouses
Any kind, often
unstructured
Common
questions
What
happened…?
How much did…?
When did…?
What if…?
What will…?
How can we…?
Business
Intelligence
Low
Past
Present
Data Science
Future
Time
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Mathematical Aspects
Computational
Geometry
Optimization
Scientific
Computing
Stochastics
Machine
Learning
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Computer Science Aspects
Data Structures and
Algorithms
Software Engineering
Databases
Artificial Intelligence
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Distributed Computing
Machine Learning
Statistical Aspects
Linear Models
Statistical Tests
Time Series Analysis
Inference
Machine Learning
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Applications
Intelligent Systems
Robotics
Marketing
Medicine
Autonomous Driving
Social Networks
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
More Data  More Opportunities
TERABYTES
PETABYTES
EXABYTES
VOLUME OF INFORMATION
LARGE
SMALL
1990’s
Relational Databases
& Data Warehouses
2000’s
Content Management
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
2010’s
Key-Value Storages
& Unstructured Data
Outline
• Introduction to Big Data
• Data Science and Business Intelligence
• The Skillset of Data Scientists
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
What are Data Scientists?
• Not computer scientists
• But should know about databases, data structures, algorithms, etc.
• Not mathematicians
• But should know about optimization, stochastics, etc.
• Not statisticians
• But should know about regression, statistical tests, etc.
• Not domain experts
• But must work together with them
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Skills of Data Scientists
Quantitative
• Maths
• Algorithms
• Statistics
A bit of everything
Collaborative
• Teamwork
• Communication
skills
Data
Scientists
Technical
• Programming
• Infrastructures
… but actually as much as
possible of everything
Skeptical
• Create
hypotheses, but
be skeptical
about them
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Tips For Becoming a Data Scientist
• There is no one pathway to becoming a data scientist. Everyone follows a
different trajectory to hail as a data scientist, but here are some tips which
can make this journey easy.
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Different types of Data Scientists
• According to Microsoft Research:
• Polymath
• Data Analyzer
• „Do it all“
• Analyzing data
• Data Evangelist
• Data analysis, disseminating and acting
on insights
• Data Preparer
• Querying existing data, preparing data
for analysis
• Platform Builder
• Collect data and create
infrastructures
• Moonlighters (50%/20%)
• „Spare time“ data scientists
• Insight Actors
• Data Shapers
• Analyzing and preparing data
• Use the outcome and act on
insights.
Miyung Kim, Thomas Zimmermann, Robert DeLine, Andrew Begel: Data Scientists in Software
Teams: State of the Art and Challenges, IEEE Transactions on Software Engineering (Online First)
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Outline
• Introduction to Big Data
• Data Science and Business Intelligence
• The Skillset of Data Scientists
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Summary
• Big data has a high volume, velocity, and variety
• Different data structures
• Structured, semi-structured, quasi-structured, unstructured
• Data science is a very diverse discipline
• Maths, computer science, statistics, applications
 Data scientists require a diverse skillset
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
END
Thank you for your
listening…………
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Download