Uploaded by Godwin Bada

Week3 - Data Prep

advertisement
BADM-578 Data Visualization
Understanding the Data
Ghulam Nabi
Data and Metadata
• Data is raw and unorganized facts that are useless
without proper processing and organizing them to
retrieve some information for future use.
• Data is a set of facts and statistics that can be operated,
referred or analyzed.
• Metadata is a data about data, it shows basic information
about data, which can make finding and working with
specific instances of data easier.
• Metadata provides a framework for the data and ensures
business users are able to better understand the data
available within the warehouse and transform it into
meaningful information.
Data Types
◼ Types of data: “Datatypes”:
◼ Fundamental or Primitive datatypes
◼ Number (Integer, Float, double) i.e. 1,2, 12324, 1, 1.5
◼ Character (A, a, A, B, C) occupies 1 byte (8 bits) such as ‘A’, ‘a’,
‘1’, ‘2’
◼ String (Group of characters – words!!) such as “1234”, “1234A”,
“Smith”
◼ Boolean (True/False)
◼ Derived datatypes: Mixed datatypes such as
DATE/DATETIME
◼ Abstract or User-Defined Data Types
© 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED.
Data Types
◼ Fundamental or Primitive datatypes are built-in or
predefined data types and can be used directly by the
user to declare variables: Integer, Character, Boolean
(True/False), Floating Point, Double Floating Point.
◼ Character datatype is used for storing characters.
Characters typically requires 1 byte of memory space.
© 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED.
Data Storage Formats
◼ Types of data storage formats:
◼ Files:
◼ Electronic lists of data, optimized to perform a
particular transaction.
◼ Database:
◼ A collection of groupings of information that are
related to each other in some way.
◼ Such collection of inter-related data helps in
efficient retrieval, insertion and deletion of data, and
organization of the data in the form of tables, views,
and schemas etc.
© 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED.
Files
◼ Data file: an electronic list of information that is
formatted for a particular transaction.
◼ Sequential organization is typical.
◼ Record associations with other records created
by pointers.
◼ A pointer is a variable which holds a memory
address pointing to a value, instead of holding
the actual value itself.
◼ Also called linked lists because of the way the
records are linked together using pointers.
© 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED.
Types of Files
◼ Master files – store core information that is
important to the application.
◼ Look-up files – contain static values.
◼ Transaction files – store information that can be
used to update a master file.
◼ Audit files – record “before” and “after” images
of data as the data is altered.
◼ History files (or archive files) – store past
transactions.
© 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED.
Databases
◼ A database is an organized collection of structured
information, or data, typically stored electronically in a
computer system. A database is usually controlled by a
database management system (DBMS).
◼ Together, the data and the DBMS, along with the applications,
which are associated with them, are referred to as a database
system, often shortened to just database.
◼ In most of the common databases today, data is typically
modeled in rows and columns in a series of tables to make
processing and data querying efficient.
© 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED.
Databases
◼ The data in a database can be easily accessed, managed,
modified, updated, controlled, and organized.
◼ Most databases use structured query language (SQL) for
writing and retrieving the data from the database.
◼ SQL is a programming language used by nearly all relational
databases to query, manipulate, and define data, and to
provide access control.
© 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED.
Databases
There are many types of databases:
◼
◼
◼
◼
◼
◼
◼
◼
Legacy database
Relational databases (RDBMS) or SQL Databases
Object-oriented databases or Object Databases
Distributed databases
Data warehouses
Multidimensional databases
NoSQL databases
Graph databases
© 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED.
Multidimensional Database Outline
© 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED.
Multidimensional Database Outline
NoSQL Databases
◼ Newest database approach; not based on the
relational model or SQL.
◼ Rapid processing on replicated database servers
in the cloud.
◼ Various types include:
◼ Document-oriented databases: manage collection of
documents of varying forms and structures (e.g.,
Mongo DB)
◼ Wide column databases: store data in records holding
very large numbers of dynamic columns (potentially
billions of columns). E.g., Bigtable, Cassandra,
Dynamo
◼ Graph databases: a collection of nodes and edges
using graph theory to store, map, and query
relationships.
© 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED.
NoSQL Databases
◼ A NoSQL, or nonrelational database, allows unstructured
and semi-structured data to be stored and manipulated
(in contrast to a relational database, which defines how
all data inserted into the database must be composed).
◼ DENORMALIZED
◼ NoSQL databases grew popular as web applications
became more common and more complex.
◼ Such as MongoDB (Document-oriented database):
Tables >> Document as JSON object, Vertica, Netezza
© 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED.
Selecting a Storage Format
◼ Each of the file and database data storage
format has its strengths and weaknesses.
◼ Factors to consider in selecting a storage
format:
◼ Data Types
◼ Type of Application System
◼ Existing Storage Formats
◼ Future Needs
© 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED.
OLTP and OLAP Systems
◼ An OLTP systems are designed to handle large volumes of
transactional data involving multiple users.
◼ Online transaction processing typically involves inserting,
updating, and/or deleting of data in a data store to collect,
manage, and secure those transactions.
◼ An OLAP system is designed to process large amounts of data
quickly, allowing users to analyze multiple data dimensions in
tandem.
◼ In an OLAP system, teams can use data for decision-making
and problem-solving.
© 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED.
OLTP and OLAP Systems
◼ OLTP is an online data modification system,
◼ Whereas OLAP is an online historical
multidimensional data store system that's used to
retrieve large amounts of data.
© 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED.
OLTP and OLAP Systems
◼ OLTP enables the real-time execution of large numbers
of transactions by large numbers of people, whereas
online analytical processing (OLAP) usually involves
querying these transactions (also referred to as records)
in a database for analytical purposes. OLAP helps
companies extract insights from their transaction data so
they can use it for making more informed decisions.
◼ OLTP also includes any kind of interaction or action such
as downloading pdfs on a web page, viewing a specific
video, or automatic maintenance triggers or comments on
social channels that maybe critical for a business to
record to serve their customers better.
© 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED.
A Data Warehouse
◼ “A data warehouse is simply a single, complete, and
consistent store of data obtained from a variety of
sources and made available to end users in a way
they can understand and use it in a business context.”
◼ A subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s
decision-making process.
© 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED.
A Data Warehouse
© 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED.
A Data Warehouse
© 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED.
A Data Warehouse
Business Intelligence (BI) / Data Visualization
© 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED.
A Data Warehouse
Information
integrated in
advance
Stored in DWH
for direct
querying and
analysis
Extractor/
Monitor
Source
Clients
Data
Warehouse
Integration System
Metadata
...
Extractor/
Monitor
Source
Extractor/
Monitor
...
Source
© 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED.
A Data Mart
◼ A data mart is a simplified form of a data warehouse that
focuses on a single area of business.
◼ Data marts help teams access data quickly without the
complexities of a data warehouse because data marts
have fewer data sources than a data warehouse.
◼ Data marts provide a single source of truth and serve the
needs of specific business teams.
© 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED.
A Data Lake
◼ Data lakes collect raw data and events from diverse
source-based systems and support data preparation and
exploratory analysis, whereas a data warehouse uses
processed data.
◼ It helps organizations store large amounts of structured,
semi-structured, and unstructured data, and
organizations don’t need to know ahead of time how their
data will be used.
◼ A data warehouse is used for structured, filtered data,
which has an intended purpose.
© 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED.
BI and DM
◼ Business Intelligence (BI): Information that people use
to support their decision making efforts.
◼ Data Mining (DM): is the core of the KDD [Knowledge
Discovery in Databases] process, involving the inferring
of algorithms that explore the data, develop the model
and discover previously unknown patterns.
◼ DM: The discovery of new, non-obvious, valuable
information from a large collection of raw data.
◼ DM: Set of activities used to find new, hidden or
unexpected patterns in data.
© 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED.
Knowledge Discovery (KDD) Process
© 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED.
BI and DM
© 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED.
BI/DM: Data Visualization & Reporting
© 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED.
Data Prep
• Explain why data needs to be preprocessed.
• Describe the steps involved in data processing including:
• Data quality assessment
• Data cleaning
• Data transformation
• Data reduction
• Discuss the importance of analytical thinking in data
visualization.
• Discuss the importance of critical thinking in data
visualization.
• Examine data transformation techniques.
Data Preparation
◼ Data preparation is the act of manipulating raw data into a
form that can readily and accurately be analyzed, e.g. for
business purposes
▪ Data preparation involves two essential steps: data
preprocessing and data wrangling.
Data Prep
◼ Data Wrangling
▪ Data wrangling occurs after data preprocessing.
▪ It is the process of transforming and mapping data from
one "raw" data form into another format with the intent of
making it more appropriate and valuable for a variety of
downstream purposes such as analytics.
▪ Usually used when making the machine learning model.
▪ It involves cleaning the raw dataset into a format
compatible with the machine learning models.
Data Prep
◼ Data Preprocessing
▪ Data preprocessing involves data cleaning, integration,
transformation, and reduction.
▪ Data preprocessing occurs first and helps convert raw,
unclean data into a usable format. Data preprocessing
involves data cleaning, integration, transformation, and
reduction.
Data Prep & Wrangling
◼ Structured data
▪ For structured data, data preparation and wrangling involve
data cleansing and data preprocessing:
1. Data Cleansing
2. Transformations (also called “Wrangling’)
•
•
Data Cleansing: The objective of data cleaning is to
remove inaccurate data from the dataset.
Data Wrangling: The objective of wrangling is to
transform the data into a more usable format
Data Prep & Wrangling
Data cleansing involves resolving:
o
o
o
o
o
o
Incompleteness errors: Data is missing.
Invalidity errors: Data is outside a meaningful range.
Inaccuracy errors: Data is not a measure of true value.
Inconsistency errors: Data conflicts with the corresponding data
points or reality.
Non-uniformity errors: Data is not present in an identical format.
Duplication errors: Duplicate observations are present.
Actions
•
•
•
•
Find/remove duplicate tuples
Detect inconsistent, wrong data such as attribute values which do not match
Patch missing, unreadable data
Notify sources of errors found
Data Prep & Wrangling
Data preprocessing involves performing the following
transformations:
o
o
o
o
o
Extraction: A new variable is extracted from the current variable
for ease of analyzing and using for training the ML model.
Aggregation: Two or more variables are consolidated into a single
variable.
Filtration: Data rows not required for the project are removed.
Selection: Data columns not required for the project are removed.
Conversion: The variables in the dataset are converted into
appropriate types to further process and analyze them correctly.
Data Prep & Wrangling
Data Transformation involves changing data:
o
o
o
Convert data to uniform format
o
Byte ordering, string termination
o
Internal layout
Remove, add & reorder attributes
o
Add key
o
Add data to get history
Sort tuples
Data Prep & Wrangling
Data Integration involves getting data from multiple sources:
Receive data (changes) from multiple
wrappers/monitors and integrate into warehouse
o Rule-based
o Actions
o Resolve inconsistencies
o Eliminate duplicates
o Integrate into warehouse
o Summarize data
o Fetch more data from sources
o
Data Prep & Wrangling
◼ Unstructured data
▪ For unstructured data, data preparation and wrangling
involve a set of text-specific cleansing and preprocessing
tasks.
▪ Text cleansing involves removing the following
unnecessary elements from the raw text:
o Html tags
o Most punctuations
o Most numbers
o White spaces
Data Prep & Wrangling
▪ Text preprocessing involves performing transformations:
o
o
o
o
o
Tokenization: The process of splitting a given text into separate tokens where each token is
equivalent to a word.
Normalization: The normalization process involves the following actions:
•
Lowercasing - Removes differences among the same words due to upper and lower cases.
•
Removing stop words - Stop words are commonly used words such a 'the', 'is' and 'a'.
•
Stemming - Converting inflected forms of a word into its base word.
•
Lemmatization - Converting inflected forms of a word into its morphological root (known as
lemma).
•
Lemmatization is a more sophisticated approach as compared to stemming and is difficult
and expensive to perform.
Creating bag-of-words (BOW): It is a collection of distinct set of tokens that does not capture
the position or sequence of the words in the text.
Organizing the BOW into a Document term matrix (DTM): It is a table, where each row of the
matrix belongs to a document (or text file), and each column represents a token (or term). The
number of rows is equal to the number of documents in the sample dataset. The number of
columns is equal to the number of tokens in the final BOW. The cells contain the counts of the
number of times a token is present in each document.
N-grams and N-grams BOW: In some cases, a sequence of words may convey more meaning
than individual words. N-grams is a representation of word sequences. The length of a sequence
varies from 1 to n. A one-word sequence is a unigram; a two-word sequence is a bigram; and a
3-word sequence is a trigram; and so on.
Download