Uploaded by Shawqi Sinokrot

CH2-Getting to Know Your Data

advertisement
Chapter 2
Getting to Know Your Data
Data Analysis
Data Analysis is a process of
inspecting, cleansing,
transforming, and modeling
data with the goal of
discovering useful
information, informing
conclusion, and supporting
decision-making.
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi
2
Evolution of Data Analysis
Source
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi
3
What makes a good data analyst ?
A good data analyst must be eager to learn and continue to ask
questions throughout the process of working with data. The
focus of those questions will vary based on the audience who are
consuming the results.
To be an expert in the field of data analysis, excellent
communication skills are required so you can understand how to
translate raw data into insights that can impact change in a
positive way.
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi
- Know Your Data (KYD)
- Voice of the Customer (VOC)
- Always Be Agile (ABA)
4
What makes a good data analyst ?
Know Your Data
(KYD)
• Understanding how the data was originally sourced including the technologies
used along with the transformations that occurred before, during, and afterward
along with the business requirements and rules used to store it (Data Lineage)
• KYD is all about doing your homework ahead of time to be prepared before
talking to experts
Voice of the
Customer (VOC)
• Understanding customer needs by learning from or listening to their needs
before, during, and after they use a company's product or service
• VOC is all about listening to the needs of your business or consumers regarding
the data
Always Be Agile
(ABA)
• It creates an interactive communication line between the business and technical
teams to iteratively deliver business value through the use of data and usable
features
• ABA is all about bringing the developers and business sponsors together to
capture requirements and then deliver incremental value.
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi
5
Data Types: Structured Data
• The most common type found in databases and data created from
applications (apps or software) and code.
• Consistency and relatively high quality between each record,
especially when stored in the same database table.
• Database Management Systems (DBMS) and Relational Database
Management Systems (RDMS).
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi
6
Entity-Relationship (ER) Diagram
• Each entity represent physical tables stored in the database, named
car, part, and car_part_bridge.
• The relationship between the car and part is defined by the table
called car_part_bridge.
• The pk label next to the car_id and part_id field names helps to
identify the primary keys for each table. If a primary key in one table
exists in another table, it would be called a foreign key.
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi
7
Data Types: Unstructured Data and Semi-structured
Data
⮚
Unstructured Data
•
•
⮚
Textual in nature, email message’s body, tweets, books, health
records, and images.
Free text challenging for data analysis is its inconsistent nature.
Semi-structured Data
•
•
Free text + tags, which are keywords or any classification used to
create a natural hierarchy. Examples of semi-structured data are
XML and JSON files.
Semi-structured data is flexible to change the underlying schema
of how the data is stored but it may have inconsistencies with
data values depending on how the data was captured.
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi
8
Common Data Types
• Data type is the details of the data that is stored and its intended
usage.
• A data type creates consistency for each data value as it's stored on
disk or memory.
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi
9
Common Data Types
• Varcahr vs Varchar2
Varchar
Varchar can identify NULL and empty string separately.
Varchar2
Varchar2 cannot identify both separately. Both
considered as same for this.
Varchar can store minimum 1 and maximum 2000 bytes of Varchar2 can store minimum 1 and maximum 4000
character data.
bytes of character data.
Allocate fixed size of data irrespective of the input.
Allocate variable size of data based on input.
Allocate variable size of data based on input. Ex: We
Allocate fixed size of data irrespective of the input. Ex: We
defined varchar2 (15) and entered only 10 characters.
defined varchar (15) and entered only 10 characters. But it
Then varchar2 will allocate space for 10 characters
allocates space for entire 15 characters.
only but not for 15.
For varchar data, extra spaces are padded to the right side. For varchar2 extra spaces will be truncated.
Varchar is ANSI Sql standard
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi
Varchar2 is Oracle standard
Source: https://www.sitesbay.com/sql/sql-difference-between-varchar-and-varchar2
10
Data Classifications and Data Attributes
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi
11
Data Classifications and Data Attributes
Continuous data is measurable, quantified with a numeric data
type, and has a continuous range with infinite possibilities,
e.g., a stock price, weight in pounds, and time.
Categorical (descriptive) data will have values as a string data
type.
•
Describe something specific (Qualified) such as a person, place, or thing.
Discrete data is continuous because of its numeric properties
(e.g., count of employees per department) and has limits
(similar to categorical).
• If only two discrete values exist, such as yes/no or true/false or 1/0, it can also
be classified as binary.
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi
12
Data Classifications and Data Attributes
Categorical variable
Categorical variables contain a finite number of categories or
distinct groups.
Discrete variable
Discrete variables are numeric variables that have a countable
number of values between any two values. A discrete variable is
always numeric.
Continuous variable
Continuous variables are numeric variables that have an infinite
number of values between any two values. A continuous
variable can be numeric or date/time.
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi
13
Data Attributes
1
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi
4
6
14
Data Attributes
• Nominal data is data where you can distinguish between different
values but not necessarily order them. It is qualitative where math
cannot be performed because they are string values.
• Labels or names as stocks or bonds.
• Ordinal data is ordered data where a ranking exists, but the distance
or range between values cannot be defined. Ordinal data is qualitative
but has a natural or defined sequence. Ordinal data can be counted
but not calculated with all statistical methods.
• Assign 1= low, 2 = medium, and 3 = high values.
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi
15
Data Attributes
• Interval data is like ordinal data, but the distance between data points
is uniform. Not every arithmetic operation can be performed on
interval data, so understanding the context of the data and how it
should be used becomes essential.
• Temperature in Celsius or Fahrenheit
• Ratio data allows for all arithmetic operations, including sum, average,
median, mode, multiplication, and division. Ratio data is
numeric/quantitative data. Unlike interval data, ratio data has a true
zero. This means that zero is an absolute, below which there are no
meaningful values.
• Speed, age, or weight
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi
16
Properties of Attribute Values
• The type of an attribute depends on which of the following properties it possesses:
– Distinctness:
– Order:
– Addition:
– Multiplication:
= 
< >
+ */
– Nominal attribute: distinctness
– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & addition
– Ratio attribute: all 4 properties
Properties of Attribute Values
Attribute Type
Description
Nominal
The values of a nominal attribute are just
different names, i.e., nominal attributes
provide only enough information to
distinguish one object from another. (=, )
zip codes, employee ID
numbers, eye color,
gender: {male, female}
mode, entropy,
contingency
correlation, 2 test
Ordinal
The values of an ordinal attribute provide
enough information to order objects. (<,
>)
hardness of minerals,
{good, better, best},
grades, street numbers
median, percentiles,
rank correlation, run
tests, sign tests
Interval
For interval attributes, the differences
between values are meaningful, i.e., a unit
of measurement exists.
(+, - )
Temperature in Celsius or
Fahrenheit
mean, standard
deviation, Pearson's
correlation, t and F
tests
For ratio variables, both differences and
ratios are meaningful. (*, /)
Temperature in Kelvin,
monetary quantities,
counts, age, mass, length,
electrical current
geometric mean,
harmonic mean,
percent variation
Ratio
Examples
Operations
Data Attributes
• Time data covers both date and time or any combination, for
example:
• The time as HH:MM AM/PM, such as 12:03 AM;
• The year as YYYY, such as 1980
• A timestamp represented as YYYY-MM-DD hh:mm:ss, such as 2000-08-19
14:32:22; or even a date as MM/DD/YY, such as 08/19/00.
• What is essential to recognize when dealing with time data is to
identify the intervals between each value to measure the difference
between them accurately.
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi
19
Reading Structured, Semi, and Unstructured
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi
20
Reading Structured, Semi, and Unstructured
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi
21
Any Question
www.psut.edu.jo
Call: (+962) 6-5359 949
Fax: (+962) 6-5347 295
Email: info@psut.edu.jo
Princess Sumaya University for Technology
Amman 11941 Jordan
P.o.Box 1438 Al-Jubaiha
Download