Chapter 2 Getting to Know Your Data Data Analysis Data Analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusion, and supporting decision-making. © Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 2 Evolution of Data Analysis Source © Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 3 What makes a good data analyst ? A good data analyst must be eager to learn and continue to ask questions throughout the process of working with data. The focus of those questions will vary based on the audience who are consuming the results. To be an expert in the field of data analysis, excellent communication skills are required so you can understand how to translate raw data into insights that can impact change in a positive way. © Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi - Know Your Data (KYD) - Voice of the Customer (VOC) - Always Be Agile (ABA) 4 What makes a good data analyst ? Know Your Data (KYD) • Understanding how the data was originally sourced including the technologies used along with the transformations that occurred before, during, and afterward along with the business requirements and rules used to store it (Data Lineage) • KYD is all about doing your homework ahead of time to be prepared before talking to experts Voice of the Customer (VOC) • Understanding customer needs by learning from or listening to their needs before, during, and after they use a company's product or service • VOC is all about listening to the needs of your business or consumers regarding the data Always Be Agile (ABA) • It creates an interactive communication line between the business and technical teams to iteratively deliver business value through the use of data and usable features • ABA is all about bringing the developers and business sponsors together to capture requirements and then deliver incremental value. © Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 5 Data Types: Structured Data • The most common type found in databases and data created from applications (apps or software) and code. • Consistency and relatively high quality between each record, especially when stored in the same database table. • Database Management Systems (DBMS) and Relational Database Management Systems (RDMS). © Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 6 Entity-Relationship (ER) Diagram • Each entity represent physical tables stored in the database, named car, part, and car_part_bridge. • The relationship between the car and part is defined by the table called car_part_bridge. • The pk label next to the car_id and part_id field names helps to identify the primary keys for each table. If a primary key in one table exists in another table, it would be called a foreign key. © Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 7 Data Types: Unstructured Data and Semi-structured Data ⮚ Unstructured Data • • ⮚ Textual in nature, email message’s body, tweets, books, health records, and images. Free text challenging for data analysis is its inconsistent nature. Semi-structured Data • • Free text + tags, which are keywords or any classification used to create a natural hierarchy. Examples of semi-structured data are XML and JSON files. Semi-structured data is flexible to change the underlying schema of how the data is stored but it may have inconsistencies with data values depending on how the data was captured. © Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 8 Common Data Types • Data type is the details of the data that is stored and its intended usage. • A data type creates consistency for each data value as it's stored on disk or memory. © Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 9 Common Data Types • Varcahr vs Varchar2 Varchar Varchar can identify NULL and empty string separately. Varchar2 Varchar2 cannot identify both separately. Both considered as same for this. Varchar can store minimum 1 and maximum 2000 bytes of Varchar2 can store minimum 1 and maximum 4000 character data. bytes of character data. Allocate fixed size of data irrespective of the input. Allocate variable size of data based on input. Allocate variable size of data based on input. Ex: We Allocate fixed size of data irrespective of the input. Ex: We defined varchar2 (15) and entered only 10 characters. defined varchar (15) and entered only 10 characters. But it Then varchar2 will allocate space for 10 characters allocates space for entire 15 characters. only but not for 15. For varchar data, extra spaces are padded to the right side. For varchar2 extra spaces will be truncated. Varchar is ANSI Sql standard © Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi Varchar2 is Oracle standard Source: https://www.sitesbay.com/sql/sql-difference-between-varchar-and-varchar2 10 Data Classifications and Data Attributes © Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 11 Data Classifications and Data Attributes Continuous data is measurable, quantified with a numeric data type, and has a continuous range with infinite possibilities, e.g., a stock price, weight in pounds, and time. Categorical (descriptive) data will have values as a string data type. • Describe something specific (Qualified) such as a person, place, or thing. Discrete data is continuous because of its numeric properties (e.g., count of employees per department) and has limits (similar to categorical). • If only two discrete values exist, such as yes/no or true/false or 1/0, it can also be classified as binary. © Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 12 Data Classifications and Data Attributes Categorical variable Categorical variables contain a finite number of categories or distinct groups. Discrete variable Discrete variables are numeric variables that have a countable number of values between any two values. A discrete variable is always numeric. Continuous variable Continuous variables are numeric variables that have an infinite number of values between any two values. A continuous variable can be numeric or date/time. © Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 13 Data Attributes 1 © Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 4 6 14 Data Attributes • Nominal data is data where you can distinguish between different values but not necessarily order them. It is qualitative where math cannot be performed because they are string values. • Labels or names as stocks or bonds. • Ordinal data is ordered data where a ranking exists, but the distance or range between values cannot be defined. Ordinal data is qualitative but has a natural or defined sequence. Ordinal data can be counted but not calculated with all statistical methods. • Assign 1= low, 2 = medium, and 3 = high values. © Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 15 Data Attributes • Interval data is like ordinal data, but the distance between data points is uniform. Not every arithmetic operation can be performed on interval data, so understanding the context of the data and how it should be used becomes essential. • Temperature in Celsius or Fahrenheit • Ratio data allows for all arithmetic operations, including sum, average, median, mode, multiplication, and division. Ratio data is numeric/quantitative data. Unlike interval data, ratio data has a true zero. This means that zero is an absolute, below which there are no meaningful values. • Speed, age, or weight © Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 16 Properties of Attribute Values • The type of an attribute depends on which of the following properties it possesses: – Distinctness: – Order: – Addition: – Multiplication: = < > + */ – Nominal attribute: distinctness – Ordinal attribute: distinctness & order – Interval attribute: distinctness, order & addition – Ratio attribute: all 4 properties Properties of Attribute Values Attribute Type Description Nominal The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, ) zip codes, employee ID numbers, eye color, gender: {male, female} mode, entropy, contingency correlation, 2 test Ordinal The values of an ordinal attribute provide enough information to order objects. (<, >) hardness of minerals, {good, better, best}, grades, street numbers median, percentiles, rank correlation, run tests, sign tests Interval For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+, - ) Temperature in Celsius or Fahrenheit mean, standard deviation, Pearson's correlation, t and F tests For ratio variables, both differences and ratios are meaningful. (*, /) Temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current geometric mean, harmonic mean, percent variation Ratio Examples Operations Data Attributes • Time data covers both date and time or any combination, for example: • The time as HH:MM AM/PM, such as 12:03 AM; • The year as YYYY, such as 1980 • A timestamp represented as YYYY-MM-DD hh:mm:ss, such as 2000-08-19 14:32:22; or even a date as MM/DD/YY, such as 08/19/00. • What is essential to recognize when dealing with time data is to identify the intervals between each value to measure the difference between them accurately. © Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 19 Reading Structured, Semi, and Unstructured © Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 20 Reading Structured, Semi, and Unstructured © Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 21 Any Question www.psut.edu.jo Call: (+962) 6-5359 949 Fax: (+962) 6-5347 295 Email: info@psut.edu.jo Princess Sumaya University for Technology Amman 11941 Jordan P.o.Box 1438 Al-Jubaiha