BADM-578 Data Visualization Understanding the Data Ghulam Nabi Data and Metadata • Data is raw and unorganized facts that are useless without proper processing and organizing them to retrieve some information for future use. • Data is a set of facts and statistics that can be operated, referred or analyzed. • Metadata is a data about data, it shows basic information about data, which can make finding and working with specific instances of data easier. • Metadata provides a framework for the data and ensures business users are able to better understand the data available within the warehouse and transform it into meaningful information. Data Types ◼ Types of data: “Datatypes”: ◼ Fundamental or Primitive datatypes ◼ Number (Integer, Float, double) i.e. 1,2, 12324, 1, 1.5 ◼ Character (A, a, A, B, C) occupies 1 byte (8 bits) such as ‘A’, ‘a’, ‘1’, ‘2’ ◼ String (Group of characters – words!!) such as “1234”, “1234A”, “Smith” ◼ Boolean (True/False) ◼ Derived datatypes: Mixed datatypes such as DATE/DATETIME ◼ Abstract or User-Defined Data Types © 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED. Data Types ◼ Fundamental or Primitive datatypes are built-in or predefined data types and can be used directly by the user to declare variables: Integer, Character, Boolean (True/False), Floating Point, Double Floating Point. ◼ Character datatype is used for storing characters. Characters typically requires 1 byte of memory space. © 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED. Data Storage Formats ◼ Types of data storage formats: ◼ Files: ◼ Electronic lists of data, optimized to perform a particular transaction. ◼ Database: ◼ A collection of groupings of information that are related to each other in some way. ◼ Such collection of inter-related data helps in efficient retrieval, insertion and deletion of data, and organization of the data in the form of tables, views, and schemas etc. © 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED. Files ◼ Data file: an electronic list of information that is formatted for a particular transaction. ◼ Sequential organization is typical. ◼ Record associations with other records created by pointers. ◼ A pointer is a variable which holds a memory address pointing to a value, instead of holding the actual value itself. ◼ Also called linked lists because of the way the records are linked together using pointers. © 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED. Types of Files ◼ Master files – store core information that is important to the application. ◼ Look-up files – contain static values. ◼ Transaction files – store information that can be used to update a master file. ◼ Audit files – record “before” and “after” images of data as the data is altered. ◼ History files (or archive files) – store past transactions. © 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED. Databases ◼ A database is an organized collection of structured information, or data, typically stored electronically in a computer system. A database is usually controlled by a database management system (DBMS). ◼ Together, the data and the DBMS, along with the applications, which are associated with them, are referred to as a database system, often shortened to just database. ◼ In most of the common databases today, data is typically modeled in rows and columns in a series of tables to make processing and data querying efficient. © 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED. Databases ◼ The data in a database can be easily accessed, managed, modified, updated, controlled, and organized. ◼ Most databases use structured query language (SQL) for writing and retrieving the data from the database. ◼ SQL is a programming language used by nearly all relational databases to query, manipulate, and define data, and to provide access control. © 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED. Databases There are many types of databases: ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ Legacy database Relational databases (RDBMS) or SQL Databases Object-oriented databases or Object Databases Distributed databases Data warehouses Multidimensional databases NoSQL databases Graph databases © 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED. Multidimensional Database Outline © 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED. Multidimensional Database Outline NoSQL Databases ◼ Newest database approach; not based on the relational model or SQL. ◼ Rapid processing on replicated database servers in the cloud. ◼ Various types include: ◼ Document-oriented databases: manage collection of documents of varying forms and structures (e.g., Mongo DB) ◼ Wide column databases: store data in records holding very large numbers of dynamic columns (potentially billions of columns). E.g., Bigtable, Cassandra, Dynamo ◼ Graph databases: a collection of nodes and edges using graph theory to store, map, and query relationships. © 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED. NoSQL Databases ◼ A NoSQL, or nonrelational database, allows unstructured and semi-structured data to be stored and manipulated (in contrast to a relational database, which defines how all data inserted into the database must be composed). ◼ DENORMALIZED ◼ NoSQL databases grew popular as web applications became more common and more complex. ◼ Such as MongoDB (Document-oriented database): Tables >> Document as JSON object, Vertica, Netezza © 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED. Selecting a Storage Format ◼ Each of the file and database data storage format has its strengths and weaknesses. ◼ Factors to consider in selecting a storage format: ◼ Data Types ◼ Type of Application System ◼ Existing Storage Formats ◼ Future Needs © 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED. OLTP and OLAP Systems ◼ An OLTP systems are designed to handle large volumes of transactional data involving multiple users. ◼ Online transaction processing typically involves inserting, updating, and/or deleting of data in a data store to collect, manage, and secure those transactions. ◼ An OLAP system is designed to process large amounts of data quickly, allowing users to analyze multiple data dimensions in tandem. ◼ In an OLAP system, teams can use data for decision-making and problem-solving. © 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED. OLTP and OLAP Systems ◼ OLTP is an online data modification system, ◼ Whereas OLAP is an online historical multidimensional data store system that's used to retrieve large amounts of data. © 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED. OLTP and OLAP Systems ◼ OLTP enables the real-time execution of large numbers of transactions by large numbers of people, whereas online analytical processing (OLAP) usually involves querying these transactions (also referred to as records) in a database for analytical purposes. OLAP helps companies extract insights from their transaction data so they can use it for making more informed decisions. ◼ OLTP also includes any kind of interaction or action such as downloading pdfs on a web page, viewing a specific video, or automatic maintenance triggers or comments on social channels that maybe critical for a business to record to serve their customers better. © 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED. A Data Warehouse ◼ “A data warehouse is simply a single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use it in a business context.” ◼ A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process. © 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED. A Data Warehouse © 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED. A Data Warehouse © 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED. A Data Warehouse Business Intelligence (BI) / Data Visualization © 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED. A Data Warehouse Information integrated in advance Stored in DWH for direct querying and analysis Extractor/ Monitor Source Clients Data Warehouse Integration System Metadata ... Extractor/ Monitor Source Extractor/ Monitor ... Source © 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED. A Data Mart ◼ A data mart is a simplified form of a data warehouse that focuses on a single area of business. ◼ Data marts help teams access data quickly without the complexities of a data warehouse because data marts have fewer data sources than a data warehouse. ◼ Data marts provide a single source of truth and serve the needs of specific business teams. © 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED. A Data Lake ◼ Data lakes collect raw data and events from diverse source-based systems and support data preparation and exploratory analysis, whereas a data warehouse uses processed data. ◼ It helps organizations store large amounts of structured, semi-structured, and unstructured data, and organizations don’t need to know ahead of time how their data will be used. ◼ A data warehouse is used for structured, filtered data, which has an intended purpose. © 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED. BI and DM ◼ Business Intelligence (BI): Information that people use to support their decision making efforts. ◼ Data Mining (DM): is the core of the KDD [Knowledge Discovery in Databases] process, involving the inferring of algorithms that explore the data, develop the model and discover previously unknown patterns. ◼ DM: The discovery of new, non-obvious, valuable information from a large collection of raw data. ◼ DM: Set of activities used to find new, hidden or unexpected patterns in data. © 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED. Knowledge Discovery (KDD) Process © 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED. BI and DM © 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED. BI/DM: Data Visualization & Reporting © 2015 JOHN WILEY & SONS, INC. ALL RIGHTS RESERVED. Data Prep • Explain why data needs to be preprocessed. • Describe the steps involved in data processing including: • Data quality assessment • Data cleaning • Data transformation • Data reduction • Discuss the importance of analytical thinking in data visualization. • Discuss the importance of critical thinking in data visualization. • Examine data transformation techniques. Data Preparation ◼ Data preparation is the act of manipulating raw data into a form that can readily and accurately be analyzed, e.g. for business purposes ▪ Data preparation involves two essential steps: data preprocessing and data wrangling. Data Prep ◼ Data Wrangling ▪ Data wrangling occurs after data preprocessing. ▪ It is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. ▪ Usually used when making the machine learning model. ▪ It involves cleaning the raw dataset into a format compatible with the machine learning models. Data Prep ◼ Data Preprocessing ▪ Data preprocessing involves data cleaning, integration, transformation, and reduction. ▪ Data preprocessing occurs first and helps convert raw, unclean data into a usable format. Data preprocessing involves data cleaning, integration, transformation, and reduction. Data Prep & Wrangling ◼ Structured data ▪ For structured data, data preparation and wrangling involve data cleansing and data preprocessing: 1. Data Cleansing 2. Transformations (also called “Wrangling’) • • Data Cleansing: The objective of data cleaning is to remove inaccurate data from the dataset. Data Wrangling: The objective of wrangling is to transform the data into a more usable format Data Prep & Wrangling Data cleansing involves resolving: o o o o o o Incompleteness errors: Data is missing. Invalidity errors: Data is outside a meaningful range. Inaccuracy errors: Data is not a measure of true value. Inconsistency errors: Data conflicts with the corresponding data points or reality. Non-uniformity errors: Data is not present in an identical format. Duplication errors: Duplicate observations are present. Actions • • • • Find/remove duplicate tuples Detect inconsistent, wrong data such as attribute values which do not match Patch missing, unreadable data Notify sources of errors found Data Prep & Wrangling Data preprocessing involves performing the following transformations: o o o o o Extraction: A new variable is extracted from the current variable for ease of analyzing and using for training the ML model. Aggregation: Two or more variables are consolidated into a single variable. Filtration: Data rows not required for the project are removed. Selection: Data columns not required for the project are removed. Conversion: The variables in the dataset are converted into appropriate types to further process and analyze them correctly. Data Prep & Wrangling Data Transformation involves changing data: o o o Convert data to uniform format o Byte ordering, string termination o Internal layout Remove, add & reorder attributes o Add key o Add data to get history Sort tuples Data Prep & Wrangling Data Integration involves getting data from multiple sources: Receive data (changes) from multiple wrappers/monitors and integrate into warehouse o Rule-based o Actions o Resolve inconsistencies o Eliminate duplicates o Integrate into warehouse o Summarize data o Fetch more data from sources o Data Prep & Wrangling ◼ Unstructured data ▪ For unstructured data, data preparation and wrangling involve a set of text-specific cleansing and preprocessing tasks. ▪ Text cleansing involves removing the following unnecessary elements from the raw text: o Html tags o Most punctuations o Most numbers o White spaces Data Prep & Wrangling ▪ Text preprocessing involves performing transformations: o o o o o Tokenization: The process of splitting a given text into separate tokens where each token is equivalent to a word. Normalization: The normalization process involves the following actions: • Lowercasing - Removes differences among the same words due to upper and lower cases. • Removing stop words - Stop words are commonly used words such a 'the', 'is' and 'a'. • Stemming - Converting inflected forms of a word into its base word. • Lemmatization - Converting inflected forms of a word into its morphological root (known as lemma). • Lemmatization is a more sophisticated approach as compared to stemming and is difficult and expensive to perform. Creating bag-of-words (BOW): It is a collection of distinct set of tokens that does not capture the position or sequence of the words in the text. Organizing the BOW into a Document term matrix (DTM): It is a table, where each row of the matrix belongs to a document (or text file), and each column represents a token (or term). The number of rows is equal to the number of documents in the sample dataset. The number of columns is equal to the number of tokens in the final BOW. The cells contain the counts of the number of times a token is present in each document. N-grams and N-grams BOW: In some cases, a sequence of words may convey more meaning than individual words. N-grams is a representation of word sequences. The length of a sequence varies from 1 to n. A one-word sequence is a unigram; a two-word sequence is a bigram; and a 3-word sequence is a trigram; and so on.