BHB 3302 Hospitality Data Analytics Data Management Data Management Attributes What is Data? • Collection of data objects and their attributes • An attribute is a property or characteristic of an object – Examples: eye color of a person, temperature, etc. Objects – Attribute is also known as variable, field, characteristic, or feature • A collection of attributes describe an object – Object is also known as record, point, case, sample, entity, or instance 10 Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Data Management Data - Attribute Values ► Attribute values are numbers or symbols assigned to an Attribute ► Distinction between Attributes and Attribute values ▪ Same attribute can be mapped to different attribute values Example: height can be measured in feet or meters ▪ Different attributes can be mapped to the same set of values Example: Attribute values for ID and age are integers ▪ But properties of attribute values can be different Example: ID has no limit but age has a maximum and minimum value Data Management Data – Hierarchy of Data Data Management Data - Entities, Attributes, and Keys Data Management Data Types Credit; towardsdatascience, 2019 Data Management Let's start with a video: “Types of Data” https://www.youtube.com/watch?v=hZxnzfnt5v8 Data Management Categorical Data ► Categorical data represents characteristics. ► Therefore it can represent things like a person’s gender, language etc. Categorical data can also take on numerical values (Example: 1 for female and 0 for male). ► Note that those numbers don’t have mathematical meaning. Credit; towardsdatascience, 2019 Data Management Categorical Data - Nominal Data ► Nominal values represent discrete units and are used to label variables, that have no quantitative value. Just think of them as „labels“. ► Note that nominal data that has no order. ► Examples of nominal variables include: blood type, zip code, gender, race, eye color. Credit; towardsdatascience, 2019 Data Management Categorical Data -Ordinal Data ► Ordinal values represent discrete and ordered units. ► It is therefore nearly the same as nominal data, except that it’s ordering matters. Examples: Survey rankings (Likert scale). Credit; towardsdatascience, 2019 Data Management Numerical Data - Discrete Data ► We speak of discrete data if its values are distinct and separate. In other words: We speak of discrete data if the data can only take on certain values. ► This type of data can’t be measured but it can be counted. It basically represents information that can be categorized into a classification. An example is the number of heads in 100 coin flips. Credit; towardsdatascience, 2019 Data Management Numerical Data - Discrete Data ► One can check by asking the following two questions whether you are dealing with discrete data or not: → Can you count it and → can it be divided up into smaller and smaller parts? Credit; towardsdatascience, 2019 Data Management Numerical Data - Continuous Data / Interval Data ► Continuous Data represents measurements and therefore their values can’t be counted but they can be measured. ► An example would be the height of a person, which you can describe by using intervals on the real number line. Credit; towardsdatascience, 2019 Data Management Numerical Data - Continuous Data / Interval Data ► Interval values represent ordered units that have the same difference. Therefore we speak of interval data when we have a variable that contains numeric values that: → are ordered → and where we know the exact differences between the values. Credit; towardsdatascience, 2019 Data Management Numerical Data - Continuous Data / Interval Data ► The problem with interval values data is that they don’t have a „true zero“. That means for example, that there is no such thing as no temperature. With interval data, we can add and subtract, but we cannot multiply, divide or calculate ratios. Because there is no true zero, a lot of descriptive and inferential statistics can’t be applied. Credit; towardsdatascience, 2019 Data Management Numerical Data - Ratio Data ► Ratio values are also ordered units that have the same difference. ► Ratio values are the same as interval values, with the difference that they do have an absolute zero. Good examples are height, weight, length etc. Credit; towardsdatascience, 2019 Data Management Recap – Difference between Discrete and Continuous Data Data Management Data - Recap – Nominal • Examples: ID numbers, eye color, zip codes – Ordinal • Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} – Interval • Examples: calendar dates, temperatures in Celsius or Fahrenheit. – Ratio • Examples: temperature in Kelvin, length, time, counts Data Management Data Management – the Basics Databases are playing an ever-increasingly important role in the Information Age. In the hospitality industry, their effective usage can help every department better manage: ► assets, ► expenses and ► sales. Data Management A Database is an organized, centralized collection of data serving applications. – Databases are a key element of most mission-critical applications and represent the most common type of back-end software. A property’s databases store data on such things as its transactions, products, employees, guests, and assets. ► Such databases must be efficiently organized and easy to access. ► They must also provide data integrity and ensure the reliability of stored data. Data Management Credit: panoply, 2018 Difference between Database and Data Warehouse Database is a collection of related data that represents some elements of the real world whereas Data warehouse is an information system that stores historical and commutative data from single or multiple sources. Database is designed to record data whereas the Data warehouse is designed to analyze data. 21 Data Management . FIGURE The DBMS serves as the link between departmental or user-specific requests and the database Data Management Data Models ► Underlying the structure of a database is the data model: A collection of conceptual tools for describing data, data relationships, data semantics [meaning], and consistency constraints. ► There are two types of a database: - the Relational Model and, - the Entity-Relationship model. Credit: Taneja, 2011 Data Management Database Models - Atomicity problems ► A computer system, like any other mechanical or electrical device, is subject to failure. ► In many applications, it is crucial that, if a failure occurs, the data be restored to the consistent state that existed prior to the failure. Credit: Taneja, 2011 Data Management Database Models - Atomicity problems ► Consider a program to transfer $50 from account A to account B. If a system failure occurs during the execution of the program, it is possible that the $50 was removed from account A but was not credited to account B, resulting in an inconsistent database state. ► Clearly, it is essential to database consistency that either both the credit and debit occur, or that neither occur. ► That is, the funds transfer must be atomic—it must happen in its entirety or not at all. Credit: Taneja, 2011 Activity (wk. 2) As a group, critically analyze the use of databases in the hospitality industry. Focus on the following questions: 1. : Which one are you aware of? 2. : Who is in charge for the database, i.e. POS? 3. : What are the challenges with these database, if so? 4. : Being asked to create “the optimal” database, what would be your requirements, i.e. data rules, data inputs etc. Data Management Credit: https://www.slideshare.net/IITA-CO/introduction-to-data-management-terminologies-and-use-of-data-managementplatforms?utm_source=slideshow&utm_medium=ssemail&utm_campaign=download_notification Data Management Credit: https://www.slideshare.net/IITA-CO/introduction-to-data-management-terminologies-and-use-of-data-managementplatforms?utm_source=slideshow&utm_medium=ssemail&utm_campaign=download_notification Data Management Credit: https://www.slideshare.net/IITA-CO/introduction-to-data-management-terminologies-and-use-of-data-managementplatforms?utm_source=slideshow&utm_medium=ssemail&utm_campaign=download_notification Data Management Data Quality ► What kinds of data quality problems, a business might face? ► How can we detect problems with the data? ► What can we do about these problems? ► Examples of data quality problems: – Noise and outliers – missing values – duplicate data Data Management Data Outliers Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set Data Management Missing Values ► Reasons for missing values – Information is not collected (e.g., people decline to give their age and weight) – Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) ► Handling missing values – Eliminate Data Objects – Estimate Missing Values – Ignore the Missing Value During Analysis – Replace with all possible values (weighted by their probabilities) Data Management Duplicate Data ► Data set may include data objects that are duplicates, or almost duplicates of one another – Major issue when merging data from heterogenous sources ► Examples: – Same person with multiple email addresses ► Data cleaning – Process of dealing with duplicate data issues Data Management Data Preprocessing ► Aggregation ► Sampling Data Management Aggregation Combining two or more attributes (or objects) into a single attribute (or object) ► Purpose – Data reduction • Reduce the number of attributes or objects – Change of scale • Cities aggregated into regions, states, countries, etc – More “stable” data • Aggregated data tends to have less variability Data Management Sampling Sampling is the main technique employed for data selection. – It is often used for both the preliminary investigation of the data and the final data analysis. ► Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming. ► Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming. Data Management Normalization is the process of efficiently organizing data by eliminating redundant data and maintaining dependencies. In simple words, normalization is a systematic way of ensuring that a database structure is suitable for general-purpose querying and free of certain undesirable characteristics - insertion, update, and deletion anomalies - that could lead to a loss of data integrity. Credit: Quora, 2019 Database Management Database Management - Practical Problems ► Combining different data sets such as AirDNA, and STR, or STB ► Data Cleansing issues ► Consistency in Key attributes ► Double-check data entry and outcomes Data Management Credit: Abuein, 2010 The Concept of Extraction, Transformation, and Load (ETL) ► A data warehousing process that consists of: – extraction (i.e., reading data from a database), – transformation (i.e., converting the extracted data from its previous form into the form in which it needs to be so that it can be placed into a data warehouse or simply another database), and – load (i.e., putting the data into the data warehouse) 39 Data Management Credit: Abuein, 2010 The Concept of Extraction, Transformation, and Load (ETL) The first phase of an ETL process focuses on retrieving the data from the storage source. Most data storage projects integrate data received from various source systems. Common data source structures are relational databases and pure data files. 40 Data Management Credit: Abuein, 2010 The Concept of Extraction, Transformation, and Load (ETL) The transform phase uses a series of rules or operations to retrieve pure data from the source to deliver the data in its final form for manipulation at the receiving end. Some data sources need very little or even no data processing. Sometimes one or more transformations may be critical to match the business and technical requirements of the target database. 41 Data Management Credit: Abuein, 2010 The Concept of Extraction, Transformation, and Load (ETL) The load or transmitting stage aims at sending data to the receiving end, which is likely to be data storage. According to the needs of the application, this process may be very simple or very complicated. Some data storage methods may replace old data with cumulative data. Updating of extracted data is normally done on a periodic basis. 42 Data Management Data Management – the Key Takeaways ► Important to understand structure and type of Data ► Data Cleansing is paramount, but takes most of the time ► Data Transformation, respective processing takes place in Data Warehouses, based on the ETL concept