Uploaded by 1902806

BHB3302 HDA - AY 2021-2022 - Topic 2, Data Management

advertisement
BHB 3302
Hospitality Data Analytics
Data Management
Data Management
Attributes
What is Data?
• Collection of data objects and
their attributes
• An attribute is a property or
characteristic of an object
– Examples: eye color of a person,
temperature, etc.
Objects
– Attribute is also known as
variable, field, characteristic, or
feature
• A collection of attributes describe
an object
– Object is also known as record,
point, case, sample, entity, or
instance
10
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Data Management
Data - Attribute Values
► Attribute values are numbers or symbols assigned to an Attribute
► Distinction between Attributes and Attribute values
▪ Same attribute can be mapped to different attribute values
Example: height can be measured in feet or meters
▪ Different attributes can be mapped to the same set of values
Example: Attribute values for ID and age are integers
▪ But properties of attribute values can be different
Example: ID has no limit but age has a maximum and minimum value
Data Management
Data – Hierarchy of Data
Data Management
Data - Entities, Attributes, and Keys
Data Management
Data Types
Credit; towardsdatascience, 2019
Data Management
Let's start with a video:
“Types of Data”
https://www.youtube.com/watch?v=hZxnzfnt5v8
Data Management
Categorical Data
► Categorical data represents characteristics.
► Therefore it can represent things like a person’s gender, language etc.
Categorical data can also take on numerical values (Example: 1 for female
and 0 for male).
► Note that those numbers don’t have
mathematical meaning.
Credit; towardsdatascience, 2019
Data Management
Categorical Data - Nominal Data
► Nominal values represent discrete units and are used to label variables,
that have no quantitative value.
Just think of them as „labels“.
► Note that nominal data that has no order.
► Examples of nominal variables include: blood type, zip code, gender, race,
eye color.
Credit; towardsdatascience, 2019
Data Management
Categorical Data -Ordinal Data
► Ordinal values represent discrete and ordered units.
► It is therefore nearly the same as nominal data,
except that it’s ordering matters.
Examples: Survey rankings (Likert scale).
Credit; towardsdatascience, 2019
Data Management
Numerical Data - Discrete Data
► We speak of discrete data if its values are distinct and separate.
In other words:
We speak of discrete data if the data can only take on certain values.
► This type of data can’t be measured but it can be counted.
It basically represents information
that can be categorized into a classification.
An example is the number of heads in 100 coin flips.
Credit; towardsdatascience, 2019
Data Management
Numerical Data - Discrete Data
► One can check by asking the following two questions whether you are
dealing with discrete data or not:
→ Can you count it and
→ can it be divided up
into smaller and smaller parts?
Credit; towardsdatascience, 2019
Data Management
Numerical Data - Continuous Data / Interval Data
► Continuous Data represents measurements and therefore their values
can’t be counted but they can be measured.
► An example would be the height of a person, which you can describe by using intervals
on the real number line.
Credit; towardsdatascience, 2019
Data Management
Numerical Data - Continuous Data / Interval Data
► Interval values represent ordered units that have the same difference.
Therefore we speak of interval data when we have a variable
that contains numeric values that:
→ are ordered
→ and where we know the exact differences
between the values.
Credit; towardsdatascience, 2019
Data Management
Numerical Data - Continuous Data / Interval Data
► The problem with interval values data is that they don’t have a „true zero“.
That means for example, that there is no such thing as no temperature.
With interval data, we can add and subtract, but we cannot multiply,
divide or calculate ratios.
Because there is no true zero, a lot of descriptive
and inferential statistics can’t be applied.
Credit; towardsdatascience, 2019
Data Management
Numerical Data - Ratio Data
► Ratio values are also ordered units that have the same difference.
► Ratio values are the same as interval values, with the difference that they
do have an absolute zero.
Good examples are height, weight, length etc.
Credit; towardsdatascience, 2019
Data Management
Recap – Difference between Discrete and Continuous Data
Data Management
Data - Recap
– Nominal
•
Examples: ID numbers, eye color, zip codes
– Ordinal
•
Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades,
height in {tall, medium, short}
– Interval
•
Examples: calendar dates, temperatures in Celsius
or Fahrenheit.
– Ratio
•
Examples: temperature in Kelvin, length, time, counts
Data Management
Data Management – the Basics
Databases are playing an ever-increasingly important role in the
Information Age.
In the hospitality industry, their effective usage can help every department
better manage:
► assets,
► expenses and
► sales.
Data Management
A Database is an organized, centralized collection of data serving applications.
– Databases are a key element of most mission-critical applications and represent the
most common type of back-end software.
A property’s databases store data on such things as its transactions,
products, employees, guests, and assets.
► Such databases must be efficiently organized
and easy to access.
► They must also provide data integrity
and ensure the reliability of stored data.
Data Management
Credit: panoply, 2018
Difference between Database and Data Warehouse
Database is a collection of related data that represents some elements of the
real world
whereas Data warehouse is an information system that stores historical and
commutative data from single or multiple sources.
Database is designed to record data
whereas the Data warehouse is designed to analyze data.
21
Data Management
.
FIGURE The DBMS serves as the link between departmental or user-specific requests and the database
Data Management
Data Models
► Underlying the structure of a database is the data model:
A collection of conceptual tools for describing data, data relationships,
data semantics [meaning], and consistency constraints.
► There are two types of a database:
- the Relational Model and,
- the Entity-Relationship model.
Credit: Taneja, 2011
Data Management
Database Models - Atomicity problems
► A computer system, like any other mechanical or electrical device, is
subject to failure.
► In many applications, it is crucial that, if a failure occurs, the data be
restored to the consistent state that existed prior to the failure.
Credit: Taneja, 2011
Data Management
Database Models - Atomicity problems
► Consider a program to transfer $50 from account A to account B.
If a system failure occurs during the execution of the program, it is
possible that the $50 was removed from account A but was not credited
to account B, resulting in an inconsistent database state.
► Clearly, it is essential to database consistency that either both the credit
and debit occur, or that neither occur.
► That is, the funds transfer must be atomic—it must happen in its entirety
or not at all.
Credit: Taneja, 2011
Activity (wk. 2)
As a group, critically analyze the use of databases in the hospitality industry.
Focus on the following questions:
1.
: Which one are you aware of?
2.
: Who is in charge for the database, i.e. POS?
3.
: What are the challenges with these database, if so?
4.
: Being asked to create “the optimal” database, what would be
your requirements, i.e. data rules, data inputs etc.
Data Management
Credit: https://www.slideshare.net/IITA-CO/introduction-to-data-management-terminologies-and-use-of-data-managementplatforms?utm_source=slideshow&utm_medium=ssemail&utm_campaign=download_notification
Data Management
Credit: https://www.slideshare.net/IITA-CO/introduction-to-data-management-terminologies-and-use-of-data-managementplatforms?utm_source=slideshow&utm_medium=ssemail&utm_campaign=download_notification
Data Management
Credit: https://www.slideshare.net/IITA-CO/introduction-to-data-management-terminologies-and-use-of-data-managementplatforms?utm_source=slideshow&utm_medium=ssemail&utm_campaign=download_notification
Data Management
Data Quality
► What kinds of data quality problems, a business might face?
► How can we detect problems with the data?
► What can we do about these problems?
► Examples of data quality problems:
– Noise and outliers
– missing values
– duplicate data
Data Management
Data Outliers
Outliers are data objects with characteristics that are considerably different
than most of the other data objects in the data set
Data Management
Missing Values
► Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
► Handling missing values
– Eliminate Data Objects
– Estimate Missing Values
– Ignore the Missing Value During Analysis
– Replace with all possible values (weighted by their probabilities)
Data Management
Duplicate Data
► Data set may include data objects that are duplicates, or almost duplicates
of one another
– Major issue when merging data from heterogenous sources
► Examples:
– Same person with multiple email addresses
► Data cleaning
– Process of dealing with duplicate data issues
Data Management
Data Preprocessing
► Aggregation
► Sampling
Data Management
Aggregation
Combining two or more attributes (or objects) into a single attribute (or
object)
► Purpose
– Data reduction
•
Reduce the number of attributes or objects
– Change of scale
•
Cities aggregated into regions, states, countries, etc
– More “stable” data
•
Aggregated data tends to have less variability
Data Management
Sampling
Sampling is the main technique employed for data selection.
– It is often used for both the preliminary investigation of the data and the final data
analysis.
► Statisticians sample because obtaining the entire set of data of interest is too expensive
or time consuming.
► Sampling is used in data mining because processing the entire set of data
of interest is too expensive or time consuming.
Data Management
Normalization
is the process of efficiently organizing data by eliminating redundant data
and maintaining dependencies.
In simple words, normalization is a systematic way of ensuring
that a database structure is suitable for general-purpose querying
and free of certain undesirable characteristics
- insertion, update, and deletion anomalies
- that could lead to a loss of data integrity.
Credit: Quora, 2019
Database Management
Database Management - Practical Problems
► Combining different data sets such as AirDNA, and STR, or STB
► Data Cleansing issues
► Consistency in Key attributes
► Double-check data entry and outcomes
Data Management
Credit: Abuein, 2010
The Concept of Extraction, Transformation, and Load (ETL)
► A data warehousing process that consists of:
– extraction (i.e., reading data from a database),
– transformation (i.e., converting the extracted data from its previous form
into the form in which it needs to be so that it can be placed into a data
warehouse or simply another database), and
– load (i.e., putting the data into the data warehouse)
39
Data Management
Credit: Abuein, 2010
The Concept of Extraction, Transformation, and Load (ETL)
The first phase of an ETL process focuses on retrieving the data from the
storage source.
Most data storage projects integrate data received from various source
systems.
Common data source structures are relational databases and pure data files.
40
Data Management
Credit: Abuein, 2010
The Concept of Extraction, Transformation, and Load (ETL)
The transform phase uses a series of rules or operations to retrieve pure data
from the source
to deliver the data in its final form for manipulation at the receiving end.
Some data sources need very little or even no data processing.
Sometimes one or more transformations may be critical to match the business
and technical requirements of the target database.
41
Data Management
Credit: Abuein, 2010
The Concept of Extraction, Transformation, and Load (ETL)
The load or transmitting stage aims at sending data to the receiving end, which
is likely to be data storage.
According to the needs of the application, this process may be very simple or
very complicated.
Some data storage methods may replace old data with cumulative data.
Updating of extracted data is normally done on a periodic basis.
42
Data Management
Data Management – the Key Takeaways
► Important to understand structure and type of Data
► Data Cleansing is paramount, but takes most of the time
► Data Transformation, respective processing takes place in Data
Warehouses, based on the ETL concept
Download