Uploaded by bucky70sg

DSA5104 - L0 - overview

advertisement
DSA5104
Principles of Data Management and Retrieval
Lecture 0: Overview
Data Is Ubiquitous
Social Media
E-Commerce
Internet of Things (IoT)
https://www.nist.gov/blogs/cybersecurity-insights/more-just-milestone-botnet-roadmap-towards-more-securable-iot-devices
Big Data - Data Flooding in Smart Cities
https://adtellintegration.com/smart-cities-infrastructure/
Teleco + Metro Data → Crowd Forcasting
Caller Detail Records (CDR) [1]
[1] Liang, Victor C., et al. "Mercury: Metro density prediction with recurrent neural
network on streaming CDR data." ICDE 2016.
Keppel Bay Tower - Singapore’s first Green Mark Platinum
(Zero Energy) commercial building
https://sleb.sg/UserFiles/Resource/GBIC/GBIC%20Demo%20Project/KBT%20GBIC%20Dashboard.pdf
Keppel Bay Tower - Singapore’s first Green Mark Platinum
(Zero Energy) commercial building
https://sleb.sg/UserFiles/Resource/GBIC/GBIC%20Demo%20Project/KBT%20GBIC%20Dashboard.pdf
Data Categorized by Data Types - Media Types
Data Categorized by Data Types - Intrinsic Structure
https://www.crowdstrike.com/blog/structured-unstructured-and-semi-structured-logging-explained/
Table & Record
Caller Detail Records (CDR) [1]
Structured Data
§ Data conforms to a set schema
§ Numerical, categorical and text
data with well-defined schema
Table name: instructor
Column
Data Type
ID
varchar(5)
name
varchar(20) not null
dept_name
varchar(20)
salary
numberic(8, 2)
instructor
Semi-Structured Data
§ Data with labels but no fixed schema
§ JSON, XML
§ CSV files with headers
§ Usage:
§ For data exchange
https://e.nodegoat.net/CMS/upload/guide-import_person_csv_notepad.png
Semi-Structured Data
§ XML - eXtensible Markup Language
Semi-Structured Data
§ XML - eXtensible Markup Language
Tag
Closing tag
note
to
from
body
Semi-Structured Data
§ XML - eXtensible Markup Language
Tag
Closing tag
note
to
from
body
Semi-Structured Data
§ XML - eXtensible Markup Language
Tag
Closing tag
note
date
to
from
body
Semi-Structured Data
§ XML
http://www.ling.helsinki.fi/~kjokinen/ICSLP06-DoD/Programme/GriolDavid06.pdf
Semi-Structured Data
§ XML
A prolog defines the XML version and the
character encoding
Element package has
three attributes:
destination, origin and
version.
http://www.ling.helsinki.fi/~kjokinen/ICSLP06-DoD/Programme/GriolDavid06.pdf
Semi-Structured Data
§ XML
A prolog defines the XML version and the
character encoding
Element package has
three attributes:
destination, origin and
version.
ASR
XML
SU
http://www.ling.helsinki.fi/~kjokinen/ICSLP06-DoD/Programme/GriolDavid06.pdf
Semi-Structured Data
§ JSON
https://json.org/example.html
Semi-Structured Data
§ JSON
key/value pair
seprated by commas
Square brackets hold arrays
Curly brackets hold objects
https://json.org/example.html
Semi-Structured Data
§ JSON
§ XML
https://json.org/example.html
Load JSON File Using Pandas
data.json
https://www.w3schools.com/python/pandas/pandas_csv.asp
Load JSON File Using Pandas
data.json
https://www.w3schools.com/python/pandas/pandas_csv.asp
Load JSON File Using Pandas
data.json
Jupyter Notebook
Unstructured Data
“I’m Hilda. I was born in 1990.”
“My name is Max. I’m turning 20 this year.”
Unstructured Data Natural Language
Text
Unstructured Data Natural Language
Text
Unstructured Data - Multimedia Content
Image2Caption
https://image2caption.pascalperle.de/
What is Big Data?
§ Big data sets are too large or complex
to be processed by traditional methods.
Consider that in a single minute (2022)
there are:
https://www.domo.com/data-never-sleeps
Summary
Structured
Semi-Structured
Unstructured
Definition
• Data with predefined
schema
• Data with flexible
schema (e.g., XML)
• Data without
predefined schema
Database
Systems
• Relational Database
• MongoDB/HBase
• No-SQL Databases
• Object Store (S3)
• Vector Database
• SQL
• XPath
• XQuery
• ElasticSearch for text
• Search Vector
Embeddings
Query
Languae
•
MySQL, PostgreSQL
Data Warehouse & ETL (Extract, Transform, Load)
https://rivery.io/wp-content/uploads/2020/05/ETL-Process-for-linkedin3.png
Data Lake & Data Lakehouse
Late 1980’s
2011
https://www.databricks.com/wp-content/uploads/2020/01/data-lakehouse-new-1024x538.png
2020
OLTP
OLAP
https://www.healthcatalyst.com/insights/database-vs-data-warehouse-a-comparative-review
OLTP
OLAP
OLAP vs. OLTP
Criteria
Online Analytical Processing (OLAP)
Online Transaction Processing (OLTP)
Purpose
•
OLAP helps you analyze large volumes of data
to support decision-making.
•
Data source
•
OLAP uses historical and aggregated data from •
multiple sources.
OLTP uses real-time and transactional data from
a single source.
Volume of
data
•
OLAP has large storage requirements. Think
terabytes (TB) and petabytes (PB).
•
OLTP has comparatively smaller storage
requirements. Think gigabytes (GB).
Response
time
•
OLAP has longer response times, typically in
seconds or minutes.
•
OLTP has shorter response times, typically in
milliseconds
Availability
•
OLAP systems can be backed up less frequently •
since they don’t modify current data.
OLTP systems require frequent or concurrent
backups to help maintain data integrity. This is
because they modify data frequently, which is the
nature of transactional processing.
Example
applications
•
OLAP is good for analyzing trends, predicting •
customer behavior, and identifying profitability.
OLTP is good for processing payments, customer
data management, and order processing.
OLTP helps you manage and process real-time
transactions.
https://aws.amazon.com/compare/the-difference-between-olap-and-oltp/#:~:text=OLTP,Online%20analytical%20processing%20(OLAP)%20and%20online%20transaction%20processing%20(OLTP,processing%20and%20real%2Dtime%20updates.
https://www.ibm.com/blog/olap-vs-oltp/
Course Schedule
Week
Topic (Subject to change)
Note
1
• Overview
• Introduction to Database Systems
2
• Relational Model
• SQL I
3
• SQL II
4
• SQL III
5
• Database Design Using the Entity-Relationship (ER)
Model
6
• Relational Database Design (Schema Refinement)
Assignment
Assignment Due
Reading
7
Test 1
In person / Closed book
Course Schedule
Week
Topic (Subject to change)
Note
8
• Semi-structured Data Management
Project
9
• Complex Data Types
Project Proposal Due
10
• Big Data
11
• Document-Database: MongoDB
12
Test 2
NUS Well-Being Day - 10 Nov 2023 (Fri)
13
• Vector Database
Project Due
Reading
• Project Presentaion
•
•
•
•
•
•
•
•
XML
XPath & XQuery
Object Orientation
Textual Data
Spatial Data
Key-value pairs
MapReduce
Spark SQL
In person / Closed book
Workload
01 Homework
02 Project
20%
• Done individually
• Week 3 - 6
03 Tests
40%
•
•
•
•
Week 7 Test 1 (20%)
Week 12 Test 2 (20%)
Closed book
In person (NO online alternative)
30%
•
•
•
•
•
4-5 persons
Week 7 - 13
Proposal 3%
Final presentation 13%
Final report 14%
04 Course Participation
10%
• Class participation
• Bonus points for
assignment / project
Late Policy
For project / assignment
§ You will lose 20% of the total points every 24 hours late.
§ No submission is allowed beyond 72 hours.
§ Please contact the instructor ASAP if something comes up.
§ My email address: xyyang@nus.edu.sg
Course Materials
§ Database System Concepts by Avi Silberschatz, Henry F. Korth
and S. Sudarshan
§ Database Management Systems by Raghu Ramakrishnan and
Johannes Gehrke
Thank You
Download