Uploaded by Amjad Ali Baig

TabularData

advertisement
Tabular Data
Rahul Dave, Univ.AI
Data Science
Computer
Science
Statistics
Domain Science
Drew Conway
Machine
Human
Human Cognition
Data Management
Data Mining
Machine Learning
Perception
Visualization
Business Intelligence
Statistics
Story Telling
Decision Making
Theory
Data Science
From CS109
Inspired by Daniel Keim, “Visual Analytics: Definition,
Process, and Challenges”
The Data Science Process
Ask an interesting
question.
What is the scientific goal?
What would you do if you had all the data?
What do you want to predict or estimate?
Get the data.
How were the data sampled?
Which data are relevant?
Are there privacy issues?
Explore the data.
Plot the data.
Are there anomalies?
Are there patterns?
Model the data.
Build a model.
Fit the model.
Validate the model.
Communicate and
visualize the results.
What did we learn?
Do the results make sense?
Can we tell a story?
From CS109
At Univ.AI..
•
We start out with short python
and machine learning courses
•
ML1(DS1) and ML2(DS2) cover all
of these aspects of data science
•
AI1 through AI4 cover deep
learning.
•
Today we will focus on Tabular
Data and I will try to give you a
flavor of how we deal with it
PYPREP
Next weekend!
COMPLETELY FREE!
PREPATORY
PYPREP
xkcd
MLPREP
Tabular Data
•
Data that can be arranged in one or more
spreadsheets, referring to each other
•
Data may have missing values
•
Cells may not be atomic
item
index
quality
0
High
Toy
20
1
High
Book
5
2
Medium
Craft
12
3
Low
Craft
10
type
price
Items
Customers
customer
index
item
index
quantity
total
0
0
5
100
1
0
3
60
2
2
4
48
3
3
10
100
Ask an interesting
question.
What is the scientific goal?
What would you do if you had all the data?
What do you want to predict or estimate?
Get the data.
How were the data sampled?
Which data are relevant?
Are there privacy issues?
Explore the data.
Plot the data.
Are there anomalies?
Are there patterns?
Model the data.
Build a model.
Fit the model.
Validate the model.
Communicate and
visualize the results.
What did we learn?
Do the results make sense?
Can we tell a story?
From CS109
Get Data
Clean Data
Combine Data
Explore Data
Visualize Data
Interrogate Model
Make Decisions
MLDS1 Part 1
MLDS1 Part 2
MLDS1 Part 3
MLDS2 Part 1
MLDS2 Part 2
Model Data
Tabular Data
•
•
•
Data that can be arranged in one or more
spreadsheets, referring to each other
Data may have missing values
Cells may not be atomic
item
index
quality
type
price
0
High
Toy
20
1
High
Book
5
2
Medium
Craft
12
3
Low
Craft
Items
Customers
10
customer
index
item
index
quantity
total
0
0
5
100
1
0
3
60
2
2
4
48
3
3
10
100
Scales'of'Measurement
•
Quan&ta&ve*(Interval*and*Ra&o)
•
Ordinal
•
3
3
Nominal
"S."S."Stevens,"Science,"New"Series,"Vol."103,"No."2684"(Jun."7,"1946),"pp."677@680
Tabular Data
Ordinal
Nominal/Categorical
Quantitative
item
index
quality
type
price
0
High
Toy
20
1
High
Book
5
2
Medium
Craft
12
3
Low
“Foreign Key”
Craft
Items
Customers
10
customer
index
item
index
quantity
total
0
0
5
100
1
0
3
60
2
2
4
48
3
3
10
100
Rubric for Data Pre-Model
1.
Build a Table from the data (ideally, put all data in this object)
2.
Clean the Table. It should have the following properties:
3.
◦
Each row describes a single object
◦
Each column describes a property of that object
◦
Columns are numeric whenever appropriate
◦
Columns contain atomic properties that cannot be further decomposed
Modified from cs109 rubric by Chris Beaumont.
Explore global properties. Use histograms, scatter plots, and aggregation functions to
summarize the data.
4.
Explore group properties. Use groupby and small multiples to compare subsets of the data.
5.
Explore global and group properties for combinations of data.
Pandas
•
Python, in memory table structure called a
DataFrame
•
Very fast library, used extensively in data
science and machine learning
Contributors
!Candidates
NOTEBOOK
https://colab.research.google.com/drive/1Mfl7SQ8YVkYbcU2PCue1FnPjDNTUAVLo?usp=sharing
What%kind%of%data%storage%do%you%need?
•
memory
•
disk:#what#if#we#do#not#fit?
•
cluster:#what#if#we#s1ll#do#not#fit?
•
cluster:#what#if#we#need/can#use#parts?
•
What#if#we#MUST#bring#compute#to#disk?
Rela%onal(Database
Dont%say:"seek"20"bytes"onto"disk"and"pick"up"
from"there."The"next"row"is"50"bytes"hence
Say:"select"data"from"a"set."I"dont"care"where"it"
is,"just"get"the"row"to"me.
Rela%onal(Database(contd)
•
A#collec(on#of#tables#related#to#each#other#through#common#
data#values.
•
Rows#represent#a:ributes#of#something
•
Everything#in#a#column#is#values#of#one#a:ributes
•
A#cell#is#expected#to#be#atomic
•
Tables#are#related#to#each#other#if#they#have#columns#called#keys#
which#represent#the#same#values
SQL
•
Lingua Franca of relational databases, and a lot of other systems.
NOTEBOOK
https://colab.research.google.com/drive/1Mfl7SQ8YVkYbcU2PCue1FnPjDNTUAVLo?usp=sharing
Grammar of Data
•
Core data manipulation commands
•
Universal across systems
•
By identifying them, we know what functionality to search for when we
encounter a new system
•
First formalized by Hadley Wickham in dplyr: https://cran.r-project.org/
web/packages/dplyr/vignettes/
Grammar of Data: Single Table Verbs
VERB
dplyr
pandas
SQL
QUERY/SELECTION
filter() (and slice())
query() (and loc[], iloc[])
SELECT WHERE
SORT
arrange()
sort_values()
ORDER BY
SELECT-COLUMNS/PROJECTION
select() (and rename())
[](__getitem__) (and rename())
SELECT COLUMN
SELECT-DISTINCT
distinct()
unique(),drop_duplicates()
SELECT DISTINCT COLUMN
ASSIGN
mutate() (and transmute())
assign
ALTER/UPDATE
AGGREGATE
summarise()
describe(), mean(), max()
None, AVG(),MAX()
SAMPLE
sample_n() and sample_frac()
sample()
implementation dep, use RAND()
GROUP-AGG
group_by/summarize
groupby/agg, count, mean
GROUP BY
DELETE
?
drop/masking
DELETE/WHERE
NOTEBOOK
https://colab.research.google.com/drive/1Mfl7SQ8YVkYbcU2PCue1FnPjDNTUAVLo?usp=sharing
Grammar of Data: 2 Table Verbs
•
MUTATING JOINS: which add new variables to one table from matching
rows in another: inner join, left(outer) join, right(outer) join, and full(outer)
join
•
FILTERING JOINS: which filter observations from one table based on
whether or not they match an observation in another table (not covered
today)
•
SET OPERATIONS: which combine observations in the data sets as if they
were set elements (not covered today)
NOTEBOOK
https://colab.research.google.com/drive/1Mfl7SQ8YVkYbcU2PCue1FnPjDNTUAVLo?usp=sharing
Modeling Tabular Data
•
Two main methods: Trees and Neural Networks
•
Trees: start with Decision Trees, create ensembles with Random Forests
•
Gradient Boosting sequentially fits weak tree models to come up with a
hyper good model
•
These are still the best but Neural Networks (Fully Connected) are becoming
competitive
•
An amazing innovation there is the use of embeddings: learn continuous
vectors from categorical and ordinal features along with the classification
and regression tasks: this is how modern language NN models work as well
After Modeling
•
Interrogate model, do multiple models, see predictions and interpret
•
This require tables for comparing the output of models as well
•
Then do decision theory: can the model output be used to reduce
customer churn for example? Optimize the cost of retaining a customer
instead of predicting whether a customer will churn
•
Grapple with the uncertainty of your predictions
•
All of these aspects (modeling and after) are covered in DS-1 and DS-2
Download