Tabular Data Rahul Dave, Univ.AI Data Science Computer Science Statistics Domain Science Drew Conway Machine Human Human Cognition Data Management Data Mining Machine Learning Perception Visualization Business Intelligence Statistics Story Telling Decision Making Theory Data Science From CS109 Inspired by Daniel Keim, “Visual Analytics: Definition, Process, and Challenges” The Data Science Process Ask an interesting question. What is the scientific goal? What would you do if you had all the data? What do you want to predict or estimate? Get the data. How were the data sampled? Which data are relevant? Are there privacy issues? Explore the data. Plot the data. Are there anomalies? Are there patterns? Model the data. Build a model. Fit the model. Validate the model. Communicate and visualize the results. What did we learn? Do the results make sense? Can we tell a story? From CS109 At Univ.AI.. • We start out with short python and machine learning courses • ML1(DS1) and ML2(DS2) cover all of these aspects of data science • AI1 through AI4 cover deep learning. • Today we will focus on Tabular Data and I will try to give you a flavor of how we deal with it PYPREP Next weekend! COMPLETELY FREE! PREPATORY PYPREP xkcd MLPREP Tabular Data • Data that can be arranged in one or more spreadsheets, referring to each other • Data may have missing values • Cells may not be atomic item index quality 0 High Toy 20 1 High Book 5 2 Medium Craft 12 3 Low Craft 10 type price Items Customers customer index item index quantity total 0 0 5 100 1 0 3 60 2 2 4 48 3 3 10 100 Ask an interesting question. What is the scientific goal? What would you do if you had all the data? What do you want to predict or estimate? Get the data. How were the data sampled? Which data are relevant? Are there privacy issues? Explore the data. Plot the data. Are there anomalies? Are there patterns? Model the data. Build a model. Fit the model. Validate the model. Communicate and visualize the results. What did we learn? Do the results make sense? Can we tell a story? From CS109 Get Data Clean Data Combine Data Explore Data Visualize Data Interrogate Model Make Decisions MLDS1 Part 1 MLDS1 Part 2 MLDS1 Part 3 MLDS2 Part 1 MLDS2 Part 2 Model Data Tabular Data • • • Data that can be arranged in one or more spreadsheets, referring to each other Data may have missing values Cells may not be atomic item index quality type price 0 High Toy 20 1 High Book 5 2 Medium Craft 12 3 Low Craft Items Customers 10 customer index item index quantity total 0 0 5 100 1 0 3 60 2 2 4 48 3 3 10 100 Scales'of'Measurement • Quan&ta&ve*(Interval*and*Ra&o) • Ordinal • 3 3 Nominal "S."S."Stevens,"Science,"New"Series,"Vol."103,"No."2684"(Jun."7,"1946),"pp."677@680 Tabular Data Ordinal Nominal/Categorical Quantitative item index quality type price 0 High Toy 20 1 High Book 5 2 Medium Craft 12 3 Low “Foreign Key” Craft Items Customers 10 customer index item index quantity total 0 0 5 100 1 0 3 60 2 2 4 48 3 3 10 100 Rubric for Data Pre-Model 1. Build a Table from the data (ideally, put all data in this object) 2. Clean the Table. It should have the following properties: 3. ◦ Each row describes a single object ◦ Each column describes a property of that object ◦ Columns are numeric whenever appropriate ◦ Columns contain atomic properties that cannot be further decomposed Modified from cs109 rubric by Chris Beaumont. Explore global properties. Use histograms, scatter plots, and aggregation functions to summarize the data. 4. Explore group properties. Use groupby and small multiples to compare subsets of the data. 5. Explore global and group properties for combinations of data. Pandas • Python, in memory table structure called a DataFrame • Very fast library, used extensively in data science and machine learning Contributors !Candidates NOTEBOOK https://colab.research.google.com/drive/1Mfl7SQ8YVkYbcU2PCue1FnPjDNTUAVLo?usp=sharing What%kind%of%data%storage%do%you%need? • memory • disk:#what#if#we#do#not#fit? • cluster:#what#if#we#s1ll#do#not#fit? • cluster:#what#if#we#need/can#use#parts? • What#if#we#MUST#bring#compute#to#disk? Rela%onal(Database Dont%say:"seek"20"bytes"onto"disk"and"pick"up" from"there."The"next"row"is"50"bytes"hence Say:"select"data"from"a"set."I"dont"care"where"it" is,"just"get"the"row"to"me. Rela%onal(Database(contd) • A#collec(on#of#tables#related#to#each#other#through#common# data#values. • Rows#represent#a:ributes#of#something • Everything#in#a#column#is#values#of#one#a:ributes • A#cell#is#expected#to#be#atomic • Tables#are#related#to#each#other#if#they#have#columns#called#keys# which#represent#the#same#values SQL • Lingua Franca of relational databases, and a lot of other systems. NOTEBOOK https://colab.research.google.com/drive/1Mfl7SQ8YVkYbcU2PCue1FnPjDNTUAVLo?usp=sharing Grammar of Data • Core data manipulation commands • Universal across systems • By identifying them, we know what functionality to search for when we encounter a new system • First formalized by Hadley Wickham in dplyr: https://cran.r-project.org/ web/packages/dplyr/vignettes/ Grammar of Data: Single Table Verbs VERB dplyr pandas SQL QUERY/SELECTION filter() (and slice()) query() (and loc[], iloc[]) SELECT WHERE SORT arrange() sort_values() ORDER BY SELECT-COLUMNS/PROJECTION select() (and rename()) [](__getitem__) (and rename()) SELECT COLUMN SELECT-DISTINCT distinct() unique(),drop_duplicates() SELECT DISTINCT COLUMN ASSIGN mutate() (and transmute()) assign ALTER/UPDATE AGGREGATE summarise() describe(), mean(), max() None, AVG(),MAX() SAMPLE sample_n() and sample_frac() sample() implementation dep, use RAND() GROUP-AGG group_by/summarize groupby/agg, count, mean GROUP BY DELETE ? drop/masking DELETE/WHERE NOTEBOOK https://colab.research.google.com/drive/1Mfl7SQ8YVkYbcU2PCue1FnPjDNTUAVLo?usp=sharing Grammar of Data: 2 Table Verbs • MUTATING JOINS: which add new variables to one table from matching rows in another: inner join, left(outer) join, right(outer) join, and full(outer) join • FILTERING JOINS: which filter observations from one table based on whether or not they match an observation in another table (not covered today) • SET OPERATIONS: which combine observations in the data sets as if they were set elements (not covered today) NOTEBOOK https://colab.research.google.com/drive/1Mfl7SQ8YVkYbcU2PCue1FnPjDNTUAVLo?usp=sharing Modeling Tabular Data • Two main methods: Trees and Neural Networks • Trees: start with Decision Trees, create ensembles with Random Forests • Gradient Boosting sequentially fits weak tree models to come up with a hyper good model • These are still the best but Neural Networks (Fully Connected) are becoming competitive • An amazing innovation there is the use of embeddings: learn continuous vectors from categorical and ordinal features along with the classification and regression tasks: this is how modern language NN models work as well After Modeling • Interrogate model, do multiple models, see predictions and interpret • This require tables for comparing the output of models as well • Then do decision theory: can the model output be used to reduce customer churn for example? Optimize the cost of retaining a customer instead of predicting whether a customer will churn • Grapple with the uncertainty of your predictions • All of these aspects (modeling and after) are covered in DS-1 and DS-2