Midterm BE202 / 2023 / Introduction to Data Science • 4/18 (Tue) 13:00-14:30, E1 101 convention hall • Scoring: 100 points max, by … • 10 * Multiple-choice questions (2p each) • 10 * short-answer questions (2p each) • 6 * complex problems in each topic (10p each) 1 BE202 / 2023 / Introduction to Data Science Week 7 Lecture 12: Data Manipulation Data cleaning and preparation 2023-4-10 Lecturer: Sunjun Kim Reference: Pandas for Data Analysis (2nd), Chapter 7 2 Recap: Dataframe Pandas Dataframe BE202 / 2023 / Introduction to Data Science Spreadsheet (a.k.a. Excel) 3 BE202 / 2023 / Introduction to Data Science Real-world Data • Workaround: filtering-out VS. fill-in 4 BE202 / 2023 / Introduction to Data Science Missing data treatment Filtering-out, or Fill-in 5 BE202 / 2023 / Introduction to Data Science Missing Data in Pandas • NA: Not available • NaN: Not a Number • None: null in python 6 BE202 / 2023 / Introduction to Data Science Filtering-out NaN: Case: Series 7 BE202 / 2023 / Introduction to Data Science Filtering-out NaN: Case: DataFrame 8 Filtering-out NaN: BE202 / 2023 / Introduction to Data Science • thresh: threshold for # of non-NA value in a row Case: DataFrame 9 Filling-in NaN: * in-place filling • Fill with a given (default) value BE202 / 2023 / Introduction to Data Science dict with { col:value, …} 10 Reindexing: filling Filling-in NaN: BE202 / 2023 / Introduction to Data Science • Fill with interpolation methods • Useful in time-series data • ffill: fill with the prev. value • bfill: fill with the next value • nearest: fill with the nearest value 11 Filling-in NaN BE202 / 2023 / Introduction to Data Science • Application: fill by mean Note: NA values are automatically ignored in mean() 12 BE202 / 2023 / Introduction to Data Science Data Transformation Removing duplicates, replacing values, function mapping, renaming, binning, etc. 13 Duplicate removal BE202 / 2023 / Introduction to Data Science • duplicated(), drop_duplicates() 14 Duplicate removal BE202 / 2023 / Introduction to Data Science • duplicated(), drop_duplicates() 15 Duplicate removal BE202 / 2023 / Introduction to Data Science • duplicated(), drop_duplicates() 16 BE202 / 2023 / Introduction to Data Science Data replacement -999 à NaN -999 à NaN -999 à NaN -1000 à 0 -999 à NaN -1000 à 0 17 BE202 / 2023 / Introduction to Data Science Binning: cut() (18,25] (25,35] (35,60] (60,100] 18 BE202 / 2023 / Introduction to Data Science Binning: cut() (18,25] (25,35] (35,60] (60,100] 19 BE202 / 2023 / Introduction to Data Science Binning: cut() (18,25] (25,35] (35,60] (60,100] 20 BE202 / 2023 / Introduction to Data Science Binning: cut() / left-side cut 18 < … <= 25 18 <= … < 25 21 BE202 / 2023 / Introduction to Data Science Binning: cut() / labeling • Application: filtering by binned group à 22 Binning: cut() / equal-size bins BE202 / 2023 / Introduction to Data Science Four bins (0.051, 0.28) Precision to 2 decimal places (0.28, 0.5) (0.5, 0.73) (0.73, 0.95) 23 BE202 / 2023 / Introduction to Data Science Binning: qcut() / equal data size bins (-2.95,-0.67] (-0.67,-0.24] (-0.24, 0.61] #: 250 #: 250 #: 250 (0.61, 3.93] #: 250 24 BE202 / 2023 / Introduction to Data Science Outlier filtering • Outlier := radically different datapoints from others • Example) np.random.randn: random samples from standard normal distribution (mean=0.0, sd=1.0) • np.random.randn(1000, 4): generating 1000*4 array 25 * Get >3 sigma outliers from column 2 Outlier filtering BE202 / 2023 / Introduction to Data Science * Get >3 sigma outliers, if any 26 Outlier removal BE202 / 2023 / Introduction to Data Science * Reassign -3 or 3 to the outliers 27 BE202 / 2023 / Introduction to Data Science Lecture 13: Data Manipulation (cont’) Data cleaning and preparation 2021-4-10 Lecturer: Sunjun Kim Reference: Pandas for Data Analysis (2nd), Chapter 7 28 BE202 / 2023 / Introduction to Data Science Sorting and Summarization 29 BE202 / 2023 / Introduction to Data Science Sort by index 30 BE202 / 2023 / Introduction to Data Science Sort by value 31 BE202 / 2023 / Introduction to Data Science Summarizations 32 BE202 / 2023 / Introduction to Data Science Summarizations 33 BE202 / 2023 / Introduction to Data Science Pivot Table 34 Pivot Table BE202 / 2023 / Introduction to Data Science • Pivot table is… • • • • A powerful data analysis tool Easily summarize and explore large datasets Customizable, interactive format Widely used in spreadsheet applications cleaned_euckr.csv 35 BE202 / 2023 / Introduction to Data Science Making pivot table 1. Data source preparation (cleanup, filtering, …) 2. Field identification : breaking down the data source into individual fields (columns) 3. Row/Column configuration : define rows and columns for the summarization 4. Define values : count, sum, average, max, min, … 36 BE202 / 2023 / Introduction to Data Science Pivot table in MS Excel 37 Pivot table in Pandas • pandas.pivot_table BE202 / 2023 / Introduction to Data Science https://pandas.pydata.org/docs/ reference/api/pandas.pivot_table.html 38 BE202 / 2023 / Introduction to Data Science String manipulation 39 BE202 / 2023 / Introduction to Data Science Python internal string methods 40 BE202 / 2023 / Introduction to Data Science Python internal string methods (cont’) 41 BE202 / 2023 / Introduction to Data Science Python internal string methods (cont’) 42 BE202 / 2023 / Introduction to Data Science Recap: regular expressions 43 BE202 / 2023 / Introduction to Data Science Regular expressions (cont’) 44 BE202 / 2023 / Introduction to Data Science Regular expressions (cont’) 45 BE202 / 2023 / Introduction to Data Science Vectorized String functions in Pandas 46 BE202 / 2023 / Introduction to Data Science Vectorized String functions in Pandas (con’t) 47