Week7 Data Manipulation

Midterm BE202 / 2023 / Introduction to Data Science • 4/18 (Tue) 13:00-14:30, E1 101 convention hall • Scoring: 100 points max, by … • 10 * Multiple-choice questions (2p each) • 10 * short-answer questions (2p each) • 6 * complex problems in each topic (10p each) 1 BE202 / 2023 / Introduction to Data Science Week 7 Lecture 12: Data Manipulation Data cleaning and preparation 2023-4-10 Lecturer: Sunjun Kim Reference: Pandas for Data Analysis (2nd), Chapter 7 2 Recap: Dataframe Pandas Dataframe BE202 / 2023 / Introduction to Data Science Spreadsheet (a.k.a. Excel) 3 BE202 / 2023 / Introduction to Data Science Real-world Data • Workaround: filtering-out VS. fill-in 4 BE202 / 2023 / Introduction to Data Science Missing data treatment Filtering-out, or Fill-in 5 BE202 / 2023 / Introduction to Data Science Missing Data in Pandas • NA: Not available • NaN: Not a Number • None: null in python 6 BE202 / 2023 / Introduction to Data Science Filtering-out NaN: Case: Series 7 BE202 / 2023 / Introduction to Data Science Filtering-out NaN: Case: DataFrame 8 Filtering-out NaN: BE202 / 2023 / Introduction to Data Science • thresh: threshold for # of non-NA value in a row Case: DataFrame 9 Filling-in NaN: * in-place filling • Fill with a given (default) value BE202 / 2023 / Introduction to Data Science dict with { col:value, …} 10 Reindexing: filling Filling-in NaN: BE202 / 2023 / Introduction to Data Science • Fill with interpolation methods • Useful in time-series data • ffill: fill with the prev. value • bfill: fill with the next value • nearest: fill with the nearest value 11 Filling-in NaN BE202 / 2023 / Introduction to Data Science • Application: fill by mean Note: NA values are automatically ignored in mean() 12 BE202 / 2023 / Introduction to Data Science Data Transformation Removing duplicates, replacing values, function mapping, renaming, binning, etc. 13 Duplicate removal BE202 / 2023 / Introduction to Data Science • duplicated(), drop_duplicates() 14 Duplicate removal BE202 / 2023 / Introduction to Data Science • duplicated(), drop_duplicates() 15 Duplicate removal BE202 / 2023 / Introduction to Data Science • duplicated(), drop_duplicates() 16 BE202 / 2023 / Introduction to Data Science Data replacement -999 à NaN -999 à NaN -999 à NaN -1000 à 0 -999 à NaN -1000 à 0 17 BE202 / 2023 / Introduction to Data Science Binning: cut() (18,25] (25,35] (35,60] (60,100] 18 BE202 / 2023 / Introduction to Data Science Binning: cut() (18,25] (25,35] (35,60] (60,100] 19 BE202 / 2023 / Introduction to Data Science Binning: cut() (18,25] (25,35] (35,60] (60,100] 20 BE202 / 2023 / Introduction to Data Science Binning: cut() / left-side cut 18 < … <= 25 18 <= … < 25 21 BE202 / 2023 / Introduction to Data Science Binning: cut() / labeling • Application: filtering by binned group à 22 Binning: cut() / equal-size bins BE202 / 2023 / Introduction to Data Science Four bins (0.051, 0.28) Precision to 2 decimal places (0.28, 0.5) (0.5, 0.73) (0.73, 0.95) 23 BE202 / 2023 / Introduction to Data Science Binning: qcut() / equal data size bins (-2.95,-0.67] (-0.67,-0.24] (-0.24, 0.61] #: 250 #: 250 #: 250 (0.61, 3.93] #: 250 24 BE202 / 2023 / Introduction to Data Science Outlier filtering • Outlier := radically different datapoints from others • Example) np.random.randn: random samples from standard normal distribution (mean=0.0, sd=1.0) • np.random.randn(1000, 4): generating 1000*4 array 25 * Get >3 sigma outliers from column 2 Outlier filtering BE202 / 2023 / Introduction to Data Science * Get >3 sigma outliers, if any 26 Outlier removal BE202 / 2023 / Introduction to Data Science * Reassign -3 or 3 to the outliers 27 BE202 / 2023 / Introduction to Data Science Lecture 13: Data Manipulation (cont’) Data cleaning and preparation 2021-4-10 Lecturer: Sunjun Kim Reference: Pandas for Data Analysis (2nd), Chapter 7 28 BE202 / 2023 / Introduction to Data Science Sorting and Summarization 29 BE202 / 2023 / Introduction to Data Science Sort by index 30 BE202 / 2023 / Introduction to Data Science Sort by value 31 BE202 / 2023 / Introduction to Data Science Summarizations 32 BE202 / 2023 / Introduction to Data Science Summarizations 33 BE202 / 2023 / Introduction to Data Science Pivot Table 34 Pivot Table BE202 / 2023 / Introduction to Data Science • Pivot table is… • • • • A powerful data analysis tool Easily summarize and explore large datasets Customizable, interactive format Widely used in spreadsheet applications cleaned_euckr.csv 35 BE202 / 2023 / Introduction to Data Science Making pivot table 1. Data source preparation (cleanup, filtering, …) 2. Field identification : breaking down the data source into individual fields (columns) 3. Row/Column configuration : define rows and columns for the summarization 4. Define values : count, sum, average, max, min, … 36 BE202 / 2023 / Introduction to Data Science Pivot table in MS Excel 37 Pivot table in Pandas • pandas.pivot_table BE202 / 2023 / Introduction to Data Science https://pandas.pydata.org/docs/ reference/api/pandas.pivot_table.html 38 BE202 / 2023 / Introduction to Data Science String manipulation 39 BE202 / 2023 / Introduction to Data Science Python internal string methods 40 BE202 / 2023 / Introduction to Data Science Python internal string methods (cont’) 41 BE202 / 2023 / Introduction to Data Science Python internal string methods (cont’) 42 BE202 / 2023 / Introduction to Data Science Recap: regular expressions 43 BE202 / 2023 / Introduction to Data Science Regular expressions (cont’) 44 BE202 / 2023 / Introduction to Data Science Regular expressions (cont’) 45 BE202 / 2023 / Introduction to Data Science Vectorized String functions in Pandas 46 BE202 / 2023 / Introduction to Data Science Vectorized String functions in Pandas (con’t) 47

Week7 Data Manipulation

Products

Support

Week7 Data Manipulation

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib