Uploaded by 권민주

Week7 Data Manipulation

advertisement
Midterm
BE202 / 2023 / Introduction to Data Science
• 4/18 (Tue) 13:00-14:30, E1 101 convention hall
• Scoring: 100 points max, by …
• 10 * Multiple-choice questions (2p each)
• 10 * short-answer questions (2p each)
• 6 * complex problems in each topic (10p each)
1
BE202 / 2023 / Introduction to Data Science
Week 7
Lecture 12: Data Manipulation
Data cleaning and preparation
2023-4-10
Lecturer: Sunjun Kim
Reference: Pandas for Data Analysis (2nd), Chapter 7
2
Recap: Dataframe
Pandas Dataframe
BE202 / 2023 / Introduction to Data Science
Spreadsheet (a.k.a. Excel)
3
BE202 / 2023 / Introduction to Data Science
Real-world Data
• Workaround: filtering-out VS. fill-in
4
BE202 / 2023 / Introduction to Data Science
Missing data treatment
Filtering-out, or Fill-in
5
BE202 / 2023 / Introduction to Data Science
Missing Data in Pandas
• NA:
Not available
• NaN:
Not a Number
• None:
null in python
6
BE202 / 2023 / Introduction to Data Science
Filtering-out NaN:
Case: Series
7
BE202 / 2023 / Introduction to Data Science
Filtering-out NaN:
Case: DataFrame
8
Filtering-out NaN:
BE202 / 2023 / Introduction to Data Science
• thresh:
threshold for # of non-NA value in a row
Case: DataFrame
9
Filling-in NaN:
* in-place filling
• Fill with a given (default) value
BE202 / 2023 / Introduction to Data Science
dict with { col:value, …}
10
Reindexing: filling
Filling-in NaN:
BE202 / 2023 / Introduction to Data Science
• Fill with interpolation methods
• Useful in time-series data
• ffill:
fill with the prev. value
• bfill:
fill with the next value
• nearest:
fill with the nearest value
11
Filling-in NaN
BE202 / 2023 / Introduction to Data Science
• Application: fill by mean
Note: NA values are
automatically ignored in
mean()
12
BE202 / 2023 / Introduction to Data Science
Data Transformation
Removing duplicates, replacing values, function mapping, renaming, binning, etc.
13
Duplicate removal
BE202 / 2023 / Introduction to Data Science
• duplicated(), drop_duplicates()
14
Duplicate removal
BE202 / 2023 / Introduction to Data Science
• duplicated(), drop_duplicates()
15
Duplicate removal
BE202 / 2023 / Introduction to Data Science
• duplicated(), drop_duplicates()
16
BE202 / 2023 / Introduction to Data Science
Data replacement
-999 à NaN
-999 à NaN
-999 à NaN
-1000 à 0
-999 à NaN
-1000 à 0
17
BE202 / 2023 / Introduction to Data Science
Binning: cut()
(18,25]
(25,35]
(35,60]
(60,100]
18
BE202 / 2023 / Introduction to Data Science
Binning: cut()
(18,25]
(25,35]
(35,60]
(60,100]
19
BE202 / 2023 / Introduction to Data Science
Binning: cut()
(18,25]
(25,35]
(35,60]
(60,100]
20
BE202 / 2023 / Introduction to Data Science
Binning: cut() / left-side cut
18 < … <= 25
18 <= … < 25
21
BE202 / 2023 / Introduction to Data Science
Binning: cut() / labeling
• Application:
filtering by binned group à
22
Binning: cut() / equal-size bins
BE202 / 2023 / Introduction to Data Science
Four bins
(0.051, 0.28)
Precision to 2 decimal places
(0.28, 0.5)
(0.5, 0.73)
(0.73, 0.95)
23
BE202 / 2023 / Introduction to Data Science
Binning: qcut() / equal data size bins
(-2.95,-0.67]
(-0.67,-0.24]
(-0.24, 0.61]
#: 250
#: 250
#: 250
(0.61, 3.93]
#: 250
24
BE202 / 2023 / Introduction to Data Science
Outlier filtering
• Outlier := radically different datapoints from others
• Example)
np.random.randn:
random samples from
standard normal distribution
(mean=0.0, sd=1.0)
• np.random.randn(1000, 4):
generating 1000*4 array
25
* Get >3 sigma outliers from column 2
Outlier filtering
BE202 / 2023 / Introduction to Data Science
* Get >3 sigma outliers, if any
26
Outlier removal
BE202 / 2023 / Introduction to Data Science
* Reassign -3 or 3 to the outliers
27
BE202 / 2023 / Introduction to Data Science
Lecture 13: Data Manipulation
(cont’)
Data cleaning and preparation
2021-4-10
Lecturer: Sunjun Kim
Reference: Pandas for Data Analysis (2nd), Chapter 7
28
BE202 / 2023 / Introduction to Data Science
Sorting and Summarization
29
BE202 / 2023 / Introduction to Data Science
Sort by index
30
BE202 / 2023 / Introduction to Data Science
Sort by value
31
BE202 / 2023 / Introduction to Data Science
Summarizations
32
BE202 / 2023 / Introduction to Data Science
Summarizations
33
BE202 / 2023 / Introduction to Data Science
Pivot Table
34
Pivot Table
BE202 / 2023 / Introduction to Data Science
• Pivot table is…
•
•
•
•
A powerful data analysis tool
Easily summarize and explore large datasets
Customizable, interactive format
Widely used in spreadsheet applications
cleaned_euckr.csv
35
BE202 / 2023 / Introduction to Data Science
Making pivot table
1. Data source preparation (cleanup, filtering, …)
2. Field identification
: breaking down the data source into individual fields (columns)
3. Row/Column configuration
: define rows and columns for the summarization
4. Define values
: count, sum, average, max, min, …
36
BE202 / 2023 / Introduction to Data Science
Pivot table in MS Excel
37
Pivot table
in Pandas
• pandas.pivot_table
BE202 / 2023 / Introduction to Data Science
https://pandas.pydata.org/docs/
reference/api/pandas.pivot_table.html
38
BE202 / 2023 / Introduction to Data Science
String manipulation
39
BE202 / 2023 / Introduction to Data Science
Python internal string methods
40
BE202 / 2023 / Introduction to Data Science
Python internal string methods (cont’)
41
BE202 / 2023 / Introduction to Data Science
Python internal string methods (cont’)
42
BE202 / 2023 / Introduction to Data Science
Recap: regular expressions
43
BE202 / 2023 / Introduction to Data Science
Regular expressions (cont’)
44
BE202 / 2023 / Introduction to Data Science
Regular expressions (cont’)
45
BE202 / 2023 / Introduction to Data Science
Vectorized String functions in Pandas
46
BE202 / 2023 / Introduction to Data Science
Vectorized String functions in Pandas
(con’t)
47
Download