Uploaded by techievishumac

Introduction pandas in python

advertisement
🐼
Pandas REDI lecture
Tags
Productivity
Now, let's get into the zoo analogy
Pandas REDI lecture
1
Dataset links :
1. Cars:
https://raw.githubusercontent.com/juliandnl/redi_ss20/master/cars.csv
2. Immmobilenscout24 : https://github.com/ReDI-School/nrw-dataanalytics/blob/main/8_berlin_housing_with_scraped - berlin_housing_with_scraped.csv
Collab Links:
1. Intro to pandas
Google Colaboratory
https://colab.research.google.com/drive/1BIy3rdXLyA1lRiIXFgzcxXuJhzMfNmzv?usp=
sharing
2. Transformation
Google Colaboratory
https://colab.research.google.com/drive/1AIBIeB6VG4jax1nSIANDpQuL36cRYIF1?us
p=sharing
Crash Notes:
1. What are CSVs?
Comma Separated Values
Most Common type of dataset file format
They have comparatively minimal metadata
2. What are common file formats?
Pandas REDI lecture
2
💡
https://guides.library.oregonstate.edu/research-data-services/data-management-types-formats
Pandas REDI lecture
3
3. What are data frames(Yes, that's a question)
Pandas REDI lecture
4
4. Dot vs Bracket Notation? (Which is the best)
In general, I prefer Dot notation because;
Reason 1: Dot notation is easier to type
Dot notation is three fewer characters to type than bracket notation. And in terms of finger movement, typing
a single period is much more convenient than typing brackets and quotes.
This might sound like a trivial reason, but if you're selecting columns dozens (or hundreds) of times a day, it
makes a real difference!
Reason 2: Dot notation limits the usage of brackets
💡
Honestly, I was just kidding, It doesn't matter, it's your own preference.
5. What are precheck for pandas. query()?
This method only works if the column name doesn’t have any empty spaces. So before applying the method,
spaces in column names are replaced with ‘_’
6. Single quotes vs Double quotes?
Single quotes are used to mark a quote within a quote or a direct quote in a news story headline.
A double quotation mark is to set off a direct (word-for-word) quotation. For example – “I hope you will be
here,” he said. In Python Programming, we use Double Quotes for string representation.
Pandas REDI lecture
5
Always make use of single quotes when you know your string may contain double quotes within.
print(sentence)
name = '"Hi" ABC'
print(name)
💡
Bonus: What if you have to use strings that may include both single and double quotes? -Use ‘’’ ‘’’
quotes(aka triple quotes )
7. What are the common data aggregation methods?
Aggregating functions are the ones that reduce the dimension of the returned objects. It means output
Series/DataFrame have fewer or the same rows as the original.
💡
Useful links:
Aggregation in Pandas
How can I perform aggregation with Pandas? No DataFrame after aggregation! What
happened? How can I aggregate mainly strings columns (to lists, tuples, strings with
separator)? How can I aggregate ...
https://stackoverflow.com/questions/53781634/aggregation-in-pandas
Group by: split-apply-combine - pandas 1.4.1 documentation
By "group by" we are referring to a process involving one or more of the following steps:
Splitting the data into groups based on some criteria. Applying a function to each group
independently. Combining the results into a data structure. Out of these, the split step is
https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html
8. What’s the role of a data analyst?
Pandas REDI lecture
6
Like cleaning your rooms and racking your weights after lifting, cleaning and organising the data before
processing is important.
9. How does iloc vs loc differs?
The main distinction between the two methods is:
loc
gets rows (and/or columns) with particular labels.
iloc
gets rows (and/or columns) at integer locations.
10. How do lambda and regular functions differ?
Lambda functions can only have one expression in their body.
Regular functions can have multiple expressions and statements in their body.
Lambdas do not have a name associated with them. That’s why they are also known as anonymous
functions.
Regular functions must have a name and signature.
Lambdas do not contain a return statement because the body is automatically returned.
Functions that need to return value should include a return statement.
11. What are axis values for rows and columns?
0 = rows
1= columns
12. What's the fundamental issue with saving columns having DateTime format?
It changes to object data type
☹
Projects :
Pandas REDI lecture
7
Odd
Intro to Pandas:
1. What are the min, max and average number of rooms of the apartments in Köpenick?
2. How many apartments are there in Mitte in a 'well-kept' condition?
Transformations:
1. Is there a difference in average room space ( Space divided by
Rooms
) for apartments built after 2000
versus apartments built before 2000?
2. How many apartments were posted each month?
Bonus:
1. Can you think about other KPIs for Immobilienscout? (It doesn't need to use only the data from this
dataset)
Even
Intro to Pandas:
1. How many apartments are there in Kreuzberg built before 1900?
2. Which is the most expensive region on average and what is the maximum rent for this region?
Transformations:
1. What's the percentage of the apartments in high-quality condition? In order to answer this question
let's assume that high quality conditions
are:
first_time_use
,
mint_condition
,
refurbished
,
first_time_use_after_refurbishment
and
fully_renovated
.
The rest we can categorize as low-quality conditions.
2. How many apartments were posted each month?
Bonus:
1. Can you think about other KPIs for Immobilienscout? (It doesn't need to use only the data from this
dataset)
Pandas REDI lecture
8
Download