🐼 Pandas REDI lecture Tags Productivity Now, let's get into the zoo analogy Pandas REDI lecture 1 Dataset links : 1. Cars: https://raw.githubusercontent.com/juliandnl/redi_ss20/master/cars.csv 2. Immmobilenscout24 : https://github.com/ReDI-School/nrw-dataanalytics/blob/main/8_berlin_housing_with_scraped - berlin_housing_with_scraped.csv Collab Links: 1. Intro to pandas Google Colaboratory https://colab.research.google.com/drive/1BIy3rdXLyA1lRiIXFgzcxXuJhzMfNmzv?usp= sharing 2. Transformation Google Colaboratory https://colab.research.google.com/drive/1AIBIeB6VG4jax1nSIANDpQuL36cRYIF1?us p=sharing Crash Notes: 1. What are CSVs? Comma Separated Values Most Common type of dataset file format They have comparatively minimal metadata 2. What are common file formats? Pandas REDI lecture 2 💡 https://guides.library.oregonstate.edu/research-data-services/data-management-types-formats Pandas REDI lecture 3 3. What are data frames(Yes, that's a question) Pandas REDI lecture 4 4. Dot vs Bracket Notation? (Which is the best) In general, I prefer Dot notation because; Reason 1: Dot notation is easier to type Dot notation is three fewer characters to type than bracket notation. And in terms of finger movement, typing a single period is much more convenient than typing brackets and quotes. This might sound like a trivial reason, but if you're selecting columns dozens (or hundreds) of times a day, it makes a real difference! Reason 2: Dot notation limits the usage of brackets 💡 Honestly, I was just kidding, It doesn't matter, it's your own preference. 5. What are precheck for pandas. query()? This method only works if the column name doesn’t have any empty spaces. So before applying the method, spaces in column names are replaced with ‘_’ 6. Single quotes vs Double quotes? Single quotes are used to mark a quote within a quote or a direct quote in a news story headline. A double quotation mark is to set off a direct (word-for-word) quotation. For example – “I hope you will be here,” he said. In Python Programming, we use Double Quotes for string representation. Pandas REDI lecture 5 Always make use of single quotes when you know your string may contain double quotes within. print(sentence) name = '"Hi" ABC' print(name) 💡 Bonus: What if you have to use strings that may include both single and double quotes? -Use ‘’’ ‘’’ quotes(aka triple quotes ) 7. What are the common data aggregation methods? Aggregating functions are the ones that reduce the dimension of the returned objects. It means output Series/DataFrame have fewer or the same rows as the original. 💡 Useful links: Aggregation in Pandas How can I perform aggregation with Pandas? No DataFrame after aggregation! What happened? How can I aggregate mainly strings columns (to lists, tuples, strings with separator)? How can I aggregate ... https://stackoverflow.com/questions/53781634/aggregation-in-pandas Group by: split-apply-combine - pandas 1.4.1 documentation By "group by" we are referring to a process involving one or more of the following steps: Splitting the data into groups based on some criteria. Applying a function to each group independently. Combining the results into a data structure. Out of these, the split step is https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html 8. What’s the role of a data analyst? Pandas REDI lecture 6 Like cleaning your rooms and racking your weights after lifting, cleaning and organising the data before processing is important. 9. How does iloc vs loc differs? The main distinction between the two methods is: loc gets rows (and/or columns) with particular labels. iloc gets rows (and/or columns) at integer locations. 10. How do lambda and regular functions differ? Lambda functions can only have one expression in their body. Regular functions can have multiple expressions and statements in their body. Lambdas do not have a name associated with them. That’s why they are also known as anonymous functions. Regular functions must have a name and signature. Lambdas do not contain a return statement because the body is automatically returned. Functions that need to return value should include a return statement. 11. What are axis values for rows and columns? 0 = rows 1= columns 12. What's the fundamental issue with saving columns having DateTime format? It changes to object data type ☹ Projects : Pandas REDI lecture 7 Odd Intro to Pandas: 1. What are the min, max and average number of rooms of the apartments in Köpenick? 2. How many apartments are there in Mitte in a 'well-kept' condition? Transformations: 1. Is there a difference in average room space ( Space divided by Rooms ) for apartments built after 2000 versus apartments built before 2000? 2. How many apartments were posted each month? Bonus: 1. Can you think about other KPIs for Immobilienscout? (It doesn't need to use only the data from this dataset) Even Intro to Pandas: 1. How many apartments are there in Kreuzberg built before 1900? 2. Which is the most expensive region on average and what is the maximum rent for this region? Transformations: 1. What's the percentage of the apartments in high-quality condition? In order to answer this question let's assume that high quality conditions are: first_time_use , mint_condition , refurbished , first_time_use_after_refurbishment and fully_renovated . The rest we can categorize as low-quality conditions. 2. How many apartments were posted each month? Bonus: 1. Can you think about other KPIs for Immobilienscout? (It doesn't need to use only the data from this dataset) Pandas REDI lecture 8