Data Analysis and Visualization in Python for Ecologists-gouda February 26, 2024 [2]: ### Pandas in Python import pandas as pd [3]: text = "Data Carpentry" [4]: text [4]: 'Data Carpentry' 1 Reading CSV Data Using Pandas We will begin by locating and reading our survey data which are in CSV format. CSV stands for Comma-Separated Values and is a common way to store formatted data. Other symbols may also be used, so you might see tab-separated, colon-separated or space separated files. pandas can work with each of these types of separators, as it allows you to specify the appropriate separator for your data. CSV files (and other -separated value file types) make it easy to share data, and can be imported and exported from many applications, including Microsoft Excel. For more details on CSV files, see the Data Organisation in Spreadsheets lesson. We can use Pandas’ read_csv function to pull the file directly into a DataFrame. So What’s a DataFrame? A DataFrame is a 2-dimensional data structure that can store data of different types (including strings, numbers, categories and more) in columns. It is similar to a spreadsheet or an SQL table or the data.frame in R. A DataFrame always has an index (0-based). An index refers to the position of an element in the data structure. [5]: # Note that pd.read_csv is used because we imported pandas as pd #pd.read_csv("data/surveys.csv") pd.read_csv("surveys.csv") [5]: 0 1 2 3 4 … 35544 35545 record_id month day 1 7 16 2 7 16 3 7 16 4 7 16 5 7 16 … … … … 35545 12 31 35546 12 31 year 1977 1977 1977 1977 1977 … 2002 2002 plot_id species_id 2 NL 3 NL 2 DM 7 DM 3 DM … … 15 AH 15 AH 1 sex M M F M M … NaN NaN hindfoot_length 32.0 33.0 37.0 36.0 35.0 NaN NaN \ 35546 35547 35548 0 1 2 3 4 … 35544 35545 35546 35547 35548 35547 35548 35549 12 12 12 31 31 31 2002 2002 2002 10 7 5 RM DO NaN F M NaN 15.0 36.0 NaN plot_id species_id 2 NL 3 NL 2 DM 7 DM 3 DM … … 15 AH 15 AH 10 RM 7 DO 5 NaN sex M M F M M … NaN NaN F M NaN hindfoot_length 32.0 33.0 37.0 36.0 35.0 weight NaN NaN NaN NaN NaN … NaN NaN 14.0 51.0 NaN [35549 rows x 9 columns] [6]: surveys_df = pd.read_csv("surveys.csv") [7]: surveys_df [7]: 0 1 2 3 4 … 35544 35545 35546 35547 35548 record_id month day 1 7 16 2 7 16 3 7 16 4 7 16 5 7 16 … … … … 35545 12 31 35546 12 31 35547 12 31 35548 12 31 35549 12 31 0 1 2 3 4 … 35544 35545 35546 35547 weight NaN NaN NaN NaN NaN … NaN NaN 14.0 51.0 year 1977 1977 1977 1977 1977 … 2002 2002 2002 2002 2002 2 NaN NaN 15.0 36.0 NaN \ 35548 NaN [35549 rows x 9 columns] [8]: surveys_df.head() # The head() method displays the first several lines of a␣ ↪file. It # is discussed below. [8]: 0 1 2 3 4 record_id 1 2 3 4 5 0 1 2 3 4 weight NaN NaN NaN NaN NaN month 7 7 7 7 7 day 16 16 16 16 16 year 1977 1977 1977 1977 1977 plot_id species_id sex 2 NL M 3 NL M 2 DM F 7 DM M 3 DM M hindfoot_length 32.0 33.0 37.0 36.0 35.0 \ [9]: type(surveys_df) [9]: pandas.core.frame.DataFrame [10]: surveys_df.dtypes [10]: record_id month day year plot_id species_id sex hindfoot_length weight dtype: object 2 int64 int64 int64 int64 int64 object object float64 float64 Calculating Statistics From Data In A Pandas DataFrame We’ve read our data into Python. Next, let’s perform some quick summary statistics to learn more about the data that we’re working with. We might want to know how many animals were collected in each site, or how many of each species were caught. We can perform summary stats quickly using groups. But first we need to figure out what we want to group by. Let’s begin by exploring our data: 3 [11]: # Look at the column names surveys_df.columns # which returns: [11]: Index(['record_id', 'month', 'day', 'year', 'plot_id', 'species_id', 'sex', 'hindfoot_length', 'weight'], dtype='object') Let’s get a list of all the species. The pd.unique function tells us all of the unique values in the species_id column. [12]: pd.unique(surveys_df['species_id']) [12]: array(['NL', 'OL', 'RF', 'SC', 'PB', 3 'DM', 'RM', 'PC', 'BA', 'PL', 'PF', 'PE', 'DS', 'PP', 'SH', 'OT', 'DO', 'OX', 'SS', nan, 'SA', 'PM', 'AH', 'DX', 'AB', 'CB', 'CM', 'CQ', 'PG', 'PH', 'PU', 'CV', 'UR', 'UP', 'ZL', 'UL', 'CS', 'SF', 'RO', 'AS', 'SO', 'PI', 'ST', 'CU', 'SU', 'RX', 'PX', 'CT', 'US'], dtype=object) CHALLENGE - STATISTICS Create a list of unique site IDs (“plot_id”) found in the surveys data. Call it site_names. How many unique sites are there in the data? How many unique species are in the data? What is the difference between len(site_names) and surveys_df[‘plot_id’].nunique()? 4 The solution 1. site_names = pd.unique(surveys_df[“plot_id”]) -How many unique sites are in the data? site_names.size or len(site_names) provide the answer: 24 -How many unique species are in the data? len(pd.unique(surveys_df[“species_id”])) tells us there are 49 species 2. len(site_names) and surveys_df[‘plot_id’].nunique() both provide the same output: they are alternative ways of getting the unique values. The nunique method combines the count and unique value extraction, and can help avoid the creation of intermediate variables like site_names. 5 Groups in Pandas We often want to calculate summary statistics grouped by subsets or attributes within fields of our data. For example, we might want to calculate the average weight of all individuals per site. We can calculate basic statistics for all records in a single column using the syntax below: [13]: surveys_df['weight'].describe() 4 [13]: count 32283.000000 mean 42.672428 std 36.631259 min 4.000000 25% 20.000000 50% 37.000000 75% 48.000000 max 280.000000 Name: weight, dtype: float64 We can also extract one specific metric if we wish: [14]: surveys_df['weight'].min() surveys_df['weight'].max() surveys_df['weight'].mean() surveys_df['weight'].std() surveys_df['weight'].count() [14]: 32283 [15]: surveys_df['weight'].min() [15]: 4.0 [16]: surveys_df['weight'].max() [16]: 280.0 [17]: surveys_df['weight'].mean() [17]: 42.672428212991356 [18]: surveys_df['weight'].std() [18]: 36.63125947458399 [19]: surveys_df['weight'].count() [19]: 32283 But if we want to summarize by one or more variables, for example sex, we can use Pandas’ .groupby method. Once we’ve created a groupby DataFrame, we can quickly calculate summary statistics by a group of our choice. [20]: # Group data by sex grouped_data = surveys_df.groupby('sex') [21]: grouped_data.mean(numeric_only=True) 5 [21]: sex F M sex F M record_id month day year plot_id 18036.412046 17754.835601 6.583047 6.392668 16.007138 16.184286 1990.644997 1990.480401 11.440854 11.098282 hindfoot_length weight 28.836780 29.709578 42.170555 42.995379 \ [22]: # Summary statistics for all numeric columns by sex grouped_data.describe() # Provide the mean for each numeric column by sex grouped_data.mean(numeric_only=True) [22]: sex F M sex F M record_id month day year plot_id 18036.412046 17754.835601 6.583047 6.392668 16.007138 16.184286 1990.644997 1990.480401 11.440854 11.098282 hindfoot_length weight 28.836780 29.709578 42.170555 42.995379 \ [23]: grouped_data.mean(numeric_only=True) [23]: sex F M sex F M record_id month day year plot_id 18036.412046 17754.835601 6.583047 6.392668 16.007138 16.184286 1990.644997 1990.480401 11.440854 11.098282 hindfoot_length weight 28.836780 29.709578 42.170555 42.995379 \ The groupby command is powerful in that it allows us to quickly generate summary stats. 6 CHALLENGE - SUMMARY DATA 1-How many recorded individuals are female F and how many male M? 2-What happens when you group by two columns using the following syntax and then calculate mean values? –grouped_data2 = surveys_df.groupby([‘plot_id’, ‘sex’]) –grouped_data2.mean(numeric_only=True) 6 3- Summarize weight values for each site in your data. HINT: you can use the following syntax to only create summary statistics for one column in your data. by_site[‘weight’].describe() 7 The solution 1-The first column of output from grouped_data.describe() (count) tells us that the data contains 15690 records for female individuals and 17348 records for male individuals. Note that these two numbers do not sum to 35549, the total number of rows we know to be in the surveys_df DataFrame. Why do you think some records were excluded from the grouping? 2-Calling the mean() method on data grouped by these two columns calculates and returns the mean value for each combination of plot and sex. Note that the mean is not meaningful for some variables, e.g. day, month, and year. You can specify particular columns and particular summary statistics using the agg() method (short for aggregate), e.g. to obtain the last survey year, median foot-length and mean weight for each plot/sex combination: [24]: surveys_df.groupby(['plot_id', 'sex']).agg({"year": 'max', "hindfoot_length": 'median', "weight": 'mean'}) [24]: plot_id sex 1 F M 2 F M 3 F M 4 F M 5 F M 6 F M 7 F M 8 F M 9 F M 10 F M 11 F M 12 F M 13 F M year hindfoot_length weight 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 34.0 36.0 33.0 33.0 22.0 22.0 36.0 36.0 34.0 34.0 25.0 25.0 20.0 20.0 35.0 36.0 36.0 36.0 17.0 17.0 35.0 36.0 35.0 35.0 25.0 26.0 46.311138 55.950560 52.561845 51.391382 31.215349 34.163241 46.818824 48.888119 40.974806 40.708551 36.352288 36.867388 20.006135 21.194719 45.623011 49.641372 53.618469 49.519309 17.094203 19.971223 43.515075 43.366197 49.831731 48.909710 40.524590 40.097754 7 14 15 16 17 18 19 20 21 22 23 24 F M F M F M F M F M F M F M F M F M F M F M 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 35.0 36.0 21.0 21.0 21.0 21.0 34.0 35.0 25.0 26.0 21.0 21.0 25.0 26.0 24.0 21.0 36.0 36.0 18.0 18.0 22.0 21.0 47.355491 45.159378 26.670236 27.523691 25.810427 23.811321 48.176201 47.558853 36.963514 43.546952 21.978599 20.306878 52.624406 44.197279 25.974832 22.772622 53.647059 54.572531 20.564417 18.941463 47.914405 39.321503 3-surveys_df.groupby([‘plot_id’])[‘weight’].describe() [25]: surveys_df.groupby(['plot_id'])['weight'].describe() [25]: plot_id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 count mean std min 25% 50% 75% max 1903.0 2074.0 1710.0 1866.0 1092.0 1463.0 638.0 1781.0 1811.0 279.0 1793.0 2219.0 1371.0 1728.0 869.0 480.0 1893.0 1351.0 1084.0 51.822911 52.251688 32.654386 47.928189 40.947802 36.738893 20.663009 47.758001 51.432358 18.541219 43.451757 49.496169 40.445660 46.277199 27.042578 24.585417 47.889593 40.005922 21.105166 38.176670 46.503602 35.641630 32.886598 34.086616 30.648310 21.315325 33.192194 33.724726 20.290806 28.975514 41.630035 34.042767 27.570389 35.178142 17.682334 35.802399 38.480856 13.269840 4.0 5.0 4.0 4.0 5.0 5.0 4.0 5.0 6.0 4.0 5.0 6.0 5.0 5.0 4.0 4.0 4.0 5.0 4.0 30.0 24.0 14.0 30.0 21.0 18.0 11.0 26.0 36.0 10.0 26.0 26.0 20.5 36.0 11.0 12.0 27.0 17.5 11.0 44.0 41.0 23.0 43.0 37.0 30.0 17.0 44.0 45.0 12.0 42.0 42.0 33.0 44.0 18.0 20.0 42.0 30.0 19.0 53.0 50.0 36.0 50.0 48.0 45.0 23.0 51.0 50.0 21.0 48.0 50.0 45.0 49.0 26.0 34.0 50.0 44.0 27.0 231.0 278.0 250.0 200.0 248.0 243.0 235.0 178.0 275.0 237.0 212.0 280.0 241.0 222.0 259.0 158.0 216.0 256.0 139.0 8 20 21 22 23 24 8 1222.0 1029.0 1298.0 369.0 960.0 48.665303 24.627794 54.146379 19.634146 43.679167 50.111539 21.199819 38.743967 18.382678 45.936588 5.0 4.0 5.0 4.0 4.0 17.0 10.0 29.0 10.0 19.0 31.0 22.0 42.0 14.0 27.5 47.0 31.0 54.0 23.0 45.0 223.0 190.0 212.0 199.0 251.0 Quickly Creating Summary Counts in Pandas Let’s next count the number of samples for each species. We can do this in a few ways, but we’ll use groupby combined with a count() method. [26]: # Count the number of samples by species species_counts = surveys_df.groupby('species_id')['record_id'].count() print(species_counts) species_id AB 303 AH 437 AS 2 BA 46 CB 50 CM 13 CQ 16 CS 1 CT 1 CU 1 CV 1 DM 10596 DO 3027 DS 2504 DX 40 NL 1252 OL 1006 OT 2249 OX 12 PB 2891 PC 39 PE 1299 PF 1597 PG 8 PH 32 PI 9 PL 36 PM 899 PP 3123 PU 5 PX 6 9 RF 75 RM 2609 RO 8 2 RX SA 75 SC 1 SF 43 SH 147 SO 43 SS 248 ST 1 SU 5 UL 4 UP 8 UR 10 US 4 ZL 2 Name: record_id, dtype: int64 Or, we can also count just the rows that have the species “DO”: [27]: surveys_df.groupby('species_id')['record_id'].count()['DO'] [27]: 3027 9 CHALLENGE - MAKE A LIST What’s another way to create a list of species and associated count of the records in the data? Hint: you can perform count, min, etc. functions on groupby DataFrames in the same way you can perform them on regular DataFrames. 10 The solution As well as calling count() on the record_id column of the grouped DataFrame as above, an equivalent result can be obtained by extracting record_id from the result of count() called directly on the grouped DataFrame: [28]: surveys_df.groupby('species_id').count()['record_id'] [28]: species_id AB 303 AH 437 AS 2 BA 46 CB 50 CM 13 CQ 16 CS 1 10 CT 1 CU 1 CV 1 DM 10596 DO 3027 DS 2504 DX 40 NL 1252 OL 1006 OT 2249 OX 12 PB 2891 PC 39 PE 1299 PF 1597 PG 8 PH 32 PI 9 PL 36 PM 899 PP 3123 PU 5 PX 6 RF 75 RM 2609 RO 8 RX 2 SA 75 SC 1 SF 43 SH 147 SO 43 SS 248 ST 1 SU 5 UL 4 UP 8 UR 10 US 4 ZL 2 Name: record_id, dtype: int64 11 Basic Math Functions If we wanted to, we could apply a mathematical operation like addition or division on an entire column of our data. For example, let’s multiply all weight values by 2. 11 [29]: # Multiply all weight values by 2 surveys_df['weight']*2 [29]: 0 1 2 3 4 NaN NaN NaN NaN NaN … 35544 NaN 35545 NaN 35546 28.0 35547 102.0 35548 NaN Name: weight, Length: 35549, dtype: float64 A more practical use of this might be to normalize the data according to a mean, area, or some other value calculated from our data. 12 Plotting Data Using Pandas We can plot our summary stats using Pandas, too. [30]: # Make sure figures appear inline in Ipython Notebook %matplotlib inline # Create a quick bar chart species_counts.plot(kind='bar'); 12 We can also look at how many animals were captured in each site: [31]: total_count = surveys_df.groupby('plot_id')['record_id'].nunique() # Let's plot that too total_count.plot(kind='bar'); 13 13 CHALLENGE - PLOTS 1-Create a plot of average weight across all species per site. 2-Create a plot of total males versus total females for the entire dataset. 1-surveys_df.groupby(‘plot_id’).mean()[“weight”].plot(kind=‘bar’) [32]: surveys_df.groupby('plot_id').mean()["weight"].plot(kind='bar'); 14 2-surveys_df.groupby(‘sex’).count()[“record_id”].plot(kind=‘bar’) [33]: surveys_df.groupby('sex').count()["record_id"].plot(kind='bar') [33]: <matplotlib.axes._subplots.AxesSubplot at 0xdec1d30> 15 14 SUMMARY PLOTTING CHALLENGE A simple example with some data where ‘a’, ‘b’, and ‘c’ are the groups, and ‘one’ and ‘two’ are the subgroups. [34]: d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']), 'two' : pd. ↪Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])} pd.DataFrame(d) [34]: a b c d one 1.0 2.0 3.0 NaN two 1.0 2.0 3.0 4.0 We can plot the above with [35]: # Plot stacked data so columns 'one' and 'two' are stacked my_df = pd.DataFrame(d) my_df.plot(kind='bar', stacked=True, title="The title of my graph") [35]: <matplotlib.axes._subplots.AxesSubplot at 0xdf0b5d0> 16 [ ]: 17