Uploaded by mohammed eelsayed

Data Analysis and Visualization in Python-gouda

advertisement
Data Analysis and Visualization in Python for Ecologists-gouda
February 26, 2024
[2]: ### Pandas in Python
import pandas as pd
[3]: text = "Data Carpentry"
[4]: text
[4]: 'Data Carpentry'
1
Reading CSV Data Using Pandas
We will begin by locating and reading our survey data which are in CSV format. CSV stands for
Comma-Separated Values and is a common way to store formatted data. Other symbols may also
be used, so you might see tab-separated, colon-separated or space separated files. pandas can work
with each of these types of separators, as it allows you to specify the appropriate separator for
your data. CSV files (and other -separated value file types) make it easy to share data, and can
be imported and exported from many applications, including Microsoft Excel. For more details on
CSV files, see the Data Organisation in Spreadsheets lesson. We can use Pandas’ read_csv function
to pull the file directly into a DataFrame.
So What’s a DataFrame? A DataFrame is a 2-dimensional data structure that can store data of
different types (including strings, numbers, categories and more) in columns. It is similar to a
spreadsheet or an SQL table or the data.frame in R. A DataFrame always has an index (0-based).
An index refers to the position of an element in the data structure.
[5]: # Note that pd.read_csv is used because we imported pandas as pd
#pd.read_csv("data/surveys.csv")
pd.read_csv("surveys.csv")
[5]:
0
1
2
3
4
…
35544
35545
record_id month day
1
7
16
2
7
16
3
7
16
4
7
16
5
7
16
…
… …
…
35545
12
31
35546
12
31
year
1977
1977
1977
1977
1977
…
2002
2002
plot_id species_id
2
NL
3
NL
2
DM
7
DM
3
DM
… …
15
AH
15
AH
1
sex
M
M
F
M
M
…
NaN
NaN
hindfoot_length
32.0
33.0
37.0
36.0
35.0
NaN
NaN
\
35546
35547
35548
0
1
2
3
4
…
35544
35545
35546
35547
35548
35547
35548
35549
12
12
12
31
31
31
2002
2002
2002
10
7
5
RM
DO
NaN
F
M
NaN
15.0
36.0
NaN
plot_id species_id
2
NL
3
NL
2
DM
7
DM
3
DM
… …
15
AH
15
AH
10
RM
7
DO
5
NaN
sex
M
M
F
M
M
…
NaN
NaN
F
M
NaN
hindfoot_length
32.0
33.0
37.0
36.0
35.0
weight
NaN
NaN
NaN
NaN
NaN
…
NaN
NaN
14.0
51.0
NaN
[35549 rows x 9 columns]
[6]: surveys_df = pd.read_csv("surveys.csv")
[7]: surveys_df
[7]:
0
1
2
3
4
…
35544
35545
35546
35547
35548
record_id month day
1
7
16
2
7
16
3
7
16
4
7
16
5
7
16
…
… …
…
35545
12
31
35546
12
31
35547
12
31
35548
12
31
35549
12
31
0
1
2
3
4
…
35544
35545
35546
35547
weight
NaN
NaN
NaN
NaN
NaN
…
NaN
NaN
14.0
51.0
year
1977
1977
1977
1977
1977
…
2002
2002
2002
2002
2002
2
NaN
NaN
15.0
36.0
NaN
\
35548
NaN
[35549 rows x 9 columns]
[8]: surveys_df.head() # The head() method displays the first several lines of a␣
↪file. It
# is discussed below.
[8]:
0
1
2
3
4
record_id
1
2
3
4
5
0
1
2
3
4
weight
NaN
NaN
NaN
NaN
NaN
month
7
7
7
7
7
day
16
16
16
16
16
year
1977
1977
1977
1977
1977
plot_id species_id sex
2
NL
M
3
NL
M
2
DM
F
7
DM
M
3
DM
M
hindfoot_length
32.0
33.0
37.0
36.0
35.0
\
[9]: type(surveys_df)
[9]: pandas.core.frame.DataFrame
[10]: surveys_df.dtypes
[10]: record_id
month
day
year
plot_id
species_id
sex
hindfoot_length
weight
dtype: object
2
int64
int64
int64
int64
int64
object
object
float64
float64
Calculating Statistics From Data In A Pandas DataFrame
We’ve read our data into Python. Next, let’s perform some quick summary statistics to learn more
about the data that we’re working with. We might want to know how many animals were collected
in each site, or how many of each species were caught. We can perform summary stats quickly
using groups. But first we need to figure out what we want to group by.
Let’s begin by exploring our data:
3
[11]: # Look at the column names
surveys_df.columns
# which returns:
[11]: Index(['record_id', 'month', 'day', 'year', 'plot_id', 'species_id', 'sex',
'hindfoot_length', 'weight'],
dtype='object')
Let’s get a list of all the species. The pd.unique function tells us all of the unique values in the
species_id column.
[12]: pd.unique(surveys_df['species_id'])
[12]: array(['NL',
'OL',
'RF',
'SC',
'PB',
3
'DM',
'RM',
'PC',
'BA',
'PL',
'PF', 'PE', 'DS', 'PP', 'SH', 'OT', 'DO', 'OX', 'SS',
nan, 'SA', 'PM', 'AH', 'DX', 'AB', 'CB', 'CM', 'CQ',
'PG', 'PH', 'PU', 'CV', 'UR', 'UP', 'ZL', 'UL', 'CS',
'SF', 'RO', 'AS', 'SO', 'PI', 'ST', 'CU', 'SU', 'RX',
'PX', 'CT', 'US'], dtype=object)
CHALLENGE - STATISTICS
Create a list of unique site IDs (“plot_id”) found in the surveys data. Call it site_names. How
many unique sites are there in the data? How many unique species are in the data?
What is the difference between len(site_names) and surveys_df[‘plot_id’].nunique()?
4
The solution
1. site_names = pd.unique(surveys_df[“plot_id”])
-How many unique sites are in the data? site_names.size or len(site_names) provide the answer:
24
-How many unique species are in the data? len(pd.unique(surveys_df[“species_id”])) tells us there
are 49 species
2. len(site_names) and surveys_df[‘plot_id’].nunique() both provide the same output: they
are alternative ways of getting the unique values. The nunique method combines the count
and unique value extraction, and can help avoid the creation of intermediate variables like
site_names.
5
Groups in Pandas
We often want to calculate summary statistics grouped by subsets or attributes within fields of our
data. For example, we might want to calculate the average weight of all individuals per site.
We can calculate basic statistics for all records in a single column using the syntax below:
[13]: surveys_df['weight'].describe()
4
[13]: count
32283.000000
mean
42.672428
std
36.631259
min
4.000000
25%
20.000000
50%
37.000000
75%
48.000000
max
280.000000
Name: weight, dtype: float64
We can also extract one specific metric if we wish:
[14]: surveys_df['weight'].min()
surveys_df['weight'].max()
surveys_df['weight'].mean()
surveys_df['weight'].std()
surveys_df['weight'].count()
[14]: 32283
[15]: surveys_df['weight'].min()
[15]: 4.0
[16]: surveys_df['weight'].max()
[16]: 280.0
[17]: surveys_df['weight'].mean()
[17]: 42.672428212991356
[18]: surveys_df['weight'].std()
[18]: 36.63125947458399
[19]: surveys_df['weight'].count()
[19]: 32283
But if we want to summarize by one or more variables, for example sex, we can use Pandas’ .groupby
method. Once we’ve created a groupby DataFrame, we can quickly calculate summary statistics
by a group of our choice.
[20]: # Group data by sex
grouped_data = surveys_df.groupby('sex')
[21]: grouped_data.mean(numeric_only=True)
5
[21]:
sex
F
M
sex
F
M
record_id
month
day
year
plot_id
18036.412046
17754.835601
6.583047
6.392668
16.007138
16.184286
1990.644997
1990.480401
11.440854
11.098282
hindfoot_length
weight
28.836780
29.709578
42.170555
42.995379
\
[22]: # Summary statistics for all numeric columns by sex
grouped_data.describe()
# Provide the mean for each numeric column by sex
grouped_data.mean(numeric_only=True)
[22]:
sex
F
M
sex
F
M
record_id
month
day
year
plot_id
18036.412046
17754.835601
6.583047
6.392668
16.007138
16.184286
1990.644997
1990.480401
11.440854
11.098282
hindfoot_length
weight
28.836780
29.709578
42.170555
42.995379
\
[23]: grouped_data.mean(numeric_only=True)
[23]:
sex
F
M
sex
F
M
record_id
month
day
year
plot_id
18036.412046
17754.835601
6.583047
6.392668
16.007138
16.184286
1990.644997
1990.480401
11.440854
11.098282
hindfoot_length
weight
28.836780
29.709578
42.170555
42.995379
\
The groupby command is powerful in that it allows us to quickly generate summary stats.
6
CHALLENGE - SUMMARY DATA
1-How many recorded individuals are female F and how many male M?
2-What happens when you group by two columns using the following syntax and then calculate
mean values?
–grouped_data2 = surveys_df.groupby([‘plot_id’, ‘sex’])
–grouped_data2.mean(numeric_only=True)
6
3- Summarize weight values for each site in your data. HINT: you can use the following syntax to
only create summary statistics for one column in your data. by_site[‘weight’].describe()
7
The solution
1-The first column of output from grouped_data.describe() (count) tells us that the data contains
15690 records for female individuals and 17348 records for male individuals. Note that these two
numbers do not sum to 35549, the total number of rows we know to be in the surveys_df DataFrame.
Why do you think some records were excluded from the grouping?
2-Calling the mean() method on data grouped by these two columns calculates and returns the
mean value for each combination of plot and sex. Note that the mean is not meaningful for some
variables, e.g. day, month, and year. You can specify particular columns and particular summary
statistics using the agg() method (short for aggregate), e.g. to obtain the last survey year, median
foot-length and mean weight for each plot/sex combination:
[24]: surveys_df.groupby(['plot_id', 'sex']).agg({"year": 'max',
"hindfoot_length": 'median',
"weight": 'mean'})
[24]:
plot_id sex
1
F
M
2
F
M
3
F
M
4
F
M
5
F
M
6
F
M
7
F
M
8
F
M
9
F
M
10
F
M
11
F
M
12
F
M
13
F
M
year
hindfoot_length
weight
2002
2002
2002
2002
2002
2002
2002
2002
2002
2002
2002
2002
2002
2002
2002
2002
2002
2002
2002
2002
2002
2002
2002
2002
2002
2002
34.0
36.0
33.0
33.0
22.0
22.0
36.0
36.0
34.0
34.0
25.0
25.0
20.0
20.0
35.0
36.0
36.0
36.0
17.0
17.0
35.0
36.0
35.0
35.0
25.0
26.0
46.311138
55.950560
52.561845
51.391382
31.215349
34.163241
46.818824
48.888119
40.974806
40.708551
36.352288
36.867388
20.006135
21.194719
45.623011
49.641372
53.618469
49.519309
17.094203
19.971223
43.515075
43.366197
49.831731
48.909710
40.524590
40.097754
7
14
15
16
17
18
19
20
21
22
23
24
F
M
F
M
F
M
F
M
F
M
F
M
F
M
F
M
F
M
F
M
F
M
2002
2002
2002
2002
2002
2002
2002
2002
2002
2002
2002
2002
2002
2002
2002
2002
2002
2002
2002
2002
2002
2002
35.0
36.0
21.0
21.0
21.0
21.0
34.0
35.0
25.0
26.0
21.0
21.0
25.0
26.0
24.0
21.0
36.0
36.0
18.0
18.0
22.0
21.0
47.355491
45.159378
26.670236
27.523691
25.810427
23.811321
48.176201
47.558853
36.963514
43.546952
21.978599
20.306878
52.624406
44.197279
25.974832
22.772622
53.647059
54.572531
20.564417
18.941463
47.914405
39.321503
3-surveys_df.groupby([‘plot_id’])[‘weight’].describe()
[25]: surveys_df.groupby(['plot_id'])['weight'].describe()
[25]:
plot_id
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
count
mean
std
min
25%
50%
75%
max
1903.0
2074.0
1710.0
1866.0
1092.0
1463.0
638.0
1781.0
1811.0
279.0
1793.0
2219.0
1371.0
1728.0
869.0
480.0
1893.0
1351.0
1084.0
51.822911
52.251688
32.654386
47.928189
40.947802
36.738893
20.663009
47.758001
51.432358
18.541219
43.451757
49.496169
40.445660
46.277199
27.042578
24.585417
47.889593
40.005922
21.105166
38.176670
46.503602
35.641630
32.886598
34.086616
30.648310
21.315325
33.192194
33.724726
20.290806
28.975514
41.630035
34.042767
27.570389
35.178142
17.682334
35.802399
38.480856
13.269840
4.0
5.0
4.0
4.0
5.0
5.0
4.0
5.0
6.0
4.0
5.0
6.0
5.0
5.0
4.0
4.0
4.0
5.0
4.0
30.0
24.0
14.0
30.0
21.0
18.0
11.0
26.0
36.0
10.0
26.0
26.0
20.5
36.0
11.0
12.0
27.0
17.5
11.0
44.0
41.0
23.0
43.0
37.0
30.0
17.0
44.0
45.0
12.0
42.0
42.0
33.0
44.0
18.0
20.0
42.0
30.0
19.0
53.0
50.0
36.0
50.0
48.0
45.0
23.0
51.0
50.0
21.0
48.0
50.0
45.0
49.0
26.0
34.0
50.0
44.0
27.0
231.0
278.0
250.0
200.0
248.0
243.0
235.0
178.0
275.0
237.0
212.0
280.0
241.0
222.0
259.0
158.0
216.0
256.0
139.0
8
20
21
22
23
24
8
1222.0
1029.0
1298.0
369.0
960.0
48.665303
24.627794
54.146379
19.634146
43.679167
50.111539
21.199819
38.743967
18.382678
45.936588
5.0
4.0
5.0
4.0
4.0
17.0
10.0
29.0
10.0
19.0
31.0
22.0
42.0
14.0
27.5
47.0
31.0
54.0
23.0
45.0
223.0
190.0
212.0
199.0
251.0
Quickly Creating Summary Counts in Pandas
Let’s next count the number of samples for each species. We can do this in a few ways, but we’ll
use groupby combined with a count() method.
[26]: # Count the number of samples by species
species_counts = surveys_df.groupby('species_id')['record_id'].count()
print(species_counts)
species_id
AB
303
AH
437
AS
2
BA
46
CB
50
CM
13
CQ
16
CS
1
CT
1
CU
1
CV
1
DM
10596
DO
3027
DS
2504
DX
40
NL
1252
OL
1006
OT
2249
OX
12
PB
2891
PC
39
PE
1299
PF
1597
PG
8
PH
32
PI
9
PL
36
PM
899
PP
3123
PU
5
PX
6
9
RF
75
RM
2609
RO
8
2
RX
SA
75
SC
1
SF
43
SH
147
SO
43
SS
248
ST
1
SU
5
UL
4
UP
8
UR
10
US
4
ZL
2
Name: record_id, dtype: int64
Or, we can also count just the rows that have the species “DO”:
[27]: surveys_df.groupby('species_id')['record_id'].count()['DO']
[27]: 3027
9
CHALLENGE - MAKE A LIST
What’s another way to create a list of species and associated count of the records in the data?
Hint: you can perform count, min, etc. functions on groupby DataFrames in the same way you can
perform them on regular DataFrames.
10 The solution
As well as calling count() on the record_id column of the grouped DataFrame as above, an equivalent result can be obtained by extracting record_id from the result of count() called directly on
the grouped DataFrame:
[28]: surveys_df.groupby('species_id').count()['record_id']
[28]: species_id
AB
303
AH
437
AS
2
BA
46
CB
50
CM
13
CQ
16
CS
1
10
CT
1
CU
1
CV
1
DM
10596
DO
3027
DS
2504
DX
40
NL
1252
OL
1006
OT
2249
OX
12
PB
2891
PC
39
PE
1299
PF
1597
PG
8
PH
32
PI
9
PL
36
PM
899
PP
3123
PU
5
PX
6
RF
75
RM
2609
RO
8
RX
2
SA
75
SC
1
SF
43
SH
147
SO
43
SS
248
ST
1
SU
5
UL
4
UP
8
UR
10
US
4
ZL
2
Name: record_id, dtype: int64
11 Basic Math Functions
If we wanted to, we could apply a mathematical operation like addition or division on an entire
column of our data. For example, let’s multiply all weight values by 2.
11
[29]: # Multiply all weight values by 2
surveys_df['weight']*2
[29]: 0
1
2
3
4
NaN
NaN
NaN
NaN
NaN
…
35544
NaN
35545
NaN
35546
28.0
35547
102.0
35548
NaN
Name: weight, Length: 35549, dtype: float64
A more practical use of this might be to normalize the data according to a mean, area, or some
other value calculated from our data.
12 Plotting Data Using Pandas
We can plot our summary stats using Pandas, too.
[30]: # Make sure figures appear inline in Ipython Notebook
%matplotlib inline
# Create a quick bar chart
species_counts.plot(kind='bar');
12
We can also look at how many animals were captured in each site:
[31]: total_count = surveys_df.groupby('plot_id')['record_id'].nunique()
# Let's plot that too
total_count.plot(kind='bar');
13
13 CHALLENGE - PLOTS
1-Create a plot of average weight across all species per site.
2-Create a plot of total males versus total females for the entire dataset.
1-surveys_df.groupby(‘plot_id’).mean()[“weight”].plot(kind=‘bar’)
[32]: surveys_df.groupby('plot_id').mean()["weight"].plot(kind='bar');
14
2-surveys_df.groupby(‘sex’).count()[“record_id”].plot(kind=‘bar’)
[33]: surveys_df.groupby('sex').count()["record_id"].plot(kind='bar')
[33]: <matplotlib.axes._subplots.AxesSubplot at 0xdec1d30>
15
14 SUMMARY PLOTTING CHALLENGE
A simple example with some data where ‘a’, ‘b’, and ‘c’ are the groups, and ‘one’ and ‘two’ are the
subgroups.
[34]: d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']), 'two' : pd.
↪Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
pd.DataFrame(d)
[34]:
a
b
c
d
one
1.0
2.0
3.0
NaN
two
1.0
2.0
3.0
4.0
We can plot the above with
[35]: # Plot stacked data so columns 'one' and 'two' are stacked
my_df = pd.DataFrame(d)
my_df.plot(kind='bar', stacked=True, title="The title of my graph")
[35]: <matplotlib.axes._subplots.AxesSubplot at 0xdf0b5d0>
16
[ ]:
17
Download