Uploaded by Haritha Machanuru

Assignment-8

advertisement
Data Science Tools - Week 7 Assignment
Chenna Krishna Reddy Bhumi Reddy
2023-03-01
1. Introduction
library("tidyverse")
##
##
##
##
##
##
##
##
-- Attaching packages --------------------------------------- tidyverse 1.3.2 -v ggplot2 3.4.0
v purrr
1.0.1
v tibble 3.1.8
v dplyr
1.0.10
v tidyr
1.2.1
v stringr 1.5.0
v readr
2.1.3
v forcats 0.5.2
-- Conflicts ------------------------------------------ tidyverse_conflicts() -x dplyr::filter() masks stats::filter()
x dplyr::lag()
masks stats::lag()
2. Tidy Data
2.1. Textbook Exercise 2
We need to divide cases by population for each nation and year in order to determine cases per individual.
In a data frame with rows denoting (country, year) combinations, it is simplest to do this if the cases and
population variables are separated into two columns.
Table 2: First, separate the population and cases into two tables, and make sure they are sorted in the same
sequence.
t2_cases <- filter(table2, type == "cases") %>%
rename(cases = count) %>%
arrange(country, year)
t2_population <- filter(table2, type == "population") %>%
rename(population = count) %>%
arrange(country, year)
Next, make a new data frame with columns for the population and cases, and add a new column for the
cases per inhabitant calculation.
t2_cases_per_cap <- tibble(
year = t2_cases$year,
country = t2_cases$country,
cases = t2_cases$cases,
population = t2_population$population
) %>%
1
mutate(cases_per_cap = (cases / population) * 10000) %>%
select(country, year, cases_per_cap)
We will add new rows to table2 in order to store this new variable in the proper spot.
t2_cases_per_cap <- t2_cases_per_cap %>%
mutate(type = "cases_per_cap") %>%
rename(count = cases_per_cap)
bind_rows(table2, t2_cases_per_cap) %>%
arrange(country, year, type, count)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
# A tibble: 18 x 4
country
year type
<chr>
<int> <chr>
1 Afghanistan 1999 cases
2 Afghanistan 1999 cases_per_cap
3 Afghanistan 1999 population
4 Afghanistan 2000 cases
5 Afghanistan 2000 cases_per_cap
6 Afghanistan 2000 population
7 Brazil
1999 cases
8 Brazil
1999 cases_per_cap
9 Brazil
1999 population
10 Brazil
2000 cases
11 Brazil
2000 cases_per_cap
12 Brazil
2000 population
13 China
1999 cases
14 China
1999 cases_per_cap
15 China
1999 population
16 China
2000 cases
17 China
2000 cases_per_cap
18 China
2000 population
count
<dbl>
7.45e+2
3.73e-1
2.00e+7
2.67e+3
1.29e+0
2.06e+7
3.77e+4
2.19e+0
1.72e+8
8.05e+4
4.61e+0
1.75e+8
2.12e+5
1.67e+0
1.27e+9
2.14e+5
1.67e+0
1.28e+9
Keep in mind that since cases_per_cap is not an integer, the form of count is forced to numeric after the
cases_per_cap rows are added.
Create a new table with nation rows and year columns, which we’ll call table4c, for cases per capita for
tables4a and table4b.
table4c <tibble(
country = table4a$country,
`1999` = table4a[["1999"]] / table4b[["1999"]] * 10000,
`2000` = table4a[["2000"]] / table4b[["2000"]] * 10000
)
2.2. Textbook Exercise 3
We must first filter the table so that only the rows representing TB cases are included before we can create
the plot showing the shift in cases over time.
2
table2 %>%
filter(type == "cases") %>%
ggplot(aes(year, count)) +
geom_line(aes(group = country), colour = "grey50") +
geom_point(aes(colour = country)) +
scale_x_continuous(breaks = unique(table2$year)) +
ylab("cases")
200000
150000
cases
country
Afghanistan
Brazil
100000
China
50000
0
1999
2000
year
3. Spreading and gathering, Gathering, into a tidy form, Spreading
3.1. Textbook Exercise 1
Because the column names, which are now moved as character columns are the true “key” variable. Gather
shouldn’t consider column names as logicals, numerics, or anything else. There is a workaround by specifying
convert = TRUE, which will attempt to convert the “key” columns to the appropriate class.
3.2. Textbook Exercise 2
Because gather cannot locate the columns. In R, column names containing integers must be quoted using
tick marks(“).
3
3.3. Textbook Exercise 3
people <- tribble(
~name, ~key, ~value,
#-----------------|--------|-----"Phillip Woods", "age", 45,
"Phillip Woods", "height", 186,
"Phillip Woods", "age", 50,
"Jessica Cordero", "age", 37,
"Jessica Cordero", "height", 156
)
Since Philip Woods has two different age values. This might lead to distinct data violations. The issue can
be fixed, by including an unique id column and giving each row an id.
people %>%
mutate(unique_id = c(1, 2, 2, 3, 3)) %>%
select(unique_id, everything()) %>%
spread(key, value)
##
##
##
##
##
##
# A tibble:
unique_id
<dbl>
1
1
2
2
3
3
3 x 4
name
age height
<chr>
<dbl> <dbl>
Phillip Woods
45
NA
Phillip Woods
50
186
Jessica Cordero
37
156
4. Separating and Pull, Separate, Unite
4.1. Textbook Exercise 2
The result data frame’s input fields are removed using the remove argument. If you want to make a new
variable while keeping the existing one, you would set it to FALSE.
4.2. Textbook Exercise 3
If the sep argument is a character vector, the function separate() divides a column into numerous columns
according to separator, or according to character positions if sep is a numeric.
tibble(x = c("X_1", "X_2", "AA_1", "AA_2")) %>%
separate(x, c("variable", "into"), sep = "_")
##
##
##
##
##
##
##
# A tibble: 4 x 2
variable into
<chr>
<chr>
1 X
1
2 X
2
3 AA
1
4 AA
2
tibble(x = c("X1", "X2", "Y1", "Y2")) %>%
separate(x, c("variable", "into"), sep = c(1))
4
##
##
##
##
##
##
##
# A tibble: 4 x 2
variable into
<chr>
<chr>
1 X
1
2 X
2
3 Y
1
4 Y
2
The method extract() divides a single character vector into several columns by specifying groups in the
character vector using a regular expression. Due to the fact that it does not demand a standard separator
or particular column locations, this is more flexible than separate().
tibble(x = c("X_1", "X_2", "AA_1", "AA_2")) %>%
extract(x, c("variable", "id"), regex = "([A-Z])_([0-9])")
##
##
##
##
##
##
##
# A tibble: 4 x 2
variable id
<chr>
<chr>
1 X
1
2 X
2
3 A
1
4 A
2
tibble(x = c("X1", "X2", "Y1", "Y2")) %>%
extract(x, c("variable", "id"), regex = "([A-Z])([0-9])")
##
##
##
##
##
##
##
# A tibble: 4 x 2
variable id
<chr>
<chr>
1 X
1
2 X
2
3 Y
1
4 Y
2
tibble(x = c("X1", "X20", "AA11", "AA2")) %>%
extract(x, c("variable", "id"), regex = "([A-Z]+)([0-9]+)")
##
##
##
##
##
##
##
# A tibble: 4 x 2
variable id
<chr>
<chr>
1 X
1
2 X
20
3 AA
11
4 AA
2
A single column is split into multiple columns using the functions separate() and extract(). Unite(), on the
other hand, merges multiple columns into one while allowing for the inclusion of a divider between column
values.
tibble(variable = c("X", "X", "Y", "Y"), id = c(1, 2, 1, 2)) %>%
unite(x, variable, id, sep = "_")
## # A tibble: 4 x 1
##
x
##
<chr>
## 1 X_1
5
## 2 X_2
## 3 Y_1
## 4 Y_2
5. Missing Values
5.1. Textbook Exercise 1
Both the fill argument in complete() and the values fill argument in pivot_wider() specify values to
replace NA. Named lists can be used with either argument to specify numbers for each column. Likewise,
pivot_wider() values_fill argument only takes one value. The fill argument in complete() also sets a value
to replace NAs, but the value is named list instead, allowing for varying values for various variables. Both
situations also substitute for both implied and explicit missing values.
stocks <- tibble(
year
= c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr
= c(
1,
2,
3,
4,
2,
3,
4),
return = c(1.88, 0.59, 0.35,
NA, 0.92, 0.17, 2.66)
)
stocks %>%
pivot_wider(names_from = year, values_from = return,
values_fill = 0)
##
##
##
##
##
##
##
# A tibble: 4 x 3
qtr `2015` `2016`
<dbl> <dbl> <dbl>
1
1
1.88
0
2
2
0.59
0.92
3
3
0.35
0.17
4
4 NA
2.66
stocks <- tibble(
year
= c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr
= c(
1,
2,
3,
4,
2,
3,
4),
return = c(1.88, 0.59, 0.35,
NA, 0.92, 0.17, 2.66)
)
stocks %>%
pivot_wider(names_from = year, values_from = return,
values_fill = 0)
##
##
##
##
##
##
##
# A tibble: 4 x 3
qtr `2015` `2016`
<dbl> <dbl> <dbl>
1
1
1.88
0
2
2
0.59
0.92
3
3
0.35
0.17
4
4 NA
2.66
stocks %>%
complete(year, qtr, fill=list(return=0))
## # A tibble: 8 x 3
##
year
qtr return
##
<dbl> <dbl> <dbl>
## 1 2015
1
1.88
6
##
##
##
##
##
##
##
2
3
4
5
6
7
8
2015
2015
2015
2016
2016
2016
2016
2
3
4
1
2
3
4
0.59
0.35
0
0
0.92
0.17
2.66
5.2. Textbook Exercise 2
When using fill, the direction decides whether NA values should be replaced by the non-missing value that
came before down or the non-missing value that comes after up.
6. Case Study
6.1. Textbook Exercise 1
Depending on how missing values are represented in this dataset, using na.rm = TRUE is acceptable. The
primary question is whether a missing value indicates that there were no TB cases or that the WHO lacks
information on the number of TB cases. Here are a few indicators that will help us tell these situations
apart. Missing values may be used to denote no instances if the data contains no 0 values.
There may be various ways missing values are being used if there are both explicit and implicit missing
values. Then, it is probable that explicit missing values would indicate that there are no cases, and implicit
missing values would indicate that there are no data on the number of cases.
I’ll start by making sure there are no zeros in the data.
who1 <- who %>%
pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = "key",
values_to = "cases",
values_drop_na = TRUE
)
who1 %>%
filter(cases == 0) %>%
nrow()
## [1] 11080
Since the data contains zeros, it appears that instances of zero TB are specifically mentioned, and the value
of NA is used to denote missing data.
Second, I should determine whether all values for a (country, year) are missing or if it’s conceivable that
only a few columns are empty.
pivot_longer(who, c(new_sp_m014:newrel_f65), names_to = "key", values_to = "cases") %>%
group_by(country, year) %>%
7
mutate(prop_missing = sum(is.na(cases)) / n()) %>%
filter(prop_missing > 0, prop_missing < 1)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
# A tibble: 195,104 x 7
# Groups:
country, year [3,484]
country
iso2 iso3
year key
cases prop_missing
<chr>
<chr> <chr> <int> <chr>
<int>
<dbl>
1 Afghanistan AF
AFG
1997 new_sp_m014
0
0.75
2 Afghanistan AF
AFG
1997 new_sp_m1524
10
0.75
3 Afghanistan AF
AFG
1997 new_sp_m2534
6
0.75
4 Afghanistan AF
AFG
1997 new_sp_m3544
3
0.75
5 Afghanistan AF
AFG
1997 new_sp_m4554
5
0.75
6 Afghanistan AF
AFG
1997 new_sp_m5564
2
0.75
7 Afghanistan AF
AFG
1997 new_sp_m65
0
0.75
8 Afghanistan AF
AFG
1997 new_sp_f014
5
0.75
9 Afghanistan AF
AFG
1997 new_sp_f1524
38
0.75
10 Afghanistan AF
AFG
1997 new_sp_f2534
36
0.75
# ... with 195,094 more rows
According to the above results, it appears that a (country, year) row may contain some but not all of the
columns’ missing values.
I’ll look for implicitly absent values before I finish. (Year, Country) combinations that do not show up in
the data are considered implicit missing numbers.
nrow(who)
## [1] 7240
who %>%
complete(country, year) %>%
nrow()
## [1] 7446
6.2. Textbook Exercise 2
The warning is issued by the separate() method with too few values. When we search the rows for values
that start with “newrel_,” we discover that sexage is absent and that the type is m014.
who3a <- who1 %>%
separate(key, c("new", "type", "sexage"), sep = "_")
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2580 rows [243,
## 244, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 903,
## 904, 905, 906, ...].
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2580 rows [243,
#> 244, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 903,
#> 904, 905, 906, ...].
filter(who3a, new == "newrel") %>% head()
## # A tibble: 6 x 8
##
country
iso2
iso3
year new
type
8
sexage cases
##
##
##
##
##
##
##
1
2
3
4
5
6
<chr>
Afghanistan
Afghanistan
Albania
Albania
Albania
Albania
<chr>
AF
AF
AL
AL
AL
AL
<chr> <int> <chr> <chr>
AFG
2013 newrel m014
AFG
2013 newrel f014
ALB
2013 newrel m014
ALB
2013 newrel m1524
ALB
2013 newrel m2534
ALB
2013 newrel m3544
<chr>
<NA>
<NA>
<NA>
<NA>
<NA>
<NA>
<int>
1705
1749
14
60
61
32
6.3. Textbook Exercise 4
who2 <- who1 %>%
mutate(names_from = stringr::str_replace(key, "newrel", "new_rel"))
who3 <- who2 %>%
separate(key, c("new", "type", "sexage"), sep = "_")
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2580 rows [243,
## 244, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 903,
## 904, 905, 906, ...].
who3 %>%
count(new)
## # A tibble: 2 x 2
##
new
n
##
<chr> <int>
## 1 new
73466
## 2 newrel 2580
who4 <- who3 %>%
select(-new, -iso2, -iso3)
who5 <- who4 %>%
separate(sexage, c("sex", "age"), sep = 1)
who5 %>%
group_by(country, year, sex) %>%
filter(year > 1995) %>%
summarise(cases = sum(cases)) %>%
unite(country_sex, country, sex, remove = FALSE) %>%
ggplot(aes(x = year, y = cases, group = country_sex, colour = sex)) +
geom_line()
## `summarise()` has grouped output by 'country', 'year'. You can override using
## the `.groups` argument.
9
8e+05
6e+05
cases
sex
f
4e+05
m
NA
2e+05
0e+00
2000
2005
2010
year
The large number of nations makes it challenging to facet a small multiples plot by each one. After giving
the the above context, another choice is to concentrate on the nations with the biggest changes or absolute
magnitudes.
10
Download