Data Science Tools - Week 7 Assignment Chenna Krishna Reddy Bhumi Reddy 2023-03-01 1. Introduction library("tidyverse") ## ## ## ## ## ## ## ## -- Attaching packages --------------------------------------- tidyverse 1.3.2 -v ggplot2 3.4.0 v purrr 1.0.1 v tibble 3.1.8 v dplyr 1.0.10 v tidyr 1.2.1 v stringr 1.5.0 v readr 2.1.3 v forcats 0.5.2 -- Conflicts ------------------------------------------ tidyverse_conflicts() -x dplyr::filter() masks stats::filter() x dplyr::lag() masks stats::lag() 2. Tidy Data 2.1. Textbook Exercise 2 We need to divide cases by population for each nation and year in order to determine cases per individual. In a data frame with rows denoting (country, year) combinations, it is simplest to do this if the cases and population variables are separated into two columns. Table 2: First, separate the population and cases into two tables, and make sure they are sorted in the same sequence. t2_cases <- filter(table2, type == "cases") %>% rename(cases = count) %>% arrange(country, year) t2_population <- filter(table2, type == "population") %>% rename(population = count) %>% arrange(country, year) Next, make a new data frame with columns for the population and cases, and add a new column for the cases per inhabitant calculation. t2_cases_per_cap <- tibble( year = t2_cases$year, country = t2_cases$country, cases = t2_cases$cases, population = t2_population$population ) %>% 1 mutate(cases_per_cap = (cases / population) * 10000) %>% select(country, year, cases_per_cap) We will add new rows to table2 in order to store this new variable in the proper spot. t2_cases_per_cap <- t2_cases_per_cap %>% mutate(type = "cases_per_cap") %>% rename(count = cases_per_cap) bind_rows(table2, t2_cases_per_cap) %>% arrange(country, year, type, count) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## # A tibble: 18 x 4 country year type <chr> <int> <chr> 1 Afghanistan 1999 cases 2 Afghanistan 1999 cases_per_cap 3 Afghanistan 1999 population 4 Afghanistan 2000 cases 5 Afghanistan 2000 cases_per_cap 6 Afghanistan 2000 population 7 Brazil 1999 cases 8 Brazil 1999 cases_per_cap 9 Brazil 1999 population 10 Brazil 2000 cases 11 Brazil 2000 cases_per_cap 12 Brazil 2000 population 13 China 1999 cases 14 China 1999 cases_per_cap 15 China 1999 population 16 China 2000 cases 17 China 2000 cases_per_cap 18 China 2000 population count <dbl> 7.45e+2 3.73e-1 2.00e+7 2.67e+3 1.29e+0 2.06e+7 3.77e+4 2.19e+0 1.72e+8 8.05e+4 4.61e+0 1.75e+8 2.12e+5 1.67e+0 1.27e+9 2.14e+5 1.67e+0 1.28e+9 Keep in mind that since cases_per_cap is not an integer, the form of count is forced to numeric after the cases_per_cap rows are added. Create a new table with nation rows and year columns, which we’ll call table4c, for cases per capita for tables4a and table4b. table4c <tibble( country = table4a$country, `1999` = table4a[["1999"]] / table4b[["1999"]] * 10000, `2000` = table4a[["2000"]] / table4b[["2000"]] * 10000 ) 2.2. Textbook Exercise 3 We must first filter the table so that only the rows representing TB cases are included before we can create the plot showing the shift in cases over time. 2 table2 %>% filter(type == "cases") %>% ggplot(aes(year, count)) + geom_line(aes(group = country), colour = "grey50") + geom_point(aes(colour = country)) + scale_x_continuous(breaks = unique(table2$year)) + ylab("cases") 200000 150000 cases country Afghanistan Brazil 100000 China 50000 0 1999 2000 year 3. Spreading and gathering, Gathering, into a tidy form, Spreading 3.1. Textbook Exercise 1 Because the column names, which are now moved as character columns are the true “key” variable. Gather shouldn’t consider column names as logicals, numerics, or anything else. There is a workaround by specifying convert = TRUE, which will attempt to convert the “key” columns to the appropriate class. 3.2. Textbook Exercise 2 Because gather cannot locate the columns. In R, column names containing integers must be quoted using tick marks(“). 3 3.3. Textbook Exercise 3 people <- tribble( ~name, ~key, ~value, #-----------------|--------|-----"Phillip Woods", "age", 45, "Phillip Woods", "height", 186, "Phillip Woods", "age", 50, "Jessica Cordero", "age", 37, "Jessica Cordero", "height", 156 ) Since Philip Woods has two different age values. This might lead to distinct data violations. The issue can be fixed, by including an unique id column and giving each row an id. people %>% mutate(unique_id = c(1, 2, 2, 3, 3)) %>% select(unique_id, everything()) %>% spread(key, value) ## ## ## ## ## ## # A tibble: unique_id <dbl> 1 1 2 2 3 3 3 x 4 name age height <chr> <dbl> <dbl> Phillip Woods 45 NA Phillip Woods 50 186 Jessica Cordero 37 156 4. Separating and Pull, Separate, Unite 4.1. Textbook Exercise 2 The result data frame’s input fields are removed using the remove argument. If you want to make a new variable while keeping the existing one, you would set it to FALSE. 4.2. Textbook Exercise 3 If the sep argument is a character vector, the function separate() divides a column into numerous columns according to separator, or according to character positions if sep is a numeric. tibble(x = c("X_1", "X_2", "AA_1", "AA_2")) %>% separate(x, c("variable", "into"), sep = "_") ## ## ## ## ## ## ## # A tibble: 4 x 2 variable into <chr> <chr> 1 X 1 2 X 2 3 AA 1 4 AA 2 tibble(x = c("X1", "X2", "Y1", "Y2")) %>% separate(x, c("variable", "into"), sep = c(1)) 4 ## ## ## ## ## ## ## # A tibble: 4 x 2 variable into <chr> <chr> 1 X 1 2 X 2 3 Y 1 4 Y 2 The method extract() divides a single character vector into several columns by specifying groups in the character vector using a regular expression. Due to the fact that it does not demand a standard separator or particular column locations, this is more flexible than separate(). tibble(x = c("X_1", "X_2", "AA_1", "AA_2")) %>% extract(x, c("variable", "id"), regex = "([A-Z])_([0-9])") ## ## ## ## ## ## ## # A tibble: 4 x 2 variable id <chr> <chr> 1 X 1 2 X 2 3 A 1 4 A 2 tibble(x = c("X1", "X2", "Y1", "Y2")) %>% extract(x, c("variable", "id"), regex = "([A-Z])([0-9])") ## ## ## ## ## ## ## # A tibble: 4 x 2 variable id <chr> <chr> 1 X 1 2 X 2 3 Y 1 4 Y 2 tibble(x = c("X1", "X20", "AA11", "AA2")) %>% extract(x, c("variable", "id"), regex = "([A-Z]+)([0-9]+)") ## ## ## ## ## ## ## # A tibble: 4 x 2 variable id <chr> <chr> 1 X 1 2 X 20 3 AA 11 4 AA 2 A single column is split into multiple columns using the functions separate() and extract(). Unite(), on the other hand, merges multiple columns into one while allowing for the inclusion of a divider between column values. tibble(variable = c("X", "X", "Y", "Y"), id = c(1, 2, 1, 2)) %>% unite(x, variable, id, sep = "_") ## # A tibble: 4 x 1 ## x ## <chr> ## 1 X_1 5 ## 2 X_2 ## 3 Y_1 ## 4 Y_2 5. Missing Values 5.1. Textbook Exercise 1 Both the fill argument in complete() and the values fill argument in pivot_wider() specify values to replace NA. Named lists can be used with either argument to specify numbers for each column. Likewise, pivot_wider() values_fill argument only takes one value. The fill argument in complete() also sets a value to replace NAs, but the value is named list instead, allowing for varying values for various variables. Both situations also substitute for both implied and explicit missing values. stocks <- tibble( year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016), qtr = c( 1, 2, 3, 4, 2, 3, 4), return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66) ) stocks %>% pivot_wider(names_from = year, values_from = return, values_fill = 0) ## ## ## ## ## ## ## # A tibble: 4 x 3 qtr `2015` `2016` <dbl> <dbl> <dbl> 1 1 1.88 0 2 2 0.59 0.92 3 3 0.35 0.17 4 4 NA 2.66 stocks <- tibble( year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016), qtr = c( 1, 2, 3, 4, 2, 3, 4), return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66) ) stocks %>% pivot_wider(names_from = year, values_from = return, values_fill = 0) ## ## ## ## ## ## ## # A tibble: 4 x 3 qtr `2015` `2016` <dbl> <dbl> <dbl> 1 1 1.88 0 2 2 0.59 0.92 3 3 0.35 0.17 4 4 NA 2.66 stocks %>% complete(year, qtr, fill=list(return=0)) ## # A tibble: 8 x 3 ## year qtr return ## <dbl> <dbl> <dbl> ## 1 2015 1 1.88 6 ## ## ## ## ## ## ## 2 3 4 5 6 7 8 2015 2015 2015 2016 2016 2016 2016 2 3 4 1 2 3 4 0.59 0.35 0 0 0.92 0.17 2.66 5.2. Textbook Exercise 2 When using fill, the direction decides whether NA values should be replaced by the non-missing value that came before down or the non-missing value that comes after up. 6. Case Study 6.1. Textbook Exercise 1 Depending on how missing values are represented in this dataset, using na.rm = TRUE is acceptable. The primary question is whether a missing value indicates that there were no TB cases or that the WHO lacks information on the number of TB cases. Here are a few indicators that will help us tell these situations apart. Missing values may be used to denote no instances if the data contains no 0 values. There may be various ways missing values are being used if there are both explicit and implicit missing values. Then, it is probable that explicit missing values would indicate that there are no cases, and implicit missing values would indicate that there are no data on the number of cases. I’ll start by making sure there are no zeros in the data. who1 <- who %>% pivot_longer( cols = new_sp_m014:newrel_f65, names_to = "key", values_to = "cases", values_drop_na = TRUE ) who1 %>% filter(cases == 0) %>% nrow() ## [1] 11080 Since the data contains zeros, it appears that instances of zero TB are specifically mentioned, and the value of NA is used to denote missing data. Second, I should determine whether all values for a (country, year) are missing or if it’s conceivable that only a few columns are empty. pivot_longer(who, c(new_sp_m014:newrel_f65), names_to = "key", values_to = "cases") %>% group_by(country, year) %>% 7 mutate(prop_missing = sum(is.na(cases)) / n()) %>% filter(prop_missing > 0, prop_missing < 1) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## # A tibble: 195,104 x 7 # Groups: country, year [3,484] country iso2 iso3 year key cases prop_missing <chr> <chr> <chr> <int> <chr> <int> <dbl> 1 Afghanistan AF AFG 1997 new_sp_m014 0 0.75 2 Afghanistan AF AFG 1997 new_sp_m1524 10 0.75 3 Afghanistan AF AFG 1997 new_sp_m2534 6 0.75 4 Afghanistan AF AFG 1997 new_sp_m3544 3 0.75 5 Afghanistan AF AFG 1997 new_sp_m4554 5 0.75 6 Afghanistan AF AFG 1997 new_sp_m5564 2 0.75 7 Afghanistan AF AFG 1997 new_sp_m65 0 0.75 8 Afghanistan AF AFG 1997 new_sp_f014 5 0.75 9 Afghanistan AF AFG 1997 new_sp_f1524 38 0.75 10 Afghanistan AF AFG 1997 new_sp_f2534 36 0.75 # ... with 195,094 more rows According to the above results, it appears that a (country, year) row may contain some but not all of the columns’ missing values. I’ll look for implicitly absent values before I finish. (Year, Country) combinations that do not show up in the data are considered implicit missing numbers. nrow(who) ## [1] 7240 who %>% complete(country, year) %>% nrow() ## [1] 7446 6.2. Textbook Exercise 2 The warning is issued by the separate() method with too few values. When we search the rows for values that start with “newrel_,” we discover that sexage is absent and that the type is m014. who3a <- who1 %>% separate(key, c("new", "type", "sexage"), sep = "_") ## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2580 rows [243, ## 244, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 903, ## 904, 905, 906, ...]. #> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2580 rows [243, #> 244, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 903, #> 904, 905, 906, ...]. filter(who3a, new == "newrel") %>% head() ## # A tibble: 6 x 8 ## country iso2 iso3 year new type 8 sexage cases ## ## ## ## ## ## ## 1 2 3 4 5 6 <chr> Afghanistan Afghanistan Albania Albania Albania Albania <chr> AF AF AL AL AL AL <chr> <int> <chr> <chr> AFG 2013 newrel m014 AFG 2013 newrel f014 ALB 2013 newrel m014 ALB 2013 newrel m1524 ALB 2013 newrel m2534 ALB 2013 newrel m3544 <chr> <NA> <NA> <NA> <NA> <NA> <NA> <int> 1705 1749 14 60 61 32 6.3. Textbook Exercise 4 who2 <- who1 %>% mutate(names_from = stringr::str_replace(key, "newrel", "new_rel")) who3 <- who2 %>% separate(key, c("new", "type", "sexage"), sep = "_") ## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2580 rows [243, ## 244, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 903, ## 904, 905, 906, ...]. who3 %>% count(new) ## # A tibble: 2 x 2 ## new n ## <chr> <int> ## 1 new 73466 ## 2 newrel 2580 who4 <- who3 %>% select(-new, -iso2, -iso3) who5 <- who4 %>% separate(sexage, c("sex", "age"), sep = 1) who5 %>% group_by(country, year, sex) %>% filter(year > 1995) %>% summarise(cases = sum(cases)) %>% unite(country_sex, country, sex, remove = FALSE) %>% ggplot(aes(x = year, y = cases, group = country_sex, colour = sex)) + geom_line() ## `summarise()` has grouped output by 'country', 'year'. You can override using ## the `.groups` argument. 9 8e+05 6e+05 cases sex f 4e+05 m NA 2e+05 0e+00 2000 2005 2010 year The large number of nations makes it challenging to facet a small multiples plot by each one. After giving the the above context, another choice is to concentrate on the nations with the biggest changes or absolute magnitudes. 10