Data Science Final Review: Functions, ggplot, Tidy Data, Exam Prep

Final Revision Ethan 2025-01-08 Contents Introduction to Data Science Final Review For Part II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ggplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . stringr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . For Part I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tidy Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . EDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graphical Excellence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 3 5 5 5 6 6 6 6 Solutions to Handouts and Tutorial Exercises Week 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Handout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tutorial Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Week 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Handout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tutorial Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Week 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Handout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tutorial Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Week 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Handout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Week 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Handout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tutorial Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Week 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Handout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tutorial Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Week 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Handout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Week 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Handout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tutorial Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Week 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exam Sample Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Week 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 8 8 8 9 10 13 13 19 20 20 24 25 25 28 29 29 32 33 33 35 38 38 43 43 45 48 48 50 1 Revision Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Sample Exam 51 Instructions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Question 1 (4 points) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Question 2 (4 points) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Question 3 (18 points) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Question 4 (8 points) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Question 5 (10 points) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Question 6a (6 points) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Question 6b (10 points) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Comment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2 Introduction to Data Science Final Review For Part II Functions 1. Date type • %Y for the 4-digit year • %y for the 2-digit year • %m for the 2-digit month • %B for the full month name • %b for the abbreviated month • %d Day of the month as a number (01–31) • %a Abbreviated weekday name (e.g., Mon) • %A Full weekday name (e.g., Monday) • %j Day of the year as a number (001–366) • %u Day of the week as a number (1 = Monday, 7 = Sunday) • %w Day of the week as a number (0 = Sunday, 6 = Saturday) Converting characters/numbers into date format using as.date(format = ) 2. Built in numerical functions • round(x, y), where y is the decimals • sqrt(), square root • floor(), round down to the nearest integer • ceiling(), round up to the nearest integer • seq(x, y), create a sequence of number, where x is the start and y is the end, use by = n to specify common difference 3. Vectors and data frames • c() to create vector • data.frame to create data frame, should use vectors to create, one vector is a column • use [] to specify location in vectors or data frames, can do [x], [x, y], [x,], or [,y] • which() plus logical argument to get position • append() to add elements 4. If and for • for (variable in vector) {} • if () {} else if() {} else {} can continue the logical arguments 5. Logical Arguments • NAs – logical argumenr is.na() – na.omit() to drop any NAs in a data • between(x, a, b), equiavalent to x >= a & x <= b 6. Data Wrangling • pull(x, y) where x is the data frame and y is the variable name, this pulls out a single column, and equivalent to x$y • cut(x, breaks, include.lowest = TRUE or FALSE, labels = ...) – x: Name of vector – breaks: Number of breaks to make or vector of break points 3 – labels: Labels for the resulting bins. labels = NULL produces interval bins and labels = FALSE produces integer category labels starting at 1. Otherwise, you can assign your own labels with labels = your own vector of labels. – include.lowest: Boolean value to include lowest break value • quantile(x, probs, na.rm = TRUE or FALSE, type = 7) – x: Numeric vector whose sample quantiles are wanted. NA and NaN values are not allowed in numeric vectors unless na.rm is set to TRUE. – probs: Numeric vector of probabilities. E.g., if probs = seq(0, 1, 0.25), then you are asking for the quartiles (0%, 25%, 50%, 75% and 100%). – na.rm = TRUE or FALSE: if TRUE, any NA and NaN’s are removed from x before the quantiles are computed. – type: An integer between 1 and 9 selecting one of the nine quantile algorithms available are to be used. Default is to keep it at 7. • gather() wide data to long, first argument provide a new column name for the original column labels, second argument provide a new column name for the values under the original columns, use -xxx to select columns not being gathered – can be replaced by pivot_longer() • spread() is the opposite to gather(), long data to wide. First argument select which column goes into column names, and second argument select which column goes to values – can be replaced by pivot_wider() • separate(), first argument on which column, second argument c("x", "y") gives names to the separated columns, third argument "" specify how to separate, for example by space or _ or - or . or – can use fill = right if there are some with 3 separations and some with 2 and to fill the last new column in NA for those with 2 – else, can use extra = merge to let the extra separation be merged into the second column • unite(), combining columns, first argument is the new column name, second argument is the first column to combine, third argument is the second column to combine • length() to look at the length (how many numbers) • unique() to look at only the unique subset 7. Dot operator • Used along with the pipe operator. However, with pipe operator, it calls on the object to be the first argument. Use dot operator . to call on the object on the left of the pipe operator to other argumets library(tidyverse) ## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 -## v dplyr 1.1.4 v readr 2.1.5 ## v forcats 1.0.0 v stringr 1.5.1 ## v ggplot2 3.5.1 v tibble 3.2.1 ## v lubridate 1.9.3 v tidyr 1.3.1 ## v purrr 1.0.2 ## -- Conflicts ------------------------------------------ tidyverse_conflicts() -## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() ## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors exponent <- function(x,y){ xˆy } x <- 2 y <- 3 4 exponent(x,y) ## [1] 8 y %>% exponent(x,.) ## [1] 8 ggplot • Log scales – + scale_x_continuous(trans = "log10") or + scale_x_log10 • Labels – can use labs() or xlab() ylab() ggtitle() • histogram – geom_histogram(bins = x) – adding density plots to a historgram ∗ in aes, specify y = after_stat(density) ∗ then add geom_density() Joins • Mutating Joins – left_join(x, y): keeps all observations in x, and matches as many rows in y as possible – right_join(x, y): keeps all observations in y and matches as many rows in x as possible – full_join(x, y): keeps all observations that appear in x or y – inner_join(x, y): retains rows if and only if the keys are equal • Filtering Joins (Only filter rows but do not add additional columns in x) – semi_join(x, y): keep all rows in x that have a match in y – anti_join(x, y): they return all rows in x that don’t have a match in y stringr Functions • str_detect() detects specific characters/patterns with return of TRUE/FALSE • str_replace_all(x, "y", "z") Replaces all occurrences of a pattern in each string • parse_number() remove every non-numeric characters • str_subset() filters to a subset that matches the specific characters or patterns • str_view() returns the matches only, if match = FALSE then returns those that don’t match • str_match() Extracts substrings based on capture groups in a regular expression • str_extract() Extracts the first matching substring for a regular expression • str_replace() Replaces the first occurrence of a pattern in each string. Use | to replace multiple patterns • str_trim() removes spaces at the beginning or the end • str_to_lower() make everything into lower case Defining patterns • \\d refers to finding any digit • \\s refers to finding any white space • [] matches if the string contains anything in [] – e.g. [56] returns true with str_detect() if the string contains 5 or 6 – use [4-7] to define patterns having numbers from 4 to 7 – [ˆ] means not ∗ [ˆa-zA-Z]\\d means digits preceded by anything except a letter 5 • . defines any character • \D defines any non-digit character • \w defines any alphanumeric character • \W defines any non-alphanumeric character • \S defines any non whitespace character • Anchors: define the beginning and the end of a pattern – ˆ defines the beginning – $ defines the end • Quantifiers – {} within the curly brackets is the number of times the previous entry can be repeated ∗ if we want one digit, we go with ˆ\\d$ ∗ if we want one or two, we use the quantifier {}, ˆ\\d{1,2}$ returns one or two digits – * means zero or more instances of the previous character – ? means none or once – + means one or more • Groups separate a string into groups but does not affect detecting. Specify groups using () for one group – ˆ([4-7]),(\\d*)$ and [4-7],\\d*$ are the same using detects, but the first creates two groups – refer back to the groups using \\i where i refers to the ith group For Part I Tidy Data We say that a data table is in tidy format if: 1. each row represents one observation, 2. each column represents the different variables available for each of these observations and, 3. each value is a cell with each cell being a single value. EDA • Variation in one numerical variable – histogram – density – boxplot • Covariance between two numerical variables – scatter plot – bin one variable and use box plot • Covariance between two categorical variables – count the number of observations of each combination, and use geom_count() – heatmap with geom_tile() • Covariance between categorical and numerical – density with line colored by the categories – boxplot Graphical Excellence 8 guiding principles for excellence in graphics (Tufte, 2001): 1. Substance: Induce the viewer to think about the substance rather than about other things (e.g., methodology, design, technology, etc.) 2. No distortion: Avoid distorting what the data have to say 3. Information-dense: Present many numbers in a small space 4. Coherence: Make large data sets coherent 5. Encourage comparisons: Encourage the eye to compare different pieces of data 6. Broad-to-fine details: Reveal the data at several levels of detail, from a broad overview to the fine structure 6 7. Clear purpose: Serve a reasonably clear purpose – why is it graphed in the first place? 8. Alignment with verbal descriptions: Be aligned with the verbal descriptions accompanying the graph 7 Solutions to Handouts and Tutorial Exercises Week 1 Handout NOTE This is an R Notebook, which combines text and code into a single document. The 14 lines are the settings of this R Notebook document- please do not change anything there. The text in a Notebook can be given special formatting using something called Markdown; it is not important to understand right away - we will take a closer look later in the module. Before starting any coding practice, please follow the slides to create a project and copy the file handout01.Rmd into the project directory. Then open the file handout01.Rmd inside the project using the bottom-right “Files” panel. Code in an R Notebook is kept in code chunks that appear in grey shaded boxes, like the one below. 2 + 3 1 + 1 Three backticks symbols are used to open and close the code chunk. The parentheses after such as {r Intro to Code Chunks} gives it a name; think of this as the tag or ID of the code chunk. Note that each code chunk should have a unique name. Run a code chunk by placing your cursor on a line of code inside it, and clicking the green arrow on the top line of the code chunk. If you want to run one or more lines in a code chunk, select the line(s) you want to run, and use Run with a down arrow and selecting “Run Selected Line(s)”. If everything is working as expected, you should see the number 5 displayed above, in a white area directly under the code chunk. Notice the [1] that precedes the result of the operation. For now, it is enough to know that this number in square brackets indicates output - we’ll talk more about it when we start working with vectors. Arithmetic In the above code chunk, we used R as a calculator with 2 integer numbers. Use the code chunk below to try out the other common mathematical operators - * / with some float numbers. # Addition with 3.5 + 2.4 # Subtraction 3.5 - 2.4 # Multiplication 3.5 * 2.4 # Division 3.5 / 0.5 In the code chunk above we introduced comments. A comment in R starts with a hash # symbol and is ignored during execution. Comments can exist on their own line, or can come immediately after some code. Exercise 1: Order of Operations R will follow the usual operator precedence that applies in mathematics. As expected, we can also apply BODMAS And BIDMAS in the code. For example, parentheses (round brackets) to force a specific order of evaluation. 8 In the code chunk below, add or remove parentheses so that all three expressions evaluate to the same result: 20 # Fix the parentheses below (2 + 3 + 5)*2 4*(10 - 5) 2*8 + 4 NOTE: Use whitespace above to improve the readability our code for everyone (both your future selves and your tutors!) Best practice: • a space between numbers and operators • a space between operators and parentheses • no space between numbers and parentheses Variables Often, it is useful to store data, or the result of an operation. Like other programming languages, R supports the assignment of variables for this. For now, you can think of variables as having a name, and containing some data. To assign data to a variable in R, we use the arrow assignment operator. # Assign a numeric value to x x <- 3.14 # Assign the result of an operation to y y <- 5 + 1.86 # Sum the variables and assign the result to z z <- x + y # Print output print(z) NOTE: assigning data to variables does not produce any output. In the top- right pane of RStudio, the Environment window (top-right panel) will show any variables you created, so take a look there after running the above code chunk. Alternatively, we can always print a variable to see its value. Now using the idea of variables, add or remove parentheses and mathematical operators in the code chunk below so that all three expressions evaluate to the same result: 20 # define variables a <- 2 b <- 3 c <- 5 # Apply BIDMAS below to obtain 20 in each line (a + b + c)*a (a + a)*(c*a - c) We can re-assign variables, but note that this will overwrite the original value. # Assign a new numeric value to z z <- 22 # Print z print(z) 9 # Assign a new character value to z z <- "Hello!" # Print z print(z) NOTE: Variables hold different types of data. In the above example, z variable was used to first hold numeric data, then character data. Tutorial Exercise Introduction to programming in R Notebook This tutorial includes more exercises about R variables, data types and built-in functions. An important note to make about variables is naming. Variable names must start with a letter and should only contain: • letters • numbers • underscores (to separate words) When naming variables, consider readability. Names like x, y and z are not very descriptive and do not make for very readable code. Great variable names will give information about the context and maybe even units! Consider the difference between the following variable names: • GDP versus uk_GDP_2018 • GDP_growth versus uk_GDP_growth_rate_2019 # GDP unit : m (million) uk_GDP_2018 <- 2203005 uk_GDP_2019 <- 2238348 uk_GDP_2020 <- 1991439 uk_GDP_2021 <- 2142738 uk_GDP_2022 <- 2230625 Run the chunk above then use the variables to find the annual GDP (Gross Domestic Product) growth rate in the UK between 2018 and 2022. To calculate the annual GDP growth rate based on the provided GDP, use the following equation: GDPgrowth = (GDPcurrent − GDPprevious )/GDPprevious ∗ 100 # Use the above variables to complete the equation below uk_GDP_growth_rate_2019 <- (uk_GDP_2019 - uk_GDP_2018)/uk_GDP_2018 * 100 print(uk_GDP_growth_rate_2019) uk_GDP_growth_rate_2020 <- (uk_GDP_2020 - uk_GDP_2019)/uk_GDP_2019 * 100 print(uk_GDP_growth_rate_2020) uk_GDP_growth_rate_2021 <- (uk_GDP_2021 - uk_GDP_2020)/uk_GDP_2020 * 100 print(uk_GDP_growth_rate_2021) Numeric and character variables are two of the most common types in R, but as you will see over the course of the module, there are plenty of other types too. 10 Dates Because we use and rely on dates so often for data analyses, R has a special data type just for dates, called Date. Using the Date type instead of a numeric or character type allows for proper comparison and calculation of values that represent days, months and years. We will often find data where dates are given to us as a character type which need to be converted to the Date type before we can start working with them. To convert, we use the as.Date() function and pass in the character date and the format it is in (order of year, month and day). Use the following special notation to specify the format of the date: • %Y for the 4-digit year • %m for the 2-digit month • %d for the 2-digit day # Messy date data as characters (Dec 7 2022) good_day <- "20221207" # Convert with as.Date as.Date(good_day, format = "%Y%m%d") # Sometimes our date has separators (Aug 18 2023) great_day <- "18-08-2023" # as.Date() can handle that too as.Date(great_day, format = "%d-%m-%Y") Exercise 2: Dates Convert the following dates by filling in values for the format argument: # July 1 1867 as.Date("07/01/1867", format = "%m/%d/%Y") # May 22 1927 as.Date("22-05 1927", format = "%d-%m %Y") # September 5 2022 as.Date("2022 09 05", format = "%Y %m %d") # January 1 1970 (hint: use %y to specify a 2-digit year) as.Date("70-01-01", format = "%y-%m-%d") # December 31 1999 (hint: use %b to specify an abbreviated month name) as.Date("31 Dec 1999", format = "%d %b %Y") Built-in Functions as.Date used earlier is a kind of a built-in function in R. For many other mathematical operations in R, we can also use built-in functions in R. Functions are commonplace in most programming languages, and we can think of them as bundling some common functionality into an easy-to-reuse command. For example, let’s use the round() function to practice rounding numbers. Pi <- 3.1415926 # Rounding Pi to the nearest whole number round(Pi) Often (but not always) functions will operate on some input. We pass this information to the function as an 11 argument (A.K.A parameter), entered inside round brackets. When a function needs additional information to know how to behave, we can pass in additional arguments. # Rounding Pi to two decimal places round(Pi, digits = 2) Arguments are separated by commas, and the order in which they are passed is important (also note whitespace). Where possible, refer to arguments by name. A popular functionality across most programming languages is the ability to print text or numbers to the screen. In R, this is achieved with print() print(Pi) print("Hello World!") Notice how an argument to the print() function is enclosed in double-quotes, as is its respective output. Quotation marks indicate non-numeric characters. The print() function will accept either numeric types or character types as arguments, but not all functions are so flexible. Google can help with finding the right function for the job. Once you know the function name, we can ask RStudio to show us detailed documentation. Run the code chunk below. You won’t see any output in the notebook, but a help page should open in the bottom right panel of RStudio. ?print HINT It may be easier to ask for help in the Console (the bottom-left pane in RStudio) Exercise 3: Exploring other functions Use help to learn about these two functions: • sqrt() • floor() Then use the empty code chunk below to try the functions. Pass in arguments and check the output! I encourage you to practice commenting your code. a <- 4 b <- 3.8 sqrt(a) # Find the square root of a floor(b) # Round down to nearest integer of b It will become very useful to know what type of data your variables hold. For this, we can use the class() function. Use the class() function to determine the type (or class) of the three variables used above - x , y , z x <- 4 y <- 4.5 z <- "Imperial" class(x) class(y) class(z) z<- TRUE class(z) 12 Week 2 Handout Basic Programming in R Run the following code chunks to get started. # GDP unit : m (million) uk_GDP_2018 <- 2203005 uk_GDP_2019 <- 2238348 uk_GDP_2020 <- 1991439 uk_GDP_2021 <- 2142738 uk_GDP_2022 <- 2230625 uk_GDP_growth_rate_2019 <- (uk_GDP_2019 - uk_GDP_2018)/uk_GDP_2018 * 100 print(uk_GDP_growth_rate_2019) uk_GDP_growth_rate_2020 <- (uk_GDP_2020 - uk_GDP_2019)/uk_GDP_2019 * 100 print(uk_GDP_growth_rate_2020) uk_GDP_growth_rate_2021 <- (uk_GDP_2021 - uk_GDP_2020)/uk_GDP_2020 * 100 print(uk_GDP_growth_rate_2021) Comparison It is often necessary to compare values and variables, to determine if they are equal, or if one is greater or less than another. The comparison operators for equality are == and != # Equal to uk_GDP_growth_rate_2020 == uk_GDP_growth_rate_2020 # Not equal to uk_GDP_growth_rate_2020 != uk_GDP_growth_rate_2020 The operators for ordering (greater/less than) are > < >= <= # Less than uk_GDP_growth_rate_2020 < uk_GDP_growth_rate_2020 # Less than or equal to uk_GDP_growth_rate_2020 <= uk_GDP_growth_rate_2020 # Greater than uk_GDP_growth_rate_2020 > uk_GDP_growth_rate_2020 # Greater than or equal to uk_GDP_growth_rate_2020 >= uk_GDP_growth_rate_2020 Notice that the output is either TRUE or FALSE. These are Boolean values (also called logical values) - a data type, different to numeric or character. Vector A vector holds same type of data in an array. A vector can be created using an in-built function in R called c(). Elements must be comma-separated. All elements in a vector can be processed all at once; run and 13 study the following code chunk. # UK average weekly wage between 2018 and 2022, unit: £ uk_avg_weekly_wage <- c(488, 505, 516, 542, 571) uk_avg_weekly_wage > 500 Previously, we converted dates in character type to Date type directly one by one. But with dates in a numeric type vector, we need to convert them into character type before converting into Date type. The following example demonstrates this idea. # UK average weekly wage between 2018 and 2022, unit: £ uk_avg_weekly_wage <- c(488, 505, 516, 542, 571) # UK GDP between 2018 and 2022, unit: m£ uk_GDP <- c(2203005, 2238348, 1991439, 2142738, 2230625) yrs <- c(20010106, 20020406, NA, 20021203, 20031104, 20050908) yr <- as.Date(as.character(yrs), format = "%Y%m%d") print(yr) NA values We can check if there are NA values using the function is.na(). Run and study the following code chunk. # is.na() is a R built-in function for detecting NAs is.na(yrs) # which() is for returning the location of TRUE values which(is.na(yrs)) Data frame A data frame is a type of data that uses rows and columns to store tabular data. An important feature of a data frame is that different columns can store different types of data. For example, the following code chunk generates a data frame. After running the code chunk, a data frame is generated and listed in the Environment window (see top right hand side of your screen). Click on the variable name, and a view window will be opened to show the data. uk_gdp_weekly_wage <- data.frame("Year"=c(2018:2022), "weekly_wage"=c(488, 505, 516, 542, "GDP"= c(2203005, 2238348 ,1991439, 2142738, stringsAsFactors = F) 571), NA), print(uk_gdp_weekly_wage) Indexing Accessing the elements of a vector or data frame is done by indexing with [ ]. The index is a position in a vector or data frame, and in R it always starts at 1. # Extract the first element uk_avg_weekly_wage[1] # Extract the first three elements uk_avg_weekly_wage[1:3] 14 # Extract the second and fourth elements uk_avg_weekly_wage[c(2, 4)] # Extract the last element uk_avg_weekly_wage[length(uk_avg_weekly_wage)] # Change the first element uk_avg_weekly_wage[1] <- 4 # Add a new element with append() uk_avg_weekly_wage <- append(uk_avg_weekly_wage, 10) print(uk_avg_weekly_wage) Indices with data frame A data frame can be also accessed by the column names and indices. The following example code shows how to access the 2nd element in the Year column with 2 methods. uk_gdp_weekly_wage$Year[2] uk_gdp_weekly_wage[2,1] Exercise 1: Indices Exercise 1 Extract then print the UK GDP and weekly wage data from the year 2021. uk_gdp_weekly_wage2021 <- uk_gdp_weekly_wage[4,] print(uk_gdp_weekly_wage2021) Exercise 2 Change the weekly wage in 2021 to 550 and print out the latest result. uk_gdp_weekly_wage2021[1,2] <- 550 print(uk_gdp_weekly_wage2021) Exercise 3 Now correct the weekly wage in 2021 back to 542 and print out the latest result uk_gdp_weekly_wage2021[1,2] <- 542 print(uk_gdp_weekly_wage2021) load data Before starting loading, copy the data file “uk_gdp_and_weeklyWage.csv” to data directory in your R project directory (i.e., create a new folder called data in the current directory) and run the folllowing code chunk. uk_gdp_and_weeklyWage <- read.csv("./data/uk_gdp_and_weeklyWage.csv", stringsAsFactors = F) The variable uk_gdp_and_weeklyWage stores the data imported. It is a data frame. The argument stringsAsFactors=F is used for stopping read.csv from organizing character type of data in factor type (factors are used to work with categorical variables, variables that have a fixed and known set of possible values). The argument is used to avoid some potential troubles. To view the imported data, simply click on the variable name from the list in the top-right Environment panel. . . But, hold on for a second, before clicking, you need to observe the number of rows of data. For example, uk_gdp_and_weeklyWage has 23 obs. of 3 variables, that means, it has 23 rows and 3 columns. If there are 500,000 or more rows or columns, direct clicking will likely freeze your RStudio. In that case, instead of clicking on the variable, the following line is recommended. View(head(uk_gdp_and_weeklyWage)) 15 The line can be run in a code chunk and ideally run in the Console window (usually, placed at the bottom left). By default, head is a built-in function and it retrieves the first 6 rows of data. If more rows are expected, an argument is required. For example, View(head(uk_gdp_and_weeklyWage, n = 15)) Programming Structure Conditional Statements if. . . else • Before applying an operation we might verify if it can be executed. • We might need to apply different operations depending on some conditions. • We can use an if-else statement to determine what operations need to be executed if a condition is verified. For example, check if weekly wage in 2021 was more than £550 # check which row contains the data of 2021 row_num <- which(uk_gdp_and_weeklyWage$Year == 2021) if (uk_gdp_and_weeklyWage$Weekly_pay[row_num] > 550) { print("Weekly wage in 2021 was more than £550") } else { print("Weekly wage in 2021 was less than £550") } Exercise 4 By changing the year number, we can check if the weekly wage in 2018 exceeded £550. # check which row contains the data of 2018 row_num <- which(uk_gdp_and_weeklyWage$Year == 2018) if (uk_gdp_and_weeklyWage$Weekly_pay[row_num] > 550) { print("Weekly wage in 2018 was more than £550") } else { print("Weekly wage in 2018 was less than £550") } Exercise 5 We don’t want to change the relevant numbers every time. In the following exercise, use one variable to improve the above code. year_num <- 2018 row_num <- which(uk_gdp_and_weeklyWage$Year == year_num) if (uk_gdp_and_weeklyWage$Weekly_pay[row_num] > 550) { print(paste("Weekly wage in", year_num, "was more than £550")) } else { print(paste("Weekly wage in", year_num, "was less than £550")) } loop What if we want to repeat the same steps many times? One common way is to use the for loop. • Use for loop to repeat the same operations a specific number of times. • Use indices together with the loop to control element access. Run the code below and observe what happens. for (i in 1:nrow(uk_gdp_and_weeklyWage)) { # nrow returns the number of rows of the data frame # 1:nrow() specifies the number of times i varies 16 print(uk_gdp_and_weeklyWage$Year[i]) } Now based on Exercise 5, code the following: Exercise 6 Loop through every year in the dataset and check if the weekly wage exceeded £550. for (i in 1:nrow(uk_gdp_and_weeklyWage)) { year_num <- uk_gdp_and_weeklyWage$Year[i] row_num <- which(uk_gdp_and_weeklyWage$Year == year_num) if (uk_gdp_and_weeklyWage$Weekly_pay[row_num] > 550) { print(paste("Weekly wage in", year_num, "was more than £550")) } else { print(paste("Weekly wage in", year_num, "was less than £550")) } } Pipe operators The pipe operator allows you to send results of one function to another. The %>% is used as pipe function for carrying out a sequences of actions. A set of verbs for common data manipulation challenges. Let’s use tidyverse package to support %>% operators. NOTE: First install the tidyverse package by running install.packages("tidyverse") in the R Console. library(tidyverse) Before starting any practice and exercises, load the dataset “uk_gdp_and_weeklyWage.csv”. uk_gdp_and_weeklyWage <- read.csv("./data/uk_gdp_and_weeklyWage.csv", stringsAsFactors = F) With pipe operators, you can use the one or more of the following verbs: • select(): pick variables based on their names. • filter(): pick cases based on their values. • mutate(): add new variables that are functions of existing variables. • arrange() allows you to sort a data frame by a particular column. • summarise(): reduce multiple values down to a single summary. • These all combine naturally with group_by() which allows you to perform any operation “by group”. Note If you use group(), then ungroup() is always used after performing all the calculations. If you forget to ungroup() data, future data management will likely produce errors. Always ungroup() swhen you’ve finished with your calculations. select For example, the following code can select the Year column from the data frame. uk_gdp_and_weeklyWage %>% select(Year) %>% head() Exercise 7 Select the Year and weekly wage columns from “uk_gdp_and_weeklyWage”. uk_gdp_and_weeklyWage %>% select(Year, Weekly_pay) If want to store the pipe operations results for later use, assign the operations to a variable. For example: year_num <- uk_gdp_and_weeklyWage %>% select(Year) 17 filter To extract a subset of the dataset, filter can be used. For example, if want to select the GDP data between 2015 and 2020. We can do uk_gdp_and_weeklyWage %>% select(Year, GDP_m) %>% filter(Year>=2015, Year<=2020) Recall the exercise selecting the subset from leap years last week, it can be implemented using pipe operations. Exercise 8 Use select and filter with pipe operators to obtain a subset of weekly wage from leap years. uk_gdp_and_weeklyWage %>% select(Year, Weekly_pay) %>% filter(Year %% 4 == 0, Year %% 100 != 0 | Year %% 400 ==0 ) mutate Recall the practice for calculating GDP growth rate in the previous week(s), if we want to calculate all the yearly GDP growth rates from 2000 to 2022, we need to add a column to the data frame and repeat the equation to obtain GDP growth rate results. One method is to use loop to implement the calculation. Read the following code chunk, run it and observe the results. # add a colummn "GDP_growth_rate" to the data frame by assigning NA values uk_gdp_and_weeklyWage$GDP_growth_rate <- NA # use loop to calculate the GDP growth rate # start from the 2nd row because the equation looks backwards one period. for (i in 2:nrow(uk_gdp_and_weeklyWage)) { uk_gdp_and_weeklyWage$GDP_growth_rate[i] <(uk_gdp_and_weeklyWage$GDP_m[i] - uk_gdp_and_weeklyWage$GDP_m[i-1])/ uk_gdp_and_weeklyWage$GDP_m[i-1] * 100 } mutate() creates new columns that are functions of existing variables. It can also modify (if the name is the same as an existing column) and delete columns (by setting their value to NULL). With mutate, lag() and pipe operator(s), the above code can be implemented in the following, much smaller, code chunk. uk_gdp_and_weeklyWage <- uk_gdp_and_weeklyWage %>% mutate(GDP_growth_rate = (GDP_m - lag(GDP_m)) / lag(GDP_m) * 100) Exercise 9 Use mutate to add a column “weeklyWageLevel” to check if weekly wage was more than 500 or not. uk_gdp_and_weeklyWage %>% mutate(weeklyWageLevel = Weekly_pay>500) arrange arrange() allows you to sort a data frame by a particular column. Try the following code chunk. uk_gdp_and_weeklyWage %>% arrange(GDP_growth_rate) The default is to sort in an ascending order; to sort by a descending order, you need to specify arrange(desc()). Try this for “GDP_growth_rate” in the “uk_gdp_and_weeklyWage” data frame. Exercise 10 uk_gdp_and_weeklyWage %>% arrange(desc(GDP_growth_rate)) 18 Tutorial Exercise 1. Deal with NAs • Load the data from “uk_gdp_and_weeklyWage_na.csv”. • Locate the NAs and remove the rows with missing values. • One way to do it is using the built-in function na.omit #Load data uk_gdp_and_weeklyWage_na <- read.csv("data/uk_gdp_and_weeklyWage_na.csv") uk_gdp_and_weeklyWage_no_NA<-na.omit(uk_gdp_and_weeklyWage_na) 2. Functions in R. Example Write a function compute_s_n that for any given n computes the sum Sn = 1ˆ2 + 2ˆ2 +3ˆ2 + . . . nˆ2. Report the value of the sum when n = 10. # Two approaches to solve this question. compute_s_n <- function(n) { sum_squares<-0 # initialize vector # using for loop for(i in 1:n){ sum_squares <- sum_squares +(iˆ2) } return(sum_squares) } result <- compute_s_n(10) print(result) compute_s_n <- function(n) { # using vectorized approach sum_squares <- sum((1:n)ˆ2) return(sum_squares) } # Compute S_n for n = 10 result <- compute_s_n(10) print(result) Exercise 1 • Load the data “uk_gdp_and_weeklyWage.csv” • Then, add two columns, “GDP_growth_rate” and “weeklyWageLevel”, using mutate to show yearly GDP growth rate and if weekly wage level was over £500 or not. uk_gdp_and_weeklyWage <- read.csv("data/uk_gdp_and_weeklyWage.csv") uk_gdp_and_weeklyWage %>% 19 mutate(GDP_growth_rate = (GDP_m - lag(GDP_m)) / lag(GDP_m) * 100, weeklyWageLevel = Weekly_pay>500) Exercise 2 (Challenging!) • From the dataframe uk_gdp_and_weeklyWage, extract a subset containing leap years only. • Hint: – you may want to use rbind together with loop and nested conditions to extract the subset holding leap years. – NULL value can be used to initialized an empty data frame. But there are different ways to implement that. leap_year_subset <- NULL for (i in 1:nrow(uk_gdp_and_weeklyWage)) { year_num <- uk_gdp_and_weeklyWage$Year[i] # check if it is a leap year if (year_num %% 100 == 0) { if (year_num %% 400 == 0) { leap_year_subset <- rbind(leap_year_subset, uk_gdp_and_weeklyWage[i, ]) } } else if (year_num %% 4 == 0) { leap_year_subset <- rbind(leap_year_subset, uk_gdp_and_weeklyWage[i, ]) } else { print(paste(year_num, "is not a leap year")) } } Week 3 Handout Tidyverse continued Let us start by calling the tidyverse library. NOTE: If you don’t have the tidyverse package on your RStudio, please first install it by running install.packages("tidyverse") in the R Console. library(tidyverse) Before starting any practice and exercises, load the dataset “uk_gdp_and_weeklyWage.csv”. Recall this gives us data on the average weekly pay (in £) and the total GDP (in millions of £) from 2000 to 2022. uk_gdp_and_weeklyWage <- read.csv("./data/uk_gdp_and_weeklyWage.csv", stringsAsFactors = F) head(uk_gdp_and_weeklyWage) With pipe operators, you may use one or more of the following verbs (note there are others): • select(): pick variables based on their names. • filter(): pick cases based on their values. • mutate(): add new variables that are functions of existing variables. • arrange() allows you to sort a data frame by a particular column. • summarise(): reduce multiple values down to a single summary. • These all combine naturally with group_by() which allows you to perform any operation “by group”. 20 Note If you use group(), then ungroup() is always used after performing all the calculations. If you forget to ungroup() data, future data management will likely produce errors. Always ungroup() swhen you’ve finished with your calculations. mutate Let us add GDP growth rates and categorize the growth rate into quartiles. Note: there will be NA(s) in the GDP growth rates data To do so, we need to make use of cut() and quantile() functions. The cut() function in R is used to cut a range of values into bins and specify labels for each bin. This function uses the following syntax: cut(x, breaks, include.lowest = TRUE or FALSE, labels = . . . ) where: • x: Name of vector • breaks: Number of breaks to make or vector of break points • labels: Labels for the resulting bins. labels = NULL produces interval bins and labels = FALSE produces integer category labels starting at 1. Otherwise, you can assign your own labels with labels = your own vector of labels. • include.lowest: Boolean value to include lowest break value To see how cut() works, study and run the code below that puts a vector of numbers from 0 to 100 into 5 bins. vec <- c(0:100) cut(vec, c(0,20,40,60,80,100), include.lowest = TRUE, labels = FALSE) The quantile() function produces sample quantiles corresponding to the given probabilities. The smallest observation corresponds to a probability of 0 and the largest to a probability of 1. The syntax is: quantile(x, probs, na.rm = TRUE or FALSE, type = 7) where: • x: Numeric vector whose sample quantiles are wanted. NA and NaN values are not allowed in numeric vectors unless na.rm is set to TRUE. • probs: Numeric vector of probabilities. E.g., if probs = seq(0, 1, 0.25), then you are asking for the quartiles (0%, 25%, 50%, 75% and 100%). • na.rm = TRUE or FALSE: if TRUE, any NA and NaN’s are removed from x before the quantiles are computed. • type: An integer between 1 and 9 selecting one of the nine quantile algorithms available are to be used. Default is to keep it at 7. To see how quantile() works, study and run the code below that puts a vector of numbers from 0 to 100 into quartiles. vec <- c(0:100) quantile(vec, probs = seq(0, 1, 0.25), type = 7) Now we are ready to use mutate to add GDP growth rates and categorize the growth rate into quartiles. Study and run the code chunk below. 21 uk_gdp_and_weeklyWage <- uk_gdp_and_weeklyWage %>% mutate(GDP_growth_rate = (GDP_m - lag(GDP_m)) / lag(GDP_m) * 100, ) uk_gdp_and_weeklyWage <- uk_gdp_and_weeklyWage %>% mutate(GDP_growth_rate_Quartile = cut(GDP_growth_rate, quantile(GDP_growth_rate, probs = seq(0, 1, 0.25), type = 7, na.rm = T), include.lowest = T, labels = F) ) head(uk_gdp_and_weeklyWage) arrange Let’s sort the data by the order of GDP growth rates using arrange. uk_gdp_and_weeklyWage %>% arrange(GDP_growth_rate) %>% head() group_by Let us now investigate the mean, standard deviation of GDPs, mean and standard deviation of weekly wage, and mean and standard deviation of GDP growth rates in each GDP growth rate quartile group by summarise. Can you do this in the following code chunk? Exercise 1 uk_gdp_and_weeklyWage %>% group_by(GDP_growth_rate_Quartile) %>% summarise(mean_GDP = mean(GDP_m), std_GDP = sd(GDP_m), mean_wage = mean(Weekly_pay), std_wage = sd(Weekly_pay), mean_GDPrate = mean(GDP_growth_rate), std_GDPrate = sd(GDP_growth_rate)) %>% ungroup() After running the above code chunk, have you observed any issue(s)? To address the issue(s), adjustment in the above code chunk can be done by different ways. For example, filter can be added to implement that in the Exercise 2. Exercise 2 uk_gdp_and_weeklyWage %>% filter(!is.na(GDP_growth_rate_Quartile)) %>% group_by(GDP_growth_rate_Quartile) %>% summarise(mean_GDP = mean(GDP_m), std_GDP = sd(GDP_m), mean_wage = mean(Weekly_pay), std_wage = sd(Weekly_pay), mean_GDPrate = mean(GDP_growth_rate), std_GDPrate = sd(GDP_growth_rate)) %>% ungroup() Sometimes, viewing the first or last few rows of the data will be helpful to understand the data. Recall that we used head to display the first few rows of data.But we often need to sort the data before viewing. Let us try viewing the first 10 rows of “uk_gdp_and_weeklyWage” after sorting the “GDP_growth_rate” with arrange in Exercise 3. Exercise 3 22 uk_gdp_and_weeklyWage %>% arrange(GDP_growth_rate) %>% head(10) Now let us try viewing the last 10 rows of “uk_gdp_and_weeklyWage” after sorting the “GDP_growth_rate” with arrange and tail in Exercise 4. Exercise 4 uk_gdp_and_weeklyWage %>% arrange(GDP_growth_rate) %>% tail(10) summarise summarise() or summarize() function provides a summary of data, such as mean, standard deviation, median etc. The function is often used together with group_by if the data contain groups. Let’s start with a simply task by showing the mean of weekly wage, GDP and GDP growth rate. uk_gdp_and_weeklyWage %>% summarise(mean_wage = mean(Weekly_pay), mean_GDP = mean(GDP_m), mean_GDP_growth = mean(GDP_growth_rate)) Have you found any issue after running the code above? Why? What’s the solution? You can handle NA values by adding an argument na.rm=TRUE or na.rm=T (for shorter version) into the function mean. Try these below. Exercise 5 uk_gdp_and_weeklyWage %>% summarise(mean_wage = mean(Weekly_pay), mean_GDP = mean(GDP_m), mean_GDP_growth = mean(GDP_growth_rate, na.rm=T)) If we run summary() on the GDP growth rate column, we get 2.163% as the median rate. Let us extract this number from the string output using the parse_number() function, as below. To find out how to use parse_number(), you can search online or type ?parse_number in the Console. GDP_growth_rate_summary <- uk_gdp_and_weeklyWage %>% summary(GDP_growth_rate, na.rm=T) GDP_growth_rate_summary GDP_growth_rate_median <- parse_number(GDP_growth_rate_summary[3,4], locale = locale(grouping_mark = " ")) GDP_growth_rate_median Let’s group the GDP growth rate data by the benchmark at 2.163% and check the mean and standard deviation of the GDP growth rate. uk_gdp_and_weeklyWage <- uk_gdp_and_weeklyWage %>% mutate(lower_median = GDP_growth_rate < GDP_growth_rate_median) uk_gdp_and_weeklyWage %>% group_by(lower_median) %>% summarise(mean_GDP_growth = mean(GDP_growth_rate, na.rm=T), std_GDP_growth = sd(GDP_growth_rate, na.rm = T)) %>% ungroup() Any issue(s) found from the code chunk above? Please find a solution to address the issue in the following code chunk. 23 Exercise 6 uk_gdp_and_weeklyWage <- uk_gdp_and_weeklyWage %>% mutate(lower_median = GDP_growth_rate < GDP_growth_rate_median) uk_gdp_and_weeklyWage %>% group_by(lower_median) %>% filter(!is.na(GDP_growth_rate)) %>% summarise(mean_GDP_growth = mean(GDP_growth_rate, na.rm=T), std_GDP_growth = sd(GDP_growth_rate, na.rm = T)) %>% ungroup() Tutorial Exercise Exercise Load built-in EuStockMarkets dataset, which contains daily closing prices of major European stock indices. The goal is to practice using the summarise(), group_by(), and arrange() functions to analyse financial data. Dataset: The EuStockMarkets dataset contains data on four major European stock indices from 1991 to 1998: DAX: German stock index. SMI: Swiss stock index. CAC: French stock index. FTSE: British stock index. #Load the data data("EuStockMarkets") # Transform to data frame stock_data <- as.data.frame(EuStockMarkets) # Data processing. Adding the date - time series data. stock_data$Date <- seq(as.Date("1991-01-01"), by = "days", length.out = nrow(stock_data)) #Print the first 6 rows head(stock_data) Exercise 1 Extract the year from the Date column and group the data by Year. Then, calculate the average closing price of the four indices for each year. stock_data %>% mutate(Year = format(Date, "%Y")) %>% group_by(Year) %>% summarise( avg_dax = mean(DAX, na.rm = TRUE), avg_smi = mean(SMI, na.rm = TRUE), avg_cac = mean(CAC, na.rm = TRUE), avg_ftse = mean(FTSE, na.rm = TRUE) ) Exercise 2 Find the year with the highest average DAX index and sort the years in descending order based on the average DAX value. stock_data %>% mutate(Year = format(Date, "%Y")) %>% group_by(Year) %>% summarise(avg_dax = mean(DAX, na.rm = TRUE)) %>% arrange(desc(avg_dax)) Exercise 3 In this exercise, you will work with the LifeCycleSavings dataset and use functions like pull(), the dot operator (.), case_when(), and between() to analyse economic data. This dataset contains economic information on savings rates in different countries. 24 Dataset The LifeCycleSavings dataset contains the following variables: - sr: Aggregate personal savings as a percentage of disposable income. - pop15: Percentage of population under 15 years of age. - pop75: Percentage of population over 75 years of age. - dpi: Real per-capita disposable income. - ddpi: Percentage growth rate of dpi. You can load and convert the dataset to a data frame with the following command: data("LifeCycleSavings") savings_data <- as.data.frame(LifeCycleSavings) Create a new column savings_category to classify countries based on their savings rate (sr) as follows: “Low Savings” if sr is less than 10. “Moderate Savings” if sr is between 10 and 20. “High Savings” if sr is greater than 20. savings_data %>% mutate(savings_category = case_when( sr < 10 ~ "Low Savings", sr >= 10 & sr <= 20 ~ "Moderate Savings", sr > 20 ~ "High Savings" )) %>% select(sr, savings_category) %>% head(10) Example # Extract the 'sr' column using pull() savings_rate <- savings_data %>% pull(sr) # Display the first 10 values head(savings_rate, 10) # Access the 'sr' column using the dot operator savings_rate <- savings_data %>% .$sr # Display the first 10 values head(savings_rate, 10) Week 4 Handout Data Visualization This tutorial covers data visualization skills and pipe operators. Let’s use tidyverse package to support %>% operators and ggplot2 to support data visualization. NOTE: If you don’t have tidyverse and ggplot2 packages on your RStudio, please install it first by running install.packages("tidyverse") then install.packages("ggplot2") in the R Console. library(tidyverse) library(ggplot2) Before starting any practice and exercises, load the dataset “uk_gdp_and_weeklyWage.csv”. uk_gdp_and_weeklyWage <- read.csv("uk_gdp_and_weeklyWage2.csv", stringsAsFactors = F) Next, we add GDP growth rates and categorize the growth rate into quartiles. 25 Note: there will be NA(s) in the GDP growth rates data, so we use filter to remove any rows with NA data. uk_gdp_and_weeklyWage <- uk_gdp_and_weeklyWage %>% mutate(GDP_growth_rate = (GDP_m - lag(GDP_m)) / lag(GDP_m) * 100, weeklyWageQuantile = cut(Weekly_pay, quantile(Weekly_pay, probs = seq(0, 1, 0.25), type = 7), include.lowest = T, labels = F) ) uk_gdp_and_weeklyWage <- uk_gdp_and_weeklyWage %>% mutate (GDP_growth_rate_Quantile = cut(GDP_growth_rate, quantile(GDP_growth_rate, probs = seq(0, 1, 0.25), type = 7, na.rm = T), include.lowest = T, labels = F) ) uk_gdp_and_weeklyWage <- uk_gdp_and_weeklyWage %>% filter(!is.na(GDP_growth_rate)) Plot weekly wage To plot some data, we need to allocate the data for x-axis and y-axis and set them as the arguments for aes in the ggplot function, followed by the verbs provided by the ggplot2 package. For example, geom_point is used to create a scatterplot. After the use of each verb, + is required. An application of ggplot is shown in the following code chunk. uk_gdp_and_weeklyWage %>% ggplot(aes(x=Year, y=Weekly_pay)) + geom_point( ) Running the above code chunck, a scatterplot will appear because of the use of geom_point. Change geom_point to geom_line, and a line plot will be plotted. Try the following code chunk. uk_gdp_and_weeklyWage %>% ggplot(aes(x=Year, y=Weekly_pay)) + geom_line( ) To obtain scatter points and a line together for a dataset, geom_point and geom_line can be used together. Try the following code chunk. uk_gdp_and_weeklyWage %>% ggplot(aes(x=Year, y=Weekly_pay)) + geom_point() + geom_line() In the above code chunks, aes is set as an argument of ggplot. Alternatively, ggplot() can be used alone and then aes can be set as an argument later in the verbs of ggplot. For example, the previous code chunk can be rewritten as the following chunk to generate the same plot. p <- uk_gdp_and_weeklyWage %>% ggplot() p + geom_point(aes(x=Year, y=Weekly_pay)) + geom_line(aes(x=Year, y=Weekly_pay)) In the previous plots, the label for the y-axis was set using the data column name by default. To change the label of y-axis appropriately, y can be added in the labs verb. A title of the plot can be added as well, as shown in the following code chunk. p <- uk_gdp_and_weeklyWage %>% ggplot(aes(x=Year, y=Weekly_pay)) p + geom_point( ) + geom_line() + labs( title ="Weekly Wage between 2001 and 2022 (UK)", 26 y = "Weekly Wage (£)" ) To improve the presentation, more settings and verbs can be added, for example, size, theme(), as demonstrated in the next code chunk. p <- uk_gdp_and_weeklyWage %>% ggplot(aes(x=Year, y=Weekly_pay)) p + geom_point(size=2) + # size = 2 for changing scatter points size geom_line() + labs( title ="Weekly Wage between 2001 and 2022 (UK)", y = "Weekly Wage (£)" ) + theme( plot.title = element_text(hjust = 0.5, size = 12, face = "bold") ) Multiple lines in one plot To add additional lines and scatter points to a plot, you just need to add more verbs such as geom_point and geom_line. Don’t forget the + for connecting the verbs. In the following exercise, add multiple lines in a plot by adding scatter points and a line for “GDP_growth_rate”. You would need to first create the new column “GDP_growth_rate” (you can get the code from Week 3 in-class R exercises or try to code it from scratch). Try making the “GDP_growth_rate” plot red by adding colour='red' as options in geom_point and geom_line. Exercise 1 uk_gdp_and_weeklyWage <- uk_gdp_and_weeklyWage %>% mutate(GDP_growth_rate = (GDP_m - lag(GDP_m)) / lag(GDP_m) * 100, ) p <- uk_gdp_and_weeklyWage %>% ggplot(aes(x=Year, y=Weekly_pay)) p + geom_point( size=2) + geom_line() + geom_point(aes(y = GDP_growth_rate), size=2, colour = 'red') + geom_line(aes(y = GDP_growth_rate ), colour = 'red') + labs( title ="Weekly Wage between 2001 and 2022 (UK)", y = "Weekly Wage (£)" ) + theme( plot.title = element_text(hjust = 0.5, size = 12, face = "bold") ) After running the code above, have you noticed any issue(s)? The problem of plotting the weekly wage and the GDP growth rate on the same plot is that the two datasets are very different in their numeric values. This leads to the GDP growth rate being difficult to read on the plot. To address the issue(s), one of the methods is to provide a secondary y-axis, which has a different range from the primary y-axis. One limitation with ggplot however is that a new range of the secondary y-axis cannot be set with arbitrary numeric range. Instead, a scale factor needs to be estimated for the new range. The scale factor can be calculated or estimated using the numeric ranges of the two datasets as below. (Note that further adjustments on the scale factor may need to be made by trial-and-error). 27 scale_f actor = (max2nd dataset − min2nd dataset )/(max1st dataset − min1st dataset ) Tutorial Data visualisation This tutorial covers data visualisation skills and pipe operators. Let’s use tidyverse package to support %>% operators and ggplot2 to support data visualisation. NOTE: if you don’t have tidyverse and ggplot2 package on your RStudio. Please install it by running install.packages("tidyverse") then install.packages("ggplot2") in the R Console. library(tidyverse) library(ggplot2) library(dplyr) Before starting any practice and exercises, load the dataset “uk_gdp_and_weeklyWage_Pop.csv”. uk_gdp_and_weeklyWage <- read.csv("data/uk_gdp_and_weeklyWage_Pop.csv") Data processing. We first add GDP growth rates and categorize the growth rate into quartiles. uk_gdp_and_weeklyWage <- uk_gdp_and_weeklyWage %>% mutate(GDP_growth_rate = (GDP_m - lag(GDP_m)) / lag(GDP_m) * 100, weeklyWageQuantile = cut(Weekly_pay, quantile(Weekly_pay, probs = seq(0, 1, 0.25), type = 7), include.lowest = T, labels = F) ) uk_gdp_and_weeklyWage <- uk_gdp_and_weeklyWage %>% mutate (GDP_growth_rate_Quantile = cut(GDP_growth_rate, quantile(GDP_growth_rate, probs = seq(0, 1, 0.25), type = 7, na.rm = T), include.lowest = T, labels = F) ) uk_gdp_and_weeklyWage <- uk_gdp_and_weeklyWage %>% mutate(GDP_per_cap = GDP_m/Population*1000000) #convert to £ by x1000,000 Exercise 1 Start by creating a basic plot using geom_point() and geom_line() to visualise the Weekly Pay over Years. Choose size=2 for the scatterplot points. # Create the base plot p <- uk_gdp_and_weeklyWage %>% ggplot(aes(x = Year, y = Weekly_pay)) + geom_point(size = 2) + geom_line() + labs(x = "Year", y = "Weekly Wage (£)", title = "UK Weekly Wage 2000 to 2022") # Display the plot print(p) 28 Let us now use the LifeCycleSavings dataset, which contains economic data including savings rates, population percentages, and disposable income. Run the following code chunk to load this dataset. # Load the LifeCycleSavings dataset data("LifeCycleSavings") savings_data <- as.data.frame(LifeCycleSavings) Exercise 2 Create a scatter plot to visualise the relationship between disposable income (dpi) and savings rate (sr). Use different colors for different savings categories. You can use the default colour scale or try to define your own colour scale using scale_color_manual. Also, you can change the title of the legend using the layer labs(colour = "Your chosen title"). # Define savings categories savings_data <- savings_data %>% mutate(savings_category = case_when( sr < 10 ~ "Low Savings", sr >= 10 & sr <= 20 ~ "Moderate Savings", sr > 20 ~ "High Savings" )) # Create a scatter plot p1 <- savings_data %>% ggplot(aes(x = dpi, y = sr, color = savings_category)) + geom_point() + scale_color_manual(values = c("Low Savings" = "green", "Moderate Savings" = "skyblue", "High Savings" labs(color = "Savings Category", x = "Disposable Income", y = "Savings Rate", title = "Relationship Between Disposable Income and Savings Rate") p1 Exercise 3 Now try adding a trend line to the scatter plot to better visualise the relationship between disposable income and savings rate. You can do this by adding the layer geom_smooth(method = "lm", se = FALSE, colour = "some colour that's not used for the scatterplot"). Try also experimenting with further edits/improvements. # Scatter plot with trend line p1 <- p1+ geom_smooth(method = "lm", se = FALSE, colour = "red") + annotate("text", x = 3500, y = 10.5, label = "Trend Line", color = "red") p1 Week 5 Handout Data Visualization (continued) This tutorial covers data visualization skills and pipe operators. Let’s use tidyverse package to support %>% operators and ggplot2 to support data visualization. 29 NOTE: if you don’t have tidyverse and ggplot2 packages on your RStudio. Please install it by running install.packages("tidyverse") then install.packages("ggplot2") in the R Console. library(tidyverse) library(ggplot2) Before starting any practice and exercises, load the dataset “uk_gdp_and_weeklyWage.csv”. uk_gdp_and_weeklyWage <- read.csv("uk_gdp_and_weeklyWage2.csv", stringsAsFactors = F) Next, add GDP growth rates and categorize the growth rate into quartiles. NOTE: there will be NA(s) in the GDP growth rates data uk_gdp_and_weeklyWage <- uk_gdp_and_weeklyWage %>% mutate(GDP_growth_rate = (GDP_m - lag(GDP_m)) / lag(GDP_m) * 100, weeklyWageQuantile = cut(Weekly_pay, quantile(Weekly_pay, probs = seq(0, 1, 0.25), type = 7), include.lowest = T, labels = F) ) uk_gdp_and_weeklyWage <- uk_gdp_and_weeklyWage %>% mutate (GDP_growth_rate_Quantile = cut(GDP_growth_rate, quantile(GDP_growth_rate, probs = seq(0, 1, 0.25), type = 7, na.rm = T), include.lowest = T, labels = F) ) uk_gdp_and_weeklyWage <- uk_gdp_and_weeklyWage %>% mutate (GDP_per_cap = GDP_m/Population*1000000) # calculate GDP per Capita (£) uk_gdp_and_weeklyWage <- uk_gdp_and_weeklyWage %>% mutate (wage_level = Weekly_pay > 500) print(uk_gdp_and_weeklyWage) Plot a histogram To plot some data, we need to allocate the data for x-axis and y-axis and set them as the arguments for aes in the ggplot function, followed by the verbs provided by ggplot2 package, for example, geom_histogram. To add each verb, + is required. An application of ggplot is shown in the following code chunk. geom_histogram is used in this example for illustrating the distribution of the average weekly wages in a histogram. bins is for setting number of vertical bars, each of which shows how many values from the data fall into this range. Run the following code chunk- a histogram will appear because of the use of geom_histogram. uk_gdp_and_weeklyWage %>% ggplot(aes(x=Weekly_pay)) + geom_histogram( bins = 6) Try the following code chunk with more settings in geom_histogram by changing the filling colour. The colours can be denoted by English words or HEX codes. For example, red, lightblue, green, #FF0000, #ADD8E6, #00FF00. uk_gdp_and_weeklyWage %>% ggplot(aes(x=Weekly_pay)) + geom_histogram( bins = 6, colour = "black", fill = "white") 30 The type of border lines of each histogram can be changed by an additional parameter linetype. In the following example, linetype is added and filling colour of each bar in the histogram is changed. Then a vertical line to represent mean of weekly wage is added by the verb geom_vline with the main parameter xintercept=mean(). Other parameters such as line colour, linetype and width can be used to define the vertical line as well. uk_gdp_and_weeklyWage %>% ggplot(aes(x=Weekly_pay)) + geom_histogram(bins = 6, colour = "black", fill = "lightblue", linetype = "dashed") + geom_vline(aes(xintercept=mean(Weekly_pay)), colour="blue", linetype="dashed", linewidth = 1) Here is an example of how to further customise this plot. Run and study the code below, and try your own variations. # Create the plot uk_gdp_and_weeklyWage %>% ggplot(aes(x = Weekly_pay)) + geom_histogram(bins = 6, colour = "black", fill = "lightblue", alpha = 0.7) + # Adjust fill colour and geom_vline(aes(xintercept = mean(Weekly_pay)), colour = "blue", linetype = "dashed", linewidth = 1) + # Adjust vline appearance labs( title = "Distribution of Weekly Wage", x = "Weekly Wage (£)", y = "Frequency" ) + theme_minimal() + # Use a minimal theme theme( plot.title = element_text(hjust = 0.5, size = 14, face = "bold"), # Adjust title appearance axis.text = element_text(size = 10), # Adjust axis text size axis.title = element_text(size = 12), # Adjust axis title size panel.grid.major = element_line(colour = "gray90"), # Adjust grid line colour panel.grid.minor = element_blank(), # Remove minor grid lines legend.position = "none" # Remove legend ) Exercise 1 Create a histogram of the GDP per capita with light green and dashed border lines, and a red dashed mean line with width 1.5. uk_gdp_and_weeklyWage %>% ggplot(aes(x=GDP_per_cap)) + geom_histogram(bins = 6, colour = "black", fill = "lightgreen", linetype = "dashed") + geom_vline(aes(xintercept=mean(GDP_per_cap)), colour="blue", linetype="dashed", linewidth = 1) Exercise 2 Using a 7-bin histogram, plot the GDP in pink with a blue dashed mean line. uk_gdp_and_weeklyWage %>% ggplot(aes(x=GDP_m)) + geom_histogram(bins = 7, colour = "black", fill = "pink", linetype = "dashed") + geom_vline(aes(xintercept=mean(GDP_m)), colour="blue", linetype="dashed", linewidth = 1) 31 Tutorial Exercise Data Visualization (continued) This tutorial covers data visualization skills and pipe operators. Let’s use tidyverse package to support %>% operators and ggplot2 to support data visualization. NOTE: if you don’t have tidyverse and ggplot2 packages on your RStudio. Please install it by running install.packages("tidyverse") then install.packages("ggplot2") in the R Console. library(tidyverse) library(ggplot2) Before starting any practice and exercises, load the dataset “uk_gdp_and_weeklyWage.csv”. uk_gdp_and_weeklyWage <- read.csv("uk_gdp_and_weeklyWage.csv", stringsAsFactors = F) Next, add GDP growth rates and categorize the growth rate into quartiles. Hint: there will be NA(s) in the GDP growth rates data uk_gdp_and_weeklyWage <- uk_gdp_and_weeklyWage %>% mutate(GDP_growth_rate = (GDP_m - lag(GDP_m)) / lag(GDP_m) * 100, weeklyWageQuantile = cut(Weekly_pay, quantile(Weekly_pay, probs = seq(0, 1, 0.25), type = 7), include.lowest = T, labels = F) ) uk_gdp_and_weeklyWage <- uk_gdp_and_weeklyWage %>% mutate (GDP_growth_rate_Quantile = cut(GDP_growth_rate, quantile(GDP_growth_rate, probs = seq(0, 1, 0.25), type = 7, na.rm = T), include.lowest = T, labels = F) ) uk_gdp_and_weeklyWage <- uk_gdp_and_weeklyWage %>% mutate (GDP_per_cap = GDP_m/Population*1000000) # calculate GDP per Capita (£) uk_gdp_and_weeklyWage <- uk_gdp_and_weeklyWage %>% mutate (wage_level = Weekly_pay > 500) To add a density line to the histogram, you need to add geom_density() after geom_histogram. Try in the following exercise. Exercise 1 uk_gdp_and_weeklyWage %>% ggplot(aes(x=Weekly_pay, y = after_stat(density))) + geom_histogram(bins = 6, color = "black", fill = "lightblue", linetype = "dashed") + geom_density() alpha and fill can be set in the geom_density. alpha is for transparency of density and fill is for filling colour of density (PDF, probability density function). In exercise 2, add alpha with 0.2 and fill with “lightpink” in the geom_density. Exercise 2 uk_gdp_and_weeklyWage %>% ggplot(aes(x=Weekly_pay, y = after_stat(density))) + geom_histogram(bins = 6, color = "black", fill = "lightblue", linetype = "dashed") + 32 geom_density(alpha = 0.3, fill = "lightpink") In the next exercise, add a solid blue mean line to the histogram plot based on Exercise 2. Exercise 3 uk_gdp_and_weeklyWage %>% ggplot(aes(x=Weekly_pay, y = after_stat(density))) + geom_histogram(bins = 6, color = "black", fill = "lightblue", linetype = "dashed") + geom_density(alpha = 0.2, fill = "lightpink") + geom_vline(aes(xintercept=mean(Weekly_pay)), color="blue", linetype="solid", linewidth = 1) You can make further modifications to the plot, e.g., (i) adjusting the appearance of the density plot (geom_density) by changing the alpha level and fill color for better visibility; (ii) adjusting the appearance of the vertical line (geom_vline) by changing its color, linetype, and size; (iii) adding labels for the title, x-axis, and y-axis to improve clarity and (iv) using a minimal theme for a clean appearance and making various theme adjustments for text size, grid lines, and legend. Run and study the code chunk below, then try your own variations. # Create the plot uk_gdp_and_weeklyWage %>% ggplot(aes(x = Weekly_pay, y = after_stat(density))) + geom_histogram(bins = 6, color = "black", fill = "lightblue", linetype = "dashed") + geom_density(alpha = 0.5, fill = "lightpink") + # Adjust density appearance geom_vline(aes(xintercept = mean(Weekly_pay)), color = "blue", linetype = "solid", size = 1) + # Adjust vline appearance labs( title = "Distribution of Weekly Wage", x = "Weekly Wage (£)", y = "Density" ) + theme_minimal() + # Use a minimal theme theme( plot.title = element_text(hjust = 0.5, size = 14, face = "bold"), # Adjust title appearance axis.text = element_text(size = 10), # Adjust axis text size axis.title = element_text(size = 12), # Adjust axis title size panel.grid.major = element_line(colour = "gray90"), # Adjust grid line color panel.grid.minor = element_blank(), # Remove minor grid lines legend.position = "none" # Remove legend ) Week 6 Handout Data Wrangling NOTE: if you don’t have tidyverse and ggplot2 packages on your RStudio. Please install it by running install.packages("tidyverse") then install.packages("ggplot2") in the R Console. gather One of the most used functions in the tidyr package is gather, which is useful for converting wide data into tidy data. First, let us load appropriate packages and libraries. 33 library(tidyverse) library(ggplot2) library(dslabs) Run the following code chunk, which was covered in class. Make sure you understand each piece of code. # inspect the data path <- system.file("extdata", package="dslabs") filename <- file.path(path, "fertility-two-countries-example.csv") wide_data <- read_csv(filename) head(wide_data) # gather into a tidy format by calling column names in wide_data new_tidy_data <- gather(wide_data, year, fertility, `1960`:`2015`) head(new_tidy_data) # gather into a tidy format by calling the column not to be gathered (country) new_tidy_data2 <- wide_data %>% gather(year, fertility, -country) head(new_tidy_data2) Run the below code chunk, which converts the class of the year column to integer. # check the class of the year column class(new_tidy_data$year) # change the class of the year column to integer new_tidy_data <- wide_data %>% gather(year, fertility, -country, convert = TRUE) class(new_tidy_data$year) Try using mutate and as.integer instead to convert the class of the year column to integer. Exercise 1 new_tidy_data3 <- wide_data %>% gather(year, fertility, -country) %>% mutate(year = as.integer(year)) head(new_tidy_data3) Now that the data is tidy, use ggplot to plot the fertility of the two countries over time, using different colours for the different countries and the Economist theme. Exercise 2 library(ggthemes) new_tidy_data %>% ggplot(aes(year, fertility, color = country)) + geom_point() + xlab("Year") + ylab("Fertility") + ggtitle("Fertility in South Korea and Germany 1960-2015") + scale_color_discrete(name = " ") + theme_economist() spread The spread function is basically the inverse of gather. The first argument is for the data, but since we are using the pipe, we don’t show it. The second argument tells spread which variable will be used 34 as the column names. The third argument specifies which variable to use to fill out the cells: new_wide_data <- new_tidy_data %>% spread(year, fertility) select(new_wide_data, country, `1960`:`1967`) separate Let us obtain the raw data first in the code chunk below. path <- system.file("extdata", package = "dslabs") filename <- "life-expectancy-and-fertility-two-countries-example.csv" filename <- file.path(path, filename) raw_dat <- read_csv(filename) select(raw_dat, 1:5) First, note that the data is in wide format. Second, notice that this table includes values for two variables, fertility and life expectancy, with the column name encoding which column represents which variable. Encoding information in the column names is not recommended but, unfortunately, it is quite common. We will put our wrangling skills to work to extract this information and store it in a tidy fashion. We can start the data wrangling with the gather function, but we should no longer use the column name year for the new column since it also contains the variable type. We will call it “key”, the default, for now: dat <- raw_dat %>% gather(key, value, -country) head(dat) The separate function can separate the key column into two or more. dat %>% separate(key, c("year", "variable_name"), extra = "merge") Now, use the spread function to create a column for each variable as well. Exercise 3 dat %>% separate(key, c("year", "variable_name"), extra = "merge") %>% spread(variable_name, value) unite Study and run the code chunk below. # suppose we had the following data frame var_names <- c("year", "first_variable_name", "second_variable_name") dat %>% separate(key, var_names, fill = "right") # unite helps to fix the problem of separated column names above dat %>% separate(key, var_names, fill = "right") %>% unite(variable_name, first_variable_name, second_variable_name) %>% spread(variable_name, value) %>% rename(fertility = fertility_NA) Tutorial Exercise Data Wrangling and Joining Tables with Economic Data In this exercise, you’ll work with the economics dataset from the ggplot2 package, which contains U.S. economic time series data, and learn how to manipulate the data using gather, spread, separate, unite, and joins. Step 1: Load the Required Packages and Data # Load the necessary packages library(tidyverse) # Load the economics dataset from ggplot2 35 data("economics") # Take a look at the first few rows of the dataset head(economics) Exercise 1 : Data Wrangling Using gather and spread The gather function allows us to reshape the data from wide to long format, and spread does the opposite. Gather Data: Reshape the dataset to a long format, focusing on the columns pce (personal consumption expenditures) and psavert (personal savings rate). # Gather data to create a long format economics_long <- economics %>% gather(key = "indicator", value = "value", pce, psavert) # View the transformed data head(economics_long) Spread Data: Reverse the transformation and spread the data back to wide format. # Spread the data back to wide format economics_wide <- economics_long %>% spread(key = "indicator", value = "value") # View the transformed data head(economics_wide) Exercise 3: Data Wrangling Using separate and unite The separate function splits a column into multiple columns, and the unite function merges multiple columns into one. Separate Data: Create a new column that combines the year and month from the date column, then separate it back into two columns. # Add a new column for year and month economics_sep <- economics %>% mutate(year_month = format(date, "%Y-%m")) %>% separate(year_month, into = c("year", "month"), sep = "-") # View the modified data head(economics_sep) Unite Data: Combine the year and month columns back into a single year_month column. # Unite the year and month columns back into one column economics_unite <- economics_sep %>% unite(col = "year_month", year, month, sep = "-") # View the transformed data head(economics_unite) First, load the datasets “ONS-England_population.csv”, “ONS-Wales_population.csv” and “ONS-EnglandWales_GDP_GrowthRate.csv” Eng_population <- read.csv("./data/ONS-England_population.csv", stringsAsFactors = F) 36 Wales_population <- read.csv("./data/ONS-Wales_population.csv", stringsAsFactors = F) Eng_Wales_GDP_GR <- read.csv("./data/ONS-England-Wales_GDP_GrowthRate.csv", stringsAsFactors = F) inspect data Before starting any practice and exercises, inspect the data and detect any abnormalities. Some basic operations can use head(), tail() etc. In the following exercise, use head and tail to have a glimpse at the data Exercise 2 # preview the first and last 15 rows of data head(Eng_population, n = 15) tail(Eng_population, n = 15) head(Wales_population, n = 15) tail(Wales_population, n = 15) head(Eng_Wales_GDP_GR, n = 15) tail(Eng_Wales_GDP_GR, n = 15) correct data Select data in certain rows, rename columns and convert data types. Eng_population <- Eng_population[8:58, ] colnames(Eng_population) <- c("Year", "Population") Eng_population <- Eng_population %>% mutate(Year = as.numeric(Year), Population = as.numeric(Population), Region = "England") Wales_population <- Wales_population[8:58, ] colnames(Wales_population) <- c("Year", "Population") Wales_population <- Wales_population %>% mutate(Year = as.numeric(Year), Population = as.numeric(Population), Region = "Wales") Join data frames Combine the population datasets from England and Wales by rows, using rbind. This step aims to produce a dataframe holding mutliple regions’ information. Eng_Wales_population <- rbind(Eng_population, Wales_population) head(Eng_Wales_population) In the following, try joining dataframes as instructed. How are the resulting data frames different from each other? Exercise 3 # Use inner_join to join "Eng_Wales_GDP_GR" (x) and "Eng_Wales_population" (y), equating "calendar.years inner_join_data1 <- Eng_Wales_GDP_GR %>% inner_join(Eng_Wales_population, join_by("calendar.years" == "Year", "Geography" == "Region")) head(inner_join_data1) 37 # Use left_join to join "Eng_Wales_GDP_GR" (x) and "Eng_Wales_population" (y), equating "calendar.years" left_join_data1 <- Eng_Wales_GDP_GR %>% left_join(Eng_Wales_population, join_by("calendar.years" == "Year", "Geography" == "Region")) head(left_join_data1) # Use right_join to join "Eng_Wales_GDP_GR" (x) and "Eng_Wales_population" (y), equating "calendar.years right_join_data1 <- Eng_Wales_GDP_GR %>% right_join(Eng_Wales_population, join_by("calendar.years" == "Year", "Geography" == "Region")) head(right_join_data1) # Use full_join to join "Eng_Wales_GDP_GR" (x) and "Eng_Wales_population" (y), equating "calendar.years" full_join_data1 <- Eng_Wales_GDP_GR %>% full_join(Eng_Wales_population, join_by("calendar.years" == "Year", "Geography" == "Region")) head(full_join_data1) Week 7 Handout Data Processing NOTE: if you don’t have tidyverse and ggplot2 packages on your RStudio. Please install it by running install.packages("tidyverse") then install.packages("ggplot2") in the R Console. library(tidyverse) library(ggplot2) load libraries and data First, load the dataset “uk_gdp_and_weeklyWage.csv”. uk_gdp_and_weeklyWage <- read.csv("data/uk_gdp_and_weeklyWage.csv", stringsAsFactors = F) inspect data Before starting any practice and exercises, inspect the data and detect any abnormalities. Some basic operations can use head(), str(), which, is.na(), is.numeric etc. For more advanced operations, for example, to detect any outlier(s), IQR (inter quartile range) can be used. In the following, use head and str to have a glimpse at the data Exercise 1 # preview the first 15 rows of data head(uk_gdp_and_weeklyWage, n = 15) # print the structure of data str(uk_gdp_and_weeklyWage) Detect NAs In the following, use the pipe operator and apply to find the columns that contain NAs. Exercise 2 38 # use colSums and is.na to find the columns contain NA value(s) uk_gdp_and_weeklyWage %>% select(which(colSums(is.na(.)) > 0)) NAs_loc <- which(uk_gdp_and_weeklyWage %>% apply(., MARGIN = 2, is.na) , arr.ind = T) Detect outliers To check if there are any outliers in the data, IQR is one metric that can be used. IQR = 3rd Quartile − 1st Quartile As a rule of thumb, outliers are data that fall outside the range between 1st Quartile - (IQR x 1.5) and 3rd Quartile + (IQR x 1.5). Outliers > 3rd Quartile + (IQR ∗ 1.5) or Outliers < 1st Quartile − (IQR ∗ 1.5) In the following code chunk, we create a function for detecting outliers. # Use IQR # create detect outlier function detect_outlier <- function(x) { # calculate first quantile Q1 <- quantile(x, probs=.25, na.rm = T) # calculate third quantile Q3 <- quantile(x, probs=.75, na.rm = T) # calculate inter quartile range IQR = Q3-Q1 # return true or false result <- x > Q3 + (IQR*1.5) | x < Q1 - (IQR*1.5) return(result) } Use the function detect_outlier, and apply with pipe operators to complete the following code chunk in Exercise 3. Exercise 3 # define an empty data frome for storing the matrix of locations of outliers outliers <- data.frame(rows=c(), cols=c()) outliers <- outliers %>% add_column(rows = NA, cols = NA) # obtain the outliers' locations outliers <- which(uk_gdp_and_weeklyWage %>% apply(., MARGIN = 2, detect_outlier) , arr.ind = T) %>% data.frame() outliers_NAs <- rbind(outliers, NAs_loc) Correct missing data load original datasets 39 We need to load the original datasets “ONS_UK_GDP.csv” and “ONS_UK_population.csv” because of the corrupted data in the the corresponding columns. Complete the following code chunk. Exercise 4 # load ONS_UK_GDP.csv from certain rows GDP <- read.csv("data/ONS_UK_GDP.csv", stringsAsFactors = F)[8:82,] # rename the columns colnames(GDP) <- c("Year", "GDP_m") # convert the columns to numeric types GDP <- GDP %>% mutate(Year = as.numeric(Year), GDP_m = as.numeric(GDP_m)) # load ONS_UK_population.csv from certain rows popul <- read.csv("data/ONS_UK_population.csv", stringsAsFactors = F)[8:58, ] # rename the columns colnames(popul) <- c("Year", "population") # convert the columns to numeric types popul <- popul %>% mutate(Year = as.numeric(Year), population = as.numeric(population)) Correct data with loops Use loops to correct data in the following exercise. Exercise 5 # correct GDP data for (row in unique(outliers_NAs[outliers_NAs$col==3, ]$row)) { year_num <- uk_gdp_and_weeklyWage$Year[row] uk_gdp_and_weeklyWage$GDP_m[row] <- GDP$GDP_m[GDP$Year==year_num] } # correct population data for (row in unique(outliers_NAs[outliers_NAs$col==4, ]$row)) { year_num <- uk_gdp_and_weeklyWage$Year[row] uk_gdp_and_weeklyWage$Population[row] <- popul$population[popul$Year==year_num] } data visualization Exercise 6 Next, add GDP growth rates and categorize the growth rate into quartiles. Note: there will be NA(s) in the GDP growth rates data; you need to remove them. uk_gdp_and_weeklyWage <- uk_gdp_and_weeklyWage %>% mutate(GDP_growth_rate = (GDP_m - lag(GDP_m)) / lag(GDP_m) * 100, weeklyWageQuartile = cut(Weekly_pay, quantile(Weekly_pay, probs = seq(0, 1, 0.25), type = 7), include.lowest = T, labels = F) ) uk_gdp_and_weeklyWage <- uk_gdp_and_weeklyWage %>% mutate (GDP_growth_rate_Quartile = cut(GDP_growth_rate, quantile(GDP_growth_rate, probs = seq(0, 1, 0.25), type = 7, na.rm = T), include.lowest = T, labels = F) ) 40 uk_gdp_and_weeklyWage <- uk_gdp_and_weeklyWage %>% mutate (GDP_per_cap = GDP_m/Population*1000000) # calculate GDP per Capita uk_gdp_and_weeklyWage <- uk_gdp_and_weeklyWage %>% mutate (wage_level = Weekly_pay > 500) Exercise 7 Add a trend line to the plot of “GDP_per_cap” against “Year” using the verb geom_smooth(method = "lm"). uk_gdp_and_weeklyWage %>% ggplot(aes(x = Year, y = GDP_per_cap)) + geom_point() + geom_smooth(method = "lm") + ylab("GDP per capita") Box plot Box plot is often used to visualize the distribution of data. It can be used to a single group of data or to compare distribution of data from several groups. For example, plot the “weekly wage” in a box plot. uk_gdp_and_weeklyWage %>% ggplot(aes(y = Weekly_pay)) + geom_boxplot() + ylab("Weekly pay (£)") Running the above code chunk, a simple boxplot is presented to illustrate the distribution of weekly wage in a box, with a vertical line to indicate the whole range of data and a horizontal cross line as the median of data. The other edges of the box represent the 25% and 75% percentiles of data. To demonstrate how a box plot compares distribution of data in several groups, one with weekly wage vs weekly wage quartiles can be used as an example in the following code chunk. NOTE: The column “weeklyWage Quartile” needs to be categorized into factors. uk_gdp_and_weeklyWage %>% mutate(weeklyWageQuartile = as.factor(weeklyWageQuartile)) %>% ggplot(aes(x=weeklyWageQuartile, y = Weekly_pay)) + geom_boxplot() + xlab("Weekly Wage Quartile") + ylab("Weekly pay (£)") The result shows how the weekly pay distributions differ by the quartile it is in. What do you observe? Using the same method above, implement a boxplot chart in the following exercise by plotting “GDP_growth_rate_Quartile” vs “Weekly_pay”. Note that any dots in a boxplot are outliers. Exercise 8 uk_gdp_and_weeklyWage <- uk_gdp_and_weeklyWage %>% filter(!is.na(GDP_growth_rate_Quartile)) %>% mutate(GDP_growth_rate_Quartile = as.factor(GDP_growth_rate_Quartile)) uk_gdp_and_weeklyWage %>% ggplot(aes(x=GDP_growth_rate_Quartile, y = Weekly_pay)) + geom_boxplot() + xlab("GDP Growth Rate Quartile") + ylab("Weekly pay (£)") Describe what you observe. ### Tutorial Exercise ##### Exercise: Detecting Outliers in airquality Dataset 41 This exercise guides you through detecting outliers in the Ozone column of the airquality dataset using a boxplot, IQR method, and a scatterplot for visualisation. • Step 1: Load and Explore the Dataset We will use the airquality dataset, which contains air quality measurements. Start by loading and inspecting the data. # Load tidyverse library(tidyverse) # Load the airquality dataset and inspect it data("airquality") head(airquality) airquality<-airquality %>% rowid_to_column("Index") %>% # Add an index column for plotting drop_na(Ozone) # Remove rows with missing Ozone values Exercise 1 Create a boxplot of the Ozone column to visually inspect for outliers. # Boxplot of Ozone levels boxplot(airquality$Ozone, main = "Boxplot of Ozone Levels", ylab = "Ozone Levels", col = "lightblue") Exercise 2 Follow these steps to calculate and use the IQR to detect outliers: Compute the quartiles (Q1, Q3) and the IQR. Calculate the lower and upper bounds for detecting outliers. Identify which data points fall outside these bounds. # Calculate IQR and classify points as outliers ozone_with_outliers <- airquality %>% mutate( Q1 = quantile(Ozone, 0.25), Q3 = quantile(Ozone, 0.75), IQR = Q3 - Q1, Lower_Bound = Q1 - 1.5 * IQR, Upper_Bound = Q3 + 1.5 * IQR, Is_Outlier = if_else(Ozone < Lower_Bound | Ozone > Upper_Bound, "Outlier", "Not outlier") ) # Display the bounds and outliers ozone_bounds <- ozone_with_outliers %>% select(Q1, Q3, IQR, Lower_Bound, Upper_Bound) %>% distinct() cat("Lower and Upper Bounds for Ozone Levels:\n") print(ozone_bounds) cat("\nOutliers in Ozone Levels:\n") 42 print(ozone_with_outliers %>% filter(Is_Outlier == "Outlier") %>% select(Index, Ozone)) Exercise 3 Create a scatterplot where outliers are highlighted in red. # Scatterplot of Ozone levels with outliers highlighted ggplot(ozone_with_outliers, aes(x = Index, y = Ozone, color = Is_Outlier)) + geom_point(size = 3) + scale_color_manual(values = c("Not outlier" = "blue", "Outlier" = "red")) + labs( title = "Scatterplot of Ozone Levels with Outliers Highlighted", x = "Index", y = "Ozone Levels", color = "Legend" ) + theme_minimal() Week 8 Handout Data Wrangling II We begin by loading appropriate packages and libraries. library(tidyverse) library(stringr) library(dslabs) data(reported_heights) Testing and improving Developing the right codes for string processing on the first try is often difficult. Trial and error is a common approach to finding the pattern that satisfies all desired conditions. In the above, we have developed a powerful string processing technique that can help us catch many of the problematic entries. Now we will test our approach, search for further problems, and tweak our approach for possible improvements. Let’s write a function that captures all the entries that can’t be converted into numbers remembering that some are in centimeters (we will deal with those later): # function to detect values that are not numerical values of height in inches or cm not_inches_or_cm <- function(x, smallest = 50, tallest = 84){ inches <- suppressWarnings(as.numeric(x)) ind <- !is.na(inches) & ((inches >= smallest & inches <= tallest) | (inches/2.54 >= smallest & inches/2.54 <= tallest)) !ind } Complete the code chunk below to figure out how many problematic entries we have after applying not_inches_or_cm. Exercise 1 problems <- reported_heights %>% filter(not_inches_or_cm(height)) %>% pull(height) length(problems) 43 Now compute what proportion of these fit our pattern after the processing steps we developed in class: Exercise 2 converted <- problems %>% str_replace("feet|foot|ft", "'") %>% # convert feet symbols to ' str_replace("inches|in|''|\"", "") %>% # remove inches symbols str_replace("ˆ([4-7])\\s*[,\\.\\s+]\\s*(\\d*)$", "\\1'\\2")# change format pattern <- "ˆ[4-7]\\s*'\\s*\\d{1,2}$" index <- str_detect(converted, pattern) mean(index) This last piece of code should show that we have matched well over half of the strings. Let’s examine the remaining cases: converted[!index] Four clear patterns arise: 1. Many students measuring exactly 5 or 6 feet did not enter any inches, for example 6’, and our pattern requires that inches be included. 2. Some students measuring exactly 5 or 6 feet entered just that number. 3. Some of the inches were entered with decimal points. For example 5’7.5’ ‘. Our pattern only looks for two digits. 4. Some entries have spaces at the end, for example 5’ 9. Although not as common, we also see the following problems: 5. Some entries are in meters and some of these use European decimals: 1.6, 1,70. 6. Two students added cm. 7. A student spelled out the numbers: Five foot eight inches. It is not necessarily clear that it is worth writing code to handle these last three cases since they might be rare enough. However, some of them provide us with an opportunity to learn a few more techniques, so we will build a fix. For Case 1, if we add a '0 after the first digit, for example, convert all 6 to 6'0, then our previously defined pattern will match. This can be done using groups: yes <- c("5", "6", "5") no <- c("5'", "5''", "5'4") s <- c(yes, no) str_replace(s, "ˆ([4-7])$", "\\1'0") The pattern says it has to start (ˆ) with a digit between 4 and 7 and end there ($). The parenthesis defines the group that we pass as \\1 to generate the replacement string. We can adapt this code slightly to handle the Case 2 as well, which covers the entry 5'. Note 5' is left untouched. This is because the extra ' makes the pattern not match since we have to end with a 5 or 6. We want to permit the 5 or 6 to be followed by 0 or 1 feet sign. So we can simply add '{0,1} after the ' to do this. However, we can use the none or once special character ?. As we saw above, this is different from * which is none or more. Try this in the following code chunk to cover Case 2. Exercise 3 str_replace(s, "ˆ([56])'?$", "\\1'0") In the above, we only permit 5 and 6, but not 4 and 7. This is because 5 and 6 feet tall is quite common, so we assume those that typed 5 or 6 really meant 60 or 72 inches. However, 4 and 7 feet tall are so rare that, although we accept 84 as a valid entry, we assume 7 was entered in error. We can use quantifiers to deal with Case 3. These entries are not matched because the inches include decimals and our pattern does not permit this. We need to allow the second group to include decimals not just digits. This means we must permit zero or one period . then zero or more digits. So we will be using both ? and *. 44 Also remember that, for this particular case, the period needs to be escaped since it is a special character (it means any character except line break). Here is a simple example of how we can use *. Adapt our pattern, currently ˆ[4-7]\\s*'\\s*\\d{1,2}$ to permit a decimal at the end and re-compute the proportion of entries that fit the pattern. Is there an improvement? Exercise 4 pattern <- "ˆ[4-7]\\s*'\\s*(\\d+\\.?\\d*)$" index <- str_detect(converted, pattern) mean(index) Case 4, meters using commas, we can approach similarly to how we converted the x.y to x'y. A difference is that we require that the first digit be 1 or 2. Try this in the following code chunk to cover Case 4. Exercise 5 yes <- c("1,7", "1, 8", "2, " ) no <- c("5,8", "5,3,2", "1.7") s <- c(yes, no) str_replace(s, "ˆ([12])\\s*,\\s*(\\d*)$", "\\1\\.\\2") Tutorial Exercise Data Wrangling II continued We begin by loading appropriate packages and libraries. library(tidyverse) library(stringr) library(dslabs) library(english) data(reported_heights) Recall also we had defined the functions below: not_inches <- function(x, smallest = 50, tallest = 84){ inches <- suppressWarnings(as.numeric(x)) ind <- is.na(inches) | inches < smallest | inches > tallest ind } not_inches_or_cm <- function(x, smallest = 50, tallest = 84){ inches <- suppressWarnings(as.numeric(x)) ind <- !is.na(inches) & ((inches >= smallest & inches <= tallest) | (inches/2.54 >= smallest & inches/2.54 <= tallest)) !ind } problems <- reported_heights %>% filter(not_inches_or_cm(height)) %>% pull(height) And let us remember what remaining cases there are after the fixes we have developed so far: converted <- problems %>% str_replace("feet|foot|ft", "'") %>% # convert feet symbols to ' 45 str_replace("inches|in|''|\"", "") %>% # remove inches symbols str_replace("ˆ([4-7])\\s*[,\\.\\s+]\\s*(\\d*)$", "\\1'\\2")# change format pattern <- "ˆ[4-7]\\s*'\\s*\\d{1,2}$" index <- str_detect(converted, pattern) converted[!index] Let us continue dealing with the remaining problematic cases. Trimming In general, spaces at the start or end of the string are uninformative. These can be particularly deceptive because sometimes they can be hard to see. This is a general enough problem that there is a function dedicated to removing them, str_trim: s <- "Hi " identical(s, "Hi") str_trim("5 ' 9 ") Changing lettercase Strings in R are case sensitive. Often we want to match a word regardless of case. One approach to doing this is to first change everything to lower case and then proceeding ignoring case. As an example, note that one of the entries writes out numbers as words “Five foot eight inches”. Although not efficient, we could add 13 extra str_replace calls to convert zero to 0, one to 1, and so on. To avoid having to write two separate operations for Zero and zero, One and one, etc., we can use the str_to_lower function to make all works lower case first: s <- c("Five feet eight inches") str_to_lower(s) Other related functions are str_to_upper and str_to_title. We are now ready to define a procedure that converts all the problematic cases to inches. We now put all of what we have learned together into a function that takes a string vector and tries to convert as many strings as possible to one format. We write a function that puts together what we have done so far. convert_format <- function(s){ s %>% str_replace("feet|foot|ft", "'") %>% str_replace_all("inches|in|''|\"|cm|and", "") %>% str_replace("ˆ([4-7])\\s*[,\\.\\s+]\\s*(\\d*)$", "\\1'\\2") %>% str_replace("ˆ([56])'?$", "\\1'0") %>% str_replace("ˆ([12])\\s*,\\s*(\\d*)$", "\\1\\.\\2") %>% str_trim() } We can also write a function that converts words to numbers. Run and study the code below. words_to_numbers <- function(s){ s <- str_to_lower(s) for(i in 0:11) s <- str_replace_all(s, words(i), as.character(i)) s } Let us see which problematic entries remain. Exercise 1 46 converted <- problems %>% words_to_numbers() %>% convert_format() remaining_problems <- converted[not_inches_or_cm(converted)] pattern <- "ˆ[4-7]\\s*'\\s*\\d+\\.?\\d*$" index <- str_detect(remaining_problems, pattern) remaining_problems[!index] Apart from the cases reported as meters, which we will fix below, the remaining cases all seem to be cases that are impossible to fix because they are nonsensical entries. extract The extract function is a useful tidyverse function for string processing that we will use in our final solution, so we introduce it here. In a previous section, we constructed a regex that lets us identify which elements of a character vector match the feet and inches pattern. However, we want to do more. We want to extract and save the feet and number values so that we can convert them to inches when appropriate. If we have a simpler case like this: s <- c("5'10", "6'1") tab <- data.frame(x = s) Use the separate() function we learnt in Week 6 to achieve our current goal below. Exercise 2 tab %>% separate(x, c("feet", "inches"), sep = "'") The extract function from the tidyr package lets us use regex groups to extract the desired values. Here is the equivalent to the code above using separate but using extract: library(tidyr) tab %>% extract(x, c("feet", "inches"), regex = "(\\d)'(\\d{1,2})") So why do we even need the new function extract? We have seen how small changes can throw off exact pattern matching. Groups in regex give us more flexibility. For example, if we define: s <- c("5'10", "6'1\"","5'8inches") tab <- data.frame(x = s) and we only want the numbers, separate fails: tab %>% separate(x, c("feet","inches"), sep = "'", fill = "right") However, we can use extract. The regex here is a bit more complicated since we have to permit ’ with spaces and feet. We also do not want the ” included in the value, so we do not include that in the group: tab %>% extract(x, c("feet", "inches"), regex = "(\\d)'(\\d{1,2})") Putting it all together We are now ready to put it all together and wrangle our reported heights data to try to recover as many heights as possible. The code is complex, but we will break it down into parts. We start by cleaning up the height column so that the heights are closer to a feet’inches format. We added an original heights column so we can compare before and after. Now we are ready to wrangle our reported heights dataset: pattern <- "ˆ([4-7])\\s*'\\s*(\\d+\\.?\\d*)$" smallest <- 50 tallest <- 84 47 new_heights <- reported_heights %>% mutate(original = height, height = words_to_numbers(height) %>% convert_format()) %>% extract(height, c("feet", "inches"), regex = pattern, remove = FALSE) %>% mutate_at(c("height", "feet", "inches"), as.numeric) %>% mutate(guess = 12 * feet + inches) %>% mutate(height = case_when(is.na(height) ~ as.numeric(NA), between(height, smallest, tallest) ~ height, #inches between(height/2.54, smallest, tallest) ~ height/2.54, #cm between(height*100/2.54, smallest, tallest) ~ height*100/2.54, #meters TRUE ~ as.numeric(NA))) %>% mutate(height = ifelse(is.na(height) & inches < 12 & between(guess, smallest, tallest), guess, height)) %>% select(-guess) head(new_heights) We can check all the entries we converted by: new_heights %>% filter(not_inches(original)) %>% select(original, height) %>% arrange(height) %>% View() In the following, show the shortest 10 students in the survey: Exercise 3 new_heights %>% arrange(height) %>% head(n=10) You should see heights of 53, 54, and 55. In the originals, we also have 51 and 52. These short heights are rare and it is likely that the students actually meant 5’1, 5’2, 5’3, 5’4, and 5’5. Because we are not completely sure, we will leave them as reported. The object new_heights contains our final solution for this case study. Week 9 Exam Sample Questions Question: Calculation of Daily Returns and Visualisation Using tidyverse and ggplot2 For this exercise, you will use stock price data from a publicly traded company (e.g., Apple Inc., ticker: AAPL). Use the quantmod package to obtain the data. If you have not installed quantmod, install it first by running install.packages("quantmod"). Data Source: Use the stock data for Apple Inc. (ticker: AAPL) from quantmod. 1. Load and Prepare Data • Use the quantmod package to download daily stock data for Apple Inc. (AAPL) for the year 2021. • Convert the data into a data.frame() and add a date column in Date format. Run the code below to load the data 48 # Install and load the required packages (if not already installed) # install.packages("quantmod") # install.packages("tidyverse") library(quantmod) library(tidyverse) # 1. Download the stock data for Apple Inc. for the year 2021 getSymbols("AAPL", src = "yahoo", from = "2021-01-01", to = "2021-12-31") # Convert the data to a data frame and add a date column # index() ; creates a vector of dates #coredate (); function extracts the numerical data (the actual stock prices) from the AAPL object. aapl_data <- data.frame(date = index(AAPL), coredata(AAPL)) # Display the first 6 rows of the resulting tibble head(aapl_data) 2. Calculate Daily Returns • Use the tidyverse package to calculate the daily returns of the stock. • Daily returns are defined as: Daily Return = Today’s Closing Price − Yesterday’s Closing Price Yesterday’s Closing Price • Store the results in a new column called daily_return. Provide the R code and display the first 6 rows that includes the daily_return column. # 2. Calculate daily returns using the tidyverse aapl_data <- aapl_data %>% arrange(date) %>% mutate(daily_return = (AAPL.Close - lag(AAPL.Close)) / lag(AAPL.Close)) # Display the first 6 rows of the resulting tibble with daily returns head(aapl_data) 3. Add Month Factor and Visualise Daily Returns • Create a new column named Month that extracts the month from the date (as a factor). • Use ggplot2 to create a line plot of daily returns over time, colored by the Month variable. • The x-axis should represent the date, the y-axis should represent the daily return, and the color should differentiate each month. • Add a title, labels for the axes, a legend, and customize the theme to make the plot clear and visually appealing. Ensure that the plot adheres to principles of graphical excellence Hint; type in your console ?format() Provide the R code and include the resulting plot. # Install and load ggplot2 if not already installed # install.packages("ggplot2") library(ggplot2) # 3. Create a new column 'Month' to extract the month as a factor 49 # Create a new column named 'Month' by extracting the month from the 'date' column aapl_data$Month <- format(aapl_data$date, "%b") # Extracts the abbreviated month name # Convert the 'Month' column to a factor aapl_data$Month <- factor(aapl_data$Month, levels = month.abb) # month.abb gives abbreviated month name # Check the first few rows of the modified aapl_data head(aapl_data) # Create a line plot of the daily returns using ggplot2, colored by Month ggplot(aapl_data, aes(x = date, y = daily_return, color = Month)) + geom_line(size = 0.7) + labs( title = "Daily Returns of Apple Inc. (AAPL) in 2021 by Month", x = "Date", y = "Daily Return", color = "Month" ) + theme_minimal() + theme( plot.title = element_text(hjust = 0.5, size = 14), axis.title = element_text(size = 12), axis.text = element_text(size = 10), legend.position = "right" ) 4. Analyse the Plot • Briefly describe any patterns or trends you observe in the daily returns over different months. • Discuss how the addition of the Month variable helps in analyzing seasonal trends or changes in volatility. • Explain how your plot adheres to the principles of graphical excellence. • Week 10 Revision Exercise Exercise: Weekly Average Closing Prices of Bitcoin In this exercise, you will analyse the weekly average closing prices of Bitcoin (BTC-USD) for the year 2022. Weekly averages provide a simplified view of trends compared to daily prices. • 1. Load and Prepare Data We will use the quantmod package to obtain daily cryptocurrency price data for Bitcoin (BTC-USD) from Yahoo Finance. The data will then be converted into a data.frame() with an added date column in Date format. # Install and load the required packages (if not already installed) # install.packages("quantmod") # install.packages("tidyverse") library(quantmod) library(tidyverse) # 1. Download the cryptocurrency data for Bitcoin for the year 2022 50 getSymbols("BTC-USD", src = "yahoo", from = "2022-01-01", to = "2022-12-31") # Convert the data to a data frame and add a date column btc_data <- data.frame(date = index(`BTC-USD`), coredata(`BTC-USD`)) # Display the first 6 rows of the resulting data frame head(btc_data) Exercise 1 # 2. Calculate weekly high, low, and average closing prices #We use as.Date(format(date, "%Y-%m-01")) + weeks(as.integer(format(date, "%U"))) to determine the corre #This ensures the week_start is calculated correctly and treated as a Date object, not just a string. btc_weekly <- btc_data %>% mutate(week_start = as.Date(format(date, "%Y-%m-01")) + weeks(as.integer(format(date, "%U")))) %>% group_by(week_start) %>% summarise( high = max(BTC.USD.Close), low = max(BTC.USD.Close), average_closing = mean(BTC.USD.Close) ) # Display the first 6 rows of the resulting data frame head(btc_weekly) Exercise 2 # Install and load ggplot2 if not already installed # install.packages("ggplot2") # Create a line plot of weekly average closing prices library(ggplot2) btc_weekly %>% ggplot(aes(x = week_start, y = average_closing)) + geom_line() + theme_minimal() + labs(x = "Week Start Date", y = "Weekly Average Closing Price ($)", title = "Trend of Bitcoin Weekly Average Closing Price 2022 to 2024") 4. Analyse the Plot Describe any trends or patterns you observe in the weekly average closing prices. Discuss the advantages of using weekly averages compared to daily prices for analysis. Reflect on how the plot adheres to the principles of graphical excellence. Sample Exam Instructions: This part of the exam is partially open-book, meaning you can consult any module material that has already been downloaded onto your computer or any pre-prepared offline notes on your computer (i.e., you may NOT open an internet browser to access the wider internet). Please submit your .Rmd file and any compiled pdf file. 51 First, load the following packages and libraries. library(tidyverse) library(ggplot2) library(stringr) Question 1 (4 points) Load the two data files, employment_rates_by_graduate_type_200723.csv and yearly_salaries_by_gender2_200723.csv by completing the code below and study the entries by checking the last six entries. emp_rates <- read.csv("employment_rates_by_graduate_type_200723.csv", stringsAsFactors = F) salaries_gender <- read.csv("yearly_salaries_by_gender2_200723.csv", stringsAsFactors = F) tail(emp_rates) tail(salaries_gender) Question 2 (4 points) In the salaries_gender data set, the column median_real contains the median salary for the particular year and group (Total, for all, Male or Female) in 2007 values, to the nearest £500. We can correct for inflation since and convert these values to 2024 values by multiplying through by 1.65. Create a new column, median2024 that does this, then remove the median_real column in salaries_gender. salaries_gender <- salaries_gender %>% mutate(median2024 = median_real*1.65) %>% select(-median_real) tail(salaries_gender) Question 3 (18 points) a. Compute average median salaries (in 2024 values) and the standard deviation of the median salaries over the years (in 2024 values; recall sd() is the function for calculating standard deviations) by graduate type and gender in the salaries_gender data set. (4 points) summary_stats <- salaries_gender %>% group_by(graduate_type, gender) %>% summarise(avg_median_salary = mean(median2024), sd_median_salary= sd(median2024)) %>% ungroup() print(summary_stats) b. We can say the average median pay of Males is statistically significantly greater than the average median pay of Females over the years reported if from a sample of data, Sample avg of Male median pay > Sample avg of Female median pay+ p 1.96 ∗ (SD of Male median pays)2 + (SD of Female median pays)2 . Can you find evidence of the above for each graduate type in the salaries_gender data set? Write a code that prints out the result of the comparisons (either there is or there isn’t a statistically significant evidence of greater Male pay) for each graduate type. (14 points) 52 types<- c("Graduate", "Non-Graduate", "Postgraduate") for (y in types){ # Evidence of greater Male pay in graduate_type = y mean_M <- summary_stats %>% filter(graduate_type==y, gender == "Male") %>% pull(avg_median_salary) mean_F <- summary_stats %>% filter(graduate_type==y, gender == "Female") %>% pull(avg_median_salary) sd_M <- summary_stats %>% filter(graduate_type==y, gender == "Male") %>% pull(sd_median_salary) sd_F <- summary_stats %>% filter(graduate_type==y, gender == "Female") %>% pull(sd_median_salary) # } print(c(mean_M,mean_F,sd_M,sd_F)) if(mean_M > mean_F + 1.96*sqrt(sd_Mˆ2+sd_Fˆ2)){ print(paste("There is statistically significant evidence of greater Male pay in",y)) } else{ print(paste("There is NO statistically significant evidence of greater Male pay in",y)) } Question 4 (8 points) The file employment_rates_by_graduate_type_200723.csv contains data on employment rates over the years by graduate type. The column hs_employment_rate contains the rate of high-skilled employment, where high-skilled work is defined by some government guidelines. Plot a time series of high-skilled employment rate by graduate type and comment on what you observe. emp_rates %>% ggplot(aes(time_period, hs_employment_rate)) + geom_line(aes(col=graduate_type)) + xlab("Year") + ylab("Rate (%)") + ggtitle("HS Employment Rate by Graduate Type") + scale_color_discrete(name = "Graduate Type") Question 5 (10 points) Now explore the correlation between hs_employment_rate and the median salary (in 2024 values) by graduate type. You may first need to join the data sets. Comment on what you observe. full_data <- full_join(emp_rates,salaries_gender) full_data %>% filter(gender == "Total") %>% ggplot(aes(hs_employment_rate, median2024)) + geom_point(aes(col=graduate_type)) + xlab("Employment Rate") + ylab("Median Salary (£)") + 53 ggtitle("Median Salary vs. Employment Rate") + scale_color_discrete(name = "Graduate Type") Next, install and load the babynames package by running the code chunk below, which contains all baby names used by at least 5 children in the US from 1880 to 2017. The data set contains information on the year, name and the total number of babies with this name (“n”) and the proportion of babies of that gender with that name born in that year (“prop”). install.packages("babynames",repos = "http://cran.us.r-project.org") library(babynames) head(babynames) Question 6a (6 points) Produce a list of all the girls names that contain the string “eli” or “Eli” in 2017. How many such names are there? list<- babynames %>% filter(year == 2017, sex == "F") list$name %>% str_view("eli|Eli") list$name %>% str_view("eli|Eli") %>% length() Question 6b (10 points) What proportion of the girls contain the string “eli” or “Eli” through the years? Produce a plot to illustrate this and comment on what you observe. list2 <- babynames %>% filter(sex == "F") list2 %>% group_by(year) %>% summarize(prop_eli = mean(str_detect(name, "eli|Eli"))) %>% ggplot(aes(x = year, y = prop_eli)) + geom_line() Comment Having “eli/Eli” in a girl’s name has steeply declined in popularity from 1880 until around 2000, although there was a slight bump around 1960-1970. There seems to be a renewed interest since 2000, but it’s not yet clear if this is also a bump or a permanent trend upwards. 54

Data Science Final Review: Functions, ggplot, Tidy Data, Exam Prep

Related documents

Products

Support

Data Science Final Review: Functions, ggplot, Tidy Data, Exam Prep

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib