Subsetting Data Stat 480 Heike Hofmann Outline • More on qplot • Subsetting Data • Indexing (review) • Logical Operations Scatterplot Load the fbi data into R, load the ggplot2 package • Two continuous variables •qplot(Burglary, Murder, data=fbi)! •qplot(log(Burglary), log(Murder), data=fbi)! •qplot(log(Burglary), log(Motor.Vehicle.Theft), data=fbi) Revision: Interpreting a scatterplot • Big patterns • Form and direction • Strength • Small patterns • Deviations from the pattern • Outliers Interpreting Scatterplots • Form • Is the plot linear? Is the plot curved? Is there a distinct pattern in the plot? Are there multiple groups? • Strength • Does the plot follow the form very closely? Or is there a lot of variation? Interpreting Scatterplots • Direction • Is the pattern increasing? Is the plot decreasing? • Positively: Above (below) average in one variable tends to be associated with above (below) average in another variable. • Negatively: Above (below) average in one variable tends to be associated with below (above) average in another variable. Form: Linear Strength: Strong, very close to a straight line. Direction: Two variables are positively associated. No outliers. Form: Roughly linear, two distinct groups (more than 40% and less than 40%.) Strength: not strong. Data points are scattered. Direction: Negatively Associated. Outliers: None Aesthetics • Can map other variables to size or colour •qplot(log(Burglary), log(Motor.Vehicle.Theft), data=fbi, colour=State)! •qplot(log(Burglary), log(Motor.Vehicle.Theft), data=fbi, size=Population)! • other aesthetics: shape Facetting • Can facet to display plots for different subsets •qplot(Year, Population, data=fbi, facets=~State) Facets vs aesthetics • Will need to experiment as to which one answers your question/tells the story best • Remember, just like with pivot tables we want comparisons of interest to be close together Univariate plots qplot (x, data=dataset) • produces histogram or barchart • Categorical variable → bar chart geom="bar"! • Continuous variable → histogram geom="histogram" • For the histogram, you should always vary the binwidth Histograms and Barcharts • What do we look for? • Symmetry/Skewness • Modes, Groups (big pattern: where is the bulk of the data?) • Gaps & Outliers (deviation from the big pattern: where are the other points?) Aesthetics & facetting • Use fill to color bars according to another variable, or use facetting to compare subsets • Facetting (later) is generally more useful, as it is easier to compare different groups Boxplots definition by J.W. Tukey (1960s, EDA 1977) qplot(x,y, geom="boxplot") 0.4 0.2 0.0 -0.2 -0.4 -3 -2 -1 0 1 Median quartiles: hinges: outliers: 25% 75% data point ≥ 25% - 1.5 * IQR 2 3 IQR = inter-quartile range = = upper quartile - lower quartile ≤ 75% + 1.5 * IQR data points between hinges and quartile ± 3*IQR extreme outliers: data points beyond quartile ± 3*IQR Boxplots • Pros: • Symmetry vs Skewness • Outliers • Quick Summary • Comparisons across multiple Treatments (side by side boxplots) • Cons: • Boxplots hide multiple modes and gaps in the data Your turn • Explore the distribution of Murders • What can you see? What might explain that pattern? • Make sure to experiment with bin width in histograms! • Use facetting to explore the relationship between Murders and States Subsets of Data • Facetting & Zooming are visual ways of subsetting the data ! • Next: use R to subset Logical vectors • Very important! • Usually created with a logical comparison •<, >, ==, !=, <=, >=! •x %in% c(1, 4, 3, 7)! •subset Logical expressions • & and | are the logical and and or • ! is the logical negation • use parentheses () when linking expressions to avoid mis-interpretation Logical Operators A B A & B is the set of elements that is both in A and B A B A | B is the set of elements that is in A or in B or in both Updating subsets • You can take a subset and update the original data •a <- 1:4! •a[2:3] <•a! 0! • Very useful with logical subsetting Practice •a <- c(1,15, 3,20, 5,8,9,10, 1,3)! • Get logical vector that is TRUE when number is: • less than 20 • squared value is at least 100 or less than 10 • equals 1 or 3 • even (look at a %% 2) Subsets • • subset(dataset, logical expression) subset(fbi, Year == 2011) subset(fbi, (Year == 2011) & (State == “Kansas”)) Useful Commands • • • • nrow(dataset) # number of records! quantile(variable, probs=0.001, na.rm=T) # retrieves 0.1 percentile of variable! which(logical variable) # retrieves all indices for which the variable is TRUE! which.max(variable) which.min(variable) # retrieve index of highest (lowest) value in variable Your Turn FBI Data • Get a subset of all crimes in Iowa, i.e.: ... iowa <- ... Plot incidences/rates for one type of crime over time. • Get a subset of all crimes in 2009, and plot one aspect of it. • Get a subset of the data that includes number of Homicides in the last five years. Find rate of homicides, extract all states that have a rate > 90% across the States, and plot.