Maps and Time Series Stat 579 Heike Hofmann Outline • Melting and Casting • Maps: polygons, chloropleth • Time series Warm-up • Start R and load data ‘fbi’ from http://www.hofroe.net/stat579/crimes-2012.csv • This data set contains number of crimes by type for each state in the U.S. • Investigate which states have the highest number of crimes (almost independently of type) • Pick one state and crime type and plot a time series getting ready for loops • Let’s concentrate on the years since 2000 • Pick a state and fit a model (use lm) in the number of Burglaries over time (i.e. lm(Burglary~Year) ) • Save the resulting object. Investigate it with your poking and prodding functions. • Extract the coefficients (mean and slope) from the model • Repeat for another state. • How can we extract coefficients for all states? Iterations • Want to run the same block of code multiple times: ! ! for (i in allstates) { onestate <- subset(fbi, state==i & Year >= 2000) model <- lm(Burglary~Year, data=onestate) ! } block of commands print(coef(model)) output ! • Loop or iteration Why should we avoid loops? • speed of for-loops still is an issue • main reason: lots of error-prone householding chores before and after the ‘meat’ fbi exploration • Plot scatterplot of population size against number of violent crimes in 2012. What is your conclusion? How do things change in 2011? • Plot population against number of burglaries in 2012. What is your conclusion there? • What should we rather look at? Reshaping Data • Two step process: • get data into a “convenient” shape, i.e. melt • cast data into new shape(s) that are cast one that is particularly flexible better suited for analysis melt.data.frame(data, id.vars, measure.vars, na.rm = F, ...)" key X1 molten form “long & skinny” • id.vars: all identifiers (keys) and qualitative variables X2 X3 • measure.vars: all quantitative variables original data id.vars X4 key X1 X2X3X4X5 measure.vars X5 Casting • Function cast dcast(dataset, rows ~ columns, aggregate) columns rows aggregate(data) Data aggregation sometimes is just a transformation > fbi.melt <- melt(fbi, id.vars=c("State","Abbr","Population"), measure.vars=4:12) ! ! > head(fbi.melt) State Abbr Population variable value 1 Alabama AL 4708708 Violent.crime 21179 2 Alaska AK 698473 Violent.crime 4421 3 Arizona AZ 6595778 Violent.crime 26929 4 Arkansas AR 2889450 Violent.crime 14959 5 California CA 36961664 Violent.crime 174459 6 Colorado CO 5024748 Violent.crime 16976 ! > tail(fbi.melt) State Abbr Population variable value 445 Vermont VT 621760 Motor.vehicle.theft 448 446 Virginia VA 7882590 Motor.vehicle.theft 11419 447 Washington WA 6664195 Motor.vehicle.theft 23680 448 West Virginia WV 1819777 Motor.vehicle.theft 2741 449 Wisconsin WI 5654774 Motor.vehicle.theft 8926 450 Wyoming WY 544270 Motor.vehicle.theft 771 ! ! > summary(fbi.melt) State Abbr Alabama : 9 AK : 9 Alaska : 9 AL : 9 Arizona : 9 AR : 9 Arkansas : 9 AZ : 9 California: 9 CA : 9 Colorado : 9 CO : 9 (Other) :396 (Other):396 Population Min. : 544270 1st Qu.: 1796619 Median : 4403094 Mean : 6128138 3rd Qu.: 6664195 Max. :36961664 variable Violent.crime : 50 Murder.and.nonnegligent.manslaughter: 50 Forcible.rape : 50 Robbery : 50 Aggravated.assault : 50 Property.crime : 50 (Other) :150 value Min. : 7 1st Qu.: 1536 Median : 11056 Mean : 47124 3rd Qu.: 37964 Max. :1009614 Incidences are now easy to compute: •fbi.melt$irate <- fbi.melt$value/fbi.melt$Population Recreate this chart of incidence rates reorder(State, irate) Murder.and.nonnegligent.manslaughter Forcible.rape Robbery Motor.vehicle.theft Aggravated.assault Violent.crime Burglary Larceny.theft Property.crime South Carolina Texas Florida Tennessee Louisiana New Mexico Arkansas Alabama Georgia Oklahoma North Carolina Washington Delaware Arizona Hawaii Missouri Maryland Nevada Kansas Ohio Alaska Utah Indiana Michigan Mississippi Illinois Oregon California Nebraska Colorado Minnesota Wyoming Wisconsin Rhode Island West Virginia Kentucky Massachusetts Montana Virginia Connecticut Iowa Pennsylvania Vermont Maine New Jersey New York New Hampshire Idaho North Dakota South Dakota 0 10002000300040000 10002000300040000 10002000300040000 10002000300040000 10002000300040000 10002000300040000 10002000300040000 10002000300040000 1000200030004000 count Then, cast • Row variables, column variables, and a summary function (sum, mean, max, etc) • dcast(molten, • dcast(molten, row ~ col, summary)" • dcast(molten, • dcast(molten, row ~ . , summary)" row1 + row2 ~ col, summary)" . ~ col, summary) Casting • Using dcast: • find the number of all offenses in 2009 • find the number of offenses by type of crime • find the number of all offenses by state What is a map? 43.5 43.0 Set of points specifying latitude and longitude lat 42.5 42.0 41.5 41.0 40.5 -96 -95 -94 -93 -92 -91 long 43.5 42.5 lat Polygon: connect dots in correct order 43.0 42.0 41.5 41.0 40.5 -96 -95 -94 long -93 -92 -91 What is a map? 40 lat 35 Polygon: connect only the correct dots 30 -95 -90 long -85 Grouping • Use parameter group to connect the “right” dots (need to create grouping sometimes) qplot(long, lat, geom="point", data=states) 40 40 lat 45 lat 45 35 35 30 30 -120 -110 -100 -90 -80 long -70 -120 -110 -100 -90 -80 -70 long qplot(long, lat, geom="path", data=states, group=group) qplot(long, lat, geom="polygon", data=states, group=group, fill=region) 45 45 40 40 lat 35 lat lat 30 35 35 40 45 30 30 -120 -110 -100 -90 long -80 -70 -120 -110 -100 -90 -80 -70 long qplot(long, lat, geom="polygon", data=states.map, fill=lat, group=group) Practice • Using the maps package, pull out map data for all US counties counties <- map_data(“county”) • Draw a map of counties (polygons & path geom) • Colour all counties called “story” • Advanced: What county names are used often? Merging Data • Merging data from different datasets: merge(x, y, by = intersect(names(x), names(y))," by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all," sort = TRUE, suffixes = c(".x",".y"), incomparables = NULL, ...)" e.g.: states.fbi <- merge(states, fbi.cast, by.x="", by.y="Abbr") Merging Data • Merging data from different datasets: region X1 alabama ... ... reg X1 X2 X3 ion alabama alabama alabama X2 region alabama alabama alabama ... ... ... X3 Practice • Merge the fbi crime data and the map of the States • Plot Chloropleth maps of crimes. • Describe the patterns that you see. ! • Advanced: try to cluster the states according to crime rates (use hclust) Time Series NASA Meteorological Data 24 x 24 grid across Central America • satellite captured data: temperature, near surface temperature (surftemp) pressure, ozone, cloud coverage: low (cloudlow) medium (cloudmid) high (cloudhigh) • for each location monthly averages for Jan 1995 to Dec 2000 Gridx 1 to 24 Gridy 1 to 24 • What is a Time Series? 305 300 295 ts for each location multiple measurements 290 285 280 qplot(time, temperature, geom="point", data=subset(nasa, (x==1) & (y==1))) 275 10 20 30 40 50 60 70 40 50 60 70 40 50 60 70 TimeIndx 305 300 ts connected by a line 295 290 285 qplot(time, temperature, geom="line", data=subset(nasa, (x==1) & (y==1))) 280 275 10 20 30 TimeIndx 305 qplot(time, temperature, geom="line", data=subset(nasa, (x==1) & (y %in% c(1,15))), group=y) 300 295 ts but only connect the right points 290 285 280 275 10 20 30 TimeIndx Practice each location, draw a time series for pressure. • For What do you expect? Are there surprising values? Which are they? near surface temperatures for each location • Plot Which locations show the highest range in temperatures? Which locations show the highest overall increase in temperatures? use ddply to get these summaries