Data from the Web stat 579 Heike Hofmann Outline • HTML & XML source code • package XML Data on Websites • Midterm elections 2014, House http://elections.nytimes.com/2014/results/house Data on Websites • Health Data from Government website http://205.207.175.93/HDI/TableViewer/ tableView.aspx?ReportId=74 Data on Websites • Weather Updates, e.g. www.kcci.com Data on Websites • Weather Data, e.g. www.wunderground.com/history/airport/ KAMW/2013/11/13/DailyHistory.html? Data on Websites • Weather Data, e.g. www.wunderground.com/history/airport/ KAMW/2014/11/12/DailyHistory.html? Data on Websites • Sports Data, e.g. http://www.baseball- reference.com/teams/SFG/2014.shtml Getting Data • Single Data File: copy & paste into spread sheet software • Multiple Data Files: use Parser - e.g. R script R Package “XML” • allows parsing of HTML and XML •install.packages("XML")! • e.g. team batting statistics for San Francisco Giants: url =“http://www.baseball-reference.com/teams/SFG/ 2014.shtml” tables <- readHTMLTable(url) Working with lists • lists are the most general data types in R they can contain anything • Use double brackets to access content of one list element: content of list item i: [[i]] • Use individual brackets to subset a list (results in another list) i.e. [i:j] [i] is therefore a list with just one list element • ldply(list, function) from the plyr package Your turn • Find your favorite baseball team (if you don’t have a favorite, use the one that matches your initials the closest) • Use the readHTMLTable function to get the batting statistics for the last three seasons (you might need to find the correct URL first) • Introduce a new variable called ‘Handedness’ into the data set, that has values ‘L’, ‘R’, or ‘B’ for the batting handedness (see website how this is encoded). package lubridate • lubridate allows to work with dates • as.Date converts (most commonly used) date formats in character variables to dates • in case of ambiguities - e.g. 6/3/2012 could be interpreted (Europe) as 6th of March instead of June 3rd, use parameter ‘format’ • accessor functions: month(), wday(), year(), … • comparisons: date > as.Date(“2012-10-05”) beyond readHTMLTable? • go directly to HTML source code <html> ! <head> <title>stat579. intro statistical computing. iastate.</title> <link rel="stylesheet" type="text/css" href="http://www.public.iastate.edu/~hofmann/style.css"> </head> Viewing the Source ! <body> <div id="navbar"> <p><a href="/~stat579/index.html">&#8594;home</a></p> </div> <div id="wrap"> ! <h1>stat 579</h1> <h2 class="subhead">Introduction to Statistical Computing</h2> ! ! <p><strong>Fall 2009</strong><br /> • http://www.public.iastate.edu/~hofmann/ <p>Section A Thursday 12:10&#8211;2:00. <p>Section B Tuesday 12:10&#8211;2:00. ! Carver 205</p> Snedecor 3121</p> <p>Heike Hofmann, <a href="mailto:hofmann@iastate.edu">hofmann@iastate.edu</a>.<br /> Office hours by arrangement. Snedecor 2413 </p> <p>TA: Ai-Ling Teh, <a href="mailto:ailing@iastate.edu">ailing@iastate.edu</a>.<br /> Office hour Monday 2-3, Snedecor 3404. </p> stat579/ ! ! ! ! • HTML is text with tags, i.e. structural <h2>Syllabus</h2> <p><a href="syllabus.html">Course syllabus</a> describing objectives, software used, topics and assessment.</p> elements <h2>Lectures and timetable</h2> <table width="100%"> <tr> <th width="15%">Date</th> <th>Lecture and Resources</th> <th>Homework</th> </tr> ! • headers, images, links, tables: e.g. <title>Statistics 579</title> <tr class="topic first"><td colspan="4">R Basics &amp; Setting up the Working Environment</td></tr> <tr> <td>Aug 25/27</td> <td><a href="lectures/01-introduction.html">Numeric and Visual Summaries</a> (<a href="http://connect.extension.iastate.edu/p31588910/">movie</a> <a href="http://lasonline.iastate.edu/stat579/media/Stat_579_8-28-2009.mov">movie.mov</a>) </td> </tr> ! <tr> <td>Sep 1/3</td> <td><a href="lectures/02-working-directories.html">Working Directories, Data frames &amp; subsets, logical operators</a> ( <a href="http://connect.extension.iastate.edu/p45341752/">movie</a> <a href="http://lasonline.iastate.edu/stat579/media/Stat_579_9-3-2009.mov">movie.mov</a>)</td> <td><a href="homework/week1.html">Week 2</a>, due Sep 8/10.</td> Value e HTML is a tree root xmlRoot xmlChildren .! head body .! footer table . xmlChildren xmlValue tr tr td td td Your turn • The website http://www.google.org/flutrends/ us/#US shows the Google trends for flu cases across the US • Read the data into R (without using any text editor help) • Extract the data for the last month for all states.