Data from the Web stat 579
 Heike Hofmann

advertisement
Data from the Web
stat 579
Heike Hofmann
Outline
• HTML & XML source code
• package XML
Data on Websites
• Midterm elections 2014, House
http://elections.nytimes.com/2014/results/house
Data on Websites
• Health Data from Government website
http://205.207.175.93/HDI/TableViewer/
tableView.aspx?ReportId=74
Data on Websites
• Weather Updates, e.g. www.kcci.com
Data on Websites
• Weather Data, e.g.
www.wunderground.com/history/airport/
KAMW/2013/11/13/DailyHistory.html?
Data on Websites
• Weather Data, e.g.
www.wunderground.com/history/airport/
KAMW/2014/11/12/DailyHistory.html?
Data on Websites
• Sports Data, e.g. http://www.baseball-
reference.com/teams/SFG/2014.shtml
Getting Data
• Single Data File: copy & paste into spread sheet software
• Multiple Data Files:
use Parser - e.g. R script
R Package “XML”
• allows parsing of HTML and XML
•install.packages("XML")!
• e.g. team batting statistics for San Francisco
Giants:
url =“http://www.baseball-reference.com/teams/SFG/
2014.shtml”
tables <- readHTMLTable(url)
Working with lists
• lists are the most general data types in R they can contain anything
• Use double brackets to access content of one
list element:
content of list item i: [[i]]
• Use individual brackets to subset a list (results
in another list) i.e. [i:j]
[i] is therefore a list with just one list element
• ldply(list, function) from the plyr package
Your turn
• Find your favorite baseball team (if you don’t have a
favorite, use the one that matches your initials the
closest)
• Use the readHTMLTable function to get the
batting statistics for the last three seasons (you might
need to find the correct URL first)
• Introduce a new variable called ‘Handedness’ into the
data set, that has values ‘L’, ‘R’, or ‘B’ for the batting
handedness (see website how this is encoded).
package lubridate
• lubridate allows to work with dates
• as.Date converts (most commonly used) date
formats in character variables to dates
• in case of ambiguities - e.g. 6/3/2012 could be
interpreted (Europe) as 6th of March instead of
June 3rd, use parameter ‘format’
• accessor functions: month(), wday(), year(), …
• comparisons: date > as.Date(“2012-10-05”)
beyond
readHTMLTable?
• go directly to HTML source code
<html>
!
<head>
<title>stat579. intro statistical computing. iastate.</title>
<link rel="stylesheet" type="text/css" href="http://www.public.iastate.edu/~hofmann/style.css">
</head>
Viewing the Source
!
<body>
<div id="navbar">
<p><a href="/~stat579/index.html">→home</a></p>
</div>
<div id="wrap">
!
<h1>stat 579</h1>
<h2 class="subhead">Introduction to Statistical Computing</h2>
!
!
<p><strong>Fall 2009</strong><br />
• http://www.public.iastate.edu/~hofmann/
<p>Section A Thursday 12:10–2:00.
<p>Section B Tuesday 12:10–2:00.
!
Carver 205</p>
Snedecor 3121</p>
<p>Heike Hofmann, <a href="mailto:hofmann@iastate.edu">hofmann@iastate.edu</a>.<br />
Office hours by arrangement. Snedecor 2413
</p>
<p>TA: Ai-Ling Teh, <a href="mailto:ailing@iastate.edu">ailing@iastate.edu</a>.<br />
Office hour Monday 2-3, Snedecor 3404.
</p>
stat579/
!
!
!
!
• HTML is text with tags, i.e. structural
<h2>Syllabus</h2>
<p><a href="syllabus.html">Course syllabus</a> describing objectives, software used, topics and assessment.</p>
elements
<h2>Lectures and timetable</h2>
<table width="100%">
<tr>
<th width="15%">Date</th>
<th>Lecture and Resources</th>
<th>Homework</th>
</tr>
!
• headers, images, links, tables:
e.g. <title>Statistics 579</title>
<tr class="topic first"><td colspan="4">R Basics & Setting up the Working Environment</td></tr>
<tr>
<td>Aug 25/27</td>
<td><a href="lectures/01-introduction.html">Numeric and Visual Summaries</a> (<a href="http://connect.extension.iastate.edu/p31588910/">movie</a> <a
href="http://lasonline.iastate.edu/stat579/media/Stat_579_8-28-2009.mov">movie.mov</a>)
</td>
</tr>
!
<tr>
<td>Sep 1/3</td>
<td><a href="lectures/02-working-directories.html">Working Directories, Data frames & subsets, logical operators</a> (
<a href="http://connect.extension.iastate.edu/p45341752/">movie</a> <a href="http://lasonline.iastate.edu/stat579/media/Stat_579_9-3-2009.mov">movie.mov</a>)</td>
<td><a href="homework/week1.html">Week 2</a>, due Sep 8/10.</td>
Value
e
HTML is a tree
root
xmlRoot
xmlChildren
.!
head
body
.!
footer
table
.
xmlChildren
xmlValue
tr
tr
td td td
Your turn
• The website http://www.google.org/flutrends/
us/#US
shows the Google trends for flu cases across
the US
• Read the data into R (without using any text
editor help)
• Extract the data for the last month for all
states.
Download