Lecture 2

advertisement
Data Collection
Challenges
How to collect the data?
How to store the data?
◦ Database or files?
◦ Cost of storage and bandwidth
What is the right data format?
◦ Improve readability or optimize storage?
◦ Human readable or computer processed?
How to present the data?
◦ Visualization, machine/human readable?
2
Outline
Data formats:
◦ CSV
◦ XML
◦ JSON
Data collection:
◦ Web crawlers
◦ wget
◦ Apis
3
CSV – Comma Separated Value
•Great for flat data
• For example log data from web servers, sensors
•Compact as text data
•Easily imported into spreadsheets
•Human readable
•Easy sequential access
John, 2008, 20.50, Detroit, Michigan
Michael, 2003, 55.00, San Francisco, California
Mary, 2014, 7.75, , Wisconsin
Kelli, Kyle and Kat, 2010, 35.00, Miami, Florida
4
CSV – Comma Separated Value
•Escape strings
•Field delimiter
•Lacks meta data, requires users to provide information somewhere
else
•If data does not fit into discrete rows?
•If row structure does not have fixed size?
5
CSV – Comma Separated Value
John, 2008, 20.50, Detroit, Michigan
Michael, 2003, 55.00, San Francisco, California
Mary, 2014, 7.75, Wisconsin
Kelli, Kyle and Kat, 2010, 35.00, Miami, Florida
Add
delimiters
Add meta
data
Name, Start Year, Hourly Pay, City, State
John, 2008, 20.50, Detroit, Michigan
John, 2008, 20.50, Ann Arbor, Michigan
Michael, 2003, 55.00, San Francisco, California
Mary, 2014, 7.75, , Wisconsin
‘Kelli, Kyle & Kat’, 2010, 35.00, Miami, Florida
Add
placeholder
6
CSV Library in Python
#load csv library
import csv
#Open file and create a reader object
f = open('mydata.csv')
csv_f = csv.reader(f)
#Loop through each row and print it
for row in csv_f:
print row
#Loop through each row again and print the first value
for row in csv_f:
print row[0]
7
XML – eXtensible Markup Language
•The data describes itself
•Widely supported
•Good for structured data
•A tree-like model:
• One root element
• Each element may contain other
elements
•Verbose format => additional
storage and bandwidth
<books>
<book>
<title>Data just right</title>
<author>Michael Manoochehri</author>
<book>
<book>
<title>Introduction to data mining</title>
<author>Pang Ning Tan</author>
<author>Michael Steinbach</author>
<author>Vipin Kumar</author>
</book>
</books>
8
JSON – JavaScript Object Notation
•A valid javascript object
•Easy to use with javascript and other languages
•Lighterweight syntax than XML so generally faster to parse
•Verbose format
9
JSON – JavaScript Object Notation
•The file type is .json
•Data is in name value pairs
•Data is separated by commas
•Curly braces to designate objects
•Brackets to designate arrays
10
JSON – JavaScript Object Notation
{
“books”: [
{“title” : ”Data just right”, “author” : ”Michael Manoochehri”},
{“title” : ”Introduction to data mining”,
“authors”:[
{“name” : ”Pang Ning Tan”},
{“name” : ”Michael Steinbach”},
{“name” : ”Vipin Kumar”}
]
}
]
}
11
JSON – JavaScript Object Notation
var library = {“books”: [ {“title” : ”Data just right”, “author” : ”Michael Manoochehri”},
{“title” : ”Introduction to data mining”,
“authors”:[ {“name” : ”Pang Ning Tan”},{“name” : ”Michael Steinbach”},{“name” :
”Vipin Kumar”} ] ]}
•library.books[0].title
•library.books[0].title = “Data 2.0”
•library.books[1].authors[2] = library.books[0].author
•library.books[0].authors[2] = library.books[0].author
12
JSON Library in Python
#load json library
import json
#Convert a python object to JSON stream
var1 = [’x’, {’y’: (’Data Mining’, ’C Programming’)}]
print var1
#Convert JSON data to a python object
json_data = '["foo", {"bar":["baz", null, 1.0, 2]}]'
python_obj = json.load(json_data)
print python_obj
#[u'foo', {u'bar': [u'baz', None, 1.0, 2]}]
13
JSON Library in Python
json_data = '["foo", {"bar":["baz", null, 1.0, 2]}]'
#[u'foo', {u'bar': [u'baz', None, 1.0, 2]}]
JSON
Python
object
Dict
{}
array
List
[]
string
Unicode u’
Int
Int, long
Real
Float
True
True
False
False
Null
None
14
Data on the Internet
www.yahoo.com
sports.yahoo.com
finance.yahoo.com
travel.yahoo.com
15
Data on the Internet
HTML File
HTML File
HTML File
HTML File
HTML File
HTML File
16
HTML – Hypertext Markup Language
<html>
<head>
<title>Page Title</title>
<script src=“./scripts/js/utilities.js”/>
<script type=“text/javascript”>
//javascript code goes here
</script>
<link rel="stylesheet" href="styles.css">
</head>
<body>
<h1>My Data</h1>
<table>
<tr><td>John</td><td>1987</td></tr>
<tr><td>Mary</td><td>2001</td></tr>
</table>
</body>
</html>
17
HTTP Requests
•HTTP: Hypertext Transfer Protocol
•A protocol to deliver data
(files/images/query results) on the World
Wide Web
1
www.msu.edu
3
Client
MSU Web
Server
•A browser is an HTTP Client:
• Sends requests to a web server
• Sends response to back to the client (user)
2
18
19
Request Header
20
Response Header
21
Response Content
22
Outline
Data formats:
◦ CSV
◦ XML
◦ JSON
Data collection:
◦ Web crawlers
◦ wget
◦ Apis
23
Web crawlers (Spiders)
•An internet program (bot) that browses the World wide web to
collect data:
• Test if web page has valid structure or is available
• Maintain mirrors of popular website
• Monitor changes in content
• Build a special purpose index
•The bot is given a list of web pages called seeds
•Each seed is collected/indexed/parsed
•All links found inside a seed are added to the list to be visited
•To visit/collect a page, the bot sends an http request to the server
24
Issues
•Deal with a large number of pages
• Cannot download all the pages
• Selection policy: which pages to visit?
•Deal with changing content
• Re-visit policy: when to visit a page again?
•Politeness policy:
• How often to visit the same server?
• How many requests per second? Do not overload the server
•Abide by robots.txt: a file on the server that states which pages are
allowed/disallowed from being scraped
•When to stop? How many levels?
25
Breadth First Search (BFS)
1
2
0
3
4
Finds pages along
the shortest path
from the root
1
1
3
4
2
2
4
5
26
Depth First Search (DFS)
1
2
0
3
4
3
5
Tend to wander
away
6
5
4
4
7
6
27
WGET
•A utility to download files from the internet
•Supports: HTTP, HTTPS and FTP protocols
•Follows links in html, xhtml and css page: recursive download
•Respects robots.txt
•Many configurable options:
• Logging, download (speed, attempts, progress), directory (inclusion/exclusion),
http (username, password, user-agent, caching)
•Format:
wget [option] [url]
•Help:
wget -h
28
WGET
• wget http://www.msu.edu
• Downloads index.html from msu.edu
• wget http://www.cse.msu.edu/~cse891/Sect001/exercises/ex_01.pdf
• Downloads the pdf file
• wget –t 5 http://cse.msu.edu/
• Retry 5 times when the attempt fails
•wget –r statenews.com
• Recursively retrieves files under the hierarchy structure
29
Python requests
•A library to send HTTP requests
•Load library:
import requests
•Send request:
req = request.get(‘http://www.espn.com’)
•Examine response:
print req.text
#Examine content
print req.status_code
print req.headers[‘content_type’]
30
Using the result
•Parse the result received for something useful
•Use libraries: json, ElementTree, MiniDom, lxml, HTMLParser (html.parser),
BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;
and their names were <a href="http://example.com/elsie"
class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister"
id="link2">Lacie</a> and
<a href="http://example.com/Tillie" class="sister"
id="link3">Tillie</a>; and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
31
Using the result
•Using BeautifulSoup:
from bs4 import BeautifulSoup
soupObj = BeautifulSoup(html_doc) # Create a soup object from the html string
print(soupObj.prettify())
# Prints with nice indentations
soupObj.title
# <title>The Dormouse's story</title>
soupObj.title.name
# u'title'
soupObj.title.string
# u'The Dormouse's story'
soupObj.title.parent.name
# u'head'
soupObj.p
# <p class="title"><b>The Dormouse's story</b></p>
soupObj.p['class']
# u'title'
soupObj.a
#
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
32
Using the result
soupObj.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soupObj.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
for link in soupObj.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
33
Using an API
•API: application program interface
•Uses HTTP (HTTPS) protocol
•Uses XML/JSON to represent the response
•Provides clean way to extract data
•Some api’s are free
•Usage limit: 5 calls/s, 1000 per day, …
34
Twitter
•Tweets: short posts of 140 character or less
•Entities: users, hashtags, urls, media
•Places
•Streams: sample of public tweets flowing through twitter
•Timelines: chronologically sorted collections of tweets
• Home timeline: tweets from people you follow
• https://twitter.com
• User timeline: tweets from a specific user
• https://twitter.com/SocialWebMining
• Home timeline of someone else
• https://twitter.com/SocialWebMining/following
35
Twitter Python API
•Create a twitter application account
• Create an app that you authorize to access your account data
• Obtain an app key
• Instead of giving the password for your user account
•Install Twitter library if you don’t have it
•Make calls to:
• Retrieve trends
• Search for tweets/retweets
• Search for users
•Reference: Mining the Social Web, 2nd Edition
36
Authorize Twitter API
import twitter
# Obtain the values from your twitter app account
CONSUMER_KEY = ''
CONSUMER_SECRET = ''
OAUTH_TOKEN = ''
OAUTH_TOKEN_SECRET = ''
auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET, CONSUMER_KEY,
CONSUMER_SECRET)
twitter_api = twitter.Twitter(auth=auth)
# Nothing to see by displaying twitter_api except that it's now a defined variable
print twitter_api
<twitter.api.Twitter object at 0x39d9b50>
37
Get Trends
WORLD_WOE_ID = 1
#Look up IDs at http://woeid.rosselliot.co.nz/
US_WOE_ID = 23424977
world_trends = twitter_api.trends.place(_id=WORLD_WOE_ID)
us_trends = twitter_api.trends.place(_id=US_WOE_ID)
print world_trends
print us_trends
# [{u'created_at': u'2013-03-27T11:50:40Z', u'trends': [{u'url':
u'http://twitter.com/search?q=%23MentionSomeoneImportantForYou'...
print json.dump(world_trends, indent = 1)
38
Get Trends
[
{
"created_at": "2013-03-27T11:50:40Z",
"trends": [
{
"url": "http://twitter.com/search?q=%23MentionSomeoneImportantForYou",
"query": "%23MentionSomeoneImportantForYou",
"name": "#MentionSomeoneImportantForYou",
"promoted_content": null,
"events": null
},
...
]
}
]
39
Search for Tweets
person = '#MentionSomeoneImportantForYou'
numberOfTweets= 20
search_results = twitter_api.search.tweets(q=person, count= numberOfTweets)
statuses = search_results['statuses']
40
Download