Data Collection Challenges How to collect the data? How to store the data? ◦ Database or files? ◦ Cost of storage and bandwidth What is the right data format? ◦ Improve readability or optimize storage? ◦ Human readable or computer processed? How to present the data? ◦ Visualization, machine/human readable? 2 Outline Data formats: ◦ CSV ◦ XML ◦ JSON Data collection: ◦ Web crawlers ◦ wget ◦ Apis 3 CSV – Comma Separated Value •Great for flat data • For example log data from web servers, sensors •Compact as text data •Easily imported into spreadsheets •Human readable •Easy sequential access John, 2008, 20.50, Detroit, Michigan Michael, 2003, 55.00, San Francisco, California Mary, 2014, 7.75, , Wisconsin Kelli, Kyle and Kat, 2010, 35.00, Miami, Florida 4 CSV – Comma Separated Value •Escape strings •Field delimiter •Lacks meta data, requires users to provide information somewhere else •If data does not fit into discrete rows? •If row structure does not have fixed size? 5 CSV – Comma Separated Value John, 2008, 20.50, Detroit, Michigan Michael, 2003, 55.00, San Francisco, California Mary, 2014, 7.75, Wisconsin Kelli, Kyle and Kat, 2010, 35.00, Miami, Florida Add delimiters Add meta data Name, Start Year, Hourly Pay, City, State John, 2008, 20.50, Detroit, Michigan John, 2008, 20.50, Ann Arbor, Michigan Michael, 2003, 55.00, San Francisco, California Mary, 2014, 7.75, , Wisconsin ‘Kelli, Kyle & Kat’, 2010, 35.00, Miami, Florida Add placeholder 6 CSV Library in Python #load csv library import csv #Open file and create a reader object f = open('mydata.csv') csv_f = csv.reader(f) #Loop through each row and print it for row in csv_f: print row #Loop through each row again and print the first value for row in csv_f: print row[0] 7 XML – eXtensible Markup Language •The data describes itself •Widely supported •Good for structured data •A tree-like model: • One root element • Each element may contain other elements •Verbose format => additional storage and bandwidth <books> <book> <title>Data just right</title> <author>Michael Manoochehri</author> <book> <book> <title>Introduction to data mining</title> <author>Pang Ning Tan</author> <author>Michael Steinbach</author> <author>Vipin Kumar</author> </book> </books> 8 JSON – JavaScript Object Notation •A valid javascript object •Easy to use with javascript and other languages •Lighterweight syntax than XML so generally faster to parse •Verbose format 9 JSON – JavaScript Object Notation •The file type is .json •Data is in name value pairs •Data is separated by commas •Curly braces to designate objects •Brackets to designate arrays 10 JSON – JavaScript Object Notation { “books”: [ {“title” : ”Data just right”, “author” : ”Michael Manoochehri”}, {“title” : ”Introduction to data mining”, “authors”:[ {“name” : ”Pang Ning Tan”}, {“name” : ”Michael Steinbach”}, {“name” : ”Vipin Kumar”} ] } ] } 11 JSON – JavaScript Object Notation var library = {“books”: [ {“title” : ”Data just right”, “author” : ”Michael Manoochehri”}, {“title” : ”Introduction to data mining”, “authors”:[ {“name” : ”Pang Ning Tan”},{“name” : ”Michael Steinbach”},{“name” : ”Vipin Kumar”} ] ]} •library.books[0].title •library.books[0].title = “Data 2.0” •library.books[1].authors[2] = library.books[0].author •library.books[0].authors[2] = library.books[0].author 12 JSON Library in Python #load json library import json #Convert a python object to JSON stream var1 = [’x’, {’y’: (’Data Mining’, ’C Programming’)}] print var1 #Convert JSON data to a python object json_data = '["foo", {"bar":["baz", null, 1.0, 2]}]' python_obj = json.load(json_data) print python_obj #[u'foo', {u'bar': [u'baz', None, 1.0, 2]}] 13 JSON Library in Python json_data = '["foo", {"bar":["baz", null, 1.0, 2]}]' #[u'foo', {u'bar': [u'baz', None, 1.0, 2]}] JSON Python object Dict {} array List [] string Unicode u’ Int Int, long Real Float True True False False Null None 14 Data on the Internet www.yahoo.com sports.yahoo.com finance.yahoo.com travel.yahoo.com 15 Data on the Internet HTML File HTML File HTML File HTML File HTML File HTML File 16 HTML – Hypertext Markup Language <html> <head> <title>Page Title</title> <script src=“./scripts/js/utilities.js”/> <script type=“text/javascript”> //javascript code goes here </script> <link rel="stylesheet" href="styles.css"> </head> <body> <h1>My Data</h1> <table> <tr><td>John</td><td>1987</td></tr> <tr><td>Mary</td><td>2001</td></tr> </table> </body> </html> 17 HTTP Requests •HTTP: Hypertext Transfer Protocol •A protocol to deliver data (files/images/query results) on the World Wide Web 1 www.msu.edu 3 Client MSU Web Server •A browser is an HTTP Client: • Sends requests to a web server • Sends response to back to the client (user) 2 18 19 Request Header 20 Response Header 21 Response Content 22 Outline Data formats: ◦ CSV ◦ XML ◦ JSON Data collection: ◦ Web crawlers ◦ wget ◦ Apis 23 Web crawlers (Spiders) •An internet program (bot) that browses the World wide web to collect data: • Test if web page has valid structure or is available • Maintain mirrors of popular website • Monitor changes in content • Build a special purpose index •The bot is given a list of web pages called seeds •Each seed is collected/indexed/parsed •All links found inside a seed are added to the list to be visited •To visit/collect a page, the bot sends an http request to the server 24 Issues •Deal with a large number of pages • Cannot download all the pages • Selection policy: which pages to visit? •Deal with changing content • Re-visit policy: when to visit a page again? •Politeness policy: • How often to visit the same server? • How many requests per second? Do not overload the server •Abide by robots.txt: a file on the server that states which pages are allowed/disallowed from being scraped •When to stop? How many levels? 25 Breadth First Search (BFS) 1 2 0 3 4 Finds pages along the shortest path from the root 1 1 3 4 2 2 4 5 26 Depth First Search (DFS) 1 2 0 3 4 3 5 Tend to wander away 6 5 4 4 7 6 27 WGET •A utility to download files from the internet •Supports: HTTP, HTTPS and FTP protocols •Follows links in html, xhtml and css page: recursive download •Respects robots.txt •Many configurable options: • Logging, download (speed, attempts, progress), directory (inclusion/exclusion), http (username, password, user-agent, caching) •Format: wget [option] [url] •Help: wget -h 28 WGET • wget http://www.msu.edu • Downloads index.html from msu.edu • wget http://www.cse.msu.edu/~cse891/Sect001/exercises/ex_01.pdf • Downloads the pdf file • wget –t 5 http://cse.msu.edu/ • Retry 5 times when the attempt fails •wget –r statenews.com • Recursively retrieves files under the hierarchy structure 29 Python requests •A library to send HTTP requests •Load library: import requests •Send request: req = request.get(‘http://www.espn.com’) •Examine response: print req.text #Examine content print req.status_code print req.headers[‘content_type’] 30 Using the result •Parse the result received for something useful •Use libraries: json, ElementTree, MiniDom, lxml, HTMLParser (html.parser), BeautifulSoup html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/Tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ 31 Using the result •Using BeautifulSoup: from bs4 import BeautifulSoup soupObj = BeautifulSoup(html_doc) # Create a soup object from the html string print(soupObj.prettify()) # Prints with nice indentations soupObj.title # <title>The Dormouse's story</title> soupObj.title.name # u'title' soupObj.title.string # u'The Dormouse's story' soupObj.title.parent.name # u'head' soupObj.p # <p class="title"><b>The Dormouse's story</b></p> soupObj.p['class'] # u'title' soupObj.a # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 32 Using the result soupObj.find_all('a') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soupObj.find(id="link3") # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> for link in soupObj.find_all('a'): print(link.get('href')) # http://example.com/elsie # http://example.com/lacie # http://example.com/tillie 33 Using an API •API: application program interface •Uses HTTP (HTTPS) protocol •Uses XML/JSON to represent the response •Provides clean way to extract data •Some api’s are free •Usage limit: 5 calls/s, 1000 per day, … 34 Twitter •Tweets: short posts of 140 character or less •Entities: users, hashtags, urls, media •Places •Streams: sample of public tweets flowing through twitter •Timelines: chronologically sorted collections of tweets • Home timeline: tweets from people you follow • https://twitter.com • User timeline: tweets from a specific user • https://twitter.com/SocialWebMining • Home timeline of someone else • https://twitter.com/SocialWebMining/following 35 Twitter Python API •Create a twitter application account • Create an app that you authorize to access your account data • Obtain an app key • Instead of giving the password for your user account •Install Twitter library if you don’t have it •Make calls to: • Retrieve trends • Search for tweets/retweets • Search for users •Reference: Mining the Social Web, 2nd Edition 36 Authorize Twitter API import twitter # Obtain the values from your twitter app account CONSUMER_KEY = '' CONSUMER_SECRET = '' OAUTH_TOKEN = '' OAUTH_TOKEN_SECRET = '' auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET, CONSUMER_KEY, CONSUMER_SECRET) twitter_api = twitter.Twitter(auth=auth) # Nothing to see by displaying twitter_api except that it's now a defined variable print twitter_api <twitter.api.Twitter object at 0x39d9b50> 37 Get Trends WORLD_WOE_ID = 1 #Look up IDs at http://woeid.rosselliot.co.nz/ US_WOE_ID = 23424977 world_trends = twitter_api.trends.place(_id=WORLD_WOE_ID) us_trends = twitter_api.trends.place(_id=US_WOE_ID) print world_trends print us_trends # [{u'created_at': u'2013-03-27T11:50:40Z', u'trends': [{u'url': u'http://twitter.com/search?q=%23MentionSomeoneImportantForYou'... print json.dump(world_trends, indent = 1) 38 Get Trends [ { "created_at": "2013-03-27T11:50:40Z", "trends": [ { "url": "http://twitter.com/search?q=%23MentionSomeoneImportantForYou", "query": "%23MentionSomeoneImportantForYou", "name": "#MentionSomeoneImportantForYou", "promoted_content": null, "events": null }, ... ] } ] 39 Search for Tweets person = '#MentionSomeoneImportantForYou' numberOfTweets= 20 search_results = twitter_api.search.tweets(q=person, count= numberOfTweets) statuses = search_results['statuses'] 40