Data Sources

advertisement
Data Size, Persistence and Access
Kevin Swingler
Introduction
• Big Data needs to be stored, searched,
accessed and moved in real time
• How it is stored affects how quickly it can be
searched
• Moving large amounts of data is not practical,
so you have to take the process to the data
• You may want to use local compute power on
remote data
Persistence
• Persistence is the name given to the
permanent storage of data
• This might be in a file, but is often more useful
in a database
Data Sizes
Value
Abr. Name
Download Time @45Mbps
1000 bytes
kB
Kilobyte
< 1 second
10002 = 1000kB
MB
Megabyte
< 1 second
10003 = 1000MB GB
Gigabyte
4 mins
10004 = 1000GB
TB
Terabyte
74 hours
10005 = 1000TB
PB
Petabyte
8.5 Years
10006 = 1000PB
EB
Exabyte
8000 Years
It would be quicker to put a Petabyte rack in a backpack and walk with it than to
download it!
Data Size Examples
Size
Example File
Storage
100 kB
This presentation
1 Megabyte
Typical English book 500 pages
470 Megabytes
NASDAQ Open, Close Since 1970
800 Megabytes
The human genome
1 Gigabyte
7 minutes HDTV
£5 Memory stick
11 Gigabytes
35 million Amazon reviews
£20 Memory stick
300 Gigabytes
Million song dataset (not the music)
1 Terabyte
20,000 Filing cabinets of text
8 Terabytes
Twitter data generated per day
1 Petabyte
13.3 years of HDTV
2.5 Petabytes
Walmart customer database
30 Petabytes
Facebook database
£50 disk drive
£300,000 Rack
Data Access
• Some data sets are small enough that you can
just download them and process them locally
however you want
• Others are too large for this and need either:
– To be processed remotely (where the data is
stored)
– To be queried so that only partial or aggregated
data is downloaded and processed
Partial Or Aggregated
• Partial data is just a subset of the whole
– The most recent
– All customers from the UK
– Etc.
• Aggregated is the result of a calculation
– Total spend per customer
– Number of customers per region
– Etc.
Remote Queries
• The remote data store needs to be able to
support such requests
• Data in a database is easier to search and
aggregate than lots of files
How is Access Given?
• There are a number of ways to grant access to
a database
– Application layer – e.g. Ebay
– Database layer – e.g. PHPMyAdmin
– API
• Language specific, e.g Java, Python
• Language independent RESTful API
RESTFul API
• Let’s say you want to analyse Twitter data for
some reason
• You can’t download the entire Twitter
database
• They won’t give you access to the database
directly
• The simple search facility on their website is
insufficient for your needs
RESTFul API
•
•
•
•
Representational State Transfer
Applied to Web Services
Provision of an API via HTTP
Response returned as an internet media type
– XML
– JSON
– Images
RESTFul API
• Request uses standard HTTP methods
– GET
– PUT
– POST
– DELETE
• URL example:
http://URI/servicename?param=val&param=val
Twitter Example
• https://api.twitter.com/1.1/sear
ch/tweets.json?q=%23superbowl&re
sult_type=recent
• Returns JSON object
JSON Twitter Response
"text": "RT @Twittername: Can’t wait for the superbowl!",
"truncated": false,
"in_reply_to_user_id": null,
"in_reply_to_status_id": null,
"favorited": false,
"source": "<a href=\"http://twitter.com/\" rel=\"nofollow\">Twitter for iPhone</a>",
"in_reply_to_screen_name": null,
"in_reply_to_status_id_str": null,
"id_str": "54691802283900928",
"entities": {
"user_mentions": [
{
"indices": [
3,
19
],
"screen_name": " Twittername ",
"id_str": "271572434",
"name": " Twittername ",
"id": 271572434
}
],
"urls": [ ],
"hashtags": [ ]
},
Google Sheets Example
https://spreadsheets.google.com/feeds/cells/key/worksheetI
d/private/full?min-row=2&min-col=4&max-col=4
•
Returns XML
<entry gd:etag='"YD0PS1YXByp7Ig.."'>
<id>https://spreadsheets.google.com/feeds/cells/key/worksheetId/private/full/R1C2</id>
<updated>2006-11-17T18:27:32.543Z</updated>
<category scheme="http://schemas.google.com/spreadsheets/2006"
term="http://schemas.google.com/spreadsheets/2006#cell"/>
<title type="text">B1</title>
<content type="text">Hours</content>
<link rel="self" type="application/atom+xml"
href="https://spreadsheets.google.com/feeds/cells/key/worksheetId/private/full/R1C2"/>
<link rel="edit" type="application/atom+xml"
href="https://spreadsheets.google.com/feeds/cells/key/worksheetId/private/full/R1C2/1
pn567"/>
<gs:cell row="1" col="2" inputValue="Hours">Hours</gs:cell>
</entry>
Making Requests
• In some cases, you can type the HTTP request
directly into a browser to see what the result
would be
• More often than not, however, you need some
code to make the request and process the
result
• Authentification is often needed too, which
calls for some code
Python Request
• Request is a Python library for making HTTP
requests
• It is installed in the labs as part of the Python
install
• More on this later ...
Data Sources
• There are a great many sources of data useful
both for learning about Big Data and for doing
research
• Some is proprietary and must be paid for, but
a lot is free
• The course website lists a number of sources
• Here are a few of interest
Before You Download
•
•
•
•
How big is the file?
How long will it take to download?
Do you have the space to store it?
What will you be doing with it once you have it?
– Will a subset do?
• Does the owner provide an API so you needn’t
download it all?
• Can you process the format it comes in?
Example – Transport Data
• National Public Transport Access Nodes
(NaPTAN)
• Download from
http://data.gov.uk/dataset/naptan
• Data on every public transport stop in the UK
• Comes in 16 files – csv or XML
• Files related by a schema
• Open Government Licence (OGL)
Schema
Storing the Data
• Having downloaded the data, you could
– Keep it as separate files and let the application
manage the links between files
– Load it into a relational database and let the
RDBMS handle the joins
Amazon Review Data
• The Stanford SNAP project has 35m
product reviews from Amazon!
• The whole thing is 11G in size. Think before
you download that
• There are lots of category specific ones that
are smaller
• https://snap.stanford.edu/data/webAmazon.html
Download