Data Size, Persistence and Access Kevin Swingler Introduction • Big Data needs to be stored, searched, accessed and moved in real time • How it is stored affects how quickly it can be searched • Moving large amounts of data is not practical, so you have to take the process to the data • You may want to use local compute power on remote data Persistence • Persistence is the name given to the permanent storage of data • This might be in a file, but is often more useful in a database Data Sizes Value Abr. Name Download Time @45Mbps 1000 bytes kB Kilobyte < 1 second 10002 = 1000kB MB Megabyte < 1 second 10003 = 1000MB GB Gigabyte 4 mins 10004 = 1000GB TB Terabyte 74 hours 10005 = 1000TB PB Petabyte 8.5 Years 10006 = 1000PB EB Exabyte 8000 Years It would be quicker to put a Petabyte rack in a backpack and walk with it than to download it! Data Size Examples Size Example File Storage 100 kB This presentation 1 Megabyte Typical English book 500 pages 470 Megabytes NASDAQ Open, Close Since 1970 800 Megabytes The human genome 1 Gigabyte 7 minutes HDTV £5 Memory stick 11 Gigabytes 35 million Amazon reviews £20 Memory stick 300 Gigabytes Million song dataset (not the music) 1 Terabyte 20,000 Filing cabinets of text 8 Terabytes Twitter data generated per day 1 Petabyte 13.3 years of HDTV 2.5 Petabytes Walmart customer database 30 Petabytes Facebook database £50 disk drive £300,000 Rack Data Access • Some data sets are small enough that you can just download them and process them locally however you want • Others are too large for this and need either: – To be processed remotely (where the data is stored) – To be queried so that only partial or aggregated data is downloaded and processed Partial Or Aggregated • Partial data is just a subset of the whole – The most recent – All customers from the UK – Etc. • Aggregated is the result of a calculation – Total spend per customer – Number of customers per region – Etc. Remote Queries • The remote data store needs to be able to support such requests • Data in a database is easier to search and aggregate than lots of files How is Access Given? • There are a number of ways to grant access to a database – Application layer – e.g. Ebay – Database layer – e.g. PHPMyAdmin – API • Language specific, e.g Java, Python • Language independent RESTful API RESTFul API • Let’s say you want to analyse Twitter data for some reason • You can’t download the entire Twitter database • They won’t give you access to the database directly • The simple search facility on their website is insufficient for your needs RESTFul API • • • • Representational State Transfer Applied to Web Services Provision of an API via HTTP Response returned as an internet media type – XML – JSON – Images RESTFul API • Request uses standard HTTP methods – GET – PUT – POST – DELETE • URL example: http://URI/servicename?param=val&param=val Twitter Example • https://api.twitter.com/1.1/sear ch/tweets.json?q=%23superbowl&re sult_type=recent • Returns JSON object JSON Twitter Response "text": "RT @Twittername: Can’t wait for the superbowl!", "truncated": false, "in_reply_to_user_id": null, "in_reply_to_status_id": null, "favorited": false, "source": "<a href=\"http://twitter.com/\" rel=\"nofollow\">Twitter for iPhone</a>", "in_reply_to_screen_name": null, "in_reply_to_status_id_str": null, "id_str": "54691802283900928", "entities": { "user_mentions": [ { "indices": [ 3, 19 ], "screen_name": " Twittername ", "id_str": "271572434", "name": " Twittername ", "id": 271572434 } ], "urls": [ ], "hashtags": [ ] }, Google Sheets Example https://spreadsheets.google.com/feeds/cells/key/worksheetI d/private/full?min-row=2&min-col=4&max-col=4 • Returns XML <entry gd:etag='"YD0PS1YXByp7Ig.."'> <id>https://spreadsheets.google.com/feeds/cells/key/worksheetId/private/full/R1C2</id> <updated>2006-11-17T18:27:32.543Z</updated> <category scheme="http://schemas.google.com/spreadsheets/2006" term="http://schemas.google.com/spreadsheets/2006#cell"/> <title type="text">B1</title> <content type="text">Hours</content> <link rel="self" type="application/atom+xml" href="https://spreadsheets.google.com/feeds/cells/key/worksheetId/private/full/R1C2"/> <link rel="edit" type="application/atom+xml" href="https://spreadsheets.google.com/feeds/cells/key/worksheetId/private/full/R1C2/1 pn567"/> <gs:cell row="1" col="2" inputValue="Hours">Hours</gs:cell> </entry> Making Requests • In some cases, you can type the HTTP request directly into a browser to see what the result would be • More often than not, however, you need some code to make the request and process the result • Authentification is often needed too, which calls for some code Python Request • Request is a Python library for making HTTP requests • It is installed in the labs as part of the Python install • More on this later ... Data Sources • There are a great many sources of data useful both for learning about Big Data and for doing research • Some is proprietary and must be paid for, but a lot is free • The course website lists a number of sources • Here are a few of interest Before You Download • • • • How big is the file? How long will it take to download? Do you have the space to store it? What will you be doing with it once you have it? – Will a subset do? • Does the owner provide an API so you needn’t download it all? • Can you process the format it comes in? Example – Transport Data • National Public Transport Access Nodes (NaPTAN) • Download from http://data.gov.uk/dataset/naptan • Data on every public transport stop in the UK • Comes in 16 files – csv or XML • Files related by a schema • Open Government Licence (OGL) Schema Storing the Data • Having downloaded the data, you could – Keep it as separate files and let the application manage the links between files – Load it into a relational database and let the RDBMS handle the joins Amazon Review Data • The Stanford SNAP project has 35m product reviews from Amazon! • The whole thing is 11G in size. Think before you download that • There are lots of category specific ones that are smaller • https://snap.stanford.edu/data/webAmazon.html