Extracting YouTube videos using SAS Dr Craig Hansen Craig Hansen has been using SAS for 15 years within the health research setting. He gained his doctorate in epidemiology at the University of the Sunshine Coast and since then has worked at • the University of Queensland School of Medicine (Australia) – working in cardiovascular research • the United States Environmental Protection Agency (USEPA) – working in air pollution research • the Centers for Disease Control and Prevention (CDC, USA) – working in birth defects research • Kaiser Permanente Center for Health Research (USA) – working on health studies using electronic medical records. He recently accepted a position at the South Australian Health and Medical Research Institute as Senior Epidemiologist within the Wardliparingga Aboriginal Health Unit. Dr Craig Hansen Extracting YouTube Videos using SAS By Craig Hansen, PhD South Australian Health & Medical Research Institute Overview YouTube API SAS Proc HTTP SAS XML Mapper Example – medications during pregnancy Questions Have you ever wanted to “scrape” all the information from YouTube videos? Length Uploader Date Comments Views You can……using YouTube Data API and SAS! YouTube Data API (Metadata for videos) Proc HTTP XML File XML Mapper SAS Datasets YouTube API – Video feed example • https://gdata.youtube.com/feeds/api/videos?q=surfing Google's APIs allows you to integrate YouTube videos and functionality into websites or applications. Examples: YouTube Analytics API YouTube Data API YouTube Player API XML File of all the videos found with the search term “surfing” Only allows 25 videos per XML file Only allows 500 videos in total per search term XML • “Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format which is both humanreadable and machine-readable” • A structured (“tree-like”) text that can be used to store information in a hierarchical format See the “structure” based on tags < > …..text …..</> This is what SAS XML Mapper reads <?xml version='1.0' encoding='UTF-8'?> <feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearch/1.1/' xmlns:gml='http://www.opengis.net/gml' xmlns:georss='http://www.georss.org/georss' xmlns:media='http://search.yahoo.com/mrss/' xmlns:batch='http://schemas.google.com/gdata/batch' xmlns:yt='http://gdata.youtube.com/schemas/2007' xmlns:gd='http://schemas.google.com/g/2005' gd:etag='W/&quot;CE4EQH47eCp7ImA9WxRQGEQ.&quot;'> <id>tag:youtube.com,2008:standardfeed:global:most_popular</id> <updated>2008-07-18T05:00:49.000-07:00</updated> <category scheme='http://schemas.google.com/g/2005#kind' term='http://gdata.youtube.com/schemas/2007#video'/> <title>Most Popular</title> <id>tag:youtube,2008:video:ZTUVgYoeN_b</id> <published>2008-07-05T19:56:35.000-07:00</published> <updated>2008-07-18T07:21:59.000-07:00</updated> <category scheme='http://gdata.youtube.com/schemas/2007/categories.cat' term='People' label='People'/> <title>Shopping for Coats</title> <content type='application/x-shockwave-flash' src='http://www.youtube.com/v/ZTUVgYoeN_b?f=gdata_standard...'/> <link rel='alternate' type='text/html' href='https://www.youtube.com/watch?v=ZTUVgYoeN_b'/> <author> <name>GoogleDevelopers</name> <uri>https://gdata.youtube.com/feeds/api/users/GoogleDevelopers</uri> <yt:userId>_x5XG1OV2P6uZZ5FSM9Ttw<</yt:userId> <yt:aspectRatio>widescreen</yt:aspectRatio> <yt:duration seconds='79'/> <yt:uploaded>2008-07-05T19:56:35.000-07:00</yt:uploaded> <yt:uploaderId>UC_x5XG1OV2P6uZZ5FSM9Ttw</yt:uploaderId> <yt:videoid>ZTUVgYoeN_b</yt:videoid> </media:group> <gd:rating min='1' max='5' numRaters='14763' average='4.93'/> <yt:recorded>2008-07-04</yt:recorded> <yt:statistics viewCount='383290' favoriteCount='7022'/> <yt:rating numDislikes='19257' numLikes='106000'/> </entry> </feed> YouTube XML SCHEMA (only a snippet) SAS PROC HTTP • Issues Hypertext Transfer Protocol (HTTP) requests • YouTube API search (https://developers.google.com/youtube/v3/) FILENAME myxml "C:\Current Work\YT.xml" encoding="UTF-8"; PROC HTTP out=myxml url="https://gdata.youtube.com/feeds/api/videos?q=surfing" method="get"; RUN; SAS XML Mapper • We now need to parse all the text in the XML into a dataset Perfect, we can use this to “map” all the structured text in the XML into a dataset – it does all the work for you! SAS XML Mapper reads in the XML or Schema file and interprets the “structure” based on the tags <> ….text….</> **Download SAS XML Mapper from SAS website SAS XML Mapper – Create a map XML window XML Map window Source window SAS XML Mapper – Create a map Open XML or Schema XML file XML map generated – save this map The datasets it will create SAS XML Mapper – Creating Datasets See how the datasets are linked by ID variables Information available about Datasets • Feed information • Title • Description • Author • Date published • Category • Duration • Number of views • Ratings • Comments (links to another API) • Restrictions • URL SASSAS Code Code FILENAME test1 " C:\Current Work\Surf.xml " encoding="UTF-8"; PROC HTTP out=test1 url="https://gdata.youtube.com/feeds/api/videos?q=surfing" method="get"; RUN; FILENAME FILENAME LIBNAME YOUTUBET 'C:\Current Work\Surf.xml' ; SXLEMAP ' C:\Current Work\YouTubeGetData.map'; YOUTUBET xmlv2 xmlmap=SXLEMAP ACCESS=READONLY; This will create all the linked datasets generated from the map Merge the datasets based on the linked IDs My project – Medications during Pregnancy • Assess the following: • How many videos have information on medications during pregnancy • The source of these videos • The popularity of these videos • The information in the videos (manual review) • Enter additional information in a database upon review Challenges • 25 videos per XML • You can pull in videos based on start index = SAS macro • 500 per search term • Pull in the 500 videos using a SAS macro • Repeat using variations on each search term = 500 per each variation • Duplicates = that’s ok because you cast a wide net and clean these later 2023 search terms %MACRO YOUTUBE; %DO k = 1 %TO &MAXID. %BY 1; 2023 search terms PROC SQL NOPRINT; SELECT DISTINCT LEFT(TRIM(MEDSEARCH)) INTO :MEDSEARCH FROM SEARCH_TERMS WHERE ID=&k.; QUIT; %DO i = 1 %TO 500 %BY 20; Increments of 20 videos for each xml %LET URL = "https://gdata.youtube.com/feeds/api/videos?q=&MEDSEARCH%nrstr(&max)-results=20%nrstr(&start)-index=&i"; [insert the proc http I showed before] PROC SQL; CREATE TABLE VIDEOS&i. AS SELECT DISTINCT [insert some SQL joins] ; QUIT; Outer Macro Loops through Search terms PROC SQL NOPRINT; SELECT COUNT(*) INTO :NOBS FROM VIDEOS&i.; QUIT; PROC APPEND DATA=VIDEOS&i. BASE=videos FORCE; RUN; PROC DATASETS LIB=WORK NOLIST; DELETE VIDEOS&i.; QUIT; RUN; %IF &NOBS.=0 %THEN %GOTO EXIT; %END; %EXIT: %END; %MEND YOUTUBE; %YOUTUBE; Inner Macro Increments through the start index getting 20 videos until 500 are retrieved Append the videos from each iteration of the inner/outer macros Import into ACCESS database Results – What I learned? • A lot of non-relevant videos • Duplicates within each iteration of videos extracted • Same video but different title • Need to have a ‘selection criteria’ to identify relevant videos (e.g. some additional data cleaning) • YouTube ‘category’ variable is not reliable • Lag time between metadata and actual time (data on YouTube) Medication+Pregnancy Search Terms, n= 2023 YouTube API Data Feed n=97,480 records Excluded Duplicate Records n = 3812 Distinct Videos n=41,438 (93,668 records) Distinct Videos n=10,462 Medication+Pregnancy Search Term in Title n=651 videos Manual Review Final Analyses n=315 videos Medication or Pregnancy Search Term not in Title or Description n=30,976 videos Not Relevant Videos Total = 336 No mention of med = 243 Duplicate = 70 Not avail./no sound = 13 Not English = 5 Legal Process = 5 Useful Resources Jason Secosky: Executing a PROC from a DATA Step http://support.sas.com/resources/papers/proceedings12/227-2012.pdf George Zhu: Accessing and Extracting Data from the Internet Using SAS http://support.sas.com/resources/papers/proceedings12/121-2012.pdf Eric Lewerenz: An Example of Website “Screen Scraping” http://www.mwsug.org/proceedings/2009/appdev/MWSUG-2009-A09.pdf SAS PROC HTTP http://support.sas.com/documentation/cdl/en/proc/65145/HTML/default/viewer.htm#n197g47i7j66x9n15xi0gaha8o v6.htm Google APIs https://developers.google.com/apis-explorer/#p/ YouTube APIs https://developers.google.com/youtube/ Thank you – I love SAS! • Questions • Feedback • Thumbs up? • Thumbs down? WRAP UP • Survey – Please complete & hand back for Lucky Draw • IAPA Chapter Meeting • SANZOC – SAS Australia & New Zealand Online Community www.communities.sas.com • Questions / Suggestions Hanlie.Myburgh@sas.com • Thank you • Lucky Draw Cop yrig ht © 2012, SAS Institute Inc. All rig hts reserv ed. ADELAIDE CHAPTER MEETING: ENSEMBLES OF 20,000 MODELS IN THE ATO Guest Speaker: Dr Graham Williams, Head of Corporate Analytics, ATO www.iapa.org.au/Events Cop yrig ht © 2012, SAS Institute Inc. All rig hts reserv ed.