Extracting YouTube Videos Using SAS

advertisement
Extracting YouTube videos using SAS
Dr Craig Hansen
Craig Hansen has been using SAS for 15 years within the health research setting. He gained his doctorate in
epidemiology at the University of the Sunshine Coast and since then has worked at
• the University of Queensland School of Medicine (Australia) – working in cardiovascular research
• the United States Environmental Protection Agency (USEPA) – working in air pollution research
• the Centers for Disease Control and Prevention (CDC, USA) – working in birth defects research
• Kaiser Permanente Center for Health Research (USA) – working on health studies using electronic medical
records.
He recently accepted a position at the South Australian Health and Medical Research Institute as Senior
Epidemiologist within the Wardliparingga Aboriginal Health Unit.
Dr Craig Hansen
Extracting YouTube Videos
using SAS
By
Craig Hansen, PhD
South Australian Health & Medical Research Institute
Overview
YouTube API SAS Proc HTTP
SAS XML Mapper
Example – medications during pregnancy
Questions
Have you ever wanted to “scrape” all the
information from YouTube videos?
Length
Uploader
Date
Comments
Views
You can……using YouTube Data API and SAS!
YouTube Data API (Metadata for videos)
Proc HTTP
XML File
XML Mapper
SAS Datasets
YouTube API – Video feed example
• https://gdata.youtube.com/feeds/api/videos?q=surfing
Google's APIs allows you to integrate
YouTube videos and functionality into
websites or applications.
Examples:
YouTube Analytics API
YouTube Data API
YouTube Player API
XML File of all the videos found
with the search term “surfing”
Only allows 25 videos per XML file
Only allows 500 videos in total
per search term
XML
• “Extensible Markup Language (XML) is a markup language that defines a
set of rules for encoding documents in a format which is both humanreadable and machine-readable”
• A structured (“tree-like”) text that can be used to store information in a
hierarchical format
See the “structure” based on tags
< > …..text …..</>
This is what SAS XML Mapper reads
<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns='http://www.w3.org/2005/Atom'
xmlns:openSearch='http://a9.com/-/spec/opensearch/1.1/'
xmlns:gml='http://www.opengis.net/gml'
xmlns:georss='http://www.georss.org/georss'
xmlns:media='http://search.yahoo.com/mrss/'
xmlns:batch='http://schemas.google.com/gdata/batch'
xmlns:yt='http://gdata.youtube.com/schemas/2007'
xmlns:gd='http://schemas.google.com/g/2005'
gd:etag='W/"CE4EQH47eCp7ImA9WxRQGEQ."'>
<id>tag:youtube.com,2008:standardfeed:global:most_popular</id>
<updated>2008-07-18T05:00:49.000-07:00</updated>
<category scheme='http://schemas.google.com/g/2005#kind'
term='http://gdata.youtube.com/schemas/2007#video'/>
<title>Most Popular</title>
<id>tag:youtube,2008:video:ZTUVgYoeN_b</id>
<published>2008-07-05T19:56:35.000-07:00</published>
<updated>2008-07-18T07:21:59.000-07:00</updated>
<category scheme='http://gdata.youtube.com/schemas/2007/categories.cat'
term='People' label='People'/>
<title>Shopping for Coats</title>
<content type='application/x-shockwave-flash'
src='http://www.youtube.com/v/ZTUVgYoeN_b?f=gdata_standard...'/>
<link rel='alternate' type='text/html'
href='https://www.youtube.com/watch?v=ZTUVgYoeN_b'/>
<author>
<name>GoogleDevelopers</name>
<uri>https://gdata.youtube.com/feeds/api/users/GoogleDevelopers</uri>
<yt:userId>_x5XG1OV2P6uZZ5FSM9Ttw<</yt:userId>
<yt:aspectRatio>widescreen</yt:aspectRatio>
<yt:duration seconds='79'/>
<yt:uploaded>2008-07-05T19:56:35.000-07:00</yt:uploaded>
<yt:uploaderId>UC_x5XG1OV2P6uZZ5FSM9Ttw</yt:uploaderId>
<yt:videoid>ZTUVgYoeN_b</yt:videoid>
</media:group>
<gd:rating min='1' max='5' numRaters='14763' average='4.93'/>
<yt:recorded>2008-07-04</yt:recorded>
<yt:statistics viewCount='383290' favoriteCount='7022'/>
<yt:rating numDislikes='19257' numLikes='106000'/>
</entry>
</feed>
YouTube XML SCHEMA
(only a snippet)
SAS PROC HTTP
• Issues Hypertext Transfer Protocol (HTTP) requests
• YouTube API search
(https://developers.google.com/youtube/v3/)
FILENAME myxml "C:\Current Work\YT.xml" encoding="UTF-8";
PROC HTTP
out=myxml
url="https://gdata.youtube.com/feeds/api/videos?q=surfing"
method="get";
RUN;
SAS XML Mapper
• We now need to parse all the text in the XML into a dataset
Perfect, we can use this to “map” all the structured text in the XML into
a dataset – it does all the work for you!
SAS XML Mapper reads in the XML or Schema file and interprets the
“structure” based on the tags <> ….text….</>
**Download SAS XML Mapper from SAS website
SAS XML Mapper – Create a map
XML window
XML Map window
Source window
SAS XML Mapper – Create a map
Open XML
or Schema
XML file
XML map generated – save this map
The datasets it will create
SAS XML Mapper – Creating Datasets
See how the datasets are linked by ID variables
Information available about Datasets
• Feed information
• Title
• Description
• Author
• Date published
• Category
• Duration
• Number of views
• Ratings
• Comments (links to another API)
• Restrictions
• URL
SASSAS
Code
Code
FILENAME test1 " C:\Current Work\Surf.xml " encoding="UTF-8";
PROC HTTP
out=test1
url="https://gdata.youtube.com/feeds/api/videos?q=surfing"
method="get";
RUN;
FILENAME
FILENAME
LIBNAME
YOUTUBET 'C:\Current Work\Surf.xml' ;
SXLEMAP ' C:\Current Work\YouTubeGetData.map';
YOUTUBET xmlv2 xmlmap=SXLEMAP ACCESS=READONLY;
This will create all the linked datasets generated from the map
Merge the datasets based on the linked IDs
My project – Medications during Pregnancy
• Assess the following:
• How many videos have information on medications during pregnancy
• The source of these videos
• The popularity of these videos
• The information in the videos (manual review)
• Enter additional information in a database upon review
Challenges
• 25 videos per XML
• You can pull in videos based on
start index = SAS macro
• 500 per search term
• Pull in the 500 videos using a SAS
macro
• Repeat using variations on each
search term = 500 per each
variation
• Duplicates = that’s ok because you
cast a wide net and clean these
later
2023 search terms
%MACRO YOUTUBE;
%DO k = 1 %TO &MAXID. %BY 1;
2023 search terms
PROC SQL NOPRINT;
SELECT DISTINCT LEFT(TRIM(MEDSEARCH)) INTO :MEDSEARCH
FROM SEARCH_TERMS
WHERE ID=&k.;
QUIT;
%DO i = 1 %TO 500 %BY 20;
Increments of 20 videos for each xml
%LET URL = "https://gdata.youtube.com/feeds/api/videos?q=&MEDSEARCH%nrstr(&max)-results=20%nrstr(&start)-index=&i";
[insert the proc http I showed before]
PROC SQL;
CREATE TABLE VIDEOS&i. AS
SELECT DISTINCT
[insert some SQL joins]
;
QUIT;
Outer Macro
Loops through
Search terms
PROC SQL NOPRINT;
SELECT COUNT(*) INTO :NOBS FROM VIDEOS&i.;
QUIT;
PROC APPEND DATA=VIDEOS&i. BASE=videos FORCE;
RUN;
PROC DATASETS LIB=WORK NOLIST;
DELETE VIDEOS&i.;
QUIT;
RUN;
%IF &NOBS.=0 %THEN %GOTO EXIT;
%END;
%EXIT:
%END;
%MEND YOUTUBE;
%YOUTUBE;
Inner Macro
Increments through the
start index getting 20
videos until 500 are
retrieved
Append the videos from each
iteration of the inner/outer macros
Import into ACCESS database
Results – What I learned?
• A lot of non-relevant videos
• Duplicates within each iteration
of videos extracted
• Same video but different title
• Need to have a ‘selection criteria’
to identify relevant videos (e.g.
some additional data cleaning)
• YouTube ‘category’ variable is not
reliable
• Lag time between metadata and
actual time (data on YouTube)
Medication+Pregnancy
Search Terms, n= 2023
YouTube API Data Feed
n=97,480 records
Excluded
Duplicate Records
n = 3812
Distinct Videos
n=41,438
(93,668 records)
Distinct Videos
n=10,462
Medication+Pregnancy
Search Term in Title
n=651 videos
Manual Review
Final Analyses
n=315 videos
Medication or Pregnancy
Search Term not in
Title or Description
n=30,976 videos
Not Relevant Videos
Total = 336
No mention of med = 243
Duplicate = 70
Not avail./no sound = 13
Not English = 5
Legal Process = 5
Useful Resources
Jason Secosky: Executing a PROC from a DATA Step
http://support.sas.com/resources/papers/proceedings12/227-2012.pdf
George Zhu: Accessing and Extracting Data from the Internet Using SAS
http://support.sas.com/resources/papers/proceedings12/121-2012.pdf
Eric Lewerenz: An Example of Website “Screen Scraping”
http://www.mwsug.org/proceedings/2009/appdev/MWSUG-2009-A09.pdf
SAS PROC HTTP
http://support.sas.com/documentation/cdl/en/proc/65145/HTML/default/viewer.htm#n197g47i7j66x9n15xi0gaha8o
v6.htm
Google APIs
https://developers.google.com/apis-explorer/#p/
YouTube APIs
https://developers.google.com/youtube/
Thank you – I love SAS!
• Questions
• Feedback
• Thumbs up?
• Thumbs down?
WRAP UP
• Survey – Please complete & hand back for Lucky Draw
• IAPA Chapter Meeting
• SANZOC – SAS Australia & New Zealand Online Community
www.communities.sas.com
• Questions / Suggestions
Hanlie.Myburgh@sas.com
• Thank you
• Lucky Draw
Cop yrig ht © 2012, SAS Institute Inc. All rig hts reserv ed.
ADELAIDE CHAPTER MEETING: ENSEMBLES OF
20,000 MODELS IN THE ATO
Guest Speaker: Dr Graham Williams, Head of Corporate Analytics, ATO
www.iapa.org.au/Events
Cop yrig ht © 2012, SAS Institute Inc. All rig hts reserv ed.
Download