Recast: An Interactive Platform for

advertisement
Recast: An Interactive Platform for
Personal Media Curation and Distribution
by
Dan Sawada
B.A., Keio University (2011)
Submitted to the Program in Media Arts and Sciences,
School of Architecture and Planning,
in partial fulfillment of the requirements for the degree of
Master of Science in Media Arts and Sciences
OF TECHNOLOGy
at the
JUL 14 2014
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
LIBRARIES
June 2014
C Massachusetts Institute of Technology 2014. All rights reserved.
Signature redacted
Author...................
..
. .
..
. .
--
- -
-
-- -
-
- ----
- -- -
-r -
- I
Uan Sawada
Program in Media Arts and Sciences
May 9th, 2014
Signature redacted
Certified by
.
Andrew B. Lippman
Senior Research Scientist / Associate Director, MIT Media Lab
Thesis Supervisor
Signature redacted
Accepted by............
Patricia Maes
Associate Academic Head, 1rogram in Media A rts and Sciences
2
Recast: An Interactive Platform for
Personal Media Curation and Distribution
by
Dan Sawada
Submitted to the Program in Media Arts and Sciences,
School of Architecture and Planning,
on May 9th, 2014, in partial fulfillment of the
requirements for the degree of
Master of Science in Media Arts and Sciences
Abstract
This thesis focuses on the design and implementation of Recast, which is an
interactive media system that enables users to dynamically aggregate, curate, reconstruct, and distribute visual stories of real-world events, based on various perspectives. Visual media have long been the means for consumptive information
acquisition. However, the advancement of technology in the field of communication networks and consumer devices has made visual media a powerful tool for
user expression.
Given the background, Recast aims to present an intuitive platform for proactive citizens to create visual storyboards that represent the view of the world from
their perspective. In order to fulfill the needs, Recast proposes a media analysis
platform, as well as a block-based user interface for semi-automating the workflow of video production. As a result of an operation test and a user study, it was
verified that Recast is successful in achieving its initial goals.
Thesis Supervisor: Andrew B. Lippman
Title: Senior Research Scientist / Associate Director, MIT Media Lab
3
4
Recast: An Interactive Platform for
Personal Media Curation and Distribution
by
Dan Sawada
/i
Signature redacted
Thesis Advisor /
...............................
Andrew B. Lippman
Senior Research Scientist / Associate Director, MIT Media Lab
Signature redacted
Thesis Reader .
.
..............
V. Michael Bove, Jr.
Principal Research Scientist, MIT Media Lab
Thesis Reader
Signature redacted
Ethan Zuckerman
Principal Research Scientist, MIT Media Lab
5
6
Acknowledgments
With the utmost gratitude, I thank my advisor, Andy Lippman, for offering me
the opportunity to come to the MIT Media Lab and guiding my way through research. Your guidances, both academic and personal, were extremely meaningful.
I thank my thesis readers, Mike Bove and Ethan Zuckerman, for their wonderful
insights and comments toward completing this thesis.
I thank the members of the Digital Life Consortium and the Ultimate Media
Initiative of the MIT Media Lab, especially Comcast, DirecTV, and Cisco, for financially supporting my research.
I thank the members and alums of the Viral Spaces group (Travis Rich, Rob
Hemsley, Jonathan Speiser, Savannah Niles, Vivian Diep, Amir Lazarovich, Grace
Woo, and Eyal Toledano) for their friendship.
Amongst the members of my group, I owe a special thanks to Rob Hemsley
and Jonathan Speiser, my fellow mates of The Office (E14-348C), for all the fun
and inspiration.
I thank all the Japanese students and researchers in the lab, especially Hayato
Ikoma, for their support and encouragement.
I thank all of my classmates, and everyone at the lab for providing such an
unique culture and environment for pursuing the concepts of the future.
I thank my parents, Shuichi and Teruko Sawada, for their understanding and
empathy.
Last but not least, I thank Shiori Suzuki, my beloved fiance, for always being
with my heart.
7
8
Contents
1
Introduction
1.1
M otivation
17
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3
Overview of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Background and Purpose
2.1
19
21
Utilizing Visual Media Toward Self Expression . . . . . . . . . . . . . 21
2.2 Video Production and Editing . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1
History of Video Production and Editing . . . . . . . . . . . . 23
2.2.2 Workflow of Video Production and Editing . . . . . . . . . . 25
2.2.3
2.3
Issues Around Video Production and Self Expression . . . . . 26
Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 Related Work
29
3.1
M edia Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2
Media Aggregation and Curation . . . . . . . . . . . . . . . . . . . . 31
3.3
Internet-based Media Distribution . . . . . . . . . . . . . . . . . . . 32
4 Approach
4.1
35
Media Analysis System. . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.1
System Concept . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.2
System Overview. . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Recast UI. .......................................
9
38
5
4.2.1
Concept of Interaction . . . . . . . . . . . . . . . . . . . . . . 38
4.2.2
Overview of UI . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Media Analysis System
41
5.1
Media Acquisition Framework.
41
5.1.1
42
Cloud-based DVR . . .
5.1.2 Web Video Crawler
43
5.1.3
User Upload Receiver
44
. . . . . . . . . . . . . .
44
Download Module . . .
46
5.2.2 Transcript Module . . .
46
5.2.3 Thumbnail Module . . .
48
5.2.4 Scene Module . . . . .
48
5.2.5
Face Module . . . . . .
49
5.2.6
Emotion Module . . . .
50
5.3 Media Database . . . . . . . .
51
5.2 G LUE
5.2.1
5.3.1
Data Store
5.3.2
Supplemental Indexing f:or Text-based Metadlata
52
5.3.3
Media Database API . .
53
. . . . . . .
51
5.4 Constellation . . . . . . . . . .
54
5.4.1
System Setup . . . . . .
54
5.4.2
User Interaction
. . . .
55
5.4.3 Visualization Interfaces .
56
6 Recast UI
6.1
59
Overview of Implementation
6.2 Scratch Pad and Main Menu .
. . . . . . . . . . .
59
. . . . . . . . . . .
60
6.3 Content Blocks . . . . . . . . .
60
6.3.1
Video Assets . . . . . .
60
6.3.2
Image Assets . . . . . .
63
6.3.3
Text Assets . . . . . . .
64
10
6.4
Filter Blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.5 Overlay Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.6 Asset Management Service . . . . . . . . . . . . . . . . . . . . . . . 68
6.7
Recast EDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
6.7.1
Timeline.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
6.7.2
Recast Video Player . . . . . . . . . . . . . . . . . . . . . . .
69
6.7.3
Publishing Recast EDLs . . . . . . . . . . . . . . . . . . . . .
72
73
7 Evaluation
7.1
7.2
8
Operation Tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
7.1.1
Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
7.1.2
Results
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
7.1.3
Considerations . . . . . . . . . . . . . . . . . . . . . . . . . .
75
User Studies
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
. . . . . . . . . . . . . . . . . . . . . . . . .
76
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
Considerations . . . . . . . . . . . . . . . . . . . . . . . . . .
79
7.2.1
Method . .. ..
7.2.2
Results
7.2.3
Conclusion
81
8.1
Overall Achievements . . . . . . . . . . . . . . . . . . . . . . . . . .
81
8.2
Future Work.
82
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
82
. . . . . . . . . . . . . . . . . . .
82
Large-scale Deployment . . . . . . . . . . . . . . . . . . . . .
82
8.2.1
Improvement of Metadata Extraction
8.2.2
Enhancement of Recast Ul
8.2.3
A List of Video Samples
85
B User Study Handout
87
11
12
List of Figures
1-1
Live Coverage of 9/11 the Terror Attacks
. . . . . . . . . . . . . . .
1-2 The Announcement of Pope Francis's Election
18
. . . . . . . . . . . . 18
2-1 V ine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .. . . . . 22
2-2 List of Supercuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2-3 C M X 600 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2-4 Popcorn Maker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2-5 NBC News Studio in New York City. . . . . . . . . . . . . . . . . . . 26
3-1 Visualization of Trending Topics on Broadcast Television . . . . . . . 30
3-2 Word Cloud of Trending Topics on Mainstream Media . . . . . . . . 32
3-3 Ustream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4-1
Concept of Tagging Scenes With Metadata . . . . . . . . . . . . . . 36
4-2 High-level Design of The Media Analysis System . . . . . . . . . . . 37
4-3 Overview Design of the Recast UI
5-1
. . . . . . . . . . . . . . . . . . . 40
Example of a GLUE Process Request . . . . . . . . . . . . . . . . . . 45
5-2 Snippet of an SRT File . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5-3 Example of an Object Representing a Phrase Mention . . . . . . . . 47
5-4 Example of Thumbnail Images . . . . . . . . . . . . . . . . . . . . . 48
5-5 Example of Scene Cuts
. . . . . . . . . . . . . . . . . . . . . . . . . 49
5-6 Example of a Face Detected Within a Video Frame . . . . . . . . . . 50
5-7 Waveform and Emotion State of a Speech Segment . . . . . . . . . 51
5-8 Setup of Constellation . . . . . . . . . . . . . . . . . . . . . . . . . . 55
13
5-9 Constellation in Metadata Mode . . . . . . . . . . . . . . . .
56
5-10 Constellation Transitioning to Metadata Mode . . . . . . . .
57
. . . . . . . . . . . . .
57
Scratch Pad and Main Menu of the Recast UI . . . . . . . . .
61
6-2 Examples of Content Blocks . . . . . . . . . . . . . . . . . . .
62
6-3 Specifying Search Keywords for Retrieving Video Assets . . .
62
6-4 Preview of Video Assets . . . . . . . . . . . . . . . . . . . . .
63
6-5 Chrome Extension for Capturing Screen Shots of Web Pages
64
6-6 List of Image Annotation Tags in Menu . . . . . . . . . . . . .
65
6-7 Preview of Image Assets . . . . . . . . . . . . . . . . . . . . .
65
6-8 Creation of Text Content Blocks . . . . . . . . . . . . . . . .
66
6-9 Examples of Filter Blocks . . . . . . . . . . . . . . . . . . . .
66
6-10 List of Creators . . . . . . . . . . . . . . . . . . . . . . . . . .
67
6-11 Recording Voice . . . . . . . . . . . . . . . . . . . . . . . . .
68
6-12 Example of a Recast EDL . . . . . . . . . . . . . . . . . . . .
70
6-13 Timeline With Content Blocks . . . . . . . . . . . . . . . . . .
70
6-14 Preview of Final Video . . . . . . . . . . . . . . . . . . . . . .
71
6-15 M edia M atrix . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
5-11 Visualization Screens in Constellation
6-1
Processing Times of Analysis Modules in GLUE
75
7-2 Comparison of the Task Completion Time . . .
78
7-3 Comparison of User Ratings . . . . . . . . . . .
78
7-1
14
List of Tables
5.1
List of Recorded Channels . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2
List of Analysis Modules in GLUE . . . . . . . . . . . . . . . . . . . . 45
5.3 Schema of Solr Core . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.4
List of Endpoints Within the Query API . . . . . . . . . . . . . . . . . 53
6.1
List of Filters
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
15
16
Chapter 1
Introduction
This chapter discusses the motivation and the contributions of this thesis, as
well as its overview structure.
1.1
Motivation
Ever since commercial television broadcasting started in the mid-20th century, visual media has been one of the main sources for us to acquire information
and understand various events that occur throughout the globe. Many historical
events such as the collapse of the World Trade Center have been visually documented with motion pictures and disseminated simultaneously to the entire world,
as shown in Figure 1-1. There is no doubt on how powerful and influential visual
media is, as it has the ability to deliver reality and immersive experiences to a mass
audience in remote locations.
In the late-20th century, the Internet was introduced and revolutionized how
we interact with visual information and media in general. The Internet has enabled
visual media to become the means for expression, rather than simple consumption.
Thanks to the Internet and its related technologies, we can easily make use of
visual media to report what we see, create stories, and present our thoughts and
perspectives to the entire world in real-time. For instance, Figure 1-2 indicates
a scene where a huge crowd is documenting and sharing the announcement of
17
Pope Francis's election at St. Peters Square in March 2013.
Figure 1-1: Live Coverage of 9/11 the Terror Attacks
(Source: BBC)
Figure 1-2: The Announcement of Pope Francis's Election
(Source: AP)
18
Given the context of visual media and self-expression, the motivation of this
thesis is to rethink the way we can use visual media to express ourselves. Although
there are many existing technologies and services we can utilize to express ourselves to the general public, there is still room for making the process simple, more
intuitive, and accessible.
1.2
Contributions
To play a role toward filling the white spaces in the area of visual media and
self-expression, this thesis focuses on the design of Recast, which is a system that
enables anyone to potentially become their own the newsroom of their own. The
main contributions of this thesis include the design and implementation of the
following frameworks:
" A novel system for collectively gathering and indexing visual media content
* A new block-based visual programming language for producing remixed
video storyboards in a semi-automated manner
1.3
Overview of Thesis
Chapter 2 first discusses the background, covering the cultural history of visual
media and its utilization toward self-expression. It then covers the purpose and
the goals of this thesis.
Chapter 3 covers some of the prior work in the field of media analysis, media
curation, and media distribution. Given the discussion of prior work, Chapter 4
presents the approach toward creating Recast, and discusses its system design.
Chapter 5 mainly discusses the details of the Media Analysis System. It covers
the technical details and implementations of the back-end analysis engine that
drives Recast, as well as other applications.
19
Chapter 6 describes the details of the Recast UI, which is the front-end user
interface of Recast. It covers the user interactions around Recast, as well as the
process of how remixed news storyboards are constructed from original materials.
Finally, Chapter 7 discusses the evaluation of Recast and Chapter 8 states the
conclusion of this thesis.
20
Chapter 2
Background and Purpose
This chapter describes the background of this thesis, mainly covering the culture and history of visual media and self-expression. Following the background,
this chapter also describes the purpose and goal of this thesis.
2.1
Utilizing Visual Media Toward Self Expression
The majority of visual media that we interact with are produced and distributed
by major content creators such as CNN or BBC. In general, the facts that we see,
hear, and learn are from video content are likely to be biased with the voices and
thoughts of the content creators. On the other hand, the act of creating and disseminating personal media content has become extremely easy, thanks to the advancement of Internet-based technologies and media platforms. For instance,
one can easily create a video and post it on YouTube [1], or share live moments
in real-time with Vine [2], as shown in Figure 2-1. However, despite the fact that
Internet-based media platforms for sharing user-generated content are increasing
in importance, the influence of old-school content creators and broadcasters are
very strong.
On the other hand, there are considerably large groups of proactive citizens
who develop and enjoy the culture of vidding [3]. Vidding refers to the practice of
collecting raw materials and footage from the existing corpus of visual media, and
21
Figure 2-1: Vine
(Source: Mashable)
create remixed content based on the users' thoughts, perspectives, and creativity.
The official history of vidding dates back to 1975, but its general popularity has
expanded drastically with the introduction of Internet-based video sharing platforms.
The process of vidding is extremely meaningful in terms of fostering creativity
and awareness. For instance, some may attempt to increase political awareness
with Political Remix Videos [4]. A Political Remix Video refers to a genre of transformative do-it-yourself media production, whereby creators critique power structures, deconstruct social myths and challenge dominate media messages through
re-cutting and re-framing fragments of mainstream media and the popular culture.
As an example, "Moms & Tiaras" [5] remix by Angelisa Candler combines selected
clips from the show, "Toddlers &Tiaras", to shift the focus away from the children
and onto their parents. By completely removing footage of the toddlers, Candler presents a re-imagined reality show called "Moms & Tiaras" that is critical of
the questionable and sometimes deeply troubling behavior of the adults behind
prepubescent beauty pageant contestants.
Not all remixed videos are related to politics. For instance, some enjoy creating
Supercuts [6], as shown in Figure 2-2. A Supercut refers to a fast-paced montage of
22
short video clips that obsessively isolates a single element from its source, usually
a word, phrase, or cliche from film and television. As an example, "My Little Pony:
Friendship is Magic in a nutshell'' [7] aggregates the segments from the show, "My
Little Pony", which mention the word "friend", in a comical way.
-
The full list
Ordr by
nm
Ordrby
dt.
Addod
ANt
Film
Te.,.,.s
TesvbIson
Real4e
VariU"
Vda-
dsCwid
calHubod
B
C on s
n . abe
vmwg
cmprng bA
2=8Ulaf~h.Lkdan~Addr.
WithW ry spewh
m
U
Si
-
Jumft
b.rmtO
-
M1*Cam*dAw.
Ad,
ysko
mPAUbW
psW
t"
fS addwn
-
Emy
m
an
-"Wawt
Ka
nd
u2fM2
t.mTmaaa.t
bm
yd.from BubWatow
- E~yy
in he
Figure 2-2: List of Supercuts
(Source: Supercut.org)
2.2 Video Production and Editing
In the field of visual media, video production and editing is one aspect that
cannot be missed, regardless of its type or genre. This section describes the history
of video editing, as well as its high-level workflow and its issues.
2.2.1
History of Video Production and Editing
The history of video editing dates back to 1903, when Edwin S. Porter first introduced the crosscutting technique [8]. The first motion film to utilize the crosscutting technique was "The Great Train Robbery", where 20 individually-shot scenes
were combined to create one single sequence. Since then, linear video editing,
23
which refers to the act of cutting, selecting, arranging and modifying images and
sound in a predetermined, ordered sequence become popular.
The production of visual media evolved drastically in 1971 when CMX Systems
introduced the CMX 600, which was the first non-linear video editing system [9].
Unlike linear video editing techniques that require the process of going through
films or videotapes in a sequential fashion for finding the correct segment, nonlinear video editing allows random access within a stored, electronic copy of the
material. The CMX 600, as shown in Figure 2-3, had the capability of digitally
storing 30 minutes worth of black-and-white video content. It was also equipped
with a light pen interface for making cuts and reordering scenes. Although the
CMX 600 had many limitations, it was the first-ever computerized non-linear video
editing system that became commercially available.
Figure 2-3: CMX 600
(Source: The Motion Picture Editors Guild)
Thanks to the advancement of technology, there are a lot of video editing and
production platforms that run on personal consumer appliances. For instance,
Apple bundles iMovie [10] with its Mac OS X operating system, and even provides
mobile versions for iPads and iPhones. Apart from native applications, the Popcorn
24
Maker [11], as shown in Figure 2-4, presents a video editing application that runs
completely within a web browser.
P cornmmakr
MYnwprjc
mmozitla
MOZK~aEvents
PopcornMaker
Test
Use the Media tab
to import your
audio or video files
W
IN
n
kflw o
uts of tb toImorado
f md
vPe
rndt
rd
cto
,i
Wtttiat
asue"
ie
c
n
ceptually the same for a number of decades. The following steps outline the process at a high level.
1. Raw footage is collected, recorded, or filmed
2. Footages are cut into segments
3. An edit decision list (EDL) is constructed
4. Based on the EDL, segments are reordered and connected into a sequence
5. The final sequence is compiled and published
An EDL refers to a list that defines which cuts or assets from the original repository are to be used in the final sequence. It may also define how sound tracks or
25
other graphical elements are overlaid on the main sequence. Regardless of the
scope or scale of video production, this process remains more or less the same.
In personal video production scenarios, the process is simplified to a certain
extent. However, in professional scenarios, the process is divided into several layers to maintain the quality and speed of its outcome. For example, in the case of
a television newsroom, there are teams of highly trained experts working together
behind the scenes to deliver high-quality content in a timely fashion. Figure 2-5
shows a few photographs taken during a visit to the NBC News studio in New York
City. Every step towards producing a news program is being handled by a team
of professionals around the globe, working side-by-side with a suite of complex
equipment.
Figure 2-5: NBC News Studio in New York City
2.2.3
Issues Around Video Production and Self Expression
Although the process of video production and editing has become less tedious,
the required workflow toward expressing oneself remains non-trivial. Most of the
user interfaces and interactions around current video production tools inherit the
idea of segmented, sorting, and reordering the original assets on a linear timeline.
While such user interfaces may be intuitive for skilled users that have experie.nce,
it is still cumbersome for ordinary users. In terms of vidding, users do not have a
26
sophisticated way to explore, sort, and understand all of the media content that is
available.
2.3
Purpose
Given the background, this thesis proposes a platform called Recast. As stated
in the previous chapter, the purpose of Recast is to design an intuitive framework
for proactive citizens who want to become the newsroom of their own. The framework of Recast can also be utilized for other use cases, such as creating personalized version of "The Daily Show", where users assemble clips to make a political
point or parody.
For the scope of this thesis, Recast aims to simplify and personalize the functions of traditional newsrooms and production studios. Its key goals are to design
systems and interfaces that enable users to effectively navigate through the universe of visual media that represent real-world events, create storyboards based
on their own perspectives and contexts, and distribute their views to the world.
To achieve the goals of Recast, this thesis focuses on the design and implementation of a system for real-time visual media analysis (Media Analysis System), and
a user interface for semi-automated media aggregation and curation (Recast UI).
The Media Analysis System gathers visual media content from multiple sources,
and annotate them based on various contexts. The Recast UI features an easyto-use visual scripting language to semi-automate the process with data-driven
intelligence.
The visual scripting interface for media aggregation and curation can always
be utilized for other means of self-expression. For instance, one can be use it to
create personalized versions of real-time sports live casts. Others may use it for
remixing personal archived material of family vacations. Although the utilization
of Recast in various domains can be interesting areas to explore, the focus of this
thesis is primarily focused on news and synchronous real-world events.
27
28
Chapter 3
Related Work
One of the key elements toward achieving the goals of Recast is the extracting
and indexing the metadata, which is essential in navigating through the world of
media. However, unlike text-based content that is comparatively easy and lowcost for analyzing and indexing, visual media is difficult to understand, process and
distribute, since it fundamentally requires intensive computation and bandwidth.
This chapter discusses some of the prior work in terms of extracting metadata,
aggregating and curating materials, and distributing content over the Internet.
3.1
Media Analysis
Upon gaining a better understanding of visual media, computational methods of extracting metadata from visual media has always been an active field of
research. In this field, many researchers have proposed various algorithms and
systems that contribute to the goal of extracting metadata.
One of the trends amongst existing work is the analysis of text-based metadata
that is associated with visual media. For instance, the Weighted Voting Method
[12] proposes a statistical approach of automatically extracting feature vectors from
the transcript (closed captioning), and categorizing the clips based on its topic.
The ContextController [13] presents a method to extract entities from video transcripts, and display contextual information associated with the entities in real-time.
29
Apart from research projects, commercial services such as Beamly [14] and Boxfish [15] provide APIs to access analyzed transcripts of broadcast television. Figure
3-1 indicates an example of how Boxfish visualizes trending topics extracted from
television programs.
Trending on Television
Stowaway
Pacers
Pujols
*
MENTIONS
PAST24HOURS
PASTWEEK
PASTMONTH
70
60
so
40
30
20
0
1
p'Ne
~
0
-
-10
04pm OSpm 06pm 07pm 08pm 09pm
10pm 11pm
12am
01am 02am 03mm 04am 05am 06am 07am 08mm 09am
10am
l1mm
12pm Olpm
Figure 3-1: Visualization of Trending Topics on Broadcast Television
(Source: Boxfish)
Other trends of in the field include the analysis of pixels, frames, and audio
tracks within video contents. For instance, Expanding Window [16] proposes an
effective approach to extract scenes from video clips with computer vision and
pseudo-object-based shot correlation analysis. Audio Keywords [17] proposes a
method, which extracts features from the audio track to detect events within a live
soccer game.
There are also methods that look into user-centric approaches to analyze the
content of videos. For instance, MoVi [18] presents a method to apply the concept
of collaborative sensing for detecting highlight scenes of an event. Ma, et al.
30
(2002) [19] focuses on building a model of the users' attention level to determine
important scenes within video clips, without the semantic understanding of the
clip.
3.2
Media Aggregation and Curation
After extracting metadata, some of the challenges are to properly categorize,
segment, annotate, index, and curate video content. Upon addressing these challenges, researchers have proposed various methods that combine different media
analysis techniques.
For instance, systems such as Personalcasting [20], Informedia [21], Pictorial
Transcripts [22], and Video Scout [23] presents a framework to analyze video clips
and create content-based video databases. These systems allow applications to
query scenes or clips based on topic keywords or categories. As a unique scheme
for intuitively retrieving and annotating content, Media Stream [24] proposes a visual scripting interface that utilizes iconic figures. CLUENET [25] proposes a framework to aggregate and cluster semantically similar video clips that flow through
the universe of social networks. Media Cloud [26], on the other hand, provides
an open platform to index and aggregate text-based metadata of media content
from arbitrary sources. Figure 3-2 shows a work cloud visualization of trending
topics that appeared in mainstream media through the week of May 20th, 2013.
In terms of media curation, there are mainly two concepts that lie underneath
existing projects and services. The first concept is to personalize the media consumption experience based on the users' interest or perspective. For instance,
NewsPeek [27] has looked into ways of examining the users' viewing habits of
nightly television news, and providing a personalized news consumption experience. More recent services such as Flipboard [28] provide platforms that aggregate and personalize the way we enjoy Internet-based news media. The second
concept is to filter out irrelevant, false, or low-quality content from a large corpus. For example, CrisisTracker [29] focuses on crowd-sourced media curation,
31
Figure 3-2: Word Cloud of Trending Topics on Mainstream Media
(Source: Media Cloud)
and presents a method to automatically generate relevant and meaningful situation awareness reports from raw social network feeds that contain false or misleading posts.
3.3
Internet-based Media Distribution
The distribution of visual media content has been radically transformed by the
rise of the Internet, due to the rapid growth of platforms like YouTube. Netflix
[30] and other related platforms have overtaken the roles of DVD rental shops,
and provide users with on-demand access to movies and TV programs over the
Internet. Apart from static media distribution, there are platforms such as Ustream
[31] that enables any user to become a live media broadcaster, and present oneself
to the world in real-time. Figure 3-3 shows a list of publicly available live broadcasts
on Ustream.
As for the technologies for Internet-based media distribution and broadcasting,
many standard protocols, coding schemes, and communication technologies have
been introduced ever since the dawn of the Internet. For instance, RealPlayer [32]
was a huge contributor in terms of video streaming. Currently, there is a strong
trend toward embedding the functions of visual media distribution within web
pages and web browsers using HTML5 [33] and the HTTP Live Streaming pro32
USTFEW
EXPLORE
-
PRODUCTS
Log in / Sign up
SEARCH
-
OOprMy Cam
370 vieweis
Jordan Lake EagleCam
124 viewors
The Decorah Eagle Cam
We ,vie
Figure 3-3: Ustream
tocol [34]. While these methods may not be the optimal solution in terms of technical efficiency, the commoditization of high-bandwidth Internet connectivity and
extensive abilities of web browsers make it possible to implement such functionalities within simple web pages that don't rely on any third-party software.
33
34
Chapter 4
Approach
Given the review of existing methods and technologies, this section describes
the approach toward building Recast, which consists of the design and implementation of a the Media Analysis System and the Recast UI. The Media Analysis System aims to build an archived corpus of visual media, and extract metadata that
gives a better understanding of the frames that underlie within each piece of content. The Recast UI, on the other hand, aims to provide a simple visual scripting
language for semi-automating the process of media aggregation and curation.
4.1
Media Analysis System
The Media Analysis System aims to build a corpus of visual media that is indexed based on frame-level metadata. Frame-level metadata refers to information
that provides deep insights into the meaning of the content.
4.1.1
System Concept
In the World Wide Web, there are basic semantic markers that make it possible to extract information and order from hypertext documents. The "a" tag,
for example, explicitly maps out connections between various web pages, and
becomes a useful tool for analysis in applications such as Google's PageRank [35].
35
Other HTML tags such as the header tags and emphasis tags (e.g. "bold ", "italic",
etc.) provide markers for differentiating importance of content within a document.
These structures embedded within the hypertext facilitate the creation of useful
information retrieval applications.
Video content, however, is typically a sequence of pixels and sound waves that
do not have a unified structure for easily extracting semantic data and relationships. Therefore, the Media Analysis System makes attempts to decompose the
video container, and run various analyses over each individual component to computationally extract information that defines the context of the original content.
Figure 4-1 indicates the concept of extracting metadata, and annotating scenes
with tags that gives an insight into its context.
Figure 4-1: Concept of Tagging Scenes With Metadata
36
4.1.2
System Overview
Figure 4-2 indicates a high-level design of the system. The system is composed
of three individual parts; the Media Acquisition Framework, the extensible media
analysis framework called GLUE, and the Media Database.
Media Acquisition Framework
YouTube
e e-
Crawler
Raw
Video Clips
Process Request
GLUE Framnework
File Storage
Metadata
Get scenes that
mention "Syria"
Figure 4-2: High-level Design of The Media Analysis System
The Media Acquisition Framework is a family of content retrievers that include a
cloud-based DVR, web video crawlers, and a user upload receiver. Each recorder,
crawler, or receiver is a self-contained process. Its roles are to collect and store
raw video clips from the specified source. As of April 2014, the framework is simultaneously capturing news content from 10 nation-level TV channels, YouTube,
and the TV News Archive [36], as well as accepting user-uploaded media.
After video clips are collected and stored into the storage, each recorder or
crawler passes a message to GLUE for initiating the analysis. GLUE is an extensible, modular media analysis engine that was designed and implemented in col37
laboration with Robert Hemsley and Jonathan Speiser. It has the responsibly of
in-taking video files and extracting the following types of metadata by analyzing
the video frames, sound track, and transcript:
" Scenes cuts within the clip
" Human faces
* Static thumbnail images
" Phrases (mentions) in the transcript
" Named entities (people, organizations, locations, etc.)
* Emotional status of the speakers
Upon completion of the analysis, GLUE passes the metadata onto the Media
Database. The Media Database stores and indexes the metadata for video clips
that ran through GLUE. The Media Database also exposes an extensible RESTful
API that allows any application to issue queries. By using this API, applications
can access video clips based on the various factors. For instance, applications can
query the system for all the scenes that mention "President Obama", where two
human individuals are having a conversation in an angry manner.
4.2
Recast UI
For providing an intuitive interface for users to curate the content in a semiautomated fashion, the Recast UI inherits the paradigm of visual programming
languages, such as Scratch.
4.2.1
Concept of Interaction
As previously mentioned, the goal of Recast is to allow novice users to become
the newsroom of their own. However, the process of curating raw materials and
38
creating meaningful storyboards is a non-trivial task that involves a large team of
professional directors, producers, reports, designers, and technicians. Therefore,
the main challenge of Recast is to design an intuitive, data-driven user interface
that can semi-automate the workflow of content curation and media production.
Typically, automating workflows involves some kind of computer programming
or scripting. For the scope of the Recast UI, it must have some form of scripting language that users can use to define and automate the selection of relevant
scenes from desired sources. It should also have smart ways to assist the user in
narrowing the scope or topic. Scripting languages may be full-featured programming languages, but such languages tend to be counterintuitive, especially for
individuals who do not have prior experience with coding.
In order to make the process of scripting intuitive and easy to use for everyone,
the Recast UI adopts the concept of visual programming languages, similar to
those found in Scratch [37] or VisionBlocks [38]. In the same way kids can combine
blocks in Scratch to create action scripts for animating game characters, users of
Recast can create their own news storyboards by manipulating blocks. In terms
of using visual elements for querying content, Recast UI also inherits some of the
dynamics proposed in Media Stream [24].
4.2.2
Overview of U1
Figure 4-3 indicates how content blocks (primitives that define bundles of content assets) are combined with filter blocks (primitives that define the scope of
curation). Based on the given scope, each content block automatically retrieves
content that are relevant to the context. After the curation process, users may drag
the blocks into the timeline. This timeline represents a personalized EDL, which includes all the original assets that were selected. Based on the EDL, Recast renders
all the content into a single video sequence.
The Recast UI is designed to be used on devices that have touch-enabled displays. In terms of manipulating virtual blocks, touch-based interfaces can increase
39
Figure 4-3: Overview Design of the Recast Ul
the tangibility of elements and improve the usability. The Recast UI is also intended
to be a web application that runs within a standard web browser, and a test bed
for experimenting with new web-based technologies. Amongst various browsers,
the desktop version of Google Chrome [39] was selected a platform for running
the Recast UI, since it is the most sophisticated browser in terms of supporting new
technologies and standards in a stabilized way.
40
Chapter 5
Media Analysis System
The Media Analysis System was designed and implemented in collaboration
with Jonathan Speiser and Robert Hemsley. This system serves as a unified backend data engine that drives not only Recast, but also several other applications
and demonstrations.
This chapter mainly covers the technical design and implementation of the
three main components that comprises the Media Analysis System. It also discusses the details of Constellation, which is an interactive installation that visualizes
the analysis results.
5.1
Media Acquisition Framework
As described in the previous chapter, the Media Acquisition Framework is a
family of content retrievers, which acquires the original content and passes them
onto GLUE for analysis. Each piece of content acquired by this framework is given
a unique global identifier known as the UMID (Unique Media ID). As of April 2014,
the Media Acquisition Framework consists of a cloud-based DVR, two web video
crawlers, and a user upload receiver.
This framework is designed to be modular, extensible, and scalable. Every
single recorder or crawler is a stand-alone process that can run independently on
any node across the network, and all of the retrieved assets are served externally
41
using nginx [40], which is a sophisticated web server. Therefore, it is extremely easy
to add new types of retrievers, and distribute processes across multiple machines
on the network. The remainder of this section describes the details of the three
retrievers that were initially implemented.
5.1.1
Cloud-based DVR
At the MIT Media Lab, there is an in-house satellite head end that has the
ability of translating DirecTV [41] channel feeds into multicast UDP streams. The
cloud-based DVR is a process that records television programs from DirecT, by
capturing and transcoding the UDP/IP streams served by the head end.
The cloud-based DVR is a stand-alone daemon written in Python that periodically retrieves the program guide and simultaneously records individual programs
from 10 nation-level channels listed on Table 5.1. The program guide used by the
DVR is provided by the Tribune Media Services (TMS) [42]. TMS exposes webbased APIs that third-party applications can use to access the television program
guide. Based on the program guide, the DVR forks FFmpeg processes for recording the actual program. After each programs is successfully recorded, the DVR
sends a process request to GLUE via HTTP.
Table 5.1: List of Recorded Channels
Channel Name
PBS (WGBH)
Channel Number
002
ABC (WCVB)
005
FOX (WFXT)
025
BBC America
264
Discovery Channel
278
FFmpeg [43] is a cross-platform tool that can capture UDP streams, and transcode
videos to a various formats. Each FFmpeg instance forked by the DVR has the re42
sponsibility of recording a single program, and transcoding them into an HTML5compatible format. As of April 2014, the video format compatible with HTML5 is
still fragmented. For the scope of thesis, FFmpeg was configured to use H.264 [44]
for the video, and AAC [45] for the audio. These are the codecs compatible with
Chrome. FFmpeg is also configured to create low-resolution (640 x 360 pixels,
15 FPS) versions of the video for analysis purposes. Apart from FFmpeg, a tool
called CCExtractor was also used to extract the transcript (closed captioning) as a
SubRip [46] subtitle (SRT) file from the original video source.
The DVR is also equipped with a feature for dynamically configuring the context
of programs to record. For the scope of this thesis, the DVR was configured to only
record programs that are related to news, based on the TMS program guide.
5.1.2 Web Video Crawler
In addition to the DVR, the Media Acquisition Framework also includes web
video crawlers that can scour and retrieve videos found on the World Wide Web.
The crawlers are stand-alone scripts based on a Python framework called Scrapy
[47]. The scripts run periodically, and recursively follows web links to find videos.
For the scope of this thesis, two types of crawlers were implemented; one being
a YouTube crawler, and the other being a News Archive [36] crawler.
Crawlers look for video content related to news, within their service domain. After the crawlers discover video content, it then passes the URL to a message queue
server based on RabbitMQ [48]. The message queue server then dispatches a process, which downloads the content, transcodes the video to an HTML5-compatible
format, and generates a low-resolution version. After all the processing is completed, it sends a process request to GLUE via HTTP. If a transcript is available
within the video, it is extracted and stored as an SRT file.
43
5.1.3
User Upload Receiver
The user upload receiver is a simple web server based on Node.js [49] and
Express [50]. After it accepts file uploads from users, it saves the file onto the disk,
and examines the file format.
If the file is a typical video file, the receiver transcodes the video into HTML5compatible format, and creates a low-resolution version. If the file is a disk image of
a video DVD, the receiver extracts the video and applies the necessary processing.
If a subtitle track is present within the uploaded video, the receiver also extracts
the transcript into an SRT file. After everything is completed, the receiver sends a
process request to GLUE via HTTP.
The receiver also accepts IDs of YouTube videos as an input. When YouTube
video IDs are received, they are passed to the same message queue server that
handles input from the YouTube crawler and processed accordingly.
5.2
GLUE
GLUE is the extensible media analysis framework that extracts frame-level metadata from arbitrary video sources. Once GLUE receives a process request from the
Media Acquisition Framework through its web-based API, it retrieves the raw content from the file server, and conducts various analyses on the transcript, the video
frames, and the sound track. As shown in Figure 5-1, GLUE expects process requests to be in JSON format, and have fields such as an unique ID, a title, and URLs
of the video content. As previously indicated, process requests are sent from the
Media Acquisition Framework to GLUE as part of HTTP POST requests.
From a technical viewpoint, GLUE is a message queue manager written in
Python, which assigns processing tasks to stand-alone analysis modules, and aggregates the results into a unified JSON data structure. It makes use of the standard multiprocessing framework included within Python for implementing the message queue. For exposing the web-based API, it makes use of Twisted [51], which
is an event-driven networking framework for Python. The analysis modules are
44
1
{
" id": "SH
/0000210000"Q
"20/20",3
ftitle":
"1cc": 11.http://url/to/SHOO000021000O.srt",
"media":{
2
3
4
s
"low": "http://url/to/SH000000210000_low.mp4"
8
10
11
Ilength": 3540.0,
12
tstartTime":
"2Oi4-03-22TO2:01Z",
"endTime": "2014-03-22TO3:OOZ"'
13
"channel":
14
1s
"005"
}
Figure 5-1: Example of a GLUE Process Request
designed to run in parallel, and extract the metadata from the given transcript
or video. Since analysis modules are implemented as independent Python modules that are dynamically imported, it is extremely easy to add new modules that
conduct new types of analysis. After the analysis results are returned from all of
the modules, GLUE stores the aggregated metadata into the Media Database by
utilizing a driver called PyMongo [52].
Table 5.2 indicates a list of all the modules that are currently implemented. The
remainder of this section covers the details of each analysis module.
Table 5.2: List of Analysis Modules in GLUE
Module Name
Module Description
45
5.2.1
Download Module
The download module handles the responsibility of downloading the low-resolution
version of the original video content. All the analysis modules, except for the transcript module, rely on the video file downloaded by this module.
GLUE downloads the video file into a temporary directory of its own, instead
of having the analysis modules directly access the video through a shared file system. After the file is successfully downloaded, this module sends a notification
to the central message queue manager, which then dispatches analysis requests
to other modules that rely on the video file. From a performance perspective,
there is a huge overhead of retrieving files through a web server. However, this architecture enables multiple instances of GLUE to exist and operate independently
anywhere on the Internet, without the need of a shared file system that could cause
constraints in terms of deployment.
5.2.2
Transcript Module
The transcript module handles the role of parsing the content of the SRT file for
any process request that contains a transcript. It also conducts natural language
processing (NLP) over the transcript to extract any useful metadata that describes
the context of the video.
Since the process request only contains an URL to the actual SRT file, the transcript module starts the process by downloading the content of the SRT into a
string variable. As shown in Figure 5-2, an SRT file lists the spoken phrases along
with time-offset values. Time-offset values indicate the segment of the video,
where the corresponding phrase was mentioned. These values are relative to the
beginning of the video, and indicates the start time and the end time of a segment.
Given an SRT file, this module parses each line and creates an array of nested
objects, which contain the actual phrase, its start time, and its end time. Figure 5-3
indicates an example of an object that represent a phrase mention. It also creates
a separate field that contains the entire transcript as a concatenated string.
46
13
jWJJ ir
LUU1%WJI
Ur
InIDL
n~ruUnJ
14
15
Figure 5-2: Snippet of an SRT File
1
{
2
"start": 0,
3
"end": 8.073,
"result": "THEN WE RIDE ALONG WITH A GROUP OF BOUNDTY
UNTERS THAT ARE SURE TO CATCH YOU BY SURPRISE."
4
5
6
}
Figure 5-3: Example of an Object Representing a Phrase Mention
After the module parses the content of the SRT file, it sends the transcript to
AlchemyAPI [53] and openCalais [54], which are cloud-based NLP toolkits. This
module uses these toolkits to extract named entities and social tags that appear
within the transcript. Named entities refer to phrases that indicate specific entities
such as geographic locations, organizations, or individual people. Social tags, on
the other hand, refer to phrases that indicate specific real-world events or contexts.
Both AlchemyAPI and openCalais provide web-based APIs for receiving text data
and returning analysis results. The results from these NLP toolkits are combined
with the original parse results, and sent back to the central message queue.
47
5.2.3
Thumbnail Module
The thumbnail module handles the task of generating thumbnail images of the
video. Thumbnail assets include; a poster image (dump of a randomly selected
frame) of the video, minute-by-minute frame dumps of the video, and an animated
GIF that summarizes the video.
Both the poster image and the minute-by-minute frame dumps are generated
by utilizing OpenCV [55], which is an open-source computer vision library that
has the capabilities of scrubbing through a video and dumping a frame into an
image file. Figure 5-4 indicates an example of minute-by-minute frame dumps. For
videos that are less than two minutes (120 seconds), frame dumps are generated
every 6 seconds. The process of concatenating the frame dumps into a single
animated sequence creates the GIF summarization, which is handled by a feature
within FFmpeg.
Figure 5-4: Example of Thumbnail Images
All of the assets created by this module are stored within a directory that is exposed externally by nginx. After the entire thumbnail generation process is completed, this module returns a list of URLs that point to the assets.
5.2.4
Scene Module
The scene module handles the responsibility of detecting scenes cuts within a
video. It generates and returns a list of nested objects, which includes the start
time, the end time, and URLs that point to thumbnail images of the scene segment.
The start time and end time are timestamps relative to the beginning of the video.
The thumbnails are frame dumps of the first and last frames of the scene.
48
For detecting the scenes, this module uses a method similar to the one proposed in Expanding Window [16]. As shown in Figure 5-5, scene cuts often lead
to a drastic change in the color distribution of consecutive video frames. As a result, scene cuts may be computationally identified by thresholding the Euclidean
distance between the color histograms of two consecutive frames. In order to generate color histograms and calculate Euclidean distances, this module makes use
of OpenCV. In the same way the thumbnail module generates images of frame
dumps, this module also creates images from the first and last frames of each
scene, and saves it under a folder that is served by nginx.
Figure 5-5: Example of Scene Cuts
5.2.5
Face Module
The face module handles the role of detecting faces that appear within a video.
This module identifies the segments within a video that contains human faces, and
returns a list of nested objects that indicate the start time and the end time of each
segment, along with the number of faces it detected.
For detecting the faces, this module also utilizes OpenCV, which has the ability
of applying cascade classification [56] over the frames. Figure 5-6 indicates an
example of a face detected within a video frame. Although this module can detect
faces that appear, it does not have the feature to recognize the identity of the
individuals. That being said, it is technically feasible to apply machine-learning
techniques for cross-referencing the detected faces against a training data set of
49
known individuals to recognize their identity.
Figure 5-6: Example of a Face Detected Within a Video Frame
5.2.6
Emotion Module
The emotion module handles the task of identifying segments with speech activity, and recognizing the emotional status of the speakers. This module returns a
list of nested objects that indicate the start time and end time of a segment, along
with a state that best describes the emotional status of the speakers.
Before any analysis is conducted, the emotion module extracts the audio track
from the video using FFmpeg. The audio track is converted into a non-compressed
linear PCM sequence, and saved into a temporary working directory. For identifying the segments with speech activity, this module uses an audio processing
toolkit called SoX [57], which normalizes the audio and segments the original file
based on speech activity. For detecting the emotional status, this module uses
an open-source audio-based emotion analysis toolkit called OpenEAR [58]. OpenEAR extracts feature points from an audio stream, and cross-references it with a
pre-trained data set that maps features with seven different emotion states. The
50
seven emotion states include anger, boredom, disgust, fear, happiness, neutral,
and sadness. Figure 5-7 indicates an example of a speech segment and its emotional status.
Figure 5-7: Waveform and Emotion State of a Speech Segment
5.3
Media Database
The Media Database is a self-contained centralized framework that stores and
indexes the metadata extracted by GLUE. It uses MongoDB [59] for storing data,
Apache SoIr [60] for indexing data, and Tornado [61] for exposing a RESTful API.
This section covers the technical details of its design and implementation.
5.3.1
Data Store
The entire Media Analysis System is designed to use JSON data structures to
pass information from component to component. Due to this nature, the Media
Database utilizes MongoDB, which is a schema-less key-value store that can handle JSON data structures seamlessly. In MongoDB, individual data entries (JSON
objects) that contains the metadata are referred to as documents, and groups of
multiple documents are referred to as a collections.
MongoDB integrates well with Python scripts, and its schema-less architecture
provides high levels of flexibility and extensibility. Although data schemes are
useful upon querying data, the flexibility of schema-less data stores enable developers to rapidly modify the features of the Media Acquisition Framework and
GLUE, without being aware of the underlying database.
51
5.3.2
Supplemental Indexing for Text-based Metadata
In general, MongoDB has a sophisticated way of indexing data for expediting
queries. However, it is not optimized for free text search. Therefore, the Media Database utilizes SoIr as a supplement. SoIr is an open-source search engine
framework that is optimized for enabling fast text-based search queries. Its main
role is to index the metadata generated by the transcript module within GLUE, and
allow free text search over the spoken phrases.
Upon indexing data, SoIr requires a schema to be defined for each core (index database). Table 5.3 indicates a list of field names and data types that were
specified in the schema.
Table 5.3: Schema of Soir Core
Field Name
umid
Field Description
UMID of content
Data Type
String
Timesftm
Dat
fof
cotp
reation
Title of content
transcript
Text
1" ~ 0
Text
startTime
Double
Start time of video segment
creator
String
Creator of content
title
videoLow
r?:r
Phrases mentioned within video segment
URL of low-resolution video file
String
-,age
URotuVtIitrr~
Th~~tM~~bri~~hI
For each object that contains a field generated by the transcript module, three
types of SoIr entries are created. First of all, an entry categorized as type "0",
is created for the entire content. Secondly, entries categorized as type "1", are
created for each individual scene segment. Finally, entries categorized as type
"2", are also created for each phrase mention. For each subdivided video segment
that correspond to a scene or a phrase mention, the actual words that were spoken
within the segment are pulled and matched accordingly from the analysis results
of the transcript module. The process of creating multiple entries based on video
52
segments allows queries to be process in a non-hierarchical manner for retrieving
specific segments within videos.
The actual creation process of SoIr entries is handled by a custom data migration daemon written in Python. This daemon monitors the MongoDB collection,
intercepts new documents, and generates SoIr entries accordingly. After all of the
entries are created, they are sent to SoIr through a web-based API for indexing.
5.3.3
Media Database API
The Media Database API is a RESTful interface for querying and retrieving the
metadata from the Media Database. This API is exposed by a Python script, which
makes use of a lightweight web framework called Tornado. Each API endpoint was
implemented as a Python class, and exposed by Tornado. Table 5.4 indicates a list
of endpoints that are available within the query API.
Table 5.4: List of Endpoints Within the Query API
Endpoint Name
Endpoint Description
GET /search/:keyword
Returns a list of assets based on the keyword
GE fo~Ers~Rtps
a 1Vo l h reator
GET /umid/:umid
Returns a single asset based on the UMID
G~~~~~~iit~~d
Reun al s-fp~~~dwthi 24 horN
GET /trends
Returns a list of trending keywords within 24 hours
The "search" endpoint interfaces directly with SoIr, and returns results of free
text queries based on the given keyword. By default, SoIr returns a list of the
top 20 assets that matches the search criteria. Optional URI parameters may be
added to specify the maximum number of results, or to narrow the criteria based
on various aspects such as the content creator, the duration, the timestamp, and
the type of video segment.
The "creator" endpoint returns a list of all the unique content creators. Content
creators include entities such as television broadcasters and creators of YouTube
videos. It utilizes the feature in MongoDB to extract distinct values within a given
field.
53
The "umid" endpoint returns the metadata associated with the specified content. The "recent" endpoint, on the other hand, returns a list of assets that were
processed within the 24-hour time period. Both of these endpoints are designed
to return the entire MongoDB document, which contains all the metadata.
The "trends" endpoint returns a list of trending topics. This list is created by
counting the appearance frequencies of named entities and social tags that were
extracted from videos created within the 24-hour time period.
5.4
Constellation
Constellation is an interactive installation, which was implemented to visualize
the metadata of visual media that was acquired and processed my the Media Analysis System. This section discusses the details of its system architecture, as well as
its user interaction.
5.4.1
System Setup
Constellation consists of a wall with touch-sensitive displays (iPads), a projector, a control panel, and a message bus server. The displays are primarily meant
for playing videos and visualizing metadata. The projector augments the wall with
projection-mapped animations. The message bus server is the key component
in terms of synchronizing all the devices, and managing its statuses. It also periodically queries the Media Analysis System, and retrieves the metadata of video
assets. Figure 5-8 indicates the actual setup of Constellation.
All of the user interface elements on the displays, the projector, and the control
panel were implemented as HTML5-based web applications, which makes use of
canvases and CSS transforms. The communication between all of the components
is handled by Socket.lO [62], which is a real-time messaging framework that utilizes
the WebSocket protocol [63].
54
Figure 5-8: Setup of Constellation
5.4.2
User Interaction
In its initial stage, Constellation displays the most recent videos that were acquired and process by the Media Analysis System. Users may use the control panel
to adjust the scope of time, and display videos that were processed within a given
time frame in the past.
When a display is touched by a user, the state of the system changes to metadata mode. In this mode, the display that was touched will continue to play the
video as shown in Figure 5-9. However, all other displays will make a transition
and visualize the metadata associated with the main video.
As shown in Figure 5-10, the projector overlays animations of fireballs and
comets on the wall while constellation transitions from normal state to metadata
mode. The animation demonstrates the metaphor of metadata emerging from the
55
Figure 5-9: Constellation in Metadata Mode
original video, building up energy, and releasing small bits of information (shooting stars) as it explodes.
5.4.3
Visualization Interfaces
The current version of Constellation is designed to handle visualizations based
on the transcript, named entities, and scene cuts. Figure 5-11 shows screen shots
from the visualizations. As previously mentioned, all of the visualization interfaces
were implemented as HTML5-based web applications. Therefore, it is easy to
extend future versions of Constellation with new visualizations.
56
Figure 5-10: Constellation Transitioning to Metadata Mode
Named Entities
TranscriDt
Scenes
Figure 5-11: Visualization Screens in Constellation
57
58
Chapter 6
Recast UI
The Recast UI is the key component that allows users to easily aggregate, curate, and present their views to the world. This user interface utilizes the concept of
block-based visual scripting paradigms for semi-automating the process of video
production. This chapter describes the interface design, as well as its technical
implementations.
6.1
Overview of Implementation
As previously mentioned in Chapter 4, the Recast Ul is implemented as an
HTML5-based web application that is optimized to run on desktop versions of
Google Chrome. It also makes use of touch-based interaction techniques for increasing the tangibility of visual blocks and improving the usability of the interface.
Since desktop versions of Chrome cannot translate OS-level touch inputs into
Javascript events, Caress [64] was utilized as a middleware to fulfill the role. For
enabling programmatic manipulation of DOM elements within the HTML document, the Recast Ul makes heavy use of jQuery [65], which is a Javascript library
for simplifying client-side scripting. It also makes use of jQuery Ul [66] and jQuery
UI Touch Punch [67] for implementing the draggable block elements. In terms of
intercepting touch events and recognizing gestures, toe.js [68] was used.
59
6.2
Scratch Pad and Main Menu
As shown in the Figure 6-1, the Recast UI is initialized with a blank scratch pad,
which is the area where users can drag and combine content blocks with filter
blocks. Users can also tap anywhere on the scratch pad to access the main menu,
which provides several options. The main menu element utilizes a Javascript library
called Isotope [69]. Isotope enables developers to easily create hierarchical layouts
and animated transitions. Users can tap and navigate through the main menu to
add new blocks, as well as to preview and publish the final video.
6.3
Content Blocks
Content blocks refer to primitives that define bundles of content assets that
can be used within a video storyboard. A content block can represent a group of
video, image, or text assets. Each content block only contains assets that match
the criteria. For example, one specific video content block may contain scene
segments that mention something about Ukraine.
Figure 6-2 indicates examples of content blocks. A basic block has placeholders for a thumbnail image and a label. The thumbnail image corresponds to one of
the assets that match the user-specified criteria, and the label indicates the initial
scope, such as keywords or tags. Existing content blocks may be removed easily
from the scratch pad by tap-holding the block element.
6.3.1
Video Assets
Video content blocks deliver one of the key features to the interface. These
blocks enable users to easily aggregate, curate, and import video content. These
blocks directly query the Media Analysis System, and automatically retrieve video
assets that match the criteria.
Figure 6-3 indicates two different ways of specifying the search keyword upon
creating new blocks. Users can either input a custom search keyword, or select one
60
Scratch Pad
Main Menu
Figure 6-1: Scratch Pad and Main Menu of the Recast UI
61
Figure 6-2: Examples of Content Blocks
from the list of trending topics. The list of trending topics is dynamically populated
with up-to-date keywords that represent trending topics. It was originally designed
to use keywords from the Media Analysis System. However, after experimentation,
it was modified to use data from WorldLends [70], which provided more relevant
keywords in terms of narrowing the scope.
Get Video With Keyword
Figure 6-3: Specifying Search Keywords for Retrieving Video Assets
After the keyword is specified, a block that bundles the relevant videos appears on the scratch pad. By default, it uses the "search" endpoint of the Media
Database API, and fetches a maximum of 20 relevant assets from the Media Analysis System. Figure 6-4 shows a preview screen where users can view the assets
that are included within the block. This preview screen can be invoked by doubletapping the block. In the preview screen, users can preview the videos, and discard
any irrelevant assets from the bundle. A library called Video.js [711, which provides
helper methods around HTML5 video elements, handles the video playback. The
62
cover flow of thumbnails, on the other hand, is dynamically generated using a
library called coverflowjs [72].
Figure 6-4: Preview of Video Assets
Users may add filter blocks to narrow the scope of search. For instance, users
who wish to create a news commentary regarding the recent developments in
Ukraine may add a timestamp filter and a creator filter for getting the most recent
videos created by a specific organization. Other users who with to create Supercuts
may apply filters to retrieve specifically 30 phrase segments that mention the given
keyword.
6.3.2
Image Assets
Image content blocks enable users to import arbitrary images into their storyboard. While there are many ways to import images, the Recast UI focuses on
methods for importing screen shots of web pages without a hassle. This feature
enables users to easily import and remix assets they see as they are browsing the
Internet.
For capturing screen shots, a Chrome extension was implemented. As shown
in Figure 6-5, users can capture screen shots of browser windows with a click of a
63
button, and see its preview. Once users save the screen shot annotated with an
arbitrary tag, they are immediately uploaded to the Asset Management Service,
which indexes image and audio assets.
0-2 e----
Pvwaptaia
ow
WNW
wILL
a
WOuemSIme
pryCer.
ptr
WNuu
Wn"
tt waassingedw
Pa
iur
Uu
ren
Shot
wt
g
a Inn
Wmi*u~
thatn unle.
tems frm
6-6.
deinihzcate
th
ito
ipae
noaintg
ihntemieu
Figure 6-5: Chrome
foronetbocsue
Capturing Screen Shots
ofag WebsabaiPages
Sevc. Extension
mg
hs
AssetMaagement
o
Figure 6-6 indicates the list of annotation tags displayed within the main menu.
This list is dynamically populated with image annotation tags that exist within the
Asset Management Service. Image content blocks use these tags as a basis for
bundling multiple images. When new blocks are added, it retrieves all the images
Assetsf
Tex
M.a.
I~~l Sw
that are associated with the given tag.
Upon retrieving content, users can apply filters to narrow the scope. Users
can also double-tap the block to invoke the preview screen shown in Figure 6-7.
Through the preview screen, users may review the assets, and discard irrelevant
items from the bundle.
6.3.3
Text Assets
Text content blocks refer to blank screens with text overlays. These blocks can
be used for adding title screens and scene separators throughout the storyboard.
As shown in Figure 6-8 , users must enter a text string that they wish to display. By
64
Figure 6-6: List of Image Annotation Tags in Menu
Figure 6-7: Preview of Image Assets
default, the duration of the screen is set to three seconds. This may be adjusted
to any given length by applying a duration filter.
6.4
Filter Blocks
Filter blocks refer to primitives that define the scope of curation. Figure 6-9
shows an example of filter blocks on the scratch pad.
65
I
Add New Title Screen
My New
Figure 6-8: Creation of Text Content Blocks
Figure 6-9: Examples of Filter Blocks
As previously noted, filter blocks may be combined with content blocks for
narrowing their scope. Table 6.1 indicates the seven types of predefined filters that
may be applied to content blocks. Video content block accepts all types of filters.
Images content blocks, however, only accepts filters related to the timestamp, the
duration, and the number of items. As for the text content blocks, none of the
filters can be added except for the duration filter.
Table 6.1: List of Filters
Filter Descriotion
Filter Name
I I,trIAA(mntr~r i
Whi
66
-I
When filters blocks are added to video content blocks, optional parameters
are added to the URI for querying the Media Analysis System. When filter blocks
are added to image content blocks, optional parameters are added to the URI for
querying the Asset Management Service. As for text content blocks, filter blocks
changes the duration parameter of the screen, when duration filters are added. Filter may be removed at anytime by tap-holding the block. Content blocks reloads
the assets within the bundle whenever new filter are added, or existing filter are
removed.
Most of the filters have values that are statically predefined, or have values that
are definable by the user. However, filters related to creators have values that
are dynamically retrieved from the Media Analysis System, using the "creators"
endpoint within the Media Database API. Figure 6-10 indicates a list of creators
shown within the main menu.
Figure 6-10: List of Creators
6.5
Overlay Blocks
Overlay blocks represent supplemental assets that can be overlaid on top videos,
images, and text. The current version of the Recast UI supports a feature where
users can create overlay blocks that contain recordings their voices.
67
As shown in Figure 6-11, users can speak directly into the browser and record
voices. The microphone and its input stream is accessed through the Web Audio API [73], which is one of the newly proposed standards for browsers to access
sound interfaces and conduct basic acoustic signal processing. For saving the
sound stream into a standard WAV file blob, the Recast UI makes use of Recorder.js
[74]. For visualizing the waveform, it make use of wavesurfer.js [75]. Apart from
recording voices, users can also create overlay blocks based on previously recorded
assets listed in the menu.
Record New Audio Narrtlve
Record New Audio Narmtive
00
Figure 6-11: Recording Voice
After voices are recorded, the Recast UI uploads it to the Asset Management
Service and creates an overlay block that contains the audio asset. These overlay
blocks may be added to content blocks, in the same way filter blocks are added.
When content blocks already have existing audio tracks, all of the audio tracks are
simply mixed.
6.6
Asset Management Service
The Asset Management Service is an extremely simple web-based service based
on Node.js and Express for indexing image and audio assets. As previously mentioned, the Recast Ul has features where users can import images and voice recordings. The Asset Management Service holds the responsibility of receiving images
68
and voice recordings from the Recast UI, and saving them into a directory that is
exposed by nginx. It also indexes the assets within MongoDB, and exposes RESTful API endpoints for the Recast UI to upload files, as well as endpoints for it to
retrieve URLs that point to assets.
6.7
Recast EDL
The Recast UI presents a unique way of rendering and sharing final videos that
represent news storyboards. Instead of creating and exchanging full video files, it
adopts the notion of a rendering Recast EDLs locally within each browser. A Recast
EDL (edit decision list) refers to a data structure where finalized assets are listed in
a sequence, along with parameters that are required to produce a video. Figure 612 indicates an example of a Recast EDL. This section describes how Recast EDLs
are comprised, rendered, and shared.
6.7.1
Timeline
The timeline is one of main components of the Recast UI, in terms of compiling
personalized EDLs. As shown in Figure 6-13, users may drag content blocks into
the timeline, after bundles are curated, polished, and finalized.
When content blocks are added to the timeline, the Recast UI examines the
assets included within the block, and appends them to the Recast EDL. In the final
video, assets included within the timeline are played in a sequential manner. Users
may remove blocks from the timeline by swiping the block upwards.
6.7.2
Recast Video Player
Compared to traditional methods, the process of rendering Recast EDLs is
unique. Instead of rendering the EDL into as a single video stream at the file
system level, the Recast UI handles the rendering at the playback level using the
69
14
15
16
17
html"
18,
19
Type'": "video"
"it
"ttl"
"Video About Obm,"
21
22
"strtimel:40.969,
23
"en~im":41.7,
24
25
26
}
ul:"tp/ult/bm -ie~p"
27
28
Figure 6-12: Example of a Recast EDL
Figure 6-13: Timeline With Content Blocks
Recast Video Player, which is a Javascri pt-based video player specifically designed
to handle Recast EDLs.
70
The Recast Video Player iterates over the assets included within the Recast EDL,
and dynamically creates HTML5 video elements that are loaded with original video
files associated with assets. Given the start time and the end time of each asset,
only the designated segment of each video file is played. Assets are played in a sequential manner, when the designated video segment reaches its end. Non-video
assets, such as images and text, are overlaid on video files with blank screens.
Buzz [76], a library that wraps the HTML5 audio element, plays the voice overlays.
While this method may not be the most efficient method in terms of content distribution, it massively reduces the computational burden on the infrastructure, since
it does not require additional video files to be created and stored.
Figure 6-14 shows a preview screen of the final video. This screen can be invoked from the main menu, after blocks are added to the timeline.
Figure 6-14: Preview of Final Video
71
6.7.3
Publishing Recast EDLs
The features for publishing and sharing final videos are vital, in terms of enabling self-expression. As previously mentioned, the Recast UI adopts a unique
way of rendering Recast EDLs at the playback level. For this reason, final videos
can be easily shared and played by exchanging the lightweight Recast EDL, as
long as the Recast Video Player is utilized.
As a platform for sharing final videos, the Media Matrix [77] was utilized. Savannah Niles and Vivian Diep are mainly developing the Media Matrix. It is a social,
subscription-oriented platform that provides a unified browser for visual media.
The Recast EDLs of final videos may be submitted to the Recast track within the
Media Matrix. In the same way users can subscribe to television channels, users
can subscribe to the Recast track for viewing remixed video content created by
other users. Figure 6-15 shows the current prototype of the Media Matrix with the
Recast track.
Figure 6-15: Media Matrix
72
Chapter 7
Evaluation
This chapter describes the evaluation of Recast, which includes an operation
test and a user study.
In this thesis, Recast is evaluated from two different angles. First, an operation
test was conducted for verifying the basic functionalities of the Media Analysis
System. Second, a user study was conducted to verify the usability of the Recast
U'.
7.1
Operation Tests
This section describes the operation test that was conducted to verify the functionalities of the Media Analysis System. As previously mentioned, the main role
of the Media Analysis System is to extract frame-level metadata from video files. In
this regard, the operation test was specifically designed to examine the capabilities
of GLUE.
7.1.1
Method
Prior to the test, 10 video samples with transcripts were manually downloaded
from YouTube. The duration of these videos range from 70 seconds to 5890 seconds (1 hour and 38 minutes). The details of the video samples are indicated in
73
Appendix A. After preparation, the video samples were uploaded to the Media
Analysis System to conduct the following verifications and observations:
" Verify proper operations throughout the system
" Observe the processing time for modules that require intensive computation
" Observe the overall quality of metadata
As for the test platform, the Media Analysis System was set up on a virtual machine running Ubuntu 12.04 LTS. The specifications of the virtual machine include
18 virtual CPU cores and 64 GB of RAM.
7.1.2
Results
In terms of checking proper operations throughout the system, the test verified
that the videos were properly received by the user upload receiver, and successfully
passed onto GLUE. All the analysis modules within GLUE were also successful in
processing all the videos. Finally, the test verified that the metadata was properly
stored and indexed on the Media Database as expected.
Figure 7-1 shows graphs that indicate the relationship between the video duration and processing time for the four modules that require intensive processing.
Given the fact that all modules run in parallel, the data reveals that it takes approximately 5 to 6 minutes for GLUE to complete the analysis of a 30-minute video.
In general, all of the graphs indicate signs of relatively strong linear correlation.
Therefore, the processing time is proportional to the video length. However, this
suggests that there is room for further optimizations to increase performance.
In terms of the quality, there were a lot of noise and false positives identified
in the results from the transcript module and the emotion module. For instance,
quite a few of the named entities extracted from the transcript were not relevant
to the context of the video. This also applies for the social tags. Natural language
processing is an active domain of research that has many unsolved questions. Although GLUE currently relies on AlchemyAPI and OpenCalais for extracting topic
74
Scene Module
Face Module
0:23:02
0:23:02
0:20:10
0:20:10
0:17:17
0:17:17
0
+ 0.0004
y - 0.1463x
2
R - 0.81896
0)
0:14:24
0:14:24
0:11:31
0:08:38
y
+ 0.0005
0.0895x
2
R -
0
0
0:05:46
2
0
0
0:11:31
0.6964
0
0
0
0:08:38
0
0:05:46
0
0
0:02:53
0:02:53
0:00:00
0:00:00 0-14:24 0:28:48 0:43:12 0:57:36 1:12:00 1:26:24 1:40:48 1:55:12
0:00:00
0:00:00 0:14:24 0:28:48 0:43:12 0:57:36 1:12:00 1:26:24 1:40:48 1:55:12
Video Duration
Video Duration
Emotion Module
Thumbnail Module
0:23:02
0:23:02
0:20:10
0:20:10
0:17:17
0:17:17
0:14:24
S0:14:24R2
0:11:31
0:11:31
0:08:38
0:08:38
0:05.46
0:05:46
0:02:53
0
a.
y = 0.0083x + 8E-05
R2 - 0.73786
0
0
0:02:53
-
0:00:00
y - 0.141x + 0.0004
- 0.68943
0:00:00
0:00:00 0:14:24 0:28:48 0:43:12 0:57:36 1:12:00 1:26:24 1:40:48 1:55:12
0:00:00 0:14:24 0:28:48 0:43:12 0:57:36 1:12:00 1:26:24 1:40:48 1:55:12
Video Duration
Video Duration
Figure 7-1: Processing Times of Analysis Modules in GLUE
keywords, these results suggest that further exploration of the field is necessary to
fully understand the context of videos from transcripts. As for the emotion module, most of the speech segments were falsely identified. This is primarily due to
the fact that the original training dataset is not optimized for video content spoken
in American English. The background noises are also factors that negatively affect
the accuracy. Some of the feasible solutions toward increasing the accuracy of the
emotion module include reconstruction of the training dataset, and implementation of sophisticated background noise reduction schemes.
7.1.3
Considerations
Overall, the current version of the Media Analysis System is capable of retrieving content, extracting metadata, and indexing assets to a certain extent. This can
75
be considered as an initial prototype that shows a good proof of concept. However, there are many ways the system can be improved, both in terms of adding
new features and optimizing existing ones.
User Studies
7.2
This section describes the user study that was conducted to verify the usability
of the Recast UI. The study was designed to specifically quantify the user experience of creating personal news commentaries.
7.2.1
Method
In the study, four users aged between 20 and 25, were first asked to answer
questions regarding their background in video production and editing. After revealing their prior experiences, users were given a task to create short videos that
represent news commentaries. In order to narrow the scope of the study, the following requirements were also given:
" Commentaries must cover something about the recent developments in Ukraine
" Commentaries must be short, preferably around 30 seconds
" Commentaries must include at least one video clip, one image, one voice
overlay, and one text overlay
To define a baseline for each user, each user was first asked to conduct the
task on iMovie, which is a well-known consumer-grade video editing tool. Before
staring the task, users were seeded with five video files. These videos contained
television programs aired on April 23rd, 2014, which covered some news about
Ukraine. They were also introduced to a tool called youtube-dl [78], which can be
used for downloading YouTube videos.
After users completed the task using iMovie, they were asked to conduct the
task again using Recast. Before starting the task, users were asked to install the
76
Chrome extension for importing browser screen shots into Recast. They and were
given a short tutorial of the Recast UI.
Both trials were timed and recorded per user. Users were also asked to rate
five perspectives of their experiences on a scale of 1 to 5, after completion of each
trial. After everything was done, there was a free discussion period where users
were given a chance to mention specific opinions and give feedback. Appendix B
lists the survey document that indicates the instructions and questions regarding
this study.
7.2.2
Results
In terms of prior experiences in video production, there was a good mix amongst
the users. One of the users was an expert, who had the skill sets to add complex
effects and cuts. On the other hand, one of the users was a complete beginner
that had no prior experience.
Figure 7-2 indicates a graph, which indicates the range and average of the task
completion time in each one of the environments. The average time it took for
users to complete the task on iMovie was 15 minutes and 47 seconds. However,
the average time it took for users to complete the same task on Recast was 5 minutes and 18 seconds. Based on these results, Recast was successful in minimizing
the task completion time by 66.4 percent.
Figure 7-3 indicates a chart, which maps the average user ratings in each one
of the environments. As a result, in the scope of creating a news commentary that
contains video clips, images, text overlays, and voice overlays, Recast was rated
to be far easier and more intuitive than iMovie.
As for personal comments and feedback regarding iMovie, many of the users
pointed out that the user interface of iMovie is too cumbersome and complex for
conducting simple tasks. For instance, one user mentioned that iMovie does not
have an intuitive workflow of importing assets into the workspace. Others pointed
out that there was no way of easy way of quickly finding the relevant video seg77
5:18
Reca st
15:471
Er
IMoVie
00:00
02:53
05:46
08:38
11:31
14:24
17:17
20:10
Elapsed Time (Minutes)
Figure 7-2: Comparison of the Task Completion Time
Q1
5
Q1: Was the process of inserting video segments easy and intuitive?
Q2: Was the process of inserting images
Q2
05,
easy and intuitive?
Q3: Was the process of applying voice overlays easy and
intuitive?
Q4: Was the process of applying text overlays easy and intuitive?
Q5: Would would you rate the overall usability of the entire experience?
*iMovie
04'
ORecast
Q3
Figure 7-3: Comparison of User Ratings
ments from the corpus. However, as a general comment, many users pointed out
that iMovie lets users control the assets at a granular level, and provides features
that let users create high-quality video content that are polished.
In terms of comments and feedback regarding Recast, many of the users mentioned that the block-based querying scheme in Recast demonstrates an extremely
simple way of automatically finding and retrieving relevant assets. However, as a
major downside, many users stated that the current version of Recast lacks the
abilities to adjust the video segments at a granular level. To be more specific, one
user pointed out that Recast should have a slider for manually adjusting the start
time and the end time of a video segment, which allows users to manually correct
78
minor mistakes made by the Media Analysis System.
7.2.3
Considerations
Based on the user studies, it was determined that Recast presents an intuitive
and easy way of constructing video storyboards. However, it was also determined
that traditional video production schemes are still required for creating high-quality
output, and cannot be entirely replaced by other tools. Apart from high-level
discussions on the concept of user interaction, this user study was extremely useful
in terms of determining next steps for improving the interface and user experience.
79
80
Chapter 8
Conclusion
This chapter states the conclusion of this thesis, as well as potential milestones
for progressing its achievements.
8.1
Overall Achievements
In this thesis, a platform called Recast was proposed, designed, and implemented. The goal of Recast was to provide an intuitive interface for proactive
citizens to easily create remixed video storyboards that represent the views of the
world from their perspective. In terms of addressing the complexity of video production and editing, Recast features the following frameworks:
* A system for collectively gathering and indexing visual media content
* A block-based visual programming language for producing remixed video
storyboards in a semi-automated manner
As a result of operation tests and user studies, it was verified that the frameworks included within Recast are effective in terms of achieving its initial goals.
However, it was also revealed that there are number of ways Recast can be improved and optimized, both in terms of system performance and user interactions.
81
8.2
Future Work
This section describes some examples of future milestones and action items
upon improving Recast
8.2.1
Improvement of Metadata Extraction
As previously stated, the current version of the Media Analysis System can do
its job to a certain extent. However, in order to reduce the amount of noise and
false positives, there are various things that can be considered. For instance, refining the training datasets and tuning algorithms for natural language processing
and emotion extraction would improve the quality. Another means of enhancing
the metadata is to add facial identification features. Apart from these, additional
tests and evaluations of the system is necessary in terms of determining other
weaknesses in the system and finding solutions to address them.
8.2.2
Enhancement of Recast UI
As previously noted, there were many suggestions that were brought up during
the user study, in terms of improving the Recast UI. For instance, adding features
for enabling manual adjustments of video segments is one possibility. Also, adding
simple features for allowing users to reorder the blocks within the timeline is another feasible enhancement. Apart from iteratively improving the Recast UI based
on user comments and suggestions, further user studies must be conducted to
evaluate its user experience at a deeper level.
8.2.3
Large-scale Deployment
Apart from improving the Recast UI iteratively, large-scale deployments are
definitely one thing to consider, in terms of testing the social impact of Recast
in the public. However, the assets automatically retrieved by the Media Analysis
System cannot be used outside of the MIT Media Lab due to copyright policies.
82
Therefore, new strategies for overcoming copyright issues must be considered in
regards to conducting large-scale deployments and field tests.
83
84
Appendix A
List of Video Samples
This appendix indicates the video samples that were used for the operation
test. These 10 videos with transcripts were randomly selected from YouTube, and
manually uploaded to the Media Analysis System.
1. Introducing TPG Maximizer
* URL: https://www.youtube.com/watch?v=WSXdSKIAn9o
* Duration: 0:01:10 (70 Seconds)
2. How to See Without Glasses
* URL: https://www.youtube.com/watch?v=OydqR_7_Djl
* Duration: 0:03:11 (191 Seconds)
3. A Funny Montage
" URL: https://www.youtube.com/watch?v=gRyPjRrjS34
" Duration: 0:10:53 (653 Seconds)
4. 402 LECTURE SAMBA FUNDAMENTALS ALAN & HAZEL FLETCHER
" URL: https://www.youtube.com/watch?v=h9wBYGN699E
" Duration: 0:17:47 (1066 Seconds)
85
5. Maintaining the Principles I Edita Daniute
" URL: https://www.youtube.com/watch?v=dh PH5DoDUVo
" Duration: 0:19:16 (1156 Seconds)
6. Dancing in Jaffa: A Film and Discussion
* URL: https://www.youtube.com/watch?v=pOLeDqIA9Jo
" Duration: 0:25:09 (1509 Seconds)
7. Unifying the Inflationary & Quantum Multiverses (Max Tegmark)
* URL: https://www.youtube.com/watch?v=PCOzHlf2Gkw
e Duration: 1:00:03 (3603 Seconds)
8. ER=EPR I Leonard Susskind
* URL: https://www.youtube.com/watch?v=jZDt-j3wZQ
" Duration: 1:15:01 (4501 Seconds)
9. Leonard Susskind Lecture 1 Topics in String Theory Stanford University
Continuing Studies Program
" URL: https://www.youtube.com/watch?v=jn4gXOWxAPO
* Duration: 1:34:28 (5668 Seconds)
10. David Gross: The Coming Revolutions in Theoretical Physics
* URL: https://www.youtube.com/watch?v=AM7SnUlwDU
" Duration: 1:38:10 (5890 Seconds)
86
Appendix B
User Study Handout
This appendix lists the survey document that was handed out to the participants
of the user study. It indicates the instructions and questions regarding the study.
87
Recast - User Study
Please indicate best matching choice.
*
User Background
1.
Have you ever expressed yourself through visual media?
No
Yes
If yes, please list a few examples: (ex: created a documentary video)
2.
Have you produced and/or edited video content on your own?
No
Yes
If yes, please list a few tools you have used:
3.
(ex: Movie)
Would you consider your self an expert in video production and editing?
1 -No
2 - Less Likely
3 - Neutral
88
4 -More Likely
5 - Yes
0
Creating an News Commentary on
iMovie
Your first task would be to create a 30-second news commentary video regarding the
situations in Ukraine using iMovie. Upon creating a commentary, please make sure you include
at least one video clip, one image, one voice overlay, and one text overlay. For the video
content, you can always find your own, but you can also use the preloaded assets. You can
take your time if necessary, but note that there is no need polish the quality.
The preloaded assets include 5 video clips that were recorded on April
2 3rd,
videos are morning news programs that reported something about Ukraine.
>
>
>
>
>
BBC World News
*
Channel: BBC America / DirecTV - 264
*
Broadcast: April
2 3 rd,
2014, 7:00 am - 8:00 am
CBS This Morning
*
Channel: CBS (WBZ) / DirecTV - 005
*
Broadcast: April
2 3 rd,
2014, 7:00 am - 8:00 am
Fox 25 Morning News
*
Channel: FOX (WFXT) /DirecTV - 025
*
Broadcast: April
2 3 rd,
2014, 7:00 am - 8:00 am
Good Morning America
*
Channel: ABC (WCVB) / DirecTV - 396
*
Broadcast: April
2 3 rd,
2014, 7:00 am - 8:00 am
Today
*
Channel: NBC (WHDH) / DirecTV - 007
*
Broadcast: April
2 3 rd,
2014, 7:00 am - 8:00 am
89
2014. All of these
1.
Was the process of inserting video segments easy and intuitive?
1 - No
2.
2 - Less Likely
3 - Neutral
5 - Yes
4 - More Likely
2 - Less Likely
3 - Neutral
4 - More Likely
5 - Yes
2 - Less Likely
3 - Neutral
5 - Yes
4 - More Likely
How would you rate the overall usability of the entire experience?
1 - Frustrating
6.
5 - Yes
Was the process of applying text overlays easy and intuitive?
1 -No
5.
4 - More Likely
Was the process of applying voice overlays easy and intuitive?
1 -No
4.
3 - Neutral
Was the process of inserting images easy and intuitive?
1 -No
3.
2 - Less Likely
2 - Bad
3 - Neutral
Please indicate general comments, if any:
90
4 - Good
5 - Excellent
0
Creating an News Commentary on Recast
Your next task would be to create a 30-second news commentary video regarding the
situations in Ukraine using Recast. Upon creating a commentary, please make sure you include
at least one video clip, one image, one voice overlay, and one text overlay.
1.
Was the process of inserting video segments easy and intuitive?
1 - No
2.
3 - Neutral
4 - More Likely
5 - Yes
2 - Less Likely
3 - Neutral
4 - More Likely
5 - Yes
2 - Less Likely
3 - Neutral
4 - More Likely
5 - Yes
How would you rate the overall usability of the entire experience?
1 - Frustrating
6.
2 - Less Likely
Was the process of applying text overlays easy and intuitive?
1- No
5.
5 - Yes
4 - More Likely
Was the process of applying voice overlays easy and intuitive?
1 -No
4.
3 - Neutral
Was the process of inserting images easy and intuitive?
1 -No
3.
2 - Less Likely
2 - Bad
3 - Neutral
Please indicate general comments, if any:
91
4 - Good
5 - Excellent
92
Bibliography
[1] YouTube. http://www.youtube. com.
[2] Vine. https: //vine. co.
[3] F. Coppa. Women, "Star Trek," and the early development of fannish vidding.
Transformative Works and Cultures, 1, 2008.
[4] Political Remix Video. http://www.politicalremixvideo.com.
[5] Moms & Tiaras. https: //www. youtube. com/watch?v=DdoszN4Yeio.
[6] Supercut.org. http://www.supercut.org.
[7] My Little Pony: Friendship is Magic in a Nutshell. https: //www.youtube. com/
wat ch?v=yEypIV12Dew.
[8] Ken Dancyger. The Technique of Film and Video Editing: History Theory, and
Practice. Focal Press, 2011.
[9] John Purcell. Dialogue Editing for Motion Pictures: A Guide to the Invisible
Art. Focal Press, 2007.
[10] iMovie. http://www. apple. com/mac/imovie.
[11] Popcorn Maker. http: //popcorn. webmaker.org.
[12] W. Zhu, C. Toklu, and S. P. Liou. Automatic news video segmentation and categorization based on closed-captioned text. In IEEE International Conference
on Multimedia and Expo (/CME 2001), pages 829--832, 2001.
[13] R. Hemsley, A. Ducao, E. Toledano, and H. Holtzman. ContextController:
Augmenting broadcast TV with realtime contextual information. In IEEE
10th Consumer Communications and Networking Conference (CCNC 2013),
pages 833--836, 2013.
[14] Beamly. http://beamly.com.
[15] Boxfish. http: //boxfish. com.
93
[16] T. Lin and H. J. Zhang. Automatic video scene extraction by shot grouping.
In IEEE 15th International Conference on Pattern Recognition (ICPR 2000),
volume 4, pages 39--42. IEEE, 2000.
[17] M. Xu, N. C. Maddage, C. Xu, M. Kankanhalli, and Q. Tian. Creating audio
keywords for event detection in soccer video. In IEEE International Conference on Multimedia and Expo (ICME 2003), volume 2, pages 281--284, 2003.
[18] X. Bao and R. R. Choudhury. MoVi: mobile phone based video highlights
via collaborative sensing. In ACM 8th International Conference on Mobile
Systems, Applications, and Services (MobiSys 2010), pages 357--370, 2010.
[19] Y. F Ma, L. Lu, H. J. Zhang, and M. Li. A user attention model for video
summarization. In ACM 10th International Conference on Multimedia (MULTIMEDIA 2002), pages 533--542, 2002.
[20] V. M. Bove. Personalcasting: Interactive local augmentation of television programming. Master's thesis, Massachusetts Institute of Technology. Department of Architecture., 1985.
[21] H.D. Wactlar, T. Kanade, M.A. Smith, and S.M. Stevens. Intelligent access to
digital video: Informedia project. Computer, 29(5):46--52, 1996.
[22] B. Shahraray and D. C. Gibbon. Efficient archiving and content-based retrieval
of video information on the web. In AAAI Symposium on Intelligent Integration and Use of Text, Image, Video, and Audio Corpora, pages 133--136,
1997.
[23] R.S. Jasinschi, N. Dimitrova, T. McGee, L. Agnihotri, J. Zimmerman, and D. Li.
Integrated multimedia processing for topic segmentation and classification.
In IEEE International Conference on Image Processing (ICIP 2001), volume 3,
pages 366--369, 2001.
[24] M. E. Davis. Media streams: representing video for retrieval and repurposing. Master's thesis, Massachusetts Institute of Technology. Department of
Architecture. Program in Media Arts and Sciences., 1995.
[25] Z. Liao, J. Yang, C. Fu, and G. Zhang. CLUENET: Enabling automatic video
aggregation in social media networks. In Advances in Multimedia Modeling, volume 6524 of Lecture Notes in Computer Science, pages 274--284.
Springer Berlin Heidelberg, 2011.
[26] Media Cloud. http://www.mediacloud.org.
[27] A. Lippman and W. Bender. News and movies in the 50 megabit living room.
In IEEE/IECE Global Telecommunications Conference (GLOBECOM 1987),
volume 3, pages 1976--1981. IEEE/IECE, 1987.
[28] Flipboard. https ://flipboard. com.
94
[29] J. Rogstadius, M. Vukovic, C. A. Teixeira, V. Kostakos, E. Karapanos, and J. A.
Laredo. CrisisTracker: Crowdsourced social media curation for disaster awareness. IBM Journal of Research and Development, 57(5):4:1--4:13, 2013.
[30] Netflix. https: //netflix. com.
[31] Ustream. http: //www.ustream. tv.
[32] RealPlayer. http: //www. real. com.
[33] E. D. Navara, R. Berjon, T.Leithead, E. O'Connor, S. Pfeiffer, and S. Faulkner.
HTML5. Candidate recommendation, W3C, 2014.
[34] R. P. Pantos and W. M. May. HTTP Live Streaming. Internet-Draft draftpantos-http-live-streaming-13, Internet Engineering Task Force, 2014. Work
in progress.
[35] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd.
The
pagerank citation ranking: Bringing order to the web. Technical Report 199966, Stanford InfoLab, November 1999.
[36] TV News Archive. https: //archive. org/details/tv.
[37] M. Resnick, J. Maloney, A. Monroy-Hernandez, N. Rusk, E. Eastmond, and
K. Brennan. Scratch: programming for all. Communications of the ACM,
52(11):60--67, 2009.
[38] A. Bendale, K. Chiu, K. Marwah, and R Raskar. Visionblocks: Asocial computer
vision framework. In IEEE 3rd International Conference on Social Computing
(SocialCom 2011), pages 521--526, 2011.
[39] Google Chrome. https: //www. google. com/chrome.
[40] nginx. http://nginx.org.
[41] DirecTV. http: //www. directv. com.
[42] Tribune Media Services. http://tribunemediaservices. com.
[43] FFMpeg. http: //www. fffmpeg. org.
[44] ITU-T. Advanced video coding for generic audiovisual services. Recommendation H.264, International Telecommunication Union, 2014.
[45] ISO. Information technology -- Generic coding of moving pictures and associated audio information -- Part 7: Advanced Audio Coding (AAC). ISO
13818-7:2006, International Organization for Standardization, 2006.
[46] SubRip. http://zuggy.wz.cz.
[47] Scrapy. http://scrapy.org.
95
[48] RabbitMQ. https://www.rabbitmq. coM.
[49] Node.js. http://nodejs.org.
[50] Express. http: //expressj s. com.
[51] Twisted. https: //twistedmatrix. com.
[52] PyMongo. http://api.mongodb.org/python/current.
[53] AlchemyAP. http://www.alchemyapi.com.
[54] OpenCalais. http://www.opencalais.com.
[55] OpenCV. http://opencv.org.
[56] R. Lienhart and J. Maydt. An extended set of Haar-like features for rapid
object detection. In IEEE International Conference on Image Processing (ICIP
2002), volume 1, pages 900--903, 2002.
[57] SoX. http: //sox. sourceforge. net.
[58] F. Eyben, M. Wollmer, and B. Schuller. OpenEAR---introducing the Munich
open-source emotion and affect recognition toolkit. In IEEE 3rd International
Conference on Affective Computing and Intelligent Interaction and Work-
shops (ACII 2009), pages 1--6, 2009.
[59] MongoDB. https: //www.mongodb.org.
[60] Apache SoIr. https: //lucene. apache. org/solr.
[61] Tornado. http://www.tornadoweb.org.
[62] Socket.IO. http: //socket.io.
[63] I.Fette and A. Melnikov. The WebSocket Protocol. Request for Comments
RFC 6455, Internet Engineering Task Force, 2011.
[64] Caress. http://caressjs.com.
[65] jQuery. http://jquery.com.
[66] jQuery UI. http: //j queryui. com.
[67] jQuery UI Touch Punch. http: //touchpunch. f urf . com.
[68] toe.js. https://github. com/visiongeist/toe.
[69] Isotope. http://isotope.metafizzy. co.
96
js.
[70] J. Speiser. Worldlens: Exploring world events through media. Master's thesis,
Massachusetts Institute of Technology. Department of Architecture. Program
in Media Arts and Sciences., 2014.
[71] Video.js. http://www.videojs. com.
[72] coverflowjs. http://coverflowjs.github.io/coverflow.
[73] P. Adenot, C. Wilson, and C. Rogers. Web Audio API. Working draft, W3C,
2013.
[74] Recorder.js. https: //github. com/mattdiamond/Recorderj s.
[75] wavesurfer.js. http: //www. wave surfer. fim.
[76] Buzz. http: //buzz . j aysalvat . com.
[77] Media Matrix. http://viral.media.rmit.edu/projects/media matrix.
[78] youtube-dl. http://rg3.github. io/youtube-dl.
97
Download