1
1
1
2
3
1 CWI: Centrum Wiskunde & Informatica
Science Park 123, 1098 XG
Amsterdam, The Netherlands
+31 20 592 9333
2 Department of Computing
Goldsmiths, University of London
London, SE14 6NW, UK
+44 (0)20 7078 5152
3 BT Research & Technology
Adastral Park, IP5 3RE
Ipswich, UK
+44 (0)1473 642916
The wide availability of relatively high-quality cameras makes it easy for many users to capture video fragments of social events such as concerts, sports events or community gatherings. The wide availability of simple sharing tools makes it nearly as easy to upload individual fragments to on-line video sites. Current work on video mashups focuses on the creation of a video summary based on the characteristics of individual media fragments, but it fails to address the interpersonal relationships and time-variant social context within a community of diverse (but related) users.
The aim of this paper is to reformulate the research problem of video authoring, by investigating the social relationships of the media ‘authors’ relative to the performers. Based on a 10-month evaluation process, we specify a set of guidelines for the design and implementation of socially-aware video editing and sharing tools. Our contributions have been realized and evaluated in a prototype software that enables community-based users to navigate through a large common content space and to generate highly personalized video compilations of targeted interest within a social circle. According to the results, a system like ours is a valid alternative for social interactions when apart. We hope that our insights can stimulate future research on socially-aware multimedia tools and applications.
The place: the Exhibition Hall in Prague. The date: August 23,
2009. Radiohead is about to start their concert. The band invites fans to capture personal videos, distributing 50 Flip cameras. The cameras are then collected, and the videos are post-processed along with Radiohead’s audio masters. The resulting DVD captures the concert from the viewpoint of the fans, making it more immersive and proximal than typical concert productions
(see: radiohead-prague.nataly.fr).
The concert of Radiohead typifies a shift in the way music concerts – and other social events – are being captured, edited, and remembered. In the past, professionals created a full-featured video, often structured according to a generic and anonymous narrative. Today, advances in non-professional devices are making each attendee a potential cameraperson who can easily upload personalized material to the Web, mostly as collections of raw-cut or semi-edited fragments. From the multimedia research perspective, this shift makes us reflect and reconsider traditional models for content analysis, authoring, and sharing.
H.1.2 [ Models and Principles ]: User/Machine Systems – Human factors . H.5.1 [ Information Interfaces and Presentation ]:
Multimedia Information Systems – Audio, Video .
Design, Experimentation, Human Factors, Management.
Web-Mediated Communication, Privacy, Video Authoring, Video
Sharing and Information Management.
The existence of many content fragments of common social events has resulted in innovative work on individual fragment retrieval. These include tools for the automatic extraction of
instrumental solos and applause sections [11], and crowd-sourcing
techniques that use the direct or indirect behavior of participants
for improving retrieval results [16]. Once a collection of candidate
fragments has been identified, other researchers have explored
community video remixing [14] and automatic algorithmic
solutions for synchronizing and organizing user-generated content
(UGC) from popular music events [8][15]. All these solutions
have focused on various properties of the video/audio content.
Nevertheless, none of them has considered the social relationships between performers, users, and recipients of the media. They thus missed an opportunity for creating videos exploiting the social relationships between people participating in the same event.
*Area Chair: Hari Sundaram.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
MM’11 , November 28–December 1, 2011, Scottsdale, Arizona, USA.
Copyright 2011 ACM 978-1-4503-0616-4/11/11...$10.00.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
Conference’10
, Month 1–2, 2010, City, State, Country.
This paper considers the case in which performers and the audience belong to the same social circle (e.g., parents, siblings and classmates at a typical school concert). Each participating member of the audience records content for personal use, but they also capture content of potential group interest. This content may be interesting to the group for several reasons: it may break the monotony of a single camera viewpoint, it may provide alternative
(and better) content for events of interest during the concert
(solos, introductions, bloopers), or it may provide additional views of events that were not captured by a person’s own camera.
It is important to understand that the decision to use substitute or
Table 1. Handling Multi-Camera Recordings of Concerts.
“Professional”
DVD
Scripted Professional
Human director
(planning)
Complete coverage
Anonymous
UGC Mashup
Ad-Hoc
Similar likings, idols
Video search, video analysis
Complete coverage
Socially-aware community
Ad-Hoc
Family & friends
Video analysis, video authoring
Memories, bonds additional content will be made in the particular context of each user separately: the father of the trombone player is not necessarily interested in the content made by the mother of the bass player unless that content is directly relevant for the father’s needs. Put another way, by integrating knowledge of the structure of the social relationships within the group, content classification can be improved and content searching and selection by individual users can be made more effective.
In order to understand the role of the social network among group members in a multi-camera setting, consider the comparison presented in Table 1. This table compares the use of multi-camera content in three situations: by a (professional) video crew creating an archival production, by a collection of anonymous users contributing to a conventional user-generated content mashup, and within a defined social circle as input for differentiated personal videos. (Semi-)Professional DVD-style productions often follow a well-defined narrative model implemented by a human director, and are created to capture the essence of a total event. Anonymous
UGC mashups are created from ad-hoc content collections, often based on the primary and secondary content classification methods discussed above. In socially-aware communities, friends and family members capture, edit and share videos of small-scale social events with the main purpose of creating personal (and not group) memories. Interested readers can view a video that illustrates the general concept of personalized community videos 1 , and an example of a personal video generated by our system 2 .
One of the major differentiators of our work is that its primary purpose is not the creation of an appealing video summary version of the event or the creation of a collective collaborative community work. Instead, our system is intended to be a resource that facilitates the reuse of collective contents for individual needs. Rather than using personal fragments to strengthen a common group video, our work takes groups fragments to increase the value of a personal video. Each of the videos created by our system will be tailored to the needs of particular members of the group – the video created for the father of the trombone player will be different from the one for the mother of the bass player, even though they may share some common content.
This paper considers the following three research questions in the context of a video editing system for school concerts:
1.
Can a socially-aware group editing system be defined in terms of existing social science theories and human-centered processes, and if so, which?
1
http://www.youtube.com/user/TA2Project#p/u/6/re-uEyHszgM
2
http://www.youtube.com/user/TA2Project#p/u/4/Ho1p_zcipyA
2.
Does the functionality provided by a socially-aware editing system provide an identifiable improvement over other approaches for creating personalized videos?
3.
Can these improvements be validated in end-user studies?
Our work has concentrated on the modeling of parents of students participating in a high school concert. In this scenario, parents capture recordings of their children for later viewing and possible sharing with friends and relatives. Working with a test group at a local high school, we investigate how focused content can be extracted from a broad repository, how content can be enhanced and tailored to form the basis of a micro-personal multimedia message packet, that can be eventually transferred and shared with family and friends (each with different degrees of connectedness with the performer and his/her parents). Results from a ten months evaluation process provide useful insights into how a sociallyaware authoring and sharing system should be designed and architected, for helping users in nurturing their close circle. The evaluation process with a core group of parents included preliminarily focus groups, recording of a high school concert, and evaluation of the prototype application, and reflection on the overall process and experience.
This paper is structured as follows. Section 2 discusses related
work with the intention of contextualizing our contributions.
Then, Section 3 identifies key requirements for socially-aware
video authoring and sharing systems. These requirements are gathered from interviews and focus groups, while social theories
motivate our work. Section 4 answers the first research question,
providing guidelines to realize systems that meet those
requirements. Then, Section 5 reports on results and findings
regarding utility and usefulness, thus directly responding the
second and third research questions. Lastly, Section 6 dedicates to
the discussion of open-ended questions and concluding remarks.
A number of research efforts have addressed the use of media for content analysis, semantic annotations, and video mashup
generation of video concerts. Naci and Hanjalic [11] report on a
video interaction environment for browsing records in music concerts, in which the underlying automatic analyzer extracts the instrumental solos and applause sections in the concert videos, and also the level of excitement along the performances. Fans of a band can be useful for improving content retrieval mechanisms
[16]; where a video search engine allows for user-provided
feedback to improve, extend, and share, automatically detected results in concepts from video footage recorded during a rock n’
roll festival. Kennedy and Naaman [8] describe a system for
synchronization and organization of user-contributed content from live music events, creating an improved representation of the event that builds on the automatic content match. While the previous examples provide useful mechanisms for handing multicamera video recordings, more recent papers focus on particular
applications. Martin and Holzman [10] describe an application
that can combine multiple streams for news coverage in a cross-
device and context-aware manner. Shrestha et al., [15] report on
an application for creating mashup videos from YouTube 3 recordings of concerts. They present a number of content management mechanisms (e.g., temporal alignment and content quality assessment) that then are used for creating a multi-camera
3
The services and technologies mentioned in this paper, if unknown, could very easily be identified via a simple Internet search, therefore they will not be Web-referenced.
video. Results in that paper indicate that the final videos are perceived almost as if created from a professional video editor.
While a system designed following our guidelines applies many of the technologies and mechanisms presented in those papers – content alignment tools, automatic semantic annotation – we extend the problem space to small-scale social events, in which the social bonds between people plays a major role. Where performers and people capturing the event are socially related, and the recordings will act in the future as common memories. Our work complements past research by providing an experimental methodology to design tools that allows people to remember and share video experiences with the aim of fostering strong ties.
work lays on the fact that we do not aim at providing a complete description of the shared event, but a better understanding of how community media can serve individual needs. Closely related to our work are topics such as home video management and
navigation [1], and photo albums creation [12]. However, we go
one step further by allowing users to navigate, annotate, and utilize multi-camera recordings: other material contributed by other participants of a shared event.
Recognizing the importance of looking at common media assets
as a cornerstone for the sharing of family experiences [19], the
guidelines introduced in this paper provides the basis for systems aiming to help smaller groups of people (such as a family, a school class or a sporting club) viewing, creating, and sharing personalized videos. From the technical perspective, a system should combine the benefits of personal focus – knowing whom you are talking with – within the context of temporally asynchronous and spatially separated social meeting spaces. The motivation of our work is rooted in the inherent necessity of people for socializing and for nurturing relationships. One important aspect of our work is that we decided to follow an interdisciplinary approach in which both technology and social issues were addressed. At the core of this approach was establishing a long-term relationship with a group of parents within the target school as a basis for requirements analysis, concept evaluation and system validation.
The experimental methodology presented in this paper is based on two social science theories: the strength of the interpersonal ties and the social connectedness. Further argumentation about the
social theories guiding our work can be consulted in [7].
Based on a number of interviews (see Section 3.2), we can
conclude that current social media approaches are not adequate for a family or a small social group for storing and sharing collections of media that is personal and important to them. Our
assumption is validated by other studies [5], which concluded that
social media applications like Facebook do not take into account
the interpersonal tie strength of the users. Granovetter [6] defines
interpersonal ties as:
“…a combination of the amount of time, the emotional intensity, the intimacy (mutual confiding), and the reciprocal services which characterize the tie.”
If we were to design a video sharing system intended for family and friends, we would exploit the social bonds between people by taking into account their personal relationships ( intimacy ). The
Figure 1. A schematic view of the perceived strength of social bond over time in relation to our scenario. system would provide mechanisms for personalizing the resulting videos ( adding personal intensity ) with some effort ( amount of time ), and would allow the recipient to acknowledge and reply to the creator ( reciprocity ).
Social connectedness theory helps us to understand how social bonds are developed over time, and how existing relationships are maintained. Social connectedness happens when one person is
aware of another person, and confirms his awareness [13]. Such
awareness between people can be natural and intense or
lightweight. As reported elsewhere, even photos [19] and sounds
[3] can be good vehicles for creating the feeling of connectedness.
Figure 1 illustrates a schematic view of the perceived strength of a
social bond over time, showing reoccurring shared events
(“interaction rituals” in the Durkheim sense [4]), with a fading
strength of the social bond in between. The peaks in the figure correspond to intense and natural shared moments, when people participate in a joint activity (e.g., a music concert) re-affirming their relationships and extending their common pool of shared memories. The smaller peaks correspond to social connectedness actions, such as sending a text message or sharing a personalized video of the shared event, that build on such shared memories. If we were to follow the social connectedness theory, we would design a system that mediates the smaller peaks and thus helps in fostering relationships over time.
In order to better understand the problem space, we involved a representative group of users at the beginning of the evaluation process. The first evaluation, in 2008, consisted of interviews with sixteen families across four countries (UK, Sweden, Netherlands, and Germany). The second evaluation, in-depth focus groups – with three parents each – was run in the summer of 2009 in the
UK and in December 2009 in the Netherlands. Further information about these two preliminarily user-centered
evaluations can be found elsewhere [20]. Since our requirements
for socially-aware video editing and sharing systems are partially derived from the results, we will summarize here key findings about current practices and challenges about media sharing.
As social connectedness theory suggests, many participants engaged in varied forms of media sharing as they felt that reliving memories and sharing experiences help bring them (and other households) together. Parents e-mailed pictures of the kids playing football to the grandparents, shared holiday pictures via
Picasa, or on disk, or using Facebook, enabling friends and families to stay in touch with each other’s lives. Nevertheless, the interviewed people said that if they shared media, they would do so via communication methods they perceived as private and then only to trusted contacts. There was a general reticence from the parents towards existing social networking sites. From the focus
groups it became clear that current video sharing models do not fit the needs of family and friends. Much richer systems are needed and will become an essential part of life for family relationships.
A number of parents reported photography as a hobby and would routinely edit their shared images. Their children, on the other hand, even if interested in photography, seemed less keen to manually edit the pictures, and declared a strong preference for automatic edits or relied on their parents. The participants would then discuss the incidents relating to the pictures later on with friends and family, on the phone or at the next reunion. Home videos tended to be watched far less frequently, although the young pre-teen participants appreciated them and were described by their parents as having
“worn the tape[s] down”
from constant viewing when much younger.
In general the participants’ responses converged to:
A willingness to engage in varied forms of recollections through recorded videos;
A clear requirement for systems that could be trusted as ensuring privacy;
A positive reaction to the suggestion of automatic and intelligent mechanisms for managing home videos.
In each case, creating personalized presentations (tailored for family use) remained a core issue.
The general user needs, social theories and initial interviews with focus groups have lead to a number of requirements for community-support for multi-camera recordings of shared events:
Connectedness : it should provide tools and mechanisms for maintaining relationships over time. The goal is not so much on supporting high intensity moments – the event – but for the small peaks of awareness (recollection of the event);
Privacy and control: current video sharing models do not fit the needs of family and friends. New systems should address their concerns, and target social community-based use cases;
Effortless : people are reluctant to invest time in processes they consider that could be done automatically. Future systems should include automatic processes for analyzing, annotating, and creating videos of the shared event;
Effort, intimacy and reciprocity : while such automatic processes lower the burden for the user, they do not conform with existing social theories. Since we do not want to limit the joy of handcrafting videos for others, systems should offer optional manual interfaces for personalization purposes.
We used these requirements as the basis for specifying the guidelines discussed in the next section.
This section specifies the general guidelines for a socially-aware video authoring and sharing system. Such guidelines have been realized in a system called MyVideos . MyVideos was developed as a collection of configurable processes, each of which allowed us to study one or more aspects of the development of socially-aware video systems. Note that the purpose of this paper is not to describe any one component in detail, but to illustrate the interplay of facilities required to support social media sharing.
The design and architecture are direct results from the evaluation process reported in this paper.
Figure 2. High-level workflow of MyVideos.
The high-level workflow of socially-aware authoring systems is
sketched in Figure 2. The input material includes the video clips
that the parents agreed to upload, together with a master track recorded by the school. All the video clips are stored in a shared video repository that also serves as a media clip browser in which parents, students, and authorized family members can view (and selectively annotate) the videos. Privacy and a protected scope for sharing is a key component of the system. Each media item is automatically associated with the person who uploaded it, and there are mechanisms for participants to restrict sharing of certain clips. Participants can use their credentials for navigating the repository – those parts allowed to them – and for creating and sharing different stories intended for different people.
At capturing time, there are no specific filming requirements for users. They can record what they wish using their own camera equipment. The goal is to recreate a realistic situation, in which friends and families are recording at a school concert. This flexibility comes at a cost, however, since most existing solutions that work well in analyzing input datasets are not that useful for
our use case. As indicated in the related work [15], handling user-
generated content is challenging, since it is recorded using a variety of devices (e.g., mobile phones), the quality and lightning are not optimal, and the length of the clips is not standard. As described below, each clip is automatically analyzed and tagged with information on performer, instrument, song, and special events. In practice, this automatic annotation must be fine-tuned manually, either by parents/users or by a school curator.
In order to support the social theories described in Section 3.1 and
MyVideos incorporates a number of automatic, semi-automatic and manual processes that assist in the creation of personal memories of an event. These processes balance convenience and personal effort when making targeted, personalized videos. Emotional intensity is provided thanks to a recommendation algorithm that searches for people and moments that might bring memories to the user. For mediating intimacy,
MyVideos provides users the means to enrich videos for others by including highly personalized content such as audio, video, and textual comments. With these approaches we intend to increase the feeling of connectedness, particularly among family members and friends who could not attend the social event.
Figure 3. Comparison between automatic and manual generation of video compilations. Automatic methods are not sufficient for creating personal and intimate memories.
4.2.1
Reflecting Personal Effort
Users of MyVideos can automatically assemble a story based on a number of parameters such as people to be featured, songs to be included, and duration of the compilation. Such selection triggers a narrative engine that, in less than three minutes, creates an initial video using the multi-camera recordings. The narrative engine selects the most appropriate fragments of videos in the repository based on the user preferences and assembles them following basic narrative constructs.
Given an example query:
SelectedPersons=[Julia, Jan Hans];
SelectedSong=[Cry Me River];
SelectedDuration=[3minutes] .
The algorithm extracts the chosen song from the master audio track, and uses its structure as backbone for the narration. It then selects all the video content aligned with the selected audio fragment; the master video track provides a good foundation and a possible fall back for the event fragments that are not well covered by individual recordings. The audio object is the leading layer and, in turn, it is made of AudioClips . This structure generates a sequence of all the songs that relate to the query. As soon as the length of the song sequence extends beyond the
SelectedDuration , the compilation is terminated. The video object has the role of selecting appropriate video content in sync with the audio. An example of the selection criteria is the following:
1.
Select video clip that is in sync with the audio;
2.
Ensure time continuity of the video;
3.
If there are more potential clips that ensure continuity, select those with Person annotations matching the user choices stored in SelectedPersons ;
4.
If the result consists of more clips, select those whose
Instruments annotation match the instruments that are active in the audio layer;
5.
If the result consists of more clips, select those whose Person annotation matches the persons currently playing a solo.
Once the automatic authoring process is complete, a new video compilation is created in which the selected song and people are
featured. As reported elsewhere [21], such narrative constructs
have been developed and tested together with professional video editors. Our assumption, based on the social theories, was that automatic methods – while useful – were not sufficient for creating personal memories of an event. Such assumption is
validated by our results (Section 5). Figure 3 shows a comparison
between automatic and manual generation of mashups. Automatic techniques are better suited for group needs such as a complete coverage or a summary of the event, but are not capable of capturing subtle personal and affective bonds. We argue instead for hybrid solutions, in which manual processes allow users to add their personal view to automatically assembled videos.
MyVideos provides such interfaces for manually fine-tuning video compilations. Users can improve and personalize existing productions by including other video clips from the shared repository. For example, a parent can add more clips in which his daughter is featured for sharing with grandma, or he can instead add a particularly funny moment from the event when creating a version for his brother. Results of the evaluation indicate that users liked such functionality, which automatic processes are not able to provide – it is quite complicated to design an algorithm that detects “funny” events based on personal stories!
4.2.2
Supporting Emotional Intensity
Another assumption leading the design of MyVideos was that users are particularly interested in finding video clips in which people close to them are featured. Results of the evaluation indicate that emotional searchers were the most popular. Our system incorporates a recommender algorithm that it is optimized based on the profile of the logged user. This optimization takes into account semantic annotations associated to the user’s media and on the subjects that more frequently appear on his recordings.
The core of the navigation interface for visualizing the multicamera recordings is such optimized recommender algorithm that returns relevant fragments of video clips based on given parameters. MyVideos implements four different filters for emotional searches: instruments, songs, people, and cameras. The first filter will return video clips in which a particular instrument appears. The second one will return those belonging to a specific song during the concert. The third filter will query about the close friends and family member of the logged users. And the last one will return video clips taken by one of the cameras (e.g., the camera of the user logged in or the camera of his wife). These filters can be combined in complex queries, thus simplifying the searching process. For example, a father can request for his son and daughter playing “Cry Me River”, since he remembers this was a very emotive moment of the concert.
Given an example query:
SelectedPersons=[Julia, Jan Hans];
SelectedSong=[Cry Me River] .
The result will be:
QueryPersons(Julia, Jan Hans)
QueryEvents(Cry Me a River)
The query algorithm works as follows:
1.
Select fragments of the video clips matching the query; in case of complex queries select intersecting the sets
2.
If the result consists of one fragment, return it
3.
If the result consists of more than one fragment, order the resulting list based on:
The requested person;
The annotations associated to the video clips uploaded by the logged user (affection parameter);
The subjects that appear more frequently in the video clips uploaded by the logged user (affection parameter).
Figure 4. Elements of an authored video composition: the parent has included an introductory image and a video for making it more personal and intimate.
In addition to the query interface that allows users to find moments of the videos that they particularly remember, MyVideos offers optional manual interfaces for improving semantic annotations. When users are searching for specific memories, it might happen that results are not accurate due to mistakes in the annotations. Our system allows users to correct such annotations while previewing individual clips. For example, they can change / add / remove the name of the performer and the title of the song.
Moreover, they can add extra annotations (e.g., solo). According to our results, such functionality was appreciated even though some participants were reluctant to make changes they consider the system should have done correctly in the first place.
4.2.3
Supporting Intimacy and Enabling Reciprocity
Apart from allowing fine-tuning of assembled videos, our system enables users to perform enrichments. MyVideos provides mechanisms for including personal audio, video, and textual commentaries. For example, these could be recordings of oneself
(the author of the production) or subtitles aligned with the video clips commenting the event for others. Users can as well record an introductory audio or video, leading to more personalized stories.
Figure 4 illustrates some elements of an authored video, where a
parent has created his own version of a concert. Moreover, he has added some personal assets, such as an introductory image and a video recording of himself that acts as the message envelope.
According to our results, this functionality (we call it “Capture
Me”) would be used by all but one of our participants.
Our application addresses reciprocity by enabling life-long editing and enriching of compiled videos. As indicated before, videos created with our tool can be manually improved and enriched using other assets from the repository, adding personal video and audio recordings, and including subtitles or textual comments.
This is possible because of the underlying mechanism used for creating compilations. A high-level description language, in our case SMIL, allows us to temporally describe the contents of the compilation. The actual video clips are included by reference, and never modified or re-encoded. In other words, we do not burn a new compilation, but just describe it in a SMIL file that then is
shared with other people. As shown in a previous publication [2],
there are benefits of such approach. In the context of this paper, the main benefit is that it enables people to comment, enrich, and improve existing compilations. Thus, users receiving an assembled video can easily include further timed comments as a reciprocity action for the original sender. A grandmother might add a “
Isn’t Julia cute?!
” reply in a reciprocal message to her son.
4.2.4
Guidelines relative to Requirements
In addition to reflecting effort and supporting emotional intensity , intimacy , and reciprocity , MyVideos also aims at other
requirements identified in Section 3.3, as discussed below.
Using a trusted storage media server (provided by the school) fulfills the requirement of privacy. Parents can upload the material from the concerts to a common media repository. The repository
Figure 5. Temporal alignment of a real life data set from a concert, where a community of users recorded video clips. is a controlled environment, since it is provided and maintained by the school, instead of being an external resource controlled by a third-party company. Moreover, all the media material is tagged and associated with the parent who uploaded it, and there are mechanisms so parents can decide not to share certain clips in which their children appear. Users can use their credentials for navigating the repository – those parts allowed to them – and for creating different stories for different people.
The requirement on effortless is met by the provision of a number of automatic processes that analyze and annotate the videos, and that help users to navigate media assets and to create memories.
First, our system uses two automatic tools for analyzing and annotating the ingested video clips: a temporal alignment and a semantic video annotation tools. The temporal alignment tool is used to align all of the individual video clips to a common time base. The core of the temporal alignment algorithm is based on perceptual time-frequency analysis with a precision of 10ms.
Figure 5 shows the results of the temporal alignment of a recorded
real life dataset (more information on the datasets will be provided
in Section 5.1). The level of accuracy of our tool is of around
99%, improving other state-of-the-art solutions [8][15]. Since the
focus of this paper is not on content analysis, we will not further detail this part of the system. The interested reader can find the
algorithm and its evaluation elsewhere [9]. The Semantic Video
Annotation Suite provides basic analysis functions, similar to the
ones reported in [15]. The tool is capable of automatically
detecting potential shot boundaries, of fragmenting the clips into coherent units, and of annotating the resulting video sub-clips.
Second, our system uses other automatic processes for visualizing media assets and for effortless creation of videos. As introduced in the previous subsections, users can navigate the video clips using a recommender algorithm, and they can automatically generate a video compilation from the multi-camera recordings.
These automatic tools provide a baseline-personalized video. Our personalization extension tools allow further transformation to highly targeted presentations from a common media resource. In order to promote easy user sharing, the client aspects of MyVideos have been implemented a full-fledge Web application. The system is hosted on a dedicated research testbed with a high bandwidth symmetrical Internet connection and virtualized processor clusters dedicated to hosting Web applications and serving video. In our architecture, each school would rent space and functionality on the testbed, in order to make systems like ours available to their community. In our current implementation, the testbed space was provided to members of the community.
The evaluation reported in this paper is part of an extended study to better understand the intersection of social multimedia and social interactions. The final intention is to provide technological
Figure 6. Interface for automatically assembling stories. solutions for improving social communications between families living apart. Potential users have been involved in the process since the beginning, starting with interviews and focus groups, leading up to an evaluation of the prototype. A set of parents from a high school in Amsterdam has been actively collaborating with this research. Starting in December 2009, the parents were invited to a focus group that took place in Amsterdam; in April 2010 they recorded (together with some researchers) a concert of their children; in July 2010 they used our prototype application with the video material recorded in that concert. This section reports on the evaluations carried out during this journey.
In particular, we are interested in whether the functionality provided by a socially-aware editing system provides an identifiable improvement over other approaches for creating personalized videos. Ultimately, we are not that interested in physiological issues related to specific interfaces. Instead we want to better understand the social interactions around our system and how well social connectedness is supported. In order to validate our assumptions we ran a number of experiments: some testing the functionality and others involving users. The next subsections detail the experiments, report on the results and analyze them.
5.1.1
Video Recordings
The first set of experiments consisted in the recording of three different concerts: a school rehearsal in Woodbridge in the UK, a jazz concert by an Amsterdam local band called the Jazz
Warriors, and a school concert at the St. Ignatius Gymnasium in
Amsterdam. The intention behind these concert recordings was twofold: to better understand the problem space and to gather real life datasets that could be used for testing the functionality.
In the first concert, Woodbridge, a total of five cameras were used to capture the rehearsals. The master camera was placed in a fixed location, front and centre to the rehearsal, set to capture the entire scene (a ‘wide’ shot) with no camera movement and an external stereo microphone in a static location near to the rehearsal performance. This experiment was very useful for testing the automatic processes provided by our system: the temporal alignment algorithm, the semantic video annotation suite, the automatic authoring tool, and the recommender system for visualizing multi-camera recordings.
Then, at the end of November 2009 a concert of the Jazz Warriors was captured as part of an asset collection process. The goal of the capture session was to gain experience with an end-user setup that would be similar to that expected for the Amsterdam school trial.
The concert took place on 27 November 2009 at the Kompaszaal,
Figure 7. Interface for navigating video clips (timeline). a public restaurant and performance location in Amsterdam. The
Jazz Warriors is a traditional big band with approximately 20 members. In total eight (8) cameras were used to capture the concert, where two cameras were considered as ‘masters’ and were placed at fixed locations at stage left and stage right. In total, we collected more than 200 video clips and approximately 80 images. The longest video clip was 50min, the shortest 5s.
The first two concerts were primarily experimental, providing us enough material for fine-tuning the automatic and manual processes. On 16 April 2010 the concert from the Big Band (St.
Ignatius Gymnasium) was recorded. In this case parents took part in the recordings and provided the research team with all the material. In total around 210 media objects were collected for a concert lasting about 1h and 35 minutes. Twelve (12) cameras were used; two of them used as the master cameras. These recordings were used for evaluating the prototype.
5.1.2
Prototype Evaluation
In cooperation with a local high school, a core group of seven people were recruited, among relatives and friends of the kids that performed in the third concert, for evaluating the prototype. The rationale used for recruitment was to gather as many roles as possible for better understanding the social needs of our potential users. The participants were three high school students, a social scientist, a software engineer, an art designer and a visual artist, resulting in a variety of needs that may influence the video capturing, editing and sharing behaviors. All the participants were
Dutch. The average age of the participants was 37.1 years (SD =
20.6 years); 3 participants (42.8%) were female. Among our participants, 3 had children (ranging in age from 14 to 17 years).
All participants were currently living in the Netherlands, but an uncle of a performer that lives in the US was recruited to serve as an external participant (the only one that was not present in the concert recordings). The prototype evaluation was conducted over a two–month span in the summer of 2010 (Jul-Sep).
The review group was kept small so that we could establish directed and long-term relationships. The qualitative nature of our interactions provided us with a deep understanding of the ways in which people currently share experiences to foster strong ties. The participants represent a realistic sample for the intended use case: all the parents have kids going to the same high school, all of them tend to record their kids, and some of them have some experience with multimedia editor tools. Moreover, the parents were involved in previous focus groups dating from December
2009 and they recorded their kids playing in the Big Band concert
(April 2010). The goals of the study make it impossible to do crowdsource testing, since users of the system should be people that care about the actual content of the videos.
We used multiple methods for data collection, including in person
interviews, questionnaires, and interaction with the system. Figure
6 shows the interface our participants used for automatically
assembling stories. Users only have to select the subject matter
(people, songs, instruments) and two parameters (style, and duration). Then, by pressing the “GO” button, a story will be
automatically assembled. Figure 7 shows how a user can browse
temporally – there are other navigation interfaces – the multicamera video material. In the figure, the video clips in which a performer connected to the user appears are highlighted in orange.
The experiment consisted of 2 sessions. The initial session was used to collect background information about video recording habits, their intentions behind these habits and their social relations around media. We also used this session as an opportunity to understand how the participants conceptualize the concert. The second (in-depth) session was dedicated to capture video editing practices and media sharing routines of the participants based on their interactions with the system.
5.1.3
London Student Study
Another smaller and more incidental evaluation was performed for validating and contrasting the results obtained when evaluating the prototype. In this case, we asked a broader group of British students about their social media habits. We then introduced the system to them, and let them play with it for a full day. A total of
20 kids completed the questionnaires. The students were from
London high schools and were between 14 and 18 years old.
Figure 8 shows the results of some of the questionnaires filled by
the participants in Amsterdam. In all the cases, the participants interacted with the system visualizing media assets and creating stories. The top part (a) shows their responses to questions related to media capturing, editing, and sharing. The bottom part (b) shows their answers to the questions related to utility and usefulness, including comparisons to other existing solutions.
Regarding common practices, all participants reported they record videos in social events (e.g., family gatherings and vacation trips).
However, most of them said that they barely looked at the
recorded material afterwards (see Figure 8 (a)). For most of them,
video editing is time consuming and way too complicated.
Nevertheless, some are familiar with video editing tools. When prompted if and how they share their videos, our participants repeatedly said that in general they do not post personal videos on the Web. While the youngest participants argued their personal videos were not interesting enough to share on social networking services, for our older respondents privacy was the main reason not to share personal videos on the Web. On the other hand, all participants do appreciate watching videos on YouTube, and most of them have a Facebook account. However, only a few do collaborative tagging of videos and/or photos.
The responses from the London high school questionnaires are
captured in Figure 9. In general, we can talk about a generation
that has gowned up watching and sharing digital videos.
Regarding MyVideos , the kids graded it similarly than the evaluators of the prototype, indicating that our results can be generalized further than initially expected.
Based on our observations, responses to the questionnaires, and analysis of the collected audio/video, we present here our findings regarding effort , emotional intensity , and intimacy. We end this a) Social media habits
Questions Questionaire Results
Q01. Did you like MyVideos ?
Q02. Does MyVideos help you recalling memories of social events?
Q03. Would you like to see what other parents and friends have recorded in the same event?
Q04. Is MyVideos better than traditional systems to browse videos recorded by other people?
Q05. Is MyVideos better than traditional systems to find people you care about?
Q06. Would you add/correct more metadata by using MyVideos ?
Q07. Is the material you capture enough to create video productions?
Q08. Would you create more video stories with
MyVideos (compared to existing tools)?
Q09. Would you create videos stories faster using
MyVideos (compared to existing tools)?
Q10. Would you create video stories more easily using MyVideos (compared to existing tools)?
Q11. Would you fine-tune automatic generated video stories by using manual tools?
Q12. Would you personalize videos by adding yourself to the story? (“capture me”)
Q13. Which approach did you like the most?
Automatic
Manual
Q14. Would you share more videos with
MyVideos ?
Q15. Is MyVideos safer than other video sharing systems?
Q16. Would you pay for MyVideos ?
Key: (Yes) (No) b) Utility and Usefulness of MyVideos.
Figure 8. Results of the questionnaire in Amsterdam. subsection with a discussion on how our experimental method relates to strong ties and social connectedness theories.
5.3.1
Effort
In general the participants liked MyVideos and considered its functionality useful, since it allowed them to use recordings done
by others (see Q01-05 in Figure 8 (b)). A surprising outcome
during the prototype evaluation was that participants were not really into editing before sharing videos. Participants rarely edited the video clips they have captured. We believe this behavior is motivated by the immediacy paradigm offered by successful commercial video sharing systems such as YouTube. However, the automatic authoring process was generally appreciated,
Figure 9. Results of the questionnaire to high school students in London (14-18 years old). making the users feel as they could create stories faster and easier
(Q08-10 and Q13 in Figure 8 (b)). Overall, the assembled videos
were considered visually compelling. As an example, Figure 10
presents the comments of a user while interacting with the system.
Although participants found the assembled videos appealing, they wanted (manual) tools to further personalize the stories (see Q11-
13 in Figure 8 (b)). Nevertheless, the participants considered that
creating memories require using the optional manual processes included in the system. Optional manual processes for personalizing and fine-tuning were welcomed and requested.
“I want more portraits of my daughter (in this automatic generated compilation)... is it possible to edit an existing movie
(in the Editor)?” (Father of a performer)
5.3.2
Emotional Intensity
The participants appreciated the filter (recommendation) functions offered by MyVideos because such features skip all the rest and focus on the topics of interest (Q04). Interestingly, with the users we spoke to, a person was considered featured in a video (and suitable for search) only if it was a prominent shot (e.g., close-up or solo). Most of the participants reported they would not be interested in a shot in which the subject of their interest barely appears. We believe the relative low score on “Is MyVideos better than traditional systems to find people you care about?” (Q05) is because this version of the system did not have quality filters.
“If he is in the background but he is on the shadow it is OK but I would like to see a video in which he really shows up… My mother (performer’s grandma) would not enjoy seeing this video of him because there is not much to see” (Uncle of a performer)
When we prompted our participants about how much time they were willing to spend in adding/correcting annotations, they were quite resistant, arguing that this process demands too much effort.
Nevertheless, the questionnaires show an improvement over
existing solutions (see Q06 in Figure 8 (b)).
5.3.3
Intimacy
Regarding the optional manual processes, for most of our participants the ‘capture me’ function – functionality for including
Figure 10. A participant using the automatic authoring tool.
personal assets (as shown in Figure 4) in a video compilation –
was seen as a way to personalize videos for a target audience. As shown in the results (see Q12), such functionality was generally appreciated. Some of the participants told us they would use this functionality when creating a birthday present video, for instance.
5.3.4
Social Connectedness
Based on the participants’ comments, reactions, and answers to the questionnaires, we can conclude that participants appreciated the benefits of such a system and considered it as a valuable vehicle for remembering events, thus improving social connectedness. In particular, participants largely agreed that
MyVideos would help them in recalling memories from social events (see Q02). Moreover, they agreed that MyVideos would allow them to create more videos and to share them with others
(see Q08, Q12 and Q14 in Figure 8 (b) and Figure 9). As shown in
Figure 1, social connectedness theory, all these actions represent
the small peaks of awareness we intended to mediate with socially-aware video authoring and sharing tools.
In general, participants were proud about their own productions, but in most cases they were actively looking for video clips of their close friends and relatives. They complained – and desperate
- when the quality of the video was not good enough or when the metadata of the videos was wrong. And most importantly, they wanted to share video clips featuring members of their close circle.
“Can I send it now?”
was the reaction of one of the participants after seeing a video clip he especially liked.
While these results are highly valuable and relevant, we are aware that they are not complete. It is our intention to develop technological solutions for togetherness, but there is no simple solution for such a complex problem. Doing this requires a number of steps, where each one needs to be developed and tested in turn. Our next step in this direction is to run long field trials, for obtaining stronger evidences on how productions and relationships evolve over time.
This paper reports on a 10-month evaluation process that investigates mechanisms and principles for togetherness and social connectivity around media. The contribution is a set of guidelines for a new paradigm of multimedia authoring tools: socially-aware video editing and sharing. Our experimental methodology has lead to the specification of a set of guidelines for nurturing strong ties and improving social connectedness. These guidelines have been realized and evaluated through a prototype tool that allows users to navigate, create and share videos from a social event – a concert – they took part in. Our user studies contemplated a short-term evaluation in the UK and a long-term
process in the Netherlands, in which people actively participated and recorded a concert of their relatives. As shown in this paper, our experimental methodology aims at meeting the requirements needed for social communities that are not addressed by existing social media Web applications. It aims at improving connectedness by effortless interaction, providing support for emotional intensity , intimacy and reciprocity .
Our results and findings lead to some open questions, we expect to answer in future studies. For example, it looks surprising that users liked MyVideos
(e.g. Q01-04 in Figure 8 (b)), but they did
not find it much “safer”. Based on previous interviews and other studies, we believe this is because more senior users are by default concerned about uploading material outside their reach (even though it is a controlled environment). More specific studies and trials are needed for gaining better insight about this issue.
Another interesting question is whether people actually need authoring tools. One might believe that often, people are satisfied with capturing an event in some primitive form perhaps more so than producing something more sophisticated. We argue that this belief is rooted in the models offered by current authoring and sharing tools, such as YouTube. Our findings illustrate that, if offered the right paradigm, users will create and share more sophisticated productions. Apart from results of our preliminarily interviews, we can observe that users want others’ content to
create their own productions (Q03 and Q07 in Figure 8 (b)) and
that users feel that they would create more videos using our prototype (Q08). After all, storytelling has been (and will be) a primarily way of communicating between people.
During this process we did not intend to focus on a specific piece of software, but to take a broader look at the process and its implications. Our final intention is to reformulate the research problem of multimedia authoring by emphasizing the importance of the social relationships between the media ‘authors’ relative to the performers. We believe our findings are relevant for future research in this new direction: socially-aware authoring and sharing tools. Essentially, we hope these insights can stimulate others, even as we use the insights to further our own work.
The research leading to these results has received funding from the European Community's Seventh Framework Programme
(FP7/2007-2013) under grant agreement no. ICT-2007-214793.
Special thanks to partners involved in the MyVideos demonstrator.
[1] Abowd, G.D., Gauger, M., and Lachenmann, A. 2003. The
Family Video Archive: an annotation and browsing environment for home movies. In Proceedings of the ACM
SIGMM International Workshop on Multimedia Information
Retrieval , pp. 1-8.
[2] Cesar, P., Bulterman, D.C.A., Jansen, J., Geerts, D., Knoche,
H., and Seager, W. 2009, Fragment, Tag, Enrich, and Send:
Enhancing the Social Sharing of Videos. ACM TOMCCAP ,
5(3): article number 19.
[3] Dib, L., Petrelli, D., and Whittaker, S. 2010. Sonic souvenirs: exploring the paradoxes of recorded sound for family remembering. In Proceedings of the ACM Conference on
Computer Supported Cooperative Work , pp. 391-400.
[4] Durkheim, E. 1971. The elementary forms of the religious life. Allen and Unwin.
[5] Gilbert, E. and Karahalios, K. 2009. Predicting tie strength with social media. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems , pp. 211-220.
[6] Granovetter, M.S. 1973. The Strength of Weak Ties.
American Journal of Sociology , 78(6): 1360-1380.
[7] Guimarães, R.L., Cesar, P., Bulterman, D.C.A., Kegel, I., and Ljungstrand, P. 2011. Social Practices around Personal
Videos using the Web. In Proceedings of the ACM
International Conference on Web Science .
[8] Kennedy, L.S. and Naaman, M. 2009. Less talk, more rock: automated organization of community-contributed collections of concert videos. In Proceedings of the
International Conference on World Wide Web , pp. 311-320.
[9] Korchagin, D., Garner, P.N. and Dines, J. 2010. Automatic
Temporal Alignment of AV Data with Confidence
Estimation. In Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing .
[10] Martin, R. and Holtzman, H. 2010. Newstream: a multidevice, cross-medium, and socially aware approach to news content. In Proceedings of EuroITV , pp. 83-90.
[11] Naci, S.U. and Hanjalic, A. 2007. Intelligent browsing of concert videos. In Proceedings of the ACM International
Conference on Multimedia , pp. 150-151.
[12] Obrador, P., Oliveira, R., and Oliver, N. 2010. Supporting personal photo storytelling for social albums. In Proceedings of the ACM International Conference on Multimedia , pp.
561-570.
[13] Rettie, R. 2003. Connectedness, awareness and social presence. In Proceedings of the Annual International
Workshop on Presence .
[14] Shaw, R. and Schmitz, P. 2006. Community annotation and remix: a research platform and pilot deployment. In
Proceedings of the ACM international Workshop on Human-
Centered Multimedia , pp. 89-98.
[15] Shrestha, P., de With, P.H.N., Weda, H., Barbieri, M., and
Aarts, E.H.L. 2010. Automatic mashup generation from multiple-camera concert recordings. In Proceedings of the
ACM International Conference on Multimedia , pp. 541-550.
[16] Snoek, C., Freiburg, B., Oomen, J., and Ordelman, R. 2010.
Crowdsourcing Rock N’ Roll Multimedia Retrieval. In
Proceedings of the ACM International Conference on
Multimedia , pp. 1535-1538.
[17] Truong, B.T. and Venkatesh, S. 2007. Video abstraction: A systematic review and classification. ACM TOMCCAP , 3(1).
[18] Westermann, U. and Jain, R. 2007. Toward a Common Event
Model for Multimedia Applications. IEEE Multimedia ,
14(1): pp. 19-29.
[19] Whittaker, S., Bergman, O., and Clough, P. 2010. Easy on that trigger dad: a study of long term family photo retrieval.
Personal Ubiquitous Computing , 14(1): pp. 31-43.
[20] Williams, D., Ursu, M.F., Cesar, P., Bergstrom, K., Kegel, I. and Meenowa, J. 2009. An emergent role for TV in social communication. In Proceedings EuroITV , pp. 19-28.
[21]
Zsombori, V., Frantzis, M., Guimarães, R.L., Ursu, M.F.,
Cesar, P., Kegel, I., Craigie, R. and Bulterman, D.
2011.Automatic Generation of Video Narratives from Shared
UGC. In Proceedings of the ACM Conference on Hypertext and Hypermedia , pp. 325-334.