Slide - SyNRG

advertisement
Your Reactions Suggest You Liked the Movie
Automatic Content Rating via
Reaction Sensing
Xuan Bao, Songchun Fan, Romit Roy Choudhury,
Alexander Varshavsky, Kevin A. Li
Rating Online Content (Movies)
Manual rating not incentivized, not easy … does not reflect experience
Our Vision
Overall star rating
Reaction tags
Reaction-based
highlights
Our Vision
Automatically
Overall star rating
Reaction tags
Reaction-based
highlights
Reactions / Ratings
Key Intuition
Multi-modal sensing / learning
02:43 - Action
09:21 - Hilarious
12:01 - Suspense
……
Overall – 5 Stars
Specific Opportunities
• Visual
– Facial expressions, eye movements, lip movements …
• Audio
– Laughter, talking
• Motion
– Device stability
• Touch screen activities
– Fast forward, rewind, checking emails and IM chats …
• Cloud
– Aggregate knowledge from others’ reactions
– Labeled scores from some users
Pulse: System Sketch
Applications (Beyond Movie Ratings)
• Annotated movie timeline
– Slide forward to the action scenes
• Platform for ad analytics
– Assess which ads grabbing attention …
– Customize ads based on scenes that user reacts to
• Personalized replays and automatic highlights
– User reacts to specific tennis shot, TV shows personalized replay
– Highlights of all exciting moments in the superbowl game
• Online video courses (MOOCs)
– May indicate which parts of lecture needs clarification
• Early disease symptom identifcation
– ADHD among young children, and other syndromes
First Step:
A Sensor Assisted Video Player
Pulse Media Player
• Developed on Samsung Galaxy tablet (Android)
– Sensor meta-data layered on video as output
Sensing threads
control
Media player control
functions monitored
Observe the user
from front cam
Basic Design
Tag Cloud
Final Rating
English Adjectives
Numeric Rating
Reaction to Rating & Adjective (R2RA)
Reactions: Laugh, Giggle, Doze, Still, Music …
Signals to Reactions (S2R)
Features from Raw Sensor Readings
Microphone, Camera, Acc, Gyro, Touch, Clicks
Data Distillation Process
Basic Design
Tag Cloud
English Adjectives
Final Rating
Numeric Rating
Reaction to Rating & Adjective (R2RA)
Reactions: Laugh, Giggle, Doze, Still, Music …
Signals to Reactions (S2R)
Features from Raw Sensor Readings
Microphone, Camera, Acc, Gyro, Touch, Clicks
Data Distillation Process
Cloud
Visual Reactions
• Facial expressions (face size, eye size, blink, etc.)
– Track viewers’ face through the front camera
– Track eye position and size (challenging with spectacles)
– Track partial faces (via SURF points matching)
Face Tracking
Eye Tracking (Green)
Blink (Red)
Partial Face
Visual Reactions
• Facial expressions (face size, eye size, blink, etc.)
–
–
–
–
Track viewers’ face through the front camera
Track eye position and size (challenging with spectacles)
Track partial faces (via SURF points matching)
Detect blinks, lip size
Look for difference between frames
Acoustic Reactions
• Laughter, Conversation, Shout-outs …
– Cancel out (known) movie sound from recorded sound
– Laughter detection, conversation detection
Even with knowledge of the original movie audio (Blue), it is hard to
identify user conversation (distinguish Red and Green)
Acoustic Reactions
• Separating movie from user’s audio
– Spectral energy density comparison not adequate
Low Volume
High Volume
– Different techniques for different volume regimes
Acoustic Reactions
• Laughter, Conversation, Shout-outs …
– Cancel out (known) movie sound from recorded sound
– Laughter detection, conversation detection
Early results demonstrate promise of detecting acoustic reactions
Motion Reactions
• Reactions also leave footprint on motion dimensions
– Motionless during intense scene
– Fidget during boredom
Intense scene
Time to stretch
Calm scene
Motion Reactions
• Reactions also leave footprint on motion dimensions
– Motionless during intense scene
– Fidget during boredom
Motion Reactions
• Reactions also leave footprint on motion dimensions
– Motionless during intense scene
– Fidget during boredom
Motion readings correlate with changing in ratings …
Motion Reactions
• Reactions also leave footprint on motion dimensions
– Motionless during intense scene
– Fidget during boredom
Motion
correlate
with changing
in ratings
… changes
Timing readings
of motions
also correlate
with timing
of scene
Extract Reaction Features – Player control
• Collect users’ player control operations
• Pause, fast forward, jump, roll back, …
• All slider movement
Seek bar
Challenges in Learning
Problem – A Generalized Model Does Not Work
• Directly trained model does not capture the rating trend
Why?
The Reason it Does Not Work is …
• Human behaviors are heterogeneous
– Users are different
– Environments are different even for same user (home vs. commute)
home
commute
Sensed motion patterns very different when the same movie
wateched during a bus commute vs. in bed at home.
The Reason it Does Not Work is …
• Human behaviors are heterogeneous
– Users are different
– Environments are different even for same user (home vs. commute)
• Gyroscope readings from same user (at home and
office)
The Reason it Does Not Work is …
• Human behaviors are heterogeneous
– Users are different
– Environments are different even for same user (home vs. commute)
• Gyroscope readings from same user (at home and
office)
• Naïve solution  build specific models one by one
– Impossible to acquire data for all <User, Context, Movie> tuples
Office
Home
Commute
…
Challenges in Learning
Approach:
Bootstrap from Reaction Agreements
Approach: Bootstrap from Agreement
• Thoughts
– What behavior means positive/negative for a particular setting
– How do we acquire data without explicitly asking the user every time
• Approach: Utilize reactions that most people agree on
Time
Climax
Boring
Sensor Reading
Ratings
Cloud Knowledge
(Other users’
opinions)
Approach: Bootstrap from Agreement
• Solution: spawn from consensus
– Learn user reactions during the “climax” and the “boring” moments
– Generalize this knowledge of positive/negative reactions
– Gaussian process regression (ratings) and svm (labels)
GPR
SVM
Evaluation
User Experiment Setting
• 11 participants watch preloaded movies (~50 movies)
• 2 comedies, 2 dramas, 1 horror movie, 1action movie
• Users provide manual ratings and labels
– For ground truth
• We compare Pulse’s ratings with manual ratings
Preliminary Results – Final (5 Star) Rating
Pulse
Truth
1
2
3
4
5
1
0
1
0
0
0
2
0
4
2
0
1
3
0
1
17
0
1
4
5
0
0
0
0
2
2
5
1
2
7
Preliminary Results – Final (5 Star) Rating
Difference with true 5 star manual rating
Preliminary Results – Myth behind the Error
• Final ratings can deviate significantly from the average
segment ratings
• User-given scores may not be linearly related to quality
Preliminary Results – Lower Segment Rating Error
• Final ratings come from averaging segment ratings
• Our system outperforms other methods
Per-segment
ratings
3
4
4
2
2
2
5
Mean Error
(5-point scale)
Random ratings
Collaborative
filtering
Our system
Preliminary Results – Better Tag Quality
• Tags capture users’ feelings better than SVM alone
Happy
Intense
Warm
Happy
Intense
Warm
Preliminary Results – Reasonable Energy Overhead
• Reasonable energy overhead compared to without sensing
More tolerable on tablets. May need duty-cycling on smart phones
Closing Thoughts
• Human reactions are in the mind
– However, manifest into bodily gestures, activities
• Rich, multi-modal sensors on moble devices
– A wider net for “catching” these reactions
• Pulse is an attempt to realize this opportunity
– Distilling semantic meanings from sensor streams
– Rating movies … tagging any content with reaction meta data
• Enabler for
– Recommendation engines
– Content/video search
– Information retrieval, summarization
Thoughts?
Backup – potential questions
•
Privacy concern
– Like every technology, pulse may attract early adoptors
– If only final ratings are uploaded, the privacy level is similar to current ratings
•
Why not just emotion sensing/just laughter detection
– Emotion sensing is a broad and challenging problem…but the goal is different than ours (rating)…
– Explicit signs like laughter usually only account for a small duration of movie viewing, we need to
explore other opportunities (motion)
– Our approach takes advantage of the specific task – 1. we know the user is watching a movie 2. we can
observe the user for a longer duration (than most emotion sensing work) 3. we know other users’
opinions
•
How is this possible…human mind is too complex
– Human thoughts are complicated… but they may produce footprints in behaviors
– Using collaborative filtering explicitly uses knowledge of other users’ thoughts to bootstrap our
algorithm
•
The sample size is small…only 11 users
– The sample size is limited, but
– Each user watched multiple movies (50+ movies viewed)… segment ratings are for 1-minute segments
(thousands of points)
– Collaborative filtering shows that even within this data set, the ratings can diverge and naïve solution
does not work as well as ours
Preliminary Results – Better Retrieval Accuracy
• Viewers care more about the highlights of a movie
• Find the contribution by using sensing
Gain
Overall achieved
performance
Additional
error
Total goal
Challenges in Learning
Problem – A Generalized Model Does Not Work
• Directly trained model does not capture the rating trend
Why?
The Reason it Does Not Work is …
• Human behaviors are heterogeneous
– Users are different
– Environments are different (e.g., home vs. commute)
home
commute
Sensed motion patterns very different when the same movie
wateched during a bus commute vs. in bed at home.
The Reason it Does Not Work is …
• Human behaviors are heterogeneous
– Users are different
– Environments are different (e.g., home vs. commute)
• Impact of sensor readings  histograms
The Reason it Does Not Work is …
• Human behaviors are heterogeneous
– Users are different
– Environments are different (e.g., home vs. commute)
• Impact on sensor readings  histograms
• Naïve solution  build specific models one by one
– Impossible to acquire data for all <User, Context, Movie> tuples
Office
Home
Commute
…
Challenges in Learning
Approach:
Bootstrap from Reaction Agreements
Approach: Bootstrap from Agreement
• Thoughts
– What behavior means positive/negative for a particular setting
– How do we acquire data without explicitly asking the user every time
• Approach: Utilize reactions that most people agree on
Time
Climax
Boring
Sensor Reading
Ratings
Cloud Knowledge
(Other users’
opinions)
Approach: Bootstrap from Agreement
• Solution: spawn from consensus
– Learn user reactions during the “climax” and the “boring” moments
– Generalize this knowledge of positive/negative reactions
– Gaussian process regression (ratings) and svm (labels)
GPR
SVM
Approach: Bootstrap from Agreement
GPR
A Simple Example of GPR
Approach: Bootstrap from Agreement
• On GPR and SVM - SVM
– SVM is a supervised learning method for classification
– Identify hyperplanes in high-dimensional space that can best separate
observed samples
– For our purpose, we used non-linear SVM with RBF kernel for its wide
applicability
User Experiment Setting
• 11 participants watch preloaded movies (46 movies)
• Two comedy, two dramas, one horror movie, one action movie
• Users give manual ratings and labels
• Evaluate by comparing generated ratings with manual ratings
Evaluation
Preliminary Results – Good Final Rating
Pulse
Truth
1
2
3
4
5
1
0
1
0
0
0
2
0
4
2
0
1
3
0
1
17
0
1
4
5
0
0
0
0
2
2
5
1
2
7
Preliminary Results – Myth behind the Error
• Final ratings can deviate significantly from segment rating
• User-given scores may not be linearly related to quality
Preliminary Results – Lower Segment Rating Error
• Final ratings come from averaging ratings for each segment
• Our system outperforms other methods
Movie
segments
3
4
4
2
2
2
5
Mean Error
(5-point scale)
Random ratings
Collaborative
filtering
Our system
Preliminary Results – Better Retrieval Accuracy
• Viewers care more about the highlights of a movie
• Find the contribution by using sensing
Gain
Overall achieved
performance
Additional
error
Total goal
Preliminary Results – Better Tag Quality
• Generated tags captures users’ feelings much better than
using SVM alone
Happy
Intense
Warm
Happy
Intense
Warm
Preliminary Results – Reasonable Energy Overhead
• Reasonable energy overhead compared to without sensing
More tolerable on tablets. May need duty-cycling on smart phones
Closing Thoughts
• Human reactions are in the mind
– However, manifest into bodily gestures, activities
• Rich, multi-modal sensors on moble devices
– Opportunity for “catching” these activities
– Multi-modal capability – whole is greater than sum of parts
• Pulse is an attempt to realize this opportunity
– Distilling semantic meanings from sensor streams
– Rating movies … tagging any content with reaction meta data
• Enabler for
– Recommendation engines
– Content/video search
– Information retrieval, summarization
Questions?
Extract Reaction Features – Player control
• Player control and taps
– Pause, fast forward, jump, roll back, …
– All slider movement
Seek bar
Approach: Bootstrap from Agreement
GPR
A Simple Example of GPR
Approach: Bootstrap from Agreement
• On GPR and SVM - SVM
– SVM is a supervised learning method for classification
– Identify hyperplanes in high-dimensional space that can best
separate observed samples
– For our purpose, we used non-linear SVM with RBF kernel for its wide
applicability
Download