White Paper Exploring the Emotion Classifiers Behind Affdex Facial Coding

advertisement
White Paper
Exploring the Emotion
Classifiers Behind
Affdex Facial Coding
Table of Contents
Background .................................................................................................................................... 3
Affdex Emotion Classifiers ............................................................................................................ 3
How Affdex Classifiers Work in the Cloud .................................................................................................. 3
Detect & Extract Features ........................................................................................................................... 4
Classify Emotional States ........................................................................................................................... 4
Assess & Report Emotion Response .......................................................................................................... 5
Creating an Affdex Emotion Classifier ......................................................................................... 5
Creating and Training Classifiers ................................................................................................................ 5
Face Video Labeling Infrastructure ............................................................................................................. 5
Iterative Training & Testing Platform ........................................................................................................... 6
Classifier Accuracy ..................................................................................................................................... 6
Classifier Cross Cultural Validation ............................................................................................................. 7
Industry Thought Leadership ...................................................................................................................... 7
In Our Labs ................................................................................................................................................. 8
Conclusion ...................................................................................................................................... 8
White Paper: Exploring the Emotion Classifiers Behind Affdex Facial Coding
2
Background
At the core of the Affdex service lies the Affdex
Emotion Classifiers. These classifiers consist of
patent-pending, emotion-sensing algorithms that
take face videos as inputs and provide frame-byframe emotion metrics as outputs. Faced with
demanding conditions in natural, uncontrolled
settings, these state-of-the-art algorithms have been
optimized to provide highly accurate results – having
been put to the test through thousands of studies
worldwide.
This paper explores the development and validation
of the Affdex Emotion Classifiers.
Affdex Emotion Classifiers
Affdex provides two categories of emotion metrics:
dimensions of emotion, which are used to
characterize the emotional response, and discrete
emotions, which are used to describe the specific
emotional states.
The dimensions of emotion that Affdex measures
include:
• Valence – A measure of the positive (or negative)
nature of the participant’s experience with the
content.
• Attention – A measure of the participant’s
attention to the screen, using the orientation of
the face to assess if they are looking directly at
the screen or if they are distracted (turning away)
while viewing content.
• Expressiveness – A measure of how
emotionally engaging content is, computed by
accumulating the frequency and intensity of the
discrete emotions including smile, dislike,
surprise and concentration. Unlike valence,
expressiveness is independent of the positive or
negative aspect of the facial expressions.
The discrete emotion measures include:
• Smile – The degree to which the participant is
displaying a natural, positive smile. The smile
classifier looks at the full face rather than just the
mouth/lip area, incorporating other facial cues,
like the eyes, to accurately indicate a true smile.
• Concentration – The degree to which the
participant is frowning (displaying a brow furrow)
that is not induced by a dislike response and thus
more likely the result of focus, mental effort or
even confusion.
• Surprise – The degree to which the participant is
showing a face of surprise, indicated by raised
eyebrows.
• Dislike – The degree to which the participant is
showing expressions of dislike or even disgust.
These expressions include nose wrinkles, frowns
and grimaces.
In developing Affdex, we prioritized these emotion
measures based on their relevance for evaluating
media and advertising. With our extensive experience
in testing this type of content, we have already
accumulated over 300 million facial frames of
representative data into an Affdex Facial Video
Repository.
Figu re 1 - Af fde x F acial Vide o Re po sitor y
This repository is used to prioritize the development
of new classifiers, as well as to improve the
performance and accuracy of new and existing
classifiers.
How Affdex Classifiers Work in the Cloud
With face videos gathered either online (streamed) or
offline in a lab (asynchronous), the Affdex cloud
service presents them to the sophisticated computer
vision processes outlined in this section.
Figu re 2 - Af fde x E m otion M e asur e s
White Paper: Exploring the Emotion Classifiers Behind Affdex Facial Coding
3
The Affdex cloud-based face video process includes
three distinct procedures:
•
•
•
Detect & Extract Features
Classify Emotional States
Assess & Report Emotion Response
Figur e 3 - Clou d Bas ed E m otion Pr oc e ss in g
Classify Emotional States
Affdex classifiers take extracted facial features and
classify them into emotional states. Given that most
Affdex data is received at 14 frames per second,
Affdex is capable of capturing both subtle and
fleeting facial expressions — even those lasting only
a split second.
There are two classification techniques employed by
Affdex classifiers:
1) Fram e-byfram e analysis:
made on a single
frame;
2) Dynam ic
analysis: based
on a sequence of
frames, features
are analyzed
temporally.
Figu re 5 – An alyze K e y Re gion s
Detect & Extract Features
The first step is to find a face and from the face
extract the key regions (“landmarks”) needed as
inputs to the classifier
Once the human face is located, Affdex uses 24 key
feature points on the face (e.g. corners of the eyes)
that identify 3 key
regions of interest:
mouth region, nose
region and upperhalf of the face (eyes
with eye brows). For
the Smile and Dislike
classifiers, the entire
face is defined as a
region for enhanced
context, leading to Figu re 4 - Fac e De tec tio n
improved accuracy.
Once the region of interest has been isolated, Affdex
analyzes each pixel in the region to describe the
color, texture, edges and gradients of the face.
At this point, Affdex has not evaluated the nature of
the facial expression — only that the facial regions
exhibit certain characteristics. Evaluating these
characteristics is the job of the Affdex classifiers.
Combining these two techniques significantly
improves the accuracy and robustness of the
classifiers. For example, these combined techniques
are used to accurately assess a person’s baseline
state (thus eliminating the need for calibration).
Once the emotion classifiers have categorized the
facial features, the resulting emotions are assigned
numeric values for each frame of video and for each
emotion classifier. Depending on the classifier, the
classifier value corresponds to increased likelihood of
occurrence and may also indicate higher intensity1.
For example, an Attention score approaching 100
signifies an increased likelihood of the viewer being
on task (i.e., ‘face on camera”). Lower Attention
values, on the other hand, indicate the viewer is
looking away from the camera—usually an indication
of boredom or fatigue (i.e., he or she is “inattentive”).
The respondents’ emotion metrics are then made
available for processing by the Affdex reporting
processes.
For some classifiers, like smile, the classifier’s numeric
values are also correlated with the intensity of the
response.
1
White Paper: Exploring the Emotion Classifiers Behind Affdex Facial Coding
4
Assess & Report Emotion Response
Creating and Training Classifiers
The accumulated classifier results for all respondents
participating in a study are visualized in the Affdex
dashboard. The Affdex dashboard displays a time
series curve for each emotion metric that aggregates
respondents’ emotional experiences. Each time
series is further segmented by survey self-report
responses collected as part of the overall study. For
example, Affdex makes it easy to highlight
differences in smiles by gender, age, buying intent,
and more.
Before an Affdex classifier can automatically identify
facial expressions at scale, the machine classifier’s
algorithm has to be trained to recognize that
expression. Classifier training uses a plethora of
videos from the Affdex repository that have been
labeled (or “tagged”) to identify that particular
expression. To enhance accuracy in natural settings,
training videos must represent a diverse population of
ages, genders and cultures.
With face videos obtained from thousands of studies,
Affectiva has amassed the world’s largest and most
robust repository of facial frames. This repository is
essential to categorizing and labeling the data
needed to train the machine classifiers used to
automatically process studies at scale.
Face Video Labeling Infrastructure
Figu re 7 - Af fde x D ash bo ard
The classifier results are also delivered to the Affdex
analytic platform, where they form the basis for
normative benchmarks. This normative data is
exposed in the summary metrics area of a
Dashboard, where study results are compared to the
Affdex norms that have been compiled across
thousands of studies. By offering regionally specific
norms, Affdex provides important context for
interpreting study results.
To train Affdex classifiers, a steady supply of labeled
face video data is needed. To fulfill this data
requirement, we developed infrastructure to support
the systematic and ongoing ground-truth labeling of
face videos. This investment includes the training of
over 20 human labelers, as well as the development
of an online facial video-labeling platform that
manages the labeling process.
Our labeling infrastructure allows for face video data
from the Affdex repository to be assigned to a team
of expert human labelers. The labelers systematically
code facial videos at the frame level, identifying facial
expressions of interest – including fleeting and subtle
expressions.
Creating an Affdex Emotion Classifier
To this point, we’ve discussed how Affdex classifiers
process face videos to yield emotion metrics and
presents respondents’ emotion journeys via the
Affdex dashboard.
The science behind the classification process
warrants further discussion, especially as it relates to
accuracy and real-world robustness.
The face
videos obtained in natural settings, where lighting,
camera position and head pose can be highly
variable, offer unique challenges that Affdex
classifiers have been tuned to address.
Figu re 6 - Vide o L abeling P ro ce ss
White Paper: Exploring the Emotion Classifiers Behind Affdex Facial Coding
5
Accurate classifiers need accurate labeling to
establish a solid “ground truth” on which to train. To
ensure accurate labeling, each video is labeled by at
least 3 labelers and a quality assurance process
ensures that there is majority agreement on the label.
The labeled video frames are then fed into the
machine learning algorithms to train the classifiers.
To create and refine our production-ready Affdex
classifiers, we continually add new, more challenging
data to our training set as it is encountered in our
production studies. As a result of this classifier
training infrastructure investment, Affdex produces
highly accurate emotion classifiers that are capable of
automatically processing thousands of face videos at
speeds constrained only by machine resources.
Iterative Training & Testing Platform
In addition to the labeling infrastructure that creates
testing and training data, we have also built a cloudbased plug-and-play framework to automate and
streamline the process of training and testing
classifiers. This framework supports the iterative
development and refinement of classifier algorithms
by allowing us to combine a variety of feature
extraction methods such as Local Binary Patterns,
Histogram of Gradients and Gabor filters with
different classifiers such as Support Vector Machines
(SVM) and Random Forests. We routinely evaluate
different combinations of features and classifiers to
optimize the performance of our algorithms.
Technology is improving at a rapid pace and this
rapid development framework allows us to take
advantage of improvements as they become
available.
Classifier Accuracy
We’ve discussed above how we train Affdex
classifiers using human labeled test data. Assessing
classifier accuracy also requires a set of human
labeled test data. This data acts as “ground truth” —
the correct classification outcome against which the
classifier performance is measured.
In order to generate robust accuracy measures, it is
essential that this test data reflect the type of data
that is found in real-world studies. It’s also essential
that the videos used in training the classifiers are not
also used to test them. Here again, we leverage our
face video repository to provide a robust test data set
that is representative of the real world conditions and
demographics but that is also distinct from the
training dataset. Our classifiers are routinely tested
against tens of thousands of frames, where most
other industry solutions are limited to testing against
several hundred video frames. Larger and more
representative testing data sets yield more accurate
results.
As part of the testing process, an Affdex classifier is
validated by a number of different accuracy
measures to meet rigorous accuracy thresholds
before it is made available for production use. These
measures are derived from assessment at two
different levels; individual face frames and aggregate
video responses. Combining the two approaches
yields a comprehensive assessment of classifier
accuracy.
Frame-level measures are used to compare classifier
output to ground-truth from labelers, frame-by-frame.
Accuracy for frame-level assessment include area
under the curve (AUC) scores, receiver operating
characteristic (ROC) curves and precision/recall
curves. These measures depict how often a classifier
will correctly identify an expression (referred to as
True Positive), miss an expression altogether (False
Negative) or incorrectly identify an expression as
present when it’s not (False Positive). We
continuously run tests on new algorithms to improve
classifier “scores” on these measures and to tune
classifier sensitivity.
At the ad/media level, we compare the classifier
output from aggregate video responses to human
labeled ground truth results across a sample of
viewers. For example, consider a study where 100
viewers watched an ad; the 100 face videos would
be classified by Affdex and an aggregate curve would
be produced per metric to be presented on the
Affdex dashboard. To assess how close that
aggregated curve is to a human-labeled curve, the
same facial expressions of all 100 participants are
coded by human labelers, and the results
aggregated. We compare the two aggregated curves
to assess how closely the Affdex results match
human labelers. Here we use correlation and Mean
Squared Error (MSE) to assess the difference
between the two aggregated curves. We also
compare classifier output to human labeled ground
truth results in raw curves. In figure 9, below, this
comparison is illustrated, showing the human labeled
curve in red and the Affdex classifier output in blue.
For this curve, the Correlation is 85% and the MSE is
White Paper: Exploring the Emotion Classifiers Behind Affdex Facial Coding
6
3.7, between the label value and the detector output
value on a 0-100 range.
To validate our classifiers cross-culturally, we
carefully constructed a series of studies that followed
this basic methodology:
Figu re 8 - Af fde x S tudie s wor ldwide
•
Figu re 7 - Ag gr eg ate Cur ve Ac cu rac y
Through continuous reviews of Affdex classifier
accuracy, we’ve improved the robustness of the
classifiers to head-pose and movement, as well as
for lighting conditions—key considerations with face
videos gathered in natural settings. Affdex works well
with head poses or head rotations between 5-10
degrees up/down and 20 degrees left/right. Head
tilting is also handled robustly.
With regard to lighting conditions, we quantify the
level of lighting on a face on an RGB range of 0 (pitch
black) to 255 (very bright). Our threshold for accurate
classifier performance is 30, as shown on the far right
of the example below:
Figu re 10 – Lig hting Ro bu stne ss
•
•
•
Select culturally specific stimulus video that
would elicit each of the intended emotional
responses.
Field a panel of local participants to watch these
videos while Affdex records their expressions.
Manually label selected videos, using our team of
certified FACS coders, for the occurrence of
facial expressions.
Our classifiers were then tested on this dataset of
facial frames.
The outcome of these tests indicates that Affdex
classifiers perform within the required accuracy range
across cultures.
These studies did, however, highlight that people in
Asian cultures tend to be less expressive than those
in other regions or cultures, especially in the
presence of a moderator in the room. Based on the
literature, this is not surprising since it is well known
that in Asian cultures a venue-based setup with a
moderator may dampen the emotion expressions.
This finding has led to the development of several
market-specific features, such as custom dashboard
scaling and market level norms.
Classifier Cross Cultural Validation
Affdex has been used in thousands of studies
covering over 52 countries, with well over 60% of
these studies taking place in emerging markets like
China and India. Our success in emerging markets
was preceded by several validation studies that
confirmed the accuracy of our classifiers crossculturally. These studies were jointly conducted with
our leading partners who applied stringent success
criteria to Affdex results.
Industry Thought Leadership
The Affdex scientists are pioneers in applying
machine learning and computer vision techniques to
the field affective computing. We are committed to
developing and delivering world-class science that
uses cutting edge techniques.
We encourage continued research in the space and
as such, have published the first comprehensively
White Paper: Exploring the Emotion Classifiers Behind Affdex Facial Coding
7
labeled dataset of ecologically valid spontaneous
facial responses recorded in natural settings over the
Internet.
This data is available for distribution to researchers
online and the EULA can be found at:
http://www.affdex.com/facial-expression-datasetam-fed/ For more details regarding this dataset,
please refer to the following publication:
http://www.affdex.com/assets/13.McDuff-etalAMFED.pdf
We are also transparent about our methods and
accuracy. As such, our scientific advancements are
regularly published in top, peer-reviewed journals and
publications. These publications are subject to the
scrutiny of top researchers (both emotion
researchers as well as computer scientists).
refining the algorithms. Most recent efforts have
focused on fine-tuning performance for use on
mobile devices such as mobile phones and
tablets.
Conclusion
Affdex Automated Facial Coding relies on robust and
accurate emotion classifiers. With a strong
commitment to industry thought leadership and
scientific rigor, Affectiva continually invests in
research and development. New tools and
techniques continue to improve existing classifiers,
as well as to create new emotion classifiers. Affdex
classifiers are now in widespread commercial use by
Fortune 1000 companies, adding valuable emotion
insights to their evaluation of media effectiveness.
Our publications cover core accuracy, reach and
scalability, as well as application to advertising
testing, media testing, political polling and others.
Our current research explores the relationships
between Affdex emotion metrics and their ability to
predict consumer behavior. We are publishing results
of this recent research that covers predictions of
consumer behavior such as short-term sales,
likability, desire to view again, box office scores and
more.
For a detailed list of our scientific publications, please
visit our website at
http://www.affdex.com/clients/affdex-resources/.
In Our Labs
The Affdex science team continually invests in the
advancement of the Affdex portfolio of webcambased measures. These advancements have been
most recently focused in the following three areas:
• New Metrics: We take a data-driven approach
to prioritize new emotion classifiers based on
what we observe most frequently and that are
meaningful in the media and market research
context.
• Predictive Measures: There is ongoing
research into how Affdex measures tie to media
effectiveness and consumer behavior.
• Accuracy & Robustness for existing
measures: In order to ensure our classifiers
work well in real-world conditions, we are always
White Paper: Exploring the Emotion Classifiers Behind Affdex Facial Coding
8
Download