White Paper Exploring the Emotion Classifiers Behind Affdex Facial Coding Table of Contents Background .................................................................................................................................... 3 Affdex Emotion Classifiers ............................................................................................................ 3 How Affdex Classifiers Work in the Cloud .................................................................................................. 3 Detect & Extract Features ........................................................................................................................... 4 Classify Emotional States ........................................................................................................................... 4 Assess & Report Emotion Response .......................................................................................................... 5 Creating an Affdex Emotion Classifier ......................................................................................... 5 Creating and Training Classifiers ................................................................................................................ 5 Face Video Labeling Infrastructure ............................................................................................................. 5 Iterative Training & Testing Platform ........................................................................................................... 6 Classifier Accuracy ..................................................................................................................................... 6 Classifier Cross Cultural Validation ............................................................................................................. 7 Industry Thought Leadership ...................................................................................................................... 7 In Our Labs ................................................................................................................................................. 8 Conclusion ...................................................................................................................................... 8 White Paper: Exploring the Emotion Classifiers Behind Affdex Facial Coding 2 Background At the core of the Affdex service lies the Affdex Emotion Classifiers. These classifiers consist of patent-pending, emotion-sensing algorithms that take face videos as inputs and provide frame-byframe emotion metrics as outputs. Faced with demanding conditions in natural, uncontrolled settings, these state-of-the-art algorithms have been optimized to provide highly accurate results – having been put to the test through thousands of studies worldwide. This paper explores the development and validation of the Affdex Emotion Classifiers. Affdex Emotion Classifiers Affdex provides two categories of emotion metrics: dimensions of emotion, which are used to characterize the emotional response, and discrete emotions, which are used to describe the specific emotional states. The dimensions of emotion that Affdex measures include: • Valence – A measure of the positive (or negative) nature of the participant’s experience with the content. • Attention – A measure of the participant’s attention to the screen, using the orientation of the face to assess if they are looking directly at the screen or if they are distracted (turning away) while viewing content. • Expressiveness – A measure of how emotionally engaging content is, computed by accumulating the frequency and intensity of the discrete emotions including smile, dislike, surprise and concentration. Unlike valence, expressiveness is independent of the positive or negative aspect of the facial expressions. The discrete emotion measures include: • Smile – The degree to which the participant is displaying a natural, positive smile. The smile classifier looks at the full face rather than just the mouth/lip area, incorporating other facial cues, like the eyes, to accurately indicate a true smile. • Concentration – The degree to which the participant is frowning (displaying a brow furrow) that is not induced by a dislike response and thus more likely the result of focus, mental effort or even confusion. • Surprise – The degree to which the participant is showing a face of surprise, indicated by raised eyebrows. • Dislike – The degree to which the participant is showing expressions of dislike or even disgust. These expressions include nose wrinkles, frowns and grimaces. In developing Affdex, we prioritized these emotion measures based on their relevance for evaluating media and advertising. With our extensive experience in testing this type of content, we have already accumulated over 300 million facial frames of representative data into an Affdex Facial Video Repository. Figu re 1 - Af fde x F acial Vide o Re po sitor y This repository is used to prioritize the development of new classifiers, as well as to improve the performance and accuracy of new and existing classifiers. How Affdex Classifiers Work in the Cloud With face videos gathered either online (streamed) or offline in a lab (asynchronous), the Affdex cloud service presents them to the sophisticated computer vision processes outlined in this section. Figu re 2 - Af fde x E m otion M e asur e s White Paper: Exploring the Emotion Classifiers Behind Affdex Facial Coding 3 The Affdex cloud-based face video process includes three distinct procedures: • • • Detect & Extract Features Classify Emotional States Assess & Report Emotion Response Figur e 3 - Clou d Bas ed E m otion Pr oc e ss in g Classify Emotional States Affdex classifiers take extracted facial features and classify them into emotional states. Given that most Affdex data is received at 14 frames per second, Affdex is capable of capturing both subtle and fleeting facial expressions — even those lasting only a split second. There are two classification techniques employed by Affdex classifiers: 1) Fram e-byfram e analysis: made on a single frame; 2) Dynam ic analysis: based on a sequence of frames, features are analyzed temporally. Figu re 5 – An alyze K e y Re gion s Detect & Extract Features The first step is to find a face and from the face extract the key regions (“landmarks”) needed as inputs to the classifier Once the human face is located, Affdex uses 24 key feature points on the face (e.g. corners of the eyes) that identify 3 key regions of interest: mouth region, nose region and upperhalf of the face (eyes with eye brows). For the Smile and Dislike classifiers, the entire face is defined as a region for enhanced context, leading to Figu re 4 - Fac e De tec tio n improved accuracy. Once the region of interest has been isolated, Affdex analyzes each pixel in the region to describe the color, texture, edges and gradients of the face. At this point, Affdex has not evaluated the nature of the facial expression — only that the facial regions exhibit certain characteristics. Evaluating these characteristics is the job of the Affdex classifiers. Combining these two techniques significantly improves the accuracy and robustness of the classifiers. For example, these combined techniques are used to accurately assess a person’s baseline state (thus eliminating the need for calibration). Once the emotion classifiers have categorized the facial features, the resulting emotions are assigned numeric values for each frame of video and for each emotion classifier. Depending on the classifier, the classifier value corresponds to increased likelihood of occurrence and may also indicate higher intensity1. For example, an Attention score approaching 100 signifies an increased likelihood of the viewer being on task (i.e., ‘face on camera”). Lower Attention values, on the other hand, indicate the viewer is looking away from the camera—usually an indication of boredom or fatigue (i.e., he or she is “inattentive”). The respondents’ emotion metrics are then made available for processing by the Affdex reporting processes. For some classifiers, like smile, the classifier’s numeric values are also correlated with the intensity of the response. 1 White Paper: Exploring the Emotion Classifiers Behind Affdex Facial Coding 4 Assess & Report Emotion Response Creating and Training Classifiers The accumulated classifier results for all respondents participating in a study are visualized in the Affdex dashboard. The Affdex dashboard displays a time series curve for each emotion metric that aggregates respondents’ emotional experiences. Each time series is further segmented by survey self-report responses collected as part of the overall study. For example, Affdex makes it easy to highlight differences in smiles by gender, age, buying intent, and more. Before an Affdex classifier can automatically identify facial expressions at scale, the machine classifier’s algorithm has to be trained to recognize that expression. Classifier training uses a plethora of videos from the Affdex repository that have been labeled (or “tagged”) to identify that particular expression. To enhance accuracy in natural settings, training videos must represent a diverse population of ages, genders and cultures. With face videos obtained from thousands of studies, Affectiva has amassed the world’s largest and most robust repository of facial frames. This repository is essential to categorizing and labeling the data needed to train the machine classifiers used to automatically process studies at scale. Face Video Labeling Infrastructure Figu re 7 - Af fde x D ash bo ard The classifier results are also delivered to the Affdex analytic platform, where they form the basis for normative benchmarks. This normative data is exposed in the summary metrics area of a Dashboard, where study results are compared to the Affdex norms that have been compiled across thousands of studies. By offering regionally specific norms, Affdex provides important context for interpreting study results. To train Affdex classifiers, a steady supply of labeled face video data is needed. To fulfill this data requirement, we developed infrastructure to support the systematic and ongoing ground-truth labeling of face videos. This investment includes the training of over 20 human labelers, as well as the development of an online facial video-labeling platform that manages the labeling process. Our labeling infrastructure allows for face video data from the Affdex repository to be assigned to a team of expert human labelers. The labelers systematically code facial videos at the frame level, identifying facial expressions of interest – including fleeting and subtle expressions. Creating an Affdex Emotion Classifier To this point, we’ve discussed how Affdex classifiers process face videos to yield emotion metrics and presents respondents’ emotion journeys via the Affdex dashboard. The science behind the classification process warrants further discussion, especially as it relates to accuracy and real-world robustness. The face videos obtained in natural settings, where lighting, camera position and head pose can be highly variable, offer unique challenges that Affdex classifiers have been tuned to address. Figu re 6 - Vide o L abeling P ro ce ss White Paper: Exploring the Emotion Classifiers Behind Affdex Facial Coding 5 Accurate classifiers need accurate labeling to establish a solid “ground truth” on which to train. To ensure accurate labeling, each video is labeled by at least 3 labelers and a quality assurance process ensures that there is majority agreement on the label. The labeled video frames are then fed into the machine learning algorithms to train the classifiers. To create and refine our production-ready Affdex classifiers, we continually add new, more challenging data to our training set as it is encountered in our production studies. As a result of this classifier training infrastructure investment, Affdex produces highly accurate emotion classifiers that are capable of automatically processing thousands of face videos at speeds constrained only by machine resources. Iterative Training & Testing Platform In addition to the labeling infrastructure that creates testing and training data, we have also built a cloudbased plug-and-play framework to automate and streamline the process of training and testing classifiers. This framework supports the iterative development and refinement of classifier algorithms by allowing us to combine a variety of feature extraction methods such as Local Binary Patterns, Histogram of Gradients and Gabor filters with different classifiers such as Support Vector Machines (SVM) and Random Forests. We routinely evaluate different combinations of features and classifiers to optimize the performance of our algorithms. Technology is improving at a rapid pace and this rapid development framework allows us to take advantage of improvements as they become available. Classifier Accuracy We’ve discussed above how we train Affdex classifiers using human labeled test data. Assessing classifier accuracy also requires a set of human labeled test data. This data acts as “ground truth” — the correct classification outcome against which the classifier performance is measured. In order to generate robust accuracy measures, it is essential that this test data reflect the type of data that is found in real-world studies. It’s also essential that the videos used in training the classifiers are not also used to test them. Here again, we leverage our face video repository to provide a robust test data set that is representative of the real world conditions and demographics but that is also distinct from the training dataset. Our classifiers are routinely tested against tens of thousands of frames, where most other industry solutions are limited to testing against several hundred video frames. Larger and more representative testing data sets yield more accurate results. As part of the testing process, an Affdex classifier is validated by a number of different accuracy measures to meet rigorous accuracy thresholds before it is made available for production use. These measures are derived from assessment at two different levels; individual face frames and aggregate video responses. Combining the two approaches yields a comprehensive assessment of classifier accuracy. Frame-level measures are used to compare classifier output to ground-truth from labelers, frame-by-frame. Accuracy for frame-level assessment include area under the curve (AUC) scores, receiver operating characteristic (ROC) curves and precision/recall curves. These measures depict how often a classifier will correctly identify an expression (referred to as True Positive), miss an expression altogether (False Negative) or incorrectly identify an expression as present when it’s not (False Positive). We continuously run tests on new algorithms to improve classifier “scores” on these measures and to tune classifier sensitivity. At the ad/media level, we compare the classifier output from aggregate video responses to human labeled ground truth results across a sample of viewers. For example, consider a study where 100 viewers watched an ad; the 100 face videos would be classified by Affdex and an aggregate curve would be produced per metric to be presented on the Affdex dashboard. To assess how close that aggregated curve is to a human-labeled curve, the same facial expressions of all 100 participants are coded by human labelers, and the results aggregated. We compare the two aggregated curves to assess how closely the Affdex results match human labelers. Here we use correlation and Mean Squared Error (MSE) to assess the difference between the two aggregated curves. We also compare classifier output to human labeled ground truth results in raw curves. In figure 9, below, this comparison is illustrated, showing the human labeled curve in red and the Affdex classifier output in blue. For this curve, the Correlation is 85% and the MSE is White Paper: Exploring the Emotion Classifiers Behind Affdex Facial Coding 6 3.7, between the label value and the detector output value on a 0-100 range. To validate our classifiers cross-culturally, we carefully constructed a series of studies that followed this basic methodology: Figu re 8 - Af fde x S tudie s wor ldwide • Figu re 7 - Ag gr eg ate Cur ve Ac cu rac y Through continuous reviews of Affdex classifier accuracy, we’ve improved the robustness of the classifiers to head-pose and movement, as well as for lighting conditions—key considerations with face videos gathered in natural settings. Affdex works well with head poses or head rotations between 5-10 degrees up/down and 20 degrees left/right. Head tilting is also handled robustly. With regard to lighting conditions, we quantify the level of lighting on a face on an RGB range of 0 (pitch black) to 255 (very bright). Our threshold for accurate classifier performance is 30, as shown on the far right of the example below: Figu re 10 – Lig hting Ro bu stne ss • • • Select culturally specific stimulus video that would elicit each of the intended emotional responses. Field a panel of local participants to watch these videos while Affdex records their expressions. Manually label selected videos, using our team of certified FACS coders, for the occurrence of facial expressions. Our classifiers were then tested on this dataset of facial frames. The outcome of these tests indicates that Affdex classifiers perform within the required accuracy range across cultures. These studies did, however, highlight that people in Asian cultures tend to be less expressive than those in other regions or cultures, especially in the presence of a moderator in the room. Based on the literature, this is not surprising since it is well known that in Asian cultures a venue-based setup with a moderator may dampen the emotion expressions. This finding has led to the development of several market-specific features, such as custom dashboard scaling and market level norms. Classifier Cross Cultural Validation Affdex has been used in thousands of studies covering over 52 countries, with well over 60% of these studies taking place in emerging markets like China and India. Our success in emerging markets was preceded by several validation studies that confirmed the accuracy of our classifiers crossculturally. These studies were jointly conducted with our leading partners who applied stringent success criteria to Affdex results. Industry Thought Leadership The Affdex scientists are pioneers in applying machine learning and computer vision techniques to the field affective computing. We are committed to developing and delivering world-class science that uses cutting edge techniques. We encourage continued research in the space and as such, have published the first comprehensively White Paper: Exploring the Emotion Classifiers Behind Affdex Facial Coding 7 labeled dataset of ecologically valid spontaneous facial responses recorded in natural settings over the Internet. This data is available for distribution to researchers online and the EULA can be found at: http://www.affdex.com/facial-expression-datasetam-fed/ For more details regarding this dataset, please refer to the following publication: http://www.affdex.com/assets/13.McDuff-etalAMFED.pdf We are also transparent about our methods and accuracy. As such, our scientific advancements are regularly published in top, peer-reviewed journals and publications. These publications are subject to the scrutiny of top researchers (both emotion researchers as well as computer scientists). refining the algorithms. Most recent efforts have focused on fine-tuning performance for use on mobile devices such as mobile phones and tablets. Conclusion Affdex Automated Facial Coding relies on robust and accurate emotion classifiers. With a strong commitment to industry thought leadership and scientific rigor, Affectiva continually invests in research and development. New tools and techniques continue to improve existing classifiers, as well as to create new emotion classifiers. Affdex classifiers are now in widespread commercial use by Fortune 1000 companies, adding valuable emotion insights to their evaluation of media effectiveness. Our publications cover core accuracy, reach and scalability, as well as application to advertising testing, media testing, political polling and others. Our current research explores the relationships between Affdex emotion metrics and their ability to predict consumer behavior. We are publishing results of this recent research that covers predictions of consumer behavior such as short-term sales, likability, desire to view again, box office scores and more. For a detailed list of our scientific publications, please visit our website at http://www.affdex.com/clients/affdex-resources/. In Our Labs The Affdex science team continually invests in the advancement of the Affdex portfolio of webcambased measures. These advancements have been most recently focused in the following three areas: • New Metrics: We take a data-driven approach to prioritize new emotion classifiers based on what we observe most frequently and that are meaningful in the media and market research context. • Predictive Measures: There is ongoing research into how Affdex measures tie to media effectiveness and consumer behavior. • Accuracy & Robustness for existing measures: In order to ensure our classifiers work well in real-world conditions, we are always White Paper: Exploring the Emotion Classifiers Behind Affdex Facial Coding 8