23691 >> Daniel Povey: I think most people here know... might be on video, Florian finished his Ph.D. in 2005...

23691 >> Daniel Povey: I think most people here know Florian, but just for those who might be on video, Florian finished his Ph.D. in 2005 at Karlsruhe. He was working with Alex Weibel [phonetic], he then worked for a few years at Deutsche Telecom on various speechy things and core analytics. Now he's faculty at CMU doing lots of interesting research, including this stuff. So I'm going to let you go for it, Florian. >> Florian Metze: Thank you. So thanks everyone for having me. And thanks for allowing me to be here. So this is really work that started while I was still a post-doc at Berlin. I started work with Tim, who is now a Ph.D. student at Deutsche Telecom Laboratories and Technical University of Berlin. So you were asking, there's no CMU affiliation on the title. That's really because Deutsche Telecom is funding Tim and I'm doing it on my spare time and because I like, because I like the topic that we're working on. So we're trying to find how much personality we can find in speech and what added value maybe speech can bring to an assessment of personality if you look into it. So brief overview. I want to start by explaining how you can assess personality in the first place. And verify that you can assess personality in speech because that's not at all clear that this is something that you can reliably do and going to make sure this is the case. We started collecting a database on which we then run a couple of experiments on both how do humans assess personality and speech and how can machines mimic the process that we want to first do human baseline experiment on personality assessment from speech and then see how well can a machine do the same. And feel free to interrupt me at any time during the talk. I see doubtful faces already. >>: I like the various presidential candidates speeches. >> Daniel Povey: Oh, yeah, [laughter]. >>: They're all possessed. [laughter]. >>: Whatever you'd like them to be. >> Florian Metze: So that would probably be a cross-cultural perception of speech thing then because the work we're doing is in German. I'm going to play you some German speech and let you assess the personality and that's always good for interesting discussion as well. If not laughs. So now why do we want to do this? Well, you all know about emotion detection from speech and that's kind of useful but not really because the only emotion you ever find is anger. And that's maybe 5 percent of the time and 95 percent of the time at least in call centers it's neutral speech. So maybe if you knew what personality a person is, maybe it would be easier and more reliable to detect emotions. So if somebody is neurotic person maybe he's going to express emotion differently than a person who is a very extrovert person. So if you knew about a person's personality you could normalize emotion detection. >>: How many classes do you have? >> Daniel Povey: We work with ten classes, but of course the scheme that we're using is sort of a profile. So it's continuous scales. So we could have any number of classes. >>: Ten emotions or ->> Daniel Povey: Ten personalities. >>: [inaudible]. >> Florian Metze: Five scales. Openness, gentleness, agreeableness extroversion and neuroticism. I remember them by now. >>: They're separate from emotion, what are related? >> Daniel Povey: Of course they're related but they're different. So let me -- I'll come to that in a minute. Of course, it would be interesting if we do, with speech transcription if we could annotate speech with properties saying this was this type of person saying this and another thing, and this is really also why Deutsche Telecom is interested in this type of work they spent a lot of time on their corporate brand and whenever they do adds, whenever they have voice or call centers, they spend time to make sure that the voice they use in the call center transports their values so they do brand monitoring and quality assurance saying that if you hear a voice in a Deutsche Telecom at least automated call center, they want to make sure that the voice instills in you the values to Deutsche Telecom stands for which is it's a former state run company so they're reliable they're your grandfather's company they don't have to be young and hip but they want to be solid and reliable. So any voice they want to use in the system and any voice also they want to use in a TTS and speech synthesis system should also transport these qualities that they stand for, so to speak, personality. So if you had an automatic system to assess that, that would be a useful instrument to make sure that the dialogue systems, for example, transport the brand values. I don't think they're quite there yet. But that's something that this could be used for. And, of course, any kind of human computer interaction would be useful if you had personality information both for recognition and synthesis, if you could synthesize voices with different emotions or personalities. So the work we're doing could also be used to assess the personality and the impression, the synthetic voice leaves in here. So that's what we are working towards. Now, what are personalities? I can read to you this definition, which I like. It's dynamic and organized set of characteristics possessed by a person that uniquely influences his or her conditions, motivations and behaviors in various situations. Now, that's one of many definitions of personality that tells you a little bit but it's not very precise. And that's part of the problem here. It's really I think what it boils down to it's an analysis of how people and their perceived behavior can differ. So you're looking for a property and a person that others can detect not directly but that they can detect because people are doing things in certain ways. So you only observe personality by proxy if you want because this person speaks like that, this person speaks like that, this person does things like this and this other person does things like that and that's because they have different personalities for lack of another sort of observable property that's different between people. The important thing about personality and also the differentiator between personality and emotion is that personality is meant to be stable over time. Emotions can change very quickly. So your personality wouldn't change over days, weeks or even months while emotions can change within minutes. Then between the two, then, I wouldn't argue that there is a clear-cut separation between personality and emotions. They are speaker states if you want or traits, and at least at this point we can't say with certainty, well, this is an inference of personality and this is an inference of emotion. You always have certain qualities. But here we're looking for something that's stable over time. And in psychology there is what's called the five factor inventory, FFI, that uses five factors to describe persons personality profile that you can use to measure and predict sort of habitual patterns that a person has in their thoughts, behaviors and the way they express their emotions. >>: So in the corporate world we normally have the perspective of trying to be dominant; is that true? Is that kind of the analysis. >> Florian Metze: It's similar. In marketing they have these personas or prototypical ->>: There's four. >> Florian Metze: There's several -- several different -- I think Germany has a different set of personalities or how do they call it, sinus milieus, where they have different types of people to do marketing or observational studies. And I think they're different from what people use in the U.S. But I think the personality at least this way of representing personality is -- there's kind of an academic way of describing it where you have a space in which you can then distribute your personas or your prototypes if you want. But in theory you should be able to map other schemes into this scheme how well that works and if one is more suitable to your purpose and one is more suitable to another purpose, that of course depends on what exactly you want to do. But this new FFI framework or the new FFI framework that we're using, it's been developed over the course of several decades now of research in psychology. It's being essentially measured using a questionnaire of 60 questions that people answer either about themselves then it's called self-assessment or they answer about somebody else whom they know. Then it's a third-person assessment. And this third-person assessment is generally considered more reliable because people tend to see themselves differently than other people see them and in the end what matters or what's more consistent, more reliable is how other people see a person. So with this type of questionnaire, the new FFI about in English about 110 studies have been performed with 24,000 participants and in German, which is what we're using this work is in German, about 50 studies have been done with 12 and a half -- 2,100 participants. So this is the population against which we can compare the personalities that we and the profiles that we find. So if we have a certain profile, we know where in the space of all these personality profiles we are if 20 percent of the other population is more neurotic than this person is or 80 percent is less neurotic and so on. >>: There's another framework that actually I think companies use also, which I don't remember the name of it. But it's got judgmental, sensing, introverted, extroverted and there's a couple more. Do you know what I'm talking about. >>: >> Daniel Povey: I know there's a six feedback for inventory. There's a purely categorical schemes and continuous ->>: Intuitive versus something. Intuitive was another axis. >> Florian Metze: There's a Zu -- I may have come across the one you're talking about, but we've only really worked with this one. This is also the questionnaire is -- so you have to pay for it if you use it -- you pay $1 or Euro or something per questionnaire. But probably any questionnaire has its advantages and its disadvantages. And there's a lot of debate about which one's better. But this one seems at least from what we could find this one is the most established one and also one that exists in essentially the same form in English and German and other languages. So you could do it across languages or cultures if you want. So what are the five factors that I've been talking about? Yeah, I found this image on the Internet so I just had to use it. It's neuroticism, extroversion, openness to experience, agreeableness and conscientiousness. The abbreviation is OCEAN, so you can remember them. Each of these five factors is expressed as a value between 0 and 45, corresponding to low and high. So if you have a low score on the neuroticism axis or neuroticism scale, that means that you're not very neurotic. If you have a high value on the neuroticism scale, that means that there's a lot of neuroticism in your personality profile. This profile is determined using a questionnaire with 60 items which you answer on a 5-point Lichert scale. So you answer questions like I like to have a lot of people around me. Strongly disagree, disagree, neutral, agree, strongly agree. Or I often feel inferior to others, strongly disagree, disagree, neutral, and so on. I laugh easily. You either do that about yourself or about somebody else. And from these answers there is a set of rules and algorithm that computes this personality profile. It's not that any of these single questions directly maps to any of these personality factors. But it is essentially a linear interpolation of your answers to these questions which determines your personality profile as a value, essentially as a set of five numbers between 0 and 48 that describe how you're supposed to be. Now ->>: So these are five different axises that correlate. >> Florian Metze: It's assuming -- I'm going to show some statistics on that later. Of course, they're not completely independent. They're independent to some extent. But, yes, if you want to see them as a coordinate system, then the five dimensional space, then that's what it's ideally, what it would be. Now, what does it mean, this personality profile? If, for example, somebody has a low value on the openness scale, then he would be described by words like conventional, simple, unimaginative literal minded. If a person had a high value on the openness scale he would be described with words, curious, original, creative, artistic, intellectual, and the same for all the other axises. So the way to interpret such a profile is to say, well, the more towards the one end of the scale you are, the more people would be likely to use any of these words to describe your personality. And now what you can do, if you take, for example, a certain sample of a population, if you take 20 people, what we've done here, and you sort of average their personality profiles, then you get a graph like this. You see neuroticism and so on, axis, here's the medium value, the interquartile ranges. So you can see if you're here then you know that -- and you can also, because you know what the total distribution of the personality profiles is from all the 2000 questionnaires that have been filled out in German or 12,000 that have been filled out in English, you know what the rest of the population is doing so you know where in respect to the total population you are. So this is what a personality profile of a single person or group might look like. Now, of course, the question is how good are these, how independent are these five axises. Now here for the German neo FFI, the manual -- so the book that describes it contains a table that shows what's been computed and you see that, for example, the correlations or the off-diagonal correlations between these axises, they are relatively I would say well behaved. Not in maybe -- maybe in physics or in statistics you would get even lower correlations, but for sort of a table on psychological data, I think the argument is that these numbers are relatively low, relatively close to 0. But what you notice, for example, is that there is a strong negative correlation between neuroticism and extroversion. So a person who has a high value on neuroticism tends to have a lower value on the extra version axis and the other way around. That's kind of intuitive if you would think of somebody, you would imagine it would be difficult to be neurotic and extrovert at the same time. So that's what this strong negative correlation means. But all the other numbers are relatively close to 0, if you want. So to some extent these axises are really independent, but, of course, it's not a completely -- they're not completely independent in the sense that you could vary them totally at your will. >>: So early on you gave the example of a company, might one project a mature, stable, reliable image. And you mentioned a grandfather voice. What would the grandfather voice be in this ->> Daniel Povey: We do have -- so we do have our sort of like Deutsche Telecom corporate voice, which is a guy recording for our database, he recorded a number of sentences in his Deutsche Telecom voice, and this is the -- this is the profile that we computed for this voice so that that's exactly these bars. Where we computed his average profile for a number of samples that he had given. So he scored high on the conscientiousness scale and relatively low on the neuroticism scale. And average values, if you want, on the extroversion, openness and maybe a slightly more towards the positive side on the agreeableness voice on the agreeableness scale. And that would be maybe your grandfather voice. But that's the personality profile ->>: Neurotic. >> Florian Metze: Yeah. I guess that's what you want to be, right? So that's the -- that's the voice that Deutsche Telecom is happy with. Of course, if you talk to the marketing folks, if they could also be happy with a voice that has a different profile, we don't know yet. We haven't done a lot of studies on that. But that's the profile of a voice that they're happy with. So, of course, there's a lot of -- there has been some other works on assessing personality from speech. Mostly on analysis of psychiatric patients, Claude Cher [phonetic] did a lot of work on that and also Apple, Apple the person, not Apple the company. They found that extroversion can be estimated from speech reliably and extrovert speakers, for example, speak louder and fewer hesitations. So they established that you can take speech and derive from speech the or get cues about the personality of a speaker or if you know what a personality is, make predictions about how this person's going to speak. Recently, Francois Marez was doing work on textual information, also intensity and pitch, and he also found that extroversion could be modeled well followed by what he calls emotional stability which is essentially another term for neuroticism. He was using a different scheme than what we are using, and this also, of course, tells us that personality doesn't only influence the acoustics of how things are being said, if you describe a image, describe a situation, an extrovert person is going to use different words than an introvert person, for example. That's also, of course, something that we need to control. And then of course there's Clifford Nass who did a lot of work on computer interaction. He did work on computer interaction, claims if you talk to a dialogue system, for example, we assign a personality to this dialogue system to this computer, and we tend to have positive feelings towards a person that -- personality that we encountered. It has similar characteristics to our own personality. So he did studies in which people liked dialogue system better, if it was extrovert if they were extroverts themselves. So this is sort of -- I'm not sure if this is a completely thorough understanding of the problem. But it gives you an indication that personality is important to take into account in human computer interaction and he's been exploring some of these ideas. >>: [inaudible]. >> Florian Metze: No, it's a person. It's a researcher. Not the company. He's a psychologist. Okay. So with this, we started collecting a database. And we essentially collected a database in three parts. We took the first part is going to be the professional speaker that we record, that we give a fixed text and ask him to record this fixed single text in various personalities. So that we can have a baseline understanding of if its active personalities we can see what changes and if we can detect it automatically if people can pick it up automatically. You're always going to keep the text fixed. The second part of the database we are relaxing this fixed text constraint and allow the speaker to use different words, if he is in different personalities, but and so if somebody listens to that, he might be able to pick up on the personality, by the vocabulary, by the lexical information. That's the difference between the first and the second part. And the third part is we're putting people from the street or the subjects into the same situation as a professional speaker and the second part we asked them to describe images, various images, and then we get other people who know them well to do their personality profiles, and we see if we can pick up speaker independently and non-active situation, cues from the acoustic speech towards the personality of a person. So we really have three conditions. Two active conditions in fixed text and free text and then nonactors, real persons, if you want, that do the same free text exercise and our goal will be to pick up from these three parts of the database, see what changes from condition to condition, and try and understand how well our humans are in picking up personalities and how well -- and how well do machines work. Now, let me give you an example of the data. I think I briefly tested the audio. So this is a standard text like that it could appear in Deutsche Telecom call center, and in fact the professional actor recording in a Deutsche Telecom voice, sort of a neutral message, something that's a bit positive, a bit negative, that's friendly but not overly friendly. So something that we hope could be said in any personality. [German]. >> Florian Metze: That's your grandfather voice. And here's the English translation of the text that this guy has been reading and acting. Now, what we're going to do is we have the personality profile and now we're going to ask this professional speaker to record personality or variations of this text with low neuroticism and high neuroticism. Low extroversion and high extroversion so that the ten extremes on these scales and to allow him to do that we gave him the descriptions of the personality profiles that we find in the new amplified framework. So he gets a piece of text that essentially contains these descriptions of persons with high neuroticism are expected to be controlled, anxious and so on. And he gets some time to put himself in the mood, if you want, and then he produces various recordings of always the same text in different personalities which we then recorded. So we have about 75 minutes of this 20-second text in various personalities. Now, I know that there's German speakers in the room. >>: How can it decide on the numerical value of this. >> Florian Metze: How can we decide on the numerical value? What we do when we have these recordings, is we have people come in and they listen to these, to the text, to the audio recording, and they fill out the questionnaire. So they say -- they listen to it and they say, oh, I think this person likes, laughs easily or this person likes to have other people around me. And by filling out this questionnaire, we get the numerical value. We get the personality profile and the question really is it reliable. If people only have about 20 seconds of speech or maybe a minute of speech is it reliable that they can fill out this questionnaire and therefore assign a personality profile. But that's what we did. So we had a speaker produce many examples of different personality, different personality prototypes and then have people listen to that and fill out the questionnaire. They were free to listen to it as often as they liked. So they could listen to it many times. There was no time pressure and everything, and then they fill out the profile. So ->>: Is this the way your evaluation data is generated or the way your training data is generated? >> Daniel Povey: Both. >>: If humans aren't any good at evaluating personality through speech but computers could be, you're limited by the humans. >> Florian Metze: Well, it depends. So here -- this part of the database we have active data. So it could be that machines are better than humans at picking out the intended variation that the actor produced, because we know what the actor was supposed to do. So we could train a system that has super -- sorry for using the term -- super human performance. And in fact we did two different experiments. So we did the classification experiment, where we tried to reproduce or where we tried to figure out what the actor was doing, and we did the regression experiment where we tried to reproduce the ratings that the humans would assign to the speech, no matter what the actor was trying to do. >>: So have you just tried using ranking [inaudible]. >>: Ranking. >> Yeah, ranking. Basically this one is more something like that? The other person, just compared to the ->> Daniel Povey: That's an interesting point. If you look to two samples, can you, would it be easier to describe a difference between two speakers and that's work that I would like to do, that I would like to get funding for. But I don't have any funding for that yet. And I don't know what the answer was going to be. We've played a little bit with that idea, what do you get when you play two samples and tell somebody don't tell us what the -- don't tell us what you think this person is, but tell us how different they are. If we really have a metric space, if you want, in which different people would be located, like these personality profiles, would you find that I take a profile that has, that's low here, take a profile that's high here, otherwise they're the same, would people then say, oh, this voice sounds a lot more agreeable than the other voice? So I don't know. But I would like to do that experiment, yes. >>: Another one, does that mean -- can that easily put on a different personality? >> Daniel Povey: Yes, I told you that personalities don't change and now I have an actor producing personalities at will. Clearly there's a problem here. The actor -- these guys -- what this guy does, is he dubs movies also. So he's been doing a lot of movies and playing different characters in different movies. So what he gets in that context is a short description of what the character is. He reads the script and all that, and then he produces different voices. So somehow this actor has a super human capability of producing different voices even though his personality doesn't really change. But, of course, we're open to criticism in the same sense that emotion detection on acted emotions detect something that's very different from real emotions but it tells you something about the space of the problem where you're in, and, for example, for synthesis experiments and for analyzing synthesized personalities or emotion, this data is very useful. But, yes, the basic fact is that here we have one person producing speech for different personalities. >>: The phrase you used are kind of very formal and polished. So I'm wondering how much you can use -- if a person that talks freely, say extreme emotional state, I would expect that the speech stumbles at some critical moments and you would also use different words to express something, some specific thing. Can you estimate how much information is encoded in the wording? >> Daniel Povey: Yes, so this is the first part of the database where we have a fixed text. The only thing that changes is the acoustics. And the second part of the database, we allowed a speaker to use different words and then the third part of the database we allow different speakers to do different, to use different words in the same condition. So that's exactly why we designed these three steps. >>: So roughly ->>: What does the voice sound like? >> Daniel Povey: Let me play a German neurotic voice. The typical way I do this, because it's so much fun, is that I ask you to close your eyes for a second and I'm going to play you one of low agreeableness, high neuroticism and high extroversion and I'm going to have you guess what it is. >>: [inaudible]. >> Florian Metze: Say again. >>: [inaudible]. [German]. >> Florian Metze: So what's the guess for what that is. >>: Extrovert. >> Florian Metze: A is agreeableness. >>: The big E. >> Florian Metze: That's high extroversion. The other choices are low agreeableness and high neuroticism. I'll play those now. [German] [laughter]. >> Florian Metze: So is that low agreeableness or high neuroticism? Anybody else. >>: That would be neuroticism. >>: I imagine it would be more like ->>: Different ->>: Woody Allen. >>: I think this is how he interpreted neuroticism, like being really sad and kind of ->> Daniel Povey: You guys are good. This in fact is high neuroticism. I'm going to go away and kill myself now. If you have low agreeableness, that's going to be -- I'm going to kill you if you ever call me again, pretty much, at least that's -[German]. >>: That's like what I hear on the phone. Very passive aggressive. >> Daniel Povey: Must be something about the German in there, I don't know. >>: This is how German sounds like. >>: Can you play the ->> Daniel Povey: The middle one. >>: The neutral. [German]. >> Florian Metze: So those are a sample of the ten different and the neutral, neutral middle dimension that we have. And even though it's cross-lingual or cross-cultural, seems there seems to be a few universal things in there that allow you to pick it up. But clearly it's something that's dependent on language. But it's clearly something that's very, that's something that you want to be able to understand in a human computer interaction system. I'm sounding like [inaudible] if you build speed synthesis, you want to be able to do these kind of things. And also a system should be able to pick up on these type of differences and automatic system and do something with it. Now, there's one fun fact for you. Do you recognize? >>: Wait. Wait. >>: Yes. It looks like that guy [inaudible]. >>: The guy ->>: Is that -- Jason -- what is his name. >> Florian Metze: I don't know what his name is. The question is who are these guys. >>: "American Pie". >> Florian Metze: That's the movie "American Pie" and our speaker, he's the German voice of Kevin. I don't know if Deutsche Telecom knows. >>: That's not really good for their brand. [laughter]. >> Florian Metze: That's what this guy does for a living. And dubs movies. Okay. So what we've done is we've recorded a number of -- we've recorded this total of 75 minutes of this speaker reading this one sentence in various personalities, and we've recorded about more than three hours of this speaker describing a selection of these images in various personality styles. And the idea here is that we put them in a task that he has to do that doesn't involve any other human that could influence the way he speaks. But these images, I mean, there's a selection of romantic, abstract, scary, peaceful images. Depending on what image he's looking at and what image he's describing, depending what personality he is sort of doing that should influence the way that he's speaking. Most of these images are also taken from what's known as schematic perception test, which is a standard psychology test where exactly people look at what words do people use to describe these images and that allows them to take some sort of, get some clues about what personality of mental state a person is in. And that's all been done in a studio with high quality recordings and no noise and everything. And, again, also for the nonactive personalities, so for the people from the street that we brought in, we did the same image tests and also a few human interaction domains where people talk to each other to get data in a comparable situation to which we have the recordings of the actor. And I would like to report on the results of these but we're not done yet. Tim is a young father and work is going slower than we've been hoping. But now what do we get? We have this database where we have ten different types of recordings of the same, always the same text, and we bring in -- the first thing we did is we selected from these 75 minutes and from all these samples we gave two students the task to rate them, to rate every sample individually and I think we have 20 or 30 repetitions of each utterance and each personality profile. So we ranked them according to naturalness. And threw away the ones that sounded the least natural so we don't have any obviously faked or obviously acted samples in our database. And then we had in total 87 test persons come in, listen to the samples, and fill out the personality profile of the person that they listened to, the voice that they're listening to while they could listen to the data as often as they wanted. And this is what we get. So on the five scales, for the variations towards low values and the variations towards the high values, we get medians and interquartile ranges. And it is clear these are clearly separable. And, for example, extroversion seems to be separable reasonably well. Neuroticism works better. Agreeableness works probably the same as neuroticism. Contentiousness works quite well, and openness is something either our speaker doesn't, isn't able to separate very well, or it's not this task, this text isn't good to distinguish openness very well or it's just not something that you can pick up from speech very well. But it's key that it's the same that the experiment that you were doing in this setting, this is something that you can pick up and that you can, that humans at least can tell apart. Now, when they're doing the same with the free speech, so the descriptions of these images, where we're varying the textual information that's also being said originally we were expecting that while we're adding another dimension of information, so this should be easier to pick up. So if I have a description of an image in a neurotic personality versus a non-neurotic personality, if I can also vary the text that should make the distinction between the two even easier because the words would also be different. But it turns out that this is not the case. So here is the old image for comparison. The overall structure of the result is quite similar. Openness is again -- you can't distinguish openness, but even the other ones are closer together. So our understanding for the moment, although we don't have any way of proving that, is that the actor is sort of hyperarticulating, if you want, in the red speech and the fixed text case, because he's always -- he's sort of -- because he's always producing the same text, he's learned very well how to do that in various personalities and what we could do, for example, is have him respeak things he has said about an image and have him respeak it over and over again to see if he sort of, if that's an effective entrainment for the actor or if it's really the case that the length of the utterance or anything else that's differentiated between the two makes it harder for him to produce it or for people to pick it up. At this point we don't know. But it's interesting to note that it's easier to pick up these personalities from a short sample that's been read versus a longer sample where we also vary the lexical content, and we do have transcriptions of the data now, but we haven't done any lexical analysis to see which words were varied, if any, and what's going on there. But that's on the list of things to do. Now, the next thing that we looked at also, if we know that this is the sort of the table of correlations between different personality factors, how does it look like for speech? So the top one, new FFI, that's the one from psychology, how does that look like for us for the fixed text case and the free text case. The first thing, of course, is that the overall the olfactions is much similar, in our case because we have a lot less samples. But if you look at the free text case, for example, where we do have most data, then we see that the numbers overall the absolute values of the of Xs correlations tend to be smaller than the fixed text case so they tend to be better. And also the biggest value here, the biggest negative correlation is also between neuroticism and extroversion, and sort of the same structure as we see it in the new FFI case, the psychology case. So still given the number of samples that we have, I wouldn't call it a significant result yet, but it's sort of -- it seems plausible that this is going in the right direction, that all the other values tend to be smaller than here. And in many cases they have the same sign at least as the correlations that we find in the ->>: They're coming from the real FFI test. >> Florian Metze: Yes. >>: So you use that as ->> Daniel Povey: Yes. >>: And these are human. >> Florian Metze: Yes, these are all human listening tests. >>: [inaudible]. >> Florian Metze: Yes. What we're trying to understand here, I mean, is this a good protocol to use, can people pick up personality from 20 seconds of speech or from longer seconds of speech or are we just measuring ->>: Are you sort of seeing if people can catch what an actor is saying? >> Daniel Povey: Yes. >>: Which I think is different from personality. It's like can you read [inaudible] I mean, why not do something like where you go to NPR where they have a lot of interviews. They interview lots of people a day. You get all these samples of people they interview. And then go to other people and say, look, does this person sound neurotic, does this person sound agreeable, and then you're working with data that just naturally occurs but it's still going to come from a big range of personalities probably, because they're interviewing them. >> Florian Metze: Yeah, we could do that. And other people have done that. Why we didn't -- we started off with the single speaker because that's also what the company was using, what the company was interested in and because also use it for synthesis. So what we can do with this data what you can't do with this type of data. >>: But the natural interview, you still need to provide the test to be able to know whether that person actually [inaudible]. >>: Real behavior, not acted behavior. >>: And you could tell if people agree at least or don't agree. But if they're politicians...[laughter]. >>: Interview authors, movie directors. >> Daniel Povey: There would be a couple of very interesting things to do. One would be to take movie, to take not the original soundtrack of movies but the soundtrack or the sound recordings that these guys do for movies, because then you have it without background music, without noises. And they do the same thing. Lots of material in different personalities, or as you say, take interviews. There is work, a group in Trento, in Italy, is working on -- they tried to figure out personalities of call center operators or people in the call center. And their first look at the data was their both human assessment and automatic recognition rates were very, very poor. So they added telephony channel. I don't know if that's a difference. But so, yeah, it's clear with an actor, we're making the task a lot easier, while with -- and we have less data. While when you're going to the telephone and take samples, you make the task harder but you have a lot more data. >>: I don't know. So I had to interact with American Express to make a reservation today for ICAST. And I'll say as usual, I could tell within like ten seconds of the person saying hello, whether the person was going to be able to solve the problem or not. Like competent versus not competent. I claim you can determine that on the telephone with an operator within ten seconds, 5 seconds. >>: I agree. You have an estimate of a personality. If it's accurate or not, that's maybe one thing or maybe if it corresponds to people who know this person also assign to this person. You don't know. But it's clear that in your application domain, that's what you're eventually -- that's what you're eventually after, that's what you're eventually interested in. You don't need a personality. You want to know is this person going to solve my problem or not. And for that you may not need this whole five factor inventory shenanigans. >>: It tends to be one of the personalities -- still a target somehow. >>: With the machine. >> Daniel Povey: So the next step that Tim is currently working on is taking the personality labels that people who know our subjects well assign to our people who have been giving us speech and see how the assigned personality on the audio corresponds to the personality that people are assigned to by people who know them well. But we don't have these results yet. But clearly the big data approach is always collected somewhere, label it and see how reliable it is. And that's just as valid, I think, as doing what we are doing here. Another thing that we analyze, if we do the recordings at different points in time and also if you do the annotations, if you do the labeling at different points in time, does that influence the results? And, of course, the statistics get poorer if we divide our set into three subsets and had one labeled at one-month intervals by the same people. And to the extent that we have reasonable statistics in 36 out of 40 tests we are confident that the time of the recording and the time of the labeling doesn't influence our results, that it's something we want to make sure that this is reproducible experiment. And we looked briefly -- let's skip over that a little bit. We also looked at if the actor is varying one axis how does it influence the perception of the other axis. So, for example, if it increases neuroticism then we find a decreased perception of extroversion if he increases extroversion we find there's a significant decrease in neuroticism. And that's also consistent -- you wouldn't expect this kind of -- you expect there's this negative correlation between values assigned on a, between the neuroticism and extroversion scales, other than that there doesn't seem to be too many other interplays as we call it. Now, last part is signal-based analysis. If humans can do it, how well can machines do it. We're doing two different things. Predict human ratings and classifying the intended personality. And, of course, we're doing standard approach. We extract a large number of audio descriptors, MFCC features, harmonic, noise ratio, zero crossing rate, intensity, pitch, loudness, for voice, unvoiced segments and silent segments. We extract with silence, and then total about 1200 features. We use information gained ratio filter for ranking these features according to how well they could work in a classifier use tenfold cross validation. SVM. Regression and an SVM classifier, and build a classifier in this case against the acted and specific text data, and if we see, well, after about 20 samples we already are getting -- after about 20 features that we put into the classifier we already get a relatively stable performance, then this is the regression experiment, we can predict the extroversion rating that certain speed sample will receive by our humans with correlation coefficient of almost .7. Neuroticism works a lot less well, about .5, .6, and agreeableness and consciousness and openness are even below that. But these are also the factors that humans aren't good at estimating from speech. So it's not surprising that -- you would expect that the correlation of a human assessment with an automatic assessment wouldn't be too high either. What's interesting is that this seems to be accelerating after maybe 30 features that are put into this predictor. If you do an analysis of what are these features, what are the salient features that the machine picks out to predict personalities while you find that here for the example of the extroversion versus neuroticism you see a lot of the extroversion features are pitch-related and that's consistent with psychology literature that variations in pitch are a good predictor of the extroversion in a person. Of course, there's the MFCC features that pop up everywhere, although they shouldn't. Neuroticism, there's several intensity and loudness features. So features capturing dynamics and also pitch and MFCC features, and it's also consistent that the voice of a neurotic person is very controlled, very flat, very leveled, with very few changes in dynamics. And, yeah, we did the classification experiment so for intended variation of our actor, we did tenfold cross validation, SVM classifier, this is set up as a ten class problem and you're getting 60 percent accuracy on the 75 minute datasets and six times chance and we're working on repeating the experiment on the larger dataset with text independent and also the large number of speakers, not only the single actor. >>: The ground truth happens where. >> Florian Metze: Here the ground truth is the intended. So the instruction for the actor when we told him make this extrovert, make this introvert, this is the ground truth for training and test here, that's why this is different from the regression. There's a number of, for the classification, we see that, for example, neuroticism and consciousness work quite well while extroversion works slightly less well, measured by the F measure per class and openness and agreeableness seem to be hard to classify. >>: So what would be typical example? >>: Conscientiousness. >> Daniel Povey: Conscientiousness is sort of how ->>: [inaudible]. >> Florian Metze: A conscientiousness person is somebody who is reliable careful, I think it's planned in some sense. I can see if I have samples of conscientiousness versus nonconscientiousness. It's interesting that the conscientiousness works well in the automatic classification while human assessment conscientiousness doesn't work quite as well. So, yeah, that pretty much sums up your, the state of or the results that we have right now. We think that the protocol that we're following allows us to measure the personality of a person reliably, quite reliably, from the voice. Definitely in the actor case, and for the results that Tim is getting now also for the actor in the free text case for -- the database that we have recorded under the same condition as the actor with people from the street. We haven't fully, we haven't run the statistics on that yet. We collected it. So we hopefully have results later in the year. What we find is that the neurotic is clearly distinct while very open and very extrovert are very similar acoustically, I skipped over that in the details. Neuroticism and extroversion are a human assessment, sort of reciprocal. Openness typically tends -- perception of openness tends to follow agreeableness if at all, but openness is the most challenging feature. And on the automatic assessment, it works quite well for neuroticism and extroversion, consciousness classifies well but we have a harder time predicting the human perception of consciousness and agreeableness sort of correlates moderately with user assessment but we have a harder time classifying it, if you sort of want to rank these features relative to each other with how well they do and openness is sort of an unsolved case, because humans don't assess it well and automatic assessment doesn't work well. So it's certainly the weakest of the five on the five factors. Ongoing work, we've done some work at Johns Hopkins on using emotional speech synthesis using articulable features as parameters that you can change and use in a speech synthesizer and using this database and this classification approach to measure how well a synthesizer produces speech with emotional or characteristics that doesn't mean that the speech is understandable but it's emotional. And, of course, we want to continue on the text independent and speaker independent case here to see where we are. And, of course, very interested to see how we go with nonacted speakers and nonacted results and this is some sort of schizophrenic setup that we have one single person producing speech in different personalities. Tim wants to look at what personality factors, what independent factors can you extract from the data if any are they different from the ones you think you find in the psychology or depending on how much data we have, you might be able to do that, and you have merged the acoustic information we have here with textual information and see which is more salient, which is harder or easier to extract them. What does it, for example, mean if the message that you're getting from the text channel doesn't match the acoustic channel. So if you take the text that the actor produced as a neurotic person and synthesize it in a nonneurotic voice or have him speak it in a nonneurotic voice does that send mixed signals and confuse the user or is one beating the other. So there's a lot of things that you can play with there. And that concludes my talk. [applause]. Any questions? >> Daniel Povey: So I think -- did the West Coast meet all of your expectations.

23691 >> Daniel Povey: I think most people here know... might be on video, Florian finished his Ph.D. in 2005...

Related documents

Products

Support

23691 &gt;&gt; Daniel Povey: I think most people here know... might be on video, Florian finished his Ph.D. in 2005...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

23691 >> Daniel Povey: I think most people here know... might be on video, Florian finished his Ph.D. in 2005...