23691 >> Daniel Povey: I think most people here know... might be on video, Florian finished his Ph.D. in 2005...

advertisement
23691
>> Daniel Povey: I think most people here know Florian, but just for those who
might be on video, Florian finished his Ph.D. in 2005 at Karlsruhe. He was
working with Alex Weibel [phonetic], he then worked for a few years at Deutsche
Telecom on various speechy things and core analytics. Now he's faculty at CMU
doing lots of interesting research, including this stuff.
So I'm going to let you go for it, Florian.
>> Florian Metze: Thank you. So thanks everyone for having me. And thanks
for allowing me to be here. So this is really work that started while I was still a
post-doc at Berlin.
I started work with Tim, who is now a Ph.D. student at Deutsche Telecom
Laboratories and Technical University of Berlin. So you were asking, there's no
CMU affiliation on the title. That's really because Deutsche Telecom is funding
Tim and I'm doing it on my spare time and because I like, because I like the topic
that we're working on.
So we're trying to find how much personality we can find in speech and what
added value maybe speech can bring to an assessment of personality if you look
into it.
So brief overview. I want to start by explaining how you can assess personality
in the first place. And verify that you can assess personality in speech because
that's not at all clear that this is something that you can reliably do and going to
make sure this is the case.
We started collecting a database on which we then run a couple of experiments
on both how do humans assess personality and speech and how can machines
mimic the process that we want to first do human baseline experiment on
personality assessment from speech and then see how well can a machine do
the same.
And feel free to interrupt me at any time during the talk. I see doubtful faces
already.
>>: I like the various presidential candidates speeches.
>> Daniel Povey: Oh, yeah, [laughter].
>>: They're all possessed. [laughter].
>>: Whatever you'd like them to be.
>> Florian Metze: So that would probably be a cross-cultural perception of
speech thing then because the work we're doing is in German. I'm going to play
you some German speech and let you assess the personality and that's always
good for interesting discussion as well. If not laughs.
So now why do we want to do this? Well, you all know about emotion detection
from speech and that's kind of useful but not really because the only emotion you
ever find is anger. And that's maybe 5 percent of the time and 95 percent of the
time at least in call centers it's neutral speech.
So maybe if you knew what personality a person is, maybe it would be easier
and more reliable to detect emotions. So if somebody is neurotic person maybe
he's going to express emotion differently than a person who is a very extrovert
person. So if you knew about a person's personality you could normalize
emotion detection.
>>: How many classes do you have?
>> Daniel Povey: We work with ten classes, but of course the scheme that we're
using is sort of a profile. So it's continuous scales. So we could have any
number of classes.
>>: Ten emotions or ->> Daniel Povey: Ten personalities.
>>: [inaudible].
>> Florian Metze: Five scales. Openness, gentleness, agreeableness
extroversion and neuroticism. I remember them by now.
>>: They're separate from emotion, what are related?
>> Daniel Povey: Of course they're related but they're different. So let me -- I'll
come to that in a minute. Of course, it would be interesting if we do, with speech
transcription if we could annotate speech with properties saying this was this type
of person saying this and another thing, and this is really also why Deutsche
Telecom is interested in this type of work they spent a lot of time on their
corporate brand and whenever they do adds, whenever they have voice or call
centers, they spend time to make sure that the voice they use in the call center
transports their values so they do brand monitoring and quality assurance saying
that if you hear a voice in a Deutsche Telecom at least automated call center,
they want to make sure that the voice instills in you the values to Deutsche
Telecom stands for which is it's a former state run company so they're reliable
they're your grandfather's company they don't have to be young and hip but they
want to be solid and reliable.
So any voice they want to use in the system and any voice also they want to use
in a TTS and speech synthesis system should also transport these qualities that
they stand for, so to speak, personality. So if you had an automatic system to
assess that, that would be a useful instrument to make sure that the dialogue
systems, for example, transport the brand values.
I don't think they're quite there yet. But that's something that this could be used
for. And, of course, any kind of human computer interaction would be useful if
you had personality information both for recognition and synthesis, if you could
synthesize voices with different emotions or personalities.
So the work we're doing could also be used to assess the personality and the
impression, the synthetic voice leaves in here.
So that's what we are working towards. Now, what are personalities? I can read
to you this definition, which I like. It's dynamic and organized set of
characteristics possessed by a person that uniquely influences his or her
conditions, motivations and behaviors in various situations.
Now, that's one of many definitions of personality that tells you a little bit but it's
not very precise. And that's part of the problem here. It's really I think what it
boils down to it's an analysis of how people and their perceived behavior can
differ. So you're looking for a property and a person that others can detect not
directly but that they can detect because people are doing things in certain ways.
So you only observe personality by proxy if you want because this person speaks
like that, this person speaks like that, this person does things like this and this
other person does things like that and that's because they have different
personalities for lack of another sort of observable property that's different
between people.
The important thing about personality and also the differentiator between
personality and emotion is that personality is meant to be stable over time.
Emotions can change very quickly. So your personality wouldn't change over
days, weeks or even months while emotions can change within minutes.
Then between the two, then, I wouldn't argue that there is a clear-cut separation
between personality and emotions. They are speaker states if you want or traits,
and at least at this point we can't say with certainty, well, this is an inference of
personality and this is an inference of emotion. You always have certain
qualities. But here we're looking for something that's stable over time. And in
psychology there is what's called the five factor inventory, FFI, that uses five
factors to describe persons personality profile that you can use to measure and
predict sort of habitual patterns that a person has in their thoughts, behaviors and
the way they express their emotions.
>>: So in the corporate world we normally have the perspective of trying to be
dominant; is that true? Is that kind of the analysis.
>> Florian Metze: It's similar. In marketing they have these personas or
prototypical ->>: There's four.
>> Florian Metze: There's several -- several different -- I think Germany has a
different set of personalities or how do they call it, sinus milieus, where they have
different types of people to do marketing or observational studies. And I think
they're different from what people use in the U.S.
But I think the personality at least this way of representing personality is -- there's
kind of an academic way of describing it where you have a space in which you
can then distribute your personas or your prototypes if you want.
But in theory you should be able to map other schemes into this scheme how
well that works and if one is more suitable to your purpose and one is more
suitable to another purpose, that of course depends on what exactly you want to
do.
But this new FFI framework or the new FFI framework that we're using, it's been
developed over the course of several decades now of research in psychology.
It's being essentially measured using a questionnaire of 60 questions that people
answer either about themselves then it's called self-assessment or they answer
about somebody else whom they know.
Then it's a third-person assessment. And this third-person assessment is
generally considered more reliable because people tend to see themselves
differently than other people see them and in the end what matters or what's
more consistent, more reliable is how other people see a person.
So with this type of questionnaire, the new FFI about in English about 110
studies have been performed with 24,000 participants and in German, which is
what we're using this work is in German, about 50 studies have been done with
12 and a half -- 2,100 participants. So this is the population against which we
can compare the personalities that we and the profiles that we find. So if we
have a certain profile, we know where in the space of all these personality
profiles we are if 20 percent of the other population is more neurotic than this
person is or 80 percent is less neurotic and so on.
>>: There's another framework that actually I think companies use also, which I
don't remember the name of it. But it's got judgmental, sensing, introverted,
extroverted and there's a couple more. Do you know what I'm talking about.
>>:
>> Daniel Povey: I know there's a six feedback for inventory. There's a purely
categorical schemes and continuous ->>: Intuitive versus something. Intuitive was another axis.
>> Florian Metze: There's a Zu -- I may have come across the one you're talking
about, but we've only really worked with this one. This is also the questionnaire
is -- so you have to pay for it if you use it -- you pay $1 or Euro or something per
questionnaire.
But probably any questionnaire has its advantages and its disadvantages. And
there's a lot of debate about which one's better. But this one seems at least from
what we could find this one is the most established one and also one that exists
in essentially the same form in English and German and other languages.
So you could do it across languages or cultures if you want. So what are the five
factors that I've been talking about? Yeah, I found this image on the Internet so I
just had to use it.
It's neuroticism, extroversion, openness to experience, agreeableness and
conscientiousness. The abbreviation is OCEAN, so you can remember them.
Each of these five factors is expressed as a value between 0 and 45,
corresponding to low and high.
So if you have a low score on the neuroticism axis or neuroticism scale, that
means that you're not very neurotic. If you have a high value on the neuroticism
scale, that means that there's a lot of neuroticism in your personality profile. This
profile is determined using a questionnaire with 60 items which you answer on a
5-point Lichert scale. So you answer questions like I like to have a lot of people
around me. Strongly disagree, disagree, neutral, agree, strongly agree.
Or I often feel inferior to others, strongly disagree, disagree, neutral, and so on.
I laugh easily. You either do that about yourself or about somebody else. And
from these answers there is a set of rules and algorithm that computes this
personality profile. It's not that any of these single questions directly maps to any
of these personality factors. But it is essentially a linear interpolation of your
answers to these questions which determines your personality profile as a value,
essentially as a set of five numbers between 0 and 48 that describe how you're
supposed to be.
Now ->>: So these are five different axises that correlate.
>> Florian Metze: It's assuming -- I'm going to show some statistics on that later.
Of course, they're not completely independent. They're independent to some
extent. But, yes, if you want to see them as a coordinate system, then the five
dimensional space, then that's what it's ideally, what it would be.
Now, what does it mean, this personality profile? If, for example, somebody has
a low value on the openness scale, then he would be described by words like
conventional, simple, unimaginative literal minded. If a person had a high value
on the openness scale he would be described with words, curious, original,
creative, artistic, intellectual, and the same for all the other axises.
So the way to interpret such a profile is to say, well, the more towards the one
end of the scale you are, the more people would be likely to use any of these
words to describe your personality.
And now what you can do, if you take, for example, a certain sample of a
population, if you take 20 people, what we've done here, and you sort of average
their personality profiles, then you get a graph like this.
You see neuroticism and so on, axis, here's the medium value, the interquartile
ranges. So you can see if you're here then you know that -- and you can also,
because you know what the total distribution of the personality profiles is from all
the 2000 questionnaires that have been filled out in German or 12,000 that have
been filled out in English, you know what the rest of the population is doing so
you know where in respect to the total population you are.
So this is what a personality profile of a single person or group might look like.
Now, of course, the question is how good are these, how independent are these
five axises. Now here for the German neo FFI, the manual -- so the book that
describes it contains a table that shows what's been computed and you see that,
for example, the correlations or the off-diagonal correlations between these
axises, they are relatively I would say well behaved. Not in maybe -- maybe in
physics or in statistics you would get even lower correlations, but for sort of a
table on psychological data, I think the argument is that these numbers are
relatively low, relatively close to 0.
But what you notice, for example, is that there is a strong negative correlation
between neuroticism and extroversion. So a person who has a high value on
neuroticism tends to have a lower value on the extra version axis and the other
way around. That's kind of intuitive if you would think of somebody, you would
imagine it would be difficult to be neurotic and extrovert at the same time.
So that's what this strong negative correlation means. But all the other numbers
are relatively close to 0, if you want. So to some extent these axises are really
independent, but, of course, it's not a completely -- they're not completely
independent in the sense that you could vary them totally at your will.
>>: So early on you gave the example of a company, might one project a mature,
stable, reliable image. And you mentioned a grandfather voice. What would the
grandfather voice be in this ->> Daniel Povey: We do have -- so we do have our sort of like Deutsche
Telecom corporate voice, which is a guy recording for our database, he recorded
a number of sentences in his Deutsche Telecom voice, and this is the -- this is
the profile that we computed for this voice so that that's exactly these bars.
Where we computed his average profile for a number of samples that he had
given. So he scored high on the conscientiousness scale and relatively low on
the neuroticism scale. And average values, if you want, on the extroversion,
openness and maybe a slightly more towards the positive side on the
agreeableness voice on the agreeableness scale.
And that would be maybe your grandfather voice. But that's the personality
profile ->>: Neurotic.
>> Florian Metze: Yeah. I guess that's what you want to be, right? So that's
the -- that's the voice that Deutsche Telecom is happy with. Of course, if you talk
to the marketing folks, if they could also be happy with a voice that has a different
profile, we don't know yet. We haven't done a lot of studies on that. But that's
the profile of a voice that they're happy with.
So, of course, there's a lot of -- there has been some other works on assessing
personality from speech. Mostly on analysis of psychiatric patients, Claude Cher
[phonetic] did a lot of work on that and also Apple, Apple the person, not Apple
the company.
They found that extroversion can be estimated from speech reliably and extrovert
speakers, for example, speak louder and fewer hesitations. So they established
that you can take speech and derive from speech the or get cues about the
personality of a speaker or if you know what a personality is, make predictions
about how this person's going to speak.
Recently, Francois Marez was doing work on textual information, also intensity
and pitch, and he also found that extroversion could be modeled well followed by
what he calls emotional stability which is essentially another term for neuroticism.
He was using a different scheme than what we are using, and this also, of
course, tells us that personality doesn't only influence the acoustics of how things
are being said, if you describe a image, describe a situation, an extrovert person
is going to use different words than an introvert person, for example.
That's also, of course, something that we need to control. And then of course
there's Clifford Nass who did a lot of work on computer interaction.
He did work on computer interaction, claims if you talk to a dialogue system, for
example, we assign a personality to this dialogue system to this computer, and
we tend to have positive feelings towards a person that -- personality that we
encountered. It has similar characteristics to our own personality.
So he did studies in which people liked dialogue system better, if it was extrovert
if they were extroverts themselves.
So this is sort of -- I'm not sure if this is a completely thorough understanding of
the problem. But it gives you an indication that personality is important to take
into account in human computer interaction and he's been exploring some of
these ideas.
>>: [inaudible].
>> Florian Metze: No, it's a person. It's a researcher. Not the company. He's a
psychologist.
Okay. So with this, we started collecting a database. And we essentially
collected a database in three parts. We took the first part is going to be the
professional speaker that we record, that we give a fixed text and ask him to
record this fixed single text in various personalities. So that we can have a
baseline understanding of if its active personalities we can see what changes
and if we can detect it automatically if people can pick it up automatically. You're
always going to keep the text fixed. The second part of the database we are
relaxing this fixed text constraint and allow the speaker to use different words, if
he is in different personalities, but and so if somebody listens to that, he might be
able to pick up on the personality, by the vocabulary, by the lexical information.
That's the difference between the first and the second part.
And the third part is we're putting people from the street or the subjects into the
same situation as a professional speaker and the second part we asked them to
describe images, various images, and then we get other people who know them
well to do their personality profiles, and we see if we can pick up speaker
independently and non-active situation, cues from the acoustic speech towards
the personality of a person.
So we really have three conditions. Two active conditions in fixed text and free
text and then nonactors, real persons, if you want, that do the same free text
exercise and our goal will be to pick up from these three parts of the database,
see what changes from condition to condition, and try and understand how well
our humans are in picking up personalities and how well -- and how well do
machines work.
Now, let me give you an example of the data. I think I briefly tested the audio.
So this is a standard text like that it could appear in Deutsche Telecom call
center, and in fact the professional actor recording in a Deutsche Telecom voice,
sort of a neutral message, something that's a bit positive, a bit negative, that's
friendly but not overly friendly. So something that we hope could be said in any
personality.
[German].
>> Florian Metze: That's your grandfather voice. And here's the English
translation of the text that this guy has been reading and acting. Now, what we're
going to do is we have the personality profile and now we're going to ask this
professional speaker to record personality or variations of this text with low
neuroticism and high neuroticism. Low extroversion and high extroversion so
that the ten extremes on these scales and to allow him to do that we gave him
the descriptions of the personality profiles that we find in the new amplified
framework. So he gets a piece of text that essentially contains these
descriptions of persons with high neuroticism are expected to be controlled,
anxious and so on.
And he gets some time to put himself in the mood, if you want, and then he
produces various recordings of always the same text in different personalities
which we then recorded. So we have about 75 minutes of this 20-second text in
various personalities. Now, I know that there's German speakers in the room.
>>: How can it decide on the numerical value of this.
>> Florian Metze: How can we decide on the numerical value? What we do
when we have these recordings, is we have people come in and they listen to
these, to the text, to the audio recording, and they fill out the questionnaire. So
they say -- they listen to it and they say, oh, I think this person likes, laughs easily
or this person likes to have other people around me.
And by filling out this questionnaire, we get the numerical value. We get the
personality profile and the question really is it reliable. If people only have about
20 seconds of speech or maybe a minute of speech is it reliable that they can fill
out this questionnaire and therefore assign a personality profile.
But that's what we did. So we had a speaker produce many examples of
different personality, different personality prototypes and then have people listen
to that and fill out the questionnaire. They were free to listen to it as often as
they liked.
So they could listen to it many times. There was no time pressure and
everything, and then they fill out the profile. So ->>: Is this the way your evaluation data is generated or the way your training
data is generated?
>> Daniel Povey: Both.
>>: If humans aren't any good at evaluating personality through speech but
computers could be, you're limited by the humans.
>> Florian Metze: Well, it depends. So here -- this part of the database we have
active data. So it could be that machines are better than humans at picking out
the intended variation that the actor produced, because we know what the actor
was supposed to do. So we could train a system that has super -- sorry for using
the term -- super human performance.
And in fact we did two different experiments. So we did the classification
experiment, where we tried to reproduce or where we tried to figure out what the
actor was doing, and we did the regression experiment where we tried to
reproduce the ratings that the humans would assign to the speech, no matter
what the actor was trying to do.
>>: So have you just tried using ranking [inaudible].
>>: Ranking.
>> Yeah, ranking. Basically this one is more something like that? The other
person, just compared to the ->> Daniel Povey: That's an interesting point. If you look to two samples, can
you, would it be easier to describe a difference between two speakers and that's
work that I would like to do, that I would like to get funding for.
But I don't have any funding for that yet. And I don't know what the answer was
going to be. We've played a little bit with that idea, what do you get when you
play two samples and tell somebody don't tell us what the -- don't tell us what you
think this person is, but tell us how different they are.
If we really have a metric space, if you want, in which different people would be
located, like these personality profiles, would you find that I take a profile that
has, that's low here, take a profile that's high here, otherwise they're the same,
would people then say, oh, this voice sounds a lot more agreeable than the other
voice?
So I don't know. But I would like to do that experiment, yes.
>>: Another one, does that mean -- can that easily put on a different personality?
>> Daniel Povey: Yes, I told you that personalities don't change and now I have
an actor producing personalities at will. Clearly there's a problem here.
The actor -- these guys -- what this guy does, is he dubs movies also. So he's
been doing a lot of movies and playing different characters in different movies.
So what he gets in that context is a short description of what the character is.
He reads the script and all that, and then he produces different voices. So
somehow this actor has a super human capability of producing different voices
even though his personality doesn't really change.
But, of course, we're open to criticism in the same sense that emotion detection
on acted emotions detect something that's very different from real emotions but it
tells you something about the space of the problem where you're in, and, for
example, for synthesis experiments and for analyzing synthesized personalities
or emotion, this data is very useful.
But, yes, the basic fact is that here we have one person producing speech for
different personalities.
>>: The phrase you used are kind of very formal and polished. So I'm wondering
how much you can use -- if a person that talks freely, say extreme emotional
state, I would expect that the speech stumbles at some critical moments and you
would also use different words to express something, some specific thing. Can
you estimate how much information is encoded in the wording?
>> Daniel Povey: Yes, so this is the first part of the database where we have a
fixed text. The only thing that changes is the acoustics. And the second part of
the database, we allowed a speaker to use different words and then the third part
of the database we allow different speakers to do different, to use different words
in the same condition. So that's exactly why we designed these three steps.
>>: So roughly ->>: What does the voice sound like?
>> Daniel Povey: Let me play a German neurotic voice. The typical way I do
this, because it's so much fun, is that I ask you to close your eyes for a second
and I'm going to play you one of low agreeableness, high neuroticism and high
extroversion and I'm going to have you guess what it is.
>>: [inaudible].
>> Florian Metze: Say again.
>>: [inaudible].
[German].
>> Florian Metze: So what's the guess for what that is.
>>: Extrovert.
>> Florian Metze: A is agreeableness.
>>: The big E.
>> Florian Metze: That's high extroversion. The other choices are low
agreeableness and high neuroticism. I'll play those now.
[German] [laughter].
>> Florian Metze: So is that low agreeableness or high neuroticism? Anybody
else.
>>: That would be neuroticism.
>>: I imagine it would be more like ->>: Different ->>: Woody Allen.
>>: I think this is how he interpreted neuroticism, like being really sad and kind
of ->> Daniel Povey: You guys are good. This in fact is high neuroticism. I'm going
to go away and kill myself now. If you have low agreeableness, that's going to
be -- I'm going to kill you if you ever call me again, pretty much, at least that's -[German].
>>: That's like what I hear on the phone. Very passive aggressive.
>> Daniel Povey: Must be something about the German in there, I don't know.
>>: This is how German sounds like.
>>: Can you play the ->> Daniel Povey: The middle one.
>>: The neutral.
[German].
>> Florian Metze: So those are a sample of the ten different and the neutral,
neutral middle dimension that we have. And even though it's cross-lingual or
cross-cultural, seems there seems to be a few universal things in there that allow
you to pick it up. But clearly it's something that's dependent on language. But
it's clearly something that's very, that's something that you want to be able to
understand in a human computer interaction system.
I'm sounding like [inaudible] if you build speed synthesis, you want to be able to
do these kind of things. And also a system should be able to pick up on these
type of differences and automatic system and do something with it.
Now, there's one fun fact for you. Do you recognize?
>>: Wait. Wait.
>>: Yes. It looks like that guy [inaudible].
>>: The guy ->>: Is that -- Jason -- what is his name.
>> Florian Metze: I don't know what his name is. The question is who are these
guys.
>>: "American Pie".
>> Florian Metze: That's the movie "American Pie" and our speaker, he's the
German voice of Kevin. I don't know if Deutsche Telecom knows.
>>: That's not really good for their brand. [laughter].
>> Florian Metze: That's what this guy does for a living. And dubs movies.
Okay. So what we've done is we've recorded a number of -- we've recorded this
total of 75 minutes of this speaker reading this one sentence in various
personalities, and we've recorded about more than three hours of this speaker
describing a selection of these images in various personality styles. And the idea
here is that we put them in a task that he has to do that doesn't involve any other
human that could influence the way he speaks.
But these images, I mean, there's a selection of romantic, abstract, scary,
peaceful images. Depending on what image he's looking at and what image he's
describing, depending what personality he is sort of doing that should influence
the way that he's speaking.
Most of these images are also taken from what's known as schematic perception
test, which is a standard psychology test where exactly people look at what
words do people use to describe these images and that allows them to take
some sort of, get some clues about what personality of mental state a person is
in.
And that's all been done in a studio with high quality recordings and no noise and
everything. And, again, also for the nonactive personalities, so for the people
from the street that we brought in, we did the same image tests and also a few
human interaction domains where people talk to each other to get data in a
comparable situation to which we have the recordings of the actor. And I would
like to report on the results of these but we're not done yet. Tim is a young father
and work is going slower than we've been hoping.
But now what do we get? We have this database where we have ten different
types of recordings of the same, always the same text, and we bring in -- the first
thing we did is we selected from these 75 minutes and from all these samples we
gave two students the task to rate them, to rate every sample individually and I
think we have 20 or 30 repetitions of each utterance and each personality profile.
So we ranked them according to naturalness.
And threw away the ones that sounded the least natural so we don't have any
obviously faked or obviously acted samples in our database. And then we had in
total 87 test persons come in, listen to the samples, and fill out the personality
profile of the person that they listened to, the voice that they're listening to while
they could listen to the data as often as they wanted.
And this is what we get. So on the five scales, for the variations towards low
values and the variations towards the high values, we get medians and
interquartile ranges. And it is clear these are clearly separable. And, for
example, extroversion seems to be separable reasonably well. Neuroticism
works better. Agreeableness works probably the same as neuroticism.
Contentiousness works quite well, and openness is something either our speaker
doesn't, isn't able to separate very well, or it's not this task, this text isn't good to
distinguish openness very well or it's just not something that you can pick up from
speech very well. But it's key that it's the same that the experiment that you were
doing in this setting, this is something that you can pick up and that you can, that
humans at least can tell apart.
Now, when they're doing the same with the free speech, so the descriptions of
these images, where we're varying the textual information that's also being said
originally we were expecting that while we're adding another dimension of
information, so this should be easier to pick up. So if I have a description of an
image in a neurotic personality versus a non-neurotic personality, if I can also
vary the text that should make the distinction between the two even easier
because the words would also be different. But it turns out that this is not the
case. So here is the old image for comparison.
The overall structure of the result is quite similar. Openness is again -- you can't
distinguish openness, but even the other ones are closer together. So our
understanding for the moment, although we don't have any way of proving that, is
that the actor is sort of hyperarticulating, if you want, in the red speech and the
fixed text case, because he's always -- he's sort of -- because he's always
producing the same text, he's learned very well how to do that in various
personalities and what we could do, for example, is have him respeak things he
has said about an image and have him respeak it over and over again to see if
he sort of, if that's an effective entrainment for the actor or if it's really the case
that the length of the utterance or anything else that's differentiated between the
two makes it harder for him to produce it or for people to pick it up.
At this point we don't know. But it's interesting to note that it's easier to pick up
these personalities from a short sample that's been read versus a longer sample
where we also vary the lexical content, and we do have transcriptions of the data
now, but we haven't done any lexical analysis to see which words were varied, if
any, and what's going on there. But that's on the list of things to do.
Now, the next thing that we looked at also, if we know that this is the sort of the
table of correlations between different personality factors, how does it look like
for speech?
So the top one, new FFI, that's the one from psychology, how does that look like
for us for the fixed text case and the free text case. The first thing, of course, is
that the overall the olfactions is much similar, in our case because we have a lot
less samples.
But if you look at the free text case, for example, where we do have most data,
then we see that the numbers overall the absolute values of the of Xs
correlations tend to be smaller than the fixed text case so they tend to be better.
And also the biggest value here, the biggest negative correlation is also between
neuroticism and extroversion, and sort of the same structure as we see it in the
new FFI case, the psychology case. So still given the number of samples that
we have, I wouldn't call it a significant result yet, but it's sort of -- it seems
plausible that this is going in the right direction, that all the other values tend to
be smaller than here. And in many cases they have the same sign at least as
the correlations that we find in the ->>: They're coming from the real FFI test.
>> Florian Metze: Yes.
>>: So you use that as ->> Daniel Povey: Yes.
>>: And these are human.
>> Florian Metze: Yes, these are all human listening tests.
>>: [inaudible].
>> Florian Metze: Yes. What we're trying to understand here, I mean, is this a
good protocol to use, can people pick up personality from 20 seconds of speech
or from longer seconds of speech or are we just measuring ->>: Are you sort of seeing if people can catch what an actor is saying?
>> Daniel Povey: Yes.
>>: Which I think is different from personality. It's like can you read [inaudible] I
mean, why not do something like where you go to NPR where they have a lot of
interviews. They interview lots of people a day. You get all these samples of
people they interview. And then go to other people and say, look, does this
person sound neurotic, does this person sound agreeable, and then you're
working with data that just naturally occurs but it's still going to come from a big
range of personalities probably, because they're interviewing them.
>> Florian Metze: Yeah, we could do that. And other people have done that.
Why we didn't -- we started off with the single speaker because that's also what
the company was using, what the company was interested in and because also
use it for synthesis. So what we can do with this data what you can't do with this
type of data.
>>: But the natural interview, you still need to provide the test to be able to know
whether that person actually [inaudible].
>>: Real behavior, not acted behavior.
>>: And you could tell if people agree at least or don't agree. But if they're
politicians...[laughter].
>>: Interview authors, movie directors.
>> Daniel Povey: There would be a couple of very interesting things to do. One
would be to take movie, to take not the original soundtrack of movies but the
soundtrack or the sound recordings that these guys do for movies, because then
you have it without background music, without noises. And they do the same
thing.
Lots of material in different personalities, or as you say, take interviews. There is
work, a group in Trento, in Italy, is working on -- they tried to figure out
personalities of call center operators or people in the call center. And their first
look at the data was their both human assessment and automatic recognition
rates were very, very poor.
So they added telephony channel. I don't know if that's a difference. But so,
yeah, it's clear with an actor, we're making the task a lot easier, while with -- and
we have less data. While when you're going to the telephone and take samples,
you make the task harder but you have a lot more data.
>>: I don't know. So I had to interact with American Express to make a
reservation today for ICAST. And I'll say as usual, I could tell within like ten
seconds of the person saying hello, whether the person was going to be able to
solve the problem or not.
Like competent versus not competent. I claim you can determine that on the
telephone with an operator within ten seconds, 5 seconds.
>>: I agree. You have an estimate of a personality. If it's accurate or not, that's
maybe one thing or maybe if it corresponds to people who know this person also
assign to this person. You don't know. But it's clear that in your application
domain, that's what you're eventually -- that's what you're eventually after, that's
what you're eventually interested in. You don't need a personality. You want to
know is this person going to solve my problem or not.
And for that you may not need this whole five factor inventory shenanigans.
>>: It tends to be one of the personalities -- still a target somehow.
>>: With the machine.
>> Daniel Povey: So the next step that Tim is currently working on is taking the
personality labels that people who know our subjects well assign to our people
who have been giving us speech and see how the assigned personality on the
audio corresponds to the personality that people are assigned to by people who
know them well. But we don't have these results yet.
But clearly the big data approach is always collected somewhere, label it and see
how reliable it is. And that's just as valid, I think, as doing what we are doing
here.
Another thing that we analyze, if we do the recordings at different points in time
and also if you do the annotations, if you do the labeling at different points in
time, does that influence the results? And, of course, the statistics get poorer if
we divide our set into three subsets and had one labeled at one-month intervals
by the same people. And to the extent that we have reasonable statistics in 36
out of 40 tests we are confident that the time of the recording and the time of the
labeling doesn't influence our results, that it's something we want to make sure
that this is reproducible experiment.
And we looked briefly -- let's skip over that a little bit. We also looked at if the
actor is varying one axis how does it influence the perception of the other axis.
So, for example, if it increases neuroticism then we find a decreased perception
of extroversion if he increases extroversion we find there's a significant decrease
in neuroticism.
And that's also consistent -- you wouldn't expect this kind of -- you expect there's
this negative correlation between values assigned on a, between the neuroticism
and extroversion scales, other than that there doesn't seem to be too many other
interplays as we call it.
Now, last part is signal-based analysis. If humans can do it, how well can
machines do it. We're doing two different things. Predict human ratings and
classifying the intended personality. And, of course, we're doing standard
approach. We extract a large number of audio descriptors, MFCC features,
harmonic, noise ratio, zero crossing rate, intensity, pitch, loudness, for voice,
unvoiced segments and silent segments.
We extract with silence, and then total about 1200 features. We use information
gained ratio filter for ranking these features according to how well they could
work in a classifier use tenfold cross validation. SVM. Regression and an SVM
classifier, and build a classifier in this case against the acted and specific text
data, and if we see, well, after about 20 samples we already are getting -- after
about 20 features that we put into the classifier we already get a relatively stable
performance, then this is the regression experiment, we can predict the
extroversion rating that certain speed sample will receive by our humans with
correlation coefficient of almost .7. Neuroticism works a lot less well, about .5, .6,
and agreeableness and consciousness and openness are even below that. But
these are also the factors that humans aren't good at estimating from speech.
So it's not surprising that -- you would expect that the correlation of a human
assessment with an automatic assessment wouldn't be too high either.
What's interesting is that this seems to be accelerating after maybe 30 features
that are put into this predictor. If you do an analysis of what are these features,
what are the salient features that the machine picks out to predict personalities
while you find that here for the example of the extroversion versus neuroticism
you see a lot of the extroversion features are pitch-related and that's consistent
with psychology literature that variations in pitch are a good predictor of the
extroversion in a person. Of course, there's the MFCC features that pop up
everywhere, although they shouldn't.
Neuroticism, there's several intensity and loudness features. So features
capturing dynamics and also pitch and MFCC features, and it's also consistent
that the voice of a neurotic person is very controlled, very flat, very leveled, with
very few changes in dynamics.
And, yeah, we did the classification experiment so for intended variation of our
actor, we did tenfold cross validation, SVM classifier, this is set up as a ten class
problem and you're getting 60 percent accuracy on the 75 minute datasets and
six times chance and we're working on repeating the experiment on the larger
dataset with text independent and also the large number of speakers, not only
the single actor.
>>: The ground truth happens where.
>> Florian Metze: Here the ground truth is the intended. So the instruction for
the actor when we told him make this extrovert, make this introvert, this is the
ground truth for training and test here, that's why this is different from the
regression.
There's a number of, for the classification, we see that, for example, neuroticism
and consciousness work quite well while extroversion works slightly less well,
measured by the F measure per class and openness and agreeableness seem to
be hard to classify.
>>: So what would be typical example?
>>: Conscientiousness.
>> Daniel Povey: Conscientiousness is sort of how ->>: [inaudible].
>> Florian Metze: A conscientiousness person is somebody who is reliable
careful, I think it's planned in some sense. I can see if I have samples of
conscientiousness versus nonconscientiousness. It's interesting that the
conscientiousness works well in the automatic classification while human
assessment conscientiousness doesn't work quite as well.
So, yeah, that pretty much sums up your, the state of or the results that we have
right now. We think that the protocol that we're following allows us to measure
the personality of a person reliably, quite reliably, from the voice. Definitely in the
actor case, and for the results that Tim is getting now also for the actor in the free
text case for -- the database that we have recorded under the same condition as
the actor with people from the street.
We haven't fully, we haven't run the statistics on that yet. We collected it. So we
hopefully have results later in the year. What we find is that the neurotic is
clearly distinct while very open and very extrovert are very similar acoustically, I
skipped over that in the details. Neuroticism and extroversion are a human
assessment, sort of reciprocal. Openness typically tends -- perception of
openness tends to follow agreeableness if at all, but openness is the most
challenging feature.
And on the automatic assessment, it works quite well for neuroticism and
extroversion, consciousness classifies well but we have a harder time predicting
the human perception of consciousness and agreeableness sort of correlates
moderately with user assessment but we have a harder time classifying it, if you
sort of want to rank these features relative to each other with how well they do
and openness is sort of an unsolved case, because humans don't assess it well
and automatic assessment doesn't work well.
So it's certainly the weakest of the five on the five factors. Ongoing work, we've
done some work at Johns Hopkins on using emotional speech synthesis using
articulable features as parameters that you can change and use in a speech
synthesizer and using this database and this classification approach to measure
how well a synthesizer produces speech with emotional or characteristics that
doesn't mean that the speech is understandable but it's emotional. And, of
course, we want to continue on the text independent and speaker independent
case here to see where we are. And, of course, very interested to see how we
go with nonacted speakers and nonacted results and this is some sort of
schizophrenic setup that we have one single person producing speech in
different personalities. Tim wants to look at what personality factors, what
independent factors can you extract from the data if any are they different from
the ones you think you find in the psychology or depending on how much data
we have, you might be able to do that, and you have merged the acoustic
information we have here with textual information and see which is more salient,
which is harder or easier to extract them. What does it, for example, mean if the
message that you're getting from the text channel doesn't match the acoustic
channel. So if you take the text that the actor produced as a neurotic person and
synthesize it in a nonneurotic voice or have him speak it in a nonneurotic voice
does that send mixed signals and confuse the user or is one beating the other.
So there's a lot of things that you can play with there. And that concludes my
talk. [applause].
Any questions?
>> Daniel Povey: So I think -- did the West Coast meet all of your expectations.
Download