Document 17863925

advertisement
>> Cha Zhang: Good morning, everyone. My great pleasure to have Professor Qi Tian here. He's
currently associate professor in the department of computer science at University of Texas, at San Antonio.
He received his PhD degree from University of Illinois at Urbana-Champaign in 2002. He received the best
paper, student paper award, together with his student in 2006 and best paper candidate in retrieval check
in PCM 2007. So without further ado. I'll let Dr. Tian ->> Qi Tian: Thanks, Cha, for the introduction.
I'm really glad to be here to give my talk about research project. My. And during my talk, please feel free
to ask questions or interrupt me any time.
First, I would like to give a brief introduction to multi-media information retrieval and my focus (inaudible)
talk about introducing concept based image retrieval and affective (inaudible) analysis, then discuss our
proposal approaches. First one is construct Lexicon of concepts with small (inaudible) for photos used for
(inaudible) learning and we propose integrated MTV affective analysis, visualization, retrieval and
management analysis. Finally for each part, individual conclusions.
So first of all, I like give acknowledgment to my collaborators. For the first part of my talk, construction of
Lexicon code, LCSS, (inaudible) with my (inaudible) students and (inaudible) and collaborated with
researchers and (inaudible) and Microsoft Research (inaudible). For the second part of my talk, i.MTV, it is
only work with my students. Yung (phonetic), graduate students in Chinese Academy of Science and the
two faculties over there, Ching Lee and Sue Chung (phonetic).
So first motivation for multi-media information retrieval with (inaudible) with expose of (inaudible) of digital
media data with huge demand for two (inaudible) being able to average users to more efficiently and more
effectively search access, persist, manage and also share this digital content.
So early information retrieval is purely text based. And also the giant (inaudible) succeed Google.
However to retrieval multi-media objects such as images, there are too many images to an annotate and
the high cost of human participation. And due to the subjectivity of the video content window, a picture is
worth a thousand words.
An alternative approach to text based information retrieval as opposed to counter based retrieval. That
means automatically retrieve images, video and audio based on video and audio content.
So counter based media retrieval became more active since 1997 when internet and web browser come
forth.
So up-to-date multi media information retrieval is truly a diverse field. It involves different data types. From
text, hyper text. Image, audio, graphics, video, movie and a combination of those. And they address lots
of research progress such as system, common MSS service, user, and social business and many
applications and. Also truly market this plenary field, from the people from database, information retrieval
techno and image processing graphics (inaudible) much learning, and so on.
So one of the biggest challenges in multi-media information retrieval is called semantic gap. So what is a
semantic gap? It is a gap between the expressing power or description of low level features and the high
level semantic content.
So for example, if a given picture of this to the human user and we can tell it contains a sky, mountain and
tree. Maybe it's a picture of summer.
Or picture of this, and for human (inaudible) tell this contains a face. And it's a female. And maybe it's a
pretty face.
However, for the computers, what computer can get is pixel (inaudible) in terms of three channels, red,
green or blue. And another is (inaudible) features on these images. For example, using color histograph
and the video features for these images. So therefore, there exists a big gap between the semantic
concept in human mind and the video features which computer can get.
So this slide gives the research overview and motivation. There are two kinds of research. One is
(inaudible) driven research.
And another is application driven research.
So for the (inaudible) research, consider research in this field in hierarchical level, from lower level, middle
level to high level. Low level research feature extension, feature selection, information deduction. And in
the mid level most likely is much learning techniques such as active learning. (Inaudible) mathematical
(inaudible). And for the high level, it is hybrid such as using both picture and video information,
integrate (inaudible) context and collective intelligence.
So for the mid level and the low level, middle level address challenge problems such as (inaudible) of the
features. Small sample search.
For the middle level to high level, learning techniques from high level mathematic computing, used to
address the semantic (inaudible) program.
Another line of the research is application driven, is -- mostly is search and what kind of search can we
provide. It is beyond search. Applications such as image retrieval, video search, music sharing, (inaudible)
and advertising.
So for my research objective of this work is, without trying to develop new theoretical driven approach that
address the change of the semantic gap and look, investigate new applications beyond search. So this talk
two parts. First part is concept lexicon for photos. Signal one is effective MSS for MTV.
This my first part of my talk.
So far many (inaudible) retrieval systems have been built. So to show you few examples, first commercial
(inaudible) from IBM.
(inaudible) by image content. And other (inaudible) retrieval such as trademark retrieval systems. And the
schedule retrieval systems.
(Inaudible). And really protein retrieval systems. And we can say, there's more than 100 systems
developed since 1990s and this field is still growing.
So (inaudible) we have seen a trend from generic image retrieval, generic (inaudible) retrieval, to concept
based image retrieval. So for content based image retrieval, the first task is to build a concept based
image database. And for that, the first step is to have a concept lexicon used for desired collection.
(Inaudible) images, we perform the (inaudible) extraction. And its image is represented as a point.
High-definitional -- (inaudible) space, and its images can be very high from (inaudible) hundred to even
thousands.
The next step is data modeling. Usually because the (inaudible) the first step to perform (inaudible)
introduction and use basic tool (inaudible) methods to model each class. Model, for example, trees,
mountains using (inaudible) or mix of (inaudible).
The next step is (inaudible) retrieval. For new image come in, they perform complication tasks. So there
are two types of metrics. One is (inaudible), and another is through similarity (inaudible).
And so you can, you can give (inaudible) rich images for (inaudible) bridging or for (inaudible) reservation
to be tree, a long tree.
And retrieval (inaudible) to use for data back.
So for this concept based image retrieval, so this is on (inaudible) in concept lexicon. There is famous
(inaudible) object database. (Inaudible) 256. And there is some popular concepts in (inaudible). Like, red,
(inaudible) Japan.
This shows certain concepts from the S Com, Life Skill from (inaudible) multi-media and it's for broadcast
news. Same concept such as sky (inaudible) car, and much event.
And some video concepts called media (inaudible) 101, 101 video concept. Obtained from the video track
data base.
>> Question: (Inaudible) Q words so you are the same thing or ->> Qi Tian: Basically the same thing. (Inaudible) my image can contain multiple key words.
And we'll show, it was (inaudible) multiple concept, one image.
So the largest concept of Lexicon is called larger skill concept (inaudible) multi-media, is joint work between
CMU, Columbia university and IBM. They developed about a thousand concepts from, for broadcast news.
However, the big improvement for this approach is concept to lexicon selection is simply a manual
selection or totally lost in previous works.
So the all the previous lexicon (inaudible) the difference of semantic gaps among concepts. And there is
no automatic way to select a concept.
So motivation for this work is, we believe, concept, small semantic gaps likely to be better models and
retrieved than concepts with larger ones. So so far there is very little research has been done on
(inaudible) MSS of semantic gap. So what are the (inaudible) semantic concepts for learning and how do
you (inaudible) semantic gap.
So with some (inaudible) to work this direction, our objectivity is to automatically construct a lexicon of high
level concepts with small automatic gaps recorded. This makes a (inaudible).
Ideally there are two properties for this LCSS. Concepts should be commonly used. Concepts are expect
to be visually and semantically consistent.
So to look on this project, we start to look at the web images. Because the web images have rich textural
features. They can have a (inaudible) title and strong (inaudible) text. These are just examples for this
picture of tiger, the title and the description. That is a water lilly and descriptions. And we believe the input,
title and comments are good semantic descriptions of images. And those text features are much closer to
the semantics of images than the video features.
Now, this (inaudible) framework of our lexicon construction. So the first step is to core image from the web.
We have called over (inaudible) 2.4 million photos from the internet, web site.
We built the video index system for these images and (inaudible) texture index for these images.
Next is the, we build the confidence map, which we'll call the (inaudible) labor confidence scope. I would
describe its meaning in the next slides.
Then I'm going to re-rack those images based on this nearest labor content score. In short, LCSS score.
So the next (inaudible) step is to construct accountant. And the accountant (inaudible) based on this LCSS
score. Then we applied a (inaudible) classroom algorithm code (inaudible) proposed by (inaudible)
published in Science 2007.
And the fundamental (inaudible) from those classroom centers and the rights to get our concepts lexicon.
So step one is data collection. So so far we collect 2.4 million web images from five total forums. And the
reason we chose those photos, because the photos have high quality. And they are rich texture
information. These are just a few examples to this image of sunset with some descriptions. (Inaudible)
some descriptions.
And we extract 64 dimensional global user features in this time, and most (inaudible) color feature,
concludes color moments. Color (inaudible), (inaudible) space information and color texture moments.
So second step is, once that we have this video features, we can, for each image in a database we can find
this K nearest labor in a video space. Then we compute this image and its nearest labor. Compute its
similarity in terms of the texture features. And this is called MSS score. The (inaudible) MSS score to get
the top candidate images.
This is just an algorithm. So this is an image. And this is the, say, 5 or 10, various labors of this images in
a video space. Then we compute the text similarity between each labor to this image. It gets (inaudible) of
them because it is (inaudible) score. Nearest (inaudible) labor score. And we can see the higher the MSS
value, the smaller the semantic gap, which means those images are both visually and texturally consistent.
We compute the MSS scores for all 2.4 video images and choose the K, nearest labor longer to be 100.
And among the 2.4 million images we select, study two, study six, study six to study of one candidate
images with top (inaudible) MSS score. The reason we choose this number, consider it relatively large
size, and also the memory concern need to construct the similarity metrics, used for (inaudible).
So (inaudible) first step is, after we have this top, this (inaudible) scores for the top candidate images, we
construct content and context similarity magic response and apply (inaudible).
So what these (inaudible) is new (inaudible) algorithm. It is fast for larger scale data and recall no prior
information. For example, (inaudible) what's number of clusters. So so far we have this 36, 2002, 31 by
36, 2001 context (inaudible).
So make sure the idea. So for each image, IJ, II. Okay. First we find out is K, nearest labors in video
space. And also find K nearest labors in the textural space. And we found in this case, like, image IJ,
okay, of both nearest labors to II in the video space and text space.
We consider the other, the nearest labor of IJ and the nearest labor of II in terms of (inaudible).
So also for this case is, this IK. IT at the (inaudible) newspapers for image II. (Inaudible) this similarity.
Similarity magic elements IJ to be overall similarity venue. And this may help semantic value in terms of
two part. Video similarity and textural similarity. (Inaudible) equally with textural similarity and (inaudible)
similarity (inaudible), .5.
>> Question: (Inaudible).
>> Qi Tian: Oh, use the standard, standard method. So for each image, for each image you have
surrounding text. You have the key word. And so you have a text feature vector for each image. And we,
it computes the co-sign distance between the two, and that is used for texture similarity.
>> Question: (Inaudible) how do you go from the (inaudible) key words to the semantic space.
>> Qi Tian: Basically you can use, like, have long list of key words. And the, like, occurrence of the key
words, of this, key words in this surrounding text of image. So (inaudible), it is one. It is not, it is zero.
>> Question: In this case how does it take (inaudible).
>> Qi Tian: Yeah.
So I'm not sure on, I'm not sure whether this is (inaudible) in current algorithm. But they say that just some
standard techniques we use to compute texture memory.
So once we have this nearest counter and context similarity magic, we apply a similar (inaudible) using
standard text based key word (inaudible) to (inaudible) key words from surrounding text for those
(inaudible) images and get our concept lexicon.
So this is the result. This result shows the top 50 key words in our constructed (inaudible) lexicon. And
roughly we put into five category. Same object, color, season and others.
(Inaudible) the first (inaudible) number one (inaudible) image is (inaudible) concept is (inaudible) because
we apply to the photos. The (inaudible) followed by sky, beach, garden, lake. And things we used 64
dimension color features. So other colors. Blue, red, yellow, green, pink are within the top 50 key words.
And some other objects. Flower, ghost, butterfly.
>> Question: Categories (inaudible). Do you actually (inaudible).
>> Qi Tian: (Inaudible).
>> Question: Do you end up with a (inaudible).
>> Qi Tian: Yes, it is automatically.
>> Question: How many (inaudible).
>> Qi Tian: It, it's about, for this -- have 36.
>> Question: (Inaudible).
>> Qi Tian: Yes. We have -- I don't remember, a few hundred factors.
>> Question: (Inaudible). Sort of like a (inaudible).
>> Qi Tian: Yes.
>> Question: (Inaudible).
>> Qi Tian: So the experiment is, the first experiment, we want to test the property images of top concept
should have a smaller semantic gap. This is -- we look at the top 15 key words in our lexicon. Sunset and
flower, blue, red, ghost and compute its nearest labor (inaudible) score.
So for each key word, we randomly select 500 photos for that key word. Then we can say, we compute its
nearest labor (inaudible) score and we can say it context score decreases.
Similarly with its rack. The concept (inaudible) which means that images (inaudible) top key word have
higher content value.
So the second experiment is a, want to test the property is a concept, a commonly good semantic
description and can be used (inaudible) good annotation.
We apply this lexicon construction to UW, University of Washington (inaudible). It contains 1,109 images.
Each image has (inaudible) based on video content.
This is a (inaudible) video content which is extracted key words for its image around surrounded text.
Now, then we compute the annotation (inaudible) apply the search based image and algorithm developed
by a group in (inaudible) Asia.
And we apply our construct lexicon to the (inaudible) annotation, construct by this SBIA. And we use two
percentages. One is (inaudible) and second is (inaudible) output defined annotation to the user (inaudible)
where you compute its (inaudible) and vertical.
So this just the result. And the X axis, this is a rescission. This is a record. Okay. This is X axis an is a
(inaudible) concept. We use for annotation assignment. So zero means (inaudible) introduce any of the
concepts construct lexicon to (inaudible).
And you can say, with increase of the number of content add to the performance (inaudible) when number
of concepts used is equal latitude 100.
And also can say the requirement distinctively improves over original annotation in terms of precision and
record. And this is the result. This is (inaudible).
>> Question: (Inaudible).
>> Qi Tian: Yes. We have not gone through. What's a (inaudible).
>> Question: (Inaudible). Do you have the state of this (inaudible) each image has (inaudible).
>> Qi Tian: Yes.
>> Question: (Inaudible).
>> Qi Tian: This is one truth, sir. But the key words extracted on the web.
>> Question: Oh (inaudible).
>> Qi Tian: Yes.
So in a summary, so we quantitatively study and formulate the semantic (inaudible) program. And we
propose a lot of (inaudible) to automatically select video and semantically system concept.
And that's, is the first lexicon in the field for the concept with small semantic and (inaudible) applications for
data collection, concept modeling, image (inaudible) refinement. (Inaudible) recommendation and
(inaudible) retrieval. So any questions. This completes my first part of the talk.
Okay.
Okay. Let's talk about my second part of my talk.
>> Question: So I want (inaudible) visual initial, initial visual features.
Your images that you (inaudible) together, that you declare to be close visually, are these very, very close
because you use (inaudible) pictures.
>> Qi Tian: Global (inaudible).
>> Question: Yes. So for example, if I have two images, they are (inaudible) a bit larger than normal, than
just -- they are not (inaudible). Okay. So on one of them I have a rose. And the left part of image. And
the other one I have rose on the -- okay. So you will mention ->> Qi Tian: When ->> Question: -- in this case here because you use -- no. No, I need better example. Well, let's go, I'm
trying to say that your choice of features (inaudible) very particular, very particular definition of visual
similarity which can be (inaudible) that's what happened.
>> Qi Tian: Right. Because we -- one is the, this construct lexicon is dependent on the digital use. So
right now to it test the ideas we use features. (Inaudible) what good concept with what (inaudible) for this
color features.
And of course it is, if you have a text features will change. So the application is, you can (inaudible) if you
looking for concept, (inaudible) would have (inaudible) color difference. They can decide which features
are good to retrieve this concept.
So one of the future work is to ->> Question: (Inaudible). And also, final question.
>> Qi Tian: Sure.
>> Question: The precision in the (inaudible) on the (inaudible) databases is still less than 0.2. So it is still
small data entry.
>> Qi Tian: Yes.
So for second part of my talk is, consider MTV effect of (inaudible). We have an exposed increasing
amount of (inaudible) data. And (inaudible) becoming increasing quarters. Because (inaudible) music
(inaudible) video, and it can keep the audience, both audio and video experience.
And the shortcomings of current MTV classification scheme is they are manually done and difficult for
(inaudible). And not intelligent enough. They are trying to retrieve those abstract concepts, MTVs.
So in the meantime, effective content to MSS is getting popular in multi-media information retrieval
(inaudible) so what is affective. Affective is a feeling or emotion as distinguished from the (inaudible)
source or action.
Consider some of the real multi-media retrieval scenarios. For example, user in (inaudible) 34 stop set of
(inaudible) pictures. And if you show them to friends, or search the most appropriate (inaudible) music for
the given situation. Search for the most impressive video clips in (inaudible) I like most.
Of courses it is going to be very challenging and difficult task. So in order to (inaudible) major approach is
to (inaudible) dance. We search for moods. Okay. Search for the images. Matches the user's profound.
(Inaudible) interest, a low interest with (inaudible).
So we propose the affective analysis for MTV retrieval and management.
So for MTV affective analysis, we expect it to be more consistent with (inaudible) thinking and study. More
intelligent and affective retrieve and manage MTV according to their affective states, and can offer some
(inaudible) applications based on user's purpose (inaudible). For example we can use for learning the
user's preference in the video futuring, video recommendation, sharing or even for friend making. Because
I believe the person's (inaudible) are likely to be friend.
So talk about its affective content MSS. There are two types of approach. Two categories. The first
category is (inaudible) affective content (inaudible). It has a discrete state of affective, such as fear, anger,
(inaudible), disgust, upset, (inaudible). And classification can be done to this affective mistakes. Affective
calculation.
It has limited flexibility for a sink hole. Easy to build. And suitable for some specific problems.
Another category of affective (inaudible) MSS is dimensional (inaudible) MSS. It has a dimensional
affective model or affective state representation and modeling. It can offer infinite computing and
complicate affective states.
So this is a dimensional affective model. There are two states (inaudible) states. Aralso (phonetic) and
ratings. What is Aralso? (Inaudible), (inaudible) represents (inaudible) of affective observance such as
from (inaudible) calm, peaceful to energized or excited, or horror.
(Inaudible) the level of pleasure from highly pleasant to extremely unpleasant. That's for (inaudible) space.
This is access for winnings, from extremely unpleasant to it highly pleasant or arousal. Peaceful to
energized. For example, we can quantitatively partition into space (inaudible) excited, energetic, anxious,
(inaudible), anger, (inaudible), peaceful, relaxed.
So this is a (inaudible) of our proposed integrated MTV affective system. It consists of three modules. The
first module is affective (inaudible). Given the MTV database, the first one to expect affecting features.
Then to affective modeling. And consider user plays a very important result in affective MSS so we keep
the user in the loop.
You give the users get back, user is gone through (inaudible).
The second module is MTV retrieval and management. And where you first propose affective (inaudible)
and based on that, the proposed MTV retrieval and management. And so (inaudible) is user profound
MSS. So you have a user profound. Example, the (inaudible) history then we can identify user's
preference programs, and that can be used for MTV recommendation.
Can be used for user affective classing and social network construction.
This is screen shots of the real system. And the first time you can register counts and (phonetic) first time
user in this, fear, the MTVs in the collection will be, can be displayed alphabetically. And after you have
used the systems for a while, the MTVs recommended will be displayed here.
So (inaudible) model. First one is (inaudible). And interactive (inaudible) adjustments. I can show them
(inaudible).
And last module is MTV affective retrieval. MTV affective management.
So for the first module. MTV affective manifest. MTV data (phonetic) arousal feature extension, feature
extraction (inaudible) modeling. Another part for the users through interactive rate adjustment, to give a
personalized (inaudible) and use that to update the model.
For the features, arousal features, we use a the lot of the both video and audio features and test
most (phonetic) lots of features motion intensity. (Inaudible) switch rate. Zero crossing rate. Brilliant
(inaudible) features and the camera and video stress.
For the (inaudible) features, this lightning, considering for the purpose of affecting emotions of the viewers,
some blue, white, can be used to establish (inaudible), and (inaudible), color energy (inaudible) and
(inaudible).
And all the features are extracted. They are normalized between zero and one.
>> Question: (Inaudible).
>> Qi Tian: Oh. Wait, wait.
Lightning is, we compute the, like, a gray, blue, white. Yeah, I forgot (inaudible). It should be in the paper.
So for (inaudible) for affective modeling (inaudible) a simple linear combination. And this is the arousal
feature and (inaudible) feature. This is the rate, this other normalized feature.
And this feature can be (inaudible) based on (inaudible) study on the users. Then we can do interactive
video adjustments so this interface. So a video is (inaudible). User can watch it and give four scores
through arousal. One to peaceful, little peaceful. Three to a little intense or to very intense. And two
scores to (inaudible). (Inaudible) happy and very happy. And this can be updated and recorded, and will
be (inaudible) to update the rate in this modeling.
So the second model is MTV retrieval and management. Okay. The first we propose (inaudible) human
emotion is now visualized two dimensional continuous us affective space. And MTV can be visualized as a
point and arousal (inaudible) space. This is an example.
This black dot, MTV. And this is MTV for (inaudible) category. MTV, blue.
So a good affective relation should provide (inaudible) functionalities. First should provide overview of all
MTVs. To have a good affective structure (inaudible) and should be visible for all MTVs to be, to browse
MTV.
That's it. So then for MTV retrieval and management, we propose this affective collapse in combination,
video collapse and affective space.
So this is a (inaudible) space. (Inaudible) application window, and in this application window we can select
cross bridging. And the affective (inaudible) of cost (inaudible) of this (inaudible) here. (Inaudible) by 10
blocks. And we can (phonetic), we can select (inaudible) ridging. And this (inaudible) ridging and this
affective (inaudible) of the fine ridging.
You can choose (inaudible) per view.
Then for retrieval, we can also start by doing master or rectangular arrow in this affective space. Then get
the affective (inaudible) of the cost (inaudible). For example in this (inaudible), you can get (inaudible)
affective collapse.
And this interest in this one, it (inaudible) here. And MTVs still is, in this rating, would be retrieved and
retained here. I can click to play.
So that relation can be also used for MTV management. For example, we are looking. You can choose
the (phonetic) of the categories. Heavy (inaudible), crazy, and MTVs in which category will be displayed
here.
And (inaudible) of this category will be (inaudible) here. This is affective (inaudible) of this ridging. MTV
means there is no MTVs (inaudible) rating.
And you can watch (inaudible) category. You can select a rating affective space and give a name.
So (inaudible) is user profound (inaudible).
Consider affective (inaudible) means users' preference for certain affective states. For example if I like to
play campaign music a lot (inaudible) it is words. (Inaudible). It is affective program can be learned from
user play history. And we identify this by first construct user preference semantics. It records the
accumulative affective preference reflective from the play history. It is semantics and can be updated each
time MTV is being played.
For example, this is video space, MTV. This MTV is being played. Like, update its value in user
preference semantics.
So this element manual in affective user preference semantics, accumulation of users' affective preference.
So identified preference points of this semantics. So process points should be representative elements for
the semantics, to describe user's affective preference. And this should be the (inaudible) which means that
it should be located relatively center positioning, and should be preferred between (inaudible) indicating
high preference degree. And we identify those preference points by the same (inaudible) algorithms. In
this example this is identified two preference points.
And this other zero element in the user affective semantics.
So (phonetic) user preference can be MTV recommendations. So compared to the traditional (inaudible)
affective preference recommendation to be more consistent with user's personal taste. It is more
(inaudible) intelligent affective. And after user looking (inaudible) MTV will be displayed in order based on
the recommendation rate, learned from the user's history.
So for example, this is, identify the preference affective state. And MTV in this region should be ranked,
give a higher recommendation rate than (inaudible) user.
Application two is use affective transferring, can build a social network. (Inaudible) multiple users with
similar personalities and characters. Can be used for product recommendations so technical movies
before MTV. Program (inaudible) for movie, music, electronics or sports. And for social networking you
can make friends and end up doing photo and video sharing. Other potential application can be used for
(inaudible), identify user. And to (inaudible) personalities or character. Fortune teller.
So there is some experiments. So to (phonetic) first is MTV data set. We collected 552 MTVs in impact
format.
And a collection is, we downloaded this MTVs from internet or (inaudible) DVDs. Because it is a, collection
is representative because it has a different resolution and visual (inaudible) and different languages.
English, Chinese, French, (phonetic), Japanese. And the list in different appearance. (Inaudible) from
nearest one and different styles. (Inaudible), and so on.
>> Question: (Inaudible). Sousa piece of MTV just one music or with some pictures?
>> Qi Tian: It is a video.
>> Question: A video (inaudible) a few minutes long just like ->> Qi Tian: For each MTV program it is about five minutes. 45 minutes.
>> Question: Basically one song. One ->> Qi Tian: Yes. (Inaudible) we can show some examples.
Secondly (inaudible) we perform user study for (inaudible). And we, obtain participants. One female, nine
males, ages 31 to 28. Each scores 150 select MTVs. And the, each watch MTV and gives two
scores (inaudible) score and (inaudible) score to each MTV and scores in terms of one, two, three, four. All
quality levels. And the (inaudible) the corresponding (inaudible) feature (inaudible) feature is one
correspond to zero, 2.25.
Repeat (inaudible) little intense. Similar for (inaudible) features.
(Inaudible) is for addition of affective features. And -- so this is (inaudible) arousal versus rate. And
average ratings only for (inaudible) because its featured independently.
This is for a feature this is the shortest (inaudible) emotion (inaudible) strengths. The right there
(inaudible), color, energy. Pitch, (inaudible) regularity. And you can see clearly all HR and APR available
to 5%.
10% is (inaudible) levels. And based on the feature the validation results (inaudible) in the affective
modeling can be (phonetic) in this (inaudible) because is higher (inaudible) consistent rate for SSR.
And lightning.
1.2 for (inaudible) pitch. 1 for the (inaudible) features.
(Inaudible) affective modeling. So each participant in a user study, they are 150 score MTVs, divided into
two groups. Group A contains eight MTVs for interactive (inaudible). Group B contains 70 MTVs for
computing the performance. Performance in terms of three matters. So for (inaudible).
And overall processing rate, including both A and B features. It is a more strict requirement. OPR. So this
is just a result of (inaudible) user one up to user 10. This is overall (inaudible) overall average rate for
OPR, APR and APR. And this X axis is (inaudible) interactive (inaudible) to the precision rate.
So you can see of all users OPR, APR and APR improves with interactive (inaudible) adjustments.
Improvements for OPR overall will (phonetic) from 7% to 31%. APR is 84.4%. 50%.
And this just (inaudible) shows (inaudible) we have a really valid features and modeling.
So (inaudible) is test the user's satisfaction for MTV retrieval and management.
(Inaudible) affective space given into four-by-four which is 16 regions. Accordingly, we test user's degree
of satisfaction. In each of its affective ratings, the compute original (inaudible) rate and original record
rates. And this is the result. This is reaching 1, 2, up to 16.
Okay. This is overall average precision rates. Original precision rates is overall. Average original record
rate. And you can see there are (phonetic) well above 50%. I don't think this is a low number. Because
the worst case is random (inaudible). Is 60, 16 (inaudible) about 6.60%. So (inaudible) say performance
(inaudible) in some ratings. For example this rating (inaudible) high precision rate because it has
(inaudible) arousal, and we're highly present.
So the last experiment is to do user preference (inaudible). The one is, we are trying to learn user's
preference in (inaudible) users for the study. One female and four males. Each is asked to play (inaudible)
MTV according to their own voice. And it is just, it is black dots just (inaudible) play by, of the users.
And this is identify the preference point for user one. Identify the preference point of user two and so forth,
so on for user five.
Then the next step is (inaudible) based on your preference (inaudible) results, (inaudible) user. So two
(inaudible) users. Each one is watch the pop survey, the command UNTVs and right there (inaudible)
scores. You (inaudible) this slide.
Okay. Neutral feeling. Enjoy. I enjoy a lot. So this is a final result. Here is the results. And overall it can
be seen above 8% of the command MTVs I enjoyed by the users.
So now can show you, show the quick demo of this.
This is user interface, original register. I can look in.
So when you are looking in, okay, you recommend MTVs will be displayed here.
Those are just the Chinese, Chinese names for the songs.
The first one the Chinese song. Okay? So if you choose the -- okay. Choose one. I choose
three (phonetic) play.
[ ]
I'll find my way
>> Qi Tian: Okay. Then this is the model for future extraction interact video adjustments (inaudible)
adjustments. So this is interface. Now I can start. So to start -[ ]
And I swear it's true
>> Qi Tian: So (inaudible) users, users will watch MTV and (phonetic) scores to this MTV. Okay. You can
give, one, two, three, four and minus one to two. And after you finish, you can set. Okay. This is a
(inaudible) for this one. This is a preference from the users. And you can pop to the next, the next MTV.
And you can (inaudible), like, establish your own preference.
So the next part is for affective management and retrieval. So for management, okay, now you click on this
one. It can be -- so this is MTVs. List of 552 MTVs displayed in affective space.
And you can choose the one (inaudible) the (phonetic) category. For example if I use happy and pleasant.
Okay. So order it up here. So MTVs (inaudible) would be displayed here.
And to play one.
So that's the -- so the mood of this really is (inaudible) very happy but kind of peaceful.
[ ]
>> Qi Tian: So this used affective (inaudible) of this category. And it would change to this (inaudible) to
angry. This, choose one to play.
It's (inaudible) in this region. It's (inaudible).
[ ]
>> Qi Tian: You choose energetic and happy. And I want -[ ]
>> Qi Tian: So it's a MTV demonstration.
And that's one can also compromise your category. You can choose the, say I choose the (inaudible).
Choose window in this region. Okay. And then I give -- because choose the singer part. It is a
mixed (inaudible). (Phonetic) this is mixed. Continue this category, and choose this category mixed. So
all MTVs in this situation would be here, and you can choose to play.
[ ]
>> Qi Tian: So this is for management. So for retrieval part, to start is, you can -- this is (phonetic)
affective space. You can choose, say if I'm interested in this part. Okay.
(Inaudible).
Show this (inaudible) order (inaudible). (Inaudible) instruction.
Okay. This is (inaudible) to start. This is a (inaudible) of the cross (inaudible). And now if I say I'm
interested in this (inaudible) interested in this ridging. Okay. This is (inaudible) of cross ridging. Then you
say I'm perfectly interest in here.
I choose a (inaudible). Okay. Class of the (inaudible) is here. I say I'm interested in this one. Play. And
the MTV (inaudible) rating would be displayed here. For this case, for this rating (inaudible) one MTV.
[ ]
>> Qi Tian: So you can say, so this MTV is actually one in this rating. It would be, respond to the
(inaudible) and energized.
This is (inaudible) one.
The two MTV (inaudible).
[ ]
>> Qi Tian: Okay. So that's the idea of the system. So finally (inaudible) come to resolution.
So in the future world (inaudible) we're trying to, with the help of the music (inaudible) and (inaudible)
studies, we're trying to develop more powerful, affective features and get the benefits of affective modeling
and (phonetic) develop MTV system for use affective class and rhythm. Build (phonetic) social network
because multiple users and use (inaudible).
So that's the end of my talk. If there are any questions.
[APPLAUSE]
>> Question: (Inaudible). Music you (inaudible). So I wonder if that's (inaudible) that's really is (inaudible).
Some music, you know, (inaudible). Some may also like some basics inside here (inaudible).
So -- but even a two musics over here, they are close to your space, your affective space but may not
(inaudible) people prefer. So somebody may like one of those and not like the other one, even though they
are close to your space. So (inaudible) that affective space is (inaudible).
>> Qi Tian: Oh, finally, eventually a good visualization should preserve their affective relationship in 2-D.
So users still could have a (inaudible) preference. They could (phonetic) identified the (inaudible)
preference point (inaudible).
And so if there is a -- hopefully, if they are close but not preferred by the users and (inaudible) can provide
a way to change that. User -- like, with additional images.
If they are visually close in 2-D dimension but in fact belong to different persons, we calculate, (inaudible)
feedback can be given to (inaudible).
>> Question: (Inaudible). Even more (inaudible) projecting two-dimensional space. In this you are losing
some information, maybe the real underlying structure of the two-dimensional. So it would be better
probably to never occur the number of dimensions from the data from user study.
>> Qi Tian: Yes. (Inaudible) but we generally use the two-dimensional, and it can be visualized by the
users, or this can be (inaudible) three-dimensional.
Of course a good structure cannot be (inaudible).
>> Question: (Inaudible).
>> Question: (Inaudible). Musics from (inaudible) label. Is sad, happy. I guess, some musics, there are
some parts of it happy, some parts of it sad ->> Qi Tian: Yes, that's (inaudible) right now we just, for that, mood changes (inaudible) MTV. You can do
a dynamic (inaudible) update its position in (inaudible) space.
But now we are just (inaudible) MTV had one mood (inaudible). For a given (inaudible) more than 50, more
than five minutes MTV, they say (phonetic). Counted as one mood, one music.
>> Question: (Inaudible).
>> Question: (Inaudible).
>> Qi Tian: Yes, of ->> Question: (Inaudible).
>> Qi Tian: Actually (inaudible) because we (inaudible) from MTV and you test the algorithm.
[APPLAUSE]
Download