>> Emre Kiciman: Hi. Welcome to today's talk... University, where he's an assistant professor, looking at communities and...

advertisement
>> Emre Kiciman: Hi. Welcome to today's talk by Derek Ruths. He's visiting us from McGill
University, where he's an assistant professor, looking at communities and social media. Today,
he's going to be talking to us about demographic inference on social media and how he can learn,
I guess, who groups of people are from the text that they write. So thank you very much, Derek,
for coming today.
>> Derek Ruths: Thank you, Emre. Well, first off, thank you very much for the opportunity to
be here. This is my first time at Microsoft Research. Actually, this is my first time on the
Microsoft campus, so I feel very privileged to be visiting. In my lab, we've been spending a lot
of time looking at latent attribute inference, or demographic inference, in social media. And so
today I wanted to talk a bit about the motivation for that and then to just really give a flavor for
what's been done in the field. It's a fast-moving field. There's a lot of work being done in it, and
so I thought I would take the time to sort of orient everybody to some of the key results that have
been obtained and some of the big problems that we're still working on, and there are plenty of
those.
So as a way of motivating this, for those of you who know about social media literature and
social media research, when I came to the field, this is a couple of years back, when I really
started investing myself in working in social media analysis, there were lots of papers coming
out about studying human behavior, social networks and social media, but they all tended to
focus on the content, irrespective of who was generating it, so people were trying to forecast
what the next big blockbuster movie was going to be, and they were doing so with mentions of
movies, but they weren't worrying about who was mentioning it. And it's quite remarkable to
consider the fact that we've been spending a lot of time talking about the content that's been
generated on social media, years and years, without really knowing a great deal about who is
generating all this content. And certainly, social scientists are very interested in understanding
who is actually on social media, and who is in a group really matters in terms of what the
behavior of that group is going to be.
And so we decided to really focus for some time on really getting a handle on this problem and
coming up with ways of figuring out who is actually on social media, and, furthermore, ways of
looking at human populations in general. So I wanted to start by sort of contrasting Twitter,
which is what we're using, a form of social media, with sort of the established technology for
figuring out who is in a community, and that is survey technology. So survey has been around
for a long time. They are very effective techniques for getting information about populations,
but in our fast-paced world, they've become outmoded in a number of regards. Surveys, many of
us have taken, they give structured information. You have multiple-choice questions, they can
pose the questions very clearly, very crisply. You get very definitive responses. You know who
you're asking when you take a survey, but they're very artificial constructs in the sense that
somebody's got to come to you, or a webpage has to come to you and basically pose this
question, and usually the questions are coming out of context, in the sense that I may ask you
about what you ate this morning or last week, and suffice to say, you're not eating it at that
moment. So you actually have -- there's memory and there's judgment involved in how you
actually answer that survey, and then not to mention the fact that surveys can be fantastically
expensive. So measuring online populations or managing physical populations using surveys
can actually be quite tricky to do, using this technology.
Now, Twitter on the other hand, it's got a host of issues, but in its favor, and in the favor of
general social media, we have a number of features that make it very attractive. It's in the
moment. To me, this is the most important aspect of it. It's in the moment. So people generate
content when they're generally having the experience. So if I'm standing in Starbucks and
somebody spills coffee on my shoes, I tweet about it right then and there. I don't wait until I get
home. I don't wait until the weekend. I talk about how I feel at that moment. And, as a result, at
least the conjecture is that it's going to be a much more candid representation of the way that
people are actually interacting with the world.
Twitter and many other social media platforms actually have this continuous feed of information
that you can connect up to and get at least some portion of effectively for free, so you're
effectively paying the -- the cost of your electricity and your Internet connection is the cost to
actually access a lot of this data. And, of course, that's not universally the case, but certainly,
getting digital information, information that people are already putting online, is much cheaper
than having to send people out or run large surveys and aggregate the information. And then,
finally, Twitter and many other social media platforms give us social context. So that means that
we not only see the individual, we see sort of the world that they live in, at least the digital world
that they live in, and that's quite a bit different than the way that survey technology typically
works. When you survey someone, if you stop them on the street, you may be able to see what
they look like, what they're wearing, maybe where they're coming from, what shopping bags
they're carrying, but you're not necessarily seeing anything about the kind of social context that
they have around them. And so Twitter, Facebook, even platforms like Reddit and Slashdot,
give us some social context within which to understand the user, and that can actually be a rich
source of information, as well. So, of course, the challenge with Twitter is that we don't actually
-- we're not given a great deal of metadata about uses, and in general, online, we don't have a
great deal of metadata about users explicitly coded by the individual. So on Twitter, for
example, literally, the only field that a user can specify is their location, and usually that location
field is used to specify something that obviously is not location, like the moon, or various places.
But they don't have many fields, actually, to specify things, so something as simple as gender is
not an obvious feature. So if we're studying users on Twitter and we wanted to actually look at
male versus female responses to things or movie reviews or different things, that is not an
obvious feature to actually try to classify by, and so it goes with age and politics or geography.
All of these things are actually not explicitly coded out. And even in the richer platforms, like
Facebook -- well, Facebook is a classic example. Even that information is not always given. In
fact, people do not complete their profiles, and so that information isn't necessarily always
available, either.
So we have a sizable task ahead of us, which is to figure out, using this real-time feed, what we
can learn about individuals. How do we actually create the equivalent of a survey using only
social media data and the metadata associated with it? So to be very concrete, this is the problem
that we have. So here's Starbucks. It wants to learn about the people that are following it on
Twitter, and here's this one user, and these are real tweets, by the way.
Here's what this user has generated. I don't even know how to pronounce some of the way
they've written, Imma bring all of my sexy professors an apple on the first day. They're using
things -- it doesn't even have proper grammar. Most of Twitter, actually, is effectively nonsense.
It's very personal communications that are coded sort of in deep slang or nuance. So we want to
go from this feed, this textual feed, to some understanding about what the gender of that
individual is, what their political orientation might be, what their ethnicity is, where they are
from. We could want to answer any host of questions. The question is, how do we actually go
about assigning a label to this user? And so I'm going to talk about a couple of aspects of this in
the remainder of the talk. First, I want to just give an overview of what are state-of-the-art
approaches to demographic inference. Given this problem, how do we actually solve the
problem?
I'm going to talk a little bit about sort of the general idea, and then I'll show some of the work
that we've done with using social context and handling different languages and how we can
actually accommodate variance and variations in the way that users actually use social media.
I'm going to talk a bit about why attributes are harder to code. Some attributes are really hard to
code, and I'm going to give a sense of that from work that actually was published just earlier this
year. And then, I want to talk a little bit about sort of the promise of what I see as being the
major promise of mining this kind of information on social media, which is measuring -- taking
measurements and moving them back into the real world, into the physical world, learning about
physical populations from online measurements. And so I'm going to talk about some of the
preliminary work that we've been doing that's gotten at that.
Okay, so the way that latent attribute inference is done, or demographic inference as it's also
known, is done, it's cast entirely as a machine-learning exercise, of course, but typically -- and so
as a machine-learning exercise, typically what's done is we take a group of users who we can
assign high-confidence labels to. So here, I'm looking at Democrats and Republicans. And we
take some set of Twitter users, Twitter users depicted as these gummy, green characters, for
which we can assign high-confidence labels. And for each of these users, we encode all of their
unstructured text using a variety -- unstructured text and social -- their user account profile, using
a variety of different features. So these features could be -- and I'll give you an example of this,
but these features could be everything from what the most common word is that they use to how
many friends they have and so on and so forth. So we obtain this feature vector for each of the
individuals, and then we feed it into a classifier or a system that's going to build a classifier, and
techniques that have been used on this, there's been a wide variety used in literature. Probably
the most successful -- in fact, definitively, the most successful ones have been support vector
machines and latent Dirichlet allocation approaches. SVMs seem to sort of be the reigning
paradigm. We use them extensively in our lab, largely because they can accommodate more
than just language. So if you're interested in capturing things like what the social context is of
that user, what their neighbors are like, how many neighbors they have, what the social graph is
around them, it's easier to actually encode that in an SVM than in some sort of language model,
which is what LDAs do.
So SVMs have become very much a mainstay in this field, and then, of course, you get your
classifier, and now you have some user that you don't know the label for, and you're going to
construct that same feature vector as you did for these individuals, through it through the
classifier and get your label out. Yes, a question.
>>: So I'm not an expert in this domain, so maybe it's a really [indiscernible]. You said SVM
showed the best performance. Can you tell me a little bit about the precision and recall of the
state of the art? So is it like 90%?
>> Derek Ruths: So I'm going to give a sense for some of that for different features. It differs
depending upon the feature that you're looking at, but in general, as you'll see, the SVM accuracy
is going to vary between about 70% and 90%-something, depending upon the feature. So with
gender and age, we can do better. With political orientation, it turns out that we don't do as well,
so that's in the 70s, but as you can imagine, as you get more complicated features, it can be more
and more difficult. I think that it's a very big open question right now as to whether the SVM is
best suited for this. I'm continually trying to figure out what the better machine-learning system
would be that would take this into account. It's just that I don't feel as though we've exhausted
what the SVM can do. And as you'll see, what really becomes important is what features you're
picking here, because if I'm interested in classifying -- oh, gosh, I don't know, whether people
like to ski, but I'm not putting any features in there that are even relevant to that, then the SVM is
going to do terribly. And so it really does come down to a feature selection question, and a lot of
the investment that my group has and continues to do is to sort of devise and come up with better
measures for providing the SVM.
And so, per our discussion about features, here's a set that are typically used, you'll find in
literature. So k-top words. So k-top words means -- whenever I say anything, what I mean is the
k-top discriminating words or characters or hashtags. So what that means is that we're going to
take the two or three classes that we're interested in, and we're going to look at the features that
are most associated with the k-top, features that are most associated with that particular class. So
k-top words for Democrats and Republicans would presumably look like the political language
that is most polarized Democrat and most polarized Republican. And then we can of course of
do that for hashtags and mentions and stems and co-stems are ways of breaking words into the
root word and then how that word is being modified, which ends up actually being very
informative when you look at age. And then using diagrams and trigrams actually can be very
powerful, just looking at the three-letter and four-character combinations that people use. And
then you can look at more meta features like how often people are tweeting, how often they're
retweeting, how often they're using links, URLs and emoticons and all these different features.
And then you can start talking about things in the network, so friend and follower ratio has been
eminently used, and in a moment I'll talk about using the actual network itself.
It's worth mentioning that one of the nice things about SVMs is that they don't suffer from
attribute dilution, so as you add more features, an SVM is not going to perform worse. So one of
the nice things about an SVM is you can just continue dumping these features in, literally, and it
will just pick and choose the ones that are going to be the most effective for any particular
exercise. So the framework that we have in my lab literally will just compute hundreds of
features and construct these feature vectors, and as we add more, and as we come up with more,
we just continue adding them. And so we can actually see which features are actually going to
be the most useful for any particular classification exercise.
And so just to give you a sense for performance, this is -- from the literature, if you use only user
information, all those features that I talked about, and plug them into an SVM, this is what you
get. So if you're classifying age as a binary category, college age, and then in the five years after
college, you get about 75%. Gender, you can do about 80% accuracy. I'm going to revise this,
so in the literature, up until this year, political orientation was believed to be upwards of 90%
accuracy. It turns out that's not true. And these have been confirmed in a number of studies, so
there's a number of papers that have established that, no matter how you dice it, this is the
performance that you get using the features that are available.
So one of the questions that we investigated early on was how to use social context in order to
actually improve the classification process. And so just to go back to our original setup, you
have Starbucks. It's interested in his particular user. Let's say it's interested in the political
orientation of this particular user, but it's important to realize that, actually, that user is embedded
in a social context where it has other individuals that it's actually related to. And the question is,
can we use the neighbors to learn something about the political orientation or the label of this
particular individual. So, in this case, homophily -- the social phenomenon that we're talking
about is homophily, which is the tendency for like individuals to cluster and to form links with
one another. In this case, if this worked, then homophily would tell us that a Republican would
likely have -- tend towards having more Republican friends, and certainly the literature in social
science suggests that homophily is heavily active in many, many, many various attributes that we
have -- not in all, but in many. And so the foregoing -- or the assumption that we're making is
that by applying the principle of homophily here, we can actually improve our ability to learn the
attribute of this individual, him or herself. And so we can go and we can build these features.
We can take each individual in the neighborhood and actually learn their features, as well, so
now what we've done is we've taken that one individual, we've looked at all of their neighbors,
and we've learned their feature vector.
Now, it's worth pointing out, in Twitter, there are two different types of neighborhoods that we
can take. There are the friends of that user, the people that I'm following, or there are the
followers, people who are following me. It's entirely possible that both of them are informative.
We looked at friends, the people that an individual chooses to follow, because we considered
those to represent the active selection by that individual of people whose content that they're
interested in. And so these are everything that we're going to look at network-wise is going to
focus on the use of the friends. That's what I'm representing here. The information is flowing
from this user to this user. And so we have these feature vectors, but of course the question is,
how do we actually combine this user feature with all of these different feature vectors. We can't
just put them all down, because SVMs operate on a fixed set of features. You can't just -- well,
you can indefinitely increase it, but you need exactly that number of features for every single
user. It wouldn't be obvious how to -- in fact, I don't even think it would be possible to try to
take an arbitrary-sized neighborhood and just concatenate features in order to produce a
meaningful classification. So instead, what we do is we take the entire neighborhood and we
simply average those features out. So this is sort of an average representation of the features in
that user's neighborhood. And now we're -- but we're still presented with the question of how we
actually want to handle combining these user features and these aggregate features.
And so what we have now is we have the same classification problem. We're still using the
SVM. We're still running it through and we're still building this classifier, this feature-based
classifier. We just now have two times the number of features. There's a number of immediate
questions that come up when we try to do this. The first is, which neighbors are we going to
use? So people follow many thousands of people on Twitter. It's not clear, necessarily, who
we're supposed to select, and so you can imagine many different policies by which we could
select individuals. Some of the ones that we considered are everybody -- seems pretty
reasonable -- and most popular, so the people with the most friends, my neighbors with the most
friends, my neighbors with the fewest friends, and then basically the people that I mention the
most. Now, these proxy for different things, so if you think about your own attributes or the
friends, or the people you associate with who tell you the most about you, you can imagine that
in different contexts different neighborhoods would be more relevant than others. So, in the case
of what you like to eat, it might be best to actually sample your immediate family or the people
who you actually hang out with recreationally. If I was interested in your actual vocational
interests, it would be far more important for me to sample people that you worked with, in order
to learn about what you are actually interested in doing vocationally. And so sub-sampling a
network can actually be a very important part of determining which features that we should be
using. And so that's the motivation here. What we've selected for, effectively, here are the
extremely popular individuals. The least-popular individuals actually proxy for individuals that
that person may know, because it's incredibly unlikely that they would have simply found them
by chance, so these are people that you must have gone out looking for and actually selected.
These are not people you heard about through media or through other people.
And then N-closeness is this idea that maybe these are actually your closest friends, the people
that you talk about or the people that you're the most connected to topically or relationally. And
then there's the question of, we have these two feature vectors. Whatever your neighborhood is,
we have some average feature, feature vector for that neighborhood and we have the user's
feature vector. How do we actually treat that. So we could join these. We could actually put
them together, have twice the number of features, or we could have them. It's unclear which
policy would be the most reasonable one to use, so we tried them all. So the data that we used in
order to look at this, we looked at three features. We looked at age, we looked at gender and we
looked at politics. These were selected for a couple of reasons. First off, they're features that are
of significant interest to organizations, to researchers, to a variety of different stakeholders in this
field. Also, they can be made relatively binary. Gender is itself binary. Politics in the United
States is a fairly binary distinction of Democrats and Republicans, although that's subject to a
great deal of debate, but certainly people can actually orient themselves that way. And then age,
while age is not discrete, we can certainly break it into meaningful time regions and life
experience, so in particular here we were looking at 18 to 23 and 25 to 30 as sort of a college age
and then leaving college, early adulthood. And so we went out and collected a whole bunch of
data, so these are the labeled uses that we obtained information for, but in order to go grab all of
their friends, we eventually ended up grabbing on the order of 500,000 users, and then we had to
go, in addition to that, grab all of their tweets. So each data set ended up being over a gigabyte
in size, so running this study was not a trivial endeavor.
>>: Is this all US based?
>> Derek Ruths: They are all Anglophone. They are not necessarily all in the US.
>>: So I don't know if this is really relevant, but I know that -- how do regional differences like
dialect and/or, say, pretending to be a gang banger. I'm Miley Cyrus, but I think I'm black, so I
speak using African American vernacular. Do you filter those out? Is that just noise?
>> Derek Ruths: That's just noise. I would love to filter that out. Or, rather, I wouldn't like to
filter it out. I'd love to actually be able to identify those discrepancies or those differences so that
we could sort of more finely separate out our population. But you can imagine that actually
detecting that is a latent attribute inference exercise itself, so in some sense, what this represents
is a first -- what I'm going to show you and where we are with the field is a first foray into this.
So trying to actually learn those kinds of features are going to be just fantastically interesting, but
we have to hit a bunch of lower-lying fruit in order to get there.
>>: Something that you might want to check out is that one of our researchers in New England,
named Kate Crawford, she's doing a lot of work looking at big data. This would be considered
big data. Big data and I guess due process and looking at discrimination based on inferences
made by Facebook posts or your tweets or other blog posts. I haven't had the chance to read the
paper, but I've heard it's pretty interesting.
>> Derek Ruths: That does sound fascinating. I'll follow up with you afterwards to get the
paper. Yes?
>>: Could you say a little bit more about who your labeled users are?
>> Derek Ruths: How we labeled them?
>>: How you labeled them and who they are. Are they people with a certain number of tweets,
themselves?
>> Derek Ruths: So each user, each labeled user, had to have 1,000 tweets, so these are active
Twitter users, and as I mentioned, they're Anglophone, 1,000 tweets, at least, and they have to
have at least 10 friends. So these are people who actually have some social context, are active
Twitter users and should be using at least within the realm of sort of the same language structure.
>>: Were these the same 400 for each?
>> Derek Ruths: No, no, they're all different. And the reason is because, in order to actually get
good ground truth for these, we had to go and sort of mine information differently. So in the
case of age, we looked for people basically declaring their birthday, wishing themselves a happy
birthday, so potentially we enriched for extremely narcissistic people, but setting that said, what
we looked for was people credibly saying, happy birthday to me, I'm whatever. Gender, what we
did is we've had a number of different passes. In this version, what we did is we looked for
strongly gender-identified names, which has been standard, although now we have a much better
technique for actually doing gender labeling, which uses profile pictures, which I can talk about
later. And then politics was done -- in this study, politics was done using self-declarations in
profiles, and I'm actually going to talk about that, how we improve on that, as well. You know
about that work. Yes.
>>: And it looks like -- I mean, you mentioned you look for the labeled users had at least 1,000
tweets, and it looks like you're going back several hundred tweets, on average, per user.
>> Derek Ruths: That's right.
>>: What timeframe? Is there any worry about them changing age?
>> Derek Ruths: That's a good point. We did not take that into account in terms of age. These
are five-year windows.
>>: I'm thinking my first tweet from 2007 probably doesn't sound the same way my tweets do
now.
>> Derek Ruths: That's true. We did not account for that in the case of age, and that would have
been a good thing to do. In the other circumstances, gender and politics, I'm a little less
concerned about that, because there's potentially going to be standard language that we can pick
up that will actually span time. I would be very concerned if we were trying to look at video
gamers, though. If we were looking at things where there was extremely -- where the language
of the kind of thing we were trying to classify was clearly going to shift over time, I'd be really
concerned.
>>: For the sake of gender -- politics is sort of that in between.
>> Derek Ruths: Yes. Politics is borderline. Politics is a bit borderline, as well, so certainly
time can become an issue, and actually I think time in general is a huge challenge, and it's
something that I'll touch on a little bit later on in terms of things we need to do to address it. But
certainly, taking time into account is important, and your point is taken. I think age would be an
interesting thing to unpack where that's concerned. So if we take these users, and we actually
run our classification system on them, we do some sort of k-fold cross-validation, this is what we
get. And so I'm going to point out some particular numbers. I don't need to read them all -- that
what's important to see in the table first is this is the baseline for users only, so this is if we only
use the user vector. Everything down here is simply merging in that average neighbor
information in one way or another. And as you can see, the numbers are not the same. The
numbers to particularly pay attention to where we saw dramatic improvement I've circled in red,
so using just neighborhood context, we've been able to go from 75% to 80%, or 5%
improvement in age. Gender we actually don't see much of an improvement on. And of course,
that is somewhat expected because gender is not a terribly homophilic property, meaning that
typically we'll all associate with men and women, so it's not as though your gender is going to be
a strong indicator of seeing an enrichment in one gender -- in one label or another. And then
political orientation, again, we see a significant boost in.
Now, what's interesting is that we see this boost for different neighborhoods, which is
interesting, as well. So here, we see that age boosts -- you get the most significant boost by
looking at the followers who are least popular. And so what this is suggesting is that you're
getting a lot of information from people who are most selected by you, most clearly selected by
you, as opposed to I saw them in the news or these are very popular individuals that I simply
want to follow, or they're news sources. These are user accounts that you knew about and you
linked to, so least here is presumably proxying for people that you know very well. And other
studies have established that age is a very homophilic property, particularly in the close circle
that you keep, and so it's not surprising that we would find that we get such a dramatic increase
using this information.
In political orientation, we get an improvement, or the most significant improvement, using all
information. Presumably, this is because individuals -- in work that I'm not going to be able to
show, we've gone on to show that individuals enrich their neighborhood uniformly for people
who are politically similar to them online. And so you get a lot of signal from people you know,
people that you don't know but are popular and organizations that you follow. And then, of
course, gender -- gender, we're not getting great deal of information from, so really, I'm not sure
how much we can say about the particular quality of that inference.
Now, the other thing to mention is that -- well, so neighborhood can certainly increase the
performance, so here we see the actual increases that I pointed out, but what's particularly
interesting to me is the fact that using only neighbor information -- so here we dropped all the
user information. We didn't even use the user's feature vector. In the case of age and political
orientation, we actually do -- well, in political orientation, we actually do better. In the case of
age, we effectively do as well as if we had the user's information itself. And so what this means
is that -- this has some very practical implications. Users can make their accounts private, but if
I can actually find the constellation of users that are around them, I can learn a great deal about
them, even without looking at the content that they're generating.
But, of course, in the case of gender, it's not a homophilic property, so we're not getting a great
deal of signal. I think it's worth mentioning. I think that gender is still an interesting place to
look for social signal, but I think that it may be necessary to actually mine out different features
to learn a different machine in order to actually use one's neighbors, in order to actually learn a
gender label for the ego. All right, so I'm going to move on and I'm going to talk a bit about how
we can handle language. Are there any questions before I move on to that?
So it turns out that, if you look at the literature, if you look at what's been done on social media,
you would believe that Twitter was almost an only-English platform, because there's just so
much stuff done on basically Twitter users in English, and in our early work, we were certainly
among this community that was looking only at English users, and there's a reason, because I
don't know other languages. But if you look at the statistics, Twitter is only 28% English, which
means the majority of content being generated on Twitter is actually not being generated by
English speakers. That doesn't mean that they don't know English. It just means that their
preferred language of tweeting and of communication is something besides English. And so
what we've done is, over the past couple of years, we've really left a very black hole in terms of
latent attribute inference concerning other languages. And so what we looked at in this study,
and actually, this is what brought me to Seattle recently, is the extent to which we can use gender
inference machinery that's been developed in the past on other languages, or what we have to do
in order to actually handle other languages. I consider this to be both a targeted investigation
into a particular feature but also a broader way of thinking about what is it going to mean to do
latent attribute inference in a multilingual environment? So the first thing to mention, we went
out and collected data sets for a diverse set of linguistic families, so French, Indonesian, Turkish
and Japanese are just about as different as you can get. And so these are selected so that we can
get a very broad spread to determine how well this machinery would work. And I think the first
takeaway is the out-of-the-box machinery -- so this is just using the SVM that I had talked about.
You have to remove the stems and co-stems and so forth, because these are language specific,
but take all of your features that we've been using in the past and machinery works. So we are
seeing performance as good -- so this is 76% for French. English was about 75%. We're seeing
performance as good or better than what's been published for English. And so we can actually
use existing machinery quite effectively on other languages. The only outlier here is Japanese,
and the complex orthography of Japanese makes it very hard to fit into many of the features that
we've defined for languages that have a much more limited alphabet. So I think this is actually
sort of a big open question, which is how do you actually handle Japanese, Chinese, these
languages that actually have this complex orthography. We have a couple ideas, but it's worth
mentioning that Twitter has a lot of Japanese content on it. So this is not just an academic
exercise. This is a lot of information that we actually can't mine.
>>: I'm not much of a language expert, so this might be a dumb question, but do any of these
languages or languages you looked at have gendered pronouns for referring to other people?
Like, so people saying you?
>> Derek Ruths: Let's see. Turkish actually a genderless language. Japanese, it's surprising that
we did this poorly, because Japanese has a tremendous amount of gender encoding in the
language and in the usage. So this is clearly not getting the features right. And so I think that,
actually, of all the languages that are up here, I think what you're pointing out would work best
for Japanese. Indonesian is also, to my knowledge, a genderless language. And French is not
genderless. However, the pronouns are not coded to gender. Other words are coded to gender,
but not pronouns. However, there is something really cool that we can do with French. French
has a really nice construction that actually encodes the speaker's gender. So when you say, I am
X, when you make the statement, I am X, you actually have to decline, change the ending, of the
adjective or participle that follows this construction. And so we considered this to be a
potentially really rich source of information about gender. So we simply went in and we looked
for -- je suis written this way happens rarely in Twitter, because people mangle it and have all
sorts of slang ways of writing it. But suffice to say, you can find a lot of instances of people
using it. And so the assumption is, if people are using proper French grammar, males would
only be using the male constructs, so they'd say, je suis petit, and the females would be saying, je
suis petite, and they would be actually spelling this out differently.
Of course, social media being social media, that's not guaranteed, and so one interesting
discovery that we had is that, in French, just like in all other languages, grammar breaks down.
Women on Twitter who tweet in French use many, many, many French male constructs, so when
they say, je suis blank, they will often leave off the feminine ending. And it's worth mentioning
that it's a little extra work, because the female modifier is always an additional character or
modifying the ending in some way, and so it's not surprising that the shorthand is to go to the
male construct.
But if you take users and you simply apply a very basic threshold, which is if in the history of all
of the tweets that they generated, they've used a female construction even once, then you can
actually classify gender with extremely high accuracy. So if you use just this threshold-based
suis construct, then you can get overall accuracy of 90%, which is up from 76%. So that is a
huge, huge improvement. Now, the only catch is that not everybody uses this je suis construct,
or at least they don't use it in a form that we could recognize. And so this covers about threefourths of all the users. That's what you're seeing here. Out of the 1,000 users we were looking
at, we got about 750 of those, we found them using a je suis construct. And so those could be
very accurately classified. Everybody else, once you looked at what was left over, we couldn't
classify very well at all. In fact, we classified them much worse than what we had in the base
classifier. So what this suggests is that the suis construct is, A -- it's a very reliable classifier, but
it's also selecting out individuals that have strong gender-indicated language in one way or
another.
Now, I think that there's something very exciting here, though, and that's not been looked at in
the latent inference literature, and that is triaging users. So traditionally, the way that all work
that I've seen has gone about this problem is saying, well, we've got to classify everybody. You
give me a user, I'll just give you a label. But this suggests that there's another alternative, and
that is what if we could simply identify users that we could get the gender for well, and then we
toss everybody else out. If that number was high enough, that would actually be a pretty nice
step in the right direction, because we would be able to generate high-quality classifications for a
large portion of the population, and then we would segment out a different part of the population
that would need to be treated differently, and this could be treated differently from any number
of different angles. Maybe what we need to do is just build a different classifier. Maybe we just
need to Amazon Mechanical Turk the identities. If the numbers are right, the there's any number
of admittedly more manual ways that you could go about handling a large classification problem.
>>: So this might be too early to ask this question, but have you started looking at crossreferencing additional data to pull other characteristics, like music or movies? So other people in
different communities are -- especially in China -- are cross-referencing all the different social
networks. They find people's favorite movie profiles or whatever, and then they've taken into
dividing people into male, female, gay, non-gay, that kind of thing. I know that Netflix has been
doing that for a while. They can figure out that you like certain types of movies.
>> Derek Ruths: Yes. So I'm very interested in that. Doing that in, A, the academic context
and, B, the closed, private social network model that currently exists is hard. I'm not familiar
with very many efforts that have been really successful at cross-linking accounts between
different social media platforms in aggregate.
>>: I can send you records.
>> Derek Ruths: Okay, excellent. But suffice it to say, that would be fantastic. I think before
we even do cross-referencing, I think that there's a lot to be done in terms of topic-based features,
and so that's something that we're looking at currently, which is maybe rather than looking at
words and these different attributes that we've been looking at, maybe what we need to do is look
at the kinds of things that people talk about and actually try to encode that. That could be gotten
by looking at other sorts of social media platforms, but it could also be done by looking at
different aspects of the language that they're using, so maybe embedding the stronger and more
sophisticated language models.
All right. There, I'm circling the best number that we had in the paper. Okay, so next, I'm going
to talk a bit about the challenges that we have. If we look at where demographic inference is
right now, I think that there are a number, a host of different challenges that I think are really
exciting open problems for us to work on. There's -- first off, there are just standard technical
challenges that some of you already alluded to, so temporal variation is a huge problem. So in
unpublished work in my lab, we've looked at how bad things get if you train at a particular time
and then try to classify users later. And we've shown that, depending upon the feature, even a
week will cost you about 10% in accuracy. And so that's serious. That is really serious, which
means that after a month, your classifier is almost useless. It's probably doing about as good as
random or worse. And so we need to come up with some way of handling temporal variation,
and this could either by continually updating the models or potentially learning more meta
features that we would be using.
Performance is of course a huge issue, so if we have to actually grab all the neighbors for an
individual every time we need to classify them, that's a lot of data. If we need to grab a gigabyte
of data just to classify 400 users, that's pretty expensive, and so one question would be can we be
smarter about the kind of data that we're using and subsampling. And then finally, the literature
is rife with examples of using binary features. They're easy. They're nice to work with, but, of
course, most features that we're interested in are not binary at all. If we just take age, for
example, what we'd really like is we'd like much finer bins for age, and the machine
classification literature doesn't offer us many out-of-the-box ways of working with non-binary
data, and so a big question is how do we take these problems that we've posed and come up with
better ways of handling these richer features? There's attribute-specific challenges, so I think
that a lot of the reason people have looked at the attributes that they have is because the attributes
are fairly easy to get access to and ground truth on. However, there's a lot of interesting features
out there that we really do need to actually get some purchase on, like education, location,
activity profiles, interest, these sorts of things, and these are much harder to actually get ground
truth for. And so one of the big questions moving forward is how do we get better ground truth
for more nuanced features, and how do we actually encode that in a way that we can get at? And
then, finally, there's -- what was alluded to earlier was this idea of regional variation, so we know
that regional variation introduces linguistic differences, introduces different practices, and so if
we're studying populations, what we would really like is segmented models that actually address
different communities in the total population. Not every person can be treated the same, so
personalized models would also be pretty important to have moving forward. So these are some
of the important dimensions that I think really need additional work. Yes.
>>: I have a question about a different challenge, that maybe it's some combination of these. So
once you infer the demographics of some group of users, you want to use it for something, you
want it to help you interpret something, so you might care about who are the -- what's the
population look like for the folks who are supporting one political party or another or the people
who are talking about Starbucks all the time. But then, when you are trying to learn those
demographics, you're also using potentially some of the features that are tied up with the
question you're asking. So you might have some training data where the people who -- all the
men happen to talk about Starbucks, and so you might learn that mentioning Starbucks means
that you're a male. I don't know why that would be the case. And then you go and then you say,
okay, I'm going to apply this to everyone who's following Starbucks, and your classifier spits
out, inaccurately perhaps, that everyone is male because they all mention Starbucks, but that was
your filtering condition. So there's this question of endogeneity and stuff like that around how
you're classifying people and what you're trying to learn. Have you thought about that?
>> Derek Ruths: In some ways, what I'm going to talk about next, which is the political
orientation and the problems that we've had with political orientation, is going to maybe get at a
little bit of this, in terms of the way that poor assumptions about the way we should sample
populations can influence the results that we get. But unfortunately, I don't have a very good
answer to that. I feel as though, as with a lot of large big data problems, the solution is to get as
random a sample of the population as one can. In many ways, I think that what you're talking
about underscores why it's important that the computationalists that are working on this problem
also have close ties to social science, because I think that social science actually has a great deal
to tell us about properly designing data set sampling techniques or being aware of correlations
and biases that we may be introducing. But even there, I don't think that it's systematic thing. I
think it's something we just have to be continually aware of. Unfortunately, I don't have
anything really, really strong to say, but I think it's something that we just need to be on guard
about. Actually, on that note, in terms of being on guard about things, I want to talk about really
the story of how I think the latent attribute inference community, my work included, really
became confused in approaching political orientation, the problem of political orientation.
All right, so let's do a quick exercise, right? I just want to do a very quick game. The game is,
I'm going to put up a picture and you're going to tell me what the political orientation of this
person is, right? You can already tell who it's going to be. Democrat? Very good, all right.
Republican, all right. Let's see, how about this person? You guys were doing so well. All right,
let's try another one. Republican, all right? How about this one? You seem less certain. What's
going on? Republican. Okay, how about this one? Here we go, last one. Don't worry.
Democrat? Okay, he's Canadian. So what happened? What happened to all the certainty? We
were all in unison, crying these names out, and the all of a sudden we hit these characters and
we're not able to do it nearly as well. So I'm trying to illustrate a point, and that is, when it's
obvious what a person is, it's very easy to classify them. They’ve been labeled for us, they've
self-declared things for us. It makes it very easy to actually assign a label to these individuals,
less so for people that we don't know. And so this really gets to the heart of the problem that
we've been having in the literature, which is that a lot of the data sets we've collected and
reported political orientation results on have to do with people who are easy to identify. And so
in this study, we looked at, in some sense, a very simple question, and that was, what happens
when you weaken that condition, when you don't look at people who are easy to identify? How
bad does political orientation inference get?
And so what we did is we went out and we somewhat arduously built three different data sets. I
mean, different levels of ardor were involved. Getting political figures was easy, because they
have just have these Twitter accounts, we go out and grab them, we get all the Senate
representatives and so forth. Active users were also pretty easy to find. These are people who
simply declare their orientation. I love being a Democrat, or Republicans rule, these sorts of
things, just stating that sort of thing in their profile. We included them in the active data set.
Modest users were nontrivial to get. These are individuals who use political language but do not
self-declare in any way.
>>: Is there a sarcasm detector?
>> Derek Ruths: We actually manually coded all active users, so active coders were put through
an AMT coding exercise, so I expect that cleaned out any sarcasm. Let me tell you, there's other
language that goes into profiles that contain that, but yes, so hopefully the manual coding
handled the sarcasm detection. At least, when we went back through and eyeballed what we had,
it made a lot of sense. But the modest users, this was really where the interesting part of the
study took place, and that was figuring out how to actually measure these modest users. So to
give you some idea of how we went about doing this, let's look at some of the features for the
political figures. These are the -- the top hashtags are generated for the active political figures,
so you can see that they are very strongly associated with, not surprisingly, the Democrat and
Republican platforms. What we did in order to identify modest users -- let me tell you what we
did not do first. What we did not do to identify the modest users was take these hashtags and
then go look for other users using them in Twitter. And the reason -- it's subtle, but the reason
we couldn't use these is because these are highly discriminative Democrat and Republican
hashtags. If we had gone out and found users that used these or selected on users that used these,
we would have effectively been selecting people who had a strong valence or a high likelihood
of being Democrat or Republican. That's not what we wanted. What we wanted was an
unbiased sample of people who spoke about political things, and so what we did is we took the
least discriminating political hashtags that were used by political figures and politically active
individuals. So these would be things like #jobs or #taxes or things that carry no political
valence but still talk about a political topic. So we took those hashtags, and we identified users
that used those and that had no mention of political parties in their profiles. So these were
people who really were not giving much signal. And then what we did is we took those
individuals, we pulled out all the tweets that contained political language, and then we Amazon
Mechanical Turked that. We basically asked people to code the political valance of these
individuals. Now, this is not a foolproof method. We took individuals that received majority
vote for a particular orientation, and so you can imagine that there would be some uncertainty,
even there. But what it gave us was a corpus of individuals for whom we had a fairly certain
valence assigned, but it was coded in a more nuanced way. It wasn't necessarily in the explicit
words that they were using. It may have been actually in the semantic construct itself, which is
much, much harder to get at, computationally. And so given this modest set of users, along with
this active and figure-based set of users, we could start to look at how well the classifiers that we
had been traditionally and previously reporting as really, really good performed. So this is the
somewhat disturbing results. Figures, not surprisingly, we do very well on, 91%. This is the
number that had been always reported in the literature. If you move down to individuals who are
still self-declaring their orientation, you already lose about seven percentage points of accuracy.
You're already down at 85%. And if you take individuals who do have some sort of political
valence and express it on Twitter but simply are not overt about it, you end up with 70%. Here,
we're barely doing better than random, practically. And so what we can see here is that the SVM
performance seriously degrades when we actually want to look at normal people. And I would
argue that, in terms of inferring political orientation and any other feature, it's most important for
our machines to work on normal people. So hopefully this is getting back to your point a little
bit, which is how you collect your users really can influence how well your machine is going to
do or how well you think you're doing at assigning this classification.
Now, this afforded another cool opportunity, and that was we could look at -- for the first time,
we could look at what happens when you take a machine that's classified on one set of users and
use it to classify a different set of users. And so -- this is a pretty important question to ask,
because what this means is -- this is always going to happen in the wild. You'll take some set of
users, you will classify on them, and then you'll pick up your machine and you'll run it on a
bunch of other people.
Now, what the literature was telling us is that you train your users on these political figures, and
then you could pick it up and you could run it anywhere, but this is the performance that you
would expect to see. You train on political figures and you classify either active or modest users,
and you see a dramatic decrease from even what you could have gotten if you had trained on the
original users themselves. So we can't even do cross-classifier or cross-data-set usage of these
classifiers. And in some sense, that's not surprising. These things are using different features,
but what's profound is just how much of a price we pay in order to actually do that crossclassification. And so I think that this is another very important question to be asking as we
move forward in this research direction, which is how do we build machines that are robust
across these populations?
Okay, and so finally, in closing, because that was potentially a bit of a negative note. Now I
want to just switch things around a bit and talk a little bit about the promise of what we can do
with social media in terms of measuring physical populations, and this is something that I'm
deeply interested in. I really am hoping that what we can do is use online measurements,
effectively social media sensors, to talk about physical populations. And so this is going back to
some preliminary work that we published two years ago, or a year ago, but have since made
significant progress on. I'm going to talk about the results that we published then, but we've
made a lot of progress, which I'll allude to. So here's the setup. We have the population of the
world, and we would like to actually estimate the makeup of that population, how many men
versus how many women, how many skiers, how many skateboarders, what have you. We'd like
to actually get the composition of this population, how many people were riding the bus, how
many taking the car. These people are generating some social media footprint, and we're going
to run that through a classifier and get some estimate of the Twitter population. Now, the
question is, if we can correct for this bias, how close can we get back to this estimate of the real
population? So what I'm going to talk about is this, right to here. We are still working on
figuring out how to correct for Twitter biases. But what I'm going to show you is that even
without correcting for the biases, we can get a fairly rough estimate of real-world populations, at
least in some cases. So we looked at census data, which is a nice, stable measure of populations.
It was also done in a very systematic way, so we have some confidence that the results being
reported are accurate. And so what we looked at was gender inference applied on commuter
populations, so in this case, what we'd be looking at are individuals who are choosing to
commute by car, by bike or by public transport bus. And the census figures -- census figures
typically, and we looked at Toronto, but census figures sort of systematically give you gender
breakdown in each of those categories, how many men are communicating by car, how many
women are communicating by car, same things for bikes and buses. And so what we wanted to
do was determine whether or not taking measurements of just Twitter populations would allow
us to reconstruct those, any gender bias that we saw in those commuting populations.
These are the census figures that we have for Toronto. By the way, I live in Montreal. We didn't
do Montreal because it's intensely bilingual, and we can handle English and we can handle
French, but handling mixed languages is another open topic in terms of how you actually
accommodate for that. And obviously, it's one that we're very incentivized to solve, but at this
point, we were just interested in looking at a monolingual context. In online data, what we did
was we took user accounts, Twitter feeds that were effectively oriented around giving news
about a mode of transport as proxies for that transport. So we took -- for instance, this is a
commuter traffic feed. This is about public transport, and this is a biking community, and we
took all of the followers of those accounts as a proxy for the Twitter population that uses each of
these different modes of transport in Toronto. And then, we gathered those individuals, we ran
our classifier, our gender-based classifier on them, and we determined how well or what
agreement we saw between the measurements that we took, the gender biases we saw, and the
biases that were reported in the census. This is more information on the classifier. Again, we
used the same sorts of features that we've been talking about, and then we ensured that our
classifier was working as well as it usually does, which you can see here. So it's achieving
effectively the same performance that we've been reporting previously between about 80% and
85%.
So if we look at the actual physical population, this is the gender bias in each commuting
population. This is the gender bias that we obtained. Now, these numbers are not equal.
They're not even remotely equal, but if you really, really squint your eyes, what you will observe
is that we actually reconstructed the valence of the bias. And that's the point that I actually want
to make about what we're finding right now, and that is that in each of these commuting
populations, we were able to determine the gender leaning of each mode of transport. Now, it's
important to observe that there were three modes of transport, and there were different biases in
each one. In particular, in public transport, there are more females than males that ride public
transport. This would be a much less interesting result if they were all male dominated and we
observed that these were all male dominated. And the reason would be that that wouldn't
necessarily prove that our classifier was doing anything except just discovering Twitter prior,
some sort of prior on male presence in Twitter. But the fact that we actually find the valence and
the leaning, the bias, to be the same as was observed in the physical population, and the fact that
it actually changes suggests that we're actually finding signal that represents the populations
underneath these.
In ongoing work, what we've done is we've effectively gone and obtained the census data for
now nine different cities, all Anglophone cities at present, and looked at our ability to reconstruct
these same figures, these same census figures, for each of those cities. These are international
cities, so we've got Sydney, we've got cities in the States, we have London, cities in Canada.
And in the majority of cases, we can actually construct the gender bias correctly. Now, making
this adjustment remains an open question that we're working on, and what I'm hoping we find is
that there's some sort of systematic bias that we can use to actually correct across different
populations. It is an entire conjecture -- it's not clear as to exactly how we need to correct for
these biases, but suffice to say, that's the next step, because clearly we're getting signal. And I
should say, in other populations, in other cities, we also see different women being dominant in
one mode of transport, men being dominant and others, and we still see our ability to reconstruct
those or discover those biases. So we're getting signal. The question is, how do we actually tune
it so we're getting closer to the actual measure of the population rather than simply getting
leanings.
All right, and so just in closing, hopefully, I've given a sense for demographic inference, where
we've come, where we've come from and the problems that we face. Where we stand, there's
tremendous opportunity for using this. I mean, it's clear that we can make quite high-resolution
measurements under relatively controlled conditions. And so the question now is really how do
we start to take away those controls and permit our machines to continue doing well? How do
we introduce time, how do we introduce population variance? How do we actually account for
mapping into physical populations? But all these present what I consider to be very manageable,
although big, open problems that are exciting directions for future work.
I'd like to recognize -- in closing, I'd like to recognize some of the students who did all of the
work that we've seen. Faiyaz is my PhD student. He'll be graduating student. Wendy is a very
talented undergrad. Also, not listed here is Morgan, another undergraduate, who did the most
recent work on multiple languages and Twitter, and this was all funded by funding from the
Canadian government, as well as from the Kanishka Project. Thank you very much. Questions,
yes?
>>: In the example that you showed before, where you looked at the particular accounts of
certain news organizations, some mechanism to determine the fixed ->> Derek Ruths: Yes, yes.
>>: Why did you pick that instead of what you've been doing before? It seemed like what you
were doing before, in some cases, you were identifying people based on what they say, like I am
a Democrat or whatever. Why not looking at I'm taking the bus, or ->> Derek Ruths: We looked at that, but there just wasn't enough. So we looked for people
declaring how they were commuting, and it turns out that it's actually too boring for people to
even mention on Twitter, which is a pretty low bar. But people don't talk about the mode of
transport that they're taking, at least often enough. Oh, another thing to mention is we needed
geographically local individuals, and so that can get dicey, because we didn't have much signal
for people talking about the mode of transport. We had even les signal once you had to have the
mode of transport and some confidence that this person was in Toronto. And so somebody
following a news feed from Toronto, pretty good indicator that that's what you're looking at.
>>: Did you look at multiple Twitter groups? Certainly, there's more than one Twitter feed
about traffic.
>> Derek Ruths: We were surprised. Different cities have different numbers, and that's one of
the interesting things that we're grappling with right now, because -- I could pull the data, but it's
probably easier for me to describe. If you take different feeds, you can actually get different
valence in terms of the gender composition that you get. A classic, bikes are real problems,
actually, and the reason is because bike commuting is different than bike interest groups, and so
you can get Twitter handles that are oriented around female bicyclists. And of course, those are
going to look extremely dominated by female bicyclists, for good reason. There's a very big
question as to how do you select the proxies to actually get the most information? And I would
over it if I actually had an answer to that question. Bikes seem to be the most problematic.
Public transport and traffic, they seem to agree a bit more, but bikes are quite an issue. Yes.
>>: So in the first study, you show that, or you say that people who are following, are your
neighbors, kind of send strong signal as to what that person is? So can you use that for the
second studies, or you talked about this -- people, you can basically figure out whether they're a
Democrat or Republican. And then there's active figures. So rather than just putting everything
into one pool and then trying to classify them, first do you look at the ones with the very clear
signal and the ones you figure out who they are, what they are, and then for the second tier you
use the first tier as a neighbor to see people they are following so that you can have better labels
for neighbors.
>> Derek Ruths: That's a great insight, because what you could do is, I guess what I had been
saying before, is that you look for the obvious wins, and then once you're done with the obvious
wins, hopefully you have a smaller population that you can spend more effort on. And maybe
the effort is go and collect all their neighbors. Certainly, that is a possibility. We haven't looked
at it, but that would make sense.
My only concern would be that, in the case of political orientation -- well, there has been some
work that showed that even -- well, no. Actually, most of the work on this has shown that people
who have very strong political valence follow accounts that have political valence. I'm not sure
that it would be true of people who just occasionally talk a little bit about politics, but it's an
open question, so I think it's worth looking at, for sure.
>>: I guess my question was basically going the other direction and also something to
something Emre mentioned earlier, which was here you were looking at particular accounts that
were being followed, and it seemed like, for the demographic inference, looking at people who
follow Miley Cyrus are probably demographically shifted in some way, so using those as
indicators instead of using local neighbors. How do you think that would do?
>> Derek Ruths: Well, those are your neighbors, as well, because if you follow Miley Cyrus,
then ->>: You're not using Miley Cyrus as a feature. Like, I could use Ann Coulter as a feature, and I
feel like that would do pretty well.
>> Derek Ruths: Oh, I see. Are you talking about for age?
>>: For political orientation, so whether or not I follow Ann Coulter?
>> Derek Ruths: That's right. So you could look at the accounts that people follow, so you
could imagine that Glenn Beck or these individuals could potentially actually signal ->>: Like in the same way they signal bike riding?
>> Derek Ruths: Yes, so are you suggesting that that's how we would label the users, or are you
suggesting that that's another feature that we would add.
>>: So, in this case, you're using it to label users and you felt pretty confident. What I meant
was, in the first case, to infer demographics, you could use it as actual features, like use a bunch
of high-profile accounts and features.
>> Derek Ruths: I see, I see. Yes, yes, I see what you mean. So we had actually looked at that,
the k-top followers, the k-top people who they followed, and that does give signal. Once you get
down to the modest users, you're not getting much of anything, though. So they don't seem to
have a core set of people that they follow, I think is the problem.
>>: You were looking at similarity to the top users, not necessarily ->> Derek Ruths: Well, so what we were looking at was one of the features we added was k-top,
k most followed individuals.
>>: Like the vector of their language and all that stuff?
>> Derek Ruths: No, sorry. No, that was a different study. Yes, this was potentially a little
confusing. Sorry. What I meant was, in the case of where we were looking at the modest users,
we looked at the individuals that were most mentioned by those users, and they were most
followed. I'm not sure about the following. Let's take it offline, because I think that they would
be interesting to look at, and I'm trying to remember if that was a feature that we had tried,
because we tried a whole bunch, and I'm not sure whether we actually had that. But certainly,
there has been some work that shows that looking at these anchor accounts can help to get
purchase on it. I just don't know how well it would work for normal users, but possible.
>>: That would be a case against using whatever I brought up, which is then, if Starbucks is
your feature, then Starbucks asks you to use it ->> Derek Ruths: Right, right. Yes, exactly. Exactly. I don't know. Yes?
>>: So thinking about the questions that are being asked about the bike experiment, the
commuter experiment and your earlier ones, it seems like there's several places where potentially
biases are creeping in, from how you're labeling the individuals to how you're measuring, how
you're sampling the users, how you're labeling them, how you're actually running the analysis
and what data sets those analyses were trained on. Have you done any experiments or are you
aware of other work that tries to put a framework around these different kinds of biases and starts
to reason about ->> Derek Ruths: I'm not familiar with any, but that's what we're trying to put together, and that's
one of the reasons why we have measured so many cities. It's basically solving for multiple
variables. In the simplest case, if we ignore distributors for a moment, then you're looking for
the bias that's being introduced in step one and step two and step three and step four. So what
we're hoping is we can do some sort of meta cross-fold validation across cities and actually solve
for these different components in the framework. No results on that yet, though. I've been
surprised, because I expected to see something in the literature about this, but I haven't, and I
don't know if it's because it's something that's happening in another community, more in the hard
statistics community, or if it's just that it's not a problem that we've gotten to yet, because it's
entirely possible that we just don't have the data sets for it yet. It took us a long time to create
the data sets that we have that could look at it. But I think that's the next step, for sure.
All right, thank you very much.
Download