>> Emre Kiciman: Hi. Welcome to today's talk... University, where he's an assistant professor, looking at communities and...

>> Emre Kiciman: Hi. Welcome to today's talk by Derek Ruths. He's visiting us from McGill University, where he's an assistant professor, looking at communities and social media. Today, he's going to be talking to us about demographic inference on social media and how he can learn, I guess, who groups of people are from the text that they write. So thank you very much, Derek, for coming today. >> Derek Ruths: Thank you, Emre. Well, first off, thank you very much for the opportunity to be here. This is my first time at Microsoft Research. Actually, this is my first time on the Microsoft campus, so I feel very privileged to be visiting. In my lab, we've been spending a lot of time looking at latent attribute inference, or demographic inference, in social media. And so today I wanted to talk a bit about the motivation for that and then to just really give a flavor for what's been done in the field. It's a fast-moving field. There's a lot of work being done in it, and so I thought I would take the time to sort of orient everybody to some of the key results that have been obtained and some of the big problems that we're still working on, and there are plenty of those. So as a way of motivating this, for those of you who know about social media literature and social media research, when I came to the field, this is a couple of years back, when I really started investing myself in working in social media analysis, there were lots of papers coming out about studying human behavior, social networks and social media, but they all tended to focus on the content, irrespective of who was generating it, so people were trying to forecast what the next big blockbuster movie was going to be, and they were doing so with mentions of movies, but they weren't worrying about who was mentioning it. And it's quite remarkable to consider the fact that we've been spending a lot of time talking about the content that's been generated on social media, years and years, without really knowing a great deal about who is generating all this content. And certainly, social scientists are very interested in understanding who is actually on social media, and who is in a group really matters in terms of what the behavior of that group is going to be. And so we decided to really focus for some time on really getting a handle on this problem and coming up with ways of figuring out who is actually on social media, and, furthermore, ways of looking at human populations in general. So I wanted to start by sort of contrasting Twitter, which is what we're using, a form of social media, with sort of the established technology for figuring out who is in a community, and that is survey technology. So survey has been around for a long time. They are very effective techniques for getting information about populations, but in our fast-paced world, they've become outmoded in a number of regards. Surveys, many of us have taken, they give structured information. You have multiple-choice questions, they can pose the questions very clearly, very crisply. You get very definitive responses. You know who you're asking when you take a survey, but they're very artificial constructs in the sense that somebody's got to come to you, or a webpage has to come to you and basically pose this question, and usually the questions are coming out of context, in the sense that I may ask you about what you ate this morning or last week, and suffice to say, you're not eating it at that moment. So you actually have -- there's memory and there's judgment involved in how you actually answer that survey, and then not to mention the fact that surveys can be fantastically expensive. So measuring online populations or managing physical populations using surveys can actually be quite tricky to do, using this technology. Now, Twitter on the other hand, it's got a host of issues, but in its favor, and in the favor of general social media, we have a number of features that make it very attractive. It's in the moment. To me, this is the most important aspect of it. It's in the moment. So people generate content when they're generally having the experience. So if I'm standing in Starbucks and somebody spills coffee on my shoes, I tweet about it right then and there. I don't wait until I get home. I don't wait until the weekend. I talk about how I feel at that moment. And, as a result, at least the conjecture is that it's going to be a much more candid representation of the way that people are actually interacting with the world. Twitter and many other social media platforms actually have this continuous feed of information that you can connect up to and get at least some portion of effectively for free, so you're effectively paying the -- the cost of your electricity and your Internet connection is the cost to actually access a lot of this data. And, of course, that's not universally the case, but certainly, getting digital information, information that people are already putting online, is much cheaper than having to send people out or run large surveys and aggregate the information. And then, finally, Twitter and many other social media platforms give us social context. So that means that we not only see the individual, we see sort of the world that they live in, at least the digital world that they live in, and that's quite a bit different than the way that survey technology typically works. When you survey someone, if you stop them on the street, you may be able to see what they look like, what they're wearing, maybe where they're coming from, what shopping bags they're carrying, but you're not necessarily seeing anything about the kind of social context that they have around them. And so Twitter, Facebook, even platforms like Reddit and Slashdot, give us some social context within which to understand the user, and that can actually be a rich source of information, as well. So, of course, the challenge with Twitter is that we don't actually -- we're not given a great deal of metadata about uses, and in general, online, we don't have a great deal of metadata about users explicitly coded by the individual. So on Twitter, for example, literally, the only field that a user can specify is their location, and usually that location field is used to specify something that obviously is not location, like the moon, or various places. But they don't have many fields, actually, to specify things, so something as simple as gender is not an obvious feature. So if we're studying users on Twitter and we wanted to actually look at male versus female responses to things or movie reviews or different things, that is not an obvious feature to actually try to classify by, and so it goes with age and politics or geography. All of these things are actually not explicitly coded out. And even in the richer platforms, like Facebook -- well, Facebook is a classic example. Even that information is not always given. In fact, people do not complete their profiles, and so that information isn't necessarily always available, either. So we have a sizable task ahead of us, which is to figure out, using this real-time feed, what we can learn about individuals. How do we actually create the equivalent of a survey using only social media data and the metadata associated with it? So to be very concrete, this is the problem that we have. So here's Starbucks. It wants to learn about the people that are following it on Twitter, and here's this one user, and these are real tweets, by the way. Here's what this user has generated. I don't even know how to pronounce some of the way they've written, Imma bring all of my sexy professors an apple on the first day. They're using things -- it doesn't even have proper grammar. Most of Twitter, actually, is effectively nonsense. It's very personal communications that are coded sort of in deep slang or nuance. So we want to go from this feed, this textual feed, to some understanding about what the gender of that individual is, what their political orientation might be, what their ethnicity is, where they are from. We could want to answer any host of questions. The question is, how do we actually go about assigning a label to this user? And so I'm going to talk about a couple of aspects of this in the remainder of the talk. First, I want to just give an overview of what are state-of-the-art approaches to demographic inference. Given this problem, how do we actually solve the problem? I'm going to talk a little bit about sort of the general idea, and then I'll show some of the work that we've done with using social context and handling different languages and how we can actually accommodate variance and variations in the way that users actually use social media. I'm going to talk a bit about why attributes are harder to code. Some attributes are really hard to code, and I'm going to give a sense of that from work that actually was published just earlier this year. And then, I want to talk a little bit about sort of the promise of what I see as being the major promise of mining this kind of information on social media, which is measuring -- taking measurements and moving them back into the real world, into the physical world, learning about physical populations from online measurements. And so I'm going to talk about some of the preliminary work that we've been doing that's gotten at that. Okay, so the way that latent attribute inference is done, or demographic inference as it's also known, is done, it's cast entirely as a machine-learning exercise, of course, but typically -- and so as a machine-learning exercise, typically what's done is we take a group of users who we can assign high-confidence labels to. So here, I'm looking at Democrats and Republicans. And we take some set of Twitter users, Twitter users depicted as these gummy, green characters, for which we can assign high-confidence labels. And for each of these users, we encode all of their unstructured text using a variety -- unstructured text and social -- their user account profile, using a variety of different features. So these features could be -- and I'll give you an example of this, but these features could be everything from what the most common word is that they use to how many friends they have and so on and so forth. So we obtain this feature vector for each of the individuals, and then we feed it into a classifier or a system that's going to build a classifier, and techniques that have been used on this, there's been a wide variety used in literature. Probably the most successful -- in fact, definitively, the most successful ones have been support vector machines and latent Dirichlet allocation approaches. SVMs seem to sort of be the reigning paradigm. We use them extensively in our lab, largely because they can accommodate more than just language. So if you're interested in capturing things like what the social context is of that user, what their neighbors are like, how many neighbors they have, what the social graph is around them, it's easier to actually encode that in an SVM than in some sort of language model, which is what LDAs do. So SVMs have become very much a mainstay in this field, and then, of course, you get your classifier, and now you have some user that you don't know the label for, and you're going to construct that same feature vector as you did for these individuals, through it through the classifier and get your label out. Yes, a question. >>: So I'm not an expert in this domain, so maybe it's a really [indiscernible]. You said SVM showed the best performance. Can you tell me a little bit about the precision and recall of the state of the art? So is it like 90%? >> Derek Ruths: So I'm going to give a sense for some of that for different features. It differs depending upon the feature that you're looking at, but in general, as you'll see, the SVM accuracy is going to vary between about 70% and 90%-something, depending upon the feature. So with gender and age, we can do better. With political orientation, it turns out that we don't do as well, so that's in the 70s, but as you can imagine, as you get more complicated features, it can be more and more difficult. I think that it's a very big open question right now as to whether the SVM is best suited for this. I'm continually trying to figure out what the better machine-learning system would be that would take this into account. It's just that I don't feel as though we've exhausted what the SVM can do. And as you'll see, what really becomes important is what features you're picking here, because if I'm interested in classifying -- oh, gosh, I don't know, whether people like to ski, but I'm not putting any features in there that are even relevant to that, then the SVM is going to do terribly. And so it really does come down to a feature selection question, and a lot of the investment that my group has and continues to do is to sort of devise and come up with better measures for providing the SVM. And so, per our discussion about features, here's a set that are typically used, you'll find in literature. So k-top words. So k-top words means -- whenever I say anything, what I mean is the k-top discriminating words or characters or hashtags. So what that means is that we're going to take the two or three classes that we're interested in, and we're going to look at the features that are most associated with the k-top, features that are most associated with that particular class. So k-top words for Democrats and Republicans would presumably look like the political language that is most polarized Democrat and most polarized Republican. And then we can of course of do that for hashtags and mentions and stems and co-stems are ways of breaking words into the root word and then how that word is being modified, which ends up actually being very informative when you look at age. And then using diagrams and trigrams actually can be very powerful, just looking at the three-letter and four-character combinations that people use. And then you can look at more meta features like how often people are tweeting, how often they're retweeting, how often they're using links, URLs and emoticons and all these different features. And then you can start talking about things in the network, so friend and follower ratio has been eminently used, and in a moment I'll talk about using the actual network itself. It's worth mentioning that one of the nice things about SVMs is that they don't suffer from attribute dilution, so as you add more features, an SVM is not going to perform worse. So one of the nice things about an SVM is you can just continue dumping these features in, literally, and it will just pick and choose the ones that are going to be the most effective for any particular exercise. So the framework that we have in my lab literally will just compute hundreds of features and construct these feature vectors, and as we add more, and as we come up with more, we just continue adding them. And so we can actually see which features are actually going to be the most useful for any particular classification exercise. And so just to give you a sense for performance, this is -- from the literature, if you use only user information, all those features that I talked about, and plug them into an SVM, this is what you get. So if you're classifying age as a binary category, college age, and then in the five years after college, you get about 75%. Gender, you can do about 80% accuracy. I'm going to revise this, so in the literature, up until this year, political orientation was believed to be upwards of 90% accuracy. It turns out that's not true. And these have been confirmed in a number of studies, so there's a number of papers that have established that, no matter how you dice it, this is the performance that you get using the features that are available. So one of the questions that we investigated early on was how to use social context in order to actually improve the classification process. And so just to go back to our original setup, you have Starbucks. It's interested in his particular user. Let's say it's interested in the political orientation of this particular user, but it's important to realize that, actually, that user is embedded in a social context where it has other individuals that it's actually related to. And the question is, can we use the neighbors to learn something about the political orientation or the label of this particular individual. So, in this case, homophily -- the social phenomenon that we're talking about is homophily, which is the tendency for like individuals to cluster and to form links with one another. In this case, if this worked, then homophily would tell us that a Republican would likely have -- tend towards having more Republican friends, and certainly the literature in social science suggests that homophily is heavily active in many, many, many various attributes that we have -- not in all, but in many. And so the foregoing -- or the assumption that we're making is that by applying the principle of homophily here, we can actually improve our ability to learn the attribute of this individual, him or herself. And so we can go and we can build these features. We can take each individual in the neighborhood and actually learn their features, as well, so now what we've done is we've taken that one individual, we've looked at all of their neighbors, and we've learned their feature vector. Now, it's worth pointing out, in Twitter, there are two different types of neighborhoods that we can take. There are the friends of that user, the people that I'm following, or there are the followers, people who are following me. It's entirely possible that both of them are informative. We looked at friends, the people that an individual chooses to follow, because we considered those to represent the active selection by that individual of people whose content that they're interested in. And so these are everything that we're going to look at network-wise is going to focus on the use of the friends. That's what I'm representing here. The information is flowing from this user to this user. And so we have these feature vectors, but of course the question is, how do we actually combine this user feature with all of these different feature vectors. We can't just put them all down, because SVMs operate on a fixed set of features. You can't just -- well, you can indefinitely increase it, but you need exactly that number of features for every single user. It wouldn't be obvious how to -- in fact, I don't even think it would be possible to try to take an arbitrary-sized neighborhood and just concatenate features in order to produce a meaningful classification. So instead, what we do is we take the entire neighborhood and we simply average those features out. So this is sort of an average representation of the features in that user's neighborhood. And now we're -- but we're still presented with the question of how we actually want to handle combining these user features and these aggregate features. And so what we have now is we have the same classification problem. We're still using the SVM. We're still running it through and we're still building this classifier, this feature-based classifier. We just now have two times the number of features. There's a number of immediate questions that come up when we try to do this. The first is, which neighbors are we going to use? So people follow many thousands of people on Twitter. It's not clear, necessarily, who we're supposed to select, and so you can imagine many different policies by which we could select individuals. Some of the ones that we considered are everybody -- seems pretty reasonable -- and most popular, so the people with the most friends, my neighbors with the most friends, my neighbors with the fewest friends, and then basically the people that I mention the most. Now, these proxy for different things, so if you think about your own attributes or the friends, or the people you associate with who tell you the most about you, you can imagine that in different contexts different neighborhoods would be more relevant than others. So, in the case of what you like to eat, it might be best to actually sample your immediate family or the people who you actually hang out with recreationally. If I was interested in your actual vocational interests, it would be far more important for me to sample people that you worked with, in order to learn about what you are actually interested in doing vocationally. And so sub-sampling a network can actually be a very important part of determining which features that we should be using. And so that's the motivation here. What we've selected for, effectively, here are the extremely popular individuals. The least-popular individuals actually proxy for individuals that that person may know, because it's incredibly unlikely that they would have simply found them by chance, so these are people that you must have gone out looking for and actually selected. These are not people you heard about through media or through other people. And then N-closeness is this idea that maybe these are actually your closest friends, the people that you talk about or the people that you're the most connected to topically or relationally. And then there's the question of, we have these two feature vectors. Whatever your neighborhood is, we have some average feature, feature vector for that neighborhood and we have the user's feature vector. How do we actually treat that. So we could join these. We could actually put them together, have twice the number of features, or we could have them. It's unclear which policy would be the most reasonable one to use, so we tried them all. So the data that we used in order to look at this, we looked at three features. We looked at age, we looked at gender and we looked at politics. These were selected for a couple of reasons. First off, they're features that are of significant interest to organizations, to researchers, to a variety of different stakeholders in this field. Also, they can be made relatively binary. Gender is itself binary. Politics in the United States is a fairly binary distinction of Democrats and Republicans, although that's subject to a great deal of debate, but certainly people can actually orient themselves that way. And then age, while age is not discrete, we can certainly break it into meaningful time regions and life experience, so in particular here we were looking at 18 to 23 and 25 to 30 as sort of a college age and then leaving college, early adulthood. And so we went out and collected a whole bunch of data, so these are the labeled uses that we obtained information for, but in order to go grab all of their friends, we eventually ended up grabbing on the order of 500,000 users, and then we had to go, in addition to that, grab all of their tweets. So each data set ended up being over a gigabyte in size, so running this study was not a trivial endeavor. >>: Is this all US based? >> Derek Ruths: They are all Anglophone. They are not necessarily all in the US. >>: So I don't know if this is really relevant, but I know that -- how do regional differences like dialect and/or, say, pretending to be a gang banger. I'm Miley Cyrus, but I think I'm black, so I speak using African American vernacular. Do you filter those out? Is that just noise? >> Derek Ruths: That's just noise. I would love to filter that out. Or, rather, I wouldn't like to filter it out. I'd love to actually be able to identify those discrepancies or those differences so that we could sort of more finely separate out our population. But you can imagine that actually detecting that is a latent attribute inference exercise itself, so in some sense, what this represents is a first -- what I'm going to show you and where we are with the field is a first foray into this. So trying to actually learn those kinds of features are going to be just fantastically interesting, but we have to hit a bunch of lower-lying fruit in order to get there. >>: Something that you might want to check out is that one of our researchers in New England, named Kate Crawford, she's doing a lot of work looking at big data. This would be considered big data. Big data and I guess due process and looking at discrimination based on inferences made by Facebook posts or your tweets or other blog posts. I haven't had the chance to read the paper, but I've heard it's pretty interesting. >> Derek Ruths: That does sound fascinating. I'll follow up with you afterwards to get the paper. Yes? >>: Could you say a little bit more about who your labeled users are? >> Derek Ruths: How we labeled them? >>: How you labeled them and who they are. Are they people with a certain number of tweets, themselves? >> Derek Ruths: So each user, each labeled user, had to have 1,000 tweets, so these are active Twitter users, and as I mentioned, they're Anglophone, 1,000 tweets, at least, and they have to have at least 10 friends. So these are people who actually have some social context, are active Twitter users and should be using at least within the realm of sort of the same language structure. >>: Were these the same 400 for each? >> Derek Ruths: No, no, they're all different. And the reason is because, in order to actually get good ground truth for these, we had to go and sort of mine information differently. So in the case of age, we looked for people basically declaring their birthday, wishing themselves a happy birthday, so potentially we enriched for extremely narcissistic people, but setting that said, what we looked for was people credibly saying, happy birthday to me, I'm whatever. Gender, what we did is we've had a number of different passes. In this version, what we did is we looked for strongly gender-identified names, which has been standard, although now we have a much better technique for actually doing gender labeling, which uses profile pictures, which I can talk about later. And then politics was done -- in this study, politics was done using self-declarations in profiles, and I'm actually going to talk about that, how we improve on that, as well. You know about that work. Yes. >>: And it looks like -- I mean, you mentioned you look for the labeled users had at least 1,000 tweets, and it looks like you're going back several hundred tweets, on average, per user. >> Derek Ruths: That's right. >>: What timeframe? Is there any worry about them changing age? >> Derek Ruths: That's a good point. We did not take that into account in terms of age. These are five-year windows. >>: I'm thinking my first tweet from 2007 probably doesn't sound the same way my tweets do now. >> Derek Ruths: That's true. We did not account for that in the case of age, and that would have been a good thing to do. In the other circumstances, gender and politics, I'm a little less concerned about that, because there's potentially going to be standard language that we can pick up that will actually span time. I would be very concerned if we were trying to look at video gamers, though. If we were looking at things where there was extremely -- where the language of the kind of thing we were trying to classify was clearly going to shift over time, I'd be really concerned. >>: For the sake of gender -- politics is sort of that in between. >> Derek Ruths: Yes. Politics is borderline. Politics is a bit borderline, as well, so certainly time can become an issue, and actually I think time in general is a huge challenge, and it's something that I'll touch on a little bit later on in terms of things we need to do to address it. But certainly, taking time into account is important, and your point is taken. I think age would be an interesting thing to unpack where that's concerned. So if we take these users, and we actually run our classification system on them, we do some sort of k-fold cross-validation, this is what we get. And so I'm going to point out some particular numbers. I don't need to read them all -- that what's important to see in the table first is this is the baseline for users only, so this is if we only use the user vector. Everything down here is simply merging in that average neighbor information in one way or another. And as you can see, the numbers are not the same. The numbers to particularly pay attention to where we saw dramatic improvement I've circled in red, so using just neighborhood context, we've been able to go from 75% to 80%, or 5% improvement in age. Gender we actually don't see much of an improvement on. And of course, that is somewhat expected because gender is not a terribly homophilic property, meaning that typically we'll all associate with men and women, so it's not as though your gender is going to be a strong indicator of seeing an enrichment in one gender -- in one label or another. And then political orientation, again, we see a significant boost in. Now, what's interesting is that we see this boost for different neighborhoods, which is interesting, as well. So here, we see that age boosts -- you get the most significant boost by looking at the followers who are least popular. And so what this is suggesting is that you're getting a lot of information from people who are most selected by you, most clearly selected by you, as opposed to I saw them in the news or these are very popular individuals that I simply want to follow, or they're news sources. These are user accounts that you knew about and you linked to, so least here is presumably proxying for people that you know very well. And other studies have established that age is a very homophilic property, particularly in the close circle that you keep, and so it's not surprising that we would find that we get such a dramatic increase using this information. In political orientation, we get an improvement, or the most significant improvement, using all information. Presumably, this is because individuals -- in work that I'm not going to be able to show, we've gone on to show that individuals enrich their neighborhood uniformly for people who are politically similar to them online. And so you get a lot of signal from people you know, people that you don't know but are popular and organizations that you follow. And then, of course, gender -- gender, we're not getting great deal of information from, so really, I'm not sure how much we can say about the particular quality of that inference. Now, the other thing to mention is that -- well, so neighborhood can certainly increase the performance, so here we see the actual increases that I pointed out, but what's particularly interesting to me is the fact that using only neighbor information -- so here we dropped all the user information. We didn't even use the user's feature vector. In the case of age and political orientation, we actually do -- well, in political orientation, we actually do better. In the case of age, we effectively do as well as if we had the user's information itself. And so what this means is that -- this has some very practical implications. Users can make their accounts private, but if I can actually find the constellation of users that are around them, I can learn a great deal about them, even without looking at the content that they're generating. But, of course, in the case of gender, it's not a homophilic property, so we're not getting a great deal of signal. I think it's worth mentioning. I think that gender is still an interesting place to look for social signal, but I think that it may be necessary to actually mine out different features to learn a different machine in order to actually use one's neighbors, in order to actually learn a gender label for the ego. All right, so I'm going to move on and I'm going to talk a bit about how we can handle language. Are there any questions before I move on to that? So it turns out that, if you look at the literature, if you look at what's been done on social media, you would believe that Twitter was almost an only-English platform, because there's just so much stuff done on basically Twitter users in English, and in our early work, we were certainly among this community that was looking only at English users, and there's a reason, because I don't know other languages. But if you look at the statistics, Twitter is only 28% English, which means the majority of content being generated on Twitter is actually not being generated by English speakers. That doesn't mean that they don't know English. It just means that their preferred language of tweeting and of communication is something besides English. And so what we've done is, over the past couple of years, we've really left a very black hole in terms of latent attribute inference concerning other languages. And so what we looked at in this study, and actually, this is what brought me to Seattle recently, is the extent to which we can use gender inference machinery that's been developed in the past on other languages, or what we have to do in order to actually handle other languages. I consider this to be both a targeted investigation into a particular feature but also a broader way of thinking about what is it going to mean to do latent attribute inference in a multilingual environment? So the first thing to mention, we went out and collected data sets for a diverse set of linguistic families, so French, Indonesian, Turkish and Japanese are just about as different as you can get. And so these are selected so that we can get a very broad spread to determine how well this machinery would work. And I think the first takeaway is the out-of-the-box machinery -- so this is just using the SVM that I had talked about. You have to remove the stems and co-stems and so forth, because these are language specific, but take all of your features that we've been using in the past and machinery works. So we are seeing performance as good -- so this is 76% for French. English was about 75%. We're seeing performance as good or better than what's been published for English. And so we can actually use existing machinery quite effectively on other languages. The only outlier here is Japanese, and the complex orthography of Japanese makes it very hard to fit into many of the features that we've defined for languages that have a much more limited alphabet. So I think this is actually sort of a big open question, which is how do you actually handle Japanese, Chinese, these languages that actually have this complex orthography. We have a couple ideas, but it's worth mentioning that Twitter has a lot of Japanese content on it. So this is not just an academic exercise. This is a lot of information that we actually can't mine. >>: I'm not much of a language expert, so this might be a dumb question, but do any of these languages or languages you looked at have gendered pronouns for referring to other people? Like, so people saying you? >> Derek Ruths: Let's see. Turkish actually a genderless language. Japanese, it's surprising that we did this poorly, because Japanese has a tremendous amount of gender encoding in the language and in the usage. So this is clearly not getting the features right. And so I think that, actually, of all the languages that are up here, I think what you're pointing out would work best for Japanese. Indonesian is also, to my knowledge, a genderless language. And French is not genderless. However, the pronouns are not coded to gender. Other words are coded to gender, but not pronouns. However, there is something really cool that we can do with French. French has a really nice construction that actually encodes the speaker's gender. So when you say, I am X, when you make the statement, I am X, you actually have to decline, change the ending, of the adjective or participle that follows this construction. And so we considered this to be a potentially really rich source of information about gender. So we simply went in and we looked for -- je suis written this way happens rarely in Twitter, because people mangle it and have all sorts of slang ways of writing it. But suffice to say, you can find a lot of instances of people using it. And so the assumption is, if people are using proper French grammar, males would only be using the male constructs, so they'd say, je suis petit, and the females would be saying, je suis petite, and they would be actually spelling this out differently. Of course, social media being social media, that's not guaranteed, and so one interesting discovery that we had is that, in French, just like in all other languages, grammar breaks down. Women on Twitter who tweet in French use many, many, many French male constructs, so when they say, je suis blank, they will often leave off the feminine ending. And it's worth mentioning that it's a little extra work, because the female modifier is always an additional character or modifying the ending in some way, and so it's not surprising that the shorthand is to go to the male construct. But if you take users and you simply apply a very basic threshold, which is if in the history of all of the tweets that they generated, they've used a female construction even once, then you can actually classify gender with extremely high accuracy. So if you use just this threshold-based suis construct, then you can get overall accuracy of 90%, which is up from 76%. So that is a huge, huge improvement. Now, the only catch is that not everybody uses this je suis construct, or at least they don't use it in a form that we could recognize. And so this covers about threefourths of all the users. That's what you're seeing here. Out of the 1,000 users we were looking at, we got about 750 of those, we found them using a je suis construct. And so those could be very accurately classified. Everybody else, once you looked at what was left over, we couldn't classify very well at all. In fact, we classified them much worse than what we had in the base classifier. So what this suggests is that the suis construct is, A -- it's a very reliable classifier, but it's also selecting out individuals that have strong gender-indicated language in one way or another. Now, I think that there's something very exciting here, though, and that's not been looked at in the latent inference literature, and that is triaging users. So traditionally, the way that all work that I've seen has gone about this problem is saying, well, we've got to classify everybody. You give me a user, I'll just give you a label. But this suggests that there's another alternative, and that is what if we could simply identify users that we could get the gender for well, and then we toss everybody else out. If that number was high enough, that would actually be a pretty nice step in the right direction, because we would be able to generate high-quality classifications for a large portion of the population, and then we would segment out a different part of the population that would need to be treated differently, and this could be treated differently from any number of different angles. Maybe what we need to do is just build a different classifier. Maybe we just need to Amazon Mechanical Turk the identities. If the numbers are right, the there's any number of admittedly more manual ways that you could go about handling a large classification problem. >>: So this might be too early to ask this question, but have you started looking at crossreferencing additional data to pull other characteristics, like music or movies? So other people in different communities are -- especially in China -- are cross-referencing all the different social networks. They find people's favorite movie profiles or whatever, and then they've taken into dividing people into male, female, gay, non-gay, that kind of thing. I know that Netflix has been doing that for a while. They can figure out that you like certain types of movies. >> Derek Ruths: Yes. So I'm very interested in that. Doing that in, A, the academic context and, B, the closed, private social network model that currently exists is hard. I'm not familiar with very many efforts that have been really successful at cross-linking accounts between different social media platforms in aggregate. >>: I can send you records. >> Derek Ruths: Okay, excellent. But suffice it to say, that would be fantastic. I think before we even do cross-referencing, I think that there's a lot to be done in terms of topic-based features, and so that's something that we're looking at currently, which is maybe rather than looking at words and these different attributes that we've been looking at, maybe what we need to do is look at the kinds of things that people talk about and actually try to encode that. That could be gotten by looking at other sorts of social media platforms, but it could also be done by looking at different aspects of the language that they're using, so maybe embedding the stronger and more sophisticated language models. All right. There, I'm circling the best number that we had in the paper. Okay, so next, I'm going to talk a bit about the challenges that we have. If we look at where demographic inference is right now, I think that there are a number, a host of different challenges that I think are really exciting open problems for us to work on. There's -- first off, there are just standard technical challenges that some of you already alluded to, so temporal variation is a huge problem. So in unpublished work in my lab, we've looked at how bad things get if you train at a particular time and then try to classify users later. And we've shown that, depending upon the feature, even a week will cost you about 10% in accuracy. And so that's serious. That is really serious, which means that after a month, your classifier is almost useless. It's probably doing about as good as random or worse. And so we need to come up with some way of handling temporal variation, and this could either by continually updating the models or potentially learning more meta features that we would be using. Performance is of course a huge issue, so if we have to actually grab all the neighbors for an individual every time we need to classify them, that's a lot of data. If we need to grab a gigabyte of data just to classify 400 users, that's pretty expensive, and so one question would be can we be smarter about the kind of data that we're using and subsampling. And then finally, the literature is rife with examples of using binary features. They're easy. They're nice to work with, but, of course, most features that we're interested in are not binary at all. If we just take age, for example, what we'd really like is we'd like much finer bins for age, and the machine classification literature doesn't offer us many out-of-the-box ways of working with non-binary data, and so a big question is how do we take these problems that we've posed and come up with better ways of handling these richer features? There's attribute-specific challenges, so I think that a lot of the reason people have looked at the attributes that they have is because the attributes are fairly easy to get access to and ground truth on. However, there's a lot of interesting features out there that we really do need to actually get some purchase on, like education, location, activity profiles, interest, these sorts of things, and these are much harder to actually get ground truth for. And so one of the big questions moving forward is how do we get better ground truth for more nuanced features, and how do we actually encode that in a way that we can get at? And then, finally, there's -- what was alluded to earlier was this idea of regional variation, so we know that regional variation introduces linguistic differences, introduces different practices, and so if we're studying populations, what we would really like is segmented models that actually address different communities in the total population. Not every person can be treated the same, so personalized models would also be pretty important to have moving forward. So these are some of the important dimensions that I think really need additional work. Yes. >>: I have a question about a different challenge, that maybe it's some combination of these. So once you infer the demographics of some group of users, you want to use it for something, you want it to help you interpret something, so you might care about who are the -- what's the population look like for the folks who are supporting one political party or another or the people who are talking about Starbucks all the time. But then, when you are trying to learn those demographics, you're also using potentially some of the features that are tied up with the question you're asking. So you might have some training data where the people who -- all the men happen to talk about Starbucks, and so you might learn that mentioning Starbucks means that you're a male. I don't know why that would be the case. And then you go and then you say, okay, I'm going to apply this to everyone who's following Starbucks, and your classifier spits out, inaccurately perhaps, that everyone is male because they all mention Starbucks, but that was your filtering condition. So there's this question of endogeneity and stuff like that around how you're classifying people and what you're trying to learn. Have you thought about that? >> Derek Ruths: In some ways, what I'm going to talk about next, which is the political orientation and the problems that we've had with political orientation, is going to maybe get at a little bit of this, in terms of the way that poor assumptions about the way we should sample populations can influence the results that we get. But unfortunately, I don't have a very good answer to that. I feel as though, as with a lot of large big data problems, the solution is to get as random a sample of the population as one can. In many ways, I think that what you're talking about underscores why it's important that the computationalists that are working on this problem also have close ties to social science, because I think that social science actually has a great deal to tell us about properly designing data set sampling techniques or being aware of correlations and biases that we may be introducing. But even there, I don't think that it's systematic thing. I think it's something we just have to be continually aware of. Unfortunately, I don't have anything really, really strong to say, but I think it's something that we just need to be on guard about. Actually, on that note, in terms of being on guard about things, I want to talk about really the story of how I think the latent attribute inference community, my work included, really became confused in approaching political orientation, the problem of political orientation. All right, so let's do a quick exercise, right? I just want to do a very quick game. The game is, I'm going to put up a picture and you're going to tell me what the political orientation of this person is, right? You can already tell who it's going to be. Democrat? Very good, all right. Republican, all right. Let's see, how about this person? You guys were doing so well. All right, let's try another one. Republican, all right? How about this one? You seem less certain. What's going on? Republican. Okay, how about this one? Here we go, last one. Don't worry. Democrat? Okay, he's Canadian. So what happened? What happened to all the certainty? We were all in unison, crying these names out, and the all of a sudden we hit these characters and we're not able to do it nearly as well. So I'm trying to illustrate a point, and that is, when it's obvious what a person is, it's very easy to classify them. They’ve been labeled for us, they've self-declared things for us. It makes it very easy to actually assign a label to these individuals, less so for people that we don't know. And so this really gets to the heart of the problem that we've been having in the literature, which is that a lot of the data sets we've collected and reported political orientation results on have to do with people who are easy to identify. And so in this study, we looked at, in some sense, a very simple question, and that was, what happens when you weaken that condition, when you don't look at people who are easy to identify? How bad does political orientation inference get? And so what we did is we went out and we somewhat arduously built three different data sets. I mean, different levels of ardor were involved. Getting political figures was easy, because they have just have these Twitter accounts, we go out and grab them, we get all the Senate representatives and so forth. Active users were also pretty easy to find. These are people who simply declare their orientation. I love being a Democrat, or Republicans rule, these sorts of things, just stating that sort of thing in their profile. We included them in the active data set. Modest users were nontrivial to get. These are individuals who use political language but do not self-declare in any way. >>: Is there a sarcasm detector? >> Derek Ruths: We actually manually coded all active users, so active coders were put through an AMT coding exercise, so I expect that cleaned out any sarcasm. Let me tell you, there's other language that goes into profiles that contain that, but yes, so hopefully the manual coding handled the sarcasm detection. At least, when we went back through and eyeballed what we had, it made a lot of sense. But the modest users, this was really where the interesting part of the study took place, and that was figuring out how to actually measure these modest users. So to give you some idea of how we went about doing this, let's look at some of the features for the political figures. These are the -- the top hashtags are generated for the active political figures, so you can see that they are very strongly associated with, not surprisingly, the Democrat and Republican platforms. What we did in order to identify modest users -- let me tell you what we did not do first. What we did not do to identify the modest users was take these hashtags and then go look for other users using them in Twitter. And the reason -- it's subtle, but the reason we couldn't use these is because these are highly discriminative Democrat and Republican hashtags. If we had gone out and found users that used these or selected on users that used these, we would have effectively been selecting people who had a strong valence or a high likelihood of being Democrat or Republican. That's not what we wanted. What we wanted was an unbiased sample of people who spoke about political things, and so what we did is we took the least discriminating political hashtags that were used by political figures and politically active individuals. So these would be things like #jobs or #taxes or things that carry no political valence but still talk about a political topic. So we took those hashtags, and we identified users that used those and that had no mention of political parties in their profiles. So these were people who really were not giving much signal. And then what we did is we took those individuals, we pulled out all the tweets that contained political language, and then we Amazon Mechanical Turked that. We basically asked people to code the political valance of these individuals. Now, this is not a foolproof method. We took individuals that received majority vote for a particular orientation, and so you can imagine that there would be some uncertainty, even there. But what it gave us was a corpus of individuals for whom we had a fairly certain valence assigned, but it was coded in a more nuanced way. It wasn't necessarily in the explicit words that they were using. It may have been actually in the semantic construct itself, which is much, much harder to get at, computationally. And so given this modest set of users, along with this active and figure-based set of users, we could start to look at how well the classifiers that we had been traditionally and previously reporting as really, really good performed. So this is the somewhat disturbing results. Figures, not surprisingly, we do very well on, 91%. This is the number that had been always reported in the literature. If you move down to individuals who are still self-declaring their orientation, you already lose about seven percentage points of accuracy. You're already down at 85%. And if you take individuals who do have some sort of political valence and express it on Twitter but simply are not overt about it, you end up with 70%. Here, we're barely doing better than random, practically. And so what we can see here is that the SVM performance seriously degrades when we actually want to look at normal people. And I would argue that, in terms of inferring political orientation and any other feature, it's most important for our machines to work on normal people. So hopefully this is getting back to your point a little bit, which is how you collect your users really can influence how well your machine is going to do or how well you think you're doing at assigning this classification. Now, this afforded another cool opportunity, and that was we could look at -- for the first time, we could look at what happens when you take a machine that's classified on one set of users and use it to classify a different set of users. And so -- this is a pretty important question to ask, because what this means is -- this is always going to happen in the wild. You'll take some set of users, you will classify on them, and then you'll pick up your machine and you'll run it on a bunch of other people. Now, what the literature was telling us is that you train your users on these political figures, and then you could pick it up and you could run it anywhere, but this is the performance that you would expect to see. You train on political figures and you classify either active or modest users, and you see a dramatic decrease from even what you could have gotten if you had trained on the original users themselves. So we can't even do cross-classifier or cross-data-set usage of these classifiers. And in some sense, that's not surprising. These things are using different features, but what's profound is just how much of a price we pay in order to actually do that crossclassification. And so I think that this is another very important question to be asking as we move forward in this research direction, which is how do we build machines that are robust across these populations? Okay, and so finally, in closing, because that was potentially a bit of a negative note. Now I want to just switch things around a bit and talk a little bit about the promise of what we can do with social media in terms of measuring physical populations, and this is something that I'm deeply interested in. I really am hoping that what we can do is use online measurements, effectively social media sensors, to talk about physical populations. And so this is going back to some preliminary work that we published two years ago, or a year ago, but have since made significant progress on. I'm going to talk about the results that we published then, but we've made a lot of progress, which I'll allude to. So here's the setup. We have the population of the world, and we would like to actually estimate the makeup of that population, how many men versus how many women, how many skiers, how many skateboarders, what have you. We'd like to actually get the composition of this population, how many people were riding the bus, how many taking the car. These people are generating some social media footprint, and we're going to run that through a classifier and get some estimate of the Twitter population. Now, the question is, if we can correct for this bias, how close can we get back to this estimate of the real population? So what I'm going to talk about is this, right to here. We are still working on figuring out how to correct for Twitter biases. But what I'm going to show you is that even without correcting for the biases, we can get a fairly rough estimate of real-world populations, at least in some cases. So we looked at census data, which is a nice, stable measure of populations. It was also done in a very systematic way, so we have some confidence that the results being reported are accurate. And so what we looked at was gender inference applied on commuter populations, so in this case, what we'd be looking at are individuals who are choosing to commute by car, by bike or by public transport bus. And the census figures -- census figures typically, and we looked at Toronto, but census figures sort of systematically give you gender breakdown in each of those categories, how many men are communicating by car, how many women are communicating by car, same things for bikes and buses. And so what we wanted to do was determine whether or not taking measurements of just Twitter populations would allow us to reconstruct those, any gender bias that we saw in those commuting populations. These are the census figures that we have for Toronto. By the way, I live in Montreal. We didn't do Montreal because it's intensely bilingual, and we can handle English and we can handle French, but handling mixed languages is another open topic in terms of how you actually accommodate for that. And obviously, it's one that we're very incentivized to solve, but at this point, we were just interested in looking at a monolingual context. In online data, what we did was we took user accounts, Twitter feeds that were effectively oriented around giving news about a mode of transport as proxies for that transport. So we took -- for instance, this is a commuter traffic feed. This is about public transport, and this is a biking community, and we took all of the followers of those accounts as a proxy for the Twitter population that uses each of these different modes of transport in Toronto. And then, we gathered those individuals, we ran our classifier, our gender-based classifier on them, and we determined how well or what agreement we saw between the measurements that we took, the gender biases we saw, and the biases that were reported in the census. This is more information on the classifier. Again, we used the same sorts of features that we've been talking about, and then we ensured that our classifier was working as well as it usually does, which you can see here. So it's achieving effectively the same performance that we've been reporting previously between about 80% and 85%. So if we look at the actual physical population, this is the gender bias in each commuting population. This is the gender bias that we obtained. Now, these numbers are not equal. They're not even remotely equal, but if you really, really squint your eyes, what you will observe is that we actually reconstructed the valence of the bias. And that's the point that I actually want to make about what we're finding right now, and that is that in each of these commuting populations, we were able to determine the gender leaning of each mode of transport. Now, it's important to observe that there were three modes of transport, and there were different biases in each one. In particular, in public transport, there are more females than males that ride public transport. This would be a much less interesting result if they were all male dominated and we observed that these were all male dominated. And the reason would be that that wouldn't necessarily prove that our classifier was doing anything except just discovering Twitter prior, some sort of prior on male presence in Twitter. But the fact that we actually find the valence and the leaning, the bias, to be the same as was observed in the physical population, and the fact that it actually changes suggests that we're actually finding signal that represents the populations underneath these. In ongoing work, what we've done is we've effectively gone and obtained the census data for now nine different cities, all Anglophone cities at present, and looked at our ability to reconstruct these same figures, these same census figures, for each of those cities. These are international cities, so we've got Sydney, we've got cities in the States, we have London, cities in Canada. And in the majority of cases, we can actually construct the gender bias correctly. Now, making this adjustment remains an open question that we're working on, and what I'm hoping we find is that there's some sort of systematic bias that we can use to actually correct across different populations. It is an entire conjecture -- it's not clear as to exactly how we need to correct for these biases, but suffice to say, that's the next step, because clearly we're getting signal. And I should say, in other populations, in other cities, we also see different women being dominant in one mode of transport, men being dominant and others, and we still see our ability to reconstruct those or discover those biases. So we're getting signal. The question is, how do we actually tune it so we're getting closer to the actual measure of the population rather than simply getting leanings. All right, and so just in closing, hopefully, I've given a sense for demographic inference, where we've come, where we've come from and the problems that we face. Where we stand, there's tremendous opportunity for using this. I mean, it's clear that we can make quite high-resolution measurements under relatively controlled conditions. And so the question now is really how do we start to take away those controls and permit our machines to continue doing well? How do we introduce time, how do we introduce population variance? How do we actually account for mapping into physical populations? But all these present what I consider to be very manageable, although big, open problems that are exciting directions for future work. I'd like to recognize -- in closing, I'd like to recognize some of the students who did all of the work that we've seen. Faiyaz is my PhD student. He'll be graduating student. Wendy is a very talented undergrad. Also, not listed here is Morgan, another undergraduate, who did the most recent work on multiple languages and Twitter, and this was all funded by funding from the Canadian government, as well as from the Kanishka Project. Thank you very much. Questions, yes? >>: In the example that you showed before, where you looked at the particular accounts of certain news organizations, some mechanism to determine the fixed ->> Derek Ruths: Yes, yes. >>: Why did you pick that instead of what you've been doing before? It seemed like what you were doing before, in some cases, you were identifying people based on what they say, like I am a Democrat or whatever. Why not looking at I'm taking the bus, or ->> Derek Ruths: We looked at that, but there just wasn't enough. So we looked for people declaring how they were commuting, and it turns out that it's actually too boring for people to even mention on Twitter, which is a pretty low bar. But people don't talk about the mode of transport that they're taking, at least often enough. Oh, another thing to mention is we needed geographically local individuals, and so that can get dicey, because we didn't have much signal for people talking about the mode of transport. We had even les signal once you had to have the mode of transport and some confidence that this person was in Toronto. And so somebody following a news feed from Toronto, pretty good indicator that that's what you're looking at. >>: Did you look at multiple Twitter groups? Certainly, there's more than one Twitter feed about traffic. >> Derek Ruths: We were surprised. Different cities have different numbers, and that's one of the interesting things that we're grappling with right now, because -- I could pull the data, but it's probably easier for me to describe. If you take different feeds, you can actually get different valence in terms of the gender composition that you get. A classic, bikes are real problems, actually, and the reason is because bike commuting is different than bike interest groups, and so you can get Twitter handles that are oriented around female bicyclists. And of course, those are going to look extremely dominated by female bicyclists, for good reason. There's a very big question as to how do you select the proxies to actually get the most information? And I would over it if I actually had an answer to that question. Bikes seem to be the most problematic. Public transport and traffic, they seem to agree a bit more, but bikes are quite an issue. Yes. >>: So in the first study, you show that, or you say that people who are following, are your neighbors, kind of send strong signal as to what that person is? So can you use that for the second studies, or you talked about this -- people, you can basically figure out whether they're a Democrat or Republican. And then there's active figures. So rather than just putting everything into one pool and then trying to classify them, first do you look at the ones with the very clear signal and the ones you figure out who they are, what they are, and then for the second tier you use the first tier as a neighbor to see people they are following so that you can have better labels for neighbors. >> Derek Ruths: That's a great insight, because what you could do is, I guess what I had been saying before, is that you look for the obvious wins, and then once you're done with the obvious wins, hopefully you have a smaller population that you can spend more effort on. And maybe the effort is go and collect all their neighbors. Certainly, that is a possibility. We haven't looked at it, but that would make sense. My only concern would be that, in the case of political orientation -- well, there has been some work that showed that even -- well, no. Actually, most of the work on this has shown that people who have very strong political valence follow accounts that have political valence. I'm not sure that it would be true of people who just occasionally talk a little bit about politics, but it's an open question, so I think it's worth looking at, for sure. >>: I guess my question was basically going the other direction and also something to something Emre mentioned earlier, which was here you were looking at particular accounts that were being followed, and it seemed like, for the demographic inference, looking at people who follow Miley Cyrus are probably demographically shifted in some way, so using those as indicators instead of using local neighbors. How do you think that would do? >> Derek Ruths: Well, those are your neighbors, as well, because if you follow Miley Cyrus, then ->>: You're not using Miley Cyrus as a feature. Like, I could use Ann Coulter as a feature, and I feel like that would do pretty well. >> Derek Ruths: Oh, I see. Are you talking about for age? >>: For political orientation, so whether or not I follow Ann Coulter? >> Derek Ruths: That's right. So you could look at the accounts that people follow, so you could imagine that Glenn Beck or these individuals could potentially actually signal ->>: Like in the same way they signal bike riding? >> Derek Ruths: Yes, so are you suggesting that that's how we would label the users, or are you suggesting that that's another feature that we would add. >>: So, in this case, you're using it to label users and you felt pretty confident. What I meant was, in the first case, to infer demographics, you could use it as actual features, like use a bunch of high-profile accounts and features. >> Derek Ruths: I see, I see. Yes, yes, I see what you mean. So we had actually looked at that, the k-top followers, the k-top people who they followed, and that does give signal. Once you get down to the modest users, you're not getting much of anything, though. So they don't seem to have a core set of people that they follow, I think is the problem. >>: You were looking at similarity to the top users, not necessarily ->> Derek Ruths: Well, so what we were looking at was one of the features we added was k-top, k most followed individuals. >>: Like the vector of their language and all that stuff? >> Derek Ruths: No, sorry. No, that was a different study. Yes, this was potentially a little confusing. Sorry. What I meant was, in the case of where we were looking at the modest users, we looked at the individuals that were most mentioned by those users, and they were most followed. I'm not sure about the following. Let's take it offline, because I think that they would be interesting to look at, and I'm trying to remember if that was a feature that we had tried, because we tried a whole bunch, and I'm not sure whether we actually had that. But certainly, there has been some work that shows that looking at these anchor accounts can help to get purchase on it. I just don't know how well it would work for normal users, but possible. >>: That would be a case against using whatever I brought up, which is then, if Starbucks is your feature, then Starbucks asks you to use it ->> Derek Ruths: Right, right. Yes, exactly. Exactly. I don't know. Yes? >>: So thinking about the questions that are being asked about the bike experiment, the commuter experiment and your earlier ones, it seems like there's several places where potentially biases are creeping in, from how you're labeling the individuals to how you're measuring, how you're sampling the users, how you're labeling them, how you're actually running the analysis and what data sets those analyses were trained on. Have you done any experiments or are you aware of other work that tries to put a framework around these different kinds of biases and starts to reason about ->> Derek Ruths: I'm not familiar with any, but that's what we're trying to put together, and that's one of the reasons why we have measured so many cities. It's basically solving for multiple variables. In the simplest case, if we ignore distributors for a moment, then you're looking for the bias that's being introduced in step one and step two and step three and step four. So what we're hoping is we can do some sort of meta cross-fold validation across cities and actually solve for these different components in the framework. No results on that yet, though. I've been surprised, because I expected to see something in the literature about this, but I haven't, and I don't know if it's because it's something that's happening in another community, more in the hard statistics community, or if it's just that it's not a problem that we've gotten to yet, because it's entirely possible that we just don't have the data sets for it yet. It took us a long time to create the data sets that we have that could look at it. But I think that's the next step, for sure. All right, thank you very much.

>> Emre Kiciman: Hi. Welcome to today's talk... University, where he's an assistant professor, looking at communities and...

Related documents

Products

Support

&gt;&gt; Emre Kiciman: Hi. Welcome to today's talk... University, where he's an assistant professor, looking at communities and...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Emre Kiciman: Hi. Welcome to today's talk... University, where he's an assistant professor, looking at communities and...