>> Helen Wang: Good morning, everyone. It's my great pleasure to introduce Joe Bonneau. Joe is a PhD candidate from the University of Cambridge, and he has been working with Ross Anderson for the past years. And today he is going to give a talk on guessing human chosen secrets. >> Joseph Bonneau: Great. Thank you. Right. So, just see if I have my slides advancing. Right. So, I'll try and pre-empt the question I normally get at the end which is why are we doing password research at all? This is a picture in what I identified as basically the first password deployment in 1961 at MIT. They actually wrote a retrospective last year for the fiftieth anniversary of it, and one of the graduate students who worked on it admitted that he was the first person to actually compromise password security by guessing people's passwords. What was actually the most interesting thing to me about reading it is that the threat model was completely different then. The main reason that they deployed passwords in the first place was to try and segment the very limited resources of computing time. And this guy, Allen Share, who admitted doing it said that the only reason he guessed people's passwords was so that he could allocate more time to his jobs and try and finish his research sooner. There's also a lot of other really good tidbits in the retrospective. They had a race condition where the print daemon would print whichever file was most recently accessed, and if somebody logged in, in between the time when somebody sent the print command and when the printer actually printed it which was very small time window, it would actually print the password file which wasn't encrypted then. So this happened a couple of times where they accidentally sent the entire master password file to the printer, and they had to reset everybody's passwords. Anyway so that was fifty years ago, and sadly we basically are dealing with a lot of the same problems today. >> : [Inaudible]. >> Joseph Bonneau: I guess we have lost ties and we've lost some of the hairstyles that you can see here, but basically passwords have stayed with us. So this is a small snapshot of some work I've done in the past year with Cormac and two others looking at basically why has everything that has been proposed to replace passwords failed? So we tried to really zoom back and say what are passwords actually good for? So we made a massive list -- Oops, not a laser pointer. We made a big list of every property that we would like an authentication system to have, and then we tried to score passwords objectively and say passwords do have some nice properties. You don't have to carry anything. At this point they're easy to learn. Everybody knows how to use them. We know how to reset them. They're compatible with everything in the world. They're basically free to deploy, and there's nothing to be stolen. You can do them without trusted third-parties, so they have some security properties that are okay. So we then we tried to look at the best examples of everything that's been proposed with the claim that, "This can replace passwords and solve all of our problems." And when we scored those systems using the same evaluation criteria, we found universally that every replacement makes a couple of incremental improvements. So this is the Firefox Password Manager. It makes things more scalable for users, say, eliminates some errors due to typing passwords, but it's now not as easy to recover. If you lose your password, it's not compatible with the browsers, and it sort of no longer has this nothing-to-carry universality of passwords. So we said, okay, there's a few marginal gains and a few marginal losses from switching to something like that. And we repeated the exercise for lots of other things: graphical passwords, doing authentication by sending SMS messages to a phone, trying some biometric like iris recognition, doing some remote single sign on scheme like OpenID. And it was always the same story, that you can't replace passwords in a strictly win-win situation; you have to give up some of the properties that we've become accustomed to with passwords. Which all this goes to say, a little bit, that passwords are basically the only show in town right now for the immediate future. We don't know really how to replace them in a way that will keep everybody happy. And particularly since somebody has to lose or something has to become harder for somebody if we replace passwords, it's very hard to break the status quo. So I do think that passwords in their current role will be with us conservatively at least for five years; I think it would be very, very hard to imagine passwords disappearing on the web. So if you're interested in reading more and seeing the entire table which doesn't fit on one slide, you can check out this paper which with Cormac and Frank Stajano and Paul van Oorschot, which will be at Oakland in a couple weeks. And longer term, taking another step back, the current situation we have, say, N users and M servers that some users might want to talk to, and every single connection that gets made users have to register another password. So we have this big messy bipartite graph. And this causes all sorts of problems: users end up reusing passwords; there's, compromises at one server can affect another server; there's too many passwords being demanded by users. All the familiar laundry list of complaints about passwords. So what would be really great is if we could switch to having one intermediary in the middle. This is the classic computer science trick of add a layer of indirection. Now everything looks really good. Every user just has to memorize one thing, and every server just has to maintain a connection with this one trusted intermediary. Of course there's a couple problems with this. This is basically the Microsoft Passport proposal from about ten years. It's also Kerberos. Today, reincarnated it's Facebook Connect. And of course there's problems that some users aren't going to want to trust this middle server, that server becomes a point of failure and all these things. So in the best case maybe five, ten years down the road we'll end up in some situation like this where users have their choice. They might use some different remote server of their choosing. They might trust their mobile phone to intermediate everything for them, or they might use some weird gadget that we haven't invented yet. And there still is going to be a really messy N-to-M bipartite graph down here, but we more or less can solve that problem. The protocols that underlie things like OpenID, which would enable the entire world of web servers to rely on any number of different identity providers, that's relatively a solvable problem. But we still have the problem of authenticating users just to the one trusted device so whether it's their OpenID provider or their phone or basically anything else. I've yet to hear a really credible proposal that doesn't push this to having one last password that people remember. Even in cases of trusted hardware and even if there's some biometric capability, there are still problems of devices getting stolen and biometrics getting faked. So people say, "Well, we should add a password to the system to take care of that one situation." So in a way long term that's the thing I'm most interested in is how we solve this last mile of authentication from the user to one device which then hopefully can unlock the world for them in a sane way. So this gets rid of, hopefully, a lot of the current generation of problems that we have with malware and phishing and people reusing passwords across sites. But we still have the fundamental problem that people have to memorize something that they then spit back to some trusted device that it's possible for adversaries to guess. So most of the bulk of my talk will be about this problem, abstractly, of how hard it is to guess secrets that people have remembered. I came into this a couple years ago, and it's turned into the topic of my PhD work. So these are a couple of questions that hopefully at the end of this talk I will have convinced you of an answer to that I don't think there is a good answer or even really a sound way of answering this question when I got into this line of research. So for fun we can do a show of hands. If we're talking just about passwords, how many people think it's easier to guess passwords chosen by teenager's compared the elderly? So teenager's first and the elderly. Okay. So pretty split room. We'll have the answer later. And how about a four-digit PIN that a human has chosen versus a random three-digit PIN? Who thinks the human-chosen four-digit PIN is better? >> : Are you talking about the weakest or the average? >> Joseph Bonneau: I'm leaving it vague for now. >> : [Inaudible]... >> Joseph Bonneau: Well, you can think if you were an adversary and you were doing some guessing attack, which situation would you rather be in? So everybody likes the random three-digit PIN? >> : [Inaudible] I'll take [inaudible]. >> Joseph Bonneau: You'll take the human-chosen? >> : Yeah. >> Joseph Bonneau: Okay. So one brave person. And how about a password versus a random surname chosen from the population? So unknown person and you have to guess their last name, who thinks that's harder than a password? One person. Most everybody else thinks the password's better. >> : It depends here on the culture, right? I mean, certain >> Joseph Bonneau: Right. >> : cultures are easier to guess surnames >> Joseph Bonneau: Definitely. So, yeah, actually in my thesis there's a whole table on different cultures, and there's some real outliers. Like in South Korea there's three surnames that cover more than half of the people. So if you're dealing with Korean people, the surname is, yeah, quite easy to guess. So if you want I can add information and say we'll take the population of users on Facebook who come from around the world, a lot of Americans. But still everybody likes the password or we had one or two for the... >> : I pick the surnames. >> Joseph Bonneau: Okay. We'll we're not keeping score but we'll find the -- we'll get to that at the end of the talk. All right, so a little bit of historical background on how people have tackled this problem before. There's the approach of looking at the content of passwords and looking at things like, well, how long are they? What different types of characters do they have? Do they have things outside of letters and digits? And a lot of people have done surveys of this by hand from passwords chosen in user studies in the lab. And there's a guideline that was put out by NIST in 2005, I think, which they never really sold as a bulletproof rule. This was kind of just a rough rule of thumb, but it's been used in a lot of usability papers since then. So it's become important at least in the research context which tries to answer the question of, "If you have a password of a certain length and with certain types of characters, how strong is this?" And on the right side over here this is if these white cells are truly random, so if within the space of one character from a 94-character alphabet this is how many bits you get. And then these are the estimates for human-chosen under a couple of different restrictions. So if they can only choose numbers, if there's a dictionary check in place, if they're required to use non-alphabetic characters, stuff like that. So not very exciting, just a big table without a lot of motivation behind it. The other approach to evaluating passwords is just to run some cracking library against them, and there's tons of cracking libraries out there now. John the Ripper is the most popular in research studies because it's free; you can get it on the web. There's also a whole ecosystem that exists of proprietary password cracking software which you can spent 14,000 pounds on if you're interested, that's at the discounted price, which I've never really looked at. But a lot of this is out there. What's interesting I think is that they have basically two business cases for selling this stuff which are employers trying to recover the passwords of their own employees and sort of jaded spouses or parents who are really interested in one person who they share a computer with. So a very strange world, but it does actually show up in -- Sadly it shows up in research papers sometimes that they say, "We used a certain proprietary scheme," which then makes the entire result kind of useless because you have no idea how good or bad this scheme was. So here is a graph of some faster cracking results that have been published over the years. So down here this is proportion of passwords that were broken in the cracking attack and this is the size of the dictionary plotted logarithmically. So that's actually masking a pretty massive loss of efficiency as you go here because this is basically a linear scale, and this is -- you're doing logarithmically more work. So if this is a linear axis too it would obviously be a huge, huge ramp up in the amount of work you have to do. But the interesting bits here I think are that these cracking attacks they very rarely ever get past the border of breaking about half of accounts that the researchers had access to. The one exception is this result from 1979 which was with users of like the original UNIX days by Rob Morris and Ken Thompson who actually to 85%. Well, that was in the days of crypt and pretty simple passwords. Okay, so in comparing these approaches and thinking about what we would like in a way to evaluate passwords, I would say that the semantic evaluation formulas, like the NIST formula, are lacking in validity a little bit because it's not really clear that the number that you get out of that formula corresponds to how hard passwords are to guess in real life. Whereas the cracking evaluations are pretty good here because the attackers use probably roughly the same software that researchers can get their hands on. The problem with cracking attacks is that they have -- they're biased toward how skillful the researchers are in using them because these things are extremely difficult to set up. You have a lot of choices to make, which word list you use, which mangling rules you use, and there's also some demographic bias because if the passwords that you're studying were chosen by a different population than the word list that you feed into your cracking algorithm, you'll do a lot worse. And it doesn't necessarily mean that the passwords are weaker. So if you run the stock John the Ripper password cracking against passwords chosen by users from China, they might look like the passwords are much stronger than users from an English-speaking country. But, it's not clear if that's actually true or if it's just a worse fit for the cracking software. And the semantic evaluations I think at least there's no bias for how skilled the operator is because it's a deterministic formula, but the demographic problem creeps in once again. And the further problems with the cracking evaluation in terms of good science and doing long term longitudinal comparison of results, it's very hard to repeat old password cracking experiments because the cracking software gets updated all of the time. And people rarely report enough information about exactly how they compiled the software, which word list they used, which mangling rules they use. And it's not free to run the cracking tools either. Some of the research that's been done in the past year or two at Carnegie Melon where they've done a lot of cracking against passwords that they've collected and user studies, they've actually had to devote a pretty significant amount of computational resources to doing the cracking evaluations which is doable but it's not free like doing the NIST entropy formula is. So my goals going into this space were to try and fix some of these problems with basically the secret weapon of having a lot of data. So in the past couple of years there've been a couple of big leaks of password data sets from different websites on the web. The biggest one was from RockYou in 2009 which had thirty-two million users, and the passwords were leaked in plain text. So this was kind of Christmas morning for people who are interested in doing password research because you could look exactly and say exactly how many people chose "123456" and on down the line. So this data set has shown up in dozens of papers over the years. I've used it quite a bit, and it is really very useful for doing these things. So we'll come back to the RockYou data in a second. My goal for the rest of this talk will be to say if we have a big data set like the RockYou set, can we develop some purely statistical metrics that don't assume anything about what the passwords are or what they mean; they just look at this column on the left here which says, "The most common word was chosen this many times and the second most common password this many times," and so on? So this is the histogram of popularity of passwords. Can we develop some metrics that rely only on those statistics that will be useful to us? And if it's possible to do this then we can collect data in a privacy-preserving way which is nice which will let us collect a really, really big data set. All right. So I'll briefly descend into some more theoretical framing of the problem which is that we'll assume passwords are just a probability distribution that we know completely. So when a user chooses a password they're just doing a random draw from some distribution, and we know exactly what all of the probabilities are. And in this model we can develop some metrics for a given distribution, how hard is it for an attacker in different circumstances? So the first port of call is to compute the entropy of this distribution. So this has been around for decades and it's been -- Excuse me. So entropy can be used to measure the uncertainty that we have about some unknown value drawn from a distribution, but it doesn't actually have to do with sequential guessing in the sense that we're interested where the attacker has to guess, "Is the value this? Is the value that?" What entropy would measure is if an attacker could say, "Does the value come from this set? No. Does it come from this small set?" and so on. Entropy actually tells you exactly how many guesses you need on average to win that guessing game where you can guess whole sets. Which is basically if you ever played the board game "Guess Who," you may have figured out the optimal strategy of course it just to do a binary search. And you can, of course, the distribution's not even then you develop a Huffman code and you search over the bits one at a time that way. Unfortunately in the case of passwords or basically anything else chosen by humans you very rarely get to guess whole sets at a time. You have to guess one element. So an alternate metric that has been proposed is called either guesswork or guessing entropy. And this is really simple. It just says if you guess the passwords in order of probability, what's the expected number of guesses that you'll need before you succeed? So just a simple summation here. So this looks pretty good; this looks like what we want. The problem is that when you dig into the RockYou data set, for example, you find a lot of stuff in the tail of it which certainly looks like random 128-bit hex strings. So about one in a million people within the RockYou data set chose something that looks like it was drawn randomly from a 128-bit base. Okay? So the implications of that through a relatively easy to prove Lemma that I'll skip over here: if our distribution is a mixture distribution of two different things, the guesswork of the mixture has to be at least -- or it's bounded by the guesswork of the two component distributions weighted by what weight they have in the mixture. So punchline here is that if you're guessing against the rockiest set with probability one in a million, you end up guessing against one of the people who's chosen a password from a really big space. And when that happens it takes you on average 2 to the 127 guesses to succeed. And when you multiply those together you can see that even if you guessed everybody else's password instantaneously, it would take you on average more to the 2 to 107 guesses to guess a RockYou password. And this is without assuming anything about how everybody else chose their passwords. And I think we'd probably all say that 107 bits is not really a useful number about the RockYou data. There's no way that people's passwords are actually that strong. So what's going on here really is that the outliers in the distribution completely overwhelm the statistic and make it meaningless for us. I see a couple of slightly perplexed looks. >> : So this is assuming that you pick one username and you just devote all of your energy to going after that password? >> Joseph Bonneau: Yeah, exactly. Yeah, so what Cormac said is exactly the right observation to transition to a metric that makes a little bit more sense. >> : This is just taking the average of a very heavy tail distribution. >> Joseph Bonneau: Exactly. Yeah. >> : So I guess what's perplexing me is a lot of people study minentropy which is like the right measure for password [inaudible]... >> Joseph Bonneau: Exactly. Yeah. Okay, so yeah. We're getting there in a few slides. Perfect. Okay. So as Cormac said the guesswork assumes that we're going to guess forever. But in reality what's more likely is that we have some big set of values, and we'll be happy if we can guess correctly on some of them but not all of them. So if we're guessing surnames, we have M things to try. A dumb strategy would be to exhaust all of your effort on the first name before you go to the next one which may lead you to guessing some really weird surnames that only a couple of people in the world have. Obviously we want to guess the most likely thing for everybody first and then the second most likely thing and so on. And we'll be able to hopefully succeed without ever getting into the heavy tail of the distribution, the reason where people have picked the really strongly random stuff. So it's fairly simple to come up with a few metrics that actually were proposed in the information theory literature in the nineties. We basically just have to model when our attacker's going to give up, and we can do that either by saying he'll give up after a constant number of guesses. So this is the beta success rate. Or, we can say that the attacker has some desired success rate that they want to get to and they'll give up after that. So basically if the attacker wants to have a 25% chance of success, they'll know exactly the number of guesses they need to get to that. And they'll try that number of guesses on every account. So these metrics I think are quite good. Min-entropy was proposed; minentropy would basically be if you're limited to one guess per account. But this is an extension of Min-entropy if you do more than one guess per account. The only thing that's not quite right about this metric is that this measures the size of the dictionary you need, but you won't actually have to use your whole dictionary on every account. So if I told you that there are a set of one million passwords that collectively cover 25% of the users in the RockYou case then you could compute this metric exactly. And you could say, "Great, if my desired success rate, alpha, is 25%, I'll need a dictionary of size a million." But you won't actually have to do a million guesses per account because you'll hit early a lot of the time. So a new metric proposed by me is to scale down this required dictionary size a little bit by saying that in the cases where you succeed, you'll require less than the full dictionary size. And you just do a partial summation which is like the original guesswork metric. So I've basically taken this guesswork or guessing entropy idea and scaled it down by saying, "The attacker's only going to guess until they get to a certain probability of success." Yes? >> : So you said you'll guess early a lot of the time? >> Joseph Bonneau: Yeah, you'll succeed >> : Which is... >> Joseph Bonneau: early. >> : A lot of the times. That's counter to the fact that it's a hundred and something bits of entropy because a lot... >> Joseph Bonneau: So, yeah. I guess... >> : You would guess early rarely. >> Joseph Bonneau: So you -- I think we may be thinking of two different things for guess early. I'm saying that within the accounts that you successful guess, you'll almost always guess them not with the last guess from your dictionary but, you know, one of the earlier guesses. And when that happens then you get to quit. Right, so this is basically a generalization of the previous guesswork metric where if you set alpha equal to one, say that the attacker desired success rate is 100% and they're going to guess forever until they succeed every time, you get to the old guesswork metric, which I argued was kind of useless because it gets skewed by the uncommon stuff. All right. So this is a quick plot of some PIN data, and so we have either the dictionary size or the required number of guesses as you increase your desired success rate. And the dark black line here, this is a random four-digit PIN. The dotted line is a random three-digit PIN. And the blue is what people actually chose. And you can see that for lower success rate the required dictionary size or the expected number of guesses are roughly the same. But for high success rate the expected number of guesses is much lower than the dictionary size. This is because the ability to quit early becomes, you know, increasingly useful. So I think this is in a sense the right plot and the right way to think about it. The problems are that it's impossible to see what's going on here. All of the information here is basically crushed, and it's a little bit hard to reason about the difference between these two uniform distributions. They have a different slope, but it's not easy to visually pick out exactly how strong those are. Oh, and if you're interested in seeing where I got the PIN data, I had a publication at Financial Crypto this year that was all about trying to figure out how people pick PIN's. Right. So back to what I was saying, to make this graph a lot more useful what we went to do is convert everything into bits. I think bits are the scale that security people and crypto people are used to reasoning about, and they're also logarithmic which I'll show you in a second why that's so nice. So I think that this is probably one of my less-interesting slides so I'll just show you the formulas and say that they're there. They're not super complicated. You move things around a little bit and you take a log. And you can convert these metrics into bits. With the nice property that for uniform distributions every one of these metrics will give you the same strength as measured in bits. Right. So visually back to this picture, if we convert to bits we get this. So now two uniform distributions are flat lines which is nice. So we can see that the three-digit pin is about ten bits; or if we use a scale of dits, it's equal to exactly three. And the random four digits is up here also a flat line. And our expected number of guesses is here, and you can see with a lower desired success rate it's almost the same as the required dictionary size. And then it starts to peel away and for alpha equals one, we get the traditional guesswork metric. And the Min-entropy is down here. It's basically an attacker with a desired success rate of zero or any epsilon just slightly greater than zero. For them the difficulty is the Min-entropy. And the Min-entropy is extremely simple. It's just the log of the probability of the most common event in the distribution, and it serves as a lower-bound. Everything from this point on has to be higher than the Min-entropy. So, yeah? >> : Actually, it's quick. First, so this the data of the PIN number summation. >> Joseph Bonneau: Yeah. >> : Okay. Why is the curve going down? Like in the left-hand side there. Like... >> Joseph Bonneau: Here? >> : Yeah. >> Joseph Bonneau: Why is the curve going down? >> : I feel like as it's going to the right there there's a part where it's actually sloping down. [Inaudible]... >> Joseph Bonneau: I think it's flat right here. >> : Oh, it is flat? Okay. Just -- Okay. >> Joseph Bonneau: Okay. >> : It's never going down on us. >> Joseph Bonneau: Yeah, it's never going down, but it will be flat over a small region which is -- I think in this data set the most common PIN had a probability of like 1.5%. So if your desired success rate is anything up to 1.5% then it's flat in that region because you'll have to do one guess. And it's sort of -- You can see a couple of steps and then it becomes smooth as the events have, you know, low probability. All right. >> : There's sort of an uptake there at, you know, .35 or something. Do you have any idea what that is? >> Joseph Bonneau: I don't. I mean it's probably -- So the PIN distribution and when we studied PINS, I mean, people choose PIN's by a lot of different strategies. So there's, you know, people who choose something really -- the easiest thing they can remember. If it's 1234 or I think 0258 is like straight line down the PIN pad. And then it transitions into people picking birthdays, so you have sort of, you know, a uniform distribution over a smaller space and then you get to other stuff. So if I had to guess this is probably roughly the really common PIN's, maybe the range of dates because I think it was about 20% of people who chose dates. And then you drift into people who start actually doing pretty sensible stuff like picking an old phone number which may yield some other attacks but is basically random. >> : What was the population? >> Joseph Bonneau: This population was actually iPhone application users. >> : In the U.S. or worldwide? >> Joseph Bonneau: Yeah, U.S. >> : Okay. The four digits might correspond to a telephone number. >> Joseph Bonneau: Right. Yeah. Okay, so this generally makes sense. The goal is to recreate this diagram for passwords and in particular to be able to do it for different groups of users to see how they compare. Right. There are some theoretical stuff that I think I'll skip in the interest of time. Okay. And I guess a quick summary, I argued that Shannon entropy doesn't apply to the guessing we're interested in, the guesswork measure is skewed, so we want to use this parameterized guesswork metric. But the other metrics I mentioned can be useful in some cases. It's not possible to compute this guesswork metric for old results that are published, so a lot of old results will basically be reported as, "We tried a dictionary of size N, and we broke this percentage of accounts which can be converted into the required dictionary size metric." But we can't figure out the expected number of guesses per account because they haven't reported how difficult it was at step along the way. And for reasoning about really limited online attacks, like if somebody only has three guesses, then obviously we'll want to use the metric that just says, "How much success can you have with three guesses?" or whatever the rate limit is. And in that case there's the gain from switching to the expected number of guesses is very, very low. Okay. So given that I just want to look at the histogram of how popular things are, this let me set up an experiment to collect passwords in a privacy-preserving way. So why not just use the RockYou data or other league data sets? Well, there's no demographic data which doesn't let me do all the experiments I would like to do. There's some question about if the RockYou passwords represent passwords that people choose for more secure sites because RockYou just develops a photo-sharing application and some Facebook games. So maybe these aren't the best passwords. And I do think that there are legitimate ethical concerns with basing the whole field of password research on data that has been leaked after being stolen. There was a panel we did at FC with Stuart and some others where we got into this in more detail. I will say actually I went into the panel, you know, being more convinced that we should use data that gets leaked if it's available and it can advance science. And I came away with it with actually a lot more reservations that there are good reasons that we shouldn't be basing what we're doing on waiting for the next website to get hacked. All right. So I went to Yahoo last spring as an intern and I came up with this proposal. So the normal login process sees a stream of usernames and passwords and clear text before checking them against some database of hashed passwords. So I put a proxy server, for the experiment, sitting in between the two that would be able to see the stream of clear text passwords. So the simplest design which is deficient but would work is to just log every password that gets seen in plain text and throw away the usernames. So this is basically how the RockYou data was done, is that the passwords were leaked by a hacker who is trying to be somewhat ethical who said I'm going to leak the passwords and not the usernames, so you get a big data set but it's hard to tie these to individual accounts. I think that this is certainly a bad model for an experiment you're setting up on purpose for a couple of reasons. If this data ever leaked, it would let you train cracking dictionary better. And also for people who've gone out of their way to memorize some really strong password that's never been seen before, once it's been seen once anywhere it can get added to a cracking dictionary and you basically hurt that person's security. So we definitely don't want to store the plain text passwords. We could store the hash of the passwords which is okay. It sort of eliminates the problem of adding some password that's never been seen before to a cracking list, but it's still possible to go and run a cracking attack and try and figure out which passwords people at Yahoo are most likely to use. So the insight that basically let me do the experiment that won over the management team is to say the proxy server at the time the experiment starts is going to generate a key randomly which is keeps in memory during the experiment. Every password that gets hashed will be hashed along with the secret key. And then when the experiment is over this key gets destroyed and the log of hashes of passwords with this key is then impossible -- It's impossible to do the dictionary attack because the key is gone. So it's different from salting in the traditional sense when passwords are stored because the point of salt is to make it hard to tell if two users have the same password which was a requirement for doing statistics to be able to tell if two users have chosen the same password. So this is the same key for everybody but it was 128 bits long so impractical to brute force. Yeah? >> : What if the user mistypes the password? You're capturing here what the user's typing when they're trying to log in, right? >> Joseph Bonneau: Right. Yeah, so obviously I wouldn't want to log the incorrect passwords. But I actually noted the success or failure of the login and only kept the successes. So there are a couple more steps to this. In particular since I wanted to be able to look at different demographic groups of users, for every user that's seen I would do a database query so that I could get a few bits of information about the user. So this is sort of pass one of doing that. You see that "joe" is logging in, and then you log his password along with, say, a gender and a language and an age group. >> : If you could include whether they're pregnant or not, we could get one of the questions we're not allowed to ask you [inaudible]. >> Joseph Bonneau: Right. Right. So the problem with doing this is that if you can re-identify users based on their demographic data then it would be possible to pair together users and see that they had the same password. So I want to collect really detailed demographic data. And of course as it gets more detailed, you'll start to have people who are unique in the experiment. And if it's possible for me to re-identify myself and re-identify somebody else and see that we have the same password then I've basically figured out that person's password. So instead of storing all of the demographic data directly in the log next to the password, the solution is to just store a bunch of different logs that are separate for each demographic detail. So basically, in the case of me, it will get written to the log of passwords of users who self-identify as males; it may not be accurate. And, you know, every other piece of demographic data that I have. And one final thing was to just keep a Bloom filter of users that you've seen so that you don't double count people and, of course, to check the login status. Yeah? >> : I'm curious, you said you did this at Yahoo. I'm curious if you were worried about the collection proxy, K, being subpoenaed? >> Joseph Bonneau: Being subpoenaed? >> : Yeah. So let's say... >> Joseph Bonneau: Subpoenaed? >> : that somewhere says, you know, one of the users is Osama Bin Laden and here is a warrant to go subpoena [inaudible]? >> Joseph Bonneau: Right. So I wasn't worried about that for a couple of reasons. For one thing K was destroyed, so the actual Zeiger key was made... >> : So you destroyed it at the end of the collection? >> Joseph Bonneau: Yeah, yeah. Right which was over in 48 hours. And it was -- Yeah, I mean the machine generated its own key. It was hashed with a key that -- You know a manager generated a key separately to add in which maybe he didn't destroy his personal copy of. But in any case, I mean, Yahoo has a pile of hashed passwords anyway, so if a subpoena came in they have hashes of every user's password sitting around to begin with. >> : But those are salted. >> Joseph Bonneau: Yeah, but the salt doesn't matter. I mean, they have the salt. >> : They just subpoena the login server. >> : Yeah. >> Joseph Bonneau: Yeah. Yeah. >> : So why a hash instead of pseudorandom function? >> Joseph Bonneau: I mean I'm using a hash as a pseudorandom function. I mean, yeah, like what -- I mean, I'm using a hash with a key as a pseudorandom function, but, I mean, what... >> : Like some hash functions potentially check whether two passwords - Like one password is the prefix of another password. >> Joseph Bonneau: Hmm. Talking about like length extension thin? I guess I wasn't concerned about that because every password was shorter than the block size. So we did Shaw256, so all of the passwords fit in a single block. But, yeah, I mean point taken. Yeah. All right. So generally the providence of this data make sense? Okay, great. So afterward, yeah, data went to me. So experiment ran for 48 hours. The goal was to get a hundred million users, and we miscalculated a little bit. We got seventy million. 42.5% of the passwords were unique within the survey and computed a whole bunch of predicate functions some of which turned out to not be very interesting and a lot of which didn't get to a minimum sample size to do interesting statistics on. >> : So what was the uniqueness one more time? >> Joseph Bonneau: That means that -- So of the seventy million passwords, 42% didn't choose the same password as anybody else. >> : Unique usernames but not unique passwords. >> Joseph Bonneau: Well, every username is unique. >> : Okay. >> Joseph Bonneau: But a lot of the passwords, so yeah 58% of people chose the same password as somebody else in the sample. >> : That's very significant [inaudible] RockYou. RockYou was about a third unique across thirty million, so this is twice the sample size. You'd expect far less uniqueness. >> Joseph Bonneau: Yeah, it was. I mean it'll show up on the graph later but these passwords were stronger by a couple bits. >> : [Inaudible] is that the RockYou passwords were weak. I mean these people were choosing the passwords for RockYou [inaudible]. >> : Well, I'm saying the distribution shows a lot more dedication from the users to choosing [inaudible]. >> Joseph Bonneau: Yeah. And it's significantly better in a statistical sense. In terms of real-world impact on guessing attacks they're not dramatically better. Okay. So I should probably go quickly over this so that I can show you some of the results. But there's some question about how we can approximate these metrics given what I got from Yahoo which is a random sample. So we don’t have perfect knowledge of the distribution like I assumed when I presented all of the metrics. So it turns out that even sixty-nine million people sounds like a really huge sample, but it's inadequate to compute a lot of things. So if I take random sub-samples of what I had and compute different metrics naively this is the entropy, the guesswork and this is the total number of passwords. And you can see that these are still growing pretty significantly even as we add this, you know, sixty-ninemillionth user to the sample. So basically projecting we can tell that if we added many more users to the sample, we would get a higher estimate for the entropy and the guesswork and all the rest of it. So we can sort of see just from this graph -- And I'll show you this theory in a second -- that these metrics, it's not possible to compute even in a sample of this size. For the scaled-down guesswork metric, if we take it at a 25% desired success rate we can see it's growing pretty steadily. And then at a certain point which was smaller than our actual sample size it does stabilize and the naïve estimate becomes correct. And for the simpler stuff, like the Min-entropy this is equivalent to Min-entropy or the Min-entropy extended to ten guesses, we can basically estimate these correctly very easily with a small sample size. So even with just about a thousand users, we have a pretty good idea of what the Min-entropy is. So some results from theory: we know we can't estimate -- Well, there's no known way of estimating the total number of passwords that are possible by just looking at a sample of this size, can't estimate the entropy. And there's some general results of computing properties, any properties of distributions that can be extended to this guesswork metric which say that it won't be possible to compute the guesswork. And the intuition behind this theorem is that you can't compute any metric that depends on things that occur only a small constant number of times in the sample. So particularly if things that occur only once in the sample would affect the metric then you won't be able to compute it accurately. And obviously in the case of guesswork, some things that only occur once do make a big difference. We can escape that result for some of the partial guessing statics just because they only depend on things that all occurred, you know, many times in the sample. And the other theoretical result, which is interesting, is that asymptotically you can't do any better to estimate any of these metrics than just throw away all of the uncommon stuff and use the common stuff which is relatively well estimated from the sample. So that's going to be the approach. So if I'm just trying to compute this required dictionary-size metric at different sample sizes and I take different sub-samples of the distribution, so I say reduce to ten million or one million samples, you can see that for a low desired success rate even at sampled down to five-hundred thousand people we're estimating the exact same values. But as you get higher, one by one, the sub-samples flatten off. This is the region where this is all based on things that we're only seeing once in the sample. So you get these huge systematic underestimates if you try and compute these things directly. So question number one is just if it's possible to figure out automatically what your limit of confidence is. And I developed a technique in my thesis to do this by bootstrapping the smaller sample to figure out what the confidence limit is, which I'm afraid I won't be able to into any more detail than that. But it works pretty well and can automatically pick out, marked in this diagram, by switching from a solid line to a dashed line exactly the point where the metric is no longer reliable. And a second question is, is it possible to actually project if you fit the data to some model so that we can estimate the value for higher desired success rates of the attacker. So it's not possible to do this without assuming some model for what passwords look like, but if we do that -- And I stole a model from the natural language processing community that's been used to model word frequencies which as been an area of research for about 15 years and has been quite challenging. This is basically one of the best performing models for doing projection of word frequencies for subsamples and fit onto passwords. I got pretty good results. So this is the naïve estimates and by projection it gets converted into this. So it's certainly not perfect. It leads to a big overestimate for the sample of five hundred thousand, but I mean compared to this picture fitting to an assumed distribution does allow you make estimates empirically that are reasonably accurate. I would probably caution against using the estimates out here partly because we really have no idea what passwords look like in that space. This is the space where the cracking evaluations usually don't go. And from empirical data it's almost always for every data set we have this is the space where you're reasoning passwords that have only been observed once and it's basically impossible to do that in a meaningful way. But I do think for up to, say, a 50% success rate where we have somewhat of a good idea of what the distribution of the passwords look like. This projection can be useful for taking a smaller sub-sample and still estimating this with some accuracy. All right. So some data. So -- And I've used the same projection technique in all of these graphs that I'll show you, and I've marked the point where I'm switching from accurate estimate no need for projection to a projection. So this a comparison of Yahoo on the top here to RockYou in purple and to other big leagues that I was able to have. >> : Does Yahoo have any password composition policy? Or is it just... >> Joseph Bonneau: Good question. Currently the only requirement is six characters, and some of these passwords are actually collected prior to that being instated. >> : So 123456 is a valid Yahoo >> Joseph Bonneau: Yeah. >> : password [inaudible]? >> Joseph Bonneau: Yeah. And I actually don't know what the most common Yahoo password is because of the way the experiment was set up. But 123456 has been the most common in everything I've ever looked at, so it's a pretty safe bet. >> : They do you have the password [inaudible] to try to get people to use [inaudible]. >> Joseph Bonneau: Yeah. So actually, yeah, I'll come back to that in a second. But basically all of these websites that got leaked -Battlefield Heros was a game. Gawker is a blog. RockYou is Facebook apps. So the Yahoo passwords, there is a gap of about two bits here but it is harder. Basically across the range of different guessing attacks the Yahoo passwords are a little bit better, but to an attacker two bits isn't a huge slow-down. >> : [Inaudible] putting in a lot of work, that one's actually easier than some of the other ones? >> Joseph Bonneau: Yeah, exactly. So on the tail here the most common thing at Yahoo was actually more common than at some of the other sites. Yeah, that's exactly the way to interpret this. Okay, and if I add in a bunch more data from all those cracking experiments earlier, and now we have to switch to the expected dictionary size and not the expected number of guesses, it's pretty sensible. Basically all of the cracking results that have been published are overestimates of security in the sense of all the guessing metrics I've proposed model an attacker who's perfect. So it makes sense that the real cracking would be a few bits worse than that consistently. Although, it's also interesting how the variation between different leaked data sets is relatively small compared to the huge variation between different cracking attacks, which to me is a justification for using the statistical metrics because I think that there's a lot of uncertainty introduced by cracking. Yeah? >> : The various cracking studies were they real password collections or lab studies or [inaudible]? >> Joseph Bonneau: So all of the ones here were real data. So I think I switched from circles to hexagons here to indicate these were -- Well, actually so these were a lab study, this purple dot here. And these were all system passwords, and these were all real web passwords that got leaked. Okay. And another comparison I thought was interesting, how do passwords compare to actual natural language. So here we have the Yahoo passwords, and I took data from the Google Ngram Corpus which has frequency counts for words, pairs of words, triplets of words. And basically the conclusion is that a password is between two or three words produced in natural language. All right. And as for some of the demographic things, there's a massive table in the paper with every different group. At a high level, though, people from different countries varied -- One of the biggest variations was splitting people up by country. I've heard various theories, I guess, of why certain countries were stronger or weaker. Informally there wasn't really any great correlation with region or GDP of the country or anything like that. I think the U.S. and China were about the two strongest. I don't have a great justification for why that is, but I think it's interesting that the variation in countries was so large. Yeah? >> : Demographic mix? >> Joseph Bonneau: Yeah, I think that it could be that... >> : [Inaudible] the United States is. >> Joseph Bonneau: Yeah. >> : China is [inaudible]. >> : Aren't there still [inaudible]? >> : [Inaudible]. >> Joseph Bonneau: So I think that the... >> : [Inaudible] factor here. >> Joseph Bonneau: Yeah, although, I'll come back to that in a second. The U.S. benefits because people from a lot of different countries that don't have -- So Yahoo has different versions in a lot of countries but not every country. And a lot of people come to the U.S. site who are actually from a different country. >> : Is Hong Kong part of China here? >> Joseph Bonneau: Well, this isn't actually where users come from by IP address. This is users who go to Yahoo U.S. versus Yahoo China. >> : Okay. >> Joseph Bonneau: So people from Hong Kong can go where they want. Okay. The age-group question that I threw out: those of you who picked the older users being better are actually right. This green down here is age 13 to 24, so teenagers and early twenties, and they were actually the weakest. The variations is much lower by age than by countries but, yeah, age 45 to 54 I think was the strongest and then 55 plus was good too. >> : Does that surprise you how close they are? >> Joseph Bonneau: Compared to -- Well, a lot of the sub-distributions were very close. In fact the biggest high-level conclusion is that the demographic groups really don't change very much. There's sort of this weird sort of universal distribution that everybody's not too far from. I definitely though going into it that the younger users would be better because of this sort of digital native, you know, mythos. But I think it's also possible that the older users are just more conservative. >> : You pointed out that these a very small differences. Are they statistically significant? >> Joseph Bonneau: Yeah, they are statistically significant. They're statistically significant at least up to here to within .1 bits which is smaller than the gap here. But in terms of real world significance, if it's like .4 bits then probably not a big story for attackers. More interesting I think was a couple of different groups of users who should be, in quotes I guess, "more motivated" to choose a secure password. So the people who have an account with Yahoo's retail service, these people have a payment card registered. So, one might expect that people who have a payment card registered will be much more motivated. There is actually a very big gap here in that the users with a payment card are much less likely to choose one of the most common passwords globally. But up here comparing to general dictionary attacks they don't really do very much better. >> : But isn't that precisely where you want -- I mean if I put a payment card -- I mean if Yahoo is allowing somebody to get away with a 30% success rate, I mean it's like, well, it's all over anyway. Right? It's... >> Joseph Bonneau: Yeah. >> : ...[inaudible]. >> Joseph Bonneau: Yeah. So I mean you could definitely argue that users are making a rational choice here which is that, "I have a payment card so I'm not going to choose 123456, but I'm also not going to take the time to memorize say like eight random digits," or something like that which is exactly what was observed. >> : Why do you think people care more about, or would care more about [inaudible]... >> Joseph Bonneau: Well because, I mean, if their account gets hacked then.... >> : The credit card company. >> Joseph Bonneau: What? >> : If your credit card gets stolen, your credit card company [inaudible]... >> Joseph Bonneau: I think most people don't... >> ...[inaudible]. >> Joseph Bonneau: Most people don't know that and it's a hassle, right. I mean I think it seems more serious if you have a credit card registered. And I guess the last one I should show you because I'm running a little bit over: Yahoo changed registration forms about two years ago. They switched from no requirements and basically no feedback to a six-character minimum and a graphical indicator to try and nudge users to pick better passwords. >> : But by no requirement you mean "A" would work as a password? >> Joseph Bonneau: Yeah. So they -- I believe so, yeah. But I should check. But I know that they didn't have -- They might've had like a three-character minimum but they upgraded it to six. So the gap -- So this was version two was the old one. They upgraded to having this graphical indicator. This made a gap of about a bit, and on the low end it basically did nothing because most of the most common passwords on the web are sort of prescreened to satisfy six characters. So 123456 is more popular than 12345. So the conclusion here might be that it's possible that Yahoo didn't do a good job designing the graphical nudge. I mean it's also possible that the gain you can get from nudging users just isn't that high. One more interesting result, I think. I took the sets of passwords registered by users coming from different languages and I said, "Supposed I trained a guessing attack on speakers of one language and did it against passwords registered by people who speak another language, how efficient would that be?" So if you are attacking German-speakers passwords and you've trained on German-speakers, this table is for a thousand guesses. If you've trained correctly on the German users you would get 6.5%, and if you trained on all these other languages your success rate declines a lot. I mean if you trained on Korean-speakers, it goes down to 1.6%. But it's not a huge gap. The biggest gap I ever saw was actually only a factor of five. So the conclusion from this was really surprising. I thought that if you trained on the wrong language group that this would really make life difficult as an attacker. But it turns out you can train on even the totally wrong language group, like Vietnamese, and go to the Frenchusers and do okay. And if you come up with a global dictionary that works against everybody, you can do this quite successfully. If you do that you never vary in efficiency by more than a factor of two. This is for like sort of a limited online attack, and it gets a lot worse if you do a more extended dictionary attack. But basically the same weak passwords are used by people everywhere no matter what language they speak. And I have a paper that's all about passwords in different languages and character encoding where we looked at some leaked data where we actually had the plain text. And we confirmed a really interesting trend which is that Chinese-speakers very strongly pick passwords with only numbers in them. So over half of Chinese-speakers, their password is only numbers. And for English-speakers it's below 20%. A lot of it has to do with the fact that it's very difficult to enter a password in Chinese. So... >> : [Inaudible] so that's surprising actually. [Inaudible]. >> Joseph Bonneau: Right. So the -- We should take that offline maybe. Anyway if you're interested in... >> : That's what I observed. >> Joseph Bonneau: What? >> : That's what I observed. >> Joseph Bonneau: Right. >> : Yeah, how the Chinese-speak tend to use numbers not letters. >> Joseph Bonneau: Oh, we should catch up after. So, okay, I can probably pause there and if there are any wrap-up questions. I guess I would -- The wrap-up slide I had with the reference to the paper if you want to go read in more detail. Also my thesis should be available soon which has a lot more detail on it. Did I achieve the goals I set out for? I think I've certainly introduced metrics that aren't demographically biased. They're repeatable. They're easy to calculate. I think how ecologically valid these metrics are, I think that they're pretty good. The only problem is that I've assumed an optimal attacker everywhere, and that may not be true. I mean when we looked at the password cracking it was significantly worse than my metrics would indicate. So I think that that's probably one of the bigger open questions going forward is how much worse real guessing attacks are than optimal and collecting some data on what people actually do. And of course I've also given up the property that these metrics can't be computed with small data. So for people doing lab studies where they want to give two different conditions to two sets of users, they'll never get a sample that's big enough to compute these metrics. >> : What would be big enough? >> Joseph Bonneau: About a hundred thousand would be a rule of thumb. >> : Yeah, so I guess that goes more to not just [inaudible] validity but external validity. All of this assumes a single way of choosing passwords. >> Joseph Bonneau: Right. >> So you can't use this methodology if you want to estimate how good is a system for nudging users to make better passwords unless you're going to expose millions of users >> Joseph Bonneau: Yeah, exactly. >> : to it. >> Joseph Bonneau: Yeah? >> : Did you collect any data on password length chosen by users? >> Joseph Bonneau: No, I didn't the -- Yeah, I mean, by the design of the experiment that information was thrown away. But we basically have that from -- like we have the distribution from looking at the RockYou data. >> : Only within the parameters of the RockYou data set. >> Joseph Bonneau: Yeah, exactly. I guess I sort of made a conscious decision to go purely statistical which made an easier pitch. And I was afraid if I started saying, "Well, this semantic information would be good," then, you know, where would I draw the line? Because it would be very interesting to see -- Like the thing about what percentage of people use numbers? It would've been great within the Yahoo data to see how that rate changed, especially when I found by looked at leaked Chinese data how amazingly high it was but decided to not go down that road. Yeah, do you have a question? Oh. I thought there was one over here. Sure. >> : So all of this is kind of, you know, there's one attacker and he's trying to do the best he can against the whole corpus. And so if I manage to steal the hashed password list and no one else has got it and I'm competing with no one, this is definitely the right metric. But if my guessing is I'm showing up at the web portal and I'm [inaudible] limited and all that kind of thing, this is still good. But if I'm showing up at the web portal and doing all of this, I also don't have the field to myself, right? There's ten thousand guys out there also trying similar strategies that they've learned from. So you could argue that, okay, the optimal strategy is to go after the most common stuff, 123456. And congratulations, when you get into an inbox you're going to find a thousand other guys in there who've also been using the same strategy. >> Joseph Bonneau: Right. >> : [Inaudible] Given that I'm competing against an unknown number of other attackers who are trying similar strategies, what's the right thing? I don't want to get into something that's been bore-holed into the ground, you know, where... >> Joseph Bonneau: Interesting. >> : The strategy is to be fast. >> Joseph Bonneau: Yeah, I mean it would be easy to -- Like you could come up with some solutions to that and say like, "I don't want to guess the actual ten most common things. I want to guess like things, you know?" >> : Just choose a random distance out in the distribution? >> Joseph Bonneau: Yeah. And then you can reason about how much your efficiency goes down. >> : Yeah. >> Joseph Bonneau: That would actually be something really interesting to go back and compute. >> : The second question was, you know, so what was your plan if, you know, you went to Yahoo and they said, "You want me to put a proxy where?" I mean what were you going to do? Did you have a back up? >> Joseph Bonneau: I mean, so we'd work this out. This was like the proposal when I applied for the internship. So, yeah. One thing I could add is that I have thought about the model where if you're trying to evade an intrusion detection system that means that you want your pattern of guesses to look like the population distribution of passwords, and there actually is a pre-existing metric that already captures that perfectly which is collision entropy or like Renyi Entropy Order Two. And I marked it actually in the PIN thing. But, yeah, that basically tells you if you draw your guesses randomly from the population and then guess against random people how likely you are. Yeah? >> : Did you plot the uniqueness of the passwords' counts? Like top -Like where that tail starts, the long tail? >> Joseph Bonneau: I'm not sure I understand your question. >> : Or how common passwords are? So like the first few hundred are super common; they're not unique. And then there's that long tail. >> : Distribution. >> : Do you have the [inaudible] distribution. >> Joseph Bonneau: Yeah, I mean I have the histogram. I don't think I have a plot.... >> : Is it in the paper? >> Joseph Bonneau: It's not in the paper. I could send you the plot. It's a pretty simple plot. >> : Like for instance how many distinct passwords would you have to have to cover 10% of the password [inaudible]? >> Joseph Bonneau: Right. Well that I have. I mean that's basically here. Just if you move up from 10% to here it's, you know, 2 to the 14. But, yeah, I mean I could put in uniqueness and it'd be an easy plot to do. >> : It would just be interesting compared like Yahoo and RockYou and stuff, where that [inaudible]. Where is a spike versus a tail? >> Joseph Bonneau: Right. >> : [Inaudible] >> Joseph Bonneau: Sure. >> : So you told us we're stuck with passwords. And people choose bad passwords. Do you have any thoughts on what we should be doing? >> Joseph Bonneau: I mean like I think priority number one -- I think I had a slide about this from -- So I did a survey a couple of years ago where I tried to actually look at the state of what websites who collect passwords actually do. And the numbers are like -- I mean, every time I show this slide there's like a few mouths that drop because like websites like aren't at the level of actually hashing. Most websites still don't use TLS correctly. Rate-limiting is like really rare, actually. Like most websites just don't do it at all. So - What? >> : How would you know? >> Joseph Bonneau: Well, I tested by trying a bunch of incorrect passwords and then trying a correct one. So it's possible that I just didn't hit the rate limit, but I figured -- I think I did a thousand guesses and then the correct one and I got in and everything was okay. So I figured that... >> : 84% of websites don't [inaudible]? >> Joseph Bonneau: No. So it's possible that -- I think a lot of websites will eventually rate-limit because they are worried about denial of service, but I didn't see any specific rate-limiting up to that point. And I think it would be hard -- There's a lot of argument about what the right rate-limit should be, but I've never heard anybody say it should be higher than a thousand. >> : For what? For IP? >> Joseph Bonneau: Well, this is -- I just did one guess per second from the same IP, so it wasn't even cloaked or anything. So, yeah. I mean anyway, I think the point is that the world is so broken right now that having a really good understanding of the statistics of passwords and how hard they are to guess is more actually setting up for future work, I think, when hopefully we come up with some agreement where we end every website collecting passwords and doing it wrong. And there's a lot of work on that. I mean there's people working on different trusted hardware devices that could potentially serve as, you know, points for single sign-on or proposals like OpenID. Hopefully something like that succeeds eventually, and I think in the nearer term that's like more interesting work. But I do think like when we get to the -We'll always have the one password, so having a good idea of the strength will be very useful when it becomes more important. Maybe that's a good note to end on. Thanks everybody for coming. >> : Thank you. [ Applause ]