Andreas Weigend (www.weigend.com) Social Data Revolution (SDR), INFO 290A-03 UC Berkeley, School of Information, Fall 2014 Class3 – October 7, 2014 Andreas Weigend (www.weigend.com) Social Data Revolution, INFO 290A-3 (http://www.ischool.berkeley.edu/courses/i290a-sdr) UC Berkeley, School of Information, Fall 2014 (http://ischool2014.wikispaces.com) This transcript: http://weigend.com/files/teaching/ischool/2014/recordings/weigend_ischool2014_3.docx https://www.dropbox.com/s/srjx55dk5zw0yjj/weigend_ischool2014_3.docx?dl=0 Corresponding audio file: https://www.dropbox.com/s/my9tx891wl1x8hg/weigend_ischool2014_3.mp3 http://weigend.com/files/teaching/ischool/2014/recordings/weigend_ischool2014_3.mp3 Corresponding video file: http://youtu.be/Oa264BdgANU Containing folder of the whole series: http://weigend.com/files/teaching/ischool/2014/recordings/ http://weigend.com/files/teaching/ischool/2014/ 1 Andreas Weigend (www.weigend.com) Social Data Revolution (SDR), INFO 290A-03 UC Berkeley, School of Information, Fall 2014 Class3 – October 7, 2014 Andreas: Welcome to the Social Data Revolution, Fall 2014, at the University of California at Berkeley. It is an extreme pleasure for me to be able to introduce a friend of mine, Max Levchin. When I was thinking about what could we do as an application for the social graph, I thought about a number of different verticals. Where would it make a difference to understand what's happening on the edges, not just in the nodes of the network. Where does identity matter? Identity being defined by who you're connected with, not only what your attributes are. And I think when you start thinking about this you realize finance, payments is one of the key pillars of any economy. In the olden days it was easy because when we knew who a person was we could go after the relatives, break their kneecaps or whatever the curse was if somebody didn't pay back money we lent to them. Then as we saw in the last class, privacy existed before they had Google Glass, before Facebook and the like. That made it quite hard and made companies like Fair Issac come up where based on whether you paid your electricity bill five years ago your credit rating was determined. Now we say we live in the illusion of privacy. Another question is what do we do with it? Without wasting much time, I want to introduce Max Levchin. If you'd take a few minutes to tell us what you're doing now, and maybe after five minutes or so we'll have a conversation, and we'll invite you to chime in. Max: Great. Thanks for having me. For those of you who are not familiar with my checkered past, I had a bit of the hand recently in the company called PayPal. It's definitely squarely in my past. It seems to somehow be good at not escaping being mentioned, but it's relevant in this conversation because to be self-critical, it's a veneer on top of an existing system. If you look at PayPal at a systems level, you can see that beneath it there's this notion of credit, credit worthiness, credit issuance through the mechanics of the global financial environment, basically entirely unchanged. Above it, when we came to the scene, there was sort of the physical terminals and card slicing through a dirty overly handled machine at a grocery story. We supposedly cleaned all that up and made it beautiful and web-based, but if you look underneath it hasn't changed at all. In fact, it's actually very interesting if old and decrepit (indiscernible) and so I came back after 15-ish years absence to try and fix that space. What happens underneath the plastic, which by the way a credit card is basically a very nice user experience, it's a great API to your money. If you want to spend money, you may have it, you may not have it, it's a time-shifting device. Maybe you don't have the money now but you'll have it later, so it's a fantastic way of managing cash and cash flow. What it's built on is this idea of pricing risk. It's an extraordinarily poorly built model for pricing risk because if you look at the credit card systems you literally go through an application. It's an onerous application, you feel you've been http://weigend.com/files/teaching/ischool/2014/ 2 Andreas Weigend (www.weigend.com) Social Data Revolution (SDR), INFO 290A-03 UC Berkeley, School of Information, Fall 2014 Class3 – October 7, 2014 frisked if you've ever applied for an in-store credit card. It's a particularly nasty model. You come up to the checkout person at the Gap and they say how would you like to save 10%. You say that sounds great. Great, you should open a store credit card. Come with us to the back room. What's going to happen to me there? You fill out an application. Hang on, we need to contact your credit bureau. It progressively gets more awful. Then you know by the end of it that somewhere in the back of your mind your mom told you never apply for an in-store credit card because you're going to get saddled with a horrible APR. You know you're being screwed. A brief detour into a good example of how screwed up consume finance is, and why it's worthwhile fixing it, so first of all does anybody know what APR stands for? It's on the back of your credit card statement. What does APR stand for? I asked the same question to a room full of hedge fund managers, most of which buy and sell consumer debt, which is what APRs are the metric for. No one knew. I asked the same thing, at a luncheon -- many of them had their wives present, and all of them made the stupid joke that my wife surely knows, she shops all the time. I know it has to do with a credit card. But none of them knew either. It's fascinating that people in charge of our financial system have no idea how it works. APR stands for either annual percentage rate or average percentage rate, depending on who you ask. The most fascinating thing about the APR is no one knows how to calculate it. It's the rate you pay for the right to use credit. The math for it is so complicated. There's no closed form. So if you ever try to figure out how to figure out the true APR you pay, it's basically impossible to compute. Part of the reason is the government controls the calculation to such a degree that they dictate it in legalese, and it has to be translated by data scientists to math and it's essentially impossible. There are two official calculations the government stands behind. They result in slightly different numbers. One I cannot understand even though I've been cooking in this industry for a good portion of 20 years. The second I can understand, but it's so complex there is no typeset version of this. If you search for it hard enough you'll find a w.gov (ph.) site that will give you the APR calculation, and you can see that it's handwritten. Someone had to write down the numerical method to solve for the APR. This gives you a glimpse into how messed up the whole thing is. Now that I've sufficiently painted the picture of the decrepitude of the industry, what happens is a data science problem. You go in, or virtually go into a bank, and you say I'd like to borrow some money to buy this house. The process for that is relatively intense. By the time you're done applying for that kind of a loan, you have basically been fully frisked. Just shy of a personal reference from your middle-school teacher, how much money do you make, how much debt do you have, what the house is worth. Somebody's done an independent assessment of what the house will probably be worth. It's adjusted for a 30-year typical outlook on the economy. It's a very detailed analysis. http://weigend.com/files/teaching/ischool/2014/ 3 Andreas Weigend (www.weigend.com) Social Data Revolution (SDR), INFO 290A-03 UC Berkeley, School of Information, Fall 2014 Class3 – October 7, 2014 The rate you pay for that kind of a loan today is somewhere between 3 and 5%. The reason it's so inexpensive compared to your 29% APR for your credit card for your age and income category is because it's a secured loan. There's a house to be repossessed if you don't make your payments. The cost of underwriting per mortgage loan is about 5,000 dollars. It's pretty outrageous but if you're buying a multi-million-dollar house in San Francisco, that's 2% of the overall price. 6% goes to the agent and broker, so peanuts. You can't really do the same -- this process is completely manual, by the way. This is literally a file created. Terminology in the lending industry is fascinating because it all goes right back to you go into the corner store and say I don't have my wallet with me, Bob, and Bob behind the counter says I'll get you later, sort of pay me off at the end of the month. A lot of that carried right through into today's lending, even though the file is electronic, supposedly. Mortgage files are not. A lot are still required by the government to be paper, which is fascinating. If you try to borrow money to buy a mattress, that's at most 3,000 dollars, so doing the full underwriting process that would cost 5,000 dollars in inherently unprofitable. If you're trying to apply for a credit card, the cost of underwriting is actually quite high. But the time to underwrite is once. At that time you have a revolving credit account known as a credit card, and you are essentially ushered by the industry to be forever in debt. You're basically going to pay off the 20% for now on. I'm going as fast as I can through the background stuff because it's not relevant to the science, but it gives you a good context, a sense of how many opportunities there are to revolutionize this thing, and also how messed up things are. There are two fundamental problems, and the second one is split into two when you're doing underwriting. Underwriting is a data problem. If I knew everything there is to know about you, I could probably lend you money at a fixed rate, for a very low amount of money because I know where you live and I know who you are, and I could get a sense for exactly how likely you are to repay. But I don't, so I'm pricing risk. I'm trying to figure out how likely you are to say I wasn't really borrowing money with the intent to pay you; I was going to run away. That's known as the fraud mitigation or pretreatment. Somebody comes in, we have no idea who that person is. We can ask any question we like. We can look at their credit bureau data, assuming they're not lying about their Social Security number. Assuming they haven't purchased a stolen identity. All these layers of possibilities which we can cover in a bit, but the goal is to figure out are you who you say you are. So there's an identity resolution which presumably you've talked about when you talked about nodes. Then there is the fraud determination or fraud-prediction problem, which is squarely an edge social-graph problem, because if I knew who that person transacted with, their friends, who they associate with, what else they've done in their past, I could pretty quickly pin it down that even though they say they're in Berkeley, California, they're really in the middle of Bulgaria, and they spend most http://weigend.com/files/teaching/ischool/2014/ 4 Andreas Weigend (www.weigend.com) Social Data Revolution (SDR), INFO 290A-03 UC Berkeley, School of Information, Fall 2014 Class3 – October 7, 2014 of their time online going through places called carterplanet.ru. They're probably not who they say they are. The activity, social and anti that could give me a sense for how likely that person is to be fraudulent. It's not a solved problem but it's a containable problem. I've solved this problem at PayPal with largely the same team that's solving this problem with me again at Affirm. It's a lot of fun. There's a thousand (indiscernible) stories, some of which are still too early to tell. But at some point I'll publish my memoir about chasing bad guys out of various financial transactions. The more interesting problem and more open-ended one is now I know you're not a bad guy, not a fraudster. Once I know you're okay and really do mean to borrow money with the intent of repaying it, including the finance charge that comes with it, there are three possibilities. There's a first possibility of you're a responsible person, going to borrow money, going to pay it back. That's a profitable transaction but it's not really exciting for the purposes of this conversation. The second possibility is you mean to pay it back but you actually are not able to. You're borrowing for a 2,000-dollar mattress, even though the best you can afford is 500 dollars. You're not doing this out of malice. You're just foolish, or irresponsible, or you think you're going to get that big raise, or you convince yourself you will. There's a million reasons why, but for whatever reason, you're applying for too much money in the form of debt. Figuring out where you fit into that bucket, how to solve that problem is a really interesting question. And that's basically what credit underwriting is about. Once we've gotten rid of the fraudsters, you're not squarely in the territory of the credit underwriting. The other side of this is almost an unsolvable problem of you're a good person, you mean to borrow money. You should be borrowing about this much money for this purchase. Debt gets a lot of bad rap because we've been told by a lot of latenight TV commercials that you should get rid of debt and get to zero. I don't subscribe to that. Debt is a good thing because it allows you to borrow against your future self, or to invest in your more intelligent self. One way to prove to yourself in a thought experiment that it's a good thing is your student loans. If you couldn't borrow money for your education you would forever be stuck in your income brackets, and you wouldn't be able to get further out. Similarly true for your purposes of procreation, which is our number-one job as humans. If you can't get that nicer apartment, a better mattress, nicer set of dishes, you might not find your perfect mate, and your genetic material will be wasted. The idea of borrowing money responsibly is a great concept. But it's very easy to go astray because there are lots of people who will push you into slightly more debt than you can possibly handle, and the whole system is designed to take full advantage of that and never let you out. The third group is a fascinating group where you borrow money. You have full intent of repaying it, and in fact we can ascertain at whatever cost that you are http://weigend.com/files/teaching/ischool/2014/ 5 Andreas Weigend (www.weigend.com) Social Data Revolution (SDR), INFO 290A-03 UC Berkeley, School of Information, Fall 2014 Class3 – October 7, 2014 capable of paying it back, except something terrible happens to you. The classic bad things to good people, you're walking down the street, and a car hits you. Hopefully it doesn't happen to anyone in this room at all, but it happens with unending regularity to lots of people. Sometimes, they say I'm going to be so straight laced, so convicted in my desire to be good to society I'm going to pay off my debt, even if it means I'm going to have to sell my possessions and put myself into financial strain more than I must. For most people it means I tried my best. I tried, and I now have to declare bankruptcy, or something in between, where I'm just not going to pay for a while. This is basically considered today by the underwriting industry as an unpredictable data problem. You can't predict someone is going to get hit by a car. How can you predict that? What I do today in my company Affirm is we do point-of-sale credit. We try to create a counterweight to the evil Macy's, Gap credit card, where you go into the back of the store, get frisked, get a terrible rate, and hate yourself for it, but you saved 10%. What we do is point-of-sale financing, which if I make it completely transparent, consumers understand exactly how much money they're borrowing and what the cost will be. It's non-revolving, and it's a one-time product. You essentially take out the kind of loan that looks more like a mortgage loan, where it's fairly inexpensive. It's straight-line, no complexity to understand it. The end of it, you're done. You can't spend money again on something else. But the timing to do it and actual cost of the underwriting process is about as low as it possibly can be. We approve a loan in less than half a second. I will stop making a commercial out of my talk here, but that's what we do. That's why this is interesting to me as much to tell you, as I go back to my office to keep working on it. We solve three problems, among other problems for scalability, marketing, product development, etc. We solve three data problems. Number one, are you a fraudster, going to steal money from me? Andreas will guide me on how much to tell you versus how much to let you participate in this. To me, these are fascinating intellectual problems to contemplate. The first thing you do is I wonder what correlates to badness, and I'll give you one classic example I'm fond of: people how stole identities tend to not care about those identities much. They tend to not capitalize their last names. There's a slight uptick in probability of being bad. It's all probability. I don't capitalize my first name a lot of times, because my iPhone has a screwed up -- like everybody else's iPhone -- autocorrect, and at some point or other I spelled my name with a low case it learned, and now I'm max with a lower case m. To my very simply fraud system it looks like Max is not Max, he's someone else. Fortunately we have lots of other variables. There are lots of really cool heuristics, but ultimately, it has to be modeled. We'll cover that in a second. The more interesting problem to me is how do you identify the amount of money you're willing to risk with someone? This is probably the coolest sort of think http://weigend.com/files/teaching/ischool/2014/ 6 Andreas Weigend (www.weigend.com) Social Data Revolution (SDR), INFO 290A-03 UC Berkeley, School of Information, Fall 2014 Class3 – October 7, 2014 about this in realtime and come up with ideas type problem I can throw you because it naturally lends itself to the question of could you use something other than FICO. FICO stands for Fair Isaacs Corporation, which is actually out here in the Bay Area, invented the notion of credit scoring mathematically 30-ish years ago. The FICO score became a standard. Basically 99% of American banks were in full use of FICO by about 1989. It's a remarkably simple model. There was no data available about American humans, compared to what we know today, 20-odd years ago when this really happened. If I remember correctly, FICO 8 takes into account maybe 20 different classes of data, but the original FICO score was literally and 8-variable model. If you wanted to describe FICO in a simple set of terms, it's essentially a bunch of little things and your debt-to-income ratio, how much money have you borrowed, and how much money are you earning to pay that off. That's about as trivial as it gets. That turns out to be quite effective. You could declare you're going to strike FICO down, do away with it, and you'd probably find yourself climbing up a tree that's way too tall. In fact, the amount of money you borrowed and how much money you're earning on any given day is a pretty good starting point. The problem is FICO updates every quarter, at most. And most times, it updates every half-year. We live in a world that is way more dynamic. When you graduate college, get a new job, go through a job interview, buy a car, sell a car, change roommates, get a DUI, these things don't wind up on your credit report, and therefore, they don't get figured into FICO quickly enough. There's a lot of opportunity to do what's known as "front-running" FICO. You can get in front of it, calculate a better score. Lots of companies, Affirm included, that try to do better than FICO by considering other interesting kinds of data. This is to figure out how much money can be leant to you or what kind of risk can be taken. Incidentally, FICO score is one of the best correlants to your health cost, health risk, so your financial measurement reappears in your health insurance. It's an interesting commentary on society, where such a financial species here -- maybe it's the other way around; our health is more important to us than our money. I'll let you decide which way it works. It's a pretty interesting detail. Finally, the third question we try to answer, although this is really a big openended problem -- I don't have a solution for this, so if bad things happen to good people, what do you do? How do you predict that? The nefarious-sounding answer is there's a reward. For example if I said let me plant a camera on your forehead facing in. I want to film every moment of your life, know everything you eat, breathe, sneeze, dream. I want to know all there is to know about you. I will be the NSA in your forehead. http://weigend.com/files/teaching/ischool/2014/ 7 Andreas Weigend (www.weigend.com) Social Data Revolution (SDR), INFO 290A-03 UC Berkeley, School of Information, Fall 2014 Class3 – October 7, 2014 I could price risk your life, presumably very precisely, especially if I have lots of views. At that point if you want to borrow money I could say I know what you'll do tomorrow morning, the same thing you'll do every morning. I know the standard risk of getting hit by a car doesn't apply to you. You live on campus. There are no cars allowed. That's a lower rate. You eat too much red meat -- but you don’t, because you're a vegan, so that should be a lower rate too. You can imagne the connection between the FICO score for finance and FICO score for health. Your safety as a human being correlates very nicely to your safety as a health risk. So this level of scrutiny should bring someone's cost of debt very low, except the cost of debt for most people isn't that high in the first place. The impracticality of a camera in your forehead prevents all of us from being scrutinized from some sort of private NSA. But what can be done to lower the cost? There are people -- back to the insurance topic for a second -- one of the things that's fascinating about finance and large risk-mitigation systems is that you ultimately always have the best least-risky people pay for the worst performers. The guys that say I'm going to borrow money, sure I'm going to pay it back; I'm going to keep this job I don't really have. The ill-intentioned Eastern European -- I can only say this because I'm Eastern European myself -- those bad guys are thieves. Factoring those guys out, there are a bunch of people that are irresponsible and know it and don't care. People that by hook or by crook make their monthly payments, no matter what interest included, that interest pays for the losses the first irresponsible category does. If that doesn't seem like a moral problem to you, translate this to health. It's the can't be denied based on prior condition clause which is now a very controversial topic of conversation. The reason it's so important is because all of you young, healthy, barely legal American adults need to pay for the 65-plus contingent that's just about to deal with their various colorectal cancers. It sounds awful but that's how society generally works. You have the highest-risk people covered by the low-risk people. It's a huge problem in healthcare. I have my own set of opinions of how to address that. In finance it's a bit easier, and we're not talking about life and death and the moral hazard is a bit more contained. But it's still a moral hazard. You have people that are basically saying but I'm living hand to mouth, being gouged by payday lenders and loan sharks, and I really want to break. Please give me a loan. I don't want to buy a nicer house. I just want to go to school. Those people can get loans, and yet people that are excellent risks, especially after 2008 it's really tilted heavily in favor of excellent risks, those people are being pushed money, given credit cards that pay incredible rewards and very low rates at the same time. They don't really need any of those resources. The moral hazard there is somewhat less blatant but still there. Andreas: Have a seat and we'll have a chat where first I want to bring out some topics that were latent in here, and we'll open up to questions. You know we have the whiteboard in the background which is bit.ly/ischool2014whiteboard, so if you have questions, I'm looking at this. http://weigend.com/files/teaching/ischool/2014/ 8 Andreas Weigend (www.weigend.com) Social Data Revolution (SDR), INFO 290A-03 UC Berkeley, School of Information, Fall 2014 Class3 – October 7, 2014 I loved your remark that when you started PayPal there basically was no data compared to the world we live in now. How was that? Did it feel like a world without data or did it feel like wow, we had all those data compared to the other people? Max: I think only in retrospect you realize how dire or low the information density was. One of the interesting things you'll learn -- in the case of Affirm we do lending which means we take direct risk. In the case of PayPal we take indirect risk. If Affirm screws up a 100-dollar loan we're on the hook for 100 dollars. If PayPal screws up a 100-dollar transaction, there's typically somewhat less money at risk, but it's a similar risk profile in many ways. What happens in that world which is really scary, but to me also exciting and real compared to perhaps photo sharing startups; you only learn one way. You can get data from other sources, but ultimately you have to lose money to get the truth set of what it feels like to lose money and the correlating factors to the losses. In many ways, at PayPal, we went through this period of where we were basically going holy crap, we're losing 10-million dollars a month. And that was terrifying. We raised 255-million dollars, all in, before IPO. And a large fraction of that was basically paid out to Eastern European fraudsters as a form of tuition payment, where money went through our hands, into theirs. We recovered some of it in time, but the data we gathered from it is priceless. PayPal is going to get spun off finally, and probably going to be worth 30 to 40billion dollars. That valuation largely rests on the data gathered. By data, as you visualize this, think of these giant vectors of this user signup at this time, here are the keystroke timings between their signups. Here is the IP that came in, here's the IP they tried their first time. Here's who they transacted with. One of the best-known narratives from the PayPal history is this product called Igor that we built. It was one of the first network visualization tools. It was the precursor, both intellectually and practically, because the people went on to start this company called Palantir, which is a very large network visualization tool and then some these days. It was the ability to say we know that this dot is suspicious, and this node is suspicious. What do they edges look like? Because we had no models, and just barely had enough data, instead of building models we said let's visualize the data. Let's have thickness of the edge be the transaction size, and the color be how far back it happens. These were very primitive tools. They improved our productivity by a factor of 100. We would reduce the number of minutes necessary to make a decision whether something was bad or good, by a human operator, let alone a computer, by two orders of magnitude. The data when we started was nil. Today you can download a tremendous amount of supporting information, so when you're coming in you don't have to rely strictly on what you've seen and the money you've lost. But reality is as you're building your models you still need a truth set, and a truth set isn't about http://weigend.com/files/teaching/ischool/2014/ 9 Andreas Weigend (www.weigend.com) Social Data Revolution (SDR), INFO 290A-03 UC Berkeley, School of Information, Fall 2014 Class3 – October 7, 2014 data, data, data, identity map. It's data, data, data, identity map, your own data, loss or not loss. And that is the tuition you pay. Any financial services company ultimately winds up being built through the "fortitude" of the people willing to lose tons of money as they're learning what kind of users they have. In that sense, it doesn't feel that different. Having said that, today we can look at literally thousands of additional columns of data, compared to PayPal. I remember having to explain to our board what it is that I do every day with all this data. You keep on buying more hard drives and you need solid-state hard drives, and this is incredibly expensive. What do you do? I said do you have access to a database? If I give you an email address you would give me a yes or no answer of good person, bad person? They'd say of course not it doesn't do so. Exactly -- that's what I'm building here. So today such databases exist. In fact, I saw a startup yesterday that literally said if you give me an email address I'll give you a yes or no, good or bad. Where were you 20 years ago when I needed one of those things? The startup didn't actually have the technology they thought they did, but it sounded really good. Andreas: Shall we probe the students a bit and see what kind of data sources do you think a company like Affirm could use to make better decisions about the conditions under which you give money to somebody? Shout out things Max might not have heard about. Student: Insurance company statistics. Andreas: He probably has heard about those. Max: One of the interesting things about -- it's a good thing to frame your thinking. There's an incredible amount of regulation in lending in particular, and in insurance in general, but lending is kind of an insurance system, where the government says if you can correlate your decision to lend or not lend to a single prohibited variable you can't use it. It’s not okay to be racist or sexist or whatever. All the obvious things, yes, you can't really decide girls don't get to borrow money or boys don't get to borrow money. Even if you have something that you've cleverly concealed and said do you have really long hair or not long hair, if that correlates sufficiently well to being a boy or girl, even if it's not a complete correlation, even if there's overlap, even if you make a claim that it's not really the same way, it doesn't matter. A government regulator will eventually catch up to you and say that looks pretty much like you're discriminating against women. So you're shut down. Insurance company statistics would be a fascinating thing to mine, except it is basically a minefield of prohibited information. As much as someone like me would like to get their hands on a lot of that data, only a fraction of it is usable. The other part is there are a ton of privacy laws, which with my general libertarian leanings I find it annoying to constantly research what is okay, what is not okay. Privacy law is typically written by people that care for the right reasons. Following http://weigend.com/files/teaching/ischool/2014/ 10 Andreas Weigend (www.weigend.com) Social Data Revolution (SDR), INFO 290A-03 UC Berkeley, School of Information, Fall 2014 Class3 – October 7, 2014 that is difficult but a good thing. Having said that, that makes underwriting even more difficult than it already is. ************************ I had a random idea. I'll plant something into your minds and you can think about this, another think-on-the-fly question. Think of a game, a game mechanic or design you could use to measure someone's responsibility, credit worthiness, response to being in debt and strained financially. One thing that you brought up which has come up in conversations before is very controversial but as far as I'm concerned is a super-cool idea is a benign self-infected spyware. You basically say I'm going to install this piece of spyware on my computer. And you can monitor -- I understand exactly how it works. This is not like capture my keystrokes and get my password, but watch my response times, watch what I type to some extent, but no more than I'm okay with privacy wise. Let that be what speaks about my digital persona, which is roughly what you're talking about. Right now, there's very few ways to estimate that except for how quickly you respond to public tweets, or how quickly your friends see your email responses. Which is interesting, because if you do this you should probably be thinking in terms of not how you behave, because you're motivated to fake things and tweak the spyware. But use you to observe your friends. Therein lies a whole vein of interesting ideas. ************************ The data source question isn't about what rule would you create. In general in underwriting, it is never the question of -- never is a little strong -- when you talk to an old-school banker, it frequently literally is you look like a trustworthy, white, fat, balding man like me. Let's shake hands, and that's a loan. This is like a true 1950-style lending there, and still practiced in a lot of parts of the country. But for a reasonable 21-century model it's more about I don't care what the data is. All I care about is feeding it into a model. If I could capture your responsiveness to emails or your ability to capitalize things correctly, it could be an influencer. By that I mean it could influence no more than .5% of the outcome of the model. I wouldn't presume to know whether it would make your responsiveness at a certain time of the day to a certain person would be more likely or less likely -- do I hang up my board call to pick up the call from my child or not may say something about my responsiveness. But the same behavior under different correspondence might mean something entirely different. Andreas: I think there are many dimensions to the gaming. For instance, if you're an online game and just did really well, will that mean that you might be now maybe spending money you shouldn't be spending because you feel you have a good streak? Or vice versa, you were doing poorly. Does it mean you're ready to double down? I think what Max is saying is let's look at many variables. Let's not judge in which way they will enter the model. That's what machine learning is very good at. The http://weigend.com/files/teaching/ischool/2014/ 11 Andreas Weigend (www.weigend.com) Social Data Revolution (SDR), INFO 290A-03 UC Berkeley, School of Information, Fall 2014 Class3 – October 7, 2014 human creativity is to come up with other ideas, to come up with other variables. I think even typing, we all have good typing days or hours and some miserable ones. I think if we know how our motor coordination is, maybe that tells us how good the person is. I heard actually about Goldman Sachs said Goldman has as one of the key variables on whether they're going to hire someone or not how fast a person can type. That's interesting on the granularity. Then there's of course also time scales. Their schedules used to be packed, but for some reason now, it's pretty empty. Does it mean that they are on their route at work? Are they about to get fired? Or the use of the dinners, one after another -- now it's just one person, so schedule. Geolocation, are they walking faster today than they usually walk? Are the depressed? How long does it take them to answer the phone? Or on Facebook, are they the ones who initiate conversations? Are they ones who organize events? Are they the ones who say maybe I'll go, but never show up? I never worry about what's prohibited. I'm always about the possible, not the permissible. Max: That's the difference between academia and industry. I get sued and go to jail. You get to be known as the rebel professor. ************************ One interesting topic is a huge percent of people, the first couple of days of January want to lose weight. Tracking weight as it correlates to responsibility is a really interesting topic. I've long been fascinated -- I don't know what the answer is, but I'd love to know if there's a good way of charting that or predicting that. Andreas: Now it's time for me to ask Max our question. Max, if you had all the data of the world readily available at your fingertips, and no worries about getting put in jail for using them, what would you do with it? Max: The problem is I'm a data scientist, so I wouldn't possibly stop with one thing. I have a stock answer to a specific kind of data I want, but I think it's a good cause and actually doable. The data I want, which is a horrible, privacy-ridden thing that would be pretty much impossible to get, I want every tumor, cancer, suspicion, CAT scan and X-ray with an outcome. I want to look at every single X-ray ever taken, every single suspicious looking tissue examined through a radiological photography, and the subsequent biopsy result. Then I want to set up a service where I would have professional radiologists read those slides, and have 10 or 20 read the same slide and decide whether that's a tumor or not. Then I would correlate that to my truth set, and figure out who is a really good radiologist. Concurrently, I would run a game or say pretend to be a radiologist. No one with a medical degree allowed. I'll show you a picture. I don't care if you have any idea what it is. All you need to do is decide whether it's a tumor or not. At some point, having a large enough data set, I will have sufficient number of people that with zero education would be able to spot tumors at least as well as a highly educated radiologist. http://weigend.com/files/teaching/ischool/2014/ 12 Andreas Weigend (www.weigend.com) Social Data Revolution (SDR), INFO 290A-03 UC Berkeley, School of Information, Fall 2014 Class3 – October 7, 2014 At which point, I would destroy the entire radiological industry because there would be no need for them. I would have a computer assist humans that are playing a video game, that probably don't even know that they're diagnosing tumors and doing well. Then I'm going to put those people out of work too, because once I know that this is (indiscernible) by people that don't know what the hell they're doing, I'm going to have a computer model pretend that they're people. At any given time I will be able to say feel like I have sniffles, it could be a tumor, get under the X-ray, throw it into my giant machine-learning model, and find out it is not a tumor. That's what I would do. I think it would be almost a worthwhile cause. Andreas: Reminds me of a story I read when I was at Amazon of an illegal immigrant in Mexico. He lived in a trailer with his parents, but he had a computer and modem. He posted on groups for legal advice and had some pseudonym. It was one of those groups where the group votes up good answers. Over time, that person got a higher and higher rank. At some stage he was a top-ten lawyer in terms of legal advice. You want to know whether you should beat your wife, he knew the answer; whether you should pay that bill, he knew the answer. Then he came out and said by the way, I'm just a kid. I don't even have a high school degree. I live in a trailer in New Mexico. The whole legal profession went after him and said how is it possible that a person without a legal degree actually gives advice. Max: Sounds very similar. That was basically a machine-learning model. In the same vein, to throw back to the points I made earlier, how would you estimate someone's credit worthiness? One way to do it, which I haven't done but I encourage you all -- those of you that write code on the side for fun or profit, here's a project I wish I had time to do. Surely, your friends know you best. The obvious way of getting rid of FICO score, much like this person in New Mexico or my semi-crazy radiologists without humans is to have a game that says ask your friends to guess your FICO score. Post on Facebook, it takes a bit more than posting, but you could probably build a Facebook app using their API, maybe a mobile app. I will pay any number of my friends a penny to guess what my FICO score is. FICO score goes from 300800, then 800 is unachievably high, and 300 is basically several bankruptcies in a row. If you're young and have one credit card, and paying between 19 and 25 APR, you're probably at 720, and it floats between 10-12 points every time you apply for a loan and goes up with payments, every quarter. That's like a baseline. It doesn't take a lot to figure out. Once you have 100 data points, you know the story about guessing the weight of the bowl. You should be able to have your friends estimate your FICO score down to a single point of precision. It's an incredibly coarse system. In fact, it doesn't really matter if your FICO is 720 or 722. It basically goes in 25-point increments. We can dispense with the entire hegemony of the three government-sponsored monopolies known as the credit bureaus and the Fair Isaac Corporation that http://weigend.com/files/teaching/ischool/2014/ 13 Andreas Weigend (www.weigend.com) Social Data Revolution (SDR), INFO 290A-03 UC Berkeley, School of Information, Fall 2014 Class3 – October 7, 2014 props them up, even though they all hate each other, simply by having your friends answer that question. That's one game to set up. I think there's probably a dozen games like this you can think of. Andreas: What is the feedback function, the gain the friends have by getting it right versus getting it wrong? Max: The friends should bet money. This is more of a startup idea, but you could probably imagine 100 friends pooling together ten bucks, and saying I bet Andreas is so responsible, I'm going to give him ten dollars. I'm going to guess his FICO is 800, perfect. All I want back is $10.01. Is that okay? You say yeah, that's sounds good to me, if you vouch for me and say I'm so good. I will give you back $10.01. The other guy says Andreas is a shifty-eyed, strange dude, German accent. If I give him ten dollars I want twelve back, because the risk is too high. At some point you have a loan sourced by people you know. You're not going to screw them. They're your Facebook friends. They're going to start tracking you online. The risk is quite low. And they would have established the ceiling of your permanent credit record that is entirely public and out in the open. That is the real way to disrupt the entire credit system and create an entirely level lending mechanism. Because it's not regulated and (indiscernible) you would also not be subject to any real restrictions. And so the government could say you can't lend money to this guy with interest, because you're not regulated. But that happens all the time anyway. Andreas comes to my dorm room and says I need ten dollars. Great, I'll give you ten but you need to bring me ten and a can of beer. That's essentially interest, but how can you charge that (indiscernible)? If you went to subversify -- get your friends to put up your rating and a number of dollars they'd be willing to put behind that. Andreas: If I then default, then it is much more embarrassing than if I default against a bank. And in the unlikely case that I only pretend to be defaulting, I claim I lost something I didn't really lose, think about the effect that has on the system. I would not tell Max that I lost the money if I didn't, because my reputation is more important than that. Whereas, some anonymous insurance company or credit card somewhere, that is a different story. This veneer of anonymity which makes in many cases the probability of claims probably an order of magnitude more likely. Max: Identity resolution in that case is critical. In fact, Facebook is practically one of the best platforms to set this up because they insist on real names. One interesting factoid on the credit industry, one of the places where credit cards took a very long time to get adoption is Japan. Until as recent as ten years ago, you could go into a store in Japan where you didn't know the shopkeeper, and they didn't know you, and you were in a random town, and leave your name card instead of a credit card. They would essentially give you a store loan because your name was so important to you. http://weigend.com/files/teaching/ischool/2014/ 14 Andreas Weigend (www.weigend.com) Social Data Revolution (SDR), INFO 290A-03 UC Berkeley, School of Information, Fall 2014 Class3 – October 7, 2014 Japan is I believe the second oldest monarchy, so it's one of the oldest civilizations in order. The reason that's relevant is their last name matters. You could say my last name is Fujiwara and that meant something to a lot of people many generations back. So the idea of leaving your name card essentially said I'm staking my good name on this loan. If I don't pay you back, my name and my ancestors will be embarrassed. As a result, the need for the credit card, the faceless loan company that says sure, you might default but I'll spread the risk across many of you, was far less relevant. In that sense, we can see the reemergence of the on-your-good-name sir or ma'am relationship of the corner store. Andreas: We have a few minutes left. What other topics do you have, would you like to ask one of the most thoughtful data scientists on this planet? Student: (Indiscernible) I would like to know do you get data from (indiscernible). Max: I'm not sure I fully understand the question. Student: (Indiscernible). Andreas: Just to be clear, Bitcoin. I want to talk about Bitcoin, but I haven't talked about it yet. Since not everybody is involved with Bitcoin, let's talk for a few minutes about what Bitcoin is, what trust means in Bitcoin. Max: Bitcoin is probably the single best-applied cryptography result in the last ten years. Before I was a data scientist, I was a cryptographer, and that's basically the only two ways to make money if you do number theory, which is what I studied. I hit them both. From applied cryptography, it's very rare that you see a really good result and application of Merkle tree, which is what Bitcoin is based on, is a fascinating piece of math that for the first time has seen a practical use. It means you have a notion of unfalsifiable, fully distributed ledger. So Bitcoin, despite the name suggests a currency, is actually nothing of the sort. It's a giant record, list of records, that say I gave Andreas a promissory note. Andreas gives one of you a promissory note, and then the promissory notes are denominated in this thing called Bitcoin, but it's just a measure of value. Like any other currency, it float with the dollar or against the dollar, against any other currency. The reason it's so fascinating is it doesn't require central authority. You don't have to trust anyone. The idea of the fully distributed ledger is brilliant because anyone in the system can verify that Andreas did in fact receive one bitcoin from me. And even if one of you says I'll never touch the thing again, I don't want to do it, there's enough of a critical mass where that record lives on. No one can destroy it. It's a permanent, essentially irreversible -- the only way to reverse the transaction is to have the transaction be undone by way of a second transaction, which is very nice because it creates all kinds of really interesting opportunities in trustworthy accounting. http://weigend.com/files/teaching/ischool/2014/ 15 Andreas Weigend (www.weigend.com) Social Data Revolution (SDR), INFO 290A-03 UC Berkeley, School of Information, Fall 2014 Class3 – October 7, 2014 For example, I could send you a bitcoin. You could send it back to me. It's a netzero system, but we have now established trust. You have now seen that I can take money and see that I can send it back to you. You can imagine interesting applications like notary public as a whole system being eliminated because the idea of somebody came over to my house and watched me put my finger -- that's pretty much medieval. All I really wanted to do is have somebody witness me promise that I do swear that this is true. I could also say this is true, sign a cryptographic, put it into the block chain which is the fancy name for the distributed ledger. And now it's true and irreversibly so. Can never be taken away. So bitcoin is super cool mathematically. It is marginally cool -- these are now my opinions. Everything I've said so far is a factual truism. Now we're in opinion land: brilliant math, incredibly volatile currency that's also practically true. The last six weeks has seen it go from roughly 600-dollar per bitcoin to about 300 the last I checked. I don't have too many bitcoins to my name. I had none except someone just gave me some. I used to say I have no bitcoins. It's too volatile, but somebody just sent me a couple. So I am now in possession of bitcoins. They sent it to me precisely so I would stop saying that I have no bitcoins. It was an expensive proposition when he did send them to me (indiscernible) cheaper proposition. As an investment, at one point it was a great idea because it was going through the roof. It's now a terrible idea because it's going the opposite direction. Some people that are big believers in cryptic currency say that's the best time to buy, as people would say in such circumstances. Those that believe it's not the best investment opportunity say oh no, I bought at the height of the market. As a commodity, it's a terrible store of value, because it floats against the dollar (indiscernible). In a long term, I'm a big believer in crypt currency surviving. The distributed ledger model is clearly the future. The idea of generally accepted accounting principles, which is GAAP accounting, that's clearly where the world is going to go. Doing it cryptographically versus by way of Price Waterhouse Cooper coming in to frisk through your books to say that is in fact real; that's very archaic. Somebody that can verify a couple digital signatures is a great way to do it. All that points to the right direction. We're probably in for another ten large time periods, whatever it's dominated in, of high volatility. And it's not clear to me whether it's Bitcoin or bitcoin, as in the same network, but a different currency or different protocol that winds up winning. Is that a reasonable summary? Andreas: Do you think the principle of having block chain having the history embedded in the object goes way beyond money? Max: Yes, it's going to fundamentally modify the notion of truth. Andreas: The notion of truth is a big one. Slightly big one but also important is the notion of trust, being distributed in bitcoin. Some thoughts on trust, particularly today since we're talking about social graph? http://weigend.com/files/teaching/ischool/2014/ 16 Andreas Weigend (www.weigend.com) Social Data Revolution (SDR), INFO 290A-03 UC Berkeley, School of Information, Fall 2014 Class3 – October 7, 2014 Max: I think the idea of society that is less libelous that has more responsibility, more consequences for saying things about people or entities that are designed to be hurtful is going to be -- similar to the honor-your-good-name sir, credit issuance or promissory note issuance, something like the block chain allows you to say I really mean it when I say that was a terrible experience in this restaurant. My name is on this and it's permanently embedded in the narrative of the block chain. It's a pretty trivial example, but it has a lot of consequences if you think about it. Initially it might mean people will use it frivolously, but after a while they'll understand how permanent it is. The notion of responding angrily at something that is just not worth the permanence of you putting a black mark out there, because it might in fact come back to you, is going to be an interesting society. Having said that, I haven't thought through the society implications for bitcoin yet, because my assumption is it's going to be largely a domain of the nerds for another dozen years. The one interesting thing about bitcoin, the reason I'm bullish on it overall is because the total market capital of bitcoin as it exists today is a few billion, somewhere between two and four billion dollars. I don't think it's more than ten and it's definitely more than one. The combined market value of the brain cycles of the people that are thinking very deeply about bitcoin is significantly more. There are definitely more than a few billionaires that are spending lots of time, effort, money investing, thinking, participating, trading, collecting bitcoin. There's a direct arbitrage opportunity in the number of intelligence cycles versus pure dollars that are available to bitcoin. For quite some time, even with the bitcoin collapse to 300 dollars, it is still a cheap buy if you're thinking about the minds of (indiscernible) thinking feverishly about what bitcoin could be. In that sense, I'm very bullish. It will come up to something. It certainly will be good buy until the two prices equal out. Andreas: I love how you think like a physicist about comparing different things in order of magnitude. It's great. The last question is a question I know people have, and also people online have, which is what would be your advice to somebody who wants to get started doing something in the field of data science, with the goal of becoming something like a Max Levchin maybe? Max: Is this the part where I say kids, don't be like me? Hopefully I'm not that much older than you guys, so I can still qualify for the kids status somewhat. I did tour a kindergarten this morning for my son, so I am officially in the parenting category now. There's two or three things that are painfully obvious but have to be said. Number one, if you're going to start a company and you know you're going to start a company, do it now. You have nothing to lose today. With every passing year, you're graduating from a topnotch university, in a good program, you're going to be very employable and marketable. That's a good thing but also a bad thing. With every passing year of being highly employable, you will amass trappings that are going to be painful to lose. http://weigend.com/files/teaching/ischool/2014/ 17 Andreas Weigend (www.weigend.com) Social Data Revolution (SDR), INFO 290A-03 UC Berkeley, School of Information, Fall 2014 Class3 – October 7, 2014 Once you drive a nice car you don't want to be in a beater. Once you live in a nice apartment it's kind of crappy to sleep on the floor. If you're going to start a company and feel ready, do it now. It's worth it. If you are not ready, which is a very legitimate place to be -- if you think I'm an entrepreneur, I just don't have all the skills necessary, either join a startup or get an internship. An internship is a very easy way to dip your toes in because it doesn't require you to stay very long. The only difference between an internship and a fulltime employment is the fulltime employment implies essentially a fouryear contract. An internship implies a summer-long contract. You could always extend one and shorten the other, but the defaults are very powerful. You don't feel bad when you say my internship is over, it's been 12 weeks. I loved this year. I'll come back next year, or I won't. Versus I've been here for 12 weeks as a fulltime employee and I hate myself for it. I want to leave. The two have the same outcome but it sounds much better if you're an intern. An internship at a startup, a well-funded startup, that you were inspired by, that looks kind of like what you would be working on your own is a great place to break in. I'll tell you a tragic and not tragic story I alluded to. Igor, the visualization software that we built at PayPal to address our fraud problem was built in a direct response -- it was summer of 2000. July of 2000 I believe we lost 13 million dollars net to primarily scalable fraud. Largely Russian and Ukrainian organized crime was -- I think they were signing 20,000 accounts a day on PayPal. Using incredibly large -- to the tune of 20,000 to 30,000 accounts per network, would bring in between 10 to 100 dollars from stolen credit cards. Would funnel them immediately to the next ring of accounts, again and again, kind of move them around the system programmatically. This is all done by script. Eventually we'd pull it out in 100,000 to 200,000-dollar installments over the course of a month. By the time people whose credit cards had been stolen would wake up and charge them back, the money would be out of the system. We wouldn't be on the hook for the money. Two million dollars left the building. There's nothing we can do. Even the process of tracking down what carts were involved was impossible because you literally have 20,000 credit card numbers involved. Today it gets a lot of celebration when a target gets breached, and a couple million credit cards get stolen, but that's been going on for a very long time. The underground marketplace where these cards get traded have been around forever. It's just now getting more publicity because it's such a widespread problem. Back then it was a problem in the payment systems people knew and the rest of the world kind of ignored it. For the right reasons, you're not really on the hook for the money. It's the merchant or merchant process, or like PayPal, that ultimately has to pay the price. We were basically facing extinction. We had about six weeks of cash, and were about to die. Primarily because of this. I had an intern -- this was (indiscernible). Peter Thiel came back. He took a sabbatical basically after we merged with http://weigend.com/files/teaching/ischool/2014/ 18 Andreas Weigend (www.weigend.com) Social Data Revolution (SDR), INFO 290A-03 UC Berkeley, School of Information, Fall 2014 Class3 – October 7, 2014 (indiscernible). Leon ultimately left to start his crazy (indiscernible) of space. Between now and then Peter came back to run the company. Peter and I sat down and he said I'm going to go figure out how to raise money so we can survive. You need to figure out how to stop fraud. I had a full engineering team working very hard to scale the system and deal with everything else, so I had exactly one guy that was going to work on this. His name was Bob, and he was an intern. He was probably younger than most people in this room, at Stanford. He and I brainstormed for several days, like what the hell do we do. We have this problem, and we went to investigators who were tracking down these stolen accounts, and literally walked into this guy's room, and he was a first Iraq campaign vet. He was investigating this case, and (indiscernible), still someone I keep in touch with on occasion. He had literally thousands of pages printed out of Excel spreadsheets, tracking where the money went. He had one big binder and said this is the 20,000 credit cards. I think I got them all. I said wow, so we're not going to lose 16 million dollars next month as I predicted. No, you will, this is from three months ago. What do you do for this month? I haven't gotten through two months ago yet. Why don't you skip forward? How am I supposed to skip forward? All I know is when somebody reports the money is gone. That's when I start tracking this stuff. Basically what would you do? What I would really like is something that shows me a network of these accounts as they get created and the nodes in the graph that are connected, because the edges contain all the information. The nodes themselves are worthless to me. I see somebody signed up, they have a credit card, indistinguishable from a normal user. People sign up, buy stuff on eBay. People sign up and steal money. I can't tell the difference. What I can tell is once they start funneling the money to one person in the center, that's the bad guy. I said why don't we build a system that visualizes this stuff. So Bob and I spent the next seven days fiercely constructing all these visualization systems, very simple as it turns out, and basic data science, trying to filter out things, trying to predict what would happen next. This was before even things like random forests were a concept. This is all very Stone Age of data science. When we finally had the first alpha we showed to the fraud investigator squad at PayPal, one of the women there named Michelle literally started crying. She said this will make all the difference in the world. You basically took something that takes us like 100 hours to investigate and boiled it down to about a minute. This is something we can act on, as soon as we see. One of the first visualizations was this notion of a Christmas tree. The top of the Christmas tree is the bad account that's withdrawing the money. And going downwards is the reverse timeline of money coming in, filtering up, being withdrawn. Once we had this Christmas tree, if we train all these analysts to look for Christmas trees, and as soon as you see one freeze every one of those accounts. One of the crazy tools we had to build was highlight everything you see in this three, one button, freeze all money. The next month losses were I think two million dollars. So we basically nipped that problem almost instantaneously. So http://weigend.com/files/teaching/ischool/2014/ 19 Andreas Weigend (www.weigend.com) Social Data Revolution (SDR), INFO 290A-03 UC Berkeley, School of Information, Fall 2014 Class3 – October 7, 2014 the moral of the story -- there's unfortunately a very tragic end to the story. Bob died a few days after his graduation. He had an acute case of diabetes and it ended tragically. But I'm still close with his family and established a scholarship in his name. His work was extremely important. If you ever dig through PayPal patents, most of the interesting ones have his name on it. He was really instrumental. His unfortunate and untimely death notwithstanding, the moral of the story is it's not like you have to wait until you are Max Levchin to make your mark. This kid would have been incredible had he lived longer, but even at 21, he made an incredible difference and was literally responsible for saving a company that is worth 30 billion dollars. Take an internship at a company in something as nerdy or seemingly impractical as data science could lead to very real results. For what it's worth, if that's going to be your startup prep, that's what you're going to do as either your job preparation for a data science role or startup role, that's a good one. Andreas: Nothing much I can add to this, but let's give a round of applause to Max. http://weigend.com/files/teaching/ischool/2014/ 20