Andreas Weigend (www.weigend.com)
Social Data Revolution (SDR), INFO 290A-03
UC Berkeley, School of Information, Fall 2014
Class3 – October 7, 2014
Andreas Weigend (www.weigend.com)
Social Data Revolution, INFO 290A-3 (http://www.ischool.berkeley.edu/courses/i290a-sdr)
UC Berkeley, School of Information, Fall 2014 (http://ischool2014.wikispaces.com)
This transcript:
Corresponding audio file:
Corresponding video file:
Containing folder of the whole series:
Andreas Weigend (www.weigend.com)
Social Data Revolution (SDR), INFO 290A-03
UC Berkeley, School of Information, Fall 2014
Class3 – October 7, 2014
Welcome to the Social Data Revolution, Fall 2014, at the University of California
at Berkeley. It is an extreme pleasure for me to be able to introduce a friend of
mine, Max Levchin. When I was thinking about what could we do as an
application for the social graph, I thought about a number of different verticals.
Where would it make a difference to understand what's happening on the edges,
not just in the nodes of the network.
Where does identity matter? Identity being defined by who you're connected with,
not only what your attributes are. And I think when you start thinking about this
you realize finance, payments is one of the key pillars of any economy. In the
olden days it was easy because when we knew who a person was we could go
after the relatives, break their kneecaps or whatever the curse was if somebody
didn't pay back money we lent to them.
Then as we saw in the last class, privacy existed before they had Google Glass,
before Facebook and the like. That made it quite hard and made companies like
Fair Issac come up where based on whether you paid your electricity bill five
years ago your credit rating was determined. Now we say we live in the illusion of
Another question is what do we do with it? Without wasting much time, I want to
introduce Max Levchin. If you'd take a few minutes to tell us what you're doing
now, and maybe after five minutes or so we'll have a conversation, and we'll
invite you to chime in.
Great. Thanks for having me. For those of you who are not familiar with my
checkered past, I had a bit of the hand recently in the company called PayPal. It's
definitely squarely in my past. It seems to somehow be good at not escaping
being mentioned, but it's relevant in this conversation because to be self-critical,
it's a veneer on top of an existing system. If you look at PayPal at a systems
level, you can see that beneath it there's this notion of credit, credit worthiness,
credit issuance through the mechanics of the global financial environment,
basically entirely unchanged.
Above it, when we came to the scene, there was sort of the physical terminals
and card slicing through a dirty overly handled machine at a grocery story. We
supposedly cleaned all that up and made it beautiful and web-based, but if you
look underneath it hasn't changed at all.
In fact, it's actually very interesting if old and decrepit (indiscernible) and so I
came back after 15-ish years absence to try and fix that space. What happens
underneath the plastic, which by the way a credit card is basically a very nice
user experience, it's a great API to your money. If you want to spend money, you
may have it, you may not have it, it's a time-shifting device. Maybe you don't
have the money now but you'll have it later, so it's a fantastic way of managing
cash and cash flow.
What it's built on is this idea of pricing risk. It's an extraordinarily poorly built
model for pricing risk because if you look at the credit card systems you literally
go through an application. It's an onerous application, you feel you've been
Andreas Weigend (www.weigend.com)
Social Data Revolution (SDR), INFO 290A-03
UC Berkeley, School of Information, Fall 2014
Class3 – October 7, 2014
frisked if you've ever applied for an in-store credit card. It's a particularly nasty
model. You come up to the checkout person at the Gap and they say how would
you like to save 10%. You say that sounds great. Great, you should open a store
credit card. Come with us to the back room. What's going to happen to me there?
You fill out an application. Hang on, we need to contact your credit bureau. It
progressively gets more awful.
Then you know by the end of it that somewhere in the back of your mind your
mom told you never apply for an in-store credit card because you're going to get
saddled with a horrible APR. You know you're being screwed.
A brief detour into a good example of how screwed up consume finance is, and
why it's worthwhile fixing it, so first of all does anybody know what APR stands
for? It's on the back of your credit card statement. What does APR stand for? I
asked the same question to a room full of hedge fund managers, most of which
buy and sell consumer debt, which is what APRs are the metric for. No one
I asked the same thing, at a luncheon -- many of them had their wives present,
and all of them made the stupid joke that my wife surely knows, she shops all the
time. I know it has to do with a credit card. But none of them knew either.
It's fascinating that people in charge of our financial system have no idea how it
works. APR stands for either annual percentage rate or average percentage rate,
depending on who you ask. The most fascinating thing about the APR is no one
knows how to calculate it. It's the rate you pay for the right to use credit. The
math for it is so complicated. There's no closed form. So if you ever try to figure
out how to figure out the true APR you pay, it's basically impossible to compute.
Part of the reason is the government controls the calculation to such a degree
that they dictate it in legalese, and it has to be translated by data scientists to
math and it's essentially impossible. There are two official calculations the
government stands behind. They result in slightly different numbers.
One I cannot understand even though I've been cooking in this industry for a
good portion of 20 years. The second I can understand, but it's so complex there
is no typeset version of this. If you search for it hard enough you'll find a w.gov
(ph.) site that will give you the APR calculation, and you can see that it's
handwritten. Someone had to write down the numerical method to solve for the
APR. This gives you a glimpse into how messed up the whole thing is.
Now that I've sufficiently painted the picture of the decrepitude of the industry,
what happens is a data science problem. You go in, or virtually go into a bank,
and you say I'd like to borrow some money to buy this house. The process for
that is relatively intense. By the time you're done applying for that kind of a loan,
you have basically been fully frisked. Just shy of a personal reference from your
middle-school teacher, how much money do you make, how much debt do you
have, what the house is worth. Somebody's done an independent assessment of
what the house will probably be worth. It's adjusted for a 30-year typical outlook
on the economy. It's a very detailed analysis.
Andreas Weigend (www.weigend.com)
Social Data Revolution (SDR), INFO 290A-03
UC Berkeley, School of Information, Fall 2014
Class3 – October 7, 2014
The rate you pay for that kind of a loan today is somewhere between 3 and 5%.
The reason it's so inexpensive compared to your 29% APR for your credit card
for your age and income category is because it's a secured loan. There's a house
to be repossessed if you don't make your payments.
The cost of underwriting per mortgage loan is about 5,000 dollars. It's pretty
outrageous but if you're buying a multi-million-dollar house in San Francisco,
that's 2% of the overall price. 6% goes to the agent and broker, so peanuts. You
can't really do the same -- this process is completely manual, by the way. This is
literally a file created. Terminology in the lending industry is fascinating because
it all goes right back to you go into the corner store and say I don't have my
wallet with me, Bob, and Bob behind the counter says I'll get you later, sort of pay
me off at the end of the month. A lot of that carried right through into today's
lending, even though the file is electronic, supposedly. Mortgage files are not. A
lot are still required by the government to be paper, which is fascinating.
If you try to borrow money to buy a mattress, that's at most 3,000 dollars, so
doing the full underwriting process that would cost 5,000 dollars in inherently
unprofitable. If you're trying to apply for a credit card, the cost of underwriting is
actually quite high. But the time to underwrite is once. At that time you have a
revolving credit account known as a credit card, and you are essentially ushered
by the industry to be forever in debt. You're basically going to pay off the 20% for
now on.
I'm going as fast as I can through the background stuff because it's not relevant
to the science, but it gives you a good context, a sense of how many
opportunities there are to revolutionize this thing, and also how messed up things
There are two fundamental problems, and the second one is split into two when
you're doing underwriting. Underwriting is a data problem. If I knew everything
there is to know about you, I could probably lend you money at a fixed rate, for a
very low amount of money because I know where you live and I know who you
are, and I could get a sense for exactly how likely you are to repay. But I don't, so
I'm pricing risk. I'm trying to figure out how likely you are to say I wasn't really
borrowing money with the intent to pay you; I was going to run away.
That's known as the fraud mitigation or pretreatment. Somebody comes in, we
have no idea who that person is. We can ask any question we like. We can look
at their credit bureau data, assuming they're not lying about their Social Security
number. Assuming they haven't purchased a stolen identity. All these layers of
possibilities which we can cover in a bit, but the goal is to figure out are you who
you say you are. So there's an identity resolution which presumably you've talked
about when you talked about nodes.
Then there is the fraud determination or fraud-prediction problem, which is
squarely an edge social-graph problem, because if I knew who that person
transacted with, their friends, who they associate with, what else they've done in
their past, I could pretty quickly pin it down that even though they say they're in
Berkeley, California, they're really in the middle of Bulgaria, and they spend most
Andreas Weigend (www.weigend.com)
Social Data Revolution (SDR), INFO 290A-03
UC Berkeley, School of Information, Fall 2014
Class3 – October 7, 2014
of their time online going through places called carterplanet.ru. They're probably
not who they say they are. The activity, social and anti that could give me a
sense for how likely that person is to be fraudulent.
It's not a solved problem but it's a containable problem. I've solved this problem
at PayPal with largely the same team that's solving this problem with me again at
Affirm. It's a lot of fun. There's a thousand (indiscernible) stories, some of which
are still too early to tell. But at some point I'll publish my memoir about chasing
bad guys out of various financial transactions.
The more interesting problem and more open-ended one is now I know you're
not a bad guy, not a fraudster. Once I know you're okay and really do mean to
borrow money with the intent of repaying it, including the finance charge that
comes with it, there are three possibilities. There's a first possibility of you're a
responsible person, going to borrow money, going to pay it back. That's a
profitable transaction but it's not really exciting for the purposes of this
The second possibility is you mean to pay it back but you actually are not able to.
You're borrowing for a 2,000-dollar mattress, even though the best you can afford
is 500 dollars. You're not doing this out of malice. You're just foolish, or
irresponsible, or you think you're going to get that big raise, or you convince
yourself you will. There's a million reasons why, but for whatever reason, you're
applying for too much money in the form of debt.
Figuring out where you fit into that bucket, how to solve that problem is a really
interesting question. And that's basically what credit underwriting is about. Once
we've gotten rid of the fraudsters, you're not squarely in the territory of the credit
The other side of this is almost an unsolvable problem of you're a good person,
you mean to borrow money. You should be borrowing about this much money for
this purchase. Debt gets a lot of bad rap because we've been told by a lot of latenight TV commercials that you should get rid of debt and get to zero. I don't
subscribe to that. Debt is a good thing because it allows you to borrow against
your future self, or to invest in your more intelligent self.
One way to prove to yourself in a thought experiment that it's a good thing is your
student loans. If you couldn't borrow money for your education you would forever
be stuck in your income brackets, and you wouldn't be able to get further out.
Similarly true for your purposes of procreation, which is our number-one job as
humans. If you can't get that nicer apartment, a better mattress, nicer set of
dishes, you might not find your perfect mate, and your genetic material will be
wasted. The idea of borrowing money responsibly is a great concept. But it's very
easy to go astray because there are lots of people who will push you into slightly
more debt than you can possibly handle, and the whole system is designed to
take full advantage of that and never let you out.
The third group is a fascinating group where you borrow money. You have full
intent of repaying it, and in fact we can ascertain at whatever cost that you are
Andreas Weigend (www.weigend.com)
Social Data Revolution (SDR), INFO 290A-03
UC Berkeley, School of Information, Fall 2014
Class3 – October 7, 2014
capable of paying it back, except something terrible happens to you. The classic
bad things to good people, you're walking down the street, and a car hits you.
Hopefully it doesn't happen to anyone in this room at all, but it happens with
unending regularity to lots of people.
Sometimes, they say I'm going to be so straight laced, so convicted in my desire
to be good to society I'm going to pay off my debt, even if it means I'm going to
have to sell my possessions and put myself into financial strain more than I must.
For most people it means I tried my best. I tried, and I now have to declare
bankruptcy, or something in between, where I'm just not going to pay for a while.
This is basically considered today by the underwriting industry as an
unpredictable data problem. You can't predict someone is going to get hit by a
car. How can you predict that?
What I do today in my company Affirm is we do point-of-sale credit. We try to
create a counterweight to the evil Macy's, Gap credit card, where you go into the
back of the store, get frisked, get a terrible rate, and hate yourself for it, but you
saved 10%.
What we do is point-of-sale financing, which if I make it completely transparent,
consumers understand exactly how much money they're borrowing and what the
cost will be. It's non-revolving, and it's a one-time product. You essentially take
out the kind of loan that looks more like a mortgage loan, where it's fairly
inexpensive. It's straight-line, no complexity to understand it. The end of it, you're
done. You can't spend money again on something else. But the timing to do it
and actual cost of the underwriting process is about as low as it possibly can be.
We approve a loan in less than half a second.
I will stop making a commercial out of my talk here, but that's what we do. That's
why this is interesting to me as much to tell you, as I go back to my office to keep
working on it. We solve three problems, among other problems for scalability,
marketing, product development, etc. We solve three data problems. Number
one, are you a fraudster, going to steal money from me? Andreas will guide me
on how much to tell you versus how much to let you participate in this. To me,
these are fascinating intellectual problems to contemplate.
The first thing you do is I wonder what correlates to badness, and I'll give you
one classic example I'm fond of: people how stole identities tend to not care
about those identities much. They tend to not capitalize their last names. There's
a slight uptick in probability of being bad. It's all probability. I don't capitalize my
first name a lot of times, because my iPhone has a screwed up -- like everybody
else's iPhone -- autocorrect, and at some point or other I spelled my name with a
low case it learned, and now I'm max with a lower case m. To my very simply
fraud system it looks like Max is not Max, he's someone else.
Fortunately we have lots of other variables. There are lots of really cool
heuristics, but ultimately, it has to be modeled. We'll cover that in a second.
The more interesting problem to me is how do you identify the amount of money
you're willing to risk with someone? This is probably the coolest sort of think
Andreas Weigend (www.weigend.com)
Social Data Revolution (SDR), INFO 290A-03
UC Berkeley, School of Information, Fall 2014
Class3 – October 7, 2014
about this in realtime and come up with ideas type problem I can throw you
because it naturally lends itself to the question of could you use something other
than FICO.
FICO stands for Fair Isaacs Corporation, which is actually out here in the Bay
Area, invented the notion of credit scoring mathematically 30-ish years ago. The
FICO score became a standard. Basically 99% of American banks were in full
use of FICO by about 1989. It's a remarkably simple model.
There was no data available about American humans, compared to what we
know today, 20-odd years ago when this really happened. If I remember
correctly, FICO 8 takes into account maybe 20 different classes of data, but the
original FICO score was literally and 8-variable model.
If you wanted to describe FICO in a simple set of terms, it's essentially a bunch of
little things and your debt-to-income ratio, how much money have you borrowed,
and how much money are you earning to pay that off. That's about as trivial as it
That turns out to be quite effective. You could declare you're going to strike FICO
down, do away with it, and you'd probably find yourself climbing up a tree that's
way too tall. In fact, the amount of money you borrowed and how much money
you're earning on any given day is a pretty good starting point.
The problem is FICO updates every quarter, at most. And most times, it updates
every half-year. We live in a world that is way more dynamic. When you graduate
college, get a new job, go through a job interview, buy a car, sell a car, change
roommates, get a DUI, these things don't wind up on your credit report, and
therefore, they don't get figured into FICO quickly enough. There's a lot of
opportunity to do what's known as "front-running" FICO. You can get in front of it,
calculate a better score. Lots of companies, Affirm included, that try to do better
than FICO by considering other interesting kinds of data.
This is to figure out how much money can be leant to you or what kind of risk can
be taken. Incidentally, FICO score is one of the best correlants to your health
cost, health risk, so your financial measurement reappears in your health
insurance. It's an interesting commentary on society, where such a financial
species here -- maybe it's the other way around; our health is more important to
us than our money. I'll let you decide which way it works. It's a pretty interesting
Finally, the third question we try to answer, although this is really a big openended problem -- I don't have a solution for this, so if bad things happen to good
people, what do you do? How do you predict that? The nefarious-sounding
answer is there's a reward. For example if I said let me plant a camera on your
forehead facing in. I want to film every moment of your life, know everything you
eat, breathe, sneeze, dream. I want to know all there is to know about you. I will
be the NSA in your forehead.
Andreas Weigend (www.weigend.com)
Social Data Revolution (SDR), INFO 290A-03
UC Berkeley, School of Information, Fall 2014
Class3 – October 7, 2014
I could price risk your life, presumably very precisely, especially if I have lots of
views. At that point if you want to borrow money I could say I know what you'll do
tomorrow morning, the same thing you'll do every morning. I know the standard
risk of getting hit by a car doesn't apply to you. You live on campus. There are no
cars allowed. That's a lower rate. You eat too much red meat -- but you don’t,
because you're a vegan, so that should be a lower rate too. You can imagne the
connection between the FICO score for finance and FICO score for health. Your
safety as a human being correlates very nicely to your safety as a health risk.
So this level of scrutiny should bring someone's cost of debt very low, except the
cost of debt for most people isn't that high in the first place. The impracticality of
a camera in your forehead prevents all of us from being scrutinized from some
sort of private NSA.
But what can be done to lower the cost? There are people -- back to the
insurance topic for a second -- one of the things that's fascinating about finance
and large risk-mitigation systems is that you ultimately always have the best
least-risky people pay for the worst performers. The guys that say I'm going to
borrow money, sure I'm going to pay it back; I'm going to keep this job I don't
really have. The ill-intentioned Eastern European -- I can only say this because
I'm Eastern European myself -- those bad guys are thieves. Factoring those guys
out, there are a bunch of people that are irresponsible and know it and don't care.
People that by hook or by crook make their monthly payments, no matter what
interest included, that interest pays for the losses the first irresponsible category
does. If that doesn't seem like a moral problem to you, translate this to health. It's
the can't be denied based on prior condition clause which is now a very
controversial topic of conversation. The reason it's so important is because all of
you young, healthy, barely legal American adults need to pay for the 65-plus
contingent that's just about to deal with their various colorectal cancers.
It sounds awful but that's how society generally works. You have the highest-risk
people covered by the low-risk people. It's a huge problem in healthcare. I have
my own set of opinions of how to address that. In finance it's a bit easier, and
we're not talking about life and death and the moral hazard is a bit more
contained. But it's still a moral hazard. You have people that are basically saying
but I'm living hand to mouth, being gouged by payday lenders and loan sharks,
and I really want to break. Please give me a loan. I don't want to buy a nicer
house. I just want to go to school.
Those people can get loans, and yet people that are excellent risks, especially
after 2008 it's really tilted heavily in favor of excellent risks, those people are
being pushed money, given credit cards that pay incredible rewards and very low
rates at the same time. They don't really need any of those resources. The moral
hazard there is somewhat less blatant but still there.
Have a seat and we'll have a chat where first I want to bring out some topics that
were latent in here, and we'll open up to questions. You know we have the
whiteboard in the background which is bit.ly/ischool2014whiteboard, so if you
have questions, I'm looking at this.
Andreas Weigend (www.weigend.com)
Social Data Revolution (SDR), INFO 290A-03
UC Berkeley, School of Information, Fall 2014
Class3 – October 7, 2014
I loved your remark that when you started PayPal there basically was no data
compared to the world we live in now. How was that? Did it feel like a world
without data or did it feel like wow, we had all those data compared to the other
I think only in retrospect you realize how dire or low the information density was.
One of the interesting things you'll learn -- in the case of Affirm we do lending
which means we take direct risk. In the case of PayPal we take indirect risk. If
Affirm screws up a 100-dollar loan we're on the hook for 100 dollars. If PayPal
screws up a 100-dollar transaction, there's typically somewhat less money at risk,
but it's a similar risk profile in many ways.
What happens in that world which is really scary, but to me also exciting and real
compared to perhaps photo sharing startups; you only learn one way. You can
get data from other sources, but ultimately you have to lose money to get the
truth set of what it feels like to lose money and the correlating factors to the
In many ways, at PayPal, we went through this period of where we were basically
going holy crap, we're losing 10-million dollars a month. And that was terrifying.
We raised 255-million dollars, all in, before IPO. And a large fraction of that was
basically paid out to Eastern European fraudsters as a form of tuition payment,
where money went through our hands, into theirs. We recovered some of it in
time, but the data we gathered from it is priceless.
PayPal is going to get spun off finally, and probably going to be worth 30 to 40billion dollars. That valuation largely rests on the data gathered. By data, as you
visualize this, think of these giant vectors of this user signup at this time, here are
the keystroke timings between their signups. Here is the IP that came in, here's
the IP they tried their first time. Here's who they transacted with.
One of the best-known narratives from the PayPal history is this product called
Igor that we built. It was one of the first network visualization tools. It was the
precursor, both intellectually and practically, because the people went on to start
this company called Palantir, which is a very large network visualization tool and
then some these days.
It was the ability to say we know that this dot is suspicious, and this node is
suspicious. What do they edges look like? Because we had no models, and just
barely had enough data, instead of building models we said let's visualize the
data. Let's have thickness of the edge be the transaction size, and the color be
how far back it happens. These were very primitive tools. They improved our
productivity by a factor of 100. We would reduce the number of minutes
necessary to make a decision whether something was bad or good, by a human
operator, let alone a computer, by two orders of magnitude.
The data when we started was nil. Today you can download a tremendous
amount of supporting information, so when you're coming in you don't have to
rely strictly on what you've seen and the money you've lost. But reality is as
you're building your models you still need a truth set, and a truth set isn't about
Andreas Weigend (www.weigend.com)
Social Data Revolution (SDR), INFO 290A-03
UC Berkeley, School of Information, Fall 2014
Class3 – October 7, 2014
data, data, data, identity map. It's data, data, data, identity map, your own data,
loss or not loss. And that is the tuition you pay. Any financial services company
ultimately winds up being built through the "fortitude" of the people willing to lose
tons of money as they're learning what kind of users they have.
In that sense, it doesn't feel that different. Having said that, today we can look at
literally thousands of additional columns of data, compared to PayPal. I
remember having to explain to our board what it is that I do every day with all this
data. You keep on buying more hard drives and you need solid-state hard drives,
and this is incredibly expensive. What do you do?
I said do you have access to a database? If I give you an email address you
would give me a yes or no answer of good person, bad person? They'd say of
course not it doesn't do so. Exactly -- that's what I'm building here. So today such
databases exist. In fact, I saw a startup yesterday that literally said if you give me
an email address I'll give you a yes or no, good or bad. Where were you 20 years
ago when I needed one of those things? The startup didn't actually have the
technology they thought they did, but it sounded really good.
Shall we probe the students a bit and see what kind of data sources do you think
a company like Affirm could use to make better decisions about the conditions
under which you give money to somebody? Shout out things Max might not have
heard about.
Insurance company statistics.
He probably has heard about those.
One of the interesting things about -- it's a good thing to frame your thinking.
There's an incredible amount of regulation in lending in particular, and in
insurance in general, but lending is kind of an insurance system, where the
government says if you can correlate your decision to lend or not lend to a single
prohibited variable you can't use it. It’s not okay to be racist or sexist or whatever.
All the obvious things, yes, you can't really decide girls don't get to borrow money
or boys don't get to borrow money.
Even if you have something that you've cleverly concealed and said do you have
really long hair or not long hair, if that correlates sufficiently well to being a boy or
girl, even if it's not a complete correlation, even if there's overlap, even if you
make a claim that it's not really the same way, it doesn't matter. A government
regulator will eventually catch up to you and say that looks pretty much like
you're discriminating against women. So you're shut down.
Insurance company statistics would be a fascinating thing to mine, except it is
basically a minefield of prohibited information. As much as someone like me
would like to get their hands on a lot of that data, only a fraction of it is usable.
The other part is there are a ton of privacy laws, which with my general libertarian
leanings I find it annoying to constantly research what is okay, what is not okay.
Privacy law is typically written by people that care for the right reasons. Following
Andreas Weigend (www.weigend.com)
Social Data Revolution (SDR), INFO 290A-03
UC Berkeley, School of Information, Fall 2014
Class3 – October 7, 2014
that is difficult but a good thing. Having said that, that makes underwriting even
more difficult than it already is.
I had a random idea. I'll plant something into your minds and you can think about
this, another think-on-the-fly question. Think of a game, a game mechanic or
design you could use to measure someone's responsibility, credit worthiness,
response to being in debt and strained financially. One thing that you brought up
which has come up in conversations before is very controversial but as far as I'm
concerned is a super-cool idea is a benign self-infected spyware.
You basically say I'm going to install this piece of spyware on my computer. And
you can monitor -- I understand exactly how it works. This is not like capture my
keystrokes and get my password, but watch my response times, watch what I
type to some extent, but no more than I'm okay with privacy wise. Let that be
what speaks about my digital persona, which is roughly what you're talking about.
Right now, there's very few ways to estimate that except for how quickly you
respond to public tweets, or how quickly your friends see your email responses.
Which is interesting, because if you do this you should probably be thinking in
terms of not how you behave, because you're motivated to fake things and tweak
the spyware. But use you to observe your friends. Therein lies a whole vein of
interesting ideas.
The data source question isn't about what rule would you create. In general in
underwriting, it is never the question of -- never is a little strong -- when you talk
to an old-school banker, it frequently literally is you look like a trustworthy, white,
fat, balding man like me. Let's shake hands, and that's a loan. This is like a true
1950-style lending there, and still practiced in a lot of parts of the country. But for
a reasonable 21-century model it's more about I don't care what the data is. All I
care about is feeding it into a model. If I could capture your responsiveness to
emails or your ability to capitalize things correctly, it could be an influencer. By
that I mean it could influence no more than .5% of the outcome of the model. I
wouldn't presume to know whether it would make your responsiveness at a
certain time of the day to a certain person would be more likely or less likely -- do
I hang up my board call to pick up the call from my child or not may say
something about my responsiveness. But the same behavior under different
correspondence might mean something entirely different.
I think there are many dimensions to the gaming. For instance, if you're an online
game and just did really well, will that mean that you might be now maybe
spending money you shouldn't be spending because you feel you have a good
streak? Or vice versa, you were doing poorly. Does it mean you're ready to
double down?
I think what Max is saying is let's look at many variables. Let's not judge in which
way they will enter the model. That's what machine learning is very good at. The
Andreas Weigend (www.weigend.com)
Social Data Revolution (SDR), INFO 290A-03
UC Berkeley, School of Information, Fall 2014
Class3 – October 7, 2014
human creativity is to come up with other ideas, to come up with other variables.
I think even typing, we all have good typing days or hours and some miserable
ones. I think if we know how our motor coordination is, maybe that tells us how
good the person is.
I heard actually about Goldman Sachs said Goldman has as one of the key
variables on whether they're going to hire someone or not how fast a person can
type. That's interesting on the granularity. Then there's of course also time
scales. Their schedules used to be packed, but for some reason now, it's pretty
empty. Does it mean that they are on their route at work? Are they about to get
fired? Or the use of the dinners, one after another -- now it's just one person, so
Geolocation, are they walking faster today than they usually walk? Are the
depressed? How long does it take them to answer the phone? Or on Facebook,
are they the ones who initiate conversations? Are they ones who organize
events? Are they the ones who say maybe I'll go, but never show up? I never
worry about what's prohibited. I'm always about the possible, not the permissible.
That's the difference between academia and industry. I get sued and go to jail.
You get to be known as the rebel professor.
One interesting topic is a huge percent of people, the first couple of days of
January want to lose weight. Tracking weight as it correlates to responsibility is a
really interesting topic. I've long been fascinated -- I don't know what the answer
is, but I'd love to know if there's a good way of charting that or predicting that.
Now it's time for me to ask Max our question. Max, if you had all the data of the
world readily available at your fingertips, and no worries about getting put in jail
for using them, what would you do with it?
The problem is I'm a data scientist, so I wouldn't possibly stop with one thing. I
have a stock answer to a specific kind of data I want, but I think it's a good cause
and actually doable. The data I want, which is a horrible, privacy-ridden thing that
would be pretty much impossible to get, I want every tumor, cancer, suspicion,
CAT scan and X-ray with an outcome. I want to look at every single X-ray ever
taken, every single suspicious looking tissue examined through a radiological
photography, and the subsequent biopsy result. Then I want to set up a service
where I would have professional radiologists read those slides, and have 10 or
20 read the same slide and decide whether that's a tumor or not.
Then I would correlate that to my truth set, and figure out who is a really good
radiologist. Concurrently, I would run a game or say pretend to be a radiologist.
No one with a medical degree allowed. I'll show you a picture. I don't care if you
have any idea what it is. All you need to do is decide whether it's a tumor or not.
At some point, having a large enough data set, I will have sufficient number of
people that with zero education would be able to spot tumors at least as well as a
highly educated radiologist.
Andreas Weigend (www.weigend.com)
Social Data Revolution (SDR), INFO 290A-03
UC Berkeley, School of Information, Fall 2014
Class3 – October 7, 2014
At which point, I would destroy the entire radiological industry because there
would be no need for them. I would have a computer assist humans that are
playing a video game, that probably don't even know that they're diagnosing
tumors and doing well. Then I'm going to put those people out of work too,
because once I know that this is (indiscernible) by people that don't know what
the hell they're doing, I'm going to have a computer model pretend that they're
people. At any given time I will be able to say feel like I have sniffles, it could be a
tumor, get under the X-ray, throw it into my giant machine-learning model, and
find out it is not a tumor. That's what I would do. I think it would be almost a
worthwhile cause.
Reminds me of a story I read when I was at Amazon of an illegal immigrant in
Mexico. He lived in a trailer with his parents, but he had a computer and modem.
He posted on groups for legal advice and had some pseudonym. It was one of
those groups where the group votes up good answers. Over time, that person got
a higher and higher rank.
At some stage he was a top-ten lawyer in terms of legal advice. You want to
know whether you should beat your wife, he knew the answer; whether you
should pay that bill, he knew the answer. Then he came out and said by the way,
I'm just a kid. I don't even have a high school degree. I live in a trailer in New
Mexico. The whole legal profession went after him and said how is it possible
that a person without a legal degree actually gives advice.
Sounds very similar. That was basically a machine-learning model. In the same
vein, to throw back to the points I made earlier, how would you estimate
someone's credit worthiness? One way to do it, which I haven't done but I
encourage you all -- those of you that write code on the side for fun or profit,
here's a project I wish I had time to do.
Surely, your friends know you best. The obvious way of getting rid of FICO score,
much like this person in New Mexico or my semi-crazy radiologists without
humans is to have a game that says ask your friends to guess your FICO score.
Post on Facebook, it takes a bit more than posting, but you could probably build
a Facebook app using their API, maybe a mobile app. I will pay any number of
my friends a penny to guess what my FICO score is. FICO score goes from 300800, then 800 is unachievably high, and 300 is basically several bankruptcies in
a row. If you're young and have one credit card, and paying between 19 and 25
APR, you're probably at 720, and it floats between 10-12 points every time you
apply for a loan and goes up with payments, every quarter. That's like a baseline.
It doesn't take a lot to figure out.
Once you have 100 data points, you know the story about guessing the weight of
the bowl. You should be able to have your friends estimate your FICO score
down to a single point of precision. It's an incredibly coarse system. In fact, it
doesn't really matter if your FICO is 720 or 722. It basically goes in 25-point
We can dispense with the entire hegemony of the three government-sponsored
monopolies known as the credit bureaus and the Fair Isaac Corporation that
Andreas Weigend (www.weigend.com)
Social Data Revolution (SDR), INFO 290A-03
UC Berkeley, School of Information, Fall 2014
Class3 – October 7, 2014
props them up, even though they all hate each other, simply by having your
friends answer that question. That's one game to set up. I think there's probably
a dozen games like this you can think of.
What is the feedback function, the gain the friends have by getting it right versus
getting it wrong?
The friends should bet money. This is more of a startup idea, but you could
probably imagine 100 friends pooling together ten bucks, and saying I bet
Andreas is so responsible, I'm going to give him ten dollars. I'm going to guess
his FICO is 800, perfect. All I want back is $10.01. Is that okay? You say yeah,
that's sounds good to me, if you vouch for me and say I'm so good. I will give you
back $10.01.
The other guy says Andreas is a shifty-eyed, strange dude, German accent. If I
give him ten dollars I want twelve back, because the risk is too high. At some
point you have a loan sourced by people you know. You're not going to screw
them. They're your Facebook friends. They're going to start tracking you online.
The risk is quite low. And they would have established the ceiling of your
permanent credit record that is entirely public and out in the open. That is the real
way to disrupt the entire credit system and create an entirely level lending
mechanism. Because it's not regulated and (indiscernible) you would also not be
subject to any real restrictions.
And so the government could say you can't lend money to this guy with interest,
because you're not regulated. But that happens all the time anyway. Andreas
comes to my dorm room and says I need ten dollars. Great, I'll give you ten but
you need to bring me ten and a can of beer. That's essentially interest, but how
can you charge that (indiscernible)? If you went to subversify -- get your friends
to put up your rating and a number of dollars they'd be willing to put behind that.
If I then default, then it is much more embarrassing than if I default against a
bank. And in the unlikely case that I only pretend to be defaulting, I claim I lost
something I didn't really lose, think about the effect that has on the system. I
would not tell Max that I lost the money if I didn't, because my reputation is more
important than that. Whereas, some anonymous insurance company or credit
card somewhere, that is a different story. This veneer of anonymity which makes
in many cases the probability of claims probably an order of magnitude more
Identity resolution in that case is critical. In fact, Facebook is practically one of
the best platforms to set this up because they insist on real names. One
interesting factoid on the credit industry, one of the places where credit cards
took a very long time to get adoption is Japan. Until as recent as ten years ago,
you could go into a store in Japan where you didn't know the shopkeeper, and
they didn't know you, and you were in a random town, and leave your name card
instead of a credit card. They would essentially give you a store loan because
your name was so important to you.
Andreas Weigend (www.weigend.com)
Social Data Revolution (SDR), INFO 290A-03
UC Berkeley, School of Information, Fall 2014
Class3 – October 7, 2014
Japan is I believe the second oldest monarchy, so it's one of the oldest
civilizations in order. The reason that's relevant is their last name matters. You
could say my last name is Fujiwara and that meant something to a lot of people
many generations back. So the idea of leaving your name card essentially said
I'm staking my good name on this loan. If I don't pay you back, my name and my
ancestors will be embarrassed.
As a result, the need for the credit card, the faceless loan company that says
sure, you might default but I'll spread the risk across many of you, was far less
relevant. In that sense, we can see the reemergence of the on-your-good-name
sir or ma'am relationship of the corner store.
We have a few minutes left. What other topics do you have, would you like to ask
one of the most thoughtful data scientists on this planet?
(Indiscernible) I would like to know do you get data from (indiscernible).
I'm not sure I fully understand the question.
Just to be clear, Bitcoin. I want to talk about Bitcoin, but I haven't talked about it
yet. Since not everybody is involved with Bitcoin, let's talk for a few minutes
about what Bitcoin is, what trust means in Bitcoin.
Bitcoin is probably the single best-applied cryptography result in the last ten
years. Before I was a data scientist, I was a cryptographer, and that's basically
the only two ways to make money if you do number theory, which is what I
studied. I hit them both.
From applied cryptography, it's very rare that you see a really good result and
application of Merkle tree, which is what Bitcoin is based on, is a fascinating
piece of math that for the first time has seen a practical use.
It means you have a notion of unfalsifiable, fully distributed ledger. So Bitcoin,
despite the name suggests a currency, is actually nothing of the sort. It's a giant
record, list of records, that say I gave Andreas a promissory note. Andreas gives
one of you a promissory note, and then the promissory notes are denominated in
this thing called Bitcoin, but it's just a measure of value. Like any other currency,
it float with the dollar or against the dollar, against any other currency.
The reason it's so fascinating is it doesn't require central authority. You don't
have to trust anyone. The idea of the fully distributed ledger is brilliant because
anyone in the system can verify that Andreas did in fact receive one bitcoin from
me. And even if one of you says I'll never touch the thing again, I don't want to do
it, there's enough of a critical mass where that record lives on. No one can
destroy it. It's a permanent, essentially irreversible -- the only way to reverse the
transaction is to have the transaction be undone by way of a second transaction,
which is very nice because it creates all kinds of really interesting opportunities in
trustworthy accounting.
Andreas Weigend (www.weigend.com)
Social Data Revolution (SDR), INFO 290A-03
UC Berkeley, School of Information, Fall 2014
Class3 – October 7, 2014
For example, I could send you a bitcoin. You could send it back to me. It's a netzero system, but we have now established trust. You have now seen that I can
take money and see that I can send it back to you. You can imagine interesting
applications like notary public as a whole system being eliminated because the
idea of somebody came over to my house and watched me put my finger -- that's
pretty much medieval. All I really wanted to do is have somebody witness me
promise that I do swear that this is true. I could also say this is true, sign a
cryptographic, put it into the block chain which is the fancy name for the
distributed ledger. And now it's true and irreversibly so. Can never be taken
So bitcoin is super cool mathematically. It is marginally cool -- these are now my
opinions. Everything I've said so far is a factual truism. Now we're in opinion land:
brilliant math, incredibly volatile currency that's also practically true. The last six
weeks has seen it go from roughly 600-dollar per bitcoin to about 300 the last I
checked. I don't have too many bitcoins to my name. I had none except someone
just gave me some. I used to say I have no bitcoins. It's too volatile, but
somebody just sent me a couple. So I am now in possession of bitcoins. They
sent it to me precisely so I would stop saying that I have no bitcoins. It was an
expensive proposition when he did send them to me (indiscernible) cheaper
As an investment, at one point it was a great idea because it was going through
the roof. It's now a terrible idea because it's going the opposite direction. Some
people that are big believers in cryptic currency say that's the best time to buy, as
people would say in such circumstances. Those that believe it's not the best
investment opportunity say oh no, I bought at the height of the market.
As a commodity, it's a terrible store of value, because it floats against the dollar
(indiscernible). In a long term, I'm a big believer in crypt currency surviving. The
distributed ledger model is clearly the future. The idea of generally accepted
accounting principles, which is GAAP accounting, that's clearly where the world
is going to go. Doing it cryptographically versus by way of Price Waterhouse
Cooper coming in to frisk through your books to say that is in fact real; that's very
archaic. Somebody that can verify a couple digital signatures is a great way to do
it. All that points to the right direction. We're probably in for another ten large time
periods, whatever it's dominated in, of high volatility. And it's not clear to me
whether it's Bitcoin or bitcoin, as in the same network, but a different currency or
different protocol that winds up winning. Is that a reasonable summary?
Do you think the principle of having block chain having the history embedded in
the object goes way beyond money?
Yes, it's going to fundamentally modify the notion of truth.
The notion of truth is a big one. Slightly big one but also important is the notion of
trust, being distributed in bitcoin. Some thoughts on trust, particularly today since
we're talking about social graph?
Andreas Weigend (www.weigend.com)
Social Data Revolution (SDR), INFO 290A-03
UC Berkeley, School of Information, Fall 2014
Class3 – October 7, 2014
I think the idea of society that is less libelous that has more responsibility, more
consequences for saying things about people or entities that are designed to be
hurtful is going to be -- similar to the honor-your-good-name sir, credit issuance
or promissory note issuance, something like the block chain allows you to say I
really mean it when I say that was a terrible experience in this restaurant. My
name is on this and it's permanently embedded in the narrative of the block
It's a pretty trivial example, but it has a lot of consequences if you think about it.
Initially it might mean people will use it frivolously, but after a while they'll
understand how permanent it is. The notion of responding angrily at something
that is just not worth the permanence of you putting a black mark out there,
because it might in fact come back to you, is going to be an interesting society.
Having said that, I haven't thought through the society implications for bitcoin yet,
because my assumption is it's going to be largely a domain of the nerds for
another dozen years. The one interesting thing about bitcoin, the reason I'm
bullish on it overall is because the total market capital of bitcoin as it exists today
is a few billion, somewhere between two and four billion dollars. I don't think it's
more than ten and it's definitely more than one. The combined market value of
the brain cycles of the people that are thinking very deeply about bitcoin is
significantly more.
There are definitely more than a few billionaires that are spending lots of time,
effort, money investing, thinking, participating, trading, collecting bitcoin. There's
a direct arbitrage opportunity in the number of intelligence cycles versus pure
dollars that are available to bitcoin. For quite some time, even with the bitcoin
collapse to 300 dollars, it is still a cheap buy if you're thinking about the minds of
(indiscernible) thinking feverishly about what bitcoin could be. In that sense, I'm
very bullish. It will come up to something. It certainly will be good buy until the
two prices equal out.
I love how you think like a physicist about comparing different things in order of
magnitude. It's great. The last question is a question I know people have, and
also people online have, which is what would be your advice to somebody who
wants to get started doing something in the field of data science, with the goal of
becoming something like a Max Levchin maybe?
Is this the part where I say kids, don't be like me? Hopefully I'm not that much
older than you guys, so I can still qualify for the kids status somewhat. I did tour a
kindergarten this morning for my son, so I am officially in the parenting category
There's two or three things that are painfully obvious but have to be said. Number
one, if you're going to start a company and you know you're going to start a
company, do it now. You have nothing to lose today. With every passing year,
you're graduating from a topnotch university, in a good program, you're going to
be very employable and marketable. That's a good thing but also a bad thing.
With every passing year of being highly employable, you will amass trappings
that are going to be painful to lose.
Andreas Weigend (www.weigend.com)
Social Data Revolution (SDR), INFO 290A-03
UC Berkeley, School of Information, Fall 2014
Class3 – October 7, 2014
Once you drive a nice car you don't want to be in a beater. Once you live in a
nice apartment it's kind of crappy to sleep on the floor. If you're going to start a
company and feel ready, do it now. It's worth it.
If you are not ready, which is a very legitimate place to be -- if you think I'm an
entrepreneur, I just don't have all the skills necessary, either join a startup or get
an internship. An internship is a very easy way to dip your toes in because it
doesn't require you to stay very long. The only difference between an internship
and a fulltime employment is the fulltime employment implies essentially a fouryear contract. An internship implies a summer-long contract. You could always
extend one and shorten the other, but the defaults are very powerful. You don't
feel bad when you say my internship is over, it's been 12 weeks. I loved this year.
I'll come back next year, or I won't. Versus I've been here for 12 weeks as a
fulltime employee and I hate myself for it. I want to leave.
The two have the same outcome but it sounds much better if you're an intern. An
internship at a startup, a well-funded startup, that you were inspired by, that looks
kind of like what you would be working on your own is a great place to break in.
I'll tell you a tragic and not tragic story I alluded to. Igor, the visualization software
that we built at PayPal to address our fraud problem was built in a direct
response -- it was summer of 2000. July of 2000 I believe we lost 13 million
dollars net to primarily scalable fraud. Largely Russian and Ukrainian organized
crime was -- I think they were signing 20,000 accounts a day on PayPal. Using
incredibly large -- to the tune of 20,000 to 30,000 accounts per network, would
bring in between 10 to 100 dollars from stolen credit cards. Would funnel them
immediately to the next ring of accounts, again and again, kind of move them
around the system programmatically. This is all done by script.
Eventually we'd pull it out in 100,000 to 200,000-dollar installments over the
course of a month. By the time people whose credit cards had been stolen would
wake up and charge them back, the money would be out of the system. We
wouldn't be on the hook for the money. Two million dollars left the building.
There's nothing we can do. Even the process of tracking down what carts were
involved was impossible because you literally have 20,000 credit card numbers
involved. Today it gets a lot of celebration when a target gets breached, and a
couple million credit cards get stolen, but that's been going on for a very long
time. The underground marketplace where these cards get traded have been
around forever. It's just now getting more publicity because it's such a
widespread problem.
Back then it was a problem in the payment systems people knew and the rest of
the world kind of ignored it. For the right reasons, you're not really on the hook for
the money. It's the merchant or merchant process, or like PayPal, that ultimately
has to pay the price.
We were basically facing extinction. We had about six weeks of cash, and were
about to die. Primarily because of this. I had an intern -- this was (indiscernible).
Peter Thiel came back. He took a sabbatical basically after we merged with
Andreas Weigend (www.weigend.com)
Social Data Revolution (SDR), INFO 290A-03
UC Berkeley, School of Information, Fall 2014
Class3 – October 7, 2014
(indiscernible). Leon ultimately left to start his crazy (indiscernible) of space.
Between now and then Peter came back to run the company.
Peter and I sat down and he said I'm going to go figure out how to raise money
so we can survive. You need to figure out how to stop fraud. I had a full
engineering team working very hard to scale the system and deal with everything
else, so I had exactly one guy that was going to work on this. His name was Bob,
and he was an intern. He was probably younger than most people in this room, at
Stanford. He and I brainstormed for several days, like what the hell do we do. We
have this problem, and we went to investigators who were tracking down these
stolen accounts, and literally walked into this guy's room, and he was a first Iraq
campaign vet. He was investigating this case, and (indiscernible), still someone I
keep in touch with on occasion. He had literally thousands of pages printed out of
Excel spreadsheets, tracking where the money went.
He had one big binder and said this is the 20,000 credit cards. I think I got them
all. I said wow, so we're not going to lose 16 million dollars next month as I
predicted. No, you will, this is from three months ago. What do you do for this
month? I haven't gotten through two months ago yet. Why don't you skip
forward? How am I supposed to skip forward? All I know is when somebody
reports the money is gone. That's when I start tracking this stuff.
Basically what would you do? What I would really like is something that shows
me a network of these accounts as they get created and the nodes in the graph
that are connected, because the edges contain all the information. The nodes
themselves are worthless to me. I see somebody signed up, they have a credit
card, indistinguishable from a normal user. People sign up, buy stuff on eBay.
People sign up and steal money. I can't tell the difference. What I can tell is once
they start funneling the money to one person in the center, that's the bad guy.
I said why don't we build a system that visualizes this stuff. So Bob and I spent
the next seven days fiercely constructing all these visualization systems, very
simple as it turns out, and basic data science, trying to filter out things, trying to
predict what would happen next. This was before even things like random forests
were a concept. This is all very Stone Age of data science.
When we finally had the first alpha we showed to the fraud investigator squad at
PayPal, one of the women there named Michelle literally started crying. She said
this will make all the difference in the world. You basically took something that
takes us like 100 hours to investigate and boiled it down to about a minute. This
is something we can act on, as soon as we see. One of the first visualizations
was this notion of a Christmas tree. The top of the Christmas tree is the bad
account that's withdrawing the money. And going downwards is the reverse
timeline of money coming in, filtering up, being withdrawn. Once we had this
Christmas tree, if we train all these analysts to look for Christmas trees, and as
soon as you see one freeze every one of those accounts.
One of the crazy tools we had to build was highlight everything you see in this
three, one button, freeze all money. The next month losses were I think two
million dollars. So we basically nipped that problem almost instantaneously. So
Andreas Weigend (www.weigend.com)
Social Data Revolution (SDR), INFO 290A-03
UC Berkeley, School of Information, Fall 2014
Class3 – October 7, 2014
the moral of the story -- there's unfortunately a very tragic end to the story. Bob
died a few days after his graduation. He had an acute case of diabetes and it
ended tragically. But I'm still close with his family and established a scholarship in
his name. His work was extremely important. If you ever dig through PayPal
patents, most of the interesting ones have his name on it. He was really
instrumental. His unfortunate and untimely death notwithstanding, the moral of
the story is it's not like you have to wait until you are Max Levchin to make your
mark. This kid would have been incredible had he lived longer, but even at 21, he
made an incredible difference and was literally responsible for saving a company
that is worth 30 billion dollars. Take an internship at a company in something as
nerdy or seemingly impractical as data science could lead to very real results.
For what it's worth, if that's going to be your startup prep, that's what you're going
to do as either your job preparation for a data science role or startup role, that's a
good one.
Nothing much I can add to this, but let's give a round of applause to Max.