Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
Andreas Weigend (www.weigend.com)
Data Mining and Electronic Business: The Social Data Revolution
April 6, 2009
Class 1 Overview: (Part 2 of 2)
This transcript:
Corresponding audio file:
Previous Transcript: (Part 1 of 2):
To see the whole series: Containing folder:
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 1
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
All right, are we ready for part two? Let me start with is there anything you briefly share
from your discussion during the break with others, any comments? Thanks for some of
you coming up and introducing yourselves. Are there any comments, questions?
Logistics we’ll do at the end.
I was wondering, is it…Google knows what is contained…?
The question is about legality of collecting stuff. Is anybody worried about it? You are
always free to not use a website. The only security is just to unplug your computer. The
question about privacy we will do later. I mentioned earlier today the question about who
owns the data. It’s a question people used to ask in the 1990’s when owning means
having something. I own that coffee, thanks for getting me coffee.
With digital data it’s different. I’m not a legal guy. Believe me, I’ve had many discussions
with the Amazon main lawyer about what it means to collect data, particularly in the
European [0:01:13.6 unclear], that’s different. Last year, Harry who works at Facebook
who was a student here last year, we had some discussions with the Facebook people
about what should be accessible. The point I made is I think it makes no sense that by
clicking I can access more information than I when I do it programmatically.
My background was; I’m not sure if anybody told you from last year, we had this Friday
afternoon app. It is Friday afternoon. You are in a dorm room. You have nothing
planned for the evening. You want to know who of your friends’ friends, who are the
gender you are interested in and who is the gender you are, and your gender that you’re
interested in, who may be single.
It turns out that you could click through that, but they didn’t allow us to do that
programmatically. So there will be questions coming up like this, but they are not really
good questions. They are just questions about the state of the world and it changes all
the time. Is there anything else?
One question I have is you were talking about how to make more relevant decisions, to
help people make their decisions or to give them more relevant data. Do you think, over
time, this sort of homogenizes that social culture….?
That is a super great question. After my PhD, I lived in Bangkok, in Thailand. I taught in
the university there. That was before the web actually was available. I know that we
constantly debated at [0:03:00.7 unclear], whether or not if something like the Web
some of us heard might be happening at some stage in the future, would it
increase the digital divide or whether it would smooth it over. To date, I don’t have
the answer.
One thing is clear that by now, the largest number of Internet users of any country is
China. In 1994, when I was first teaching at [0:03:22.7 unclear], there was one terminal
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 2
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
at the Academy of Sciences, across the street, where I could go to try to read my email.
The world has dramatically changed. I have no opinion. I have no idea how it would
work. It’s a good question. I know that many people worry about it. Ultimately, it’s
up to people like us, whether we increase the divide, or whether we help bridge the
I promised you another lecture of about half an hour. Then, I will tell you about the first
problem set, and I will introduce you to the people who are helping with this class. The
last sort of ten minutes is always Q&A. From 5:55, I’m available, so if you have questions
in private, we can do it then.
We will have half an hour of content. That is actually half an hour I gave a week ago in
San Francisco at the Facebook Developer Garage. The company I’m advising, called
Contagent put together an evening with eight speakers, good people. It’s all on the
web. We talked about metrics, frameworks and how to measure stuff. I believe
that if you measure something, you can make progress. If you don’t measure it,
you don’t know whether you make progress. It’s just hard.
The first problem that there will be, and this is due a week from Sunday, is to think about
metrics in a particular context.
PHAME is the framework that I will really try to structure each class about. That
means that first of all let’s be clear about what the problem is. The problem could be
to get people to join a certain Facebook page. If that is your problem, you could have
certain hypotheses, how do you track people? You could put flyers under the peep hole
in your dorm. You could hit up all your friends in email, or whatever it is.
If the problem is a different one, not to attract people but to retain people, then you do
very different things. You try to build a good product. A product that people want to use.
The hypotheses might be that if you give them to have persistent data about themselves,
and more importantly, the idea behind Facebook newsfeed, if you give them new stuff,
interesting stuff their friends create all the time, that will create a reason for them to come
I’m actually looking forward to reading your answers here, tonight. One of the
questions I’m very curious about is the question about what makes people share
information. What makes people come back again, and again, and yet again? I think
that one of the drivers of addiction, whether it is on Wall Street, where people have
something at stake and the world changes and affects them, whether they’re there or not;
whether it’s on eBay and you have an item out there and maybe you are bidding for
something and the price goes up; whether it’s an MMORPG, massively multiplayer online
role-playing game, where you have that role and if you go to sleep something might
happen to it.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 3
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
I have friends in China who actually set their alarm every hour so they can go back to
their MMORPG and check on how it’s going. They all have in common that you do
something and as you are away, things happen to it and you are curious about what
happened. That would be the difference between building something which is for
acquisition, where we know the rules of virality, how you built this through viral
roots and retention, which ultimately is in the long term, the important element,
namely, how do you actually create something that is useful to people so they will
come back. These would be different problems. The hypotheses we gave.
Actions – what could be actions? Let me give you an example for actions, a much
easier example. That’s the example of Amazon.com. It’s a very simple question;
should this shopping cart, where you just add that Miles Davis CD, be on the right
hand side or should it be on the left hand side? Who thinks it should be on the right
hand side? Who thinks it should be on the left hand side?
I will tell you. We don’t know, but we can measure it. You are right, because you are
all trained to go to Amazon.com and you know how it goes. Believe me, over the last ten
years, I had a lot of conversations. But, in countries where people read from left to right
or right to left or from top to bottom or bottom to top, it should… we don’t know but we
can measure it.
Amazon.com of course, measured it. The key question is what are the metrics – the
M in PHAME. In this case, we looked at conversion rate. I looked at order size.
Typically, these are 1% effects.
The point I want to make with this is that ideas we have can and indeed should
help us come up with hypotheses. That’s not the end of it. These hypotheses
should enable us to come up with actions we take, differential actions, left versus
right. Then, we just simply measure what looks better, given a set of metrics we
decide on.
The metrics are super important. If you don’t get the metrics right, you just have
the company do the wrong thing. For example, at Amazon, Jeff Bezos and I spent
maybe fifty hours going through the various groups at Amazon, discussing and deciding
with them what should the metrics be for the group that does cross-selling, or
recommendations, or search and so on.
If you have a group of very smart people then believe me; if you make a mistake in the
metrics, they will do precisely the right thing for those metrics and the wrong thing for the
company, as a whole. I have examples, but we don’t know each other well enough yet; I
don’t want to share them now, with the cameras running.
I want to give you a couple of examples here to show you one thing that I, as a physicist,
deeply believe in. That is that you want to look at distributions of data as opposed to
just sort of a more computer science approach of just getting the mean or
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 4
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
something. Actually, Jeff and I had a game going where I told him, “You tell me what
number you want, and I’ll tell you how we achieve that number. Do you want every
session of seventy clicks; I will give you every session that has seventy clicks.”
What am I talking about here? This graph has a distribution of visit lengths, how many
clicks per visit. A visit is defined as between midnight Seattle time and midnight Seattle
time – twenty-four hours. Of course, to be future compatible, I have Chinese here
already, these are recognized purchases. This means we know who the person is
when they buy something. These are recognized non-purchases, we know who the
person is, but they don’t buy in the twenty-four hour period. These are
unrecognized purchases so we don’t know who they are and they don’t purchase,
and at the bottom you have internal stuff.
Look at this graph. Do you see anything surprising?
There are three or four spikes…
There are a bunch of spikes. There are these spikes over there and where are they?
They are on that black curve, which are the unrecognized purchasers there, nonpurchased, not recognized. Do you have any idea what that could be? It’s unlikely that
more people click two hundred times than one hundred ninety-nine, or two hundred and
one times. Maybe it’s not people but spiders.
What else do you see? I am actually very curious about this one here. Things that
are not monetarily falling is really weird. Why are Earth would more people, these
are recognized, non-purchasing people, click eleven times and ten times? That’s
where you need to have some domain knowledge. You need to know what we set
up. What was set up there was this funny thing called Gold Box and there are too
many stories about Gold Box.
People come every day to see whether these ten (random) items that are put in the
Gold Box for them, whether they will find a deal there. People like to do deals.
People like to get a good deal.
The most interesting one, a third of all sessions are single hit sessions. They are
sessions that are bouncers, only one hit within twenty-four hours. Is that a good
or bad thing? We have no idea. It could be somebody going and checking their
sales rank, their Amazon rank every day. It could be robots that change identity
for each click. It could be people who are at that porn site and there is the link to
Amazon and they don’t want to be at Amazon. My mistake, they click on that.
Whatever it is, we don’t know what it is because they have no persistent history.
You see, if you just report an average of number clicks per session or pokes or
posts or messages, it’s not as rich as trying to understand what it is that’s going
on. I want you to look at distribution of data.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 5
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
I have two more examples. This one uses persistent history and says I’m looking at
everybody who bought something today. When did they first look at that item at
Amazon.com? Indeed, 80% looked at it today, for the first time. Then, there is a weekly
periodicity. Does it make sense?
On the weekends, or maybe you play volleyball on Tuesdays and your volleyball friends
say, “Did you buy that book,” and you say, “Gee, I forgot. I looked at it last week,” and
then you go buy it. It could be weekends or any other periodicity.
That made me quite happy when I pulled those data. It’s unlikely that I make a mistake if
something that makes imminent sense like a weekly periodicity shows up. On the
flipside, I don’t have the graph here, but I looked at data, not on the weekly scale, but on
the hourly scale.
I found that an enormous number of people, a huge peak, buys eight hours after
they looked first at the item. It’s a graph but I’m showing it for you in thin air that it
drops, but it doesn’t. You have another big peak. We were scratching our heads
and asking, “How is this possible?” How is that possible?
Yes, but you wouldn’t get exactly eight hours that way. It took me two days and a good
dinner I promised whoever solved the problem.
The click log is in one place where … and another place.
That’s perfect. You would have gotten the dinner had we met five years ago.
These are small things that you spend your time on. You only learn this, and I was
stupid, by first removing all negative times. Had I kept the negative [0:15:19.6
unclear] times in, I would have realized there is a peak at minus hours, so of
course. This is what happens when you are trying to do things too quickly.
I’ve many stories like this. I have a whole talk about all the mistakes I made. Here is
another one I think is a beautiful curve. I’m interested in the pricing of virtual items like
virtual gifts and stuff like this. These are real items and these are real data. What is on
the X axis is the gross merchandise sales, so how much was the order size in dollars.
The Y axis is the accounts, this historigram. [0:15:52.9 unclear] means straight line is a
power law. I hope you know that stuff.
What turns out is it has to be a straight line. Jerry Friedman quotes his PhD by it.
Jerry was my PhD advisor. He had a PhD advisor and his PhD advisor’s PhD
advisor said, “You know, if you take the log often enough, everything will be a
straight line.” In this case, it’s enough to take an X log and a Y log, straight line, with an
exception that under $25 you have a mixed distribution, under $25 something else is
happening. It makes sense because at that stage when I did this graph, Amazon had a
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 6
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
$25 and above free shipping. Of course, nobody would pay something like $24 plus
shipping if they can just add another item and get the shipping for free.
These are all the things you need to look at. If you just take an average of this plot, there
is no meaning in averages of this plot. Looking at data is something that is very close to
my heart. I’m telling you these things today because I said the first homework is I
want you to come up with metrics. These metrics should be something robust. By
the way, what would be a robust metric to describe this [0:17:04.3 log/log] plot here? I’m
curious. I’ve never asked this in class? I did it with one company but I’ve never asked a
class. What would be a way of describing it?
What’s the vertical line between when the …?
It’s clearly mixed in its distribution so one is where is the cut between the
distributions. If we ignore the thing on the left hand side, how do we describe the
[power law]? What parameter do we have left? Its slope, and then of course the
question is how do you compute it. Do you weight each point evenly or stuff like
this; some details here, but what we want to know is what’s the slope. That’s the
same for Pareto Distribution if you are in economics, or Zipf’s Law if you are in
linguistics. Basically, if you have these scale free distributions, all you want to
know is what’s the slope.
Don’t come and say you are interested in the mean of this. Tell me we can
characterize the distribution by slope. Here is another graph. These were retail data
from an undisclosed company, around 9/11/2001. The reason I’m showing you this
graph and the next graph is these are the graphs I have from [0:18:19.7 Ronnie Cohavi].
You don’t see all that much here. If you plot on the X axis the weeks and then in a
circular way, the days, you clearly see that something happened here on 9/11. That
Tuesday was not a very busy day in terms of undisclosed retailer selling stuff. In this
graph you don’t see much. In that graph it’s obvious; this was a holiday, Labor Day. This
was 9/11.
The other point I want to make is people are social beings. That had been pretty
much ignored by traditional marketing. They sent people trees back home, paper
catalogs, and they never listened to what people actually had to say.
Another experiment was done by a friend of mine at NYU. It was to compare
recommendations that use the social structure, who is connected to whom, with
recommendations that ignored the social structure but used the most
sophisticated statistical models available, [0:19:35.3 unclear]
There is a lot of experience, a lot of traditional data, demographic data, geographic data,
and loyalty data. The product was a new communication service, like a new service
on your phone. The comparison I make now is in green, you have these sophisticated
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 7
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
traditional models. In orange, you have something very simple. You just market to the
people who make phone calls with the people who bought the item.
If we call each other, and you bought that service, they would hit me up and say,
“Hey, do you want that?” It turns out that the response rate here is .28%, and here
it is 1.35%. It’s almost five times as large.
The lesson to be learned from here is people are social beings and knowing who
hangs out with whom is very powerful, often more powerful than building
complicated models. Do you remember, the 1990’s was about given a set of data, what
insights can we get? Some people are still waiting for those actual insights. This decade
is what is the problem? What data can we go and get.
In this case, we look at calling data and there we go. Your neighbors are in the graph, so
it’s people you are connected with.
Metrics have a big impact on what people try to optimize.
Let’s pause for a moment and think about what really is important. It’s very
important to understand what is scarce, as opposed to what’s abundant. If you
want to build cool apps, you have to help people with their scarcities. For
instance, one of the scarcities we have is time, or attention.
If we help people do more with their time, direct their attention to more interesting
things, we are doing a good job. One of the things that used to be scarce is
information, whatever that is. Now, we know most answers are somewhere but in our
finite attention, how can we get to them? That is one of the frameworks that is
quite important.
What are the real costs? Many people only think about dollars. The real costs now
are very different ones. What I want to always point out is there is this thing called
social capital. Does anybody know what social capital refers to? Who knows what
social capital is?
Reputations? Yes, so for instance I have a certain social capital. I’m happy to
introduce people to each other. I’m happy to help students with projects and with
startups. So, I have built social capital. If I need a favor, they are happy to help me
On the other hand, if I was spamming all of you all day long, I would pretty much
have spent my social capital by now. You would probably say, “Oh, another message
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 8
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
from Andreas,” delete it before reading. Social capital or social cost is often more
important than monetary costs for people.
Another one we have seen is the cost of interrupt being vastly underestimated. We
all have situations every day where we are in the middle of something and the phone
rings. Then we quickly answer the phone call and, “Gee, what was I doing before,” and
swapping all these processes back in. It’s something people way underestimate in
terms of how expensive it is.
The interesting thing is that we might cognitively know that we shouldn’t be answering
that phone call, but we’ll still do it. A friend of mine received a Nobel Prize in 2002,
Danny Konnerman. I usually visit him in New York, we’ll talk for a few hours about what’s
cool and new. The last time it was Twitter. The time before that it was Facebook. Last
time he asked me, “Andreas can you do me a favor?” I said, “Sure”. He said, “I have this
PC and they always have these alerts coming up even when I get a new message. I
know I shouldn’t be looking at it but I can’t help it. Can you please turn that off?”
I think if you are not strong enough to resist the temptation of shiny objects; don’t worry,
you’re not the only ones.
Sometimes, what we have learned, which was good for us when we lived in caves – not
most of us, but our ancestors, might not be the best thing anymore. I think the fact that
my mobile phone doesn’t work on campus is probably a good thing.
How can we help manage peoples’ attention? The first insight is giving them
insight in how they are spending their attention. On my blog,
www.weigend.com/blog, I had a very interesting conversation in July of last year,
with Esther Dyson and Shoshana Zuboff. We talked about why people do these
things, and basically boiled it down to people want to spread their means and
people want to spread their genes. From that, all the rest follows. If you want to
download an hour and hear us rambling at 7:30 in the morning, it’s on
Short term versus long term – It’s easy to measure short term stuff. The metrics I
want you to come up with, some of them should be short term metrics. Those you
can do with A/B test, which I showed you before, left/right is an example of what’s
called an A/B test.
Long term things are very hard to measure. Ultimately however, the long term
things are the ones that matter. If you just do short term things, it is not that you’ll have
a big impact. It’s like many games; people play the game and people play another game
and people play yet another game.
How can you actually come up with metrics that measure long term user
satisfaction? At the beginning of class today, we talked about how we go from a
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 9
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
transaction economy to a relationship economy. Relationships are long term
relationships like two-night stands, three-night stands, at least.
What do we have on the [0:26:24.5 unclear]? It’s a little bit difficult with this class
because this class is only nine weeks. We can’t really have good, long term metrics.
One of the awards of the class will be that in half a year from now, the end of
October, we’ll have a dinner with a group, whoever wins the contest I will describe
in a moment, here. We have some long term component and we will show you the
long term metrics that you devise throughout the next year.
I want to give you one example here from this field that is called behavior
economics. There is classic economics, new classic economics, which tells you
how people ought to behave. For instance, want to experiment in this? I don’t know
your name, Blake? And Matt? Matt is just checking his SMS.
Matt is done checking SMS right now. If I give Blake $10 and say, “You can now play a
game,” and you offer some amount to Matt, and if Matt thinks it sounds like a deal and
out of these $10 you both keep whatever your respective amounts are; if he doesn’t think
it’s fair, I get my $10 back.
I give him $10 and he says to Matt, “Hey, I’ll give you a penny.” What would Matt say?
Fuck you, probably. That is the end of that game. If he says $3, Matt would say, “Okay”.
It’s very culturally dependent, by the way. Russia it totally different from China, which is
totally different from Europe which is totally different from the U.S.
These are things where a new classic economist would say, “One cent; we’re both
better off thank you,” and there we go. That’s not how people do things.
The other example I want to give you is from Dragus, you know the store in Menlo Park?
Dragus has – I didn’t believe it until I saw it – 360 kinds of jams. Here was the
experiment; Stanford Psych 101 students dressed as they were normal shoppers,
all were there as counters and counted who was coming in. In one experiment,
there was one table that had six jams on it. They counted how many people were
actually poking there and trying some of these jams. It turned out, and I actually
didn’t look up the numbers right now, it was something like 30%. I know that 30%
Now, in the other case, they didn’t have six jams, but they had 24 jams and not like
weird stuff but the 6 there before and another one, up to 24, of reasonable stuff.
Honestly, I’ve never seen 24 jams on a table. Indeed, the ratio of people who went
and checked out those jams was much higher, however, the purchasing rate
dropped to only 3%. People said, “That’s interesting, I will worry about this next
time,” and they actually didn’t buy. That’s the baseline for jam purchasing.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 10
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
The point here is that what we’ve evolved in is dealing with options where you
always want more options might not be the right thing online. When you develop
an app, try to keep that in mind; the classic approach of economics of people
being rational agents with unbound rationality isn’t how things work.
The third and last example I’m going to give you here is from The Economist, after
all, we’re talking about economics. It’s a weekly paper coming from the U.K. It’s a good
paper. Here is what they did; they had, for their subscriptions, [0:30:06.7 unclear] the
possibility of getting only the online for $59, for $125 print, and also for $125 printed
online. I’ve given you the numbers and this is from Dan Ariely, about all these
three options. It turns out that people are good at comparing stuff but people are
not good at absolute things. People are good at saying, “If print is $125, and
printed on web is $125, this is a better deal than that; after all, this is The
Economist, right?” It turns out that nobody takes this one here, 84% take that one.
Then The Economist thought, “If nobody is choosing the middle option, we might as well
remove it.” What do you think happens? The preference which used to be that the
majority of people prefer this over that, flips. By adding a choice nobody picks, or
by removing a choice nobody has picked, people’s preferences revert.
It’s just like the one-cent story, or many other stories I have.
Why am I telling you all of that? I don’t know these things and I’m not the only one. I
always argue that people don’t know what they want. People are good at making
up stories as they go through the world. That’s why I have this PHAME framework.
We think about the Problem; we think about Hypotheses; we think about Actions
we can take, that’s the key thing; we think about Metrics, metrics need to be
actionable, accessible, and audible; and then finally we do Experiments. Are we
done? Of course not. We are always in business because experiments lead to
new problems, new hypotheses, actions, metrics, experiments, and so on.
What’s missing in this picture? Did you miss anything? Where is data mining?
Where is data? The point is there is no data in here. This doesn’t mean we don’t
collect data but it means that it is not the game, here, from having to send data and
doing analysis, but the data we collect here in the experiment are driven by the
problems we are trying to solve.
That is really the mindset I have for you here.
What I am personally interested in is virtual gifts. If any of you are interested in that, I am
probably doing some fun stuff with Facebook. We have a film maker who wants to do a
movie – 99 red roses; whoever gives some people real roses, other people give virtual
roses. Then we will see what the differences are. What are the metrics; getting laid,
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 11
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
probably, I don’t know what the metrics are. [Laughter] That will be an experiment. Any
volunteers, talk to me about it.
Now, we need to get to the serious stuff because I told you at 4:30 it’s time to talk about
serious things. First of all, in the hierarchy of who can write - www.weigend.com, only I
write to. www.weigend.com/teaching/stanford has information about the class. If there is
something wrong, do email me. I don’t pay $1 for a bug, but maybe I should do this. If
there is a bug on the site I will pay you $1.
All of you should have access to the class wiki, which is
www.stanford2009.wikispaces.co. You can also look at
www.stanford2008.wikispaces.com to get a feeling of what people did last year,
www.stanford2007.wikispaces.com and so on. There, all the people who have written
their email addresses on this union of the email addresses we have will get an email from
us when I get home tonight, giving you access to that page. If you want in between, hit
“give me access to that page,” and one of the TAs will grant you access.
The main thing on it is homework 1. For homework 1, I had an idea which is I went to
my friends at Facebook and said you did this thing with a Facebook class two years ago.
It was great getting ten million users in ten weeks. Unfortunately, when I asked people in
my class half a year later how many were still using it, one group had seven. The other
group had nine people, another group was just one guy and his friend.
In some ways it was great for acquisition, but it really failed for retention. The task
we came up with is to create active, meaningful, and persistent community of
Facebook pages for some cause that promotes the social data revolution.
We first wanted to have some more real world examples, but again, the lawyer said you
can’t pretend to be somebody else. We can only pretend to be here about the social data
revolution. When I actually thought about it, it’s actually not that limiting.
It can be pretty much anything where you have some idea about sharing data. It
could be about health data, about whatever you want to think about. This is a
group project. Before you leave today, before we shut class down at 5:05, I want
everybody to have found partners for a group, three to six is a reasonable size. Try to
make it diverse. You need some people, you have a lot of people here from the Business
School who have some ideas about marketing. You want some people who can actually
get stuff done. [Laughter] Did I say something wrong? You know what I mean; you want
people who can write, who can make pretty graphs, because today we are assigning this.
On Sunday afternoon, I want you to come with a proposal for metrics. The reason I
want this on Sunday is so I can give you feedback on Monday in class. Then, for
the subsequent Thursday, I want a brief progress report of how you are doing on
these metrics. Between Sunday that week, I will be at Facebook and we will see what
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 12
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
metrics they might not have publically, but I think we have great ideas they should
implement for our classes.
The first report will be just what we have. By the 23rd, we will have a stable set of
metrics. We will compare on a web page and if somebody wants to help out a little bit in
creating some of the infrastructure, it’s not that much work. I would love for you to come
up to me afterwards. On that page, we will have all of your groups, on a daily basis,
competing against each other. We will see how well those things that I decided after I get
your metrics examples, are the good ones and we will see how well you are doing.
This element that you learn from each other is actually key. It is not just [0:37:31.5
Enrique] who actually wrote the homework and said to come up, but it’s not only
from each other. There is another twist that is a fun twist to this experiment.
If you stand close enough to me and talk loud enough, you should be picked up by the
microphone. This is Enrique here; he took the class last year. If you want to hear all the
complaints, talk to him and his friends.
I’m Enrique. I hope all of you stay in the class. While I was writing this assignment, I
actually wrote it first for the Haas Business School class. In the actual writing of the
course, I put in parenthesis that this will actually be a competition between our class and
the Haas students. I said that undergrads here will beat them, until Andreas found that
on the assignment and made me delete it. I firmly believe that you guys will create ten
times better pages than they do and eventually win and go out to dinner with some cool,
distinguished guests, and actually present your pages as case studies for Facebook. I
think you have some incentives and we’ll talk more about how [0:38:37.1 cough] those
incentives to make you guys actually get creative and try to explore and push the
boundaries to the fringes of social media.
I figured it’s a fun thing and since I actually am in this situation that this quarter I teach
this class called Marketing 2.x at Berkeley. Berkeley is the public school on the other
side of the Bay, as well as having you guys here. We carefully timed it so I will also tell
them today about the assignment, and all these deadlines are identical to their deadlines.
We will also have on the summary page the comparison between your average and their
average. We will know. I don’t know the answer but we can measure it. [Laughter]
I created two Gmail addresses. This one is if you have questions for the TAs. If you are
still here, do you mind standing up? We have Enrique, our social media TA. We have
[0:39:34.0 Xian Thung], the guy over there, who is from [0:39:39.2 unclear] Province,
which on the way if you take the train from Shanghai. I also have a house in Shanghai,
by the way. Take the train from Shanghai to Beijing and you go to …. He did his
graduate degree at [0:39:48.4 unclear], at Peking University in maths. He’s much
smarter than I am. His undergrad program was People’s University in Beijing.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 13
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
We also have [0:39:56.8 Xian Yushuei], who went to [0:39:59.2 unclear] in …. Province
and they’re grad students in the STATS department. They will be helping you with
homework, I hope.
We are also still looking for graders. If you have any friends, although they shouldn’t pay
favors to you, if you have any friends who will actually help out for whatever the school
pays graders, let me know.
Also, last year I had Eric Sun, who was the most wonderful writer for my blog, but Eric is
so busy with the concerto competition so he told me I should ask the class whether any
of you is willing, for pay, to help me write a blog entry every now and then. It’s actually
for me. English is a second language and all of that stuff so it’s not that easy to write
well. That’s a personal request I have for you.
Then for the script for the metrics, collection and displays for class, if somebody is willing
to put in a couple of hours and are good at Pearl or Python, that would be great, so we
can get this done.
For more contact information, I mentioned my page, the wiki, www.socialrevolution.com
thanks to Enrique who is a guy who gets stuff done. It points to the Facebook page. It’s
the same as www.facebook.com/socialdatarevolution.
One thing I will do tonight is I will see what questions you had here in the questions I
asked you. If you could ask a student in the class one question related to the social data
revolution, what would it be? From the seventy or so copies here, I will see which ten
questions or so are really cool questions. I will create a Google forum and I will mail out
the URL of that Google forum and ask you to spend ten or fifteen minutes tomorrow, by
tomorrow evening, to give us the answers. We will make the answers public
anonymously, for the rest of the class.
I want to know where your heads are. You remember the story with the fish in the water?
I need you to explain the water to me.
We already talked about email. I’m easily reachable. I am at
www.aweigend.stanford.edu. That is it from my perspective, as far as logistics is
concerned. The homepage, if you will, on www.stanford2009.wikispaces.com, by
tomorrow morning, will have all the logistics questions answered. Right now, I couldn’t
access it from the department so it’s not up yet. I apologize for this.
These are slides. I will put the slides and everything up on this page, which is the link
from www.weigend.com/files/teaching/stanford/2009/ You will get it by clicking on the
links. Are there any questions? We have about twenty minutes left.
I was thinking about metrics collection.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 14
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
There is a certain set of metrics that are available on Facebook. Do you want to show
I think the best thing is to create a page. It will take a couple of minutes to create a page
and they’ll see over these next couple of days they’ll start to get…. There are a number of
web 1.0 metrics available…
These are not the most advanced metrics in the universe. It is a good starting point. I
believe I have a screen shot of them here. While I’m looking for this, do you have any
other questions?
Let me show you… you never know what comes up on the screen, like Danny
Konnerman said. This is not the best screenshot possible. What we have is page
views, total interactions, unique views, fans, remove fans. That’s very important –
remove fans. Actually that’s something; when we talk about the costs, you really
should consider the huge cost if somebody says, “I don’t want to hear from you
anymore.” They have friends and tell their friends, and stuff like this. Wall posts,
discussion topics, reviews, audio plays, video plays. These are the standard
I want you to use your creativity of coming up with metrics that you think makes
sense. I’ll be at Facebook next week and will try to convince them to implement
the ones that I think make sense from the ones you think make sense. It really is
an exercise for us to change the world, in order to move from this web 1.0 metrics
of unique users, to truly metrics of engagement.
What are your other questions? Say your name and department.
I’m Alex, CS… what progression?
What do you mean by progression?
Course outline
Look at last year. I think the best example for this year is last year, although the
performance last year is no guarantee for performance this year. Take the time, click
through last year’s wikispaces, www.standford2008.wikispaces.com; it gives you a good
feeling about what’s happening. I have not put the individual classes together. Partly
because I want to learn from you what it is you are interested in. Last year is a good
example of what will be happening.
If you guys are working on cool projects or want some – share those feedback right now
because it’s the best time to share it. If you didn’t put it all up there, drop an email or get
in touch.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 15
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
I plan to really do a good job in putting it out in the next couple of days. The reason I
told you this more general stuff today, like the PHAME framework, or the
importance of data, was that’s the way I think about the world. It’s not about
tuning that algorithm, given a set of data, but it’s coming up with cool problems. I
want to see where you are right now. I told you the story of the water and the fish
twice, now. You are the fish in the water. Tell me about the water and I’ll tell you what I
will pick from that.
Look at last year’s 2008 and that will give you a good feeling about what we did last year.
My name is Chris. I’m from … systems. For the projects, do we have to use Facebook
pages, or what’s the… Facebook pages because as far as I can tell, Facebook pages
are actually kind of pretty horrible…
We did think about this. I also talked to other people about it. We felt that the question
about which platform you are on is almost more important than how smart you are in
putting something up. If you are on a different platform, it’s very difficult to compare what
different people are doing, although having a whole bunch of former students working in
the data group at Facebook it is relatively easy for me to get them to do stuff. What were
you thinking about?
For me, community often is fostered in places where there is more open discussion tools,
so I’ve seen it happen more in forums or real time chat rooms… whereas Facebook, it’s
what you get when you have any sort of discussion on the wall… linear
thread…chronological, terrible way… pretty much no way of sorting…
I hear you.
… people reading individual post rather than actual persistent…
I am with you. We did talk about it but we didn’t have a better suggestion. If you know of
a better platform, I should say I have [0:48:16.3 unclear], who is the CEO of Ning coming
to one of the classes. I invited those of you whose email address I had last week to have
it picked up on Ning so I can learn pictures of people and their names, as opposed to
having to go to this awful tool Stanford has for instructors where I can look at one picture
at a time. I don’t think it is all that much better a platform.
Do you try to drive stuff to Facebook pages if you can, but you could use something else.
You could use other acquisition channels if you wanted, or do some type of porting the
data to somewhere else. You can be creative. Right now, we’re working off of this one
constraint… platform constraint. Don’t close your box.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 16
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
Enrique is right. It has to be very manageable for the class. That is a platform where we
think we can manage things okay. I am also talking to [0:49:16.6 unclear] who runs the
entire ad group at Facebook, to get money to run ads. This is sort of more or less real
money. You could also sell whatever else you are selling for the money. We can’t check
what you’re driving your traffic to. That’s a little bit of a worry right now.
Let’s talk about this after class. If you have good ideas, I think we are super open to
listening to them. Thank you.
Are all assignments through…
No, this is one group assignment. There are individual assignments and group
assignments. It’s not primarily about the grading. It’s about you learning stuff. The way
I structure the assignments is that I have a spreadsheet or a Google Doc here that
the TAs and I have access to. At the end of the quarter, I’ll just push some button
and I will make reasonable cuts for A+, A, A-, and so on.
Everybody in the group gets the same points. I’m not in the business of asking how
did you do compared to him. I think for the group projects that assignment, homework 1
will carry out for five weeks. That is algorithmically graded. It’s not my feeling about
those metrics goes in. That’s the first step. The second is we will see how the group is
I should mention one thing. Wikispaces, we have for each class, a class wiki. I think the
total number of classes is nine. That means roughly 1/9 of the people should volunteer
today to bring up the class wiki for the first class, in a reasonable way. I think we typically
say is by Wednesday evening or Thursday. Whatever it says on the first page, I think it’s
Thursday evening.
How does it work? You find people you may have been too shy to talk to at the
beginning of class, but really would like to talk to and say, “Do you want to do the wiki
with me?” Then you have to get together and figure out what was it we did in class, what
was really important. I don’t want anything resembling the transcript. We have the
transcript already. I want you to really condense this to enrich it with hyperlinks, to come
up with good insights, and have a relatively short page for each class.
You have protected under Thursday. I will look at this on Thursday evening, every week.
I have full discretion; if I like it there is 10 points If it’s sort of okay, it gets 7 points. If it’s
no good, 0 points. At Berkeley, they didn’t understand this notion of the wiki. There was
nothing after three days – 0 points. It’s 30% of your class grade. 30% comes from that
one week when you take the time with maybe 5-7 other students and produce the wiki.
It’s an important part of the class. Pick a topic. I think an important one is some of you
who might be busy later in the quarter may want to volunteer today. Find people. Give
me your names and I will send my notes to somebody. I will send my PowerPoint to
somebody. I do whatever I can do to support you. But, on Thursday evening, some wiki
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 17
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
thing is up and that is what you will get for 30% of your grade. It’s the same next week,
the week after and so on.
The TAs keep track of who did what wiki and I will just enter the grade for that wiki.
Look at last year’s wiki. There are a couple of absolutely awesome pages. I remember
the three hours on recommender systems is a super great wiki page. I remember we had
one class on instrumenting the world, and I was shocked at how rich that wiki page was
afterwards. Pick a topic. Once we have the topics up in a few days, pick a topic you’re
interested in, rally some people around you and say, “This is the wiki”. That’s 30%.
Homework is 60% of the grade, and 5% is for class participation. I don’t mean
coughing. I mean contributing in class and 5% is for contributing elsewhere,
commenting on Enrique’s blog, using Facebook Connect, or putting stuff to
Facebook Group and stuff like this, good citizenship.
Are there any other questions? Did I forget anything, Enrique?
… higher level should be thinking of this as a movement together… really work
together… community even within this classroom. I know we all have motivation… it
comes to this classroom…
In that spirit, let’s close since this was the first week of class here, today. Let’s close ten
minutes early and give people ten minutes to actually form the groups for the wiki and
see whether there is anybody else interested to give me feedback. If you forgot
something in your forms, find your form and fill it in. I’ll see you next Monday at 2:15.
Thank you.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 18