weigend_stanford2010_2data_2010.04.01

advertisement
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
Andreas Weigend (www.weigend.com)
The Social Data Revolution: Data Mining and Electronic Business
MS&E 237, Stanford University
Spring 2010
April 1, 2010
Class 2: Data
This transcript:
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc
Corresponding audio file:
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.mp3
To see the whole series: Containing folder:
http://weigend.com/files/teaching/stanford/2010/recordings/audio/
Course Wiki:
http://stanford2010.wikispaces.com
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
Andreas:
Welcome to the second class of MS&E 237 this spring. The agenda for today is the
following. We will start with doing logistics issues. I’ll tell you what’s coming up through
the quarter, how you’ll be evaluated; I’ll introduce one of the TAs. We’ll then form some
groups and I’ll tell you why we will do that. That will be kind of a break. The second half
of class I’ll do some content today where I will tell you about different data sources and
that is the business perspective before we dive more deeply into the technical stuff in the
next class.
First of all, thank you for sending in all of your - almost all of your surveys and for sending
in your bio information and interests. I haven’t managed to get to all of them yet. I
promise I will be done with everything you’ve sent in, all the 88 forms I’ve received so far,
by Tuesday. That’s a great treat because I get to know all the things you think are cool
so I have the intelligence and the attention of about 100 people sitting on the web and
getting it socially filtered through you.
Let me talk about assignments. Assignments come in four flavors in the class. We have
a group project which makes up 40% of your grade. We have online contributions, things
you do on socialdatarevolution on Facebook. We have individual homework; those are
slightly more technical. If you want to get a feeling about what’s coming down the pike
there, look at what we had last year, but this year being at MS&E, it’s less technical than
last year which was the STATS. Then we’ll talk about dog food in a moment. After that I
will talk about something I decided to do, which is sort of a class rep advisory board.
Then we’ll take a break and I’ll do about a 40 minute lecture on data sources.
Filling in details: next Tuesday after class we will have a mixer. We’ll get some pizza,
some beer, and the purpose of that is so you can get to know each other and so you can
find out who has complementary skills that you need for the groups. We will have maybe
15 or so virtual tables, which if somebody feels strongly about a certain area they want to
do a project in, they’re going to recruit 3 other people - the total group size is 4 - to work
with them on their project.
We have very different skill sets in class so what we’ll have is a Google online
spreadsheet where each of us rates themselves along 4 dimensions. The first dimension
is what I call a “happy hacker,” people who are good at hacking stuff. The second one
we could call “producer,” so those are people who are good in product management
because a product needs to be managed and if you don’t do this then all the work is
happening at the end and people are very unhappy. The third one is the “secret sauce”
guy, the algorithms guy who knows more, maybe more theory, who knows what to be
able to extract from data, machine learning. The fourth dimension is “strategy” so we
have 16 people in GSB here. I would expect them to rate themselves pretty highly on
strategy.
For me, strategy also means data strategy. Since these projects are all projects about
social data, one key element there is how do you get to the data; what is the data
strategy?
0:04:29
As I was driving down, I had a conversation which is in a company in the financial space.
They first got data by screen scraping. That means you get the use to give you a user
name and password and you pretend to be logging on as them, and you get all their
financial data and present it back to the user as a dashboard. If you were the financial
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
institution, what would you say? Would you say no way, those guys are actually going to
grab the data from our user and show their ads, or would you say great they’re doing
work for us for free? What would be your perspective there?
The first perspective is we don’t want them to do this. So, the financial institution
switched things around with this result; that lots of the data that was scraped was actually
wrong, like your amount of dollars was your zip code, for a day, and the users were
ultimately very unhappy. The company went to the financial institution and said, “Do
you want unhappy users or happy users?” “We want happy users.” “Why don’t
we do an API, an interface where you can actually suck the data out right away, as
opposed to doing the screen scraping? By the way, since we’re doing work for
you, why don’t you pay us for that?” That would be an example of strategy, how
do we get to the data we need. Incentives fall under that.
Go to this link, bit.ly/mse237 projects. There should be your name and email address.
Rate yourself there and that way people can add some descriptions so you have prior
information when you go to the mixer on Tuesday after class, regarding who you might
want to talk to. We’ll have name tags and they will have the phone numbers prominently
displayed. Of course me, as a data mining guy, at the end of the quarter, I’ll be curious
as to what is the correlation between the raw grades and those numbers.
That is what you need to do so we can form groups. Groups need to be final by
Thursday of next week, a week from now. If there are problems, tell us early so we can
basically announce them in class and see whether we can still fix things. Don’t wait until
3 weeks into the quarter.
The projects are created by you. Last week we saw that Intuit is willing to act as tutors
for some of the projects. We have another couple of companies which will do similar
things, but ultimately, it really is an important part to come up with good questions
to ask. It’s not that I will say project one, project two, project three, but an
important part of the project is actually defining the project and figuring out how to
go about it, particularly the data.
We will have a number of milestones through the quarter so it’s not all pushed to the end.
It is relatively frontloaded. We’ll tell you next week what the deadlines are. Any
questions about that 40% here, the project? It’s different from last year. It’s much more
important. It’s a group project for a maximum of 5 people.
Student:
Is this the project we’re going to pitch to the VCs?
Andreas:
That’s the project which at the end of the quarter you will have a few minutes to
pitch. Primarily, you’ll first pitch it to your friends. You will see how they debug it.
When I was at Xerox Park, and people say I can’t really talk about it, my feeling was
always was there is nothing to talk about. Good people are so rich in ideas that
they’re very happy to share their ideas and get them debugged by their friends.
The VCs are at the very end of the last class. I want you to talk to everybody about
it and make it better before you even start coding. It’s one project throughout the
quarter in your group. Any other questions?
0:08:55
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
Student:
Are you going to have a grading rubric or something like that so we can know …?
Andreas:
Yes, we’ll have a timeline which is more important than the grading. Why do you think
we want to do a project?
Student:
So we can respond creatively to the material.
Andreas:
I want you to figure out what to do, what would be an interesting application of the stuff I
feed you. Personally, I always learn more in projects than in just doing problem sets,
which gets us to the other three ingredients - online contributions. One of the potential
projects is to come up with a very good social structure for
socialdatarevolution.com, but we haven’t set it up yet.
The homework I’m assigning now is I’ve put two papers up from The Economist, two
special reports. One is from January this year, about social networks, and the other one
is from February this year about big data. So you can convince yourself that this actually
works, this is the link. If you click on this you can either have it as a PDF scan which is
pretty big, or as a Word document. This is illegal. I should probably pay royalties for
that, but I never understand how those rules really work so if you want to buy The
Economist, you can do that as well. If you want to just read, you can do it here.
I expect you to go on the Facebook group, facebook.com/socialdatarevolution, and
share one related idea you get by reading these 32 pages. Shared there means it’s
shared with your friends as well. Facebook.com/socialdatarevolution is the right transient
place where things come and go, and disappear afterwards, but that’s where ideas get
shared related to the class. It’s not just that the class watches it; past students see that
as well. I think both those articles are pretty good, and it is irrespective of your
background, The Economist is a decent way of introducing the ideas that have relevance
here for business. Has any of you actually seen any of these articles already? How did
you come across them? You subscribed? What did you think about them?
Student:
It’s been a little while. I think the big data one was a little disappointing because by the
time you read the cover of The Economist, the trend is already a little too late to jump on.
That was my reaction to it.
Andreas:
I remember they called me in January so it actually gives an interesting feeling about how
long it takes them to actually produce something. I know we talked for hours in January
when I was in Shanghai, so it’s about 2 months between talking and it finally appearing.
Any questions about what I expect you to do between now and Tuesday?
Student:
That’s due by Tuesday?
Andreas:
You should post something on Facebook, and the great thing about it is that if it’s a great
idea you have many people who will like it. If it’s lame, then you probably will go by the
wayside. It makes it easier for us to get a feeling about what the quality of the idea is, if
lots of people like it. Or, they might like you or your picture; we’re never sure about that.
0:13:05
Individual homework is more …. If you want to get a feeling about that, look at last
year. For instance, Google Analytics is just a good one. We still need to figure out
how we trim down last year’s assignments so we don’t overwork you, but you’ll still get
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
the insights which we want you to get. You probably will have to build a
recommender system but the building blocks are there. It’s in Python and for that
one I allow people to potentially collaborate in pairs so if one person really doesn’t know
Python, then we’ll find ways of making this work.
Why do we do that? I personally think that building mental models of what to do
with the data, actually doing it is the best way of getting there. I can talk until the
cows come home about how these things out to work. If you don’t do it yourself,
you actually will never be getting there. That’s my personal belief.
There are people who talk about it and never do it. Immediately when you talk to them,
you can feel they don’t know what they’re talking about. That’s why I actually want you to
do it, to run it, to build a mental model.
I did my PhD here at Stanford, doing Neural Networks. I think I was super lucky. My
advisor was Dave Rumelhart who invented Neural Networks and he had an amazing
intuition coming from cognitive psychology, about how you learn from patterns, how you
extract patterns by learning from data. That’s why I have a very strong bias towards
running stuff to see what it does. You change a parameter and see how it actually
changes the quality of the recommendations. That’s my intuition and my intention and
why I give you homework that is actually hands-on. Are there any questions about that?
Dog food - I think in a class like this, there is for sure an experience component to it. I
want people to live in the space we’re talking about here. As we said last time, this is a
revolution, the Social Data Revolution, and the way to do this well is to eat your own dog
food. I would like to use that as a bridge to introduce Jeremy.
Jeremy is a graduate student of MS&E and is our head TA. We haven’t figured out the
entire structure yet, but we figured out that he’s leading the team. Dog food is something
he’s going to be producing. Why don’t you take 5-10 minutes and tell people more about
it.
Jeremy:
The idea with dog food is there are a lot of interesting things we’re exploring with the
Social Data Revolution, but to really understand what’s going on and to have meaningful
insights in what’s going on, we need to study the fundamentals of what’s going on. To do
that, we need to live what’s going on. This is taking it a little bit further than just our
Facebook accounts, which I’m sure we’re all very familiar with, or some of us have
Twitter accounts or something like that, but really exploring a lot of the different tools that
are growing out of the open APIs that Twitter has and other things like that.
The idea of this project is maybe one week or two week projects. We’ll choose a
different tool that’s out there and it will be across a variety of different things.
There is certainly the stream stuff that’s going on with Twitter. Maybe it will be the news
discovery sort of things like StumbleUpon or Digg. We’ll finalize what that list will be but
over the course of that week, we’ll really call on you to create an account and start
posting and start being social and using that service. There is really no way to become
familiar with these things, other than actually using them.
0:17:27
The structure of that project will be two components. There will be the actual
using of the tool and there will be a reflective kind of period of thinking about. The
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
first layer is like the user layer, so you’re using it, communicating on Facebook. It
accomplished your purpose.
Maybe the time stamps are on there from when the server is being posted. Then for the
purpose of this class, thinking about the second and third layers that are beneath
that; what meta data is associated with the post that you’re posting on there?
Maybe you have some location data, if you’re using Foursquare or something like that.
That’s not immediately apparent to you as the user, but on the other side, from a
company perspective; you really can use that data for recommendations or other things
like that.
Maybe a third layer is this idea of the semantic web where you’re tagging different
things. An obvious example of that is Delicious, where you’re tagging different websites
that you might enjoy with different tags, and on the back end Delicious can correlate here are the most popular sites based around design or something like that. Really
starting to dig beneath the surface of moving from being a user of all of these
services, into looking at the data and the underlying functionality that you can
actually extract from that data, and figuring out what you can possibly do from it.
That part of the course will really allow you to have a great foundation for a lot of
the great things that are going on in Web 2.0 and social data, and it will give you
some great ideas and a lot more ability to be creative and innovative as you go into
the project or homework assignments more deeply. Does anybody have any
question?
Andreas:
Do you understand why we are doing the dog food part?
Student:
…
Andreas:
For instance, take Twitter for an example. I notice a surprising number of you who
told me they actually don’t have a Twitter account, which I was genuinely surprised
about. Maybe there is no need in a college environment where everybody has
Facebook. Unless you do it and live in that space or actually dive into that space
for a week or two, you probably don’t know what it really is about. It’s just like with
the algorithms. You can very quickly tell whether somebody actually knows what
they’re talking about or whether they just you saw an interview with … or
something like this, where he talks about what Twitter is about. There is no
shortcut to actually doing it.
The homework for those is basically simple feedback, stuff you would do differently, what
is it that they missed, what surprised you, or just to stick with the Twitter homework
example that was used successful last year; find someone in a company you know and
work with them and engage with any one individual who is saying something about a
product of that company. That’s a super interesting exercise, and for those of you GSB
people here it’s also very handy when you interview afterwards and say, “I know how to
engage with Twitter users. We have our Twitter strategy in place.”
0:21:13
That is the thinking behind it. It’s not super time intensive but it is to engage, give us
feedback, no more than one page saying what it did for you, what were you surprised
about, how would you be using it. On that note, we actually do have a Twitter account
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
which is called @socialdata. If you follow @socialdata, that’s where I tweet stuff about
the class, logistics issues, if the classroom changed should have shown up on
@socialdata but because at that stage I hadn’t told you about it, it didn’t.
If you want to subscribe to that, that’s where you learn, in a push way, what’s happening.
Student:
I’m not a typical social network user, but I’ve been in some of the social networking
classes and I’m a Facebook user of course. I was wondering why they always give
heavy weight to Twitter versus Facebook? Can you tell me why they think Twitter is
somehow - not better, but gives you more insight about what you can do with social
media?
Andreas:
The question is basically Twitter vs. Facebook. I can give you a few facts here.
The most important one is the bi-directional, mutually confirmed nature on
Facebook versus Twitter is just a broadcasting medium. That is a very different
element. While in both cases you can extract a social graph out of it, the social graph
where people actually took the physical world and mapped it in the virtual world,
Facebook versus having a radio station, is very different.
In a college environment I think Twitter is actually not all that interesting. Companies
however jump on it because they feel this is yet another outlet for pushing their
messages down the throat of millions of people. Why is Twitter so popular? Because
there is the illusion of an audience, “I have 1,000 followers,” so if I tell people
whatever, I think 1,000 people are listening. Bad news - nobody is really listening.
But that illusion of the audience is what people haven’t experienced. I would be having
breakfast at … this morning and nobody else showed up. I tried many things. I tried an
experiment. I have an extra two tickets for a play and symphony orchestra or something.
I thought nobody wanted them. Then my TA said, “I can go with my girlfriend.” We did a
number of things so I’m pretty aware not of the illusion of an audience.
The reason I have the chief product guy from Bit.ly coming to class is because
Bit.ly actually allows you to measure how many people actually really give you
their attention. I think for some of us it might be a bit disappointing to see how little we
get.
Facebook is of course set up between people who may know each other. The survey we have to figure out how to present it. I might simply put up the results anonymously of
course, on the web. Peoples’ responses were interesting on if Facebook went away,
what would happen. If Twitter went away they said probably nothing; if Facebook went
away - people are worried about the address book and their photos. They’re worried
about the personal relationship with others and their past.
There are many other things to be said. One is the question of identity; I think
Facebook is much more important for identity than Twitter. Think about it;
Facebook is probably more important for us, for our identity, than our passport is.
Your passport you can fake. We all know friends who organize these things.
On the other hand, if somebody cracks your password on Facebook - the same
happened to me on Twitter. Somebody cracked my password and within a minute or two
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
people were saying, “Andreas, your account has been hacked.” If I lose my passport
nobody would hit me up and say, “Your passport has been stolen.”
0:25:53
Are there any other questions? It’s a big topic and we had one very good class about this
last year where we had people from both companies talk about it.
Guest speakers. I will have an average of every second or third class, we’ll have a guess
for about 45 minutes, and we’ll reflect during the last 30 minutes about what it was that
we learned in the presentation. I will look through your recommendations. If somebody
really feels strongly about it and thinks they didn’t express is strongly enough in the initial
survey, then drop me another email. If you heard somebody is awesome in another class
or in a talk around here, please let me know. I don’t know everybody but usually when
you invite people and give them a choice of dates, and they’re somewhat I the area, they
will come.
I haven’t fixed them all. A couple are set for April 20th, with Bit.ly but there are spaces
and I want to find people you are interested in. As I said in the survey, don’t say …
because everybody knows them already. Think about people you actually genuinely
would like to discover something about, which not everybody knows about yet.
The last logistics issue, although I tried to insert it with some content already, is given the
diversity of this class, which is pretty diverse, I decided I will form an advisory board. I
want to have 4 class reps, 4 people whom I meet with initially once a week on Thursday
after class. I’ll take you out for pizza or we can go to one of your dorms. I want to learn
what the 4 groups are thinking because some groups might be more vocal than others
here. I don’t want to fool myself into thinking that just by looking at your smiles, that I
know what you’re thinking. Some people might not have the courage to actually email
me directly to tell me what they think. This is a very good thing which has worked really
well in other places in the past.
Student:
I’m Matt Osborne. I’m the GSB representative, not an MBA. I’m part of the [Sloan]
program, which is the business school students who have been 10-15 years into their
career. I’ve actually been working in this business for that long.
Student:
I’m Jess. I’m … senior and… major.
Student:
I’m Dan Goodwin. I’m a first year masters in electrical engineering. I’m an extroverted
engineer because I … when I talk.
Andreas:
Our first meeting is next Thursday. Just recorded among the five of us and I’m also
inviting the TA. It’s an open discussion, 6:15 dinner and maybe you can pick a place or
otherwise we can do a coffee house or something simple.
One email address to reach whoever the teaching team is, other than me, which is
mse237@gmail.com. It’s very easy. You don’t have to worry about what goes to whom.
There is one email address. Any questions about that? If the four of you could just send
to that address, your names, that way we have it on record.
0:29:50
The current info before we actually get the socialdatarevolution.com up better everything is on stanford2010.wikispaces.com. Tonight I will invite all of the students
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
whose email addresses I have, I will take all the email addresses I got from you and will
invite everybody there to be an editor of the wiki spaces. By tomorrow you can all edit
this. It might be that we translate this out to something more interesting, something that
allows to do more annotations. That’s what a few of us are meeting at dinner for tonight.
For right now, that’s the email address to reach the TAs. That is where the information
sits. If you want to email me, it’s aweigend@standord.edu. Any questions about that?
We should have about half an hour left. In this half hour, I want to talk about data. I
want to think about the Nile River first. What does that have to do with it? In
history, those people who managed to build long, unclear feedback loops, or
maybe omit them, used to be the ones who actually became rich and famous.
Let me give you some examples. It was the high priests in those days. What is the
feedback loop for somebody who tells something that might be happening after
death? It’s pretty long and pretty broken. Jumping a few hundred years further,
there was something in the ‘90s where it was very popular to be on Wall Street.
You probably don’t remember those days, but there were days - from my
graduating PhD class, from Harvard and Stanford combined, there were two of us
who did not end up at [Quans] on Wall Street. It was the normal way to go.
Why did this work? It’s not so much for the [quans] but for all other people, trust
me. Write this thing out and somebody it will be all good. They were long
feedback loops, not a rich set of metrics.
Another example is religion. These infrastructure investments, Notre Dame to
Paris is an infrastructure investment. It was not clear what the ROI really was.
Let’s compare and contrast this era of faith to an era of data. The reason I have
Chinese here is I thought it would be forward compatible.
In the era of faith there were massive investments into cathedrals, etc. In the era of
data there were massive investments in measuring, networking, communicating,
and storing. We move from an unclear ROI to a very clear ROI. We have a short
feedback cycle and what’s very important for me is we can do experiments in real
time.
This means that a lot of data is being created, The Economist article from February, “Big
Data,” and what do we do with all of that? We gather them, explore, publish, and archive
them, a lot of stuff; what should people do? What do marketers in the business school do
about all of that stuff?
The paradigm shift here, and this is an important one, is that we are moving away
from the ‘90s question of given a set of data, what insights can we get, to the
2000's question, given a problem, what data can I get. In other words, given a
business problem, how can I incentivize people to actually tell me something about
themselves so I can serve them better, to what I think is a 2010 problem that not
what is the insight given the data nor what is the data given the problem, but what
is the business model.
A lot of it here when I talk about incentive design is figuring out how we get people to do
stuff. This is an old table here in case someone is not really sure what the unit of
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
measurement for data is. What I like about it is that it’s pretty shocking to see how
many order of magnitude the data we live with actually span. It’s roughly 18 orders
of magnitude here. If you transfer this, not from the digital scale, but into the
physical scale, into distances, then you get this.
0:35:02
If you say the unit of analysis here is an atom, roughly this, then you have Mt.
Everest and you go the distance to the Sun. The amount of data is so difficult to
grasp for people because it’s so many orders of magnitude that is being spanned when
going from an individual bit or byte to what the Internet is actually carrying about us.
I mentioned last time that what matters is not only the static overall size, but that the
doubling rate of data, the amount of data each of us create roughly doubles every one
and a half years. One of you emailed me asking what did I actually mean by saying you
can just describe where you are and that’s a few bytes, or give geolocation, or you can
take a picture and describe it, but that estimate is essentially all the data that you are
creating. Of course, video and photo does play an important role in this exponential
growth of data.
Sometimes we talk about the surface web, something like 10 billion pages, roughly
a page per person on Earth, and compare it to email. The storage cost in 2008 when
I last did this exercise was roughly $400 thousand, you could store everything that is out
there in the size of your garage. My garage in San Francisco costs about $100
thousand, so think about this; all the data mankind has created and put up on the
web fits into my garage for $400 thousand.
Deep web of course is the underlying databases, and I would say it’s roughly ten
times bigger. No point arguing about this much. The point I want to make is you can
store everything people have created in your garage.
We want to turn behavior into data. MoodLogic is company I co-founded in 1999 with
Chris Pirkner, a former grad student of mine at NYU. The belief we had there was if
we just give people tools to trivially, easily annotate things they do - in this case it
was in the music domain - characterize songs they like, then we can build a space
where nearby points in that space have nearby perceptual impact on the person.
Music is probably the best legal drug and if you want to be happier and you have a song
that makes you happier, then what are songs in the vicinity of that? One way of getting to
this was asking people why they like to listen to this song. “I like to listen to this song
while I’m in the shower or running.”
It was pretty amazing that within 2 years we had 1 billion explicit ratings. We would have
never thought that. Chris Pirkner really understood how to get people motivated to do
stuff. How much do you think we had to pay for those ratings? Nothing. Monetary
incentives don’t really work because you get people who are doing it for $3 an
hour, clicking and rating songs, but you won’t want those people. You want
people who actually care for the music.
That was one of the deepest things we learned there. If you line the incentives so
they get something back which is useful for them, namely discovering new songs,
that was the ‘90s when mp3 was easily available and later on people got sued for
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
having them on their computer. In those days it was really a discovery problem which
we solved. The company was sold to All Media Guide and then All Media Guide was
sold to Microvision and Chris is their Chief Strategist there.
0:39:11
It’s a data play, a meta data play if you will. The point I want to make is that behavior
gets turned into data. Music is one example. Search is another example where
you share your secret desires with Google. Who would be willing to have every single
search you did in the last week displayed, with your name, on the screen? We do share
our secret desires - with Google. Why? Because we get something back. We might get
the answer back.
Another example is online trading, where peoples’ behavior can be measured, and
online dating. That’s very interesting. There is a theory of addiction, and it’s very
simple. It is that if you have a stake somewhere and the world changes, whether
you watch it or whether you don’t watch it, and it affects your stake in that world,
then you are drawn to go back and check it out again and again.
Online trading is a perfect example where you have some positions; while you are
sleeping, the price may go up and down, and people are addicted to it. Online dating,
you have your profile out there and you can’t wait to see whether that person you really
want to contact you actually did. That driver, having something which changes with
you having a stake in the game is a very power driver for whatever app you’re
creating. Online role-playing games, Zing as an example which everybody knows, both
those other things, particularly trading and dating have the same underlying structure.
My belief is everything can and will become data. Movement data, mobile [0:41:20],
brain activity - we will have Brian Knutson come, the guy I mentioned the last class who
does FMRI in Jordan Hall, to actually talk to us about what’s happening in the brain, what
data can we get, how can we make predictions, given those data, onto future behavior.
Privacy is one issue that will come up and will be topicalized again and again in
class. There are lots of different kinds of privacy, information like what your hobbies
are, what you’re good at; communication privacy, who you talk to. Skydeck is a
company between here and San Francisco. Their user model - you give them your
user name, your password, or your mobile number and your password to the site.
They go to the site - T-Mobile in my case, or AT&T, and look at your calling
patterns. They present back to you how those calling patterns change over time.
I’m sure you can predict breakups with your friend way ahead of time, before
Facebook knows about it, by understanding how things shift. Another example is
how long does it take me to respond to somebody’s voicemail, versus how long
does it take them to respond? Nokia Research has a product which tries to come up
with an organizational chart, very different from the official organization chart, which takes
such response delays into account.
I did some consulting work at Morgan Stanley. I needed a math lab for that so I called up
the MathWorks, people near Boston, and said, “Can you help me out here and shoot over
a license for everything for the next week or so?” They said, “It’s funny that you called,
Andreas. We just sold Morgan Stanley a very expensive license.” “ Who did you sell it
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
to?” I don’t know the guy. I called him up and he said, “What are you doing?” I told him
what I’m supposed to do and he said, “But that’s what I’m doing.”
By accessing similar tool boxes, for instance MATLAB, or in other ways by accessing
similar data sources if you actually buy data, you get a very good idea, in yet another
space, of who is working on similar products from whom. If you look at the same data
with the same tools, chances are that you actually are trying to solve a similar problem.
0:43:54
Thompson, which recently fused to Thompson-Reuters, has a company in the Midwest,
in Minnesota, called Westlaw. Westlaw goes back 100 years, taking all these public
records from court cases. Somebody enters them and then for lots of money you can
actually access them, which is an interesting conversation; I think about the value of
data because the data is already public, but the fact that you can access it from
your desktop and you can search it is where the value lies that people are paying
for, rather than taking the bus to the courthouse and finding the proceedings
somewhere.
The interesting question there was who should have access to which cases I’m
actually looking at? Certainly not the counter party because that way they could
reverse engineer the argument I’m going to make in court. That’s not what we want, but
within the law firm, should people have access to that? Probably yes. Then what
should they be able to aggregate up in order to have a better data product, and
what should they not be able to aggregate up?
It’s a similar question for Intuit data. What can you learn across companies and
what would companies be very unhappy to have shared? The general rule is that
the big companies are the ones who are to lose more than the small ones.
Amazon.com pretty much knows what’s going on in the world by what they see. Some
little Internet retailer cannot lose much if their data is public, but they can gain a lot by
benchmarking themselves to data from other companies.
One of the things we’ll do at the mixer next week is that we will have Angus summarize
the data sources we’re looking at. Is that right?
Geolocation, territorial privacy, what happens in your office home bedroom? I just
learned that in Singapore, a hotel room is a public space. Watch out what you’re doing in
hotel rooms in Singapore. Of course, bodily privacy, strip searches, drug tests, these are
at least some of those interesting dimensions of privacy.
Out of those privacy concerns here, collection and storing, unauthorized
secondary use, improper access, and combining data, what’s missing here is my
main concern which is what is something wrong is out there? What do you do
about it? If somebody says that some Stanford professor molested a child, now what do
you do about it, assuming it’s not right? What do they do about it? It’s not easy.
This year, what would be processes for people to fix errors in the database, which
are pretty foolproof, that is tamper proof, that if I actually want to fix something
that is right and change it to something wrong, I can’t do it. But the person who
actually wants to fix something that’s wrong and make it right actually can do it.
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
0:47:16
Let’s take geolocation; let’s say there was a murder in L.A. last week. I might not be
willing to have - I would be but some of you might not be willing to have your geolocation
being recorded every minute, which would make it very easy to say this is where I was.
Then if I get accused of having committed that murder in Los Angeles last week, I would
be more than happy to explain that it can’t have been me because I was hanging out in
San Francisco. That’s an example where you hash the data into some space, so you can
come back and say the probability that Andreas did this is 10 to the minus 17, versus the
probability that the other guy is 10 to the minus 1 or something like this.
How do you do the data? How do you process them so you can get your question
answered without revealing any more than you actually need to reveal is a very
interesting area of research.
Accessing your own data is a very interesting thing. I missed a plane connection
and went to the woman at Lufthansa in the lounge, and I said, “The plane was late. How
do I get to where I was going?” She said just a moment. I leaned over to look at her
screen. She pushed the screen away from me and said, “No, you can’t look at this?” I
said what are you talking about? I paid a lot of money for that ticket. I’m stranded here
because you’re plane was late. How come I can’t look at my own data?”
The same thing happened with iPhone. There was an issue with the AT&T billing so I go
there and I didn’t know what the problem was. I said, “Let me help you debug it. Let me
see whether you have the address wrong. What’s this?” How can people think about my
data as something I can’t look at?
23andMe is another example, versus financial data, that 23andMe where for a couple of
hundred bucks did a DNA sequence and tells you which diseases you’re likely to have
and which diseases you’re less likely to have. Initially it was you can only get this data if
a doctor is next to you. On the other hand, you can get your financial data without a guy
who has a degree in financial mathematics being with you. How people think about
their data, whether it’s my DNA or my money, is very much worth reflecting on.
The Federal Trade Commission has a bunch of dimensions here. What they came up
with is actually quite reasonable. A Pentagon article from 4 years ago, a New Scientist
article talks about how the spooks are trying to figure out what is going on, on social
networks. Anybody, any comments on that? I think we should assume that they’re
everywhere. I’m not the person who is worried about these things, but assume that
everything you ever say, write, email, text, they listen to.
Cost of storage has dramatically dropped. It’s absolutely amazing what [slack] had
when I was a grad student, now I have on my mobile phone, in terms of storage. Of
course, the hard drive capacity has shifted. There is a good paper by Hal Varian who
is a Chief Economist at Google. It really explores the law that storage prices drop
by a factor of 2 every 1-1.5 years, and what that means to the world.
I see it this way, that there is implicit and explicit data. Implicit data are data which
are clicks, my geolocation, stuff that I produce and can’t do much about. Even if I
don’t have a phone with geolocation, just the fact that our phones are in this room,
somebody has to know this because otherwise how would they know your phone should
ring? That’s one of these strong data strategy things.
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
0:51:32
I can turn my phone off, but then I won’t get phone calls, so I need to transmit where I
am. Otherwise, my phone can’t ring. There is a company, I forgot the name, maybe
some of you know it; it little receivers in conferences and then they trace out the
trajectories people take through the booth at the conference, who hangs out where, for
how long, what is the more or less random walk people take at a conference, and who is
hanging out with whom. Does anybody know the name of that company? There are very
interesting insights you can get about where people go, how does this change when you
rearrange the conference.
Many stores had video cameras where people sat in the back trying to figure out if you
move that product from that aisle to this aisle, how does it affect sales. Walmart actually
has an entire model which says if I move the toothpaste from this rack to that rack, sales
will go up by .03%. They knew this 10 years ago already. Now with people moving
through the stores, having their mobile phones on, you can understand and analyze the
implicit data much more than we could ever do by manually looking at video tapes.
That is growing exponentially, no question about it. Surveys on the other hand, as
an example for explicit data, is not going up exponentially because we actually
have to do work for that. So, from a machine-learning perspective, or … for instance,
you could say that implicit data is unlabeled data and explicit data is labeled data.
How much worth is there to labeling a piece of data?
What I’ve seen happening is that great progress has been made by making it increasingly
easy to share whatever you want to share about geolocation or about a restaurant dish.
One of my students last year has a company in San Francisco where the unit of analysis
is not the restaurant, a la Yelp, but the dish in the restaurant. If you want to have really
good … then this might be a good Chinese restaurant but the … might be really lousy. It
will tell you where the best… is in a 5-mile radius, or 10-mile radius.
As long as you make it trivially easy for people to annotate it, they will do it. Any
increase in the barrier of doing it will have an exponential decrease on what people
actually do.
Voice over IP is a great example. This is a great graph. It’s an old graph, from 1980 or
something, but it’s a beautiful way of representing and [ensuring] people communication
costs. On the x axis, we have the cost of transmitting 1,000 words, inflation
adjusted. On the y axis we have the number of trillions of words made available,
transmitted. For each of these things, like fax machines that were invented at
some stage, you see how the price goes down here. Telegram, price went down.
I wanted to show this to you to show you that the economics of data is important. We
live in a different era. Here is an example from Amazon.com. What is the amount of
data Amazon is collecting in a given year? The new data Amazon had, and this graph is
a few years old. I don’t know what the date is exactly now. If you have individual clicks,
such as an access log, it’s an order of 100 TB, maybe a factor of 10 more, but that’s
roughly the order of magnitude.
If you look at session aggregates, where we summarize what the person has been doing,
did they buy something, how many clicks were there, when did it start, when did it end;
it’s about 2 orders less a TB, so it fits on one of these little devices. If you aggregated it
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
further, 2 orders, it’s 10 GB. All the customer data of Amazon is just 100 MB. If you blow
it up here, the more explicit data; up here versus the implicit data, you have way more in
implicit data.
0:56:23
To ramp down here, there is an iterative process of modeling and of making
decisions. That works just as well if you build trading models on Wall Street, as if you
work with an ecommerce company. The first problem here is that you define what
the problem is, and that includes that you have the baselines defined. When you do
work with a Wall Street firm for instance, make sure the problem definition includes what
it means to do well and what it means to not do well. We measure, and then we
describe exploratory data analysis, something computer scientists tend to not be that
good at. It’s more what EE and physicists - let’s plot it on a [0:57:14] scale, hold it
against the light and see if it’s a straight line.
Then, I deeply believe that the only way I would trust the model was if it makes
predictive accuracy, it has predictive accuracy out of sample, our new data. That’s
a big rift between those in the class who come from social sciences, where they come
and have P values and table and say, “This is the right model,” and those who come from
natural sciences, who say, “Can we make predictions out of samples?” Of course, we
decide how we can value it is an iterative process because once you actually made the
decision, after you evaluated it, you realize that what you had as your initial problem
definition probably was wrong.
The last point here is to say I can’t emphasize enough, and I made this point in the last
class and I make it again here, that the M in PHAME, the Metrics, is something where
many people actually can benefit greatly, spending some time making them
explicit.
Here are examples of what would Amazon want to drive. Amazon might want to drive the
stock price, profit - number of items sold, and then when you write those down, which
might be 100 or so, then you begin to see what the necessary tradeoffs are. Here is an
example between profit and number of items sold. No problem, I can make sure we sell
10 times as many items by dropping the price to 90% of what my competitors have. That
means the tradeoff is profit goes. Or here, conversion rate - no problem, we can buy the
cheapest key words in the world, get everybody to the site, but they want to have free
porn and that’s not what Amazon is about, so the conversion rate of people actually
buying something is pretty small.
What I wanted to do in this relatively short time was to show you that with data here, it is
hard because we have pretty much moved from a world of reporting, of printing out stuff
to a world of doing behavior analysis, data mining, making predictions, but the positive of
this is that a move from a world where this was very much a cost center, these people
are not well liked, to it being a profit center.
Amazon fired its marketing department in 2002. Amazon is probably one of the most
successful marketing departments in the world because of its recommendations. Why do
recommendations work? They observe what people do and smart algorithms manage to
play that back. For me, that’s what really good data strategy is about.
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
To summarize what we just did, we looked into the importance of data. We all buy it
and know that lots of data gets created. The reason I did this today is I want you to
be aware that I expect your products to be somewhat data intensive, so you should
think about data. I don’t want philosophy papers here. What has changed is that
the data used to be sniffed behind people, “sniffing the digital exhaust,” people
coming and saying, “Give me some [1:00:49] insights,” to creating incentives
where people knowingly and willingly share data if you give something back in
return.
1:01:00
To know whether you’re doing the right thing or not, you have to have a good clear
set of metrics you agree on. It’s much easier, and I can tell you this from the bottom of
my heart, much easier to agree on metrics than to agree on outcomes, before you have
done the experiment. It’s very easy in a meeting to say “These are the metrics. Any
other metrics? Okay, let’s throw this in. Let’s write the test [1:01:26] and now let’s do the
experiment” that to argue a priori arguments that this is better for the following reasons.
For me, that’s the driver that has moved us from an era of faith, and I’m not saying
anything against religion here, just from an era of the “trust me” kind of thing, to an era of
data.
That’s what I wanted to tell you today. Any questions? If not, expect an email from us. I
remind you of the two paper locations. Think about which four groups you fall into, how
you want to rate yourself there. See you on Tuesday. We will have a mixer on Tuesday,
right after class, bye.
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc
Download