Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics Andreas Weigend (www.weigend.com) Data Mining and Electronic Business: The Social Data Revolution STATS 252 April 6, 2009 Class 1 Overview: (Part 2 of 2) This transcript: http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-2_2009.04.06.doc Corresponding audio file: http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-2_2009.04.06.mp3 Previous Transcript: (Part 1 of 2): http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc To see the whole series: Containing folder: http://weigend.com/files/teaching/stanford/2009/recordings/audio/ Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 1 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-2_2009.04.06.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics Andreas: All right, are we ready for part two? Let me start with is there anything you briefly share from your discussion during the break with others, any comments? Thanks for some of you coming up and introducing yourselves. Are there any comments, questions? Logistics we’ll do at the end. Student: I was wondering, is it…Google knows what is contained…? Andreas: The question is about legality of collecting stuff. Is anybody worried about it? You are always free to not use a website. The only security is just to unplug your computer. The question about privacy we will do later. I mentioned earlier today the question about who owns the data. It’s a question people used to ask in the 1990’s when owning means having something. I own that coffee, thanks for getting me coffee. With digital data it’s different. I’m not a legal guy. Believe me, I’ve had many discussions with the Amazon main lawyer about what it means to collect data, particularly in the European [0:01:13.6 unclear], that’s different. Last year, Harry who works at Facebook who was a student here last year, we had some discussions with the Facebook people about what should be accessible. The point I made is I think it makes no sense that by clicking I can access more information than I when I do it programmatically. My background was; I’m not sure if anybody told you from last year, we had this Friday afternoon app. It is Friday afternoon. You are in a dorm room. You have nothing planned for the evening. You want to know who of your friends’ friends, who are the gender you are interested in and who is the gender you are, and your gender that you’re interested in, who may be single. It turns out that you could click through that, but they didn’t allow us to do that programmatically. So there will be questions coming up like this, but they are not really good questions. They are just questions about the state of the world and it changes all the time. Is there anything else? Student One question I have is you were talking about how to make more relevant decisions, to help people make their decisions or to give them more relevant data. Do you think, over time, this sort of homogenizes that social culture….? Andreas: That is a super great question. After my PhD, I lived in Bangkok, in Thailand. I taught in the university there. That was before the web actually was available. I know that we constantly debated at [0:03:00.7 unclear], whether or not if something like the Web some of us heard might be happening at some stage in the future, would it increase the digital divide or whether it would smooth it over. To date, I don’t have the answer. 0:03:12.2 One thing is clear that by now, the largest number of Internet users of any country is China. In 1994, when I was first teaching at [0:03:22.7 unclear], there was one terminal Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 2 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-2_2009.04.06.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics at the Academy of Sciences, across the street, where I could go to try to read my email. The world has dramatically changed. I have no opinion. I have no idea how it would work. It’s a good question. I know that many people worry about it. Ultimately, it’s up to people like us, whether we increase the divide, or whether we help bridge the divide. I promised you another lecture of about half an hour. Then, I will tell you about the first problem set, and I will introduce you to the people who are helping with this class. The last sort of ten minutes is always Q&A. From 5:55, I’m available, so if you have questions in private, we can do it then. We will have half an hour of content. That is actually half an hour I gave a week ago in San Francisco at the Facebook Developer Garage. The company I’m advising, called Contagent put together an evening with eight speakers, good people. It’s all on the web. We talked about metrics, frameworks and how to measure stuff. I believe that if you measure something, you can make progress. If you don’t measure it, you don’t know whether you make progress. It’s just hard. The first problem that there will be, and this is due a week from Sunday, is to think about metrics in a particular context. PHAME is the framework that I will really try to structure each class about. That means that first of all let’s be clear about what the problem is. The problem could be to get people to join a certain Facebook page. If that is your problem, you could have certain hypotheses, how do you track people? You could put flyers under the peep hole in your dorm. You could hit up all your friends in email, or whatever it is. If the problem is a different one, not to attract people but to retain people, then you do very different things. You try to build a good product. A product that people want to use. The hypotheses might be that if you give them to have persistent data about themselves, and more importantly, the idea behind Facebook newsfeed, if you give them new stuff, interesting stuff their friends create all the time, that will create a reason for them to come back. 0:06:00.4 I’m actually looking forward to reading your answers here, tonight. One of the questions I’m very curious about is the question about what makes people share information. What makes people come back again, and again, and yet again? I think that one of the drivers of addiction, whether it is on Wall Street, where people have something at stake and the world changes and affects them, whether they’re there or not; whether it’s on eBay and you have an item out there and maybe you are bidding for something and the price goes up; whether it’s an MMORPG, massively multiplayer online role-playing game, where you have that role and if you go to sleep something might happen to it. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 3 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-2_2009.04.06.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics I have friends in China who actually set their alarm every hour so they can go back to their MMORPG and check on how it’s going. They all have in common that you do something and as you are away, things happen to it and you are curious about what happened. That would be the difference between building something which is for acquisition, where we know the rules of virality, how you built this through viral roots and retention, which ultimately is in the long term, the important element, namely, how do you actually create something that is useful to people so they will come back. These would be different problems. The hypotheses we gave. Actions – what could be actions? Let me give you an example for actions, a much easier example. That’s the example of Amazon.com. It’s a very simple question; should this shopping cart, where you just add that Miles Davis CD, be on the right hand side or should it be on the left hand side? Who thinks it should be on the right hand side? Who thinks it should be on the left hand side? I will tell you. We don’t know, but we can measure it. You are right, because you are all trained to go to Amazon.com and you know how it goes. Believe me, over the last ten years, I had a lot of conversations. But, in countries where people read from left to right or right to left or from top to bottom or bottom to top, it should… we don’t know but we can measure it. Amazon.com of course, measured it. The key question is what are the metrics – the M in PHAME. In this case, we looked at conversion rate. I looked at order size. Typically, these are 1% effects. The point I want to make with this is that ideas we have can and indeed should help us come up with hypotheses. That’s not the end of it. These hypotheses should enable us to come up with actions we take, differential actions, left versus right. Then, we just simply measure what looks better, given a set of metrics we decide on. The metrics are super important. If you don’t get the metrics right, you just have the company do the wrong thing. For example, at Amazon, Jeff Bezos and I spent maybe fifty hours going through the various groups at Amazon, discussing and deciding with them what should the metrics be for the group that does cross-selling, or recommendations, or search and so on. 0:09:23.1 If you have a group of very smart people then believe me; if you make a mistake in the metrics, they will do precisely the right thing for those metrics and the wrong thing for the company, as a whole. I have examples, but we don’t know each other well enough yet; I don’t want to share them now, with the cameras running. I want to give you a couple of examples here to show you one thing that I, as a physicist, deeply believe in. That is that you want to look at distributions of data as opposed to just sort of a more computer science approach of just getting the mean or Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 4 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-2_2009.04.06.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics something. Actually, Jeff and I had a game going where I told him, “You tell me what number you want, and I’ll tell you how we achieve that number. Do you want every session of seventy clicks; I will give you every session that has seventy clicks.” What am I talking about here? This graph has a distribution of visit lengths, how many clicks per visit. A visit is defined as between midnight Seattle time and midnight Seattle time – twenty-four hours. Of course, to be future compatible, I have Chinese here already, these are recognized purchases. This means we know who the person is when they buy something. These are recognized non-purchases, we know who the person is, but they don’t buy in the twenty-four hour period. These are unrecognized purchases so we don’t know who they are and they don’t purchase, and at the bottom you have internal stuff. Look at this graph. Do you see anything surprising? Student: There are three or four spikes… Andreas: There are a bunch of spikes. There are these spikes over there and where are they? They are on that black curve, which are the unrecognized purchasers there, nonpurchased, not recognized. Do you have any idea what that could be? It’s unlikely that more people click two hundred times than one hundred ninety-nine, or two hundred and one times. Maybe it’s not people but spiders. What else do you see? I am actually very curious about this one here. Things that are not monetarily falling is really weird. Why are Earth would more people, these are recognized, non-purchasing people, click eleven times and ten times? That’s where you need to have some domain knowledge. You need to know what we set up. What was set up there was this funny thing called Gold Box and there are too many stories about Gold Box. People come every day to see whether these ten (random) items that are put in the Gold Box for them, whether they will find a deal there. People like to do deals. People like to get a good deal. 0:12:13.6 The most interesting one, a third of all sessions are single hit sessions. They are sessions that are bouncers, only one hit within twenty-four hours. Is that a good or bad thing? We have no idea. It could be somebody going and checking their sales rank, their Amazon rank every day. It could be robots that change identity for each click. It could be people who are at that porn site and there is the link to Amazon and they don’t want to be at Amazon. My mistake, they click on that. Whatever it is, we don’t know what it is because they have no persistent history. You see, if you just report an average of number clicks per session or pokes or posts or messages, it’s not as rich as trying to understand what it is that’s going on. I want you to look at distribution of data. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 5 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-2_2009.04.06.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics I have two more examples. This one uses persistent history and says I’m looking at everybody who bought something today. When did they first look at that item at Amazon.com? Indeed, 80% looked at it today, for the first time. Then, there is a weekly periodicity. Does it make sense? On the weekends, or maybe you play volleyball on Tuesdays and your volleyball friends say, “Did you buy that book,” and you say, “Gee, I forgot. I looked at it last week,” and then you go buy it. It could be weekends or any other periodicity. That made me quite happy when I pulled those data. It’s unlikely that I make a mistake if something that makes imminent sense like a weekly periodicity shows up. On the flipside, I don’t have the graph here, but I looked at data, not on the weekly scale, but on the hourly scale. I found that an enormous number of people, a huge peak, buys eight hours after they looked first at the item. It’s a graph but I’m showing it for you in thin air that it drops, but it doesn’t. You have another big peak. We were scratching our heads and asking, “How is this possible?” How is that possible? Student … Andreas: Yes, but you wouldn’t get exactly eight hours that way. It took me two days and a good dinner I promised whoever solved the problem. Student: The click log is in one place where … and another place. Andreas: That’s perfect. You would have gotten the dinner had we met five years ago. These are small things that you spend your time on. You only learn this, and I was stupid, by first removing all negative times. Had I kept the negative [0:15:19.6 unclear] times in, I would have realized there is a peak at minus hours, so of course. This is what happens when you are trying to do things too quickly. I’ve many stories like this. I have a whole talk about all the mistakes I made. Here is another one I think is a beautiful curve. I’m interested in the pricing of virtual items like virtual gifts and stuff like this. These are real items and these are real data. What is on the X axis is the gross merchandise sales, so how much was the order size in dollars. The Y axis is the accounts, this historigram. [0:15:52.9 unclear] means straight line is a power law. I hope you know that stuff. 0:15:57.6 What turns out is it has to be a straight line. Jerry Friedman quotes his PhD by it. Jerry was my PhD advisor. He had a PhD advisor and his PhD advisor’s PhD advisor said, “You know, if you take the log often enough, everything will be a straight line.” In this case, it’s enough to take an X log and a Y log, straight line, with an exception that under $25 you have a mixed distribution, under $25 something else is happening. It makes sense because at that stage when I did this graph, Amazon had a Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 6 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-2_2009.04.06.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics $25 and above free shipping. Of course, nobody would pay something like $24 plus shipping if they can just add another item and get the shipping for free. These are all the things you need to look at. If you just take an average of this plot, there is no meaning in averages of this plot. Looking at data is something that is very close to my heart. I’m telling you these things today because I said the first homework is I want you to come up with metrics. These metrics should be something robust. By the way, what would be a robust metric to describe this [0:17:04.3 log/log] plot here? I’m curious. I’ve never asked this in class? I did it with one company but I’ve never asked a class. What would be a way of describing it? Student: What’s the vertical line between when the …? Andreas: It’s clearly mixed in its distribution so one is where is the cut between the distributions. If we ignore the thing on the left hand side, how do we describe the [power law]? What parameter do we have left? Its slope, and then of course the question is how do you compute it. Do you weight each point evenly or stuff like this; some details here, but what we want to know is what’s the slope. That’s the same for Pareto Distribution if you are in economics, or Zipf’s Law if you are in linguistics. Basically, if you have these scale free distributions, all you want to know is what’s the slope. Don’t come and say you are interested in the mean of this. Tell me we can characterize the distribution by slope. Here is another graph. These were retail data from an undisclosed company, around 9/11/2001. The reason I’m showing you this graph and the next graph is these are the graphs I have from [0:18:19.7 Ronnie Cohavi]. You don’t see all that much here. If you plot on the X axis the weeks and then in a circular way, the days, you clearly see that something happened here on 9/11. That Tuesday was not a very busy day in terms of undisclosed retailer selling stuff. In this graph you don’t see much. In that graph it’s obvious; this was a holiday, Labor Day. This was 9/11. The other point I want to make is people are social beings. That had been pretty much ignored by traditional marketing. They sent people trees back home, paper catalogs, and they never listened to what people actually had to say. 0:19:17.1 Another experiment was done by a friend of mine at NYU. It was to compare recommendations that use the social structure, who is connected to whom, with recommendations that ignored the social structure but used the most sophisticated statistical models available, [0:19:35.3 unclear] There is a lot of experience, a lot of traditional data, demographic data, geographic data, and loyalty data. The product was a new communication service, like a new service on your phone. The comparison I make now is in green, you have these sophisticated Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 7 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-2_2009.04.06.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics traditional models. In orange, you have something very simple. You just market to the people who make phone calls with the people who bought the item. If we call each other, and you bought that service, they would hit me up and say, “Hey, do you want that?” It turns out that the response rate here is .28%, and here it is 1.35%. It’s almost five times as large. The lesson to be learned from here is people are social beings and knowing who hangs out with whom is very powerful, often more powerful than building complicated models. Do you remember, the 1990’s was about given a set of data, what insights can we get? Some people are still waiting for those actual insights. This decade is what is the problem? What data can we go and get. In this case, we look at calling data and there we go. Your neighbors are in the graph, so it’s people you are connected with. Metrics have a big impact on what people try to optimize. Let’s pause for a moment and think about what really is important. It’s very important to understand what is scarce, as opposed to what’s abundant. If you want to build cool apps, you have to help people with their scarcities. For instance, one of the scarcities we have is time, or attention. If we help people do more with their time, direct their attention to more interesting things, we are doing a good job. One of the things that used to be scarce is information, whatever that is. Now, we know most answers are somewhere but in our finite attention, how can we get to them? That is one of the frameworks that is quite important. What are the real costs? Many people only think about dollars. The real costs now are very different ones. What I want to always point out is there is this thing called social capital. Does anybody know what social capital refers to? Who knows what social capital is? Student: Reputation 0:22:51.0 Andreas: Reputations? Yes, so for instance I have a certain social capital. I’m happy to introduce people to each other. I’m happy to help students with projects and with startups. So, I have built social capital. If I need a favor, they are happy to help me back. On the other hand, if I was spamming all of you all day long, I would pretty much have spent my social capital by now. You would probably say, “Oh, another message Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 8 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-2_2009.04.06.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics from Andreas,” delete it before reading. Social capital or social cost is often more important than monetary costs for people. Another one we have seen is the cost of interrupt being vastly underestimated. We all have situations every day where we are in the middle of something and the phone rings. Then we quickly answer the phone call and, “Gee, what was I doing before,” and swapping all these processes back in. It’s something people way underestimate in terms of how expensive it is. The interesting thing is that we might cognitively know that we shouldn’t be answering that phone call, but we’ll still do it. A friend of mine received a Nobel Prize in 2002, Danny Konnerman. I usually visit him in New York, we’ll talk for a few hours about what’s cool and new. The last time it was Twitter. The time before that it was Facebook. Last time he asked me, “Andreas can you do me a favor?” I said, “Sure”. He said, “I have this PC and they always have these alerts coming up even when I get a new message. I know I shouldn’t be looking at it but I can’t help it. Can you please turn that off?” I think if you are not strong enough to resist the temptation of shiny objects; don’t worry, you’re not the only ones. Sometimes, what we have learned, which was good for us when we lived in caves – not most of us, but our ancestors, might not be the best thing anymore. I think the fact that my mobile phone doesn’t work on campus is probably a good thing. How can we help manage peoples’ attention? The first insight is giving them insight in how they are spending their attention. On my blog, www.weigend.com/blog, I had a very interesting conversation in July of last year, with Esther Dyson and Shoshana Zuboff. We talked about why people do these things, and basically boiled it down to people want to spread their means and people want to spread their genes. From that, all the rest follows. If you want to download an hour and hear us rambling at 7:30 in the morning, it’s on www.weigend.com/blog. Short term versus long term – It’s easy to measure short term stuff. The metrics I want you to come up with, some of them should be short term metrics. Those you can do with A/B test, which I showed you before, left/right is an example of what’s called an A/B test. 0:25:48.3 Long term things are very hard to measure. Ultimately however, the long term things are the ones that matter. If you just do short term things, it is not that you’ll have a big impact. It’s like many games; people play the game and people play another game and people play yet another game. How can you actually come up with metrics that measure long term user satisfaction? At the beginning of class today, we talked about how we go from a Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 9 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-2_2009.04.06.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics transaction economy to a relationship economy. Relationships are long term relationships like two-night stands, three-night stands, at least. What do we have on the [0:26:24.5 unclear]? It’s a little bit difficult with this class because this class is only nine weeks. We can’t really have good, long term metrics. One of the awards of the class will be that in half a year from now, the end of October, we’ll have a dinner with a group, whoever wins the contest I will describe in a moment, here. We have some long term component and we will show you the long term metrics that you devise throughout the next year. I want to give you one example here from this field that is called behavior economics. There is classic economics, new classic economics, which tells you how people ought to behave. For instance, want to experiment in this? I don’t know your name, Blake? And Matt? Matt is just checking his SMS. Matt is done checking SMS right now. If I give Blake $10 and say, “You can now play a game,” and you offer some amount to Matt, and if Matt thinks it sounds like a deal and out of these $10 you both keep whatever your respective amounts are; if he doesn’t think it’s fair, I get my $10 back. I give him $10 and he says to Matt, “Hey, I’ll give you a penny.” What would Matt say? Fuck you, probably. That is the end of that game. If he says $3, Matt would say, “Okay”. It’s very culturally dependent, by the way. Russia it totally different from China, which is totally different from Europe which is totally different from the U.S. These are things where a new classic economist would say, “One cent; we’re both better off thank you,” and there we go. That’s not how people do things. The other example I want to give you is from Dragus, you know the store in Menlo Park? Dragus has – I didn’t believe it until I saw it – 360 kinds of jams. Here was the experiment; Stanford Psych 101 students dressed as they were normal shoppers, all were there as counters and counted who was coming in. In one experiment, there was one table that had six jams on it. They counted how many people were actually poking there and trying some of these jams. It turned out, and I actually didn’t look up the numbers right now, it was something like 30%. I know that 30% bought. 0:28:57.1 Now, in the other case, they didn’t have six jams, but they had 24 jams and not like weird stuff but the 6 there before and another one, up to 24, of reasonable stuff. Honestly, I’ve never seen 24 jams on a table. Indeed, the ratio of people who went and checked out those jams was much higher, however, the purchasing rate dropped to only 3%. People said, “That’s interesting, I will worry about this next time,” and they actually didn’t buy. That’s the baseline for jam purchasing. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 10 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-2_2009.04.06.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics The point here is that what we’ve evolved in is dealing with options where you always want more options might not be the right thing online. When you develop an app, try to keep that in mind; the classic approach of economics of people being rational agents with unbound rationality isn’t how things work. The third and last example I’m going to give you here is from The Economist, after all, we’re talking about economics. It’s a weekly paper coming from the U.K. It’s a good paper. Here is what they did; they had, for their subscriptions, [0:30:06.7 unclear] the possibility of getting only the online for $59, for $125 print, and also for $125 printed online. I’ve given you the numbers and this is from Dan Ariely, about all these three options. It turns out that people are good at comparing stuff but people are not good at absolute things. People are good at saying, “If print is $125, and printed on web is $125, this is a better deal than that; after all, this is The Economist, right?” It turns out that nobody takes this one here, 84% take that one. Then The Economist thought, “If nobody is choosing the middle option, we might as well remove it.” What do you think happens? The preference which used to be that the majority of people prefer this over that, flips. By adding a choice nobody picks, or by removing a choice nobody has picked, people’s preferences revert. It’s just like the one-cent story, or many other stories I have. Why am I telling you all of that? I don’t know these things and I’m not the only one. I always argue that people don’t know what they want. People are good at making up stories as they go through the world. That’s why I have this PHAME framework. We think about the Problem; we think about Hypotheses; we think about Actions we can take, that’s the key thing; we think about Metrics, metrics need to be actionable, accessible, and audible; and then finally we do Experiments. Are we done? Of course not. We are always in business because experiments lead to new problems, new hypotheses, actions, metrics, experiments, and so on. What’s missing in this picture? Did you miss anything? Where is data mining? Where is data? The point is there is no data in here. This doesn’t mean we don’t collect data but it means that it is not the game, here, from having to send data and doing analysis, but the data we collect here in the experiment are driven by the problems we are trying to solve. That is really the mindset I have for you here. 0:32:44.6 What I am personally interested in is virtual gifts. If any of you are interested in that, I am probably doing some fun stuff with Facebook. We have a film maker who wants to do a movie – 99 red roses; whoever gives some people real roses, other people give virtual roses. Then we will see what the differences are. What are the metrics; getting laid, Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 11 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-2_2009.04.06.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics probably, I don’t know what the metrics are. [Laughter] That will be an experiment. Any volunteers, talk to me about it. Now, we need to get to the serious stuff because I told you at 4:30 it’s time to talk about serious things. First of all, in the hierarchy of who can write - www.weigend.com, only I write to. www.weigend.com/teaching/stanford has information about the class. If there is something wrong, do email me. I don’t pay $1 for a bug, but maybe I should do this. If there is a bug on the site I will pay you $1. All of you should have access to the class wiki, which is www.stanford2009.wikispaces.co. You can also look at www.stanford2008.wikispaces.com to get a feeling of what people did last year, www.stanford2007.wikispaces.com and so on. There, all the people who have written their email addresses on this union of the email addresses we have will get an email from us when I get home tonight, giving you access to that page. If you want in between, hit “give me access to that page,” and one of the TAs will grant you access. The main thing on it is homework 1. For homework 1, I had an idea which is I went to my friends at Facebook and said you did this thing with a Facebook class two years ago. It was great getting ten million users in ten weeks. Unfortunately, when I asked people in my class half a year later how many were still using it, one group had seven. The other group had nine people, another group was just one guy and his friend. In some ways it was great for acquisition, but it really failed for retention. The task we came up with is to create active, meaningful, and persistent community of Facebook pages for some cause that promotes the social data revolution. We first wanted to have some more real world examples, but again, the lawyer said you can’t pretend to be somebody else. We can only pretend to be here about the social data revolution. When I actually thought about it, it’s actually not that limiting. It can be pretty much anything where you have some idea about sharing data. It could be about health data, about whatever you want to think about. This is a group project. Before you leave today, before we shut class down at 5:05, I want everybody to have found partners for a group, three to six is a reasonable size. Try to make it diverse. You need some people, you have a lot of people here from the Business School who have some ideas about marketing. You want some people who can actually get stuff done. [Laughter] Did I say something wrong? You know what I mean; you want people who can write, who can make pretty graphs, because today we are assigning this. 0:36:18.9 On Sunday afternoon, I want you to come with a proposal for metrics. The reason I want this on Sunday is so I can give you feedback on Monday in class. Then, for the subsequent Thursday, I want a brief progress report of how you are doing on these metrics. Between Sunday that week, I will be at Facebook and we will see what Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 12 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-2_2009.04.06.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics metrics they might not have publically, but I think we have great ideas they should implement for our classes. The first report will be just what we have. By the 23rd, we will have a stable set of metrics. We will compare on a web page and if somebody wants to help out a little bit in creating some of the infrastructure, it’s not that much work. I would love for you to come up to me afterwards. On that page, we will have all of your groups, on a daily basis, competing against each other. We will see how well those things that I decided after I get your metrics examples, are the good ones and we will see how well you are doing. This element that you learn from each other is actually key. It is not just [0:37:31.5 Enrique] who actually wrote the homework and said to come up, but it’s not only from each other. There is another twist that is a fun twist to this experiment. If you stand close enough to me and talk loud enough, you should be picked up by the microphone. This is Enrique here; he took the class last year. If you want to hear all the complaints, talk to him and his friends. Enrique I’m Enrique. I hope all of you stay in the class. While I was writing this assignment, I actually wrote it first for the Haas Business School class. In the actual writing of the course, I put in parenthesis that this will actually be a competition between our class and the Haas students. I said that undergrads here will beat them, until Andreas found that on the assignment and made me delete it. I firmly believe that you guys will create ten times better pages than they do and eventually win and go out to dinner with some cool, distinguished guests, and actually present your pages as case studies for Facebook. I think you have some incentives and we’ll talk more about how [0:38:37.1 cough] those incentives to make you guys actually get creative and try to explore and push the boundaries to the fringes of social media. Andreas: I figured it’s a fun thing and since I actually am in this situation that this quarter I teach this class called Marketing 2.x at Berkeley. Berkeley is the public school on the other side of the Bay, as well as having you guys here. We carefully timed it so I will also tell them today about the assignment, and all these deadlines are identical to their deadlines. We will also have on the summary page the comparison between your average and their average. We will know. I don’t know the answer but we can measure it. [Laughter] 0:39:17.4 I created two Gmail addresses. This one is if you have questions for the TAs. If you are still here, do you mind standing up? We have Enrique, our social media TA. We have [0:39:34.0 Xian Thung], the guy over there, who is from [0:39:39.2 unclear] Province, which on the way if you take the train from Shanghai. I also have a house in Shanghai, by the way. Take the train from Shanghai to Beijing and you go to …. He did his graduate degree at [0:39:48.4 unclear], at Peking University in maths. He’s much smarter than I am. His undergrad program was People’s University in Beijing. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 13 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-2_2009.04.06.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics We also have [0:39:56.8 Xian Yushuei], who went to [0:39:59.2 unclear] in …. Province and they’re grad students in the STATS department. They will be helping you with homework, I hope. We are also still looking for graders. If you have any friends, although they shouldn’t pay favors to you, if you have any friends who will actually help out for whatever the school pays graders, let me know. Also, last year I had Eric Sun, who was the most wonderful writer for my blog, but Eric is so busy with the concerto competition so he told me I should ask the class whether any of you is willing, for pay, to help me write a blog entry every now and then. It’s actually for me. English is a second language and all of that stuff so it’s not that easy to write well. That’s a personal request I have for you. Then for the script for the metrics, collection and displays for class, if somebody is willing to put in a couple of hours and are good at Pearl or Python, that would be great, so we can get this done. For more contact information, I mentioned my page, the wiki, www.socialrevolution.com thanks to Enrique who is a guy who gets stuff done. It points to the Facebook page. It’s the same as www.facebook.com/socialdatarevolution. One thing I will do tonight is I will see what questions you had here in the questions I asked you. If you could ask a student in the class one question related to the social data revolution, what would it be? From the seventy or so copies here, I will see which ten questions or so are really cool questions. I will create a Google forum and I will mail out the URL of that Google forum and ask you to spend ten or fifteen minutes tomorrow, by tomorrow evening, to give us the answers. We will make the answers public anonymously, for the rest of the class. I want to know where your heads are. You remember the story with the fish in the water? I need you to explain the water to me. We already talked about email. I’m easily reachable. I am at www.aweigend.stanford.edu. That is it from my perspective, as far as logistics is concerned. The homepage, if you will, on www.stanford2009.wikispaces.com, by tomorrow morning, will have all the logistics questions answered. Right now, I couldn’t access it from the department so it’s not up yet. I apologize for this. 0:42:42.3 These are slides. I will put the slides and everything up on this page, which is the link from www.weigend.com/files/teaching/stanford/2009/ You will get it by clicking on the links. Are there any questions? We have about twenty minutes left. Student: I was thinking about metrics collection. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 14 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-2_2009.04.06.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics Andreas: There is a certain set of metrics that are available on Facebook. Do you want to show them? Enrique: I think the best thing is to create a page. It will take a couple of minutes to create a page and they’ll see over these next couple of days they’ll start to get…. There are a number of web 1.0 metrics available… Andreas: These are not the most advanced metrics in the universe. It is a good starting point. I believe I have a screen shot of them here. While I’m looking for this, do you have any other questions? Let me show you… you never know what comes up on the screen, like Danny Konnerman said. This is not the best screenshot possible. What we have is page views, total interactions, unique views, fans, remove fans. That’s very important – remove fans. Actually that’s something; when we talk about the costs, you really should consider the huge cost if somebody says, “I don’t want to hear from you anymore.” They have friends and tell their friends, and stuff like this. Wall posts, discussion topics, reviews, audio plays, video plays. These are the standard metrics. I want you to use your creativity of coming up with metrics that you think makes sense. I’ll be at Facebook next week and will try to convince them to implement the ones that I think make sense from the ones you think make sense. It really is an exercise for us to change the world, in order to move from this web 1.0 metrics of unique users, to truly metrics of engagement. What are your other questions? Say your name and department. Alex: I’m Alex, CS… what progression? Andreas: What do you mean by progression? Alex: Course outline 0:45:30.1 Andreas: Look at last year. I think the best example for this year is last year, although the performance last year is no guarantee for performance this year. Take the time, click through last year’s wikispaces, www.standford2008.wikispaces.com; it gives you a good feeling about what’s happening. I have not put the individual classes together. Partly because I want to learn from you what it is you are interested in. Last year is a good example of what will be happening. Enrique: If you guys are working on cool projects or want some – share those feedback right now because it’s the best time to share it. If you didn’t put it all up there, drop an email or get in touch. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 15 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-2_2009.04.06.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics Andreas: I plan to really do a good job in putting it out in the next couple of days. The reason I told you this more general stuff today, like the PHAME framework, or the importance of data, was that’s the way I think about the world. It’s not about tuning that algorithm, given a set of data, but it’s coming up with cool problems. I want to see where you are right now. I told you the story of the water and the fish twice, now. You are the fish in the water. Tell me about the water and I’ll tell you what I will pick from that. Look at last year’s 2008 and that will give you a good feeling about what we did last year. Chris: My name is Chris. I’m from … systems. For the projects, do we have to use Facebook pages, or what’s the… Facebook pages because as far as I can tell, Facebook pages are actually kind of pretty horrible… Andreas: We did think about this. I also talked to other people about it. We felt that the question about which platform you are on is almost more important than how smart you are in putting something up. If you are on a different platform, it’s very difficult to compare what different people are doing, although having a whole bunch of former students working in the data group at Facebook it is relatively easy for me to get them to do stuff. What were you thinking about? Chris: For me, community often is fostered in places where there is more open discussion tools, so I’ve seen it happen more in forums or real time chat rooms… whereas Facebook, it’s what you get when you have any sort of discussion on the wall… linear thread…chronological, terrible way… pretty much no way of sorting… Andreas: I hear you. Chris: … people reading individual post rather than actual persistent… Andreas: I am with you. We did talk about it but we didn’t have a better suggestion. If you know of a better platform, I should say I have [0:48:16.3 unclear], who is the CEO of Ning coming to one of the classes. I invited those of you whose email address I had last week to have it picked up on Ning so I can learn pictures of people and their names, as opposed to having to go to this awful tool Stanford has for instructors where I can look at one picture at a time. I don’t think it is all that much better a platform. 0:48:39.1 Enrique: Do you try to drive stuff to Facebook pages if you can, but you could use something else. You could use other acquisition channels if you wanted, or do some type of porting the data to somewhere else. You can be creative. Right now, we’re working off of this one constraint… platform constraint. Don’t close your box. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 16 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-2_2009.04.06.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics Andreas: Enrique is right. It has to be very manageable for the class. That is a platform where we think we can manage things okay. I am also talking to [0:49:16.6 unclear] who runs the entire ad group at Facebook, to get money to run ads. This is sort of more or less real money. You could also sell whatever else you are selling for the money. We can’t check what you’re driving your traffic to. That’s a little bit of a worry right now. Let’s talk about this after class. If you have good ideas, I think we are super open to listening to them. Thank you. Student: Are all assignments through… Andreas: No, this is one group assignment. There are individual assignments and group assignments. It’s not primarily about the grading. It’s about you learning stuff. The way I structure the assignments is that I have a spreadsheet or a Google Doc here that the TAs and I have access to. At the end of the quarter, I’ll just push some button and I will make reasonable cuts for A+, A, A-, and so on. Everybody in the group gets the same points. I’m not in the business of asking how did you do compared to him. I think for the group projects that assignment, homework 1 will carry out for five weeks. That is algorithmically graded. It’s not my feeling about those metrics goes in. That’s the first step. The second is we will see how the group is progressing. I should mention one thing. Wikispaces, we have for each class, a class wiki. I think the total number of classes is nine. That means roughly 1/9 of the people should volunteer today to bring up the class wiki for the first class, in a reasonable way. I think we typically say is by Wednesday evening or Thursday. Whatever it says on the first page, I think it’s Thursday evening. How does it work? You find people you may have been too shy to talk to at the beginning of class, but really would like to talk to and say, “Do you want to do the wiki with me?” Then you have to get together and figure out what was it we did in class, what was really important. I don’t want anything resembling the transcript. We have the transcript already. I want you to really condense this to enrich it with hyperlinks, to come up with good insights, and have a relatively short page for each class. 0:51:40.7 You have protected under Thursday. I will look at this on Thursday evening, every week. I have full discretion; if I like it there is 10 points If it’s sort of okay, it gets 7 points. If it’s no good, 0 points. At Berkeley, they didn’t understand this notion of the wiki. There was nothing after three days – 0 points. It’s 30% of your class grade. 30% comes from that one week when you take the time with maybe 5-7 other students and produce the wiki. It’s an important part of the class. Pick a topic. I think an important one is some of you who might be busy later in the quarter may want to volunteer today. Find people. Give me your names and I will send my notes to somebody. I will send my PowerPoint to somebody. I do whatever I can do to support you. But, on Thursday evening, some wiki Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 17 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-2_2009.04.06.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics thing is up and that is what you will get for 30% of your grade. It’s the same next week, the week after and so on. The TAs keep track of who did what wiki and I will just enter the grade for that wiki. Look at last year’s wiki. There are a couple of absolutely awesome pages. I remember the three hours on recommender systems is a super great wiki page. I remember we had one class on instrumenting the world, and I was shocked at how rich that wiki page was afterwards. Pick a topic. Once we have the topics up in a few days, pick a topic you’re interested in, rally some people around you and say, “This is the wiki”. That’s 30%. Homework is 60% of the grade, and 5% is for class participation. I don’t mean coughing. I mean contributing in class and 5% is for contributing elsewhere, commenting on Enrique’s blog, using Facebook Connect, or putting stuff to Facebook Group and stuff like this, good citizenship. Are there any other questions? Did I forget anything, Enrique? Enrique: … higher level should be thinking of this as a movement together… really work together… community even within this classroom. I know we all have motivation… it comes to this classroom… Andreas: In that spirit, let’s close since this was the first week of class here, today. Let’s close ten minutes early and give people ten minutes to actually form the groups for the wiki and see whether there is anybody else interested to give me feedback. If you forgot something in your forms, find your form and fill it in. I’ll see you next Monday at 2:15. Thank you. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 18 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-2_2009.04.06.doc