weigend_stanford2010_2data_2010.04.01

MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business Andreas Weigend (www.weigend.com) The Social Data Revolution: Data Mining and Electronic Business MS&E 237, Stanford University Spring 2010 April 1, 2010 Class 2: Data This transcript: http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc Corresponding audio file: http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.mp3 To see the whole series: Containing folder: http://weigend.com/files/teaching/stanford/2010/recordings/audio/ Course Wiki: http://stanford2010.wikispaces.com Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business Andreas: Welcome to the second class of MS&E 237 this spring. The agenda for today is the following. We will start with doing logistics issues. I’ll tell you what’s coming up through the quarter, how you’ll be evaluated; I’ll introduce one of the TAs. We’ll then form some groups and I’ll tell you why we will do that. That will be kind of a break. The second half of class I’ll do some content today where I will tell you about different data sources and that is the business perspective before we dive more deeply into the technical stuff in the next class. First of all, thank you for sending in all of your - almost all of your surveys and for sending in your bio information and interests. I haven’t managed to get to all of them yet. I promise I will be done with everything you’ve sent in, all the 88 forms I’ve received so far, by Tuesday. That’s a great treat because I get to know all the things you think are cool so I have the intelligence and the attention of about 100 people sitting on the web and getting it socially filtered through you. Let me talk about assignments. Assignments come in four flavors in the class. We have a group project which makes up 40% of your grade. We have online contributions, things you do on socialdatarevolution on Facebook. We have individual homework; those are slightly more technical. If you want to get a feeling about what’s coming down the pike there, look at what we had last year, but this year being at MS&E, it’s less technical than last year which was the STATS. Then we’ll talk about dog food in a moment. After that I will talk about something I decided to do, which is sort of a class rep advisory board. Then we’ll take a break and I’ll do about a 40 minute lecture on data sources. Filling in details: next Tuesday after class we will have a mixer. We’ll get some pizza, some beer, and the purpose of that is so you can get to know each other and so you can find out who has complementary skills that you need for the groups. We will have maybe 15 or so virtual tables, which if somebody feels strongly about a certain area they want to do a project in, they’re going to recruit 3 other people - the total group size is 4 - to work with them on their project. We have very different skill sets in class so what we’ll have is a Google online spreadsheet where each of us rates themselves along 4 dimensions. The first dimension is what I call a “happy hacker,” people who are good at hacking stuff. The second one we could call “producer,” so those are people who are good in product management because a product needs to be managed and if you don’t do this then all the work is happening at the end and people are very unhappy. The third one is the “secret sauce” guy, the algorithms guy who knows more, maybe more theory, who knows what to be able to extract from data, machine learning. The fourth dimension is “strategy” so we have 16 people in GSB here. I would expect them to rate themselves pretty highly on strategy. For me, strategy also means data strategy. Since these projects are all projects about social data, one key element there is how do you get to the data; what is the data strategy? 0:04:29 As I was driving down, I had a conversation which is in a company in the financial space. They first got data by screen scraping. That means you get the use to give you a user name and password and you pretend to be logging on as them, and you get all their financial data and present it back to the user as a dashboard. If you were the financial Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business institution, what would you say? Would you say no way, those guys are actually going to grab the data from our user and show their ads, or would you say great they’re doing work for us for free? What would be your perspective there? The first perspective is we don’t want them to do this. So, the financial institution switched things around with this result; that lots of the data that was scraped was actually wrong, like your amount of dollars was your zip code, for a day, and the users were ultimately very unhappy. The company went to the financial institution and said, “Do you want unhappy users or happy users?” “We want happy users.” “Why don’t we do an API, an interface where you can actually suck the data out right away, as opposed to doing the screen scraping? By the way, since we’re doing work for you, why don’t you pay us for that?” That would be an example of strategy, how do we get to the data we need. Incentives fall under that. Go to this link, bit.ly/mse237 projects. There should be your name and email address. Rate yourself there and that way people can add some descriptions so you have prior information when you go to the mixer on Tuesday after class, regarding who you might want to talk to. We’ll have name tags and they will have the phone numbers prominently displayed. Of course me, as a data mining guy, at the end of the quarter, I’ll be curious as to what is the correlation between the raw grades and those numbers. That is what you need to do so we can form groups. Groups need to be final by Thursday of next week, a week from now. If there are problems, tell us early so we can basically announce them in class and see whether we can still fix things. Don’t wait until 3 weeks into the quarter. The projects are created by you. Last week we saw that Intuit is willing to act as tutors for some of the projects. We have another couple of companies which will do similar things, but ultimately, it really is an important part to come up with good questions to ask. It’s not that I will say project one, project two, project three, but an important part of the project is actually defining the project and figuring out how to go about it, particularly the data. We will have a number of milestones through the quarter so it’s not all pushed to the end. It is relatively frontloaded. We’ll tell you next week what the deadlines are. Any questions about that 40% here, the project? It’s different from last year. It’s much more important. It’s a group project for a maximum of 5 people. Student: Is this the project we’re going to pitch to the VCs? Andreas: That’s the project which at the end of the quarter you will have a few minutes to pitch. Primarily, you’ll first pitch it to your friends. You will see how they debug it. When I was at Xerox Park, and people say I can’t really talk about it, my feeling was always was there is nothing to talk about. Good people are so rich in ideas that they’re very happy to share their ideas and get them debugged by their friends. The VCs are at the very end of the last class. I want you to talk to everybody about it and make it better before you even start coding. It’s one project throughout the quarter in your group. Any other questions? 0:08:55 Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business Student: Are you going to have a grading rubric or something like that so we can know …? Andreas: Yes, we’ll have a timeline which is more important than the grading. Why do you think we want to do a project? Student: So we can respond creatively to the material. Andreas: I want you to figure out what to do, what would be an interesting application of the stuff I feed you. Personally, I always learn more in projects than in just doing problem sets, which gets us to the other three ingredients - online contributions. One of the potential projects is to come up with a very good social structure for socialdatarevolution.com, but we haven’t set it up yet. The homework I’m assigning now is I’ve put two papers up from The Economist, two special reports. One is from January this year, about social networks, and the other one is from February this year about big data. So you can convince yourself that this actually works, this is the link. If you click on this you can either have it as a PDF scan which is pretty big, or as a Word document. This is illegal. I should probably pay royalties for that, but I never understand how those rules really work so if you want to buy The Economist, you can do that as well. If you want to just read, you can do it here. I expect you to go on the Facebook group, facebook.com/socialdatarevolution, and share one related idea you get by reading these 32 pages. Shared there means it’s shared with your friends as well. Facebook.com/socialdatarevolution is the right transient place where things come and go, and disappear afterwards, but that’s where ideas get shared related to the class. It’s not just that the class watches it; past students see that as well. I think both those articles are pretty good, and it is irrespective of your background, The Economist is a decent way of introducing the ideas that have relevance here for business. Has any of you actually seen any of these articles already? How did you come across them? You subscribed? What did you think about them? Student: It’s been a little while. I think the big data one was a little disappointing because by the time you read the cover of The Economist, the trend is already a little too late to jump on. That was my reaction to it. Andreas: I remember they called me in January so it actually gives an interesting feeling about how long it takes them to actually produce something. I know we talked for hours in January when I was in Shanghai, so it’s about 2 months between talking and it finally appearing. Any questions about what I expect you to do between now and Tuesday? Student: That’s due by Tuesday? Andreas: You should post something on Facebook, and the great thing about it is that if it’s a great idea you have many people who will like it. If it’s lame, then you probably will go by the wayside. It makes it easier for us to get a feeling about what the quality of the idea is, if lots of people like it. Or, they might like you or your picture; we’re never sure about that. 0:13:05 Individual homework is more …. If you want to get a feeling about that, look at last year. For instance, Google Analytics is just a good one. We still need to figure out how we trim down last year’s assignments so we don’t overwork you, but you’ll still get Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business the insights which we want you to get. You probably will have to build a recommender system but the building blocks are there. It’s in Python and for that one I allow people to potentially collaborate in pairs so if one person really doesn’t know Python, then we’ll find ways of making this work. Why do we do that? I personally think that building mental models of what to do with the data, actually doing it is the best way of getting there. I can talk until the cows come home about how these things out to work. If you don’t do it yourself, you actually will never be getting there. That’s my personal belief. There are people who talk about it and never do it. Immediately when you talk to them, you can feel they don’t know what they’re talking about. That’s why I actually want you to do it, to run it, to build a mental model. I did my PhD here at Stanford, doing Neural Networks. I think I was super lucky. My advisor was Dave Rumelhart who invented Neural Networks and he had an amazing intuition coming from cognitive psychology, about how you learn from patterns, how you extract patterns by learning from data. That’s why I have a very strong bias towards running stuff to see what it does. You change a parameter and see how it actually changes the quality of the recommendations. That’s my intuition and my intention and why I give you homework that is actually hands-on. Are there any questions about that? Dog food - I think in a class like this, there is for sure an experience component to it. I want people to live in the space we’re talking about here. As we said last time, this is a revolution, the Social Data Revolution, and the way to do this well is to eat your own dog food. I would like to use that as a bridge to introduce Jeremy. Jeremy is a graduate student of MS&E and is our head TA. We haven’t figured out the entire structure yet, but we figured out that he’s leading the team. Dog food is something he’s going to be producing. Why don’t you take 5-10 minutes and tell people more about it. Jeremy: The idea with dog food is there are a lot of interesting things we’re exploring with the Social Data Revolution, but to really understand what’s going on and to have meaningful insights in what’s going on, we need to study the fundamentals of what’s going on. To do that, we need to live what’s going on. This is taking it a little bit further than just our Facebook accounts, which I’m sure we’re all very familiar with, or some of us have Twitter accounts or something like that, but really exploring a lot of the different tools that are growing out of the open APIs that Twitter has and other things like that. The idea of this project is maybe one week or two week projects. We’ll choose a different tool that’s out there and it will be across a variety of different things. There is certainly the stream stuff that’s going on with Twitter. Maybe it will be the news discovery sort of things like StumbleUpon or Digg. We’ll finalize what that list will be but over the course of that week, we’ll really call on you to create an account and start posting and start being social and using that service. There is really no way to become familiar with these things, other than actually using them. 0:17:27 The structure of that project will be two components. There will be the actual using of the tool and there will be a reflective kind of period of thinking about. The Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business first layer is like the user layer, so you’re using it, communicating on Facebook. It accomplished your purpose. Maybe the time stamps are on there from when the server is being posted. Then for the purpose of this class, thinking about the second and third layers that are beneath that; what meta data is associated with the post that you’re posting on there? Maybe you have some location data, if you’re using Foursquare or something like that. That’s not immediately apparent to you as the user, but on the other side, from a company perspective; you really can use that data for recommendations or other things like that. Maybe a third layer is this idea of the semantic web where you’re tagging different things. An obvious example of that is Delicious, where you’re tagging different websites that you might enjoy with different tags, and on the back end Delicious can correlate here are the most popular sites based around design or something like that. Really starting to dig beneath the surface of moving from being a user of all of these services, into looking at the data and the underlying functionality that you can actually extract from that data, and figuring out what you can possibly do from it. That part of the course will really allow you to have a great foundation for a lot of the great things that are going on in Web 2.0 and social data, and it will give you some great ideas and a lot more ability to be creative and innovative as you go into the project or homework assignments more deeply. Does anybody have any question? Andreas: Do you understand why we are doing the dog food part? Student: … Andreas: For instance, take Twitter for an example. I notice a surprising number of you who told me they actually don’t have a Twitter account, which I was genuinely surprised about. Maybe there is no need in a college environment where everybody has Facebook. Unless you do it and live in that space or actually dive into that space for a week or two, you probably don’t know what it really is about. It’s just like with the algorithms. You can very quickly tell whether somebody actually knows what they’re talking about or whether they just you saw an interview with … or something like this, where he talks about what Twitter is about. There is no shortcut to actually doing it. The homework for those is basically simple feedback, stuff you would do differently, what is it that they missed, what surprised you, or just to stick with the Twitter homework example that was used successful last year; find someone in a company you know and work with them and engage with any one individual who is saying something about a product of that company. That’s a super interesting exercise, and for those of you GSB people here it’s also very handy when you interview afterwards and say, “I know how to engage with Twitter users. We have our Twitter strategy in place.” 0:21:13 That is the thinking behind it. It’s not super time intensive but it is to engage, give us feedback, no more than one page saying what it did for you, what were you surprised about, how would you be using it. On that note, we actually do have a Twitter account Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business which is called @socialdata. If you follow @socialdata, that’s where I tweet stuff about the class, logistics issues, if the classroom changed should have shown up on @socialdata but because at that stage I hadn’t told you about it, it didn’t. If you want to subscribe to that, that’s where you learn, in a push way, what’s happening. Student: I’m not a typical social network user, but I’ve been in some of the social networking classes and I’m a Facebook user of course. I was wondering why they always give heavy weight to Twitter versus Facebook? Can you tell me why they think Twitter is somehow - not better, but gives you more insight about what you can do with social media? Andreas: The question is basically Twitter vs. Facebook. I can give you a few facts here. The most important one is the bi-directional, mutually confirmed nature on Facebook versus Twitter is just a broadcasting medium. That is a very different element. While in both cases you can extract a social graph out of it, the social graph where people actually took the physical world and mapped it in the virtual world, Facebook versus having a radio station, is very different. In a college environment I think Twitter is actually not all that interesting. Companies however jump on it because they feel this is yet another outlet for pushing their messages down the throat of millions of people. Why is Twitter so popular? Because there is the illusion of an audience, “I have 1,000 followers,” so if I tell people whatever, I think 1,000 people are listening. Bad news - nobody is really listening. But that illusion of the audience is what people haven’t experienced. I would be having breakfast at … this morning and nobody else showed up. I tried many things. I tried an experiment. I have an extra two tickets for a play and symphony orchestra or something. I thought nobody wanted them. Then my TA said, “I can go with my girlfriend.” We did a number of things so I’m pretty aware not of the illusion of an audience. The reason I have the chief product guy from Bit.ly coming to class is because Bit.ly actually allows you to measure how many people actually really give you their attention. I think for some of us it might be a bit disappointing to see how little we get. Facebook is of course set up between people who may know each other. The survey we have to figure out how to present it. I might simply put up the results anonymously of course, on the web. Peoples’ responses were interesting on if Facebook went away, what would happen. If Twitter went away they said probably nothing; if Facebook went away - people are worried about the address book and their photos. They’re worried about the personal relationship with others and their past. There are many other things to be said. One is the question of identity; I think Facebook is much more important for identity than Twitter. Think about it; Facebook is probably more important for us, for our identity, than our passport is. Your passport you can fake. We all know friends who organize these things. On the other hand, if somebody cracks your password on Facebook - the same happened to me on Twitter. Somebody cracked my password and within a minute or two Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business people were saying, “Andreas, your account has been hacked.” If I lose my passport nobody would hit me up and say, “Your passport has been stolen.” 0:25:53 Are there any other questions? It’s a big topic and we had one very good class about this last year where we had people from both companies talk about it. Guest speakers. I will have an average of every second or third class, we’ll have a guess for about 45 minutes, and we’ll reflect during the last 30 minutes about what it was that we learned in the presentation. I will look through your recommendations. If somebody really feels strongly about it and thinks they didn’t express is strongly enough in the initial survey, then drop me another email. If you heard somebody is awesome in another class or in a talk around here, please let me know. I don’t know everybody but usually when you invite people and give them a choice of dates, and they’re somewhat I the area, they will come. I haven’t fixed them all. A couple are set for April 20th, with Bit.ly but there are spaces and I want to find people you are interested in. As I said in the survey, don’t say … because everybody knows them already. Think about people you actually genuinely would like to discover something about, which not everybody knows about yet. The last logistics issue, although I tried to insert it with some content already, is given the diversity of this class, which is pretty diverse, I decided I will form an advisory board. I want to have 4 class reps, 4 people whom I meet with initially once a week on Thursday after class. I’ll take you out for pizza or we can go to one of your dorms. I want to learn what the 4 groups are thinking because some groups might be more vocal than others here. I don’t want to fool myself into thinking that just by looking at your smiles, that I know what you’re thinking. Some people might not have the courage to actually email me directly to tell me what they think. This is a very good thing which has worked really well in other places in the past. Student: I’m Matt Osborne. I’m the GSB representative, not an MBA. I’m part of the [Sloan] program, which is the business school students who have been 10-15 years into their career. I’ve actually been working in this business for that long. Student: I’m Jess. I’m … senior and… major. Student: I’m Dan Goodwin. I’m a first year masters in electrical engineering. I’m an extroverted engineer because I … when I talk. Andreas: Our first meeting is next Thursday. Just recorded among the five of us and I’m also inviting the TA. It’s an open discussion, 6:15 dinner and maybe you can pick a place or otherwise we can do a coffee house or something simple. One email address to reach whoever the teaching team is, other than me, which is mse237@gmail.com. It’s very easy. You don’t have to worry about what goes to whom. There is one email address. Any questions about that? If the four of you could just send to that address, your names, that way we have it on record. 0:29:50 The current info before we actually get the socialdatarevolution.com up better everything is on stanford2010.wikispaces.com. Tonight I will invite all of the students Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business whose email addresses I have, I will take all the email addresses I got from you and will invite everybody there to be an editor of the wiki spaces. By tomorrow you can all edit this. It might be that we translate this out to something more interesting, something that allows to do more annotations. That’s what a few of us are meeting at dinner for tonight. For right now, that’s the email address to reach the TAs. That is where the information sits. If you want to email me, it’s aweigend@standord.edu. Any questions about that? We should have about half an hour left. In this half hour, I want to talk about data. I want to think about the Nile River first. What does that have to do with it? In history, those people who managed to build long, unclear feedback loops, or maybe omit them, used to be the ones who actually became rich and famous. Let me give you some examples. It was the high priests in those days. What is the feedback loop for somebody who tells something that might be happening after death? It’s pretty long and pretty broken. Jumping a few hundred years further, there was something in the ‘90s where it was very popular to be on Wall Street. You probably don’t remember those days, but there were days - from my graduating PhD class, from Harvard and Stanford combined, there were two of us who did not end up at [Quans] on Wall Street. It was the normal way to go. Why did this work? It’s not so much for the [quans] but for all other people, trust me. Write this thing out and somebody it will be all good. They were long feedback loops, not a rich set of metrics. Another example is religion. These infrastructure investments, Notre Dame to Paris is an infrastructure investment. It was not clear what the ROI really was. Let’s compare and contrast this era of faith to an era of data. The reason I have Chinese here is I thought it would be forward compatible. In the era of faith there were massive investments into cathedrals, etc. In the era of data there were massive investments in measuring, networking, communicating, and storing. We move from an unclear ROI to a very clear ROI. We have a short feedback cycle and what’s very important for me is we can do experiments in real time. This means that a lot of data is being created, The Economist article from February, “Big Data,” and what do we do with all of that? We gather them, explore, publish, and archive them, a lot of stuff; what should people do? What do marketers in the business school do about all of that stuff? The paradigm shift here, and this is an important one, is that we are moving away from the ‘90s question of given a set of data, what insights can we get, to the 2000's question, given a problem, what data can I get. In other words, given a business problem, how can I incentivize people to actually tell me something about themselves so I can serve them better, to what I think is a 2010 problem that not what is the insight given the data nor what is the data given the problem, but what is the business model. A lot of it here when I talk about incentive design is figuring out how we get people to do stuff. This is an old table here in case someone is not really sure what the unit of Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business measurement for data is. What I like about it is that it’s pretty shocking to see how many order of magnitude the data we live with actually span. It’s roughly 18 orders of magnitude here. If you transfer this, not from the digital scale, but into the physical scale, into distances, then you get this. 0:35:02 If you say the unit of analysis here is an atom, roughly this, then you have Mt. Everest and you go the distance to the Sun. The amount of data is so difficult to grasp for people because it’s so many orders of magnitude that is being spanned when going from an individual bit or byte to what the Internet is actually carrying about us. I mentioned last time that what matters is not only the static overall size, but that the doubling rate of data, the amount of data each of us create roughly doubles every one and a half years. One of you emailed me asking what did I actually mean by saying you can just describe where you are and that’s a few bytes, or give geolocation, or you can take a picture and describe it, but that estimate is essentially all the data that you are creating. Of course, video and photo does play an important role in this exponential growth of data. Sometimes we talk about the surface web, something like 10 billion pages, roughly a page per person on Earth, and compare it to email. The storage cost in 2008 when I last did this exercise was roughly $400 thousand, you could store everything that is out there in the size of your garage. My garage in San Francisco costs about $100 thousand, so think about this; all the data mankind has created and put up on the web fits into my garage for $400 thousand. Deep web of course is the underlying databases, and I would say it’s roughly ten times bigger. No point arguing about this much. The point I want to make is you can store everything people have created in your garage. We want to turn behavior into data. MoodLogic is company I co-founded in 1999 with Chris Pirkner, a former grad student of mine at NYU. The belief we had there was if we just give people tools to trivially, easily annotate things they do - in this case it was in the music domain - characterize songs they like, then we can build a space where nearby points in that space have nearby perceptual impact on the person. Music is probably the best legal drug and if you want to be happier and you have a song that makes you happier, then what are songs in the vicinity of that? One way of getting to this was asking people why they like to listen to this song. “I like to listen to this song while I’m in the shower or running.” It was pretty amazing that within 2 years we had 1 billion explicit ratings. We would have never thought that. Chris Pirkner really understood how to get people motivated to do stuff. How much do you think we had to pay for those ratings? Nothing. Monetary incentives don’t really work because you get people who are doing it for $3 an hour, clicking and rating songs, but you won’t want those people. You want people who actually care for the music. That was one of the deepest things we learned there. If you line the incentives so they get something back which is useful for them, namely discovering new songs, that was the ‘90s when mp3 was easily available and later on people got sued for Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business having them on their computer. In those days it was really a discovery problem which we solved. The company was sold to All Media Guide and then All Media Guide was sold to Microvision and Chris is their Chief Strategist there. 0:39:11 It’s a data play, a meta data play if you will. The point I want to make is that behavior gets turned into data. Music is one example. Search is another example where you share your secret desires with Google. Who would be willing to have every single search you did in the last week displayed, with your name, on the screen? We do share our secret desires - with Google. Why? Because we get something back. We might get the answer back. Another example is online trading, where peoples’ behavior can be measured, and online dating. That’s very interesting. There is a theory of addiction, and it’s very simple. It is that if you have a stake somewhere and the world changes, whether you watch it or whether you don’t watch it, and it affects your stake in that world, then you are drawn to go back and check it out again and again. Online trading is a perfect example where you have some positions; while you are sleeping, the price may go up and down, and people are addicted to it. Online dating, you have your profile out there and you can’t wait to see whether that person you really want to contact you actually did. That driver, having something which changes with you having a stake in the game is a very power driver for whatever app you’re creating. Online role-playing games, Zing as an example which everybody knows, both those other things, particularly trading and dating have the same underlying structure. My belief is everything can and will become data. Movement data, mobile [0:41:20], brain activity - we will have Brian Knutson come, the guy I mentioned the last class who does FMRI in Jordan Hall, to actually talk to us about what’s happening in the brain, what data can we get, how can we make predictions, given those data, onto future behavior. Privacy is one issue that will come up and will be topicalized again and again in class. There are lots of different kinds of privacy, information like what your hobbies are, what you’re good at; communication privacy, who you talk to. Skydeck is a company between here and San Francisco. Their user model - you give them your user name, your password, or your mobile number and your password to the site. They go to the site - T-Mobile in my case, or AT&T, and look at your calling patterns. They present back to you how those calling patterns change over time. I’m sure you can predict breakups with your friend way ahead of time, before Facebook knows about it, by understanding how things shift. Another example is how long does it take me to respond to somebody’s voicemail, versus how long does it take them to respond? Nokia Research has a product which tries to come up with an organizational chart, very different from the official organization chart, which takes such response delays into account. I did some consulting work at Morgan Stanley. I needed a math lab for that so I called up the MathWorks, people near Boston, and said, “Can you help me out here and shoot over a license for everything for the next week or so?” They said, “It’s funny that you called, Andreas. We just sold Morgan Stanley a very expensive license.” “ Who did you sell it Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business to?” I don’t know the guy. I called him up and he said, “What are you doing?” I told him what I’m supposed to do and he said, “But that’s what I’m doing.” By accessing similar tool boxes, for instance MATLAB, or in other ways by accessing similar data sources if you actually buy data, you get a very good idea, in yet another space, of who is working on similar products from whom. If you look at the same data with the same tools, chances are that you actually are trying to solve a similar problem. 0:43:54 Thompson, which recently fused to Thompson-Reuters, has a company in the Midwest, in Minnesota, called Westlaw. Westlaw goes back 100 years, taking all these public records from court cases. Somebody enters them and then for lots of money you can actually access them, which is an interesting conversation; I think about the value of data because the data is already public, but the fact that you can access it from your desktop and you can search it is where the value lies that people are paying for, rather than taking the bus to the courthouse and finding the proceedings somewhere. The interesting question there was who should have access to which cases I’m actually looking at? Certainly not the counter party because that way they could reverse engineer the argument I’m going to make in court. That’s not what we want, but within the law firm, should people have access to that? Probably yes. Then what should they be able to aggregate up in order to have a better data product, and what should they not be able to aggregate up? It’s a similar question for Intuit data. What can you learn across companies and what would companies be very unhappy to have shared? The general rule is that the big companies are the ones who are to lose more than the small ones. Amazon.com pretty much knows what’s going on in the world by what they see. Some little Internet retailer cannot lose much if their data is public, but they can gain a lot by benchmarking themselves to data from other companies. One of the things we’ll do at the mixer next week is that we will have Angus summarize the data sources we’re looking at. Is that right? Geolocation, territorial privacy, what happens in your office home bedroom? I just learned that in Singapore, a hotel room is a public space. Watch out what you’re doing in hotel rooms in Singapore. Of course, bodily privacy, strip searches, drug tests, these are at least some of those interesting dimensions of privacy. Out of those privacy concerns here, collection and storing, unauthorized secondary use, improper access, and combining data, what’s missing here is my main concern which is what is something wrong is out there? What do you do about it? If somebody says that some Stanford professor molested a child, now what do you do about it, assuming it’s not right? What do they do about it? It’s not easy. This year, what would be processes for people to fix errors in the database, which are pretty foolproof, that is tamper proof, that if I actually want to fix something that is right and change it to something wrong, I can’t do it. But the person who actually wants to fix something that’s wrong and make it right actually can do it. Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business 0:47:16 Let’s take geolocation; let’s say there was a murder in L.A. last week. I might not be willing to have - I would be but some of you might not be willing to have your geolocation being recorded every minute, which would make it very easy to say this is where I was. Then if I get accused of having committed that murder in Los Angeles last week, I would be more than happy to explain that it can’t have been me because I was hanging out in San Francisco. That’s an example where you hash the data into some space, so you can come back and say the probability that Andreas did this is 10 to the minus 17, versus the probability that the other guy is 10 to the minus 1 or something like this. How do you do the data? How do you process them so you can get your question answered without revealing any more than you actually need to reveal is a very interesting area of research. Accessing your own data is a very interesting thing. I missed a plane connection and went to the woman at Lufthansa in the lounge, and I said, “The plane was late. How do I get to where I was going?” She said just a moment. I leaned over to look at her screen. She pushed the screen away from me and said, “No, you can’t look at this?” I said what are you talking about? I paid a lot of money for that ticket. I’m stranded here because you’re plane was late. How come I can’t look at my own data?” The same thing happened with iPhone. There was an issue with the AT&T billing so I go there and I didn’t know what the problem was. I said, “Let me help you debug it. Let me see whether you have the address wrong. What’s this?” How can people think about my data as something I can’t look at? 23andMe is another example, versus financial data, that 23andMe where for a couple of hundred bucks did a DNA sequence and tells you which diseases you’re likely to have and which diseases you’re less likely to have. Initially it was you can only get this data if a doctor is next to you. On the other hand, you can get your financial data without a guy who has a degree in financial mathematics being with you. How people think about their data, whether it’s my DNA or my money, is very much worth reflecting on. The Federal Trade Commission has a bunch of dimensions here. What they came up with is actually quite reasonable. A Pentagon article from 4 years ago, a New Scientist article talks about how the spooks are trying to figure out what is going on, on social networks. Anybody, any comments on that? I think we should assume that they’re everywhere. I’m not the person who is worried about these things, but assume that everything you ever say, write, email, text, they listen to. Cost of storage has dramatically dropped. It’s absolutely amazing what [slack] had when I was a grad student, now I have on my mobile phone, in terms of storage. Of course, the hard drive capacity has shifted. There is a good paper by Hal Varian who is a Chief Economist at Google. It really explores the law that storage prices drop by a factor of 2 every 1-1.5 years, and what that means to the world. I see it this way, that there is implicit and explicit data. Implicit data are data which are clicks, my geolocation, stuff that I produce and can’t do much about. Even if I don’t have a phone with geolocation, just the fact that our phones are in this room, somebody has to know this because otherwise how would they know your phone should ring? That’s one of these strong data strategy things. Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business 0:51:32 I can turn my phone off, but then I won’t get phone calls, so I need to transmit where I am. Otherwise, my phone can’t ring. There is a company, I forgot the name, maybe some of you know it; it little receivers in conferences and then they trace out the trajectories people take through the booth at the conference, who hangs out where, for how long, what is the more or less random walk people take at a conference, and who is hanging out with whom. Does anybody know the name of that company? There are very interesting insights you can get about where people go, how does this change when you rearrange the conference. Many stores had video cameras where people sat in the back trying to figure out if you move that product from that aisle to this aisle, how does it affect sales. Walmart actually has an entire model which says if I move the toothpaste from this rack to that rack, sales will go up by .03%. They knew this 10 years ago already. Now with people moving through the stores, having their mobile phones on, you can understand and analyze the implicit data much more than we could ever do by manually looking at video tapes. That is growing exponentially, no question about it. Surveys on the other hand, as an example for explicit data, is not going up exponentially because we actually have to do work for that. So, from a machine-learning perspective, or … for instance, you could say that implicit data is unlabeled data and explicit data is labeled data. How much worth is there to labeling a piece of data? What I’ve seen happening is that great progress has been made by making it increasingly easy to share whatever you want to share about geolocation or about a restaurant dish. One of my students last year has a company in San Francisco where the unit of analysis is not the restaurant, a la Yelp, but the dish in the restaurant. If you want to have really good … then this might be a good Chinese restaurant but the … might be really lousy. It will tell you where the best… is in a 5-mile radius, or 10-mile radius. As long as you make it trivially easy for people to annotate it, they will do it. Any increase in the barrier of doing it will have an exponential decrease on what people actually do. Voice over IP is a great example. This is a great graph. It’s an old graph, from 1980 or something, but it’s a beautiful way of representing and [ensuring] people communication costs. On the x axis, we have the cost of transmitting 1,000 words, inflation adjusted. On the y axis we have the number of trillions of words made available, transmitted. For each of these things, like fax machines that were invented at some stage, you see how the price goes down here. Telegram, price went down. I wanted to show this to you to show you that the economics of data is important. We live in a different era. Here is an example from Amazon.com. What is the amount of data Amazon is collecting in a given year? The new data Amazon had, and this graph is a few years old. I don’t know what the date is exactly now. If you have individual clicks, such as an access log, it’s an order of 100 TB, maybe a factor of 10 more, but that’s roughly the order of magnitude. If you look at session aggregates, where we summarize what the person has been doing, did they buy something, how many clicks were there, when did it start, when did it end; it’s about 2 orders less a TB, so it fits on one of these little devices. If you aggregated it Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business further, 2 orders, it’s 10 GB. All the customer data of Amazon is just 100 MB. If you blow it up here, the more explicit data; up here versus the implicit data, you have way more in implicit data. 0:56:23 To ramp down here, there is an iterative process of modeling and of making decisions. That works just as well if you build trading models on Wall Street, as if you work with an ecommerce company. The first problem here is that you define what the problem is, and that includes that you have the baselines defined. When you do work with a Wall Street firm for instance, make sure the problem definition includes what it means to do well and what it means to not do well. We measure, and then we describe exploratory data analysis, something computer scientists tend to not be that good at. It’s more what EE and physicists - let’s plot it on a [0:57:14] scale, hold it against the light and see if it’s a straight line. Then, I deeply believe that the only way I would trust the model was if it makes predictive accuracy, it has predictive accuracy out of sample, our new data. That’s a big rift between those in the class who come from social sciences, where they come and have P values and table and say, “This is the right model,” and those who come from natural sciences, who say, “Can we make predictions out of samples?” Of course, we decide how we can value it is an iterative process because once you actually made the decision, after you evaluated it, you realize that what you had as your initial problem definition probably was wrong. The last point here is to say I can’t emphasize enough, and I made this point in the last class and I make it again here, that the M in PHAME, the Metrics, is something where many people actually can benefit greatly, spending some time making them explicit. Here are examples of what would Amazon want to drive. Amazon might want to drive the stock price, profit - number of items sold, and then when you write those down, which might be 100 or so, then you begin to see what the necessary tradeoffs are. Here is an example between profit and number of items sold. No problem, I can make sure we sell 10 times as many items by dropping the price to 90% of what my competitors have. That means the tradeoff is profit goes. Or here, conversion rate - no problem, we can buy the cheapest key words in the world, get everybody to the site, but they want to have free porn and that’s not what Amazon is about, so the conversion rate of people actually buying something is pretty small. What I wanted to do in this relatively short time was to show you that with data here, it is hard because we have pretty much moved from a world of reporting, of printing out stuff to a world of doing behavior analysis, data mining, making predictions, but the positive of this is that a move from a world where this was very much a cost center, these people are not well liked, to it being a profit center. Amazon fired its marketing department in 2002. Amazon is probably one of the most successful marketing departments in the world because of its recommendations. Why do recommendations work? They observe what people do and smart algorithms manage to play that back. For me, that’s what really good data strategy is about. Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business To summarize what we just did, we looked into the importance of data. We all buy it and know that lots of data gets created. The reason I did this today is I want you to be aware that I expect your products to be somewhat data intensive, so you should think about data. I don’t want philosophy papers here. What has changed is that the data used to be sniffed behind people, “sniffing the digital exhaust,” people coming and saying, “Give me some [1:00:49] insights,” to creating incentives where people knowingly and willingly share data if you give something back in return. 1:01:00 To know whether you’re doing the right thing or not, you have to have a good clear set of metrics you agree on. It’s much easier, and I can tell you this from the bottom of my heart, much easier to agree on metrics than to agree on outcomes, before you have done the experiment. It’s very easy in a meeting to say “These are the metrics. Any other metrics? Okay, let’s throw this in. Let’s write the test [1:01:26] and now let’s do the experiment” that to argue a priori arguments that this is better for the following reasons. For me, that’s the driver that has moved us from an era of faith, and I’m not saying anything against religion here, just from an era of the “trust me” kind of thing, to an era of data. That’s what I wanted to tell you today. Any questions? If not, expect an email from us. I remind you of the two paper locations. Think about which four groups you fall into, how you want to rate yourself there. See you on Tuesday. We will have a mixer on Tuesday, right after class, bye. Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_2data_2010.04.01.doc

weigend_stanford2010_2data_2010.04.01

Related documents

Products

Support

weigend_stanford2010_2data_2010.04.01

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib