36583 >> Eric Horvitz: It's an honor today to have... the Fletcher Jones professor of computer science and electrical engineering

36583 >> Eric Horvitz: It's an honor today to have Jennifer Widom with us. She's the Fletcher Jones professor of computer science and electrical engineering at Stanford University. She's now also the senior associate dean for faculty and academic affairs in Stanford school of engineering. Jennifer served as chair of the CS department at Stanford from 2009 to 2014. I was surprised that she wasn't doing that when I visited about 9 or 10 months ago and I popped in her office and I saw she was doing dean-like things. She was looking over pie charts of which -- by gender of CS majors of her time in coming and by year. And she was very much engrossed in this -- doing this kind of interesting demographic analysis. So it's part of her new role at Stanford. Jennifer received her bachelor's degree from Indiana University, Jacob's School of Music. I think I'm going to ask you what instrument you played, but we'll talk about that later. >> Jennifer Widom: >> Eric Horvitz: Trumpet. Trumpet. Wow. That's fabulous. So did I. [Laughter] >> Eric Horvitz: My high school instrument. And her Ph.D. was from Cornell University. And she was a research staff member at IBM Almaden before joining Stanford in 1993. We probably just basically missed each other. I left in '93, Stanford, to come to Microsoft Research. Jennifer is an ACM fellow, a member of the National Academy of Engineering, and the American Academy of Arts and Sciences. She won the ACM SIGMOD Edward F. Codd innovations award a few years back and also the Guggenheim fellowship. So we're here today to continue this year Jennifer's celebration of the ACM-W Athena Lecturer Award. Each year, ACM-W honors a preeminent woman computer scientist as the Athena lecturer. They have that title for the year. And they -- and the honor celebrates women researchers who have made fundamental contributions to computer science. And going back into some of the citations on Jennifer's work, her research interests have spanned many aspects of non-traditional data management. She was cited as introducing fundamental concepts and architectures of active database systems, which is a major area of research in the database field today. Active database systems allow application developers to embed logic into the database that allow actions to be executed when certain conditions are met. Active database systems have had major impact on commercial database management systems and most modern relational databases include these active database features and she was also cited for fundamental contributions to the study of semistructured data management and semistructured data management systems are key in supporting many applications that are coming forward today such as genomic databases, multimedia applications, and digital libraries. The lecture that Jennifer will give today was originally presented last June at SIGMOD. People that get this award get to choose the meeting they'll give their main lecture at and she chose SIGMOD, which was in Melbourne, Australia last June. And so we invited her to come today to MSR to give us a reprise of this lecture and she's going to share her three favorite results and great strategies forgiving a talk, I think. Let's welcome Jennifer. [Applause] >> Jennifer Widom: Thank you. Thank you for that very nice introduction. 10:40. What time should I plan to stop? >> Eric Horvitz: >> Jennifer Widom: >> Eric Horvitz: Noon. Noon. Okay. I won't talk until noon, I promise, but -- Promote discussion. >> Jennifer Widom: Okay. But given that we do -- it seems like we have flexible time, I'd be absolutely happy to take questions during the talk. That's my preference. So if you want, please, if you want to bring something up during the talk, do that and we can adjust time as appropriate. So thank you again. So, when I won the Athena Lecturer Award, I was presented with the sort of daunting task of giving a retrospective like talk on my research. And that's -- it is -- it can be a hard thing to do but it's also quite valuable. And I think maybe I had just been to a leadership program at Stanford where maybe the only thing I learned -- well, maybe a couple things, but one thing I learned is that all good things come in threes. And so, the combination of that leadership skill and needing to do a retrospective talk, I said, let me just pick my three favorite results over the history of my research. And so that's what I did and that's what I'm going to do today is tell you about those three favorite results and I'll have a particular way of telling you about them as you'll see. Before I start, though, I think it's extremely important to say what favorite means because favorite can mean a whole bunch of things. So first of all, the favorite results are, it turns out, not going to be the ones that have won best paper awards or test of time awards even, although the latter would probably be more likely to be a fair result. They're not necessarily going to be the results that have the most influence. Although one of them I think falls in that category. So they're really personal philosophical favorites. And part of what I'm going to try to get across today in addition to explaining the results themselves is why they are my favorite results. So I'm not going to spring any surprises on you. I'm going to tell you right now what the three results are and then we'll go into talking about them. So the first result is DataGuides. And that is in the area of semistructured data, as Eric just talked about, and that was around 1997. So almost 20 years ago now. Second favorite result is in the area of data streams and it's the CQL, continuous query language, around 2002. And the third result I'm going to tell you about is ULDBs or uncertain lineage databases, which is really sort of a data model or representation scheme in the area of uncertain data. And that was ten years ago. So maybe, in the future, if I went back for some threes, there would be something after that period, but at this point when I look back, those are my favorites. Okay. Now, let me digress momentarily and tell you about the Stanford InfoLab patented five part paper introduction. remember the five parts? I won't put you on the spot. Arvind, do you [Laughter] >> Jennifer Widom: But at Stanford, we actually hammer home to our students a way of thinking about introducing a topic that they're writing about or even talking about and we even force them to structure the introduction to their papers this way initially, five paragraphs. After that first draft, things can get mushed around. But we found it very valuable. And in fact, guaranteed paper acceptance if you follow this five-part, patented paper introduction. Okay. So the first thing when explaining a result, and I'm saying this of course because I'm going to explain my results this way. The first thing you have to answer is what is the problem. Amazing to me how many people tell you about their work without actually telling you what problem they're trying to solve. Okay. Second, why is it an important problem? Third, why is it a hard problem? Really, you want all these things to be true or it's not going to be that interesting. Why has it not been solved already, or at least, what's the landscape of the previous work? And finally, what is our solution? Okay. And for today, I'm going to add a number six which is why is it a favorite? All right. So, we're going to launch right now into the first favorite result which is DataGuides and I'm going to start by giving you the context before I can go into the five parts. So it's around 1997 and we have a project called Lore. Lore stands for lightweight object repository. We were working on a project on data integration where we were -- who hasn't worked on data integration? Where we're trying to bring together data from multiple sources and we defined a lightweight data model to use for exchanging data and then I decided that it would be interesting to separately build a traditional database system to manage that particular data. You don't need to read any of this. The student who was involved in DataGuides is Roy Goldman. And I'm going to for each result identify the people who were involved. Okay. So we're building the system for semistructured data. In 1995, when we started the data integration project, we invented or I wouldn't even say invented, crystallized this idea of what we were calling the object exchange model, which was this lightweight semistructured data model and we used directed labeled graphs. And here's a picture of an example database in this lightweight data exchange model. So, this is a directed labeled graph, by the way. I grabbed this picture from the actual papers at the time. All of the figures are going to come from the papers at the time. So this is a tiny database of restaurants and bars. We can see that this restaurant has name entree phone. So on. This one has not quite the same data. This is a bar that only has a name. And you can see that this data is self-describing and that the labels are in with the data down here. We have values [indiscernible]. There should be nothing too exciting or surprising about this. This just happened to be what we were using to have a very flexible semistructured model. Now, shortly after that XML came out, so I don't want to claim there's anything unique about our model, here's exactly the same data in XML. And since then, JSON has become more popular. There is exactly the same data in JSON. Everything I'm saying about DataGuides could apply to XML and JSON. And we actually converted the project to XML at some point along the way. But because I want to be true to history, I'm going to use the object exchange model for this talk. Okay. So now let's go into the five parts. First of all, what is the problem that DataGuides was solving? It was the problem that semistructured data does not have a fixed schema. Well, I would say that's pretty obvious. That's the whole point of semistructured data is that you don't have a fixed schema. In fact, the data at that time was called schema-less or self-describing. So that's the problem that we were trying to solve, the fact that we had no schema. Now, why is that an important problem? Because database management systems rely on a schema for all kinds of things. So we started building this database system for this self-describing semistructured data and we immediately saw that lacking a schema was a big problem. So what do database systems rely on a schema for? They rely on the schema to store statistics. You need to know what kind of data you have to store statistics about the data. To build indexes, you need a schema. You need a schema to check whether its portions of the data, the attributes in the query are actually in the data. So to check if you're working on SQL and you want to check the validity of a query, you have to check that everything that's mentioned in the query is actually in the data and you do that using the schema. Even a simple thing like taking a query that says select star where star means pick all the attributes, you need the schema to understand what those attributes are. If you want to build an interface to browse a database, what do you do? You build that based on the schema so that you know what the pieces are and many other things. All right. So I hope I've convinced you schemas are very useful in databases. So we need a schema or something like a schema in this world of schema-less data or self-describing data. So, why is it hard do that? Well, first of all, we have to define what a schema means. Second of all, it turns out that what we really need to do in this case is infer the schema from the data. So we need algorithms to do that. And furthermore, in traditional databases, schemas can change but that's sort of a big hiccup when the schema changes where in this world have semistructured self-describing data, the schema may change as rapidly as the data changes. So you need some way of incrementally updating this schema regularly and not too expensively. And finally, the schema can be as large as the data. So if you think about it, in semistructured data, if the data is completely irregular, if there's nothing uniform across the data, then the data is the schema. On the other end of a spectrum, in a relational database, for example, the schema is just typically like the width of the tables. Okay. So lastly, or second to lastly, before our solution, why has it not been solved already? Now, we're winding back of course to 1997, why hasn't it not been solved at that time? Actually, basic because nobody else had tried to build a traditional database system for semistructured data. People were really using it for data exchange. And for data exchange, it was useful but not so necessary to have a schema. Okay. So that's where we were. Now, let's talk about our solution. So our solution was to take this semistructured database and provide what we call a structural summary, which is what we call DataGuides. So we made a formal definition of this structural summary. We have algorithms for inferring it from the data and updating it. We have the way we use it for indexing, for statistics, and for query processing. And also for the user interface. Now I'm just going to give you -- and obviously I could give a whole talk on this by itself. I'm going to give you some flavor though of each of these components. And again, questions anytime if you would like. All right. So, as a reminder, here is the database that we're working with. And now let me give you the formal definition of the DataGuide and then I'll show you the one for that database. It actually turns out to be relatively simple which is I think, you know, good ideas in the end usually are relatively simple in retrospect. So we have three requirements for the DataGuide or the structural summary. One is that it needs to be represented in the same database -- in the same data model, this object exchange model. So in databases typically, you want to represent the schema in the data that turns out to be very helpful in all kinds of ways. So we want to represent that DataGuide in the same data model. Second, every label path in the database, so every path that you can traverse in that labeled graph have to appear exactly one time in the DataGuide. So if we have restaurant followed by name, then we have to have a restaurant name path in the DataGuide and furthermore, there's no extraneous paths so every path in the database, the DataGuide corresponds to a path in the database. Okay? Pretty straightforward as it turns out. So here's our example. And here is the DataGuide for that example. Okay? And you can confirm this obviously is in the same data model. Every unique path in the database appears exactly once in the DataGuide and there's no extraneous paths. Every path in the DataGuide appears in the database. Now, someone might ask about cycles. Is that what you're going to ask about? Not cycles. >> [Indiscernible] in the example we have multiple [indiscernible]. >> Jennifer Widom: Right. >> So be sure you have ones? >> Jennifer Widom: Correct. And that's by definition. So by definition, we want every path that is -- every path in the database to appear exactly one time in the DataGuide and every path in the DataGuide has to be in the database. So the definition is really easy. Dealing with it is not as easy. The other thing I want to mention is about cycles. So we did allow cycles in our data model. They weren't used that commonly. But that did make things fairly tricky but it still worked. So when you have cycles in your database, then you have infinitely many paths and to capture this definition, a cycle into the database will turn into a cycle in the DataGuide. So you'll have infinitely many paths in the database, infinitely many paths in the DataGuide. And in the DataGuide, you'll have each of those infinitely many paths appearing exactly once. Okay. All good? All right. >> Could I ask a follow-up question? >> Jennifer Widom: Yeah. So you -- >> So you're not going to represent that it's unique or it's countable or anything like that in the DataGuide. You're going to use extra information represent stuff like that? >> Jennifer Widom: That's correct, yeah. Going to show that in a moment. >> Great. >> Jennifer Widom: Yeah. Anything else? All right. Okay. So I said that the DataGuide is our schema and it's used for the type of things that schemas are used for in databases and now I'm going to show you a few of those things. First of all, whoa use it for indexing and for statistics. Okay? So to do that, what we do is in the DataGuide, we store at every node the object IDs of the corresponding objects in the database. This is effectively a path index for every path you have in the DataGuide. In the database, sorry. So for example, in our original database, there were three elements that were restaurant entrees and they were object ID 6, 10, and 11. So if we have a query that asks for restaurant dot entree, that's how we did dots for our path, then what we do is we don't explore the whole database. We go straight to the DataGuide. They go down here, this gives us our objects and then we can fetch the objects. So this is a traditional index. Of course you have to mix index accesses with other types of evaluation in a typical query processing sense but this is how we used it as a path index. We also kept the object IDs at the interior nodes as well so here are the objects that are the restaurants. Okay? So the other thing that we stored in the DataGuide was we stored sample values and this is really for the user interface. This was to give users a sense of the type of values that were in the database. So for example, here, we store a couple of names of restaurants. Okay? Now, we use the DataGuide quite a bit for query processing. What we decided to do for our query language and I'm not going to talk about the query language at all today, we decided not to have the query language generate errors when it mentioned things that incident exist but rather generate warnings because we found in semistructured data, people preferred to have exploration or things would change over time so that's sort of beside the point. In order to do our warning system, we would, if the -before we actually executed a query, we would take the query and we would check it against the DataGuide and if we knew there was nothing that was going to match, then we would return a warning for that query and we wouldn't bother to explore the whole database. Okay? Much more interesting was that we used the DataGuide to do ex-expansion of the path expressions that formed the core of the query language. And again, I'm not going to go into the query language in detail but you can imagine there were like regular expressions that would be matched to the paths in the database. So as an example here, if we wrote the query select star followed by phone or address, this star would match a path of any length, any labels. And so what we could do is instead of exploring the entire database, looking for any path that eventually had a phone or address, we would use the DataGuide. We would find the paths that had a phone or address and we would change the start, the actual paths in the database so we put the actual labels. So here particularly, we would know only restaurants have phones. Obviously, you know, this doesn't -- it's not a big deep theorem in a large or deep database doing this could save a huge amount of time. So we use the DataGuide for that purpose also. Again, that's sort of analogous to the select star and relational queries but much more complicated here. >> [Indiscernible]. >> Jennifer Widom: Pardon? >> [Indiscernible] to the structure. >> Jennifer Widom: That's correct, yes. Yeah, we actually didn't save plans anyway. So right, yeah. Yeah. Okay. All right. The next thing I'm going to do is go through a few slides showing our DataGuide browser. Now, this was a big deal at the time and people really liked this. But you have to remember it was 1997. Okay? Well, a lot of you don't remember 1997. >> [Indiscernible]. >> Jennifer Widom: Okay. Not HCI people. Okay. >> [Indiscernible] in the back. >> Jennifer Widom: stuff. Okay. So it's 1997, database people, this was cool [Laughter] >> Jennifer Widom: So we're going to switch now to the database we used for demos was a database about our database group. Okay. That seemed to be a good sides and understandable. So again, all of these I captured from the actual papers. It doesn't still run. So this was the browser that you got when you opened one of these lower databases and the browser was the DataGuide. So this is actually the DataGuide for that particular database. So we had group members, projects and publications, the group members had all of these. The projects actually pointed to group members. So this one did have cycles in it. That was one of the things we liked about it that members pointed to projects, projects to members, publication and so forth. So it was a fairly interconnected database and it worked pretty well. So you go here and you can open and close and you’re basically exploring this directed label graph. So if you chose to look at a particular path, so if you clicked here, this is dbgroup.groupmember.originalhome, it would pop open this window which would give you these sample values that were stored in the DataGuide and then it also allowed to you start constructing queries through this. So through that DataGuide, could actually form queries. Again, this was cool at the time. You could add conditions on that path. You could select that path for the result. So here, we're blowing up. Now we're forming a query through the DataGuide where we have added a condition on the original home and on the years at Stanford and on the positions so these are all predicates we added and then this yellow says we're selecting that and that would launch a query and given you the result. Okay. All right. So that's what it looked like at the time. So now let's talk about why is it hard. Okay. So I didn't do these quite in order. So why was this a hard problem well, first of all, the DataGuide isn't unique. I don't know if anybody thought of that. This is actually one of the most interesting things. It's not unique. That's fine. There's a definition of a minimal DataGuide but it turns out that wasn't the best one. The best one was something we defined called the strong DataGuide that went minimal but it turned out to be the best for the indexing purpose. And I'm not going to go into the formal definition but that was kind of interesting. Okay. Second of all, the DataGuide isn't small. So if you think about it, if there's no common structure, as I said before, in the graph, then the database is the DataGuide. All it's doing is compressing common structure. And so the DataGuide could be pretty big. We introduced what was called an approximate DataGuide and for that, we relaxed the third condition. We allowed there to be paths in the DataGuide that weren't in the database. Not tragic. You don't want to miss paths in the DataGuide but it was okay to kind of over shoot. Okay? And third, it turns out that constructing the DataGuide from the database is pretty similar to NFA to DFA. So it can be expensive. Trees were easy, DAGs were harder. Cyclic graphs even harder. And similarly for incremental maintenance. It to actually be an exponential algorithm. Okay. So last question. Why was it a favorite? I -- this is also sort of the fun part of preparing this talk was to think about why this was the favorite. And for this one, I think I can articulate it pretty clearly. For the work that we did, we had to solve challenges of every type. So we had to develop the foundations. We had to develop algorithms. We implemented it all the way to the user interface. Second, it had applications of every type. So we had to worry about storage, we had to use it for storage structures, we used it for query processing we used it for the user interface. So it really cut through the whole system. And lastly, I do think the name. So I remember sitting around -- actually still, you know how you remember certain things? I still remember sitting around with that student Roy Goldman thinking about what name we were going to use. And we were calling it representative objects. And I kind of wonder if we'd called it representative objects if it would be as popular as it is today. But he said no, we need something snazzy, let's go with DataGuides. [Laughter] >> Jennifer Widom: So I'm going to say actually, I'm going to say this is really the result that, for me, has had the most tenacity and longevity. So I have a habit with Roy, whenever I hear -- so he's graduated ages and ages ago. Whenever I come across somebody referencing DataGuides or using DataGuides, I send him an e-mail. And it's still pretty common. People are still using DataGuides. I can't believe it. So it's still -- it's great. So really, a big favorite. And again, I wouldn't completely discount the name. How many of you have used DataGuides? Anybody use -- all right. Well, we got one. [Laughter] >> Jennifer Widom: So that's the ends of this one, so we can -- >> Historically, this reminded me of the X spot on everything. sort of was done -- So X spot >> Yeah, X spot can use DataGuides. Yeah, absolutely. So we, I mean, we actually converted the project to XML and, yeah, that's where DataGuides are still be used is in -- yeah. You can make a DataGuide for JSON also. No problem. Anything that -- yes, exactly. Anything that's self-describing semistructured data needs a DataGuide really, I think. Or something like it. >> [Indiscernible]. >> Jennifer Widom: Yeah. Yeah. Okay. Yes? >> I was just curious if you could put this in context with the state of the Internet in '97. I mean, this predates Internet search like. >> Probably right around the time that Internet search ->> Jennifer Widom: Wait. I thought that came -- I thought 1993 was sort of when the browsers first came out. I remember ->> [Indiscernible]. >> [Indiscernible]. >> Netscape was just coming out. >> Yeah. Google [indiscernible]. Google -- [indiscernible]. >> LightPost was '94. >> Jennifer Widom: LightPost was '94. We're all aging ourselves. [Laughter] >> Jennifer Widom: How many people were still in high school in '94? >> [Indiscernible]. [Indiscernible] but at that point, [indiscernible] because, you know a lot of optimized index, you know, [indiscernible] index [indiscernible] optimization based on structure. >> Jennifer Widom: So, but this, I would say DataGuides don't have too much to do with the Internet actually. I mean, to tell you the truth, they're really about semistructured data, data exchange. Though another thing I remember very well is when I got a phone call from a random person, I still don't know who it was, in my office, who said I saw your work on Lore. Have you heard of this thing called XML? I actually hadn't heard of it at the time, and I still don't know who he was and why he called. it and -- But I looked into >> [Indiscernible]. >> [Indiscernible]. >> Jennifer Widom: Pardon? >> He called. >> Jennifer Widom: He called, but I mean, I get people who call and say that they have solved PQL's NP OSIS so I don't even -[Laughter] >> Jennifer Widom: Well, he called and didn't -- I see, yes. >> That was around the time when persistent optic databases were in vogue. >> Jennifer Widom: That's also true. >> People were kind of going, oh, maybe it's not just relational databases as we know them from transactional processing and there's all this debate between the polynomial language community about persistent objects versus the database community about [indiscernible]. >> Jennifer Widom: That's true. >> We [indiscernible] persistence but we don't know what -- how to do objects. >> Jennifer Widom: Right. >> So this debate going on, because I was on [indiscernible]. >> Jennifer Widom: >> [Indiscernible]. [Laughter] >> We did lose. >> [Indiscernible]. >> [Indiscernible]. [Laughter] Programming languages side, yeah. >> The game's not over. In fact, the languages stuff for persistence on objects, now that we're going to have persistent memory essentially in a year, these people don't seem to have read the papers. >> Jennifer Widom: Well, okay. [Laughter] >> Jennifer Widom: You are energizing debate. Whether it's relevant to the -- well, people were open to new database models. That I -- and understanding that they needed them. That's true. This one was not too related to object database because any way you looked at that, that was usually strongly typed and this is like the opposite. >> [Indiscernible]. >> Jennifer Widom: Right. Yeah. But I think that was a time when people were realizing relational databases weren't going to solve everything. People are still grappling with whether that's true or not 20 years later. But anyway. Okay. Anything else on DataGuides? All right. Number two. CQL, the continuous query language. So now we're going to wind forward five years and it's 2002 and we're working on a project called the Stanford stream data manager which we called stream. And the project was -[Laughter] >> You can always find an acronym. >> Jennifer Widom: You can always find an acronym. Yes, you can. So in this project, we were again building a data base system for a new type of data which is data stream. So instead of your data sitting on disk and you're asking queries about it, your data is streaming in rapidly and you're queries tend to sit there and watch the data stream and it stream out their answers and the students who were working on this, Arvind made the slides here. I don't know if you knew you were there. Where's Arvind? And Arvind and Shivnath Babu who is on the faculty at Duke. So they're the two who worked on the query language. Okay. So now, let me start with what is the problem and so on. So what is the problem? We're building a database system for data streams and we need a declarative query language. Okay. So why is that important? Well, I would argue that a declarative query language is a key component of any database system. I still think declarative query languages and transaction processing are the two really key things about a database system and there's lots of other stuff around it but you better have both of those things I think to have a good database system. Okay. So I'm going to claim that's fairly obvious. So why was this a hard problem? Well, it turns out if we want to make a SQL like query language for data streams, the semantics, what those queries actually mean is surprisingly tricky and I actually think it has nothing to do with SQL. Whether or not you reuse SQL, I think the semantics of queries over data streams is hard and I'm going to give you examples for that. And secondly, the semantics can actually have a significant effect on the implementation. So I have a pretty firm belief on figuring out semantics first and then implementing later but there is some interplay and in the data treatment world, small changes in semantics can make the difference between being able to process your query as each element comes in and throw the element away versus having to keep all history of all data. Even a small change. So that's important but I'm not going to cover that particular aspect of it today. So I'm going to give you an example to explain why it's hard. Here, we have a -- this is going to be a query that has one stream and one traditional table. And the stream is just a stream of page views. So this is going to be a view of a URL and the user ID who viewed it and now we're going to separately have a table that has the age of users so obviously this is extremely simple but will serve my purpose. And what I want to do is find as these page views stream in, the average age of the viewers for each URL in the last five minutes. Okay? So this is a standard -- I'm going to show you SQL now. I think even if you don't know SQL you'll be fine, but this is a pretty standard group by aggregation query, except we have a string. All right. So here's a SQL-like query that answers that question. It says I'm going to take in any from clause -- you always read the from first -- I'm going to take -- this is the one thing I've added here, a five-minute window on that views stream. So views is a stream. Going to look at the last five minutes, okay? And then I have my users table and I'm going to join on the user ID. Very standard here. Grouped by the URL and give me the URL and the average age. So I think that should be readable for everybody. Pretty straightforward. Okay. What's the result of this query? Is the result a stream? Is it a relation? Is it something else? I would claim it's not actually obvious what the result of that query should be. Okay? Though people would write it and not worry about it too much. And here's a more really specific question about that query. So what happens if someone's age changes while they're in the five-minute window? So they already viewed the page and then their age changed. They're still in the five-minute window. Does that change the result of the query or not? Okay? So that's just a very specific point I wanted to make here. I'm not going to tell you the answer just yet. I'm going to just point out that this is pretty subtle. Okay. So now, let's go into why it hasn't been solved already and then we'll go into our solution. At the time there were a few groups building database systems for data streams, I got the sense that the others didn't seem to worry too much about query semantics. Let me just put it that way. I have a nit about the database community in general that there's a lack of worrying about query semantics and I have a whole another talk on that but I'll spare you that today. So that's where things stood and we decided to worry about it ourselves. So what is our solution? So we started to step back and figure out what the best way would be to define -to make a very precise semantics for streams. And what we decided to do was rely as much as possible on relational semantics because that, everybody understands. People understand relational databases. So people know what relations are and people know what it means to ask a query. If you ask a SQL query on relations, you get a relation back or relational algebra, all well understood. So what we decided to do is rely on that and then we have streams and we have a very well defined way of going from streams to relations and relations back to streams. We go from streams to relations based on these window specifications so when you put a window on a treatment, it turns into a table effectively. And then we have operators just a couple operators that turn relations into streams. And what was the basis for our definition. Okay. So let's go back to our query now with that in mind. So this query now with our new semantics says that this here, this views with the range is going to turn into a table. Okay? So that's just going to be the last five minutes as a table. So then this result according to our new semantics, is a relation. It's a relation because this is a table. Now we're just doing the join. That relation is updated potentially when time passes because this table here will change its value when time passes when new page views occur. Okay? Or when ages change. So when anything changes that contributes to this, that would be -- that relational result will get updated. Okay? So clear, maybe not what we want, but clear. Okay. If we want the result to be a stream, then that was pretty easy too. That says that we're going to just add this operator we have called stream and what that operator did would just stream out a new element whenever the result changed so you can just think of it as a table but whenever there's a change to the table, we stream out a new element. So all of that is good. The kind of bad thing was this business with the age. So this, the way our semantics worked, if someone's age changed after they viewed the page but while they were still in the window, the result of the query changed. Probably not what you wanted. Probably you actually wanted to use their age at the time they viewed the page. Presumably that's wanted. Here's the query to do that. I'm not going to claim it's wonderful but it works. What do we do? Well, we take our views stream and we have this window called now which makes just the latest element into a table. We join that so we are joining, basically we're joining with the user table at the time the element appears and turning that into a new stream. So new we're streaming out the views with the ages. Then it's that stream that we take the five-minute window on and everything works from there. I'm not going to argue that it's beautiful but at least we have a well-defined semantic. >> Eric Horvitz: And it's probably an Einsteinian relativistic version of this where space and time is part of now. >> Jennifer Widom: Well, sure. Yeah. Something like that. [Laughter] >> Jennifer Widom: Okay. So, just going to summarize now. Summarize what I've said. So we have a -- what we defined is a precise semantics, what we call an abstract semantics was that diagram I showed based on the fact that you have a relational semantics, the fact that you have these specific operators that go from streams to relations and relations back to streams. We had a concrete implementation based on SQL with the windowing constructs. We also added a sampling construct which turned out to be very -- the stream, data stream query languages or data stream applications often like to do sampling so we through that into the query language. Some of the most interesting work actually was in query equivalences. So it was pretty interesting, you can actually analyze a query for example that would use an infinitely arbitrary growing window in the query and you could analyze the query and see that you could change it to one of these now windows and there were a whole bunch of other optimizations. I thought that was one of the most enjoyable parts of the work. Okay. We had a guiding principle for the work that drove what we did. Easy queries should be easy to write. Simple queries should do what you expect. And I think we achieved that. What it didn't say anything about was the hard or the complex queries. So the hard queries were not always easy to write and the complex ones were not always to understand, I would say. I also wanted to mention briefly about time and ordering. You brought this up slightly. This was the issue of streams coming in out of order or there being large gaps in the timestamp or time passing and not knowing if you might get a stream element from a long time in the past was a big problem in data stream systems. Some of the other projects chose to deal with that problem in the query language itself. We chose to not do that, which helped. We chose to assume there was a lower layer that was buffering the streams and delivering well-behaved streams to the query processor. So we would assume there was a bounded window beyond which you would never get elements coming in late, right, and things like that. We assume that they would be within a bounded amount of orderedness and so on. And that was quite important, I think, to the work. Okay. So, why is it a favorite? Well, first of all, I think that query language design as a field is highly under rated. It's difficult to publish in. I have my favorite story. Some of you might have heard before. We're going to go back to the lower project and the query language that we developed for that project which was called LORL. And we could not publish our LORL paper for the life of us. We tried everywhere, nobody wanted it. Finally, one of my coauthors, Serge Abiteboul, who was visiting Stanford for a couple years at the time, said, well, I was invited to be the -- to contribute a paper to a new journal called the Journal of Digital Libraries, Volume I, No. 1. Maybe we should just put it in there. And we said all right. It was the only volume ever, number ever of that journal. [Laughter]. >> Jennifer Widom: But, I was very happy that for a rather significant length of time, like a couple of years, that paper was in the top 100 cite papers of computer science in that really defunct journal. So that tells you not to worry if your things keep getting rejected. They can still have impact. We had some difficulty publishing this work, but -- as Arvind is nodding. But it did get some attention. And we had a little easier time. So I think people were recognizing that. The need for semantics, as I said, is often ignored by the database people. There are some really sorry stories about the early days of SQL, simple queries where two different systems would get different answers on -- I mean, it's really amazing. >> It's not actually the early days. [Laughter] [Indiscernible]. >> [Indiscernible] SQL 7.0. Jim Gray and his lab, a San Francisco lab, have [indiscernible] DB2 and SQL Server. Don Schluse was running the project to see whether they answered the same, gave the same results. [Indiscernible] scheme, right? >> Jennifer Widom: Right. >> And there are serious [indiscernible]. >> Jennifer Widom: Okay. I could go off on this tangent here. There are still queries today where different systems will give different answers. Even worse, there is a type of query where some systems can give you a different answer on different days without you changing the data. It has -very briefly, if you do a group by query, and you add to the select clause an attribute that's not in your group by clause and that's not an aggregation, some systems will choose a random value from your group to put in the result of that query. And that random value could change if the database gets reorganized. Yes, I teach introduction to databases so I like to point this out to the students. It's pretty shocking actually. It's -- right. >> Select start is [indiscernible]. >> No, select start is a [indiscernible]. [Laughter] >> It's not bound. So you could have [indiscernible] assumes 15 attributes right, [indiscernible]? >> Jennifer Widom: Yes. >> Then somebody adds three columns. >> Jennifer Widom: That's true. Okay. But if you change the schema, I'm slight -- I mean, that's not good. But this -- this is an example where you don't change anything. Right? You don't change the schema, you don't change the data. One system gives -- good systems say you can't write that query. The bad systems give you a random answer and that random answer can be different at different times. Yeah. It's bad. >> [Indiscernible] query. >> Jennifer Widom: Something like -- well, yes. >> How do you explain it away? >> Jennifer Widom: Sure. Right. Yes. Anyway, so I think in this case, people at least appreciated that there was some challenges and subtleties in the semantics. Lastly, I would guess say not the name for sure. Although it was a fine name, it didn't quite have the umph of DataGuides. that's number two. Any more discussion on that one? Okay. So >> Eric Horvitz: Yeah. What's your reflex on where stream processing has gone over the years since the result? >> Jennifer Widom: People are still working on it and they haven't like solidified it. It's surprising to me that there's no standard. Right. And -- yeah. It's still ongoing. >> Eric Horvitz: systems that -- It actually can be quite important even for these AI >> Jennifer Widom: >> Eric Horvitz: Absolutely. -- multisensory streams, very fast paced. >> Jennifer Widom: Right. I mean, yes. And people keep building new systems and they keep doing different things. I mean, I guess if there was a real need for a standard, it would have emerged, but yeah. >> I was a little surprised that you mapped from the stream world to the relational world just using time windows. A number of other things having to do with order of events. >> Jennifer Widom: So we had time windows and we had number of events. So you could either have number of rows or number of tuples or you could have time. You could -- and there has been a different line of work on very rich windowing constructs. And so, our -- in fact, our party line was any window in construct is fine, the abstract semantics would take any. Our concrete implementation just had those two types. Yeah. >> Well, analogous to that question is you have a pretty fixed semantics for relational tables but it seems to me you implicitly chose a semantics or streams so tight you could do this mapping on it because there are many semantics that you also associate with streams. >> Jennifer Widom: That's correct. Yeah. And there's one actually significant reduction of expressiveness that nobody -- that happened which is that when we switched from streams to relations using windows, we lost the ordering. >> [Indiscernible]. >> Jennifer Widom: anyway. Yes. And we knew we were doing that. But we did the >> And that seems a little against the whole behavioral property that one would associate with streams. Just to be honest, right? >> Jennifer Widom: Yes. Yes. Though -- right. So I agree with that -- >> Unless you took this abstract view it's just about windows elements. >> Jennifer Widom: Yeah. That's right. Yeah. It was a conclusion decision partly to keep things simple. And there were ways to overcome it, but yeah. Okay. Number three, ULDBs, uncertain lineage databases. So now, we're winding forward to 2006 and it's a project called Trio. Trio was a system for integrated management of data uncertainty and lineage. So that was why it was called Trio, for three things. This was our logo. Anybody see anything unusual about the logo? >> [Indiscernible]. >> Jennifer Widom: The wheels cannot actually turn. [Laughter] >> Jennifer Widom: You have so separate one of them make them turn. think we more or less made them turn anyway. Okay. So -- But I >> Did you realize that problem after the logo was created? >> Jennifer Widom: Yes. Yeah, we did. But I liked it anyway. It tells a good story. Right? Okay. So the people who were involved specifically in the -- so what I'm going to talk about is self ULDBs which is the data model or representation scheme for the Trio project and the people who were involved in that particular part of it were Omar Benjelloun, who was a post doc at the time, my Ph.D. student Anish Das Sarma, and Alon Halevy who was visiting Stanford at that time. By the way, this was very briefly the iPod logo. That was back when people -- they introduced the shuffle I think and people didn't like it and there was a big billboard in San Francisco that said enjoy uncertainty. So we grabbed it. Okay. All right. So what's the problem? Well, once again, we're building a new kind of database system. This is what I like to do actually. And now it's for uncertain data and I'll explain what I mean by that. And we need a data model. Okay? So why is it important? Well I argue a well chosen data model is important for anything you're doing in data management at all and I would say anything you're doing in data at all, you better understand what your data looks like or what the possibility are for your data. I do want to be very clear. I don't know that I need to with this audience. I'm not talking about an AI model or anything like that. I am talking about how you represent your data. What it's structured like. So the first part of the talk, I was talking about those directed labeled graphs, the second part I was talking about data streams. Now I'm talking about uncertain data but not in the AI sense. Okay. So why is it hard? What we're going to see is that developing this data model or representation scheme for uncertain data, we come quickly to a tension between having an understandable model, one you can look at and know what it's talking about, and one that's expressive enough and I'm going to give a very concrete example for that. So here comes the example. This is a database for solving crimes. So we're going to have -- we're going to have witnesses and drivers so there was a crime -- there was a crime committed. There were people driving cars near the crime. People who owned cars and witnesses who might have seen cars. So specifically, we're going to have two relations, the saw relation where a witness might have seen a car at the scene of the crime, okay? I'll get to some real data in a moment. And people who might drive particular cars. Okay? So these will look like regular tuples but we'll see what I mean here with the uncertainty. So if we want to generate suspects, we just do a relational join. If we wanted to generate a suspect for the crime, we find people who might drive a car that might have been seen by a witness at the crime and again, I'm going to explain all this in detail. All right. So let me just back up. >> Are there no pedestrians in this? >> Jennifer Widom: No, this was all about driving. There's no pedestrians. [Laughter] >> Jennifer Widom: I don't know what the crime was, but it was committed in a -- well, maybe they jumped out and robbed a bank or something like that. Yes. Okay. Again, contrived to be the simplest possible example that brings out the important points. Okay. So let me back up and talk about what people agreed about on certain databases. So pretty much everyone agreed that abstractly, an uncertain database is a representation of a set of possible certain databases. Maybe arbitrarily large set. Okay? Those are often called possible instances. So I'll get to this in a moment but in our example, we could have that Kathy saw a Honda or a Mazda, so there were two possibilities. Kathy saw a Honda, Kathy saw a Mazda. Amy might have seen an Acura or maybe she didn't see one. Okay? We have a Honda that's different by Billy or Frank. Concretely, we're going to represent these as alternative values like Kathy saw a Honda or a Mazda and then we're going to have these question marks that say that values can be either present or absent. Now, in the Trio project, we also had confidence values or probability so that would be more in the probabilistic data sense but I don't need those to get my point across in this talk. So we're not going to have them today. Okay. So here's the very concrete representation of what I described. So these are two tables in the uncertain database world. The first table, the saw table says Kathy saw a Honda or Kathy saw a Mazda. So this tuple has one of two possible values. This says Amy might have seen an Acura, but that question mark says present or absent. So this table has four possible instances. Two for the first tuple and two for the second independently. Okay? Over here, we have a Honda that's driven by either Billy or Frank so two possible instances there. So this uncertain database has a total of eight possible certain databases. All right? Yes? >> Is this semantic that Amy didn't see a Mazda? >> Jennifer Widom: people didn't see. Yes. Well, no. This -- And it doesn't say anything about what >> So then that statement has no -- what does it say? >> Jennifer Widom: This says that Amy may have seen an Acura. One of the possible instances -- in one of the possible -- well, in half the possible instances ->> [Indiscernible] maybe probable models of the [indiscernible]. >> Jennifer Widom: So in half of the possible databases, Amy saw an Acura. Right. Yeah. Doesn't say anything about the absence, though. Okay. So why is -- so why is our problem hard? What's wrong with this model? Well, it turns out that the simple model is not closed and what does closure mean? Closure means that I have a model or representation scheme and that when I run a query on it, the answer can be represented in the same scheme. And that's considered a no-brainer for databases. You want that to be true. When I run a query on these uncertain databases, I want to be able to give you the answer as one of these uncertain databases. Pretty important. This model is not closed and I'll show you that now and I'm going to have a quiz for you so everybody get ready. Okay. So I've expanded my database now. In addition to these, I have a guy, Jimmy, who drives a Toyota or a Mazda and I have definitely that Hank drives a Honda. Anybody notice anything about my choice of data? The men are the criminals? >> Oh, I was just going to say, the women are on the saw and the ->> Jennifer Widom: The women are the witnesses, the men are the criminals, just like real life. [Laughter] >> Jennifer Widom: It helped us keep all our data straight. [Laughter] >> Jennifer Widom: Okay. All right. >> You don't have interesting cars. >> Jennifer Widom: Don't have interesting cars, okay. [Laughter] >> Jennifer Widom: That's also true. >> They all have Japanese cars. >> Jennifer Widom: >> [Indiscernible]. That's also true. I should change these to Tesla. >> Jennifer Widom: All right. I'll put a test play in here. Okay. Let's run our relational join on these two tables to get the answer to our query. All right. When we do it, here's what we get. Okay? And this is where your quiz is coming in. We did the join on these and we -- what we get is that Billy or Frank may be suspects. Jimmy might be, Hank might be. All right. But, this doesn't capture the correct possible instances in the result and I'm going to ask you why. And I'll just tell you that I gave this example in a talk at the ACM India conference about three weeks ago to a thousand eager undergraduates and one of them was so excited when he jumped up and -- I just sat there and waited, and then one of them jumped up and was so excited that he saw the answer, and he got it right. Does anybody see why this doesn't capture the right instances in the result? Now you're under pressure. Yes? >> Couldn't the suspect also be none of the above because if Amy was right that she saw the Acura and so someone -- no one is driving -- so there could be a suspect who is not yet in your database. >> Jennifer Widom: That's sort of coming to the same issue of absence of data. We're kind of using this closed world. So that's not the problem. Yes? >> Why is there still Billy or Frank? Shouldn't there be four? >> Jennifer Widom: Well, it's Billy or Frank because one of them drove that Honda that might have been seen by Kathy. Right? So if Kathy saw a Honda, then Billy or Frank could be a suspect. >> So it's [indiscernible]. >> Jennifer Widom: Yeah. >> You can't have both row 1 and 2? >> Jennifer Widom: Yes. You can't have both rows 1 and 2 at the same time. If Billy -- and there's other examples of the same thing. If Billy or Frank is in the answer, so if they're actually there, that means that Kathy saw a Honda. If Kathy saw a Honda, she didn't see a Mazda. If she didn't see a Mazda, then Jimmy can't be in the answer. And by the way, if Billy or Frank is in the answer, then Hank has to be in the answer, another example. So effectively, there are correlations, relationships between things and the answer that depend on what you choose in the original data. Okay? So we actually proved that our model cannot answer -- cannot represent the answer to this query. Just can't do it. So this model is not expressive enough. So what happened next? Oh, sorry. Just a moment. Why hadn't it been solved already? Well, there were other people working in the area. Most of them were theorists, I would say, at the time. So they actually were not too concerned about this understandability. So there were other models that had sort of complex constraints and put variables in there and so on. We were trying to get something that people could actually look at and know what the data meant. Whether we achieved that, you'll see -- I'll see what you think, but that was our goal. Okay. So what happened? Actually lineage, believe it or not, came to the rescue. Lineage is -- lineage can mean a lot of different things but it's effectively the concept of tracing where data comes from. So, what we did is we added to our model and it's a little ugly here but what we added to our model is effectively pointers or capturing in the answer where the data came from. I tried to do this with arrows but it got a little too complicated. But effectively, this says that this first alternative of tuple 31 came from the first alternative of 11 and this first alternative of 21. That's what this little thing here says and the second one came from the first alternative here and the second there. So these are effectively captures pointers to where the data came from. Okay? And then the interpretation of this data is that the possible instances that this represents, the set of possible databases are only those databases where you have consistent lineage. Okay. You can't grab at the same time two things that come from two different choices in your base data. Okay? And this, with the lineage, correctly captures the possible instances in the result. But it's even stronger. First of all, this model with the lineage, these constructs and lineage, is closed under all the relational operations. We proved that. But furthermore -- and the second actually implies the first -it's complete. So any uncertain database you give me, and there I mean any set of databases, any set of possible databases can be represented in this model. All right. So why is it a favorite? Well, the Trio project itself was conceived before we actually built the data model. So we started a project. We wanted to do data uncertainty and lineage and the combination was really motivated entirely by applications. So there was scientific data seemed to need -- scientific data applications often seemed to need both uncertainty and lineage. We were looking at entity resolution problem which is also one where you have uncertainty in lineage and a whole bunch of them seemed to need both and that was why we developed the project. Never imagining that lineage would turn out to be the key to representing uncertain data. Seriously never imagined that. So, in retrospect, you know, why is it that -- maybe there was an implicit connection somehow in the applications? That's probably the most likely. Maybe an unconscious hunch. Maybe less likely. Divine intervention, pure luck. Hard to know. But anyway, that's one of the reasons I really like this one is that it sort of fit together later on. Definitely not the name. Okay. So, just going to wind up. Is there anything in common among these favorites? You know, let me just do my best and try to make something common. In the area of developing data models, developing query languages, we worry a lot about expressiveness. We worry a lot about simplicity and we just saw that in the last one, and then efficiency. I didn't talk about efficiency today, but as I hinted through all this work, we were thinking about efficiency. So I would say that DataGuides did pretty well on the expressiveness and the simplicity side. Not so well on efficiency, as I explained, at least the pure DataGuides. If you look at CQL, expressive is pretty good. Efficient, pretty good actually. Maybe not so good on simplicity. I like to say that ULDBs maybe did manage to hit that center point of all three of those, but the one thing I would say completely in retrospect is that balancing -- trying to balance these conflicting goals I would say has been a theme across a lot of my work, and that's something I learned preparing this talk. So thank you. [Applause]. >> Eric Horvitz: Do you have questions? session as long as we've gone. >> Jennifer Widom: Right. I guess we've been going -- a We could have some more debates. >> So for ULDB with respect to the efficiency, like or in relation [indiscernible] operations, [indiscernible], but I'm talking about the execution of the queries. >> Jennifer Widom: Right. So the only really problematic case is aggregation, actually. Other than that, it was fairly efficient. So aggregation has this problem. If you imagine a relation of 100 tuples, each of which is present or absent and you ask for the sum of them, you have two to the 100 sums. So owner that, it's sufficient. There were also some complicated things I didn't get into where when you have negation in your queries, it gets complicated with this and you start having boolean expressions as your lineage and so on. But for standard relational operations, it's not a problem. >> So [indiscernible], but this is one area where there is obviously a lot of uncertainty and continues to expand as we get into IoT sensing. >> Jennifer Widom: Right. >> But as ->> Jennifer Widom: And by the way, I didn't talk about probabilities at all, but that's a huge thing. I just didn't need it today. Yeah. >> So [indiscernible], what is your reflex on whether it has taken root or not and why not? >> Jennifer Widom: data? You mean in terms of a generic platform for uncertain >> Yeah. >> Jennifer Widom: I don't know if there's going to be one that serves everybody's purposes, to tell you the truth. But it's not -- verdict is still out. Like you said, there's more things coming out that need it. I think there's been a lot of one-offs for particular application centers for example. Yeah. >> Do you have any results that you were very excited about at the time, you know, when you submitted the paper and now, looking back -- >> Jennifer Widom: Oh, man. >> -- and producing this talk, you're kind of surprised at the fact that actually, it's not even worthy of consideration for the top three? >> You might as well have three least favorite results. >> Jennifer Widom: I didn't talk about active databases at all. You know, I need some time to think about that. It's a great question. I should think about that. I did have to pick and choose a little, but you mean, something I really loved that nobody else liked? Well I mean ->> Well, well, either that or that you changed your mind about the work based on the way technology has unfolded. >> Jennifer Widom: Oh, I see. >> You know, turns out to be only a theoretical [indiscernible] but at the time you really thought it was ->> Jennifer Widom: category. I have plenty of work that probably falls in that [Laughter] >> Jennifer Widom: We all do, don't we? It's a good question. would be an interesting talk, three least favorite results. Right. That [Laughter] >> Jennifer Widom: Three bombs of -- [Laughter]. >> Eric Horvitz: >> Jennifer Widom: Or tenures -Yeah. So the anti-tenure paper award, right? [Laughter] >> Jennifer Widom: It's a good question. the bat. It's a good question. I don't have an answer right offer >> As you said, these are your three favorite results from your own work. Do you have a couple favorite results from other people's work in the field that were the most kind of inspirational or ->> Jennifer Widom: Oh, no, that's a hard one too. Oh, boy. I'm really getting put on the spot here. I'm going to have to put that one off too. I'm sure, I mean, I absolutely -- and probably in the -- if you want to take these three areas, then probably it's in the probabilistic database where there's some beautiful theoretical results. So like I said, we wanted something that you could look at -- we were looking at user-facing data models but behind the scenes, people like Dan Shushu of Washington -- oh, we're near Washington, aren't we? [Laughter] >> Jennifer Widom: Had beautiful -- we're in Washington, yes. There was some really nice theory behind there unlike, I would say, the other two areas, truthfully. What I would appreciate mostly would be really nice theoretical results that back doing something. I've always liked to build prototypes, but having something that's backing that is important. >> Lucy: I wonder, so you have the uncertainty. So if you have that -- I can't remember who the people were, that Sally saw a Honda or Sally saw an Acura. Those appear to be uncertain, but if you attached a time modifier to those, maybe they're less uncertain. >> Jennifer Widom: Right. >> Lucy: And I'm thinking about kind of the databases kind of constructing where if you have this particular that protein. Or it doesn't up regulate that protein. contradictory, but they're not because it up regulates mouse genome but it might down regulate the protein in >> Jennifer Widom: that Hoifun and I are protein up regulates Those appear to be this protein in the something else. Right. >> Lucy: So there are apparent -- like have you handled apparent contradictions but they're not contradictions because they just require further modification in order to understand? >> Jennifer Widom: Yeah. My answer to that would be in our model, no, because our model was pretty cut and dried. We have these alternative values. It's one or the other and not both. I mean, you can construct, so it's both. What I would say for that type of thing, just a shot in the dark here, but the uncertain databases have this interpretation where it's a set of possible certain databases. And it sounds like maybe there's some layering of additional information that would constrain or even expand with a set of possible certain databases are. That's what it sounds like to me. Almost like our lineage, right? So we had our representation and then we said, okay, but with lineage that changes what the possible certain databases are and it sounds like you might have something like that where you have sort of the data and then additional information that constrains or expands, maybe in your case, what the possible databases are. Does that make any sense? >> Lucy: I think that the lineage is very important. I guess the thing is I'm coming from a text processing point of view where, you know, kind of we don't just say triples. There's a lot of meaning that gets layered on. >> Jennifer Widom: Right. >> Lucy: And so, kind of how do we meaningfully layer those additional constraints. >> Jennifer Widom: Although I've always argued in the database world, we don't layer on any meaning: We just give you the data and you can do what you want with it. [Laughter] >> Jennifer Widom: This is what I always have to explain to people when they think this is some kind of AI system, the Trio. I keep saying it's not. It's just this data there. If you want a layer a BAZA networks on it, you can. So probably we don't capture up what you kind of ->> So, Lucy, why have you not explicitly represented that because ->> Lucy: The context? >> Yeah, because then you have the context, right? And so for any use Z three or term provers use ULDB, you can reason about it, but if you have it embedded in your inference system, implicit within your developing a very specialized interest [indiscernible] which makes sense is a lot more efficient. >> Lucy: I think it would be really interesting to talk about because the thing that -- when people think that way, you end up with kind of non-predicates and then it's very hard. Then you can't see inside the predicate anymore. But that's a longer discussion. But it would be really interesting to have that. >> Eric Horvitz: [Indiscernible] question I'll ask you in career, it seems like looking at your bio and so on, you spent about five years at Almaden after you finished your dissertation work at Cornell. And then moved to Stanford. And so you made this decision to go to a industry research lab. And then [indiscernible] academia. And it's a decision that many research scientists who are in this room and pondered, made at different parts in their career, and then recurrently revisit. >> Jennifer Widom: Of course most of the ones in this room didn't go, right? Or they wouldn't be in the room. Didn't leave research lab and go to academia, but many went the other direction who are in the room. >> Eric Horvitz: No, no, but the decision was entertained at the time of like Ph.D. completion. >> Jennifer Widom: Oh, I see. >> Eric Horvitz: So I'm just curious to reflect on the -- and of course, you know, MSR and over time, it's quite different probably than Almaden [indiscernible] research labs, they're all quite different, but you could share some of your experiences of what it was like to make the decision, to be in Almaden, then Stanford and just reflect a little bit for this group. >> Jennifer Widom: I would be happy to. Okay. >> Talk to us in private, if you want. [Laughter] >> [Indiscernible]. >> Jennifer Widom: Okay. I got my Ph.D. in program verifications. My thesis was a negative result in using temporal logic to prove properties of concurrent programs. It's pretty much a dead end. I thought. Anyway. Well, I mean it was a great thesis of course. [Laughter] >> Jennifer Widom: Okay. So let's start with that. -- >> Times have changed, by the way, sense then. >> Eric Horvitz: >> Jennifer Widom: >> Eric Horvitz: Yeah, we have a -That's true. We have a place for you right now in our [indiscernible]. >> Jennifer Widom: I shouldn't put down the area. I enjoyed my thesis work and it was appropriate for Cornell where I was. I had a two-body problem when I lift. We interviewed at many universities and a couple of research labs and in the end, the optimal decision for the two of us was to go to IBM Almaden. Now, that was a great turn of events for me because at IBM, I was given the chance to join the database group based on a pretty flimsy evidence that I might know something about data. I really didn't know anything. I had done a summer internship at Xerox Park. It had something vaguely do with databases. So at IBM Almaden, they offered me the chance to join the group and who wouldn't join that group? It is an amazing group. So I went there and I became a database person, which was great. And then so that's where I learned -- that's where I switched research fields at that point in the context of an amazingly good group, in the context of having 95 percent of my time to just focus on research. At that time at Almaden, that was also the glory days where you could just publish in the context of a big software project. All good. Five years later, I had the chance to go to Stanford. Who would turn that down? And that's more or less the story. a conscious decision. More a sort of meandering of events. So not exactly >> You at least credit that because your thesis work was in verification in languages that your statement about semantics, you know, it's ->> Jennifer Widom: Absolutely. >> -- from your lineage. >> Jennifer Widom: Absolutely. I have no question in my mind that my programming languages Ph.D. influenced everything did I in databases. No question. I should have said that. Absolutely. Yeah. >> Eric Horvitz: And I guess, just to complete the discussion, so the environment and lifestyle in academics for you, academia with the students and so on at Stanford versus the focused time you had at Almaden, how would you kind of compare or contrast those kind of experiences? >> Jennifer Widom: Well, I mean, I always like working with students. I had students when I was at Almaden. I'm sure most people here like working with students also. But at that time, I literally spent 95 percent of my time on research. Obviously, then after going to the university, it was never even 50 percent, I don't think. But well, the other thing, again, this is very personal, but establishing my research career at IBM Almaden, by the time I left there and went to Stanford, I had already established myself as a researcher, so that really eased being a junior faculty member. I would not change anything about what I did. The fact that I went to Stanford not with that immediate pressure of just having finished my Ph.D. and having to get grants and all that and ramp up and worry about tenure and all that, it did ease the transition, I would say. >> Eric Horvitz: Okay. Any other questions? AJ? >> So you pretty much upgraded your work from happens in AI through the talk. >> Jennifer Widom: Yes. >> You said I don't want to do inference. I just want to represent the data. But the [indiscernible] foundations in AI are also about logic and the relationship and the representation of the data. So in terms of these favorite results in your career, how would you rank the results with respect to the influence on other disciplines, for example, AI, or that's just the hard thing that hasn't really happened that much? >> Jennifer Widom: Boy. I don't know the answer to that question. How would I rang them in terms of influence on other fields in computer science? Boy. Well, get information retrieval that use DataGuides, but I don't have a good answer to that. But, by the way, I will say on the record here that getting database and machine learning communities together is like the number one priority. I think that is really important right now, and I think people are working on that. So I'm guilty of being one of these people who has really separated them and will just make it very clear, okay, this isn't going to be AI. I'm just building the substrate. But I do think it's very important for the fields to get together. >> Eric Horvitz: [Applause] Why don't we stop there and thank Jennifer.

36583 >> Eric Horvitz: It's an honor today to have... the Fletcher Jones professor of computer science and electrical engineering

Related documents

Products

Support

36583 &gt;&gt; Eric Horvitz: It's an honor today to have... the Fletcher Jones professor of computer science and electrical engineering

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

36583 >> Eric Horvitz: It's an honor today to have... the Fletcher Jones professor of computer science and electrical engineering