36583 >> Eric Horvitz: It's an honor today to have... the Fletcher Jones professor of computer science and electrical engineering

advertisement
36583
>> Eric Horvitz: It's an honor today to have Jennifer Widom with us. She's
the Fletcher Jones professor of computer science and electrical engineering
at Stanford University. She's now also the senior associate dean for faculty
and academic affairs in Stanford school of engineering. Jennifer served as
chair of the CS department at Stanford from 2009 to 2014. I was surprised
that she wasn't doing that when I visited about 9 or 10 months ago and I
popped in her office and I saw she was doing dean-like things. She was
looking over pie charts of which -- by gender of CS majors of her time in
coming and by year. And she was very much engrossed in this -- doing this
kind of interesting demographic analysis. So it's part of her new role at
Stanford. Jennifer received her bachelor's degree from Indiana University,
Jacob's School of Music. I think I'm going to ask you what instrument you
played, but we'll talk about that later.
>> Jennifer Widom:
>> Eric Horvitz:
Trumpet.
Trumpet.
Wow.
That's fabulous.
So did I.
[Laughter]
>> Eric Horvitz: My high school instrument. And her Ph.D. was from Cornell
University. And she was a research staff member at IBM Almaden before
joining Stanford in 1993. We probably just basically missed each other. I
left in '93, Stanford, to come to Microsoft Research. Jennifer is an ACM
fellow, a member of the National Academy of Engineering, and the American
Academy of Arts and Sciences. She won the ACM SIGMOD Edward F. Codd
innovations award a few years back and also the Guggenheim fellowship. So
we're here today to continue this year Jennifer's celebration of the ACM-W
Athena Lecturer Award. Each year, ACM-W honors a preeminent woman computer
scientist as the Athena lecturer. They have that title for the year. And
they -- and the honor celebrates women researchers who have made fundamental
contributions to computer science. And going back into some of the citations
on Jennifer's work, her research interests have spanned many aspects of
non-traditional data management. She was cited as introducing fundamental
concepts and architectures of active database systems, which is a major area
of research in the database field today. Active database systems allow
application developers to embed logic into the database that allow actions to
be executed when certain conditions are met. Active database systems have
had major impact on commercial database management systems and most modern
relational databases include these active database features and she was also
cited for fundamental contributions to the study of semistructured data
management and semistructured data management systems are key in supporting
many applications that are coming forward today such as genomic databases,
multimedia applications, and digital libraries. The lecture that Jennifer
will give today was originally presented last June at SIGMOD. People that
get this award get to choose the meeting they'll give their main lecture at
and she chose SIGMOD, which was in Melbourne, Australia last June. And so we
invited her to come today to MSR to give us a reprise of this lecture and
she's going to share her three favorite results and great strategies
forgiving a talk, I think. Let's welcome Jennifer.
[Applause]
>> Jennifer Widom: Thank you. Thank you for that very nice introduction.
10:40. What time should I plan to stop?
>> Eric Horvitz:
>> Jennifer Widom:
>> Eric Horvitz:
Noon.
Noon.
Okay.
I won't talk until noon, I promise, but --
Promote discussion.
>> Jennifer Widom: Okay. But given that we do -- it seems like we have
flexible time, I'd be absolutely happy to take questions during the talk.
That's my preference. So if you want, please, if you want to bring something
up during the talk, do that and we can adjust time as appropriate. So thank
you again. So, when I won the Athena Lecturer Award, I was presented with
the sort of daunting task of giving a retrospective like talk on my research.
And that's -- it is -- it can be a hard thing to do but it's also quite
valuable. And I think maybe I had just been to a leadership program at
Stanford where maybe the only thing I learned -- well, maybe a couple things,
but one thing I learned is that all good things come in threes. And so, the
combination of that leadership skill and needing to do a retrospective talk,
I said, let me just pick my three favorite results over the history of my
research. And so that's what I did and that's what I'm going to do today is
tell you about those three favorite results and I'll have a particular way of
telling you about them as you'll see. Before I start, though, I think it's
extremely important to say what favorite means because favorite can mean a
whole bunch of things. So first of all, the favorite results are, it turns
out, not going to be the ones that have won best paper awards or test of time
awards even, although the latter would probably be more likely to be a fair
result. They're not necessarily going to be the results that have the most
influence. Although one of them I think falls in that category. So they're
really personal philosophical favorites. And part of what I'm going to try
to get across today in addition to explaining the results themselves is why
they are my favorite results. So I'm not going to spring any surprises on
you. I'm going to tell you right now what the three results are and then
we'll go into talking about them. So the first result is DataGuides. And
that is in the area of semistructured data, as Eric just talked about, and
that was around 1997. So almost 20 years ago now. Second favorite result is
in the area of data streams and it's the CQL, continuous query language,
around 2002. And the third result I'm going to tell you about is ULDBs or
uncertain lineage databases, which is really sort of a data model or
representation scheme in the area of uncertain data. And that was ten years
ago. So maybe, in the future, if I went back for some threes, there would be
something after that period, but at this point when I look back, those are my
favorites. Okay. Now, let me digress momentarily and tell you about the
Stanford InfoLab patented five part paper introduction.
remember the five parts? I won't put you on the spot.
Arvind, do you
[Laughter]
>> Jennifer Widom: But at Stanford, we actually hammer home to our students
a way of thinking about introducing a topic that they're writing about or
even talking about and we even force them to structure the introduction to
their papers this way initially, five paragraphs. After that first draft,
things can get mushed around. But we found it very valuable. And in fact,
guaranteed paper acceptance if you follow this five-part, patented paper
introduction. Okay. So the first thing when explaining a result, and I'm
saying this of course because I'm going to explain my results this way. The
first thing you have to answer is what is the problem. Amazing to me how
many people tell you about their work without actually telling you what
problem they're trying to solve. Okay. Second, why is it an important
problem? Third, why is it a hard problem? Really, you want all these things
to be true or it's not going to be that interesting. Why has it not been
solved already, or at least, what's the landscape of the previous work? And
finally, what is our solution? Okay. And for today, I'm going to add a
number six which is why is it a favorite? All right. So, we're going to
launch right now into the first favorite result which is DataGuides and I'm
going to start by giving you the context before I can go into the five parts.
So it's around 1997 and we have a project called Lore. Lore stands for
lightweight object repository. We were working on a project on data
integration where we were -- who hasn't worked on data integration? Where
we're trying to bring together data from multiple sources and we defined a
lightweight data model to use for exchanging data and then I decided that it
would be interesting to separately build a traditional database system to
manage that particular data. You don't need to read any of this. The
student who was involved in DataGuides is Roy Goldman. And I'm going to for
each result identify the people who were involved. Okay. So we're building
the system for semistructured data. In 1995, when we started the data
integration project, we invented or I wouldn't even say invented,
crystallized this idea of what we were calling the object exchange model,
which was this lightweight semistructured data model and we used directed
labeled graphs. And here's a picture of an example database in this
lightweight data exchange model. So, this is a directed labeled graph, by
the way. I grabbed this picture from the actual papers at the time. All of
the figures are going to come from the papers at the time. So this is a tiny
database of restaurants and bars. We can see that this restaurant has name
entree phone. So on. This one has not quite the same data. This is a bar
that only has a name. And you can see that this data is self-describing and
that the labels are in with the data down here. We have values
[indiscernible]. There should be nothing too exciting or surprising about
this. This just happened to be what we were using to have a very flexible
semistructured model. Now, shortly after that XML came out, so I don't want
to claim there's anything unique about our model, here's exactly the same
data in XML. And since then, JSON has become more popular. There is exactly
the same data in JSON. Everything I'm saying about DataGuides could apply to
XML and JSON. And we actually converted the project to XML at some point
along the way. But because I want to be true to history, I'm going to use
the object exchange model for this talk. Okay. So now let's go into the
five parts. First of all, what is the problem that DataGuides was solving?
It was the problem that semistructured data does not have a fixed schema.
Well, I would say that's pretty obvious. That's the whole point of
semistructured data is that you don't have a fixed schema. In fact, the data
at that time was called schema-less or self-describing. So that's the
problem that we were trying to solve, the fact that we had no schema. Now,
why is that an important problem? Because database management systems rely
on a schema for all kinds of things. So we started building this database
system for this self-describing semistructured data and we immediately saw
that lacking a schema was a big problem. So what do database systems rely on
a schema for? They rely on the schema to store statistics. You need to know
what kind of data you have to store statistics about the data. To build
indexes, you need a schema. You need a schema to check whether its portions
of the data, the attributes in the query are actually in the data. So to
check if you're working on SQL and you want to check the validity of a query,
you have to check that everything that's mentioned in the query is actually
in the data and you do that using the schema. Even a simple thing like
taking a query that says select star where star means pick all the
attributes, you need the schema to understand what those attributes are. If
you want to build an interface to browse a database, what do you do? You
build that based on the schema so that you know what the pieces are and many
other things. All right. So I hope I've convinced you schemas are very
useful in databases. So we need a schema or something like a schema in this
world of schema-less data or self-describing data. So, why is it hard do
that? Well, first of all, we have to define what a schema means. Second of
all, it turns out that what we really need to do in this case is infer the
schema from the data. So we need algorithms to do that. And furthermore, in
traditional databases, schemas can change but that's sort of a big hiccup
when the schema changes where in this world have semistructured
self-describing data, the schema may change as rapidly as the data changes.
So you need some way of incrementally updating this schema regularly and not
too expensively. And finally, the schema can be as large as the data. So if
you think about it, in semistructured data, if the data is completely
irregular, if there's nothing uniform across the data, then the data is the
schema. On the other end of a spectrum, in a relational database, for
example, the schema is just typically like the width of the tables. Okay.
So lastly, or second to lastly, before our solution, why has it not been
solved already? Now, we're winding back of course to 1997, why hasn't it not
been solved at that time? Actually, basic because nobody else had tried to
build a traditional database system for semistructured data. People were
really using it for data exchange. And for data exchange, it was useful but
not so necessary to have a schema. Okay. So that's where we were. Now,
let's talk about our solution. So our solution was to take this
semistructured database and provide what we call a structural summary, which
is what we call DataGuides. So we made a formal definition of this
structural summary. We have algorithms for inferring it from the data and
updating it. We have the way we use it for indexing, for statistics, and for
query processing. And also for the user interface. Now I'm just going to
give you -- and obviously I could give a whole talk on this by itself. I'm
going to give you some flavor though of each of these components. And again,
questions anytime if you would like. All right. So, as a reminder, here is
the database that we're working with. And now let me give you the formal
definition of the DataGuide and then I'll show you the one for that database.
It actually turns out to be relatively simple which is I think, you know,
good ideas in the end usually are relatively simple in retrospect. So we
have three requirements for the DataGuide or the structural summary. One is
that it needs to be represented in the same database -- in the same data
model, this object exchange model. So in databases typically, you want to
represent the schema in the data that turns out to be very helpful in all
kinds of ways. So we want to represent that DataGuide in the same data
model. Second, every label path in the database, so every path that you can
traverse in that labeled graph have to appear exactly one time in the
DataGuide. So if we have restaurant followed by name, then we have to have a
restaurant name path in the DataGuide and furthermore, there's no extraneous
paths so every path in the database, the DataGuide corresponds to a path in
the database. Okay? Pretty straightforward as it turns out. So here's our
example. And here is the DataGuide for that example. Okay? And you can
confirm this obviously is in the same data model. Every unique path in the
database appears exactly once in the DataGuide and there's no extraneous
paths. Every path in the DataGuide appears in the database. Now, someone
might ask about cycles. Is that what you're going to ask about? Not cycles.
>> [Indiscernible] in the example we have multiple [indiscernible].
>> Jennifer Widom:
Right.
>> So be sure you have ones?
>> Jennifer Widom: Correct. And that's by definition. So by definition, we
want every path that is -- every path in the database to appear exactly one
time in the DataGuide and every path in the DataGuide has to be in the
database. So the definition is really easy. Dealing with it is not as easy.
The other thing I want to mention is about cycles. So we did allow cycles in
our data model. They weren't used that commonly. But that did make things
fairly tricky but it still worked. So when you have cycles in your database,
then you have infinitely many paths and to capture this definition, a cycle
into the database will turn into a cycle in the DataGuide. So you'll have
infinitely many paths in the database, infinitely many paths in the
DataGuide. And in the DataGuide, you'll have each of those infinitely many
paths appearing exactly once. Okay. All good? All right.
>> Could I ask a follow-up question?
>> Jennifer Widom:
Yeah.
So you --
>> So you're not going to represent that it's unique or it's countable or
anything like that in the DataGuide. You're going to use extra information
represent stuff like that?
>> Jennifer Widom:
That's correct, yeah.
Going to show that in a moment.
>> Great.
>> Jennifer Widom: Yeah. Anything else? All right. Okay. So I said that
the DataGuide is our schema and it's used for the type of things that schemas
are used for in databases and now I'm going to show you a few of those
things. First of all, whoa use it for indexing and for statistics. Okay?
So to do that, what we do is in the DataGuide, we store at every node the
object IDs of the corresponding objects in the database. This is effectively
a path index for every path you have in the DataGuide. In the database,
sorry. So for example, in our original database, there were three elements
that were restaurant entrees and they were object ID 6, 10, and 11. So if we
have a query that asks for restaurant dot entree, that's how we did dots for
our path, then what we do is we don't explore the whole database. We go
straight to the DataGuide. They go down here, this gives us our objects and
then we can fetch the objects. So this is a traditional index. Of course
you have to mix index accesses with other types of evaluation in a typical
query processing sense but this is how we used it as a path index. We also
kept the object IDs at the interior nodes as well so here are the objects
that are the restaurants. Okay? So the other thing that we stored in the
DataGuide was we stored sample values and this is really for the user
interface. This was to give users a sense of the type of values that were in
the database. So for example, here, we store a couple of names of
restaurants. Okay? Now, we use the DataGuide quite a bit for query
processing. What we decided to do for our query language and I'm not going
to talk about the query language at all today, we decided not to have the
query language generate errors when it mentioned things that incident exist
but rather generate warnings because we found in semistructured data, people
preferred to have exploration or things would change over time so that's sort
of beside the point. In order to do our warning system, we would, if the -before we actually executed a query, we would take the query and we would
check it against the DataGuide and if we knew there was nothing that was
going to match, then we would return a warning for that query and we wouldn't
bother to explore the whole database. Okay? Much more interesting was that
we used the DataGuide to do ex-expansion of the path expressions that formed
the core of the query language. And again, I'm not going to go into the
query language in detail but you can imagine there were like regular
expressions that would be matched to the paths in the database. So as an
example here, if we wrote the query select star followed by phone or address,
this star would match a path of any length, any labels. And so what we could
do is instead of exploring the entire database, looking for any path that
eventually had a phone or address, we would use the DataGuide. We would find
the paths that had a phone or address and we would change the start, the
actual paths in the database so we put the actual labels. So here
particularly, we would know only restaurants have phones. Obviously, you
know, this doesn't -- it's not a big deep theorem in a large or deep database
doing this could save a huge amount of time. So we use the DataGuide for
that purpose also. Again, that's sort of analogous to the select star and
relational queries but much more complicated here.
>> [Indiscernible].
>> Jennifer Widom:
Pardon?
>> [Indiscernible] to the structure.
>> Jennifer Widom: That's correct, yes. Yeah, we actually didn't save plans
anyway. So right, yeah. Yeah. Okay. All right. The next thing I'm going
to do is go through a few slides showing our DataGuide browser. Now, this
was a big deal at the time and people really liked this. But you have to
remember it was 1997. Okay? Well, a lot of you don't remember 1997.
>> [Indiscernible].
>> Jennifer Widom:
Okay.
Not HCI people.
Okay.
>> [Indiscernible] in the back.
>> Jennifer Widom:
stuff.
Okay.
So it's 1997, database people, this was cool
[Laughter]
>> Jennifer Widom: So we're going to switch now to the database we used for
demos was a database about our database group. Okay. That seemed to be a
good sides and understandable. So again, all of these I captured from the
actual papers. It doesn't still run. So this was the browser that you got
when you opened one of these lower databases and the browser was the
DataGuide. So this is actually the DataGuide for that particular database.
So we had group members, projects and publications, the group members had all
of these. The projects actually pointed to group members. So this one did
have cycles in it. That was one of the things we liked about it that members
pointed to projects, projects to members, publication and so forth. So it
was a fairly interconnected database and it worked pretty well. So you go
here and you can open and close and you’re basically exploring this directed
label graph. So if you chose to look at a particular path, so if you clicked
here, this is dbgroup.groupmember.originalhome, it would pop open this window
which would give you these sample values that were stored in the DataGuide
and then it also allowed to you start constructing queries through this. So
through that DataGuide, could actually form queries. Again, this was cool at
the time. You could add conditions on that path. You could select that path
for the result. So here, we're blowing up. Now we're forming a query
through the DataGuide where we have added a condition on the original home
and on the years at Stanford and on the positions so these are all predicates
we added and then this yellow says we're selecting that and that would launch
a query and given you the result. Okay. All right. So that's what it
looked like at the time. So now let's talk about why is it hard. Okay. So
I didn't do these quite in order. So why was this a hard problem well, first
of all, the DataGuide isn't unique. I don't know if anybody thought of that.
This is actually one of the most interesting things. It's not unique.
That's fine. There's a definition of a minimal DataGuide but it turns out
that wasn't the best one. The best one was something we defined called the
strong DataGuide that went minimal but it turned out to be the best for the
indexing purpose. And I'm not going to go into the formal definition but
that was kind of interesting. Okay. Second of all, the DataGuide isn't
small. So if you think about it, if there's no common structure, as I said
before, in the graph, then the database is the DataGuide. All it's doing is
compressing common structure. And so the DataGuide could be pretty big. We
introduced what was called an approximate DataGuide and for that, we relaxed
the third condition. We allowed there to be paths in the DataGuide that
weren't in the database. Not tragic. You don't want to miss paths in the
DataGuide but it was okay to kind of over shoot. Okay? And third, it turns
out that constructing the DataGuide from the database is pretty similar to
NFA to DFA. So it can be expensive. Trees were easy, DAGs were harder.
Cyclic graphs even harder. And similarly for incremental maintenance. It to
actually be an exponential algorithm. Okay. So last question. Why was it a
favorite? I -- this is also sort of the fun part of preparing this talk was
to think about why this was the favorite. And for this one, I think I can
articulate it pretty clearly. For the work that we did, we had to solve
challenges of every type. So we had to develop the foundations. We had to
develop algorithms. We implemented it all the way to the user interface.
Second, it had applications of every type. So we had to worry about storage,
we had to use it for storage structures, we used it for query processing we
used it for the user interface. So it really cut through the whole system.
And lastly, I do think the name. So I remember sitting around -- actually
still, you know how you remember certain things? I still remember sitting
around with that student Roy Goldman thinking about what name we were going
to use. And we were calling it representative objects. And I kind of wonder
if we'd called it representative objects if it would be as popular as it is
today. But he said no, we need something snazzy, let's go with DataGuides.
[Laughter]
>> Jennifer Widom: So I'm going to say actually, I'm going to say this is
really the result that, for me, has had the most tenacity and longevity. So
I have a habit with Roy, whenever I hear -- so he's graduated ages and ages
ago. Whenever I come across somebody referencing DataGuides or using
DataGuides, I send him an e-mail. And it's still pretty common. People are
still using DataGuides. I can't believe it. So it's still -- it's great.
So really, a big favorite. And again, I wouldn't completely discount the
name. How many of you have used DataGuides? Anybody use -- all right.
Well, we got one.
[Laughter]
>> Jennifer Widom:
So that's the ends of this one, so we can --
>> Historically, this reminded me of the X spot on everything.
sort of was done --
So X spot
>> Yeah, X spot can use DataGuides. Yeah, absolutely. So we, I mean, we
actually converted the project to XML and, yeah, that's where DataGuides are
still be used is in -- yeah. You can make a DataGuide for JSON also. No
problem. Anything that -- yes, exactly. Anything that's self-describing
semistructured data needs a DataGuide really, I think. Or something like it.
>> [Indiscernible].
>> Jennifer Widom:
Yeah.
Yeah.
Okay.
Yes?
>> I was just curious if you could put this in context with the state of the
Internet in '97. I mean, this predates Internet search like.
>> Probably right around the time that Internet search ->> Jennifer Widom: Wait. I thought that came -- I thought 1993 was sort of
when the browsers first came out. I remember ->> [Indiscernible].
>> [Indiscernible].
>> Netscape was just coming out.
>> Yeah.
Google [indiscernible].
Google -- [indiscernible].
>> LightPost was '94.
>> Jennifer Widom:
LightPost was '94.
We're all aging ourselves.
[Laughter]
>> Jennifer Widom:
How many people were still in high school in '94?
>> [Indiscernible]. [Indiscernible] but at that point, [indiscernible]
because, you know a lot of optimized index, you know, [indiscernible] index
[indiscernible] optimization based on structure.
>> Jennifer Widom: So, but this, I would say DataGuides don't have too much
to do with the Internet actually. I mean, to tell you the truth, they're
really about semistructured data, data exchange. Though another thing I
remember very well is when I got a phone call from a random person, I still
don't know who it was, in my office, who said I saw your work on Lore. Have
you heard of this thing called XML? I actually hadn't heard of it at the
time, and I still don't know who he was and why he called.
it and --
But I looked into
>> [Indiscernible].
>> [Indiscernible].
>> Jennifer Widom:
Pardon?
>> He called.
>> Jennifer Widom: He called, but I mean, I get people who call and say that
they have solved PQL's NP OSIS so I don't even -[Laughter]
>> Jennifer Widom:
Well, he called and didn't -- I see, yes.
>> That was around the time when persistent optic databases were in vogue.
>> Jennifer Widom:
That's also true.
>> People were kind of going, oh, maybe it's not just relational databases as
we know them from transactional processing and there's all this debate
between the polynomial language community about persistent objects versus the
database community about [indiscernible].
>> Jennifer Widom:
That's true.
>> We [indiscernible] persistence but we don't know what -- how to do
objects.
>> Jennifer Widom:
Right.
>> So this debate going on, because I was on [indiscernible].
>> Jennifer Widom:
>> [Indiscernible].
[Laughter]
>> We did lose.
>> [Indiscernible].
>> [Indiscernible].
[Laughter]
Programming languages side, yeah.
>> The game's not over. In fact, the languages stuff for persistence on
objects, now that we're going to have persistent memory essentially in a
year, these people don't seem to have read the papers.
>> Jennifer Widom:
Well, okay.
[Laughter]
>> Jennifer Widom: You are energizing debate. Whether it's relevant to
the -- well, people were open to new database models. That I -- and
understanding that they needed them. That's true. This one was not too
related to object database because any way you looked at that, that was
usually strongly typed and this is like the opposite.
>> [Indiscernible].
>> Jennifer Widom: Right. Yeah. But I think that was a time when people
were realizing relational databases weren't going to solve everything.
People are still grappling with whether that's true or not 20 years later.
But anyway. Okay. Anything else on DataGuides? All right. Number two.
CQL, the continuous query language. So now we're going to wind forward five
years and it's 2002 and we're working on a project called the Stanford stream
data manager which we called stream. And the project was -[Laughter]
>> You can always find an acronym.
>> Jennifer Widom: You can always find an acronym. Yes, you can. So in
this project, we were again building a data base system for a new type of
data which is data stream. So instead of your data sitting on disk and
you're asking queries about it, your data is streaming in rapidly and you're
queries tend to sit there and watch the data stream and it stream out their
answers and the students who were working on this, Arvind made the slides
here. I don't know if you knew you were there. Where's Arvind? And Arvind
and Shivnath Babu who is on the faculty at Duke. So they're the two who
worked on the query language. Okay. So now, let me start with what is the
problem and so on. So what is the problem? We're building a database system
for data streams and we need a declarative query language. Okay. So why is
that important? Well, I would argue that a declarative query language is a
key component of any database system. I still think declarative query
languages and transaction processing are the two really key things about a
database system and there's lots of other stuff around it but you better have
both of those things I think to have a good database system. Okay. So I'm
going to claim that's fairly obvious. So why was this a hard problem? Well,
it turns out if we want to make a SQL like query language for data streams,
the semantics, what those queries actually mean is surprisingly tricky and I
actually think it has nothing to do with SQL. Whether or not you reuse SQL,
I think the semantics of queries over data streams is hard and I'm going to
give you examples for that. And secondly, the semantics can actually have a
significant effect on the implementation. So I have a pretty firm belief on
figuring out semantics first and then implementing later but there is some
interplay and in the data treatment world, small changes in semantics can
make the difference between being able to process your query as each element
comes in and throw the element away versus having to keep all history of all
data. Even a small change. So that's important but I'm not going to cover
that particular aspect of it today. So I'm going to give you an example to
explain why it's hard. Here, we have a -- this is going to be a query that
has one stream and one traditional table. And the stream is just a stream of
page views. So this is going to be a view of a URL and the user ID who
viewed it and now we're going to separately have a table that has the age of
users so obviously this is extremely simple but will serve my purpose. And
what I want to do is find as these page views stream in, the average age of
the viewers for each URL in the last five minutes. Okay? So this is a
standard -- I'm going to show you SQL now. I think even if you don't know
SQL you'll be fine, but this is a pretty standard group by aggregation query,
except we have a string. All right. So here's a SQL-like query that answers
that question. It says I'm going to take in any from clause -- you always
read the from first -- I'm going to take -- this is the one thing I've added
here, a five-minute window on that views stream. So views is a stream.
Going to look at the last five minutes, okay? And then I have my users table
and I'm going to join on the user ID. Very standard here. Grouped by the
URL and give me the URL and the average age. So I think that should be
readable for everybody. Pretty straightforward. Okay. What's the result of
this query? Is the result a stream? Is it a relation? Is it something
else? I would claim it's not actually obvious what the result of that query
should be. Okay? Though people would write it and not worry about it too
much. And here's a more really specific question about that query. So what
happens if someone's age changes while they're in the five-minute window? So
they already viewed the page and then their age changed. They're still in
the five-minute window. Does that change the result of the query or not?
Okay? So that's just a very specific point I wanted to make here. I'm not
going to tell you the answer just yet. I'm going to just point out that this
is pretty subtle. Okay. So now, let's go into why it hasn't been solved
already and then we'll go into our solution. At the time there were a few
groups building database systems for data streams, I got the sense that the
others didn't seem to worry too much about query semantics. Let me just put
it that way. I have a nit about the database community in general that
there's a lack of worrying about query semantics and I have a whole another
talk on that but I'll spare you that today. So that's where things stood and
we decided to worry about it ourselves. So what is our solution? So we
started to step back and figure out what the best way would be to define -to make a very precise semantics for streams. And what we decided to do was
rely as much as possible on relational semantics because that, everybody
understands. People understand relational databases. So people know what
relations are and people know what it means to ask a query. If you ask a SQL
query on relations, you get a relation back or relational algebra, all well
understood. So what we decided to do is rely on that and then we have
streams and we have a very well defined way of going from streams to
relations and relations back to streams. We go from streams to relations
based on these window specifications so when you put a window on a treatment,
it turns into a table effectively. And then we have operators just a couple
operators that turn relations into streams. And what was the basis for our
definition. Okay. So let's go back to our query now with that in mind. So
this query now with our new semantics says that this here, this views with
the range is going to turn into a table. Okay? So that's just going to be
the last five minutes as a table. So then this result according to our new
semantics, is a relation. It's a relation because this is a table. Now
we're just doing the join. That relation is updated potentially when time
passes because this table here will change its value when time passes when
new page views occur. Okay? Or when ages change. So when anything changes
that contributes to this, that would be -- that relational result will get
updated. Okay? So clear, maybe not what we want, but clear. Okay. If we
want the result to be a stream, then that was pretty easy too. That says
that we're going to just add this operator we have called stream and what
that operator did would just stream out a new element whenever the result
changed so you can just think of it as a table but whenever there's a change
to the table, we stream out a new element. So all of that is good. The kind
of bad thing was this business with the age. So this, the way our semantics
worked, if someone's age changed after they viewed the page but while they
were still in the window, the result of the query changed. Probably not what
you wanted. Probably you actually wanted to use their age at the time they
viewed the page. Presumably that's wanted. Here's the query to do that.
I'm not going to claim it's wonderful but it works. What do we do? Well, we
take our views stream and we have this window called now which makes just the
latest element into a table. We join that so we are joining, basically we're
joining with the user table at the time the element appears and turning that
into a new stream. So new we're streaming out the views with the ages. Then
it's that stream that we take the five-minute window on and everything works
from there. I'm not going to argue that it's beautiful but at least we have
a well-defined semantic.
>> Eric Horvitz: And it's probably an Einsteinian relativistic version of
this where space and time is part of now.
>> Jennifer Widom:
Well, sure.
Yeah.
Something like that.
[Laughter]
>> Jennifer Widom: Okay. So, just going to summarize now. Summarize what
I've said. So we have a -- what we defined is a precise semantics, what we
call an abstract semantics was that diagram I showed based on the fact that
you have a relational semantics, the fact that you have these specific
operators that go from streams to relations and relations back to streams.
We had a concrete implementation based on SQL with the windowing constructs.
We also added a sampling construct which turned out to be very -- the stream,
data stream query languages or data stream applications often like to do
sampling so we through that into the query language. Some of the most
interesting work actually was in query equivalences. So it was pretty
interesting, you can actually analyze a query for example that would use an
infinitely arbitrary growing window in the query and you could analyze the
query and see that you could change it to one of these now windows and there
were a whole bunch of other optimizations. I thought that was one of the
most enjoyable parts of the work. Okay. We had a guiding principle for the
work that drove what we did. Easy queries should be easy to write. Simple
queries should do what you expect. And I think we achieved that. What it
didn't say anything about was the hard or the complex queries. So the hard
queries were not always easy to write and the complex ones were not always to
understand, I would say. I also wanted to mention briefly about time and
ordering. You brought this up slightly. This was the issue of streams
coming in out of order or there being large gaps in the timestamp or time
passing and not knowing if you might get a stream element from a long time in
the past was a big problem in data stream systems. Some of the other
projects chose to deal with that problem in the query language itself. We
chose to not do that, which helped. We chose to assume there was a lower
layer that was buffering the streams and delivering well-behaved streams to
the query processor. So we would assume there was a bounded window beyond
which you would never get elements coming in late, right, and things like
that. We assume that they would be within a bounded amount of orderedness
and so on. And that was quite important, I think, to the work. Okay. So,
why is it a favorite? Well, first of all, I think that query language design
as a field is highly under rated. It's difficult to publish in. I have my
favorite story. Some of you might have heard before. We're going to go back
to the lower project and the query language that we developed for that
project which was called LORL. And we could not publish our LORL paper for
the life of us. We tried everywhere, nobody wanted it. Finally, one of my
coauthors, Serge Abiteboul, who was visiting Stanford for a couple years at
the time, said, well, I was invited to be the -- to contribute a paper to a
new journal called the Journal of Digital Libraries, Volume I, No. 1. Maybe
we should just put it in there. And we said all right. It was the only
volume ever, number ever of that journal.
[Laughter].
>> Jennifer Widom: But, I was very happy that for a rather significant
length of time, like a couple of years, that paper was in the top 100 cite
papers of computer science in that really defunct journal. So that tells you
not to worry if your things keep getting rejected. They can still have
impact. We had some difficulty publishing this work, but -- as Arvind is
nodding. But it did get some attention. And we had a little easier time.
So I think people were recognizing that. The need for semantics, as I said,
is often ignored by the database people. There are some really sorry stories
about the early days of SQL, simple queries where two different systems would
get different answers on -- I mean, it's really amazing.
>> It's not actually the early days.
[Laughter]
[Indiscernible].
>> [Indiscernible] SQL 7.0. Jim Gray and his lab, a San Francisco lab, have
[indiscernible] DB2 and SQL Server. Don Schluse was running the project to
see whether they answered the same, gave the same results. [Indiscernible]
scheme, right?
>> Jennifer Widom:
Right.
>> And there are serious [indiscernible].
>> Jennifer Widom: Okay. I could go off on this tangent here. There are
still queries today where different systems will give different answers.
Even worse, there is a type of query where some systems can give you a
different answer on different days without you changing the data. It has -very briefly, if you do a group by query, and you add to the select clause an
attribute that's not in your group by clause and that's not an aggregation,
some systems will choose a random value from your group to put in the result
of that query. And that random value could change if the database gets
reorganized. Yes, I teach introduction to databases so I like to point this
out to the students. It's pretty shocking actually. It's -- right.
>> Select start is [indiscernible].
>> No, select start is a [indiscernible].
[Laughter]
>> It's not bound. So you could have [indiscernible] assumes 15 attributes
right, [indiscernible]?
>> Jennifer Widom:
Yes.
>> Then somebody adds three columns.
>> Jennifer Widom: That's true. Okay. But if you change the schema, I'm
slight -- I mean, that's not good. But this -- this is an example where you
don't change anything. Right? You don't change the schema, you don't change
the data. One system gives -- good systems say you can't write that query.
The bad systems give you a random answer and that random answer can be
different at different times. Yeah. It's bad.
>> [Indiscernible] query.
>> Jennifer Widom:
Something like -- well, yes.
>> How do you explain it away?
>> Jennifer Widom: Sure. Right. Yes. Anyway, so I think in this case,
people at least appreciated that there was some challenges and subtleties in
the semantics. Lastly, I would guess say not the name for sure. Although it
was a fine name, it didn't quite have the umph of DataGuides.
that's number two. Any more discussion on that one?
Okay.
So
>> Eric Horvitz: Yeah. What's your reflex on where stream processing has
gone over the years since the result?
>> Jennifer Widom: People are still working on it and they haven't like
solidified it. It's surprising to me that there's no standard. Right.
And -- yeah. It's still ongoing.
>> Eric Horvitz:
systems that --
It actually can be quite important even for these AI
>> Jennifer Widom:
>> Eric Horvitz:
Absolutely.
-- multisensory streams, very fast paced.
>> Jennifer Widom: Right. I mean, yes. And people keep building new
systems and they keep doing different things. I mean, I guess if there was a
real need for a standard, it would have emerged, but yeah.
>> I was a little surprised that you mapped from the stream world to the
relational world just using time windows. A number of other things having to
do with order of events.
>> Jennifer Widom: So we had time windows and we had number of events. So
you could either have number of rows or number of tuples or you could have
time. You could -- and there has been a different line of work on very rich
windowing constructs. And so, our -- in fact, our party line was any window
in construct is fine, the abstract semantics would take any. Our concrete
implementation just had those two types. Yeah.
>> Well, analogous to that question is you have a pretty fixed semantics for
relational tables but it seems to me you implicitly chose a semantics or
streams so tight you could do this mapping on it because there are many
semantics that you also associate with streams.
>> Jennifer Widom: That's correct. Yeah. And there's one actually
significant reduction of expressiveness that nobody -- that happened which is
that when we switched from streams to relations using windows, we lost the
ordering.
>> [Indiscernible].
>> Jennifer Widom:
anyway.
Yes.
And we knew we were doing that.
But we did the
>> And that seems a little against the whole behavioral property that one
would associate with streams. Just to be honest, right?
>> Jennifer Widom:
Yes.
Yes.
Though -- right.
So I agree with that --
>> Unless you took this abstract view it's just about windows elements.
>> Jennifer Widom: Yeah. That's right. Yeah. It was a conclusion decision
partly to keep things simple. And there were ways to overcome it, but yeah.
Okay. Number three, ULDBs, uncertain lineage databases. So now, we're
winding forward to 2006 and it's a project called Trio. Trio was a system
for integrated management of data uncertainty and lineage. So that was why
it was called Trio, for three things. This was our logo. Anybody see
anything unusual about the logo?
>> [Indiscernible].
>> Jennifer Widom:
The wheels cannot actually turn.
[Laughter]
>> Jennifer Widom: You have so separate one of them make them turn.
think we more or less made them turn anyway. Okay. So --
But I
>> Did you realize that problem after the logo was created?
>> Jennifer Widom: Yes. Yeah, we did. But I liked it anyway. It tells a
good story. Right? Okay. So the people who were involved specifically in
the -- so what I'm going to talk about is self ULDBs which is the data model
or representation scheme for the Trio project and the people who were
involved in that particular part of it were Omar Benjelloun, who was a post
doc at the time, my Ph.D. student Anish Das Sarma, and Alon Halevy who was
visiting Stanford at that time. By the way, this was very briefly the iPod
logo. That was back when people -- they introduced the shuffle I think and
people didn't like it and there was a big billboard in San Francisco that
said enjoy uncertainty. So we grabbed it. Okay. All right. So what's the
problem? Well, once again, we're building a new kind of database system.
This is what I like to do actually. And now it's for uncertain data and I'll
explain what I mean by that. And we need a data model. Okay? So why is it
important? Well I argue a well chosen data model is important for anything
you're doing in data management at all and I would say anything you're doing
in data at all, you better understand what your data looks like or what the
possibility are for your data. I do want to be very clear. I don't know
that I need to with this audience. I'm not talking about an AI model or
anything like that. I am talking about how you represent your data. What
it's structured like. So the first part of the talk, I was talking about
those directed labeled graphs, the second part I was talking about data
streams. Now I'm talking about uncertain data but not in the AI sense.
Okay. So why is it hard? What we're going to see is that developing this
data model or representation scheme for uncertain data, we come quickly to a
tension between having an understandable model, one you can look at and know
what it's talking about, and one that's expressive enough and I'm going to
give a very concrete example for that. So here comes the example. This is a
database for solving crimes. So we're going to have -- we're going to have
witnesses and drivers so there was a crime -- there was a crime committed.
There were people driving cars near the crime. People who owned cars and
witnesses who might have seen cars. So specifically, we're going to have two
relations, the saw relation where a witness might have seen a car at the
scene of the crime, okay? I'll get to some real data in a moment. And
people who might drive particular cars. Okay? So these will look like
regular tuples but we'll see what I mean here with the uncertainty. So if we
want to generate suspects, we just do a relational join. If we wanted to
generate a suspect for the crime, we find people who might drive a car that
might have been seen by a witness at the crime and again, I'm going to
explain all this in detail. All right. So let me just back up.
>> Are there no pedestrians in this?
>> Jennifer Widom:
No, this was all about driving.
There's no pedestrians.
[Laughter]
>> Jennifer Widom: I don't know what the crime was, but it was committed in
a -- well, maybe they jumped out and robbed a bank or something like that.
Yes. Okay. Again, contrived to be the simplest possible example that brings
out the important points. Okay. So let me back up and talk about what
people agreed about on certain databases. So pretty much everyone agreed
that abstractly, an uncertain database is a representation of a set of
possible certain databases. Maybe arbitrarily large set. Okay? Those are
often called possible instances. So I'll get to this in a moment but in our
example, we could have that Kathy saw a Honda or a Mazda, so there were two
possibilities. Kathy saw a Honda, Kathy saw a Mazda. Amy might have seen an
Acura or maybe she didn't see one. Okay? We have a Honda that's different
by Billy or Frank. Concretely, we're going to represent these as alternative
values like Kathy saw a Honda or a Mazda and then we're going to have these
question marks that say that values can be either present or absent. Now, in
the Trio project, we also had confidence values or probability so that would
be more in the probabilistic data sense but I don't need those to get my
point across in this talk. So we're not going to have them today. Okay. So
here's the very concrete representation of what I described. So these are
two tables in the uncertain database world. The first table, the saw table
says Kathy saw a Honda or Kathy saw a Mazda. So this tuple has one of two
possible values. This says Amy might have seen an Acura, but that question
mark says present or absent. So this table has four possible instances. Two
for the first tuple and two for the second independently. Okay? Over here,
we have a Honda that's driven by either Billy or Frank so two possible
instances there. So this uncertain database has a total of eight possible
certain databases. All right? Yes?
>> Is this semantic that Amy didn't see a Mazda?
>> Jennifer Widom:
people didn't see.
Yes. Well, no.
This --
And it doesn't say anything about what
>> So then that statement has no -- what does it say?
>> Jennifer Widom: This says that Amy may have seen an Acura. One of the
possible instances -- in one of the possible -- well, in half the possible
instances ->> [Indiscernible] maybe probable models of the [indiscernible].
>> Jennifer Widom: So in half of the possible databases, Amy saw an Acura.
Right. Yeah. Doesn't say anything about the absence, though. Okay. So why
is -- so why is our problem hard? What's wrong with this model? Well, it
turns out that the simple model is not closed and what does closure mean?
Closure means that I have a model or representation scheme and that when I
run a query on it, the answer can be represented in the same scheme. And
that's considered a no-brainer for databases. You want that to be true.
When I run a query on these uncertain databases, I want to be able to give
you the answer as one of these uncertain databases. Pretty important. This
model is not closed and I'll show you that now and I'm going to have a quiz
for you so everybody get ready. Okay. So I've expanded my database now. In
addition to these, I have a guy, Jimmy, who drives a Toyota or a Mazda and I
have definitely that Hank drives a Honda. Anybody notice anything about my
choice of data? The men are the criminals?
>> Oh, I was just going to say, the women are on the saw and the ->> Jennifer Widom: The women are the witnesses, the men are the criminals,
just like real life.
[Laughter]
>> Jennifer Widom:
It helped us keep all our data straight.
[Laughter]
>> Jennifer Widom:
Okay.
All right.
>> You don't have interesting cars.
>> Jennifer Widom:
Don't have interesting cars, okay.
[Laughter]
>> Jennifer Widom:
That's also true.
>> They all have Japanese cars.
>> Jennifer Widom:
>> [Indiscernible].
That's also true.
I should change these to Tesla.
>> Jennifer Widom: All right. I'll put a test play in here. Okay. Let's
run our relational join on these two tables to get the answer to our query.
All right. When we do it, here's what we get. Okay? And this is where your
quiz is coming in. We did the join on these and we -- what we get is that
Billy or Frank may be suspects. Jimmy might be, Hank might be. All right.
But, this doesn't capture the correct possible instances in the result and
I'm going to ask you why. And I'll just tell you that I gave this example in
a talk at the ACM India conference about three weeks ago to a thousand eager
undergraduates and one of them was so excited when he jumped up and -- I just
sat there and waited, and then one of them jumped up and was so excited that
he saw the answer, and he got it right. Does anybody see why this doesn't
capture the right instances in the result? Now you're under pressure. Yes?
>> Couldn't the suspect also be none of the above because if Amy was right
that she saw the Acura and so someone -- no one is driving -- so there could
be a suspect who is not yet in your database.
>> Jennifer Widom: That's sort of coming to the same issue of absence of
data. We're kind of using this closed world. So that's not the problem.
Yes?
>> Why is there still Billy or Frank?
Shouldn't there be four?
>> Jennifer Widom: Well, it's Billy or Frank because one of them drove that
Honda that might have been seen by Kathy. Right? So if Kathy saw a Honda,
then Billy or Frank could be a suspect.
>> So it's [indiscernible].
>> Jennifer Widom:
Yeah.
>> You can't have both row 1 and 2?
>> Jennifer Widom: Yes. You can't have both rows 1 and 2 at the same time.
If Billy -- and there's other examples of the same thing. If Billy or Frank
is in the answer, so if they're actually there, that means that Kathy saw a
Honda. If Kathy saw a Honda, she didn't see a Mazda. If she didn't see a
Mazda, then Jimmy can't be in the answer. And by the way, if Billy or Frank
is in the answer, then Hank has to be in the answer, another example. So
effectively, there are correlations, relationships between things and the
answer that depend on what you choose in the original data. Okay? So we
actually proved that our model cannot answer -- cannot represent the answer
to this query. Just can't do it. So this model is not expressive enough.
So what happened next? Oh, sorry. Just a moment. Why hadn't it been solved
already? Well, there were other people working in the area. Most of them
were theorists, I would say, at the time. So they actually were not too
concerned about this understandability. So there were other models that had
sort of complex constraints and put variables in there and so on. We were
trying to get something that people could actually look at and know what the
data meant. Whether we achieved that, you'll see -- I'll see what you think,
but that was our goal. Okay. So what happened? Actually lineage, believe
it or not, came to the rescue. Lineage is -- lineage can mean a lot of
different things but it's effectively the concept of tracing where data comes
from. So, what we did is we added to our model and it's a little ugly here
but what we added to our model is effectively pointers or capturing in the
answer where the data came from. I tried to do this with arrows but it got a
little too complicated. But effectively, this says that this first
alternative of tuple 31 came from the first alternative of 11 and this first
alternative of 21. That's what this little thing here says and the second
one came from the first alternative here and the second there. So these are
effectively captures pointers to where the data came from. Okay? And then
the interpretation of this data is that the possible instances that this
represents, the set of possible databases are only those databases where you
have consistent lineage. Okay. You can't grab at the same time two things
that come from two different choices in your base data. Okay? And this,
with the lineage, correctly captures the possible instances in the result.
But it's even stronger. First of all, this model with the lineage, these
constructs and lineage, is closed under all the relational operations. We
proved that. But furthermore -- and the second actually implies the first -it's complete. So any uncertain database you give me, and there I mean any
set of databases, any set of possible databases can be represented in this
model. All right. So why is it a favorite? Well, the Trio project itself
was conceived before we actually built the data model. So we started a
project. We wanted to do data uncertainty and lineage and the combination
was really motivated entirely by applications. So there was scientific data
seemed to need -- scientific data applications often seemed to need both
uncertainty and lineage. We were looking at entity resolution problem which
is also one where you have uncertainty in lineage and a whole bunch of them
seemed to need both and that was why we developed the project. Never
imagining that lineage would turn out to be the key to representing uncertain
data. Seriously never imagined that. So, in retrospect, you know, why is it
that -- maybe there was an implicit connection somehow in the applications?
That's probably the most likely. Maybe an unconscious hunch. Maybe less
likely. Divine intervention, pure luck. Hard to know. But anyway, that's
one of the reasons I really like this one is that it sort of fit together
later on. Definitely not the name. Okay. So, just going to wind up. Is
there anything in common among these favorites? You know, let me just do my
best and try to make something common. In the area of developing data
models, developing query languages, we worry a lot about expressiveness. We
worry a lot about simplicity and we just saw that in the last one, and then
efficiency. I didn't talk about efficiency today, but as I hinted through
all this work, we were thinking about efficiency. So I would say that
DataGuides did pretty well on the expressiveness and the simplicity side.
Not so well on efficiency, as I explained, at least the pure DataGuides. If
you look at CQL, expressive is pretty good. Efficient, pretty good actually.
Maybe not so good on simplicity. I like to say that ULDBs maybe did manage
to hit that center point of all three of those, but the one thing I would say
completely in retrospect is that balancing -- trying to balance these
conflicting goals I would say has been a theme across a lot of my work, and
that's something I learned preparing this talk. So thank you.
[Applause].
>> Eric Horvitz: Do you have questions?
session as long as we've gone.
>> Jennifer Widom:
Right.
I guess we've been going -- a
We could have some more debates.
>> So for ULDB with respect to the efficiency, like or in relation
[indiscernible] operations, [indiscernible], but I'm talking about the
execution of the queries.
>> Jennifer Widom: Right. So the only really problematic case is
aggregation, actually. Other than that, it was fairly efficient. So
aggregation has this problem. If you imagine a relation of 100 tuples, each
of which is present or absent and you ask for the sum of them, you have two
to the 100 sums. So owner that, it's sufficient. There were also some
complicated things I didn't get into where when you have negation in your
queries, it gets complicated with this and you start having boolean
expressions as your lineage and so on. But for standard relational
operations, it's not a problem.
>> So [indiscernible], but this is one area where there is obviously a lot of
uncertainty and continues to expand as we get into IoT sensing.
>> Jennifer Widom:
Right.
>> But as ->> Jennifer Widom: And by the way, I didn't talk about probabilities at all,
but that's a huge thing. I just didn't need it today. Yeah.
>> So [indiscernible], what is your reflex on whether it has taken root or
not and why not?
>> Jennifer Widom:
data?
You mean in terms of a generic platform for uncertain
>> Yeah.
>> Jennifer Widom: I don't know if there's going to be one that serves
everybody's purposes, to tell you the truth. But it's not -- verdict is
still out. Like you said, there's more things coming out that need it. I
think there's been a lot of one-offs for particular application centers for
example. Yeah.
>> Do you have any results that you were very excited about at the time, you
know, when you submitted the paper and now, looking back --
>> Jennifer Widom:
Oh, man.
>> -- and producing this talk, you're kind of surprised at the fact that
actually, it's not even worthy of consideration for the top three?
>> You might as well have three least favorite results.
>> Jennifer Widom: I didn't talk about active databases at all. You know, I
need some time to think about that. It's a great question. I should think
about that. I did have to pick and choose a little, but you mean, something
I really loved that nobody else liked? Well I mean ->> Well, well, either that or that you changed your mind about the work based
on the way technology has unfolded.
>> Jennifer Widom:
Oh, I see.
>> You know, turns out to be only a theoretical [indiscernible] but at the
time you really thought it was ->> Jennifer Widom:
category.
I have plenty of work that probably falls in that
[Laughter]
>> Jennifer Widom: We all do, don't we? It's a good question.
would be an interesting talk, three least favorite results.
Right.
That
[Laughter]
>> Jennifer Widom:
Three bombs of --
[Laughter].
>> Eric Horvitz:
>> Jennifer Widom:
Or tenures -Yeah.
So the anti-tenure paper award, right?
[Laughter]
>> Jennifer Widom: It's a good question.
the bat. It's a good question.
I don't have an answer right offer
>> As you said, these are your three favorite results from your own work. Do
you have a couple favorite results from other people's work in the field that
were the most kind of inspirational or ->> Jennifer Widom: Oh, no, that's a hard one too. Oh, boy. I'm really
getting put on the spot here. I'm going to have to put that one off too.
I'm sure, I mean, I absolutely -- and probably in the -- if you want to take
these three areas, then probably it's in the probabilistic database where
there's some beautiful theoretical results. So like I said, we wanted
something that you could look at -- we were looking at user-facing data
models but behind the scenes, people like Dan Shushu of Washington -- oh,
we're near Washington, aren't we?
[Laughter]
>> Jennifer Widom: Had beautiful -- we're in Washington, yes. There was
some really nice theory behind there unlike, I would say, the other two
areas, truthfully. What I would appreciate mostly would be really nice
theoretical results that back doing something. I've always liked to build
prototypes, but having something that's backing that is important.
>> Lucy: I wonder, so you have the uncertainty. So if you have that -- I
can't remember who the people were, that Sally saw a Honda or Sally saw an
Acura. Those appear to be uncertain, but if you attached a time modifier to
those, maybe they're less uncertain.
>> Jennifer Widom:
Right.
>> Lucy: And I'm thinking about kind of the databases
kind of constructing where if you have this particular
that protein. Or it doesn't up regulate that protein.
contradictory, but they're not because it up regulates
mouse genome but it might down regulate the protein in
>> Jennifer Widom:
that Hoifun and I are
protein up regulates
Those appear to be
this protein in the
something else.
Right.
>> Lucy: So there are apparent -- like have you handled apparent
contradictions but they're not contradictions because they just require
further modification in order to understand?
>> Jennifer Widom: Yeah. My answer to that would be in our model, no,
because our model was pretty cut and dried. We have these alternative
values. It's one or the other and not both. I mean, you can construct, so
it's both. What I would say for that type of thing, just a shot in the dark
here, but the uncertain databases have this interpretation where it's a set
of possible certain databases. And it sounds like maybe there's some
layering of additional information that would constrain or even expand with a
set of possible certain databases are. That's what it sounds like to me.
Almost like our lineage, right? So we had our representation and then we
said, okay, but with lineage that changes what the possible certain databases
are and it sounds like you might have something like that where you have sort
of the data and then additional information that constrains or expands, maybe
in your case, what the possible databases are. Does that make any sense?
>> Lucy: I think that the lineage is very important. I guess the thing is
I'm coming from a text processing point of view where, you know, kind of we
don't just say triples. There's a lot of meaning that gets layered on.
>> Jennifer Widom:
Right.
>> Lucy: And so, kind of how do we meaningfully layer those additional
constraints.
>> Jennifer Widom: Although I've always argued in the database world, we
don't layer on any meaning: We just give you the data and you can do what
you want with it.
[Laughter]
>> Jennifer Widom: This is what I always have to explain to people when they
think this is some kind of AI system, the Trio. I keep saying it's not.
It's just this data there. If you want a layer a BAZA networks on it, you
can. So probably we don't capture up what you kind of ->> So, Lucy, why have you not explicitly represented that because ->> Lucy:
The context?
>> Yeah, because then you have the context, right? And so for any use Z
three or term provers use ULDB, you can reason about it, but if you have it
embedded in your inference system, implicit within your developing a very
specialized interest [indiscernible] which makes sense is a lot more
efficient.
>> Lucy: I think it would be really interesting to talk about because the
thing that -- when people think that way, you end up with kind of
non-predicates and then it's very hard. Then you can't see inside the
predicate anymore. But that's a longer discussion. But it would be really
interesting to have that.
>> Eric Horvitz: [Indiscernible] question I'll ask you in career, it seems
like looking at your bio and so on, you spent about five years at Almaden
after you finished your dissertation work at Cornell. And then moved to
Stanford. And so you made this decision to go to a industry research lab.
And then [indiscernible] academia. And it's a decision that many research
scientists who are in this room and pondered, made at different parts in
their career, and then recurrently revisit.
>> Jennifer Widom: Of course most of the ones in this room didn't go, right?
Or they wouldn't be in the room. Didn't leave research lab and go to
academia, but many went the other direction who are in the room.
>> Eric Horvitz: No, no, but the decision was entertained at the time of
like Ph.D. completion.
>> Jennifer Widom:
Oh, I see.
>> Eric Horvitz: So I'm just curious to reflect on the -- and of course, you
know, MSR and over time, it's quite different probably than Almaden
[indiscernible] research labs, they're all quite different, but you could
share some of your experiences of what it was like to make the decision, to
be in Almaden, then Stanford and just reflect a little bit for this group.
>> Jennifer Widom:
I would be happy to.
Okay.
>> Talk to us in private, if you want.
[Laughter]
>> [Indiscernible].
>> Jennifer Widom: Okay. I got my Ph.D. in program verifications. My
thesis was a negative result in using temporal logic to prove properties of
concurrent programs. It's pretty much a dead end. I thought. Anyway.
Well, I mean it was a great thesis of course.
[Laughter]
>> Jennifer Widom:
Okay.
So let's start with that.
--
>> Times have changed, by the way, sense then.
>> Eric Horvitz:
>> Jennifer Widom:
>> Eric Horvitz:
Yeah, we have a -That's true.
We have a place for you right now in our [indiscernible].
>> Jennifer Widom: I shouldn't put down the area. I enjoyed my thesis work
and it was appropriate for Cornell where I was. I had a two-body problem
when I lift. We interviewed at many universities and a couple of research
labs and in the end, the optimal decision for the two of us was to go to IBM
Almaden. Now, that was a great turn of events for me because at IBM, I was
given the chance to join the database group based on a pretty flimsy evidence
that I might know something about data. I really didn't know anything. I
had done a summer internship at Xerox Park. It had something vaguely do with
databases. So at IBM Almaden, they offered me the chance to join the group
and who wouldn't join that group? It is an amazing group. So I went there
and I became a database person, which was great. And then so that's where I
learned -- that's where I switched research fields at that point in the
context of an amazingly good group, in the context of having 95 percent of my
time to just focus on research. At that time at Almaden, that was also the
glory days where you could just publish in the context of a big software
project. All good. Five years later, I had the chance to go to Stanford.
Who would turn that down? And that's more or less the story.
a conscious decision. More a sort of meandering of events.
So not exactly
>> You at least credit that because your thesis work was in verification in
languages that your statement about semantics, you know, it's ->> Jennifer Widom:
Absolutely.
>> -- from your lineage.
>> Jennifer Widom: Absolutely. I have no question in my mind that my
programming languages Ph.D. influenced everything did I in databases. No
question. I should have said that. Absolutely. Yeah.
>> Eric Horvitz: And I guess, just to complete the discussion, so the
environment and lifestyle in academics for you, academia with the students
and so on at Stanford versus the focused time you had at Almaden, how would
you kind of compare or contrast those kind of experiences?
>> Jennifer Widom: Well, I mean, I always like working with students. I had
students when I was at Almaden. I'm sure most people here like working with
students also. But at that time, I literally spent 95 percent of my time on
research. Obviously, then after going to the university, it was never even
50 percent, I don't think. But well, the other thing, again, this is very
personal, but establishing my research career at IBM Almaden, by the time I
left there and went to Stanford, I had already established myself as a
researcher, so that really eased being a junior faculty member. I would not
change anything about what I did. The fact that I went to Stanford not with
that immediate pressure of just having finished my Ph.D. and having to get
grants and all that and ramp up and worry about tenure and all that, it did
ease the transition, I would say.
>> Eric Horvitz:
Okay.
Any other questions?
AJ?
>> So you pretty much upgraded your work from happens in AI through the talk.
>> Jennifer Widom:
Yes.
>> You said I don't want to do inference. I just want to represent the data.
But the [indiscernible] foundations in AI are also about logic and the
relationship and the representation of the data. So in terms of these
favorite results in your career, how would you rank the results with respect
to the influence on other disciplines, for example, AI, or that's just the
hard thing that hasn't really happened that much?
>> Jennifer Widom: Boy. I don't know the answer to that question. How
would I rang them in terms of influence on other fields in computer science?
Boy. Well, get information retrieval that use DataGuides, but I don't have a
good answer to that. But, by the way, I will say on the record here that
getting database and machine learning communities together is like the number
one priority. I think that is really important right now, and I think people
are working on that. So I'm guilty of being one of these people who has
really separated them and will just make it very clear, okay, this isn't
going to be AI. I'm just building the substrate. But I do think it's very
important for the fields to get together.
>> Eric Horvitz:
[Applause]
Why don't we stop there and thank Jennifer.
Download