Walker from Princeton University. David is now a member... of the 10-year professors. So he's passed through the... >> Jim Larus:

advertisement
>> Jim Larus: Why don't we get started. So it's my pleasure to introduce David
Walker from Princeton University. David is now a member of the secret society
of the 10-year professors. So he's passed through the difficult stage and now he
gets to sit in retirement for the rest of his career. Actually David has done lots of
great stuff. Many of the people here have connections with David and I think he's
going to be talking to us about some new work, at least work that I haven't seen
before.
So, David...
>> David Walker: Thanks. All right. Yeah. So thanks for having me. Today
I'm going to talk about some work that I've been doing over the last couple of
years with some collaborators at AT&T, primarily Kathleen Fisher and Yitzhak
Mandelbaumis one of my graduate students at Princeton who's graduated and
taken a research position at AT&T. And Qian is a new student of mine that's
started working on this stuff and Kenny is a post-doc. Right. So I normally have
to say that all the credit is theirs and all the problems are mine. So there we go.
So the overall starting point for talking about this stuff is that there's lots and lots
of data all over the place and many of it is what we will call semi-structured data,
but it isn't necessarily in a standard format like XML or HTML. So systems are
producing all kinds of web logs. There are statistics that show up all the time on
my hockey discussion boards.
AT&T has tons of information pertaining to phone call and billing and financial
transactions are in their own strange format. Scientists, like biologists, have
micro-DNA data and genomics data. So there's all kinds of different bits of data
from all over the place.
And there are a lot of problems with this data. Sometime its has no
documentation. The formats can be evolving and with little documentation you
can run into errors all the time when formats are changing and people don't know
exactly which way is what. And many of these data sources also have huge
volumes that we have to deal with when we're building tools.
So just give you a little bit of an idea what I'm talking about. Web Servers
generate logs to tell you all the requests that have come in and they have
information like IP addresses. They have the dates and structured requests and
information about errors and responses.
AT&T has similar kinds of log files that catalog the process of assigning a phone
number to somebody new. So these are huge logs that tell you what states the
process of somebody getting a new phone number is in. And again, it's
combinations of phone numbers, other kinds of numbers, various different names
of states in the process.
Biologists like Olga Transky(phonetic) at Princeton, are developing new systems
for integrating lots of different experiments that researchers do and then post on
the web. And here is a little bit of data from a website, Geotology website, that
tells you connections between different processes that go on inside cells. So
there's identifiers for the different processes and names and there's definitions
and various identifiers that forge connections between the different parts of the
document.
This is my last one. I'm a big fat hockey fan and I've always wanted to be able to
give a talk about -- and include some sort of hockey data and now since I have
tenure, I finally can. So on my various hockey boards people post all kinds of
different statistics and the beginning of the hockey season, what you want to do
is you want to go and grab a whole bunch of these statistics from different places
and perform some analysis of them so you can pick your hockey players for your
hockey pool in a way such as to defeat all the noncomputer scientists in the
crowd, which over the last couple of years I've been pretty good at. I've had the
number one hockey team in my pool right now.
But anyway, the main point is there was lots and lots of data. And what I want to
do is I want to help people build the arrow in this diagram that from all this kind of
ad hoc data, provide you with a bunch of end user Tools. Tool that will take that
data and convert it into the right format to feed directly into your database. Tools
that will take the data and convert it into a more standard form such as XML so
that standard programming tools and other kinds of tools can work directly on
your ad hoc data.
And also build tools that will take a look at the data and determine where there
are problems. So all of the different kinds of errors in the data can indicate
where other processes that are generating the errors are going wrong. So from
raw data we'd like to create a bunch of tools for crossing that data. Oftentimes it
means converting it into more standard formats that other existing databases or
tools can deal with.
All right. So halfway there. This is the first version of the PADS system. The
idea is you have an ad hoc data source and the first thing you do as a
programmer and you write a PADS description in a little domain specific
language that describes the syntax of your data and some of its semantic
properties. Then Compiler takes that description. You feed it through the
Compiler and it generates a bunch of libraries for parsing, printing, traversal of
the data structure and validation that the data meets the various properties that
you specified.
Then you can link those generated libraries to a run time system that deals with
IO and a bunch of error handling. And you can also link the generated libraries
and the runtime system to a bunch of tool stuffs, such as an XML converter or
data profiler or even -- the grandest tool of all is an X query engine.
So each of these tools are defined in some sense by induction over the structure
of the description or (static) just they are driven by the (static) -- yeah, I do have a
cell phone in my pocket. I'll just -- okay. Anyway. So there are a bunch of these
tools here that are driven by the structure of or operate by the structure of the
description that gets generated. So then you can go ahead and take your ad hoc
data source, run it through the generated system, generate XML data instead of
your ad hoc data. Generate error reports that tell you where the errors are.
Generate a graph or use a query engine to generate some results.
And you can also link your data source against a -- or also generate the
programming interface so you can write your own custom applications and do
whatever you want with the data.
Okay. So what is the PADS language look like? Well, it's a language in which
one specifies data sources by a group of extended type declarations. Okay. So
there's a rich library of base types for things like integers of different sizes. There
are things like strings that are terminated by particular character, like the vertical
bar. They are fixed with strings. There are strings that match regular
expressions and then because really focused a lot on systems data and log files
that are generated, there is a lot of specific system Z style base types like dates
and times and IP addresses and URLs.
In addition to these base types, there are a number of constructors that allow you
to build larger descriptions. So there are constructors that deal with sequences,
such as structs or record types, array types. There are structures that deal with
choices like unions, enumerations, switch statements. There are constraints and
then there is parameterization and dependency. And we'll see some examples of
these things in a second.
So the basic idea is that the reason we use types is because these types can
have more than one different interpretation. On the one hand, they can be
interpreted as a grammar for the data that you're interested in parsing.
On the other hand, they can also be interpreted as a type for the data structure
that is resulting from the parser. So a single definition has these multiple
different interpretations that are important for programming these kinds of
applications.
It's also the case that along with an internal representation, you get an auxiliary
structure that we call a parch descriptor that describes where the errors in the
data were found. Okay. And the shape of that parch descriptor mirrors the
shape of the data representation you get and it is also generated from the type
that the programmer has specified.
And there are other tools as well that go in the opposite direction. So there is the
printer that takes a representation and generates data in the right format. Okay.
So as an example of how this PADS language works -- you know, one thing I
should do actually, I should keep track of how much time I have and when I
should stop and ->> Jim Larus: So we have the room until noon. I wouldn't recommend ->> David Walker: No.
>> Jim Larus: -- talking like that, but if run over past an hour don't feel like you
have to stop.
>> David Walker: Okay. Hopefully I'm aimed for about an hour or a little less.
Okay. So that would be about 11:30 on the clock at the back. Okay, excellent.
So here is -- here's an example of one of the lines in our log file for one of our
web servers. Okay. It's got a bunch of different components to it. The first
component's an IP address. Then there's a couple spaces. Then there's some
dashes and the dashes are usually a dash, but sometimes they're an identifier
that identifies remote entity or an authenticated user. Then there's a time at
which the request to the web server was made. And then there's a string
surrendered by quotes that has some structure to it that explains what kind of
request was made. So in this case a get-request for that URL was made. And
also the version of the protocol that was used to make that request.
And at the end there was a response code and the number of bites that were
returned from the request. Okay. So here is what it looks like to specify that kind
of line. And the data using PADS description. So there's a couple versions of
Pad, there is actually one for the functional language, O'Caml. There's another
for C. I'm going with the C version, although probably the O'Caml one would be
just as familiar for most people in the audience.
So the idea is we have a struck definition and like in C, that declares a record,
okay. And the names of the fields of the record are highlighted in blue. And
each of those fields also has a type associated, like the field name Clients has a
type Host. The field Remote ID has a type Off ID. Off has top off ID. The Field
Date has the type P Date. Okay. So and in between each field there's some
punctuation specifiers. So I could read this in terms of a type as a record with
one, two, three, four, five, six, seven different named fields. And I could also
read it as a grammar that says the first thing I parch is something with type host
where host is defined elsewhere. Then I parse a space. Then I parse something
that's an Off ID. Then I parse another space, then I parse something else that is
an Off ID. Then I parse a space plus a bracket. Then I parse a date, etcetera,
etcetera.
So it has these two different views. One as the type in C and another as a parser
okay. So we can dig down a little and see how these different fields -- the types
of different fields are defined.
So here is the definition of a type called ID. Okay. And there's -- it's union so
there's two different possibilities. Okay. The first possibility is it just a character
and it's going to be called unavailable. And it has to just be the Singleton dash.
The other possibility is that it's a string ending with a space. Okay.
So the semantics of such union is that the first branch in the union is tried first
and the second branch is tried second, etcetera, etcetera, etcetera. Okay. So
that describes these -- each of these little dashes in here. Okay.
Then there are also array types. Okay. So we might describe an IP address as
array of unsigned integers 8 bits wide, there's going to be four of them. And
there's going to be a separator in between each of the array elements. That's a
period and it's going to be terminated by a space.
Okay. So arrays again there's internally they just become an array. Externally
you can view them as a grammar for a -- for some data and they're much like -much like an extended Kleene star from regular expressions. Okay.
Okay, so...
>> Question: Are you going to talk about sort of the (inaudible) community to
that?
>> David Walker: Okay. I'm not -- I'm not actually going to talk about that really,
except to say that -- so here you can see there is some ambiguity going on. So
PADS ambiguity disambiguation is very simple. It's simply takes the first match,
say in the union and ->> Question: Once you have matched kind of field and structure you're not
going to go back ->> David Walker: We're not -- -- there's not going to be any backtracking ever.
>> Question: Okay.
>> David Walker: There are pros and cons to that.
>> Question: Yeah.
>> David Walker: Okay. So here is another struct that describes a request and
this has three fields. It has a method, a question URL and a version. And the
interesting thing about this example is that diversion -- not only does it have to
have this type, so it has to satisfy the requirement of HTTP underscore V,
whatever that type is. It also has to satisfy the constraint that's described by this
function check version. The check version is a function described up here and it
takes a pair of arguments and basically it checks that if you have the link or
unlink method you'd better be in version 1.0 and not a higher version because
those apparently were Deprecated after version 1.0.
So the point here is that in order to check deeper properties of your program you
can write any arbitrary function that you'd like and checking those properties can
depend upon earlier elements in your structure. So there's some dependency
that's going on here and the checking of some predicates. Okay.
>> Question: Post the checking the predicate integrate with the parsing?
>> David Walker: So the parsing generates data representations and each of
those representations have data structures that are given by the types and so
like in this version -- well, that's supposed to be. So version just comes from
here. So the data structure that you passed to this function as an argument will
have the type given by HTTP.
>> Question: Could I use that just to roll back the parsing to find an alternative?
>> David Walker: Yes. No. No. You can't. It -- these should be functional.
>> Question: Any of it's functional. It's more of a question so you're sparse
solving more than backtracking.
>> David Walker: Oh, I see. So if this field doesn't match then it would try the
next one. So sort of related to Tom's question earlier. So within a union it tries
the first branch, tries the second branch, tries the third branch and if it fails ->> Question: In the struct it's going to fail.
>> David Walker: In the struct it's going to fail? Oh, sorry, yeah. Right. So -yeah, my bad. So it is going to -- right. There is no -- yeah, here we'll just fail
that constraint.
>> Question: And you have choice to save it and try the other, so if you have
different criteria you would put this in a union instead of putting it in a structure, is
what I should say. So you can introduce backtracking -- limited backtracking
where you want.
>> Question: And you could do it with a predicate?
>> David Walker: And you could do it with a predicate.
>> Question: Okay.
>> David Walker: The next example's a little bit like that, but not exactly.
So here is a case where you have a union and the idea is that there's some
header that we've previously parsed. Okay. We pass in through this parameter
which some integer tells us for instance what the pay load of our packet is going
to look like. And we can switch on this argument to say, oh, in this case try this
one N. This case, try that one. In this case, try that one. And if none of those
match, try that one.
So predicates. Okay.
>> Question: Post the checking the predicate integrate with the parsing?
>> David Walker: So the parsing generates data representations and each of
those representations have data structures that are given by the types and so
like in this version -- well, that's supposed to be. So version just comes from
here. So the data structure that you passed to this function as an argument will
have the type given by HTTP.
>> Question: Could I use that just to roll back the parsing to find an alternative?
>> David Walker: Yes. No. No. You can't. It -- these should be functional.
>> Question: Any of it's functional. It's more of a question so you're sparse
solving more than backtracking.
>> David Walker: Oh, I see. So if this field doesn't match then it would try the
next one. So sort of related to Tom's question earlier. So within a union it tries
the first branch, tries the second branch, tries the third branch and if it fails ->> Question: In the struct it's going to fail.
>> David Walker: In the struct it's going to fail? Oh, sorry, yeah. Right. So -yeah, my bad. So it is going to -- right. There is no -- yeah, here we'll just fail
that constraint.
>> Question: And you have choice to save it and try the other, so if you have
different criteria you would put this in a union instead of putting it in a structure, is
what I should say. So you can introduce backtracking -- limited backtracking
where you want.
>> Question: And you could do it with a predicate?
>> David Walker: And you could do it with a predicate.
>> Question: Okay.
>> David Walker: The next example's a little bit like that, but not exactly.
So here is a case where you have a union and the idea is that there's some
header that we've previously parsed. Okay. We pass in through this parameter
which some integer tells us for instance what the pay load of our packet is going
to look like. And we can switch on this argument to say, oh, in this case try this
one N. This case, try that one. In this case, try that one. And if none of those
match, try that one.
So here is one further construct that allows you to define descriptions in which
later parts of the format depend upon earlier parts of the format.
Okay. So that's a basic summary of a bunch of the features that you can use to
describe the formats and the key ideas are that in addition to a grammar you also
get a description of what the type of the internal representation of the data is
going to be. And so as a consequence of that you can write functions that are
well typed that do things like analyze constraints.
So there are a number of advantages. I think this approach, I think the syntax is
familiar and it's relatively intuitive to people. Structuring means sequence of
things. Array means sequence of things. Union means variety of choices. It
provides some readable, executable documentation of a format-like grammar.
We've also developed a formal semantics of the system that I won't get into, but
you can ask me later if you want. And it's been used internally at AT&T for doing
things such as taking log files, cleaning them and putting the results into formats
that can easily be written into data bases for them.
But it still takes a long time to write these descriptions by hand. And there is
some investment that you need to make in order to learn this little
domain-specific language. It takes experts and much quicker for experts to use
than for novices.
So hence in comes PADS Version 2.0, where we'd like to investigate skipping the
handwritten part of the description step and simply going directly from raw data to
tools such as a converter for XML for a query engine or a tool that will graph the
results based on some specification.
So here we have just a bunch of data files containing our ad hoc data. We send
those into our format inference engine. Out the other side pops data description.
We can pass that through our Compiler, go ahead and generate tools such as
the Profiler or XML Converter and take our raw data and spin those through the
other boxes and get out an error report or a profile of what the likely values are or
XML format.
So next section of the talk then is about how we can go about -- how we've gone
about implementing this format inference engine that takes the raw data and
spits out the PADS description automatically. Okay.
And this inference engine is designed as a series of phases. The first phase
break up the raw data into a bunch of parts for analyzing the repeated structure.
The second phase we call structure discovery. It generates simple candidate
format that we hope is fairly close to being a good format. And then there is a
cycle that we go around scoring the candidate that we've generated and applying
a punch of rewriting rules that attempt to optimize that score. Okay.
When we've done that and we've come up with a format that we can't optimize
any further we spit it out and we'll go through the rest of the cycle.
Okay. So I'm going to talk about the phases of these -- Yep
>> Question: Can you guarantee that the format that you spit out is -- will begin
due to a guaranteed representation of the input data or can ->> David Walker: Yeah. It will parse all of the data that is used as a -- that has
as a sample. But if there's more data, it may not correctly generalize. Okay.
So step one, what's the chunking and tokenization process? Okay. So breaking
up initial data into chunks is simple. There's just one chunk either per file if you
give it a collection of files or there's one chunk per line. Okay. So I'm going to
have a simple running example here, but we have some file with quote, number,
comma, number, quote, or some names and commas and we'll first off break that
up into a series of lines or chunks.
Okay. Next step. We run over a tokenizer or lexer over our chunks and right
now our systems has configuration file that you can manage yourself that allows
you to express whatever tokens you want in terms of a set of regular
expressions.
Our default tokenizer that you get with the system again is skewed towards
systems data. So it contains things like integers, white space punctuation strings
and a bunch of things like IP addresses, certain date formats, times, mac
addresses. Okay. So right. So we convert our chunk data into series of tokens.
>> Question: Well, already there are some ambiguity in the class of data.
>> David Walker: Absolutely. Yeah.
>> Question: And the -- again it's a first match.
>> David Walker: It's the same principle as lex right now.
>> Question: Okay.
>> David Walker: Although ->> Question: Oh, so not ->> David Walker: So it's first longest match or longest first match given the set
of rules. And we actually have -- I'm not going to talk about it here, but we've
actually gone and looked at using some machine learning techniques to try to
learn characteristics of what's likely to be an IP address. What's likely to be a
domain name? And at this point in the process, not rigorously deciding upon a
single tokenization, but collected up all basically collected up the dag of all
possible parses.
>> Question: Uh-huh.
>> David Walker: And then running that dag of all possible parses through the
system and one reason that I'm not talking about it actually is because I think that
the results have been really mixed in that it takes a lot longer to use these
machine learning techniques and the results are only better some of the time.
But it's the kind of thing where I'd love to work with someone who really knows
how to do machine learning well. Perhaps they know a lot more than I do about
how to resolve some of the problems. Yeah.
>> Question: So are there defaults or some knobs or something that you use to
control where (inaudible). In other words here you've got quotes around things
so one usage of quotes might be you know these are just comments and I
don't -- I just want a string. But in other cases the quotes are supposed to be
significant delimiters and they are supposed to keep on analyzing.
>> David Walker: Um, uh, so we don't really have -- we don't have any such
knobs. So right now we don't treat quotes as opaque and we analyze the internal
structure. Although at the end we -- based on the rewriting that I'll show you,
we'll talk about in a second, we make certain decisions that a type that we've
discovered is too complicated for the data it describes. And we back off on doing
things like having a complicated typed in a couple of quotes for instance. So in
general we have a process for backing off when the types are too complicated for
the data they describe and come up with an overly -- what we consider to be an
overly verbose description. Yes?
>> Question: Do you understand then things like matching quotes, I take it,
though. In other words as -- you must have certain assumptions about beginning
and ending ->> David Walker: We do. So like two slides from now I'll tell you about our
algorithm and it's a top-down algorithm. And as part of the top down algorithm at
certain points we view what's in between quotes as opaque at one level of the
algorithm, but then we'll break it open at the next level of recursion. But I'll show
you in a second.
Okay. All right. I said all that stuff. Okay. So this is -- so that's basically just the
set up and tokenization is basically, it's the hardest thing to do and the results of
the rest do depend heavily on how well you do in this space. So we love to
improve it, but I don't exactly know how at the moment, but that's okay.
Okay. So how does this structured discovery algorithm work? Okay. So it's a
top down sort of divide and conquer algorithm and at each level it will compute
various statistics from the chunk and tokenized data. Then it's going to -- yes
from though statistics a top level description like this is a struct or this is a union
or this is an array. Then it will partition the tokenized data into smaller chunks
and it's going to then recursively reanalyze the data in each individual chunk,
compute new statistics and recursively apply the algorithm. And it's going to
base, but mainly we identify that we just have a base type in our set of chunks.
So give you a little picture of how this works. Okay. So we start out with nothing
so far. We have these initial sets of chunks. Now what's going to happen is
we're going to notice that every single one of these chunks has two quotes and a
comma in it and that's going to cause us to guess that the current description -the current data is best described as a struct request a quote, something else, a
comma, something else, a quote.
We're going to run through each of these list of tokens and partition it such that
we get this set of chunks here describe matching this guy and this set of chunks
here matching that guy and then we recursively go and analyze those two other
sets of chunks. So here we go. This is from the last stage. Now we cursively
recompute a bunch of statistics and this time we guess that our description is a
union of two things, either something else over here or something else over here.
We partition the elements of union based on the first token that we see in the two
sets of chunks that we have.
So this case all the strings go on one side, all the (inaudible) go in the other. We
do the same thing with the other question mark on the right-hand side of the tree
and then at this point the set of chunks in each of these boxes is all just a set of
base types so we can bottom out and say, uh-huh, this is an inst, this is a string,
this is a string and this is an inth(phonetic).
Okay. So the description that we first discover here is going to be a quote of a
union, comma, a union and a quote. Okay. So the interesting part is how do we
do this guessing to try to come up with a decision about oh, this is a struct, this is
a union this, is an array. So the primary thing that we do is we take each of our
chunks and we compute an histogram and so what's in the histogram is for each
different base type. What we do is we compute the percentage of the records
that have a certain number of occurrences of each token. So for instance the
quote token appears twice in 100% of the records. And 0% of the time it appears
once and 0% of the time it appears 0% of the time. The comma token appears
once in 100% of the records. The integer token appears once 30% of the time,
twice -- I guess 33%. Once 33%, twice 33% and 0 times 33%. The spring
appears once 33%, twice 33%.
Okay. So once we have these histograms what we go and do is we cluster sets
of tokens that have similar histograms. And before we do any of the clustering
we actually normalize the histograms so they are in terms of descending size of
their columns. So that means that the quote histogram is similar to the comma
histogram. Because if you normalize by descending size the histograms look
exactly the same.
Okay. So it turns out that this is a group and that's a group. Okay. So once we
have our groups ->> Question: (Inaudible) the exact equality like if in one, two bowls there was
some comma missing or one field less you might nonetheless say ->> David Walker: That's true. So they're -- yeah, we use this symmetric relative
entropy function, which I got from one of the machine learning guys in our
department, which is a good way of comparing two histograms and it's -histograms are the same as if they have the same symmetric relative entropy,
modulo some small delta. Yeah.
So if there was a few errors in the file that will not prevent you from classifying
things as being in the same category. Okay. So once we have these groups
what we then do is we try to find a group that we have high confident in that has
a particular struct-like or array-like characteristic. So things are struct like if they
have high coverage, meaning they're in a lot of the records and if they have
narrow distributions, meaning that there aren't very many columns in the
distribution. So quote and comma very much satisfy the struct-like criterion
because they're in all of the records and they have the same number of
occurrences in all of those records. Okay.
So that's a good indicator of a struct and in fact at this level the -- of the recursion
that's the strongest signal of anything that we have. So we'll pick that, then we'll
go and split up the data based on that decision and recompute histograms at the
next level, which because we split up the data we're going to get a lot cleaner
histograms than we had received prior to doing that. Okay. So structs have high
coverage and their distribution or arrays have wide distributions as opposed to
structs. And unions are the groups where weed, well, we didn't find anything with
these criteria, so we're going to subdivide the data into a union and hopefully on
the next iteration we'll find some better indicator of one of the other structure
types.
Okay. So once we decided that we have a quote, a comma and a -- or two
quotes and a comma, what we do is we run over all of the chunks and we look at
how many orderings of the two quotes and commas are there. For each ordering
we'll have one element of a union with that ordering. So in this case there's only
one ordering of quote, comma, quote, so the union part is degenerate and we
just generate a struct and we have two -- recursively have two more problems
that we have to learn using the same algorithm.
Okay. So that's how division works for a struct. For a union what we do is we
just look at the first token of each line of data and put everything with the same
first token into the same bucket for redoing things as a union. And when we
have -- we decide we have an array we scan for a particular separator token that
had the array-like qualities. Okay. So that's how we generate an initial candidate
structure.
The next step is to score that structure and to refine it using a bunch of rewriting
rules. Okay. So we have a large collection of rewriting rules that we've come up
with on a relatively ad hoc basis. What these rerouting rules do is they merge
structures, they identify overly complex elements and create -- eliminate the
overly complex elements. Sometimes they add constraints which add precision,
like isn't the case that only a certain number of tokens appear? Like in our
example for the ATTP requests, you might see "get" and "post" and "link" in the
data and so instead of setting this as a general string, we say it's actually an
enumeration. It's also the case that we fill in some missing details, such as what
are the sizes of arrays, how do arrays terminate and what are the separators for
arrays.
So all this writing is guided by a scoring function. And what that scoring function
does is it balances sort of two ideas. So one idea is that a description must be
concise and another idea is that a description must be precise. So taken limit
either one of these is unreasonable. But as a combination it seems to work fairly
well. So you want your description to be concise because people can't read
descriptions and obviously the data is its own description in the limit. And you
want a description to be precise because imprecise descriptions don't give you
much information. So if we just say aha, this is a string of characters, you
haven't done much work in understanding the data.
Okay. So we use an idea that very prevalent in the natural language learning
community, which is this idea of minimum description length. The idea -- the way
you balance these two concerns is you look at what the cost is for transmitting
the data that is described by your description. Okay. And so the cost is broken
down into two factors. One is the number of bits that you need to transmit the
syntax of the description. And the second component is generally you've
transmitted the syntax of the description to somebody, what's the amount of
information you need to communicate the data given the description.
All right. So here if you have decided for instance the description says this is
exactly in this place, this one character then you don't need to transmit any
information to say, that yes this, is exactly the character that should be there.
Okay. So these two factors help you balance conciseness and precision in a
reasonable way. And so what we do is we apply this function to our description
that we generate and we just iteratively apply rules that decrease the score that
we get. And so this can lead us to a local optimum instead of a global optimum.
But it still seems to work quite nicely.
Okay. So we've done some evaluation of these ideas on a whole collection of
benchmarks. Most of the benchmarks again are drawn from this domain of
systems logs and things like that, that we're most interested in. And all the files,
so there's server logs and various logs that we found on our machines, as well as
a few things from AT&T and they all range, you know, from a couple hundred
lines usually to a couple thousand lines. Okay.
And so one of the things that we did was we tried to measure the correctness of
our analysis in terms of how much data we needed to generate descriptions that
were appropriately general. In other words, that given some small amount of
data would parse 90% or 95% of the rest of the data that we had from that
source.
>> Question: Of all these vena sources, do you know the priority that there
exists some PADS description that will describe it?
>> David Walker: Yeah, yeah. So we also, for all those data sources we wrote
our own description PADS by hand first. And we also tried to do some
qualitative -- we have a qualitative measure of how well do the machine learning
description do versus the handwritten one. It's hard to eliminate our own
personal bias from such an analysis.
Right. So what I really should have here is I should have another line that says
how much data do we need to generate descriptions that correctly guess 100%
of the data. But I don't have that. So sorry. Okay. So -- well, anyway, I think the
bottom line is that a lot of these descriptions are sort of thousands of lines long or
1,000 or 2,000 lines long. And it takes about 5 or 10% of the data in order to -for an algorithm to generate a description that works almost all the time.
In terms of execution time, it takes under a minute for our algorithm to work on
any of these data files that are about a 1,000 to 3,000 lines of code. It depends a
little bit on the complexity of the format, though. So the most time-consuming
ones are this like ASL.log and AI3000. And this one's actually 3,000 lines long
and this one's only 1,000 lines long and, you know, it takes this other one quite a
bit longer. So a lot of it depends upon the complexity of the format.
But generally in terms of -- if you're a programmer and you say, okay, I want to
go and, you know, manipulate this data, you could press a button, wait 30
seconds, wait a minute and you have a format in PADS that you can run your
Compiler on and continue your work.
So it's fully in the normal, you know, compile time cycle in terms of a programmer
getting work done. And just as another sort of reference point, the last column
here suggests how long it took us to write a version of the format by hand and
debug it. Okay. So and one thing is my post-doc did this when he first got here
and so this one here is the time that includes downloading and installing the
system and getting it all to work and reading some of the manual and doing that
stuff. So we can say it takes a couple days for a first-time user to learn the
system. And after that, you know, writing a description in PADS for some data
source that's not too complicated takes -- maybe it takes half an hour, maybe it
takes two hours, something in that range. Versus one minute for the automated
analysis.
>> Question: How long are the handwritten descriptions typically?
>> David Walker: How long are the handwritten descriptions typically? Um, I
don't know, 100 lines of code, that's my guess. It can vary. It can vary a fair
amount. Okay.
One other thing we did was we took a look AT&T least for small files what is the
scaling like and roughly speaking as longs as we don't run out of memory, the -and we don't in any of these small files, but we would on files that were much
larger. The process seems to scale a bit linearly with the number of lines of
code. And more important is the constant factor that is associated with the
complexity of each line in the description and then the number of lines in the file,
for instance.
Okay. So that's what I've been doing for the last couple of years. What's next?
Well, there are a couple of things that I really am excited about doing next. So
one thing is I'm working with a couple of my friends on scaling the inference
system up so that it can handle sources that are millions of lines of data long, like
they have at AT&T. So in some sense the automated analysis is much more
useful potentially for the data sources that are really, really huge. Kathleen
wanted to create a description for some source that was tens of millions of lines
long for use internally at AT&T and one thing she found that after about one and
a half million lines the format completely changed. So it was a complete
nightmare for her to go ahead and attempt to generate this file which partly
motivates this description.
>> Question: Scaling and optimizing the algorithms or just the descriptions tend
to be really complicated?
>> David Walker: It's the algorithms, the amount of data that they us. What
we're doing to scale them up is simply to apply the algorithm in batches. So run
the learning algorithm over the first bit of data, generate candidate description,
then run it over the next bit of data and wherever there are errors, basically
populating a data structure in a way to accumulate data at particular nodes, then
do more learning problems in the incorrect spots and then add that back into the
overall description. New descriptions, and then apply some other rewriting rules
in ways that don't completely change the entire description that you have.
>> Question: It seems like it would be perhaps a bit (inaudible) where you have
all these workers working on this huge file, each coming up with a description
and then the reduction somehow has to deal with ->> David Walker: How do you merge.
>> Question: Merging these together and merging might involve making a new
union for a new struct in different phases or something like that.
>> David Walker: Yeah. That's sort of a slightly different way of doing things.
Our current way is given a starting description, which could be written by hand or
could be machine generated, what we've tried to do is preserve as much of the
existing structure as we can while making the necessary changes. And I think it's
easier for us to preserve the existing structure -- well, you could still do it with a
map or still formulate that existing structure to many, many different nodes. But
it's a slightly different algorithm than two things independently coming up with
potentially completely different things and trying to merge.
>> Question: Uh-huh.
>> David Walker: As opposed to localizing in the description where the errors
occur.
>> Question: Yes.
>> David Walker: And relearning only that part of the description. Two slightly
different techniques and we're currently working on that first one, but we did
actually consider the second one. We just thought it would be technically trickier,
but whatever. Yeah, you had a...
>> Question: Yes. I'm still trying to understand the difficulty in scaling the
inference of the median blanks so the issue is not the median blanks. Supposing
there's a Ganglion line data source.
>> David Walker: Uh-huh.
>> Question: (Inaudible) include that you have alternating (inaudible) ->> David Walker: Okay.
>> Question: Maybe to do the inference you just need like maybe 10, 20 of
those records, right? And you will be able to reveal faster. You don't really have
to look at the entire thing.
>> David Walker: Um, yeah, if you're -- if you're lucky, right, you can look at the
first, you know, few lines, never the right description and then just validate that
the rest of the data matches.
>> Question: So probably this happens when the -- if you put these logs in
terms of lines and individual record (inaudible) you have to recognize is very
large probably.
>> David Walker: Yeah. Or -- or you just pick the -- yeah, or you just pick the
wrong set of lines to look at initially, that you know there is some change and
some protocol that happens halfway through. Something that Kathleen has
found happens sometimes in these logs.
>> Question: Just pick the first (inaudible) to do sample.
>> David Walker: Right now we're just going to -- we could easily sample
randomly, but we're just going to pick the first set of lines and then go through
and validate the rest of the description and yeah, if all the rest of the description
is exactly the same, then there's no change that needs to be done. But there are
definitely cases where that just doesn't happen.
I mean, these descriptions can grow to be quite complicated. I mean, one part
that I didn't mention is in the -- one of the things that the rewriting phase does is it
makes a table out of all the data and it looks for functional dependencies
between -- so is this column functionally dependent upon this other column
basically in the data? And if it is, then we'll put that into the description. So that
kind of algorithm is using all the data at once in order to do this and it's
comparing all columns against all other columns. And what we want to do is
basically summarize that information in a chunk and then the next chunk, you
know, check is that -- did we overspecify because of the specifics of the previous
column or does that relationship that we've inferred still hold? We can't -- it
would be too slow to -- yeah.
>> Question: So when you say lines, in some sincere, because you're doing the
lexing, you're not really thinking in a line structure, right?
>> David Walker: No.
>> Question: I mean, if I'm out putting messages which are multi -- if I'm
outputting sort of information about messages being sent and received and the
message format is multi-line text ->> David Walker: Yeah.
>> Question: Right. That's not a problem.
>> David Walker: That's not a problem, no. Yeah. So one thing that's --
>> Question: There are hundreds of messages and the orders of those
messages differ and the fields of the messages change. I mean then that is a
little bit more complex.
>> David Walker: Yeah, on our to-do list is being able to specify the paragraphs
that you want your learning system to process the data in chunks. You basically
just -- you just the beginning of the argument you need some way to get started
with having a set of candidates that you have some reason to believe that there
will be some commonality.
>> Question: So are you saying that the way it works right now if you did have
the case with some multi-line things then it might chunk just at the lines and not
notice the correspondence of sort of togetherness of the groups of lines?
>> David Walker: Yeah, that's right. It won't ->> Question: So it won't measure -- or it won't find structure across lines?
>> David Walker: It won't currently find structure across lines. What we would
need to do is give it some reason to break things up.
>> Question: I'm assuming that people often have to go in and at least give
meaningful names. This would find structure ->> David Walker: Yes, that's right.
>> Question: But obviously it's going to have, you know, just made-up names
for everything, but in an X and O file would just look like gibberish.
>> David Walker: Yeah. Oh, yeah, I kind of stopped here in the middle. But I -the next thing I was going -- I forgot, I don't know, I guess I was -- whatever, what
a great talk. No, but so actually one of the next things I'm real interested in doing
and this is what my student, Qian is working on, is so there's this problem with
the handwritten descriptions, you know it's slower. The automated ones, you
definitely have a problem with these machine generated names which makes the
resulting description look more complex and oftentimes you know just generating
the description isn't the end result. If you are cleaning some data that you want
to put into a database, you might have to refer to different fields, do some small
transformation on those fields. Normalize dates or times or any variety of other
things. And so you need to be able to get some hooks on the different bits of the
data that you're interested in and so what Qian is looking at is sort of the fusion of
the two ideas, partly handwritten, partly automated in which you take your data
and you start editing it to insert various bits of description inside it. So here is an
example, here is my hockey stats and what I'm interested in is I'm interested in
the name of my player, the age and the salary for instance. And what this line
here says, I expect to see after this line star says a repeated set of records that
match this format and I'd like to tag the name field, the age field and the salary
field.
So I do this almost as though you are -- I mean, one way is sort of like you are
working in XML, but you have raw text as a starting point and you are placing
your own little tags with some additional information in terms of how to read the
rest of or parts of the description.
>> Question: Formatting by example.
>> David Walker: Formatting by example, that's right. It's formatting by
example. So yeah you're using the structure and the data that's already there to
allow you to have lied tons of stuff you would otherwise have to do.
And then the other -- the next part of this is to couple this with new scripting
language that can refer to the various tags that you've inserted into the different
places. So I sort of have some vague ideas of basing this more or less around
ex-query and here is a little script -- this is my weigh-out ideas part of the talk.
I don't know what the link is.
>>: Oh, gosh.
>> David Walker: Sorry.
>>: We'll have to talk.
>> David Walker: Okay.
>>: You really should know what Link is.
>> David Walker: Okay. Tell me what Link is. All right. Anyway, so this is a
little script that maybe it does what Link does, but it selects from the stats field
and it's ordered by the name field and it prints out data that is a name followed by
some spaces, the age and the salary. So, there you go.
So there's that idea. And then another set of things that I'm thinking about also is
moving towards not describing individual files, but describing collections of files.
So at Princeton and AT&T everyone who builds a distributed system also
generates their own little monitoring system which goes and collects up log files
from all over the -- you know, all over the place. And they often have these
complex directory structures like Mike Freedman, as an example, he has some
complex directory structure that has, you know, one directory for each machine
on Planet Lab that he's interested in monitoring and then subdirectory is for times
and then more -- directory is for different kinds of information. And what I'd like to
do is extend the PADS such that it doesn't specify just one pile, but it specifies
entire parts of the file system and generates for you an interface against what
you can query the files that are in the file system, manipulate them, transform
them, get information about them.
Another thing that we can possibly do is generate the monitoring infrastructure
that goes and fetches all the data from various different places and archives it for
you centrally. So, yeah, um, those are some things I'm interested in doing next.
There's a number of bits of related work. In terms of just languages, there are a
number of languages that were made up in the networking community sometime
ago, data script and packet types are one of them. Allows you to describe binary
data and use this sort of type-based metaphor and generate internal
representations for you. So, PADS is inspired by a lot of those efforts. I think
some of the new things are in terms of the way we do error processing with the
parse descriptors. We also have a paper on the semantics that describe the
semantics of a number of these different languages.
And in terms of the learning stuff, we have borrowed ideas from a number of
different people in the machine learning community and people who are learning
the structure of XML. So the idea for our top down recursive structure discovery
algorithm is based on ideas by Arassu and Garcina -- Garcina Mulina, who
looked at how to extract data from XML or from web pages. And we coupled that
with other ideas that other people have had in terms of using the Midian
description-length principle to optimize the descriptions that you get.
Okay. So summary is PADS 1.0 is a language where you can write down
descriptions by hand. PADS 2.0 improves productivity by automatically
generating description. PADS 3.0 is going to be better yet, though we're not
exactly sure what it's going to do. But no, hopefully it will combine some of the
ideas from each of those and give you the control of the handwritten with the
efficiency of the automated version.
All right. Thanks.
(applause)
>> Question: Any questions?
>> Question: Is there a download? Where do I ->> David Walker: There is a download. Do we -- oh, here we go. Look at that.
Yeah. So you can go to www.padsproj.org, and we have a couple of demos
online that you can take a look at if you want. So there is a demo of the basic
PADS infrastructure. And that will show you a handwritten description and some
of the tools that we can generate, like an XML converter and then there's this
learning demo where there's a bunch of different files that you can select from
and look at and then you can click the button and you can see the descriptions
that you get out and then you can see the -- what happens when you apply one
of the tools using those descriptions.
>> Question: So when you combined the hand (inaudible) and description of
the inference so (inaudible) description or say it's just (inaudible) that description
doesn't actually (inaudible) ->> David Walker: So right now we're building a new system which will do the
this incremental processing. Given a starting point that's a description either if
it's been learned or if it's been written by hand. Apply it to segment after
segment of a large data source and if it parses the large data source, great F. It
doesn't then it finds the places where the description is wrong and it will relearn
those nodes and rewrite them and include them.
So does that answer your question? Yeah?
>> Question: If for example, the description is generated by the kind of like the
inference of (inaudible) data ->> David Walker: Okay.
>> Question: Then the possibility of that description is (inaudible) higher than
the (inaudible) seems to be kind of like a (inaudible) description being the
(inaudible) so how do you ->> David Walker: So we do have a tool also that if you have a handwritten
description, we have this like accumulator tool. You can apply it to a data set
and then it will give you a list of all the errors that it found. You can look at is that
an error in description or is it an error in the data?
>>: Thank you.
>> David Walker: Okay, good, yeah.
Download