>> Jim Larus: Why don't we get started. So it's my pleasure to introduce David Walker from Princeton University. David is now a member of the secret society of the 10-year professors. So he's passed through the difficult stage and now he gets to sit in retirement for the rest of his career. Actually David has done lots of great stuff. Many of the people here have connections with David and I think he's going to be talking to us about some new work, at least work that I haven't seen before. So, David... >> David Walker: Thanks. All right. Yeah. So thanks for having me. Today I'm going to talk about some work that I've been doing over the last couple of years with some collaborators at AT&T, primarily Kathleen Fisher and Yitzhak Mandelbaumis one of my graduate students at Princeton who's graduated and taken a research position at AT&T. And Qian is a new student of mine that's started working on this stuff and Kenny is a post-doc. Right. So I normally have to say that all the credit is theirs and all the problems are mine. So there we go. So the overall starting point for talking about this stuff is that there's lots and lots of data all over the place and many of it is what we will call semi-structured data, but it isn't necessarily in a standard format like XML or HTML. So systems are producing all kinds of web logs. There are statistics that show up all the time on my hockey discussion boards. AT&T has tons of information pertaining to phone call and billing and financial transactions are in their own strange format. Scientists, like biologists, have micro-DNA data and genomics data. So there's all kinds of different bits of data from all over the place. And there are a lot of problems with this data. Sometime its has no documentation. The formats can be evolving and with little documentation you can run into errors all the time when formats are changing and people don't know exactly which way is what. And many of these data sources also have huge volumes that we have to deal with when we're building tools. So just give you a little bit of an idea what I'm talking about. Web Servers generate logs to tell you all the requests that have come in and they have information like IP addresses. They have the dates and structured requests and information about errors and responses. AT&T has similar kinds of log files that catalog the process of assigning a phone number to somebody new. So these are huge logs that tell you what states the process of somebody getting a new phone number is in. And again, it's combinations of phone numbers, other kinds of numbers, various different names of states in the process. Biologists like Olga Transky(phonetic) at Princeton, are developing new systems for integrating lots of different experiments that researchers do and then post on the web. And here is a little bit of data from a website, Geotology website, that tells you connections between different processes that go on inside cells. So there's identifiers for the different processes and names and there's definitions and various identifiers that forge connections between the different parts of the document. This is my last one. I'm a big fat hockey fan and I've always wanted to be able to give a talk about -- and include some sort of hockey data and now since I have tenure, I finally can. So on my various hockey boards people post all kinds of different statistics and the beginning of the hockey season, what you want to do is you want to go and grab a whole bunch of these statistics from different places and perform some analysis of them so you can pick your hockey players for your hockey pool in a way such as to defeat all the noncomputer scientists in the crowd, which over the last couple of years I've been pretty good at. I've had the number one hockey team in my pool right now. But anyway, the main point is there was lots and lots of data. And what I want to do is I want to help people build the arrow in this diagram that from all this kind of ad hoc data, provide you with a bunch of end user Tools. Tool that will take that data and convert it into the right format to feed directly into your database. Tools that will take the data and convert it into a more standard form such as XML so that standard programming tools and other kinds of tools can work directly on your ad hoc data. And also build tools that will take a look at the data and determine where there are problems. So all of the different kinds of errors in the data can indicate where other processes that are generating the errors are going wrong. So from raw data we'd like to create a bunch of tools for crossing that data. Oftentimes it means converting it into more standard formats that other existing databases or tools can deal with. All right. So halfway there. This is the first version of the PADS system. The idea is you have an ad hoc data source and the first thing you do as a programmer and you write a PADS description in a little domain specific language that describes the syntax of your data and some of its semantic properties. Then Compiler takes that description. You feed it through the Compiler and it generates a bunch of libraries for parsing, printing, traversal of the data structure and validation that the data meets the various properties that you specified. Then you can link those generated libraries to a run time system that deals with IO and a bunch of error handling. And you can also link the generated libraries and the runtime system to a bunch of tool stuffs, such as an XML converter or data profiler or even -- the grandest tool of all is an X query engine. So each of these tools are defined in some sense by induction over the structure of the description or (static) just they are driven by the (static) -- yeah, I do have a cell phone in my pocket. I'll just -- okay. Anyway. So there are a bunch of these tools here that are driven by the structure of or operate by the structure of the description that gets generated. So then you can go ahead and take your ad hoc data source, run it through the generated system, generate XML data instead of your ad hoc data. Generate error reports that tell you where the errors are. Generate a graph or use a query engine to generate some results. And you can also link your data source against a -- or also generate the programming interface so you can write your own custom applications and do whatever you want with the data. Okay. So what is the PADS language look like? Well, it's a language in which one specifies data sources by a group of extended type declarations. Okay. So there's a rich library of base types for things like integers of different sizes. There are things like strings that are terminated by particular character, like the vertical bar. They are fixed with strings. There are strings that match regular expressions and then because really focused a lot on systems data and log files that are generated, there is a lot of specific system Z style base types like dates and times and IP addresses and URLs. In addition to these base types, there are a number of constructors that allow you to build larger descriptions. So there are constructors that deal with sequences, such as structs or record types, array types. There are structures that deal with choices like unions, enumerations, switch statements. There are constraints and then there is parameterization and dependency. And we'll see some examples of these things in a second. So the basic idea is that the reason we use types is because these types can have more than one different interpretation. On the one hand, they can be interpreted as a grammar for the data that you're interested in parsing. On the other hand, they can also be interpreted as a type for the data structure that is resulting from the parser. So a single definition has these multiple different interpretations that are important for programming these kinds of applications. It's also the case that along with an internal representation, you get an auxiliary structure that we call a parch descriptor that describes where the errors in the data were found. Okay. And the shape of that parch descriptor mirrors the shape of the data representation you get and it is also generated from the type that the programmer has specified. And there are other tools as well that go in the opposite direction. So there is the printer that takes a representation and generates data in the right format. Okay. So as an example of how this PADS language works -- you know, one thing I should do actually, I should keep track of how much time I have and when I should stop and ->> Jim Larus: So we have the room until noon. I wouldn't recommend ->> David Walker: No. >> Jim Larus: -- talking like that, but if run over past an hour don't feel like you have to stop. >> David Walker: Okay. Hopefully I'm aimed for about an hour or a little less. Okay. So that would be about 11:30 on the clock at the back. Okay, excellent. So here is -- here's an example of one of the lines in our log file for one of our web servers. Okay. It's got a bunch of different components to it. The first component's an IP address. Then there's a couple spaces. Then there's some dashes and the dashes are usually a dash, but sometimes they're an identifier that identifies remote entity or an authenticated user. Then there's a time at which the request to the web server was made. And then there's a string surrendered by quotes that has some structure to it that explains what kind of request was made. So in this case a get-request for that URL was made. And also the version of the protocol that was used to make that request. And at the end there was a response code and the number of bites that were returned from the request. Okay. So here is what it looks like to specify that kind of line. And the data using PADS description. So there's a couple versions of Pad, there is actually one for the functional language, O'Caml. There's another for C. I'm going with the C version, although probably the O'Caml one would be just as familiar for most people in the audience. So the idea is we have a struck definition and like in C, that declares a record, okay. And the names of the fields of the record are highlighted in blue. And each of those fields also has a type associated, like the field name Clients has a type Host. The field Remote ID has a type Off ID. Off has top off ID. The Field Date has the type P Date. Okay. So and in between each field there's some punctuation specifiers. So I could read this in terms of a type as a record with one, two, three, four, five, six, seven different named fields. And I could also read it as a grammar that says the first thing I parch is something with type host where host is defined elsewhere. Then I parse a space. Then I parse something that's an Off ID. Then I parse another space, then I parse something else that is an Off ID. Then I parse a space plus a bracket. Then I parse a date, etcetera, etcetera. So it has these two different views. One as the type in C and another as a parser okay. So we can dig down a little and see how these different fields -- the types of different fields are defined. So here is the definition of a type called ID. Okay. And there's -- it's union so there's two different possibilities. Okay. The first possibility is it just a character and it's going to be called unavailable. And it has to just be the Singleton dash. The other possibility is that it's a string ending with a space. Okay. So the semantics of such union is that the first branch in the union is tried first and the second branch is tried second, etcetera, etcetera, etcetera. Okay. So that describes these -- each of these little dashes in here. Okay. Then there are also array types. Okay. So we might describe an IP address as array of unsigned integers 8 bits wide, there's going to be four of them. And there's going to be a separator in between each of the array elements. That's a period and it's going to be terminated by a space. Okay. So arrays again there's internally they just become an array. Externally you can view them as a grammar for a -- for some data and they're much like -much like an extended Kleene star from regular expressions. Okay. Okay, so... >> Question: Are you going to talk about sort of the (inaudible) community to that? >> David Walker: Okay. I'm not -- I'm not actually going to talk about that really, except to say that -- so here you can see there is some ambiguity going on. So PADS ambiguity disambiguation is very simple. It's simply takes the first match, say in the union and ->> Question: Once you have matched kind of field and structure you're not going to go back ->> David Walker: We're not -- -- there's not going to be any backtracking ever. >> Question: Okay. >> David Walker: There are pros and cons to that. >> Question: Yeah. >> David Walker: Okay. So here is another struct that describes a request and this has three fields. It has a method, a question URL and a version. And the interesting thing about this example is that diversion -- not only does it have to have this type, so it has to satisfy the requirement of HTTP underscore V, whatever that type is. It also has to satisfy the constraint that's described by this function check version. The check version is a function described up here and it takes a pair of arguments and basically it checks that if you have the link or unlink method you'd better be in version 1.0 and not a higher version because those apparently were Deprecated after version 1.0. So the point here is that in order to check deeper properties of your program you can write any arbitrary function that you'd like and checking those properties can depend upon earlier elements in your structure. So there's some dependency that's going on here and the checking of some predicates. Okay. >> Question: Post the checking the predicate integrate with the parsing? >> David Walker: So the parsing generates data representations and each of those representations have data structures that are given by the types and so like in this version -- well, that's supposed to be. So version just comes from here. So the data structure that you passed to this function as an argument will have the type given by HTTP. >> Question: Could I use that just to roll back the parsing to find an alternative? >> David Walker: Yes. No. No. You can't. It -- these should be functional. >> Question: Any of it's functional. It's more of a question so you're sparse solving more than backtracking. >> David Walker: Oh, I see. So if this field doesn't match then it would try the next one. So sort of related to Tom's question earlier. So within a union it tries the first branch, tries the second branch, tries the third branch and if it fails ->> Question: In the struct it's going to fail. >> David Walker: In the struct it's going to fail? Oh, sorry, yeah. Right. So -yeah, my bad. So it is going to -- right. There is no -- yeah, here we'll just fail that constraint. >> Question: And you have choice to save it and try the other, so if you have different criteria you would put this in a union instead of putting it in a structure, is what I should say. So you can introduce backtracking -- limited backtracking where you want. >> Question: And you could do it with a predicate? >> David Walker: And you could do it with a predicate. >> Question: Okay. >> David Walker: The next example's a little bit like that, but not exactly. So here is a case where you have a union and the idea is that there's some header that we've previously parsed. Okay. We pass in through this parameter which some integer tells us for instance what the pay load of our packet is going to look like. And we can switch on this argument to say, oh, in this case try this one N. This case, try that one. In this case, try that one. And if none of those match, try that one. So predicates. Okay. >> Question: Post the checking the predicate integrate with the parsing? >> David Walker: So the parsing generates data representations and each of those representations have data structures that are given by the types and so like in this version -- well, that's supposed to be. So version just comes from here. So the data structure that you passed to this function as an argument will have the type given by HTTP. >> Question: Could I use that just to roll back the parsing to find an alternative? >> David Walker: Yes. No. No. You can't. It -- these should be functional. >> Question: Any of it's functional. It's more of a question so you're sparse solving more than backtracking. >> David Walker: Oh, I see. So if this field doesn't match then it would try the next one. So sort of related to Tom's question earlier. So within a union it tries the first branch, tries the second branch, tries the third branch and if it fails ->> Question: In the struct it's going to fail. >> David Walker: In the struct it's going to fail? Oh, sorry, yeah. Right. So -yeah, my bad. So it is going to -- right. There is no -- yeah, here we'll just fail that constraint. >> Question: And you have choice to save it and try the other, so if you have different criteria you would put this in a union instead of putting it in a structure, is what I should say. So you can introduce backtracking -- limited backtracking where you want. >> Question: And you could do it with a predicate? >> David Walker: And you could do it with a predicate. >> Question: Okay. >> David Walker: The next example's a little bit like that, but not exactly. So here is a case where you have a union and the idea is that there's some header that we've previously parsed. Okay. We pass in through this parameter which some integer tells us for instance what the pay load of our packet is going to look like. And we can switch on this argument to say, oh, in this case try this one N. This case, try that one. In this case, try that one. And if none of those match, try that one. So here is one further construct that allows you to define descriptions in which later parts of the format depend upon earlier parts of the format. Okay. So that's a basic summary of a bunch of the features that you can use to describe the formats and the key ideas are that in addition to a grammar you also get a description of what the type of the internal representation of the data is going to be. And so as a consequence of that you can write functions that are well typed that do things like analyze constraints. So there are a number of advantages. I think this approach, I think the syntax is familiar and it's relatively intuitive to people. Structuring means sequence of things. Array means sequence of things. Union means variety of choices. It provides some readable, executable documentation of a format-like grammar. We've also developed a formal semantics of the system that I won't get into, but you can ask me later if you want. And it's been used internally at AT&T for doing things such as taking log files, cleaning them and putting the results into formats that can easily be written into data bases for them. But it still takes a long time to write these descriptions by hand. And there is some investment that you need to make in order to learn this little domain-specific language. It takes experts and much quicker for experts to use than for novices. So hence in comes PADS Version 2.0, where we'd like to investigate skipping the handwritten part of the description step and simply going directly from raw data to tools such as a converter for XML for a query engine or a tool that will graph the results based on some specification. So here we have just a bunch of data files containing our ad hoc data. We send those into our format inference engine. Out the other side pops data description. We can pass that through our Compiler, go ahead and generate tools such as the Profiler or XML Converter and take our raw data and spin those through the other boxes and get out an error report or a profile of what the likely values are or XML format. So next section of the talk then is about how we can go about -- how we've gone about implementing this format inference engine that takes the raw data and spits out the PADS description automatically. Okay. And this inference engine is designed as a series of phases. The first phase break up the raw data into a bunch of parts for analyzing the repeated structure. The second phase we call structure discovery. It generates simple candidate format that we hope is fairly close to being a good format. And then there is a cycle that we go around scoring the candidate that we've generated and applying a punch of rewriting rules that attempt to optimize that score. Okay. When we've done that and we've come up with a format that we can't optimize any further we spit it out and we'll go through the rest of the cycle. Okay. So I'm going to talk about the phases of these -- Yep >> Question: Can you guarantee that the format that you spit out is -- will begin due to a guaranteed representation of the input data or can ->> David Walker: Yeah. It will parse all of the data that is used as a -- that has as a sample. But if there's more data, it may not correctly generalize. Okay. So step one, what's the chunking and tokenization process? Okay. So breaking up initial data into chunks is simple. There's just one chunk either per file if you give it a collection of files or there's one chunk per line. Okay. So I'm going to have a simple running example here, but we have some file with quote, number, comma, number, quote, or some names and commas and we'll first off break that up into a series of lines or chunks. Okay. Next step. We run over a tokenizer or lexer over our chunks and right now our systems has configuration file that you can manage yourself that allows you to express whatever tokens you want in terms of a set of regular expressions. Our default tokenizer that you get with the system again is skewed towards systems data. So it contains things like integers, white space punctuation strings and a bunch of things like IP addresses, certain date formats, times, mac addresses. Okay. So right. So we convert our chunk data into series of tokens. >> Question: Well, already there are some ambiguity in the class of data. >> David Walker: Absolutely. Yeah. >> Question: And the -- again it's a first match. >> David Walker: It's the same principle as lex right now. >> Question: Okay. >> David Walker: Although ->> Question: Oh, so not ->> David Walker: So it's first longest match or longest first match given the set of rules. And we actually have -- I'm not going to talk about it here, but we've actually gone and looked at using some machine learning techniques to try to learn characteristics of what's likely to be an IP address. What's likely to be a domain name? And at this point in the process, not rigorously deciding upon a single tokenization, but collected up all basically collected up the dag of all possible parses. >> Question: Uh-huh. >> David Walker: And then running that dag of all possible parses through the system and one reason that I'm not talking about it actually is because I think that the results have been really mixed in that it takes a lot longer to use these machine learning techniques and the results are only better some of the time. But it's the kind of thing where I'd love to work with someone who really knows how to do machine learning well. Perhaps they know a lot more than I do about how to resolve some of the problems. Yeah. >> Question: So are there defaults or some knobs or something that you use to control where (inaudible). In other words here you've got quotes around things so one usage of quotes might be you know these are just comments and I don't -- I just want a string. But in other cases the quotes are supposed to be significant delimiters and they are supposed to keep on analyzing. >> David Walker: Um, uh, so we don't really have -- we don't have any such knobs. So right now we don't treat quotes as opaque and we analyze the internal structure. Although at the end we -- based on the rewriting that I'll show you, we'll talk about in a second, we make certain decisions that a type that we've discovered is too complicated for the data it describes. And we back off on doing things like having a complicated typed in a couple of quotes for instance. So in general we have a process for backing off when the types are too complicated for the data they describe and come up with an overly -- what we consider to be an overly verbose description. Yes? >> Question: Do you understand then things like matching quotes, I take it, though. In other words as -- you must have certain assumptions about beginning and ending ->> David Walker: We do. So like two slides from now I'll tell you about our algorithm and it's a top-down algorithm. And as part of the top down algorithm at certain points we view what's in between quotes as opaque at one level of the algorithm, but then we'll break it open at the next level of recursion. But I'll show you in a second. Okay. All right. I said all that stuff. Okay. So this is -- so that's basically just the set up and tokenization is basically, it's the hardest thing to do and the results of the rest do depend heavily on how well you do in this space. So we love to improve it, but I don't exactly know how at the moment, but that's okay. Okay. So how does this structured discovery algorithm work? Okay. So it's a top down sort of divide and conquer algorithm and at each level it will compute various statistics from the chunk and tokenized data. Then it's going to -- yes from though statistics a top level description like this is a struct or this is a union or this is an array. Then it will partition the tokenized data into smaller chunks and it's going to then recursively reanalyze the data in each individual chunk, compute new statistics and recursively apply the algorithm. And it's going to base, but mainly we identify that we just have a base type in our set of chunks. So give you a little picture of how this works. Okay. So we start out with nothing so far. We have these initial sets of chunks. Now what's going to happen is we're going to notice that every single one of these chunks has two quotes and a comma in it and that's going to cause us to guess that the current description -the current data is best described as a struct request a quote, something else, a comma, something else, a quote. We're going to run through each of these list of tokens and partition it such that we get this set of chunks here describe matching this guy and this set of chunks here matching that guy and then we recursively go and analyze those two other sets of chunks. So here we go. This is from the last stage. Now we cursively recompute a bunch of statistics and this time we guess that our description is a union of two things, either something else over here or something else over here. We partition the elements of union based on the first token that we see in the two sets of chunks that we have. So this case all the strings go on one side, all the (inaudible) go in the other. We do the same thing with the other question mark on the right-hand side of the tree and then at this point the set of chunks in each of these boxes is all just a set of base types so we can bottom out and say, uh-huh, this is an inst, this is a string, this is a string and this is an inth(phonetic). Okay. So the description that we first discover here is going to be a quote of a union, comma, a union and a quote. Okay. So the interesting part is how do we do this guessing to try to come up with a decision about oh, this is a struct, this is a union this, is an array. So the primary thing that we do is we take each of our chunks and we compute an histogram and so what's in the histogram is for each different base type. What we do is we compute the percentage of the records that have a certain number of occurrences of each token. So for instance the quote token appears twice in 100% of the records. And 0% of the time it appears once and 0% of the time it appears 0% of the time. The comma token appears once in 100% of the records. The integer token appears once 30% of the time, twice -- I guess 33%. Once 33%, twice 33% and 0 times 33%. The spring appears once 33%, twice 33%. Okay. So once we have these histograms what we go and do is we cluster sets of tokens that have similar histograms. And before we do any of the clustering we actually normalize the histograms so they are in terms of descending size of their columns. So that means that the quote histogram is similar to the comma histogram. Because if you normalize by descending size the histograms look exactly the same. Okay. So it turns out that this is a group and that's a group. Okay. So once we have our groups ->> Question: (Inaudible) the exact equality like if in one, two bowls there was some comma missing or one field less you might nonetheless say ->> David Walker: That's true. So they're -- yeah, we use this symmetric relative entropy function, which I got from one of the machine learning guys in our department, which is a good way of comparing two histograms and it's -histograms are the same as if they have the same symmetric relative entropy, modulo some small delta. Yeah. So if there was a few errors in the file that will not prevent you from classifying things as being in the same category. Okay. So once we have these groups what we then do is we try to find a group that we have high confident in that has a particular struct-like or array-like characteristic. So things are struct like if they have high coverage, meaning they're in a lot of the records and if they have narrow distributions, meaning that there aren't very many columns in the distribution. So quote and comma very much satisfy the struct-like criterion because they're in all of the records and they have the same number of occurrences in all of those records. Okay. So that's a good indicator of a struct and in fact at this level the -- of the recursion that's the strongest signal of anything that we have. So we'll pick that, then we'll go and split up the data based on that decision and recompute histograms at the next level, which because we split up the data we're going to get a lot cleaner histograms than we had received prior to doing that. Okay. So structs have high coverage and their distribution or arrays have wide distributions as opposed to structs. And unions are the groups where weed, well, we didn't find anything with these criteria, so we're going to subdivide the data into a union and hopefully on the next iteration we'll find some better indicator of one of the other structure types. Okay. So once we decided that we have a quote, a comma and a -- or two quotes and a comma, what we do is we run over all of the chunks and we look at how many orderings of the two quotes and commas are there. For each ordering we'll have one element of a union with that ordering. So in this case there's only one ordering of quote, comma, quote, so the union part is degenerate and we just generate a struct and we have two -- recursively have two more problems that we have to learn using the same algorithm. Okay. So that's how division works for a struct. For a union what we do is we just look at the first token of each line of data and put everything with the same first token into the same bucket for redoing things as a union. And when we have -- we decide we have an array we scan for a particular separator token that had the array-like qualities. Okay. So that's how we generate an initial candidate structure. The next step is to score that structure and to refine it using a bunch of rewriting rules. Okay. So we have a large collection of rewriting rules that we've come up with on a relatively ad hoc basis. What these rerouting rules do is they merge structures, they identify overly complex elements and create -- eliminate the overly complex elements. Sometimes they add constraints which add precision, like isn't the case that only a certain number of tokens appear? Like in our example for the ATTP requests, you might see "get" and "post" and "link" in the data and so instead of setting this as a general string, we say it's actually an enumeration. It's also the case that we fill in some missing details, such as what are the sizes of arrays, how do arrays terminate and what are the separators for arrays. So all this writing is guided by a scoring function. And what that scoring function does is it balances sort of two ideas. So one idea is that a description must be concise and another idea is that a description must be precise. So taken limit either one of these is unreasonable. But as a combination it seems to work fairly well. So you want your description to be concise because people can't read descriptions and obviously the data is its own description in the limit. And you want a description to be precise because imprecise descriptions don't give you much information. So if we just say aha, this is a string of characters, you haven't done much work in understanding the data. Okay. So we use an idea that very prevalent in the natural language learning community, which is this idea of minimum description length. The idea -- the way you balance these two concerns is you look at what the cost is for transmitting the data that is described by your description. Okay. And so the cost is broken down into two factors. One is the number of bits that you need to transmit the syntax of the description. And the second component is generally you've transmitted the syntax of the description to somebody, what's the amount of information you need to communicate the data given the description. All right. So here if you have decided for instance the description says this is exactly in this place, this one character then you don't need to transmit any information to say, that yes this, is exactly the character that should be there. Okay. So these two factors help you balance conciseness and precision in a reasonable way. And so what we do is we apply this function to our description that we generate and we just iteratively apply rules that decrease the score that we get. And so this can lead us to a local optimum instead of a global optimum. But it still seems to work quite nicely. Okay. So we've done some evaluation of these ideas on a whole collection of benchmarks. Most of the benchmarks again are drawn from this domain of systems logs and things like that, that we're most interested in. And all the files, so there's server logs and various logs that we found on our machines, as well as a few things from AT&T and they all range, you know, from a couple hundred lines usually to a couple thousand lines. Okay. And so one of the things that we did was we tried to measure the correctness of our analysis in terms of how much data we needed to generate descriptions that were appropriately general. In other words, that given some small amount of data would parse 90% or 95% of the rest of the data that we had from that source. >> Question: Of all these vena sources, do you know the priority that there exists some PADS description that will describe it? >> David Walker: Yeah, yeah. So we also, for all those data sources we wrote our own description PADS by hand first. And we also tried to do some qualitative -- we have a qualitative measure of how well do the machine learning description do versus the handwritten one. It's hard to eliminate our own personal bias from such an analysis. Right. So what I really should have here is I should have another line that says how much data do we need to generate descriptions that correctly guess 100% of the data. But I don't have that. So sorry. Okay. So -- well, anyway, I think the bottom line is that a lot of these descriptions are sort of thousands of lines long or 1,000 or 2,000 lines long. And it takes about 5 or 10% of the data in order to -for an algorithm to generate a description that works almost all the time. In terms of execution time, it takes under a minute for our algorithm to work on any of these data files that are about a 1,000 to 3,000 lines of code. It depends a little bit on the complexity of the format, though. So the most time-consuming ones are this like ASL.log and AI3000. And this one's actually 3,000 lines long and this one's only 1,000 lines long and, you know, it takes this other one quite a bit longer. So a lot of it depends upon the complexity of the format. But generally in terms of -- if you're a programmer and you say, okay, I want to go and, you know, manipulate this data, you could press a button, wait 30 seconds, wait a minute and you have a format in PADS that you can run your Compiler on and continue your work. So it's fully in the normal, you know, compile time cycle in terms of a programmer getting work done. And just as another sort of reference point, the last column here suggests how long it took us to write a version of the format by hand and debug it. Okay. So and one thing is my post-doc did this when he first got here and so this one here is the time that includes downloading and installing the system and getting it all to work and reading some of the manual and doing that stuff. So we can say it takes a couple days for a first-time user to learn the system. And after that, you know, writing a description in PADS for some data source that's not too complicated takes -- maybe it takes half an hour, maybe it takes two hours, something in that range. Versus one minute for the automated analysis. >> Question: How long are the handwritten descriptions typically? >> David Walker: How long are the handwritten descriptions typically? Um, I don't know, 100 lines of code, that's my guess. It can vary. It can vary a fair amount. Okay. One other thing we did was we took a look AT&T least for small files what is the scaling like and roughly speaking as longs as we don't run out of memory, the -and we don't in any of these small files, but we would on files that were much larger. The process seems to scale a bit linearly with the number of lines of code. And more important is the constant factor that is associated with the complexity of each line in the description and then the number of lines in the file, for instance. Okay. So that's what I've been doing for the last couple of years. What's next? Well, there are a couple of things that I really am excited about doing next. So one thing is I'm working with a couple of my friends on scaling the inference system up so that it can handle sources that are millions of lines of data long, like they have at AT&T. So in some sense the automated analysis is much more useful potentially for the data sources that are really, really huge. Kathleen wanted to create a description for some source that was tens of millions of lines long for use internally at AT&T and one thing she found that after about one and a half million lines the format completely changed. So it was a complete nightmare for her to go ahead and attempt to generate this file which partly motivates this description. >> Question: Scaling and optimizing the algorithms or just the descriptions tend to be really complicated? >> David Walker: It's the algorithms, the amount of data that they us. What we're doing to scale them up is simply to apply the algorithm in batches. So run the learning algorithm over the first bit of data, generate candidate description, then run it over the next bit of data and wherever there are errors, basically populating a data structure in a way to accumulate data at particular nodes, then do more learning problems in the incorrect spots and then add that back into the overall description. New descriptions, and then apply some other rewriting rules in ways that don't completely change the entire description that you have. >> Question: It seems like it would be perhaps a bit (inaudible) where you have all these workers working on this huge file, each coming up with a description and then the reduction somehow has to deal with ->> David Walker: How do you merge. >> Question: Merging these together and merging might involve making a new union for a new struct in different phases or something like that. >> David Walker: Yeah. That's sort of a slightly different way of doing things. Our current way is given a starting description, which could be written by hand or could be machine generated, what we've tried to do is preserve as much of the existing structure as we can while making the necessary changes. And I think it's easier for us to preserve the existing structure -- well, you could still do it with a map or still formulate that existing structure to many, many different nodes. But it's a slightly different algorithm than two things independently coming up with potentially completely different things and trying to merge. >> Question: Uh-huh. >> David Walker: As opposed to localizing in the description where the errors occur. >> Question: Yes. >> David Walker: And relearning only that part of the description. Two slightly different techniques and we're currently working on that first one, but we did actually consider the second one. We just thought it would be technically trickier, but whatever. Yeah, you had a... >> Question: Yes. I'm still trying to understand the difficulty in scaling the inference of the median blanks so the issue is not the median blanks. Supposing there's a Ganglion line data source. >> David Walker: Uh-huh. >> Question: (Inaudible) include that you have alternating (inaudible) ->> David Walker: Okay. >> Question: Maybe to do the inference you just need like maybe 10, 20 of those records, right? And you will be able to reveal faster. You don't really have to look at the entire thing. >> David Walker: Um, yeah, if you're -- if you're lucky, right, you can look at the first, you know, few lines, never the right description and then just validate that the rest of the data matches. >> Question: So probably this happens when the -- if you put these logs in terms of lines and individual record (inaudible) you have to recognize is very large probably. >> David Walker: Yeah. Or -- or you just pick the -- yeah, or you just pick the wrong set of lines to look at initially, that you know there is some change and some protocol that happens halfway through. Something that Kathleen has found happens sometimes in these logs. >> Question: Just pick the first (inaudible) to do sample. >> David Walker: Right now we're just going to -- we could easily sample randomly, but we're just going to pick the first set of lines and then go through and validate the rest of the description and yeah, if all the rest of the description is exactly the same, then there's no change that needs to be done. But there are definitely cases where that just doesn't happen. I mean, these descriptions can grow to be quite complicated. I mean, one part that I didn't mention is in the -- one of the things that the rewriting phase does is it makes a table out of all the data and it looks for functional dependencies between -- so is this column functionally dependent upon this other column basically in the data? And if it is, then we'll put that into the description. So that kind of algorithm is using all the data at once in order to do this and it's comparing all columns against all other columns. And what we want to do is basically summarize that information in a chunk and then the next chunk, you know, check is that -- did we overspecify because of the specifics of the previous column or does that relationship that we've inferred still hold? We can't -- it would be too slow to -- yeah. >> Question: So when you say lines, in some sincere, because you're doing the lexing, you're not really thinking in a line structure, right? >> David Walker: No. >> Question: I mean, if I'm out putting messages which are multi -- if I'm outputting sort of information about messages being sent and received and the message format is multi-line text ->> David Walker: Yeah. >> Question: Right. That's not a problem. >> David Walker: That's not a problem, no. Yeah. So one thing that's -- >> Question: There are hundreds of messages and the orders of those messages differ and the fields of the messages change. I mean then that is a little bit more complex. >> David Walker: Yeah, on our to-do list is being able to specify the paragraphs that you want your learning system to process the data in chunks. You basically just -- you just the beginning of the argument you need some way to get started with having a set of candidates that you have some reason to believe that there will be some commonality. >> Question: So are you saying that the way it works right now if you did have the case with some multi-line things then it might chunk just at the lines and not notice the correspondence of sort of togetherness of the groups of lines? >> David Walker: Yeah, that's right. It won't ->> Question: So it won't measure -- or it won't find structure across lines? >> David Walker: It won't currently find structure across lines. What we would need to do is give it some reason to break things up. >> Question: I'm assuming that people often have to go in and at least give meaningful names. This would find structure ->> David Walker: Yes, that's right. >> Question: But obviously it's going to have, you know, just made-up names for everything, but in an X and O file would just look like gibberish. >> David Walker: Yeah. Oh, yeah, I kind of stopped here in the middle. But I -the next thing I was going -- I forgot, I don't know, I guess I was -- whatever, what a great talk. No, but so actually one of the next things I'm real interested in doing and this is what my student, Qian is working on, is so there's this problem with the handwritten descriptions, you know it's slower. The automated ones, you definitely have a problem with these machine generated names which makes the resulting description look more complex and oftentimes you know just generating the description isn't the end result. If you are cleaning some data that you want to put into a database, you might have to refer to different fields, do some small transformation on those fields. Normalize dates or times or any variety of other things. And so you need to be able to get some hooks on the different bits of the data that you're interested in and so what Qian is looking at is sort of the fusion of the two ideas, partly handwritten, partly automated in which you take your data and you start editing it to insert various bits of description inside it. So here is an example, here is my hockey stats and what I'm interested in is I'm interested in the name of my player, the age and the salary for instance. And what this line here says, I expect to see after this line star says a repeated set of records that match this format and I'd like to tag the name field, the age field and the salary field. So I do this almost as though you are -- I mean, one way is sort of like you are working in XML, but you have raw text as a starting point and you are placing your own little tags with some additional information in terms of how to read the rest of or parts of the description. >> Question: Formatting by example. >> David Walker: Formatting by example, that's right. It's formatting by example. So yeah you're using the structure and the data that's already there to allow you to have lied tons of stuff you would otherwise have to do. And then the other -- the next part of this is to couple this with new scripting language that can refer to the various tags that you've inserted into the different places. So I sort of have some vague ideas of basing this more or less around ex-query and here is a little script -- this is my weigh-out ideas part of the talk. I don't know what the link is. >>: Oh, gosh. >> David Walker: Sorry. >>: We'll have to talk. >> David Walker: Okay. >>: You really should know what Link is. >> David Walker: Okay. Tell me what Link is. All right. Anyway, so this is a little script that maybe it does what Link does, but it selects from the stats field and it's ordered by the name field and it prints out data that is a name followed by some spaces, the age and the salary. So, there you go. So there's that idea. And then another set of things that I'm thinking about also is moving towards not describing individual files, but describing collections of files. So at Princeton and AT&T everyone who builds a distributed system also generates their own little monitoring system which goes and collects up log files from all over the -- you know, all over the place. And they often have these complex directory structures like Mike Freedman, as an example, he has some complex directory structure that has, you know, one directory for each machine on Planet Lab that he's interested in monitoring and then subdirectory is for times and then more -- directory is for different kinds of information. And what I'd like to do is extend the PADS such that it doesn't specify just one pile, but it specifies entire parts of the file system and generates for you an interface against what you can query the files that are in the file system, manipulate them, transform them, get information about them. Another thing that we can possibly do is generate the monitoring infrastructure that goes and fetches all the data from various different places and archives it for you centrally. So, yeah, um, those are some things I'm interested in doing next. There's a number of bits of related work. In terms of just languages, there are a number of languages that were made up in the networking community sometime ago, data script and packet types are one of them. Allows you to describe binary data and use this sort of type-based metaphor and generate internal representations for you. So, PADS is inspired by a lot of those efforts. I think some of the new things are in terms of the way we do error processing with the parse descriptors. We also have a paper on the semantics that describe the semantics of a number of these different languages. And in terms of the learning stuff, we have borrowed ideas from a number of different people in the machine learning community and people who are learning the structure of XML. So the idea for our top down recursive structure discovery algorithm is based on ideas by Arassu and Garcina -- Garcina Mulina, who looked at how to extract data from XML or from web pages. And we coupled that with other ideas that other people have had in terms of using the Midian description-length principle to optimize the descriptions that you get. Okay. So summary is PADS 1.0 is a language where you can write down descriptions by hand. PADS 2.0 improves productivity by automatically generating description. PADS 3.0 is going to be better yet, though we're not exactly sure what it's going to do. But no, hopefully it will combine some of the ideas from each of those and give you the control of the handwritten with the efficiency of the automated version. All right. Thanks. (applause) >> Question: Any questions? >> Question: Is there a download? Where do I ->> David Walker: There is a download. Do we -- oh, here we go. Look at that. Yeah. So you can go to www.padsproj.org, and we have a couple of demos online that you can take a look at if you want. So there is a demo of the basic PADS infrastructure. And that will show you a handwritten description and some of the tools that we can generate, like an XML converter and then there's this learning demo where there's a bunch of different files that you can select from and look at and then you can click the button and you can see the descriptions that you get out and then you can see the -- what happens when you apply one of the tools using those descriptions. >> Question: So when you combined the hand (inaudible) and description of the inference so (inaudible) description or say it's just (inaudible) that description doesn't actually (inaudible) ->> David Walker: So right now we're building a new system which will do the this incremental processing. Given a starting point that's a description either if it's been learned or if it's been written by hand. Apply it to segment after segment of a large data source and if it parses the large data source, great F. It doesn't then it finds the places where the description is wrong and it will relearn those nodes and rewrite them and include them. So does that answer your question? Yeah? >> Question: If for example, the description is generated by the kind of like the inference of (inaudible) data ->> David Walker: Okay. >> Question: Then the possibility of that description is (inaudible) higher than the (inaudible) seems to be kind of like a (inaudible) description being the (inaudible) so how do you ->> David Walker: So we do have a tool also that if you have a handwritten description, we have this like accumulator tool. You can apply it to a data set and then it will give you a list of all the errors that it found. You can look at is that an error in description or is it an error in the data? >>: Thank you. >> David Walker: Okay, good, yeah.