>> Ken Eugro: Hello everyone. I am Ken Eugro from the Embedded and Reconfigurable Computing Group here in Research Redmond and today it's my pleasure to introduce Louis Woods. Louis is a third-year graduate student at ETH Zürich and he also used to be one of our interns here. His research interests include FPGAs in the context of modern databases, parallel algorithms and stream processing and pattern matching and with that, Louis. >> Louis Woods: Thanks Ken for the introduction, and thanks everyone for coming. The work that I am going to present is joint work together with Jens Teubner and Chongling Nie from the Systems Group in ETH Zürich. The title of this work is Skeleton, Automata for FPGAs, Reconfiguring without Reconstructing. Let me just start with a very brief introduction to FPGAs. I know that many of you know what they are but I like to think of them as soft hardware, so soft in the sense that you can change them after manufacturing, but once they are programmed, they behave like real hardware. They consist conceptually of these two layers, a logic layer, which is just a pool of hardware resources, an uncommitted pool and then a configuration layer on top which allows you to define how these hardware regions interact and together construct a real circuit. By software update the behavior of a circuit or actually really the layout of the circuit can be changed. In my research I am particularly interested in how these types of hardware can be used in database systems. How can data processing systems benefit from this and one question is the question of, you know, how do you integrate it into an existing system? Do you use it as a cold processor where you load data over there to do a lot of number crunching on it or do you do something like I'm proposing here where you say you integrate it into an existing data path, so this data path could be the network or it could be a direct link to storage and you can take a chip and put it into this path and have it process in a streaming manner the data as it flows through to the CPU anyway. For instance as I show in this example, you could use it to maybe pre-filter a lot of data or you could use it to, you know, do precomputation or auxiliary computation on the data as the data flows by. How is the work between CPU and a FPGA divided? The goal is to extract the parts of your program which the FPGA can handle efficiently. But these are probably rather simple things. You still need the CPU because it's much more flexible. You still in a full-blown query engine, you cannot offload everything to the FPGA. So the key challenge is how you make this separation for this to work well. There has been some research on using FPGAs for databases and there are also come some commercial systems I do this. So on one end, I want to sort of present two extremes in a sense. On the one hand there is something like this Glacier complier. Glacier was developed in our group a while back and it is a query to hardware compiler, so you give this thing an algebraic query plan and it translates this into a circuit you find in a hardware description language such as Verilog or VHDL. Then however, you cannot put this directly on the FPGA; you have to do a number of steps, such as synthesis place and route to map the circuit you defined to the particular FPGA device. And this takes some time. This is not something that you can do on-the-fly. So this approach here works well in, for instance, the streaming scenario where you have long-running standing queries which you want to have implemented in hardware. Let's say you are in a network intrusion detection system and you have certain patterns that you want to detect on a data stream, so you might have new patterns as time evolves which you want to put on there on a nightly basis or on a weekly basis, but not very frequently. So at the other end there are systems like Netezza which are, Netezza is a data warehouse appliance which also uses that FPGAs inside its architecture. Here you don't compile query, full queries into hardware, but you rather put a fixed set of operations on to the FPGA, so things such as decompressing data efficiently to speed up transfer times between this and the final system or you can do simple filtering of data such as projection based or restriction or selection-based filtering. You can do that type of stuff, so it is, you don't fully compile queries; you just do a little bit of it and the more complex stuff will be done on the CPU. The work I'm going to present here is about this type of filtering done on XML using a technique called XML projection. This technique extracts filtering expressions from a query and then pre-filters the data before the final XQuery engine then runs on a reduced set of data, and the goal of this work is that we don't want to have to compilation overhead, which we saw in glacier, and yet we want to make this filtering as expressive as possible, so we don't want to just filter on [inaudible] or so we have here a more complex problem where we really won't have… >>: [inaudible] confirmation on it? It sounds like [inaudible] joint something benchmark [inaudible] benchmark? How long does [inaudible] to generate [inaudible]? >>: If you do the full compilation, this will take definitely several minutes and then it depends, of course, how complex your circuit is and then it can even take hours, but it will definitely take a number of minutes. Yes? >>: So you have [inaudible] optimizer data plan and [inaudible] takes [inaudible] plan [inaudible] and then there's a cost of [inaudible] and that is one [inaudible] and the second is [inaudible]. >>: So the compilation that first part is not an issue, but the second part is the issue because the complicated thing is you are now taking an abstract circuit description and you have to translate it to the hardware which the FPGA provides you. You have to do the routing and this is really is a complicated process. This is not something you could do, that you can do efficiently. So yeah, so from the very [inaudible] to bitstream and loading the bitstream onto the FPGA, that's a process which takes quite some time. So the alternative is to not reprogram the FPGA but to just have a generic circuit which you can modify the way you like. I think I have to give now, next I am going to give an introduction as to what is XML protection. I already highlighted the idea a little bit, but let's run through an example. Here I have an XML document and on the right-hand side I have an XQuery. So this XQuery, what it does, it from this document basically selects all of the item elements which are descendents of the region elements and it then returns for each of those items it returns a new item. It can do this in XQuery like this. You can generate a new item which contains the name tag which matches so the name is all of the subtree of name. Here it is just a text [inaudible] but it could be an entire subtree and it generates known categories, new element known categories that contains an aggregate, which is just a count of these two in category elements. So we say okay, this is a complex query which I mean each query is a very complex language and we cannot run this full query in on an FPGA. However, in 2003, there was a paper by Marian and Simeon who talked about a technique for projecting XML documents and they showed that you can actually from any query infer statically a set of so-called projection paths. These are, this is a subset of each path used to express these paths and these paths define exactly which parts in the document will be touched by the query. So you can actually before you run this query you can quickly extract these paths and then filter out all of the irrelevant parts from the document. Then you can run the XQuery on the final filtered resulting document and it will give you the same result. So if we take this idea to our hybrid architecture which I'm proposing, then the server running a full-blown XQuery engine such as Saxon or MXquery or you name it, before it runs the XQuery engine it extracts these paths and puts them on the FPGA, and this has to happen obviously quick enough for this approach to make sense, and then the data is streamed not directly to the FPGA or it's not directly to the server, but it's streamed through the FPGA, and the idea is that the FPGA really operates as a stream processing network which just reads on the one end the original document and then the other end produces the filtered document. That's the key idea. So with that, the next question is so how do we put these filtering expressions on an FPGA. How does the corresponding hardware look? So here is such an expression, where I say from the route I want to match a descendent, which is A which has a child B which has a child C and then that shall have a descendent, that means the descendent means somewhere in the subtree an element D. This is a similar problem to regular expression matching. If I, in the regular expression I can express a nondeterministic finite state machine or a deterministic one, and the example here is a nondeterministic finite state machine where I say okay, if I match A, I go into the next state. If I match to match B, I go into the next state. If I match C, then I had a clean closure meaning I will stay in this state. I've omitted one thing here on purpose, namely that's not quite full story. What is missing is we are not keeping track of closing tags here, so in XML we actually have to keep track of once I'm in this state C, once I match A, B, C those tags are open. When tag C closes again, I have to take care of this somehow. And this is typically done with a stack and I will later show how we added the stack as an implementation detail into our system. Anyway, if I, for now for simplicity if I just want to take an NFA like this, how can I translate it into hardware and here I give you an example. So what you have on the FPGA is you have flip-flops and you have logic and you have a lot of that. So we can store this state of this nondeterministic finite state machine in a flip-flop for each state telling me whether or not this state matched, so if I go from Q0 into Q1 then I, so when I match this predicate A then I can put a one into this flip-flop and with the gates I basically implement the transitions between all of these states. So here I am assuming an external tag decoder, so I am going to have some black box which kind of reads tags and output them. In our implementation actually we will have local replicas of tag decoders because that works more efficiently, and so the gates, anyway, they just take the previous state from the flip-flop and the current tag and then define whether they should go into active state or not. >>: Will you need a parser? >>: Yes. You will need a parser, yes, and I will come to that in a second. Here I am just looking at this past expression knowing that I already have a parser in front of this. And again, I already said this several times if we want to now, of course, we could compile these past expressions into particular circuits and load them onto the FPGA but we don't want to do this because this would probably take too long, but what we observed is that actually these paths, so this is a restricted set of each path. We have, for example, only navigation, so the automatons or these NFAs which we construct are always of the same structure. They might change the semantics, might change in the sense that whether I'm matching a tag 4 or matching a tag R, and what also changes is whether I have, whether I'm doing a descendent navigation step or just a child step, but other than that sort of really the hardware structure really stays the same. So we can, we see that we always have pairs of the noticed and a navigation step. So it's always descendent and A or child step and B; that's the repetitive pattern, and this we want to exploit exactly for our idea of a skeleton automaton. This is really the key idea behind this approach, namely that you define this skeleton which is the same for all the X path expressions and load this on to the FPGA once and you leave those parts which are specific to a specific X path expression. You put this on there later. So you just have this skeleton that the load on the FPGA when you boot the engine and later at run time all you have to do from your X path expressions is you just have to extract the semantics basically and load them into there. This can be done quickly because all you're doing is, you know, updating a little bit of memory. Again, this should be quick so you can have highly dynamic workloads and you can do this, so in numbers you can do this in, you know, two or three microseconds versus several minutes of compilation time. So that's the key idea and from here I will now in the next couple of slides before I go into the evaluation, I will show a little bit more of the details in the main hardware because there are some things missing which I haven't yet talked about. So in terms of architecture you asked about the parser, so yes, we have a parser in front of this, so that the XML stream runs through this parser or it's actually more of a lexir which annotates the data stream with lexicographical information which is then used by these X path circuits. So this parcel reads one byte or one character per clock cycle and it's just a large state machine which says okay, this is the tag start or this is here is the end of the tag or the end of the closing tag. And then this so-called cooked XML flows through these segment matchers and this is actually how we are exploiting the parallelism of the XGA so it's a pipeline parallelism in which we are doing here. So we are streaming this data through this pipeline of segment matches which are daisychained. A small detail here at the end is a serializer that I don't want to say too much about it, but that just make sure that what comes out of this engine is valid XML, because we want the XQuery engine to be completely oblivious that in front of it an FPGA is filtering out stuff, so we cannot just give it the matched parts. We cannot just rip everything out of it, so we cannot just give it the parts of which the X path expressions matched; we have to embed it in a valid XML document, so we have to keep the full past relief nodes and that's what this serializer does. The skeleton segment, this is now as technical as our get, is at the core of this. Here is a diagram of the architecture of this segment skeleton segment. Again, it represents a navigation step plus a node test, so I'm saying am I doing a descendent or a child step and which tag am I matching. The configurable parts of this are tag predicate. That's going to write in there into a RAM if I'm matching tag [inaudible] I have to store that in there, a matching tag bar, I have to store that information in there and a configuration parameter which says what kind of an access am I doing. And each segment matchers can do both, I mean all four navigation steps, and we configure which one to do by writing into this thing. Then you see these matches have data in the match and the match import and the data out of the match out outputs, so they are all daisychained. The data is passed from one matcher to the next to the next to the next and the same holds for the match in and the match out signals. The second core takes all of this information and defines whether it has a match or not. Then the final missing part here is this thing here at the bottom, this history unit, and what this is is the stack which I mentioned. What we are not taking into account if we are just looking at the regular expression. If we have a match we put this into this stack. It is essentially a one bit stack we write a one into the end as we go down in the tree we shift; we shift this information to the left. As we go up into the tree we shift the other way around. If it is a child navigation we just shift that one up and down and if it's a descendent navigation step I will, you know, once I put in a one, I keep shifting in a one. And so here we use the 16 bit shift resistor which means we can go up to a depth of 16. This is reasonable. You typically don't have extremely deep documents. If you have, you could change this. On the other hand you could also say well if I have an overflow here, I will just stop filtering. That is always my backup in the worst case just send you all of the data. This is a detail about the configuration, how was it done. Configuring at runtime we use for this processing instructions. That is a feature is a XML specification and the key is we embed this configuration also into the byte stream and this can be recognized and then we actually configure as this flows through and in terms of how fast this is, well, it can process one byte per clock cycle so depending if you're X path consists of 50 bytes, for instance, then the reconfiguration in terms of throughput is 50 cycles which on my 125 MHz clock will translate to 400 ns. This was all about how to build one all of the details on how we build a single path. Now we want to support more than one path. We, when we analyze such a query we typically have about 15 paths. Now we don't have to change this architecture to do that a lot. We still keep this chain of segment matchers and all we have to do is pay a little bit of attention at the beginning and at the end of the path. So at the beginning we have to make sure that the segment at the beginning behaves as a root node basically. That's not difficult to do. The details are you can just initialize this history thing with a one and make sure that you are not dependent on the predecessor. The other thing is now the question is what do multiple paths mean for the end result. So here we are saying if either one of these paths match, then we have to keep that part of the document. What we want is to, the union of all of the paths. We just have an additional global match signal and we can configure the segment at the end of the path we can say you are an end of chain segment and so at those segments what will happen is what you see here at the bottom, that the previous global match signal is merged together with the local match, so if any one of them has a match at the end it will propagate all of the way to this serializer again which is responsible for outputting the data. Yes? >>: Path [inaudible] one, two [inaudible]. >> Louis Woods: Sorry, say again. >>: In your picture you have them in sequence. Are they really in sequence or are they [inaudible] XML file? >> Louis Woods: They are laid out in sequence. They operate in parallel, but path one will match before path two. If path one matched, together with this datastream, that match will propagate along so the data just flows through this pipeline and matching information is merged into this stream basically. So it's all the sequence operating in parallel is what it is. That's what it is actually. This is sort of the big picture. We have this parser and this serializer at the two ends and then we have this chain of segment matchers and we put as much on there as we can. The FPGA is a 2-D array of resources. A pipeline like this tends to map nicely to the hardware because there is only small neighbor to neighbor communication between the elements, so the tools are very good in figuring out how to compute this data. All these other designs you can construct other more complex designs with transitions all over the place which don't work as well, but this one works very well. So a little bit of evaluation. First some performance. We looked at, we measured the Saxon EE so Saxon is a sax-based XML engine or XQuery engine and the EE is the commercial version, so we measured the speed up in parsing time on this engine with and without this projection and what you can see is on 100 MB XML instance the speed up was around 6X, 8X, somewhere around there. The speedup in parse time was actually significant by reducing the documents and parse time is indeed really an issue for XML applications. In particular, it is something that is not, it is inherently serial process which is hard to parallelize and so if you can just make the document which has to be parsed smaller then the parsing and the XQuery engine which we didn't change at all will obviously run faster. The benefits in execution time of the query which we measured were not significantly or not significant. This is known about projection. I think the reason is that once you have your optimized data structures in main memory, then the queries run very efficiently over these data structures anyway so they won't touch the stuff which you don't need any way, so there is no benefit here. In many of the queries the execution time was much the smaller part of the parsing time, but you have also in this XMark benchmark which is a standard benchmark for XML benchmarking; you also have a few expensive joint inquiries where the execution time then dominates. But in most of the queries the execution time was less dominant than the parsing. Finally, and this was the original reason for projection and memory consumption is again, the improvements there are quite significant, since you have a smaller document and the data structures in memory are a multiple of the original document size, so often this really blows up. Yes? >>: [inaudible] you compared the corresponding [inaudible] condition or [inaudible]? >> Louis Woods: Yes, I have a backup slide on that. We can, I might as well show this one now. That's true. Saxon, this is unfortunately a bit graph which is a little bit hard to parse, but Saxon has a switch so the commercial version has a switch for software-based projection, so it implements this and we ran this as well so that's what you see with these striped things here. This is not a stacked graph or anything. The message here is it has no effect; software projection has no affect on parse time because the document you parse is the same size. Execution time it doesn't have so much of an effect anyway and memory consumption we were a little bit surprised at the memory consumption improvements from the hardware projection were much more than those from the software projection. I have a few more results before I conclude this talk. The one thing is about, one question is well how many of these segment things can you put on the FPGAs, so here we used a not very large FPGA. This was a Xilinx Vertex 5. It really was on the [inaudible] five [inaudible], so a medium-size FPGA and so I didn't talk about how we used BRAM at all. That doesn't matter, but you have basically two types of resources on the FPGA and we used those sort of in an equilibrium and the message is, we then could put 750 of these segments on the FPGA and this is sufficient to do the first 10 XMark queries so to put this in numbers maybe to run 10, 15 each paths, which, where each is using something like another 15 or 15 to 30 segments, something like that. This was sufficient. The other result that I have is scalability. I said at one point in the talk that this maps well to FPGAs and one way to measure this is to say how does the clock frequency behave. Does it degrade when I start filling up chip? At 750 segments I'm really saturating the chip quite a lot, and what you see sometimes is that then the clock frequency really degrades. Here, it stays more or less stable which lets us assume that this would also scale to a larger chip very nicely, and that we wouldn't have any problems there. Also, this is the clock frequency I am showing here is well above our target which was just 125 MHz for a gigabit. So this brings me to my conclusion, so what I talked about today was I was showing hybrid XQuery processing engine with the approach, the architectural approach of putting the FPGA into the data path, so rather than using it as a cold processor which does heavy number crunching on the side where you have to load data on there and get a result back, here we put it into the path which exists between server, or between source, data source and server anyway, and we let it sort of transparently to the XQuery, engine process preprocess the data. And the problem that we encountered was that reprogramming the FPGA in such a setting is not an option because we want to change what we put on the FPGA very frequently and very fast. And the result, the solution that we came up with for this is to put a skeleton, this part that doesn't change for all of the queries, put that on the FPGA once and at runtime you only change the semantics of it that those parts which are specific to a certain query. Finally, this work is part of our avalanche research project which you find on at this URL and on this project we are looking at, we are aiming for hybrid CPU FPGA code designs for data processing. That is our ultimate goal and we are trying to figure out what other right instructions and what other right interfaces we need for this to work well. With that, I am happy to take any questions. [applause]. Yes? >>: Two questions. First this is entirely about [inaudible] programmability for data, so how much [inaudible] overhead, how much overhead [inaudible] how much clock ring overhead do you expect, is there a cost [inaudible] for making this [inaudible]? >> Louis Woods: Okay, I see what you mean. I think in this particular case the cost, so I don't have an accurate number, but I think it was fairly small. The navigation steps, so on the one hand you have a lot of resources there anyway; that's the way the FPGA is built, so whether you use an additional multiplexer to decide whether this is going to be a child step or an extended step, I think that will have a lot of impact. And then the BRAMs, like for instance, we stored the predicate information which tagged your matching in the BRAM blocks, and again these blocks, they are there; the question is what would you do with them. Here we are assuming that we could really use the chip as we like. There's nothing else that we have to do on it. They are there and I don't think that like if you would say [inaudible] solution without this online reconfigureability I don't think you then could say I could put 10 times more paths on it. I think it would be rather something like, well, I'm not even sure if it would be more, but I think it wouldn't be so much more. Was that your question? >>: Actually had several questions. If you go back to your area graph. >> Louis Woods: This one? >>: Yeah. It's not linear. I understand why it sort of the zero intersection point is not zero, because you did some things just to sort of get the system up. [inaudible] control [inaudible]… >> Louis Woods: Yes, right in the parser, the serializer… >>: Well, what is the sort of the less than one slope caused by? The fact that it sort of drops off as you get farther and farther? >> Louis Woods: Okay, so I think you are right. It is not linear but it's… >>: But it's sub linear, [laughter]. >> Louis Woods: Well, I think this has to do with the tools. I think the more you are filling up this stuff, the tools start doing a better job in using the resources. That's my guess here. >>: So you could actually in these media plates [inaudible] 400, and what it could be using is considerably lower than that, but well, I have this space why not expand and fill the space? >> Louis Woods: That would be my first guess. >>: I mean it's a visually discernible [inaudible] which then would mean when people report logic utilization numbers, and they are on a chip that is not even [inaudible] used in fact they are [inaudible] the results could be quite a bit better [inaudible] and who knows whether or not [inaudible] maybe the circuit they are trying [inaudible] but that… >>: [inaudible] point posting census estimates as well. >>: And so that's the question. So is this the blowup the synthesis or is it [inaudible]? >>: I would guess it would be in place and route, I would guess because you start off with synthesis and since there is a lot of negative [inaudible] translate and then… >>: Well no, the synthesis step sees that you're at 2% utilization so you're going to double all of these pieces to make it faster because you have plenty of room. >>: Well, the synthesis option is to say [inaudible] area regardless, right? I guess you could try running the graphs [inaudible] options… >> Louis Woods: That's true, yeah. I think this one here we used just out-of-the-box options. >>: [inaudible] other [inaudible] 80% and then keep adding stuff and you get to 90% of [inaudible] and then definitely finding a quick solution and that's good enough as far as that's concerned. >> Louis Woods: Yeah, I mean the routing, all of these steps take considerably, the time it takes to produce a final design also I guess isn't linear. >>: But in general you see these super linear graphs, not a sub linear graph and that's the thing that caught my eye. >> Ken Eugro: Okay is there anything else? >>: [inaudible] if I understand right year programmability [inaudible] special twist to the data pack? >> Louis Woods: Right. >>: So now you're going to have a programming function and something special [inaudible] and everything else is certain [inaudible], so the overhead in the core, in the core segment [inaudible] programmability has to go ahead [inaudible] so you get 8% [inaudible] point wise that's the same blowup. You don't have a complicated [inaudible] and use the regular streaming network so that's your [inaudible] overhead, is that right? Then you can have your [inaudible] logic and fixed pockets and [inaudible]. >> Louis Woods: Uh-huh. >>: I have a question about your benchmarks [inaudible] also. There are a couple that really stand out, the five or six I think it was or [inaudible] next once be a very [inaudible] five or six and the shapes of the first and second are soaring. Did you work into [inaudible] why some have better speedups than others? >> Louis Woods: Not so much. I know like I think Q6 just really a lot of data is filtered. I mean in the parse time I think the speedup is really due to how much is filtered out and that's depending on the query. >>: So the queries are progressively harder evidently in this graph? Is that right? >> Louis Woods: Well, I don't know… >>: [inaudible] vectors up there in that area. >> Louis Woods: Actually I don't remember exactly every query what it's doing and the execution time, I don't have these ones here. Some queries are joint queries which are hard on the execution time and some, there are also some queries which you just can't filter out as much. Yes? >>: You have a graph showing how many segments per [inaudible]? >> Louis Woods: No I don't have this graph, but half a second. I have this information with me. I just have to look it up real quick. I hope I marked it so I can find it. I know it's in this thing but maybe we have to take it off-line because I'm not sure if I can spot it right away. I think it was something like between 15 and 73 and the median I don't know anymore what we have for median. There's a lot of numbers in here, but I think it was something like that. >>: I think [inaudible] this approach is the [inaudible] force complexity reasons. So the system will never say, for complexity reasons I can't support your [inaudible] always fall back on whatever you bring I am just going to write [inaudible]. >> Louis Woods: Right. Right as far as… >>: So I'm I know nothing about certainly the [inaudible] and what have you, so is there anything like that in the language that you certainly can't support? >> Louis Woods: In XQuery language? >>: Yes. >> Louis Woods: Oh absolutely. There's a lot of stuff which we would have troubles because… >>: And so, you know, will what fraction of the elements… >> Louis Woods: I'm sorry I just want to make sure I got the question right. You mean support on the FPGA? >>: Right. And so what fraction of those types of things are actually benchmarks? I assume that however large, this is one sweet and probably some other sweets so how much of that is actually gets expressed and how much of that is actually built into the runtime of the performance of the… >> Louis Woods: I think fairly a lot gets expressed. They are doing joints, for instance, that would be difficult to do in a FPGA. There are back references you know from a child back to a parent again here on our ex-passings this works well because we only have forward, we only go down to tree in the paths. We never have been when we are in the child a reference up to a parent or a parent of the parent or stuff like that. You can do XQuery is a language which allows you to basically express anything and it has function calls and all sorts of stuff and so there is a variety of things in which we put on an FPGA. >>: [inaudible] based on the performance [inaudible]. They are there, but they may not be a significant part of the overall [inaudible] so if you switched [inaudible] execution [inaudible] I mean the six and seven that barely make it to two, at best you're doing, you're making the job half as thick, right? >> Louis Woods: Right. I really think that the reason for this is that once you build up your main memory data structures, the filtering effects, the effect of filtering on a query execution is just not that great. >>: Have you tried isolating the lexing [inaudible] in other words [inaudible] one, two, three instead of in strings? >> Louis Woods: Ah so doing this before the FPGAs, so like binary XML or something like that? No, we haven't looked into that. >>: [inaudible] 90% of the [inaudible]. >> Louis Woods: Yeah, so, yeah, then I don't know how parsing speed up would change with binary XML. That's actually an interesting thing as an extension to this to look at. Uh-huh. >>: Part of this also when you are doing this [inaudible] and computation in a pipeline fashion so it's all going on in sort of parallel. I guess there's also nothing that says that you couldn't run the entire system in four string parallel and just say okay, but I have one parser and one series of sequence [inaudible] and then just [inaudible] the whole set of them. >> Louis Woods: Sure. >>: There's no real way of sort of [inaudible] because sequence matches are already sort of parallel in the sense that they are all looking for different things. I mean I guess the other thing, this is interesting because the problem statement allows you to basically make one long pipe and cut off wherever you want and it's like getting the garden hose and cutting it to length. If you want, but if you needed to produce certain [inaudible] outputs where this is interesting because I don't care which one you matched, all I know is that something in this thing matches and so [inaudible] this entire [inaudible] to solve it and to put it on the output. But if you couldn't, if you had to figure out which one it came from so you had some idea of how you would maintain that at some kind of cut to length… >> Louis Woods: Well, which one it came from I think that could be done by adding, using, you would have to then add this information into the stream but you would then, I guess you would then have to stop the previous one add in yours and this would sort of, all of these that matched you would add at this kind of information. The problem is so then you know which one it came from if you want to know this. The problem is if you want to route, if now your problem is such that you say okay, this doesn't all, if passed one match, then I want to send that data over there and path two match, then I want to send it to somebody else, then you have the problem if all of them matched then you have to send them to all of them and how do you do that? But then that's a different problem. Go-ahead? >>: So the what is the [inaudible]? >> Louis Woods: The query engine is a Saxon here and, what? >>: Is it based on like [inaudible] just like regular sequence or [inaudible]? I'll rephrase my question so currently the [inaudible] sort of thing [inaudible] input is XML [inaudible]. >> Louis Woods: Uh-huh. >>: Output is again [inaudible] so in principle you can store it and then [inaudible] but if the query engine is and what you are doing maybe you can feed the output of this straight to the operator that you [inaudible] filter rather than making it [inaudible] document [inaudible]. >> Louis Woods: Sure. I mean yeah, if you are willing to modify your engine you can do a lot more. Also we are parsing twice basically. We are parsing on the FPGA and we’re parsing again on the engine. You could use this lexical graphical information which we have maybe already for a final parser but then you have to modify the engine. >>: [inaudible] modify the [inaudible]? >>: And then it gets back to the binary XML suggestion earlier. >>: I was wondering you say something more about the automaton. >> Louis Woods: The automaton? >>: Yes, yes from the example [inaudible] and I was sort of thinking it always look like this had some sort of [inaudible] when you check some [inaudible]. >> Louis Woods: Uh-huh. >>: Or is it more complicated and then in relation to that does the when you say what you mean by semantics is different labels, so language their [inaudible] plugging in actual labels and you get different semantics [inaudible]. >> Louis Woods: Uh-huh exactly. >>: But of course if you choose different labels you might actually get a nondeterminism [inaudible] which perhaps you didn't have on the [inaudible] the other way around. I'm just wondering if once you fix the skeleton, does the affect of how you choose the semantic labels have consequences or are there certain constraints to that sort of stuff? >> Louis Woods: Okay. So first of all, yes this is not, this is a specific type of nondeterministic finite state machine, so it's one where you have this forward only path. So here, the predicates that we are supporting here are always just tag names. We wanted to, we also had some, what we didn't evaluate it thoroughly, but we also saw how would you do more complex predicates where you say okay, I'm also looking at attributes. I'm taking, I only want to look at those items which have an age attribute which is larger than 50. I think the concept is the same but it makes this predicate matcher thing more complex. You have to then convert strings to binary and do the comparisons. I don't think, I mean as long as you don't support things like back references from a child back to the parent and so I think you won't have anything that you cannot handle, and in particular, the nondeterminism that you have here is not a problem because you do all of this in parallel, so while, what I want to say is if I have a non-deterministic state machine in software, and now I have a lot of active states then suddenly this can become a performance bottleneck, right, because I have to update all of these parallel states on one single input item, while in hardware I don't care how many states are active. I operate them all in parallel anyway, so… >>: Essentially for example here you would get A, but you might [inaudible] zero and it might be Q1, so the choices are open. >> Louis Woods: Exactly, and that doesn't… >>: [inaudible] state is like a substantive state [inaudible] somehow it doesn't go up. >> Louis Woods: Yes, and because each state is an individual circuit that's fine. Each state just looks for himself and the overall result is correct. >>: One more question. Go back to your [inaudible] performance [inaudible]. I'm wondering about basically the [inaudible] here. You have these things broken out as being the speed of [inaudible] speedup of the [inaudible], how much time do I spend in each? What is the wall clock? If I put this graph together with the other graph, if I said what is my parse and execute times? >> Louis Woods: This one is a very query dependent. Like I said in a lot of these queries, you, you know, you can just maybe stack them on top of each other, but there is a benchmark also a few queries like long joints where really the query execution time is dominant, so there are a few I don't know, in Q10 whatever, where you would then be 40 seconds, for 40 seconds joining, so you would get a very different picture. >>: Could you give me something [inaudible] the average? >> Louis Woods: Yes, okay. [laughter]. I should have used… >>: [inaudible] [laughter]. >> Louis Woods: Yes, I can do this. >>: If it's in the paper, I'll read it. It's just that… >> Louis Woods: No. Here you go. Actual query execution time varies between 68 milliseconds and 41 seconds and the median over these was 390 milliseconds. This versus parse time of 2.5 seconds just put this into proportion. And the speedup is what you… >>: Got ya. >>: Can you run them in parallel? Because you have 700 segments and you use at most half of 100 so you connect seven of these engines and run FPGA, so you have to run seven of those queries in parallel… >> Louis Woods: Seven queries in parallel? >>: Provided two things, A to the engine [inaudible] and B that the ratio between input data and output data is what seven [inaudible]. >> Louis Woods: Uh-huh. So with this there is yes, I think yes, but there is one issue namely how do you, so when we just load the X paths, all of the X paths for one query onto the FPGA, we basically swiped and then append all of the paths. Now if you have multiple and parallel you could do this but now one query terminates. You take this out. You leave the other ones there. You have a hole, so if you would do batches of queries in parallel, I'd say yes, but if you just want to, okay, I have six queries are still running, but here I finished one and I'm going to send the next one, there would be an issue of, you know, allocating this thing in a proper manner. And the other thing is then of course you have to decide on okay, you could always union all of these queries, but then the filtering effect of course goes down because the queries might not look at the same data, so that's another issue, you know,. >>: [inaudible] design, so difference, you wouldn't be running one query in the next [inaudible]. >>: I mean I'm sure [inaudible] dependent were [inaudible]. >>: Right. >>: This [inaudible] we would have first, second and third. You have [inaudible] space [inaudible] use one chip [inaudible]. >>: How many queries do we have running on the same document? >>: But that I would be [inaudible] partitioning it up [inaudible] parallel, how would you do that? >>: And you also need in a different you need for multiple parsing, multiple parsing engines. >>: Yes. >>: The whole thing [inaudible]. >>: Yes. >>: So maybe it's not [inaudible]. >>: [multiple speakers] [inaudible]. >>: [inaudible] different FPGA [inaudible]. >>: Right, right but being able to [inaudible] that would mean that you would have to partition the system, have insertion points where you could sort of break up the cards into [inaudible] one card into two cards, or take two cards and make one card. >>: [inaudible] problem [inaudible] one, problem [inaudible] two. >>: Yeah, yeah, there's a certain granularity, right? >>: Yes. >>: You probably also would have to increase your I/O as well. I'm assuming if you ran 780 such a time then input [inaudible] bottlenecked pretty quickly. >>: So that's a different [inaudible]. That's a [inaudible]. >>: So you're saying it's the same input? >>: The same [inaudible], but yeah. >>: Oh, so assuming that the speedup is roughly proportional to the [inaudible] compressing it then you could just feed all of the seven outputs on the same [inaudible]. >>: Even [inaudible] the fact that you [inaudible]. >>: Yeah, that's true and that's interesting. >> Ken Eugro: Thank our guest, and Louis, thank you very much. >> Louis Woods: Thank you. [applause].