>> Ken Eugro: Hello everyone. I am Ken... Computing Group here in Research Redmond and today it's my...

advertisement
>> Ken Eugro: Hello everyone. I am Ken Eugro from the Embedded and Reconfigurable
Computing Group here in Research Redmond and today it's my pleasure to introduce Louis
Woods. Louis is a third-year graduate student at ETH Zürich and he also used to be one of our
interns here. His research interests include FPGAs in the context of modern databases, parallel
algorithms and stream processing and pattern matching and with that, Louis.
>> Louis Woods: Thanks Ken for the introduction, and thanks everyone for coming. The work
that I am going to present is joint work together with Jens Teubner and Chongling Nie from the
Systems Group in ETH Zürich. The title of this work is Skeleton, Automata for FPGAs,
Reconfiguring without Reconstructing. Let me just start with a very brief introduction to FPGAs.
I know that many of you know what they are but I like to think of them as soft hardware, so soft
in the sense that you can change them after manufacturing, but once they are programmed,
they behave like real hardware. They consist conceptually of these two layers, a logic layer,
which is just a pool of hardware resources, an uncommitted pool and then a configuration layer
on top which allows you to define how these hardware regions interact and together construct
a real circuit. By software update the behavior of a circuit or actually really the layout of the
circuit can be changed. In my research I am particularly interested in how these types of
hardware can be used in database systems. How can data processing systems benefit from this
and one question is the question of, you know, how do you integrate it into an existing system?
Do you use it as a cold processor where you load data over there to do a lot of number
crunching on it or do you do something like I'm proposing here where you say you integrate it
into an existing data path, so this data path could be the network or it could be a direct link to
storage and you can take a chip and put it into this path and have it process in a streaming
manner the data as it flows through to the CPU anyway. For instance as I show in this example,
you could use it to maybe pre-filter a lot of data or you could use it to, you know, do precomputation or auxiliary computation on the data as the data flows by. How is the work
between CPU and a FPGA divided? The goal is to extract the parts of your program which the
FPGA can handle efficiently. But these are probably rather simple things. You still need the
CPU because it's much more flexible. You still in a full-blown query engine, you cannot offload
everything to the FPGA. So the key challenge is how you make this separation for this to work
well. There has been some research on using FPGAs for databases and there are also come
some commercial systems I do this. So on one end, I want to sort of present two extremes in a
sense. On the one hand there is something like this Glacier complier. Glacier was developed in
our group a while back and it is a query to hardware compiler, so you give this thing an
algebraic query plan and it translates this into a circuit you find in a hardware description
language such as Verilog or VHDL. Then however, you cannot put this directly on the FPGA; you
have to do a number of steps, such as synthesis place and route to map the circuit you defined
to the particular FPGA device. And this takes some time. This is not something that you can do
on-the-fly. So this approach here works well in, for instance, the streaming scenario where you
have long-running standing queries which you want to have implemented in hardware. Let's
say you are in a network intrusion detection system and you have certain patterns that you
want to detect on a data stream, so you might have new patterns as time evolves which you
want to put on there on a nightly basis or on a weekly basis, but not very frequently. So at the
other end there are systems like Netezza which are, Netezza is a data warehouse appliance
which also uses that FPGAs inside its architecture. Here you don't compile query, full queries
into hardware, but you rather put a fixed set of operations on to the FPGA, so things such as
decompressing data efficiently to speed up transfer times between this and the final system or
you can do simple filtering of data such as projection based or restriction or selection-based
filtering. You can do that type of stuff, so it is, you don't fully compile queries; you just do a
little bit of it and the more complex stuff will be done on the CPU. The work I'm going to
present here is about this type of filtering done on XML using a technique called XML
projection. This technique extracts filtering expressions from a query and then pre-filters the
data before the final XQuery engine then runs on a reduced set of data, and the goal of this
work is that we don't want to have to compilation overhead, which we saw in glacier, and yet
we want to make this filtering as expressive as possible, so we don't want to just filter on
[inaudible] or so we have here a more complex problem where we really won't have…
>>: [inaudible] confirmation on it? It sounds like [inaudible] joint something benchmark
[inaudible] benchmark? How long does [inaudible] to generate [inaudible]?
>>: If you do the full compilation, this will take definitely several minutes and then it depends,
of course, how complex your circuit is and then it can even take hours, but it will definitely take
a number of minutes. Yes?
>>: So you have [inaudible] optimizer data plan and [inaudible] takes [inaudible] plan
[inaudible] and then there's a cost of [inaudible] and that is one [inaudible] and the second is
[inaudible].
>>: So the compilation that first part is not an issue, but the second part is the issue because
the complicated thing is you are now taking an abstract circuit description and you have to
translate it to the hardware which the FPGA provides you. You have to do the routing and this
is really is a complicated process. This is not something you could do, that you can do
efficiently. So yeah, so from the very [inaudible] to bitstream and loading the bitstream onto
the FPGA, that's a process which takes quite some time. So the alternative is to not reprogram
the FPGA but to just have a generic circuit which you can modify the way you like. I think I have
to give now, next I am going to give an introduction as to what is XML protection. I already
highlighted the idea a little bit, but let's run through an example. Here I have an XML document
and on the right-hand side I have an XQuery. So this XQuery, what it does, it from this
document basically selects all of the item elements which are descendents of the region
elements and it then returns for each of those items it returns a new item. It can do this in
XQuery like this. You can generate a new item which contains the name tag which matches so
the name is all of the subtree of name. Here it is just a text [inaudible] but it could be an entire
subtree and it generates known categories, new element known categories that contains an
aggregate, which is just a count of these two in category elements. So we say okay, this is a
complex query which I mean each query is a very complex language and we cannot run this full
query in on an FPGA. However, in 2003, there was a paper by Marian and Simeon who talked
about a technique for projecting XML documents and they showed that you can actually from
any query infer statically a set of so-called projection paths. These are, this is a subset of each
path used to express these paths and these paths define exactly which parts in the document
will be touched by the query. So you can actually before you run this query you can quickly
extract these paths and then filter out all of the irrelevant parts from the document. Then you
can run the XQuery on the final filtered resulting document and it will give you the same result.
So if we take this idea to our hybrid architecture which I'm proposing, then the server running a
full-blown XQuery engine such as Saxon or MXquery or you name it, before it runs the XQuery
engine it extracts these paths and puts them on the FPGA, and this has to happen obviously
quick enough for this approach to make sense, and then the data is streamed not directly to the
FPGA or it's not directly to the server, but it's streamed through the FPGA, and the idea is that
the FPGA really operates as a stream processing network which just reads on the one end the
original document and then the other end produces the filtered document. That's the key idea.
So with that, the next question is so how do we put these filtering expressions on an FPGA.
How does the corresponding hardware look? So here is such an expression, where I say from
the route I want to match a descendent, which is A which has a child B which has a child C and
then that shall have a descendent, that means the descendent means somewhere in the
subtree an element D. This is a similar problem to regular expression matching. If I, in the
regular expression I can express a nondeterministic finite state machine or a deterministic one,
and the example here is a nondeterministic finite state machine where I say okay, if I match A, I
go into the next state. If I match to match B, I go into the next state. If I match C, then I had a
clean closure meaning I will stay in this state. I've omitted one thing here on purpose, namely
that's not quite full story. What is missing is we are not keeping track of closing tags here, so in
XML we actually have to keep track of once I'm in this state C, once I match A, B, C those tags
are open. When tag C closes again, I have to take care of this somehow. And this is typically
done with a stack and I will later show how we added the stack as an implementation detail
into our system. Anyway, if I, for now for simplicity if I just want to take an NFA like this, how
can I translate it into hardware and here I give you an example. So what you have on the FPGA
is you have flip-flops and you have logic and you have a lot of that. So we can store this state of
this nondeterministic finite state machine in a flip-flop for each state telling me whether or not
this state matched, so if I go from Q0 into Q1 then I, so when I match this predicate A then I can
put a one into this flip-flop and with the gates I basically implement the transitions between all
of these states. So here I am assuming an external tag decoder, so I am going to have some
black box which kind of reads tags and output them. In our implementation actually we will
have local replicas of tag decoders because that works more efficiently, and so the gates,
anyway, they just take the previous state from the flip-flop and the current tag and then define
whether they should go into active state or not.
>>: Will you need a parser?
>>: Yes. You will need a parser, yes, and I will come to that in a second. Here I am just looking
at this past expression knowing that I already have a parser in front of this. And again, I already
said this several times if we want to now, of course, we could compile these past expressions
into particular circuits and load them onto the FPGA but we don't want to do this because this
would probably take too long, but what we observed is that actually these paths, so this is a
restricted set of each path. We have, for example, only navigation, so the automatons or these
NFAs which we construct are always of the same structure. They might change the semantics,
might change in the sense that whether I'm matching a tag 4 or matching a tag R, and what also
changes is whether I have, whether I'm doing a descendent navigation step or just a child step,
but other than that sort of really the hardware structure really stays the same. So we can, we
see that we always have pairs of the noticed and a navigation step. So it's always descendent
and A or child step and B; that's the repetitive pattern, and this we want to exploit exactly for
our idea of a skeleton automaton. This is really the key idea behind this approach, namely that
you define this skeleton which is the same for all the X path expressions and load this on to the
FPGA once and you leave those parts which are specific to a specific X path expression. You put
this on there later. So you just have this skeleton that the load on the FPGA when you boot the
engine and later at run time all you have to do from your X path expressions is you just have to
extract the semantics basically and load them into there. This can be done quickly because all
you're doing is, you know, updating a little bit of memory. Again, this should be quick so you
can have highly dynamic workloads and you can do this, so in numbers you can do this in, you
know, two or three microseconds versus several minutes of compilation time. So that's the key
idea and from here I will now in the next couple of slides before I go into the evaluation, I will
show a little bit more of the details in the main hardware because there are some things
missing which I haven't yet talked about. So in terms of architecture you asked about the
parser, so yes, we have a parser in front of this, so that the XML stream runs through this parser
or it's actually more of a lexir which annotates the data stream with lexicographical information
which is then used by these X path circuits. So this parcel reads one byte or one character per
clock cycle and it's just a large state machine which says okay, this is the tag start or this is here
is the end of the tag or the end of the closing tag. And then this so-called cooked XML flows
through these segment matchers and this is actually how we are exploiting the parallelism of
the XGA so it's a pipeline parallelism in which we are doing here. So we are streaming this data
through this pipeline of segment matches which are daisychained. A small detail here at the
end is a serializer that I don't want to say too much about it, but that just make sure that what
comes out of this engine is valid XML, because we want the XQuery engine to be completely
oblivious that in front of it an FPGA is filtering out stuff, so we cannot just give it the matched
parts. We cannot just rip everything out of it, so we cannot just give it the parts of which the X
path expressions matched; we have to embed it in a valid XML document, so we have to keep
the full past relief nodes and that's what this serializer does. The skeleton segment, this is now
as technical as our get, is at the core of this. Here is a diagram of the architecture of this
segment skeleton segment. Again, it represents a navigation step plus a node test, so I'm
saying am I doing a descendent or a child step and which tag am I matching. The configurable
parts of this are tag predicate. That's going to write in there into a RAM if I'm matching tag
[inaudible] I have to store that in there, a matching tag bar, I have to store that information in
there and a configuration parameter which says what kind of an access am I doing. And each
segment matchers can do both, I mean all four navigation steps, and we configure which one to
do by writing into this thing. Then you see these matches have data in the match and the
match import and the data out of the match out outputs, so they are all daisychained. The data
is passed from one matcher to the next to the next to the next and the same holds for the
match in and the match out signals. The second core takes all of this information and defines
whether it has a match or not. Then the final missing part here is this thing here at the bottom,
this history unit, and what this is is the stack which I mentioned. What we are not taking into
account if we are just looking at the regular expression. If we have a match we put this into this
stack. It is essentially a one bit stack we write a one into the end as we go down in the tree we
shift; we shift this information to the left. As we go up into the tree we shift the other way
around. If it is a child navigation we just shift that one up and down and if it's a descendent
navigation step I will, you know, once I put in a one, I keep shifting in a one. And so here we
use the 16 bit shift resistor which means we can go up to a depth of 16. This is reasonable. You
typically don't have extremely deep documents. If you have, you could change this. On the
other hand you could also say well if I have an overflow here, I will just stop filtering. That is
always my backup in the worst case just send you all of the data. This is a detail about the
configuration, how was it done. Configuring at runtime we use for this processing instructions.
That is a feature is a XML specification and the key is we embed this configuration also into the
byte stream and this can be recognized and then we actually configure as this flows through
and in terms of how fast this is, well, it can process one byte per clock cycle so depending if
you're X path consists of 50 bytes, for instance, then the reconfiguration in terms of throughput
is 50 cycles which on my 125 MHz clock will translate to 400 ns. This was all about how to build
one all of the details on how we build a single path. Now we want to support more than one
path. We, when we analyze such a query we typically have about 15 paths. Now we don't have
to change this architecture to do that a lot. We still keep this chain of segment matchers and all
we have to do is pay a little bit of attention at the beginning and at the end of the path. So at
the beginning we have to make sure that the segment at the beginning behaves as a root node
basically. That's not difficult to do. The details are you can just initialize this history thing with
a one and make sure that you are not dependent on the predecessor. The other thing is now
the question is what do multiple paths mean for the end result. So here we are saying if either
one of these paths match, then we have to keep that part of the document. What we want is
to, the union of all of the paths. We just have an additional global match signal and we can
configure the segment at the end of the path we can say you are an end of chain segment and
so at those segments what will happen is what you see here at the bottom, that the previous
global match signal is merged together with the local match, so if any one of them has a match
at the end it will propagate all of the way to this serializer again which is responsible for
outputting the data. Yes?
>>: Path [inaudible] one, two [inaudible].
>> Louis Woods: Sorry, say again.
>>: In your picture you have them in sequence. Are they really in sequence or are they
[inaudible] XML file?
>> Louis Woods: They are laid out in sequence. They operate in parallel, but path one will
match before path two. If path one matched, together with this datastream, that match will
propagate along so the data just flows through this pipeline and matching information is
merged into this stream basically. So it's all the sequence operating in parallel is what it is.
That's what it is actually. This is sort of the big picture. We have this parser and this serializer
at the two ends and then we have this chain of segment matchers and we put as much on there
as we can. The FPGA is a 2-D array of resources. A pipeline like this tends to map nicely to the
hardware because there is only small neighbor to neighbor communication between the
elements, so the tools are very good in figuring out how to compute this data. All these other
designs you can construct other more complex designs with transitions all over the place which
don't work as well, but this one works very well. So a little bit of evaluation. First some
performance. We looked at, we measured the Saxon EE so Saxon is a sax-based XML engine or
XQuery engine and the EE is the commercial version, so we measured the speed up in parsing
time on this engine with and without this projection and what you can see is on 100 MB XML
instance the speed up was around 6X, 8X, somewhere around there. The speedup in parse time
was actually significant by reducing the documents and parse time is indeed really an issue for
XML applications. In particular, it is something that is not, it is inherently serial process which is
hard to parallelize and so if you can just make the document which has to be parsed smaller
then the parsing and the XQuery engine which we didn't change at all will obviously run faster.
The benefits in execution time of the query which we measured were not significantly or not
significant. This is known about projection. I think the reason is that once you have your
optimized data structures in main memory, then the queries run very efficiently over these data
structures anyway so they won't touch the stuff which you don't need any way, so there is no
benefit here. In many of the queries the execution time was much the smaller part of the
parsing time, but you have also in this XMark benchmark which is a standard benchmark for
XML benchmarking; you also have a few expensive joint inquiries where the execution time
then dominates. But in most of the queries the execution time was less dominant than the
parsing. Finally, and this was the original reason for projection and memory consumption is
again, the improvements there are quite significant, since you have a smaller document and the
data structures in memory are a multiple of the original document size, so often this really
blows up. Yes?
>>: [inaudible] you compared the corresponding [inaudible] condition or [inaudible]?
>> Louis Woods: Yes, I have a backup slide on that. We can, I might as well show this one now.
That's true. Saxon, this is unfortunately a bit graph which is a little bit hard to parse, but Saxon
has a switch so the commercial version has a switch for software-based projection, so it
implements this and we ran this as well so that's what you see with these striped things here.
This is not a stacked graph or anything. The message here is it has no effect; software
projection has no affect on parse time because the document you parse is the same size.
Execution time it doesn't have so much of an effect anyway and memory consumption we were
a little bit surprised at the memory consumption improvements from the hardware projection
were much more than those from the software projection. I have a few more results before I
conclude this talk. The one thing is about, one question is well how many of these segment
things can you put on the FPGAs, so here we used a not very large FPGA. This was a Xilinx
Vertex 5. It really was on the [inaudible] five [inaudible], so a medium-size FPGA and so I didn't
talk about how we used BRAM at all. That doesn't matter, but you have basically two types of
resources on the FPGA and we used those sort of in an equilibrium and the message is, we then
could put 750 of these segments on the FPGA and this is sufficient to do the first 10 XMark
queries so to put this in numbers maybe to run 10, 15 each paths, which, where each is using
something like another 15 or 15 to 30 segments, something like that. This was sufficient. The
other result that I have is scalability. I said at one point in the talk that this maps well to FPGAs
and one way to measure this is to say how does the clock frequency behave. Does it degrade
when I start filling up chip? At 750 segments I'm really saturating the chip quite a lot, and what
you see sometimes is that then the clock frequency really degrades. Here, it stays more or less
stable which lets us assume that this would also scale to a larger chip very nicely, and that we
wouldn't have any problems there. Also, this is the clock frequency I am showing here is well
above our target which was just 125 MHz for a gigabit. So this brings me to my conclusion, so
what I talked about today was I was showing hybrid XQuery processing engine with the
approach, the architectural approach of putting the FPGA into the data path, so rather than
using it as a cold processor which does heavy number crunching on the side where you have to
load data on there and get a result back, here we put it into the path which exists between
server, or between source, data source and server anyway, and we let it sort of transparently to
the XQuery, engine process preprocess the data. And the problem that we encountered was
that reprogramming the FPGA in such a setting is not an option because we want to change
what we put on the FPGA very frequently and very fast. And the result, the solution that we
came up with for this is to put a skeleton, this part that doesn't change for all of the queries,
put that on the FPGA once and at runtime you only change the semantics of it that those parts
which are specific to a certain query. Finally, this work is part of our avalanche research project
which you find on at this URL and on this project we are looking at, we are aiming for hybrid
CPU FPGA code designs for data processing. That is our ultimate goal and we are trying to
figure out what other right instructions and what other right interfaces we need for this to work
well. With that, I am happy to take any questions. [applause]. Yes?
>>: Two questions. First this is entirely about [inaudible] programmability for data, so how
much [inaudible] overhead, how much overhead [inaudible] how much clock ring overhead do
you expect, is there a cost [inaudible] for making this [inaudible]?
>> Louis Woods: Okay, I see what you mean. I think in this particular case the cost, so I don't
have an accurate number, but I think it was fairly small. The navigation steps, so on the one
hand you have a lot of resources there anyway; that's the way the FPGA is built, so whether you
use an additional multiplexer to decide whether this is going to be a child step or an extended
step, I think that will have a lot of impact. And then the BRAMs, like for instance, we stored the
predicate information which tagged your matching in the BRAM blocks, and again these blocks,
they are there; the question is what would you do with them. Here we are assuming that we
could really use the chip as we like. There's nothing else that we have to do on it. They are
there and I don't think that like if you would say [inaudible] solution without this online
reconfigureability I don't think you then could say I could put 10 times more paths on it. I think
it would be rather something like, well, I'm not even sure if it would be more, but I think it
wouldn't be so much more. Was that your question?
>>: Actually had several questions. If you go back to your area graph.
>> Louis Woods: This one?
>>: Yeah. It's not linear. I understand why it sort of the zero intersection point is not zero,
because you did some things just to sort of get the system up. [inaudible] control [inaudible]…
>> Louis Woods: Yes, right in the parser, the serializer…
>>: Well, what is the sort of the less than one slope caused by? The fact that it sort of drops off
as you get farther and farther?
>> Louis Woods: Okay, so I think you are right. It is not linear but it's…
>>: But it's sub linear, [laughter].
>> Louis Woods: Well, I think this has to do with the tools. I think the more you are filling up
this stuff, the tools start doing a better job in using the resources. That's my guess here.
>>: So you could actually in these media plates [inaudible] 400, and what it could be using is
considerably lower than that, but well, I have this space why not expand and fill the space?
>> Louis Woods: That would be my first guess.
>>: I mean it's a visually discernible [inaudible] which then would mean when people report
logic utilization numbers, and they are on a chip that is not even [inaudible] used in fact they
are [inaudible] the results could be quite a bit better [inaudible] and who knows whether or not
[inaudible] maybe the circuit they are trying [inaudible] but that…
>>: [inaudible] point posting census estimates as well.
>>: And so that's the question. So is this the blowup the synthesis or is it [inaudible]?
>>: I would guess it would be in place and route, I would guess because you start off with
synthesis and since there is a lot of negative [inaudible] translate and then…
>>: Well no, the synthesis step sees that you're at 2% utilization so you're going to double all of
these pieces to make it faster because you have plenty of room.
>>: Well, the synthesis option is to say [inaudible] area regardless, right? I guess you could try
running the graphs [inaudible] options…
>> Louis Woods: That's true, yeah. I think this one here we used just out-of-the-box options.
>>: [inaudible] other [inaudible] 80% and then keep adding stuff and you get to 90% of
[inaudible] and then definitely finding a quick solution and that's good enough as far as that's
concerned.
>> Louis Woods: Yeah, I mean the routing, all of these steps take considerably, the time it takes
to produce a final design also I guess isn't linear.
>>: But in general you see these super linear graphs, not a sub linear graph and that's the thing
that caught my eye.
>> Ken Eugro: Okay is there anything else?
>>: [inaudible] if I understand right year programmability [inaudible] special twist to the data
pack?
>> Louis Woods: Right.
>>: So now you're going to have a programming function and something special [inaudible] and
everything else is certain [inaudible], so the overhead in the core, in the core segment
[inaudible] programmability has to go ahead [inaudible] so you get 8% [inaudible] point wise
that's the same blowup. You don't have a complicated [inaudible] and use the regular
streaming network so that's your [inaudible] overhead, is that right? Then you can have your
[inaudible] logic and fixed pockets and [inaudible].
>> Louis Woods: Uh-huh.
>>: I have a question about your benchmarks [inaudible] also. There are a couple that really
stand out, the five or six I think it was or [inaudible] next once be a very [inaudible] five or six
and the shapes of the first and second are soaring. Did you work into [inaudible] why some
have better speedups than others?
>> Louis Woods: Not so much. I know like I think Q6 just really a lot of data is filtered. I mean
in the parse time I think the speedup is really due to how much is filtered out and that's
depending on the query.
>>: So the queries are progressively harder evidently in this graph? Is that right?
>> Louis Woods: Well, I don't know…
>>: [inaudible] vectors up there in that area.
>> Louis Woods: Actually I don't remember exactly every query what it's doing and the
execution time, I don't have these ones here. Some queries are joint queries which are hard on
the execution time and some, there are also some queries which you just can't filter out as
much. Yes?
>>: You have a graph showing how many segments per [inaudible]?
>> Louis Woods: No I don't have this graph, but half a second. I have this information with me.
I just have to look it up real quick. I hope I marked it so I can find it. I know it's in this thing but
maybe we have to take it off-line because I'm not sure if I can spot it right away. I think it was
something like between 15 and 73 and the median I don't know anymore what we have for
median. There's a lot of numbers in here, but I think it was something like that.
>>: I think [inaudible] this approach is the [inaudible] force complexity reasons. So the system
will never say, for complexity reasons I can't support your [inaudible] always fall back on
whatever you bring I am just going to write [inaudible].
>> Louis Woods: Right. Right as far as…
>>: So I'm I know nothing about certainly the [inaudible] and what have you, so is there
anything like that in the language that you certainly can't support?
>> Louis Woods: In XQuery language?
>>: Yes.
>> Louis Woods: Oh absolutely. There's a lot of stuff which we would have troubles because…
>>: And so, you know, will what fraction of the elements…
>> Louis Woods: I'm sorry I just want to make sure I got the question right. You mean support
on the FPGA?
>>: Right. And so what fraction of those types of things are actually benchmarks? I assume
that however large, this is one sweet and probably some other sweets so how much of that is
actually gets expressed and how much of that is actually built into the runtime of the
performance of the…
>> Louis Woods: I think fairly a lot gets expressed. They are doing joints, for instance, that
would be difficult to do in a FPGA. There are back references you know from a child back to a
parent again here on our ex-passings this works well because we only have forward, we only go
down to tree in the paths. We never have been when we are in the child a reference up to a
parent or a parent of the parent or stuff like that. You can do XQuery is a language which
allows you to basically express anything and it has function calls and all sorts of stuff and so
there is a variety of things in which we put on an FPGA.
>>: [inaudible] based on the performance [inaudible]. They are there, but they may not be a
significant part of the overall [inaudible] so if you switched [inaudible] execution [inaudible] I
mean the six and seven that barely make it to two, at best you're doing, you're making the job
half as thick, right?
>> Louis Woods: Right. I really think that the reason for this is that once you build up your
main memory data structures, the filtering effects, the effect of filtering on a query execution is
just not that great.
>>: Have you tried isolating the lexing [inaudible] in other words [inaudible] one, two, three
instead of in strings?
>> Louis Woods: Ah so doing this before the FPGAs, so like binary XML or something like that?
No, we haven't looked into that.
>>: [inaudible] 90% of the [inaudible].
>> Louis Woods: Yeah, so, yeah, then I don't know how parsing speed up would change with
binary XML. That's actually an interesting thing as an extension to this to look at. Uh-huh.
>>: Part of this also when you are doing this [inaudible] and computation in a pipeline fashion
so it's all going on in sort of parallel. I guess there's also nothing that says that you couldn't run
the entire system in four string parallel and just say okay, but I have one parser and one series
of sequence [inaudible] and then just [inaudible] the whole set of them.
>> Louis Woods: Sure.
>>: There's no real way of sort of [inaudible] because sequence matches are already sort of
parallel in the sense that they are all looking for different things. I mean I guess the other thing,
this is interesting because the problem statement allows you to basically make one long pipe
and cut off wherever you want and it's like getting the garden hose and cutting it to length. If
you want, but if you needed to produce certain [inaudible] outputs where this is interesting
because I don't care which one you matched, all I know is that something in this thing matches
and so [inaudible] this entire [inaudible] to solve it and to put it on the output. But if you
couldn't, if you had to figure out which one it came from so you had some idea of how you
would maintain that at some kind of cut to length…
>> Louis Woods: Well, which one it came from I think that could be done by adding, using, you
would have to then add this information into the stream but you would then, I guess you would
then have to stop the previous one add in yours and this would sort of, all of these that
matched you would add at this kind of information. The problem is so then you know which
one it came from if you want to know this. The problem is if you want to route, if now your
problem is such that you say okay, this doesn't all, if passed one match, then I want to send that
data over there and path two match, then I want to send it to somebody else, then you have
the problem if all of them matched then you have to send them to all of them and how do you
do that? But then that's a different problem. Go-ahead?
>>: So the what is the [inaudible]?
>> Louis Woods: The query engine is a Saxon here and, what?
>>: Is it based on like [inaudible] just like regular sequence or [inaudible]? I'll rephrase my
question so currently the [inaudible] sort of thing [inaudible] input is XML [inaudible].
>> Louis Woods: Uh-huh.
>>: Output is again [inaudible] so in principle you can store it and then [inaudible] but if the
query engine is and what you are doing maybe you can feed the output of this straight to the
operator that you [inaudible] filter rather than making it [inaudible] document [inaudible].
>> Louis Woods: Sure. I mean yeah, if you are willing to modify your engine you can do a lot
more. Also we are parsing twice basically. We are parsing on the FPGA and we’re parsing again
on the engine. You could use this lexical graphical information which we have maybe already
for a final parser but then you have to modify the engine.
>>: [inaudible] modify the [inaudible]?
>>: And then it gets back to the binary XML suggestion earlier.
>>: I was wondering you say something more about the automaton.
>> Louis Woods: The automaton?
>>: Yes, yes from the example [inaudible] and I was sort of thinking it always look like this had
some sort of [inaudible] when you check some [inaudible].
>> Louis Woods: Uh-huh.
>>: Or is it more complicated and then in relation to that does the when you say what you
mean by semantics is different labels, so language their [inaudible] plugging in actual labels and
you get different semantics [inaudible].
>> Louis Woods: Uh-huh exactly.
>>: But of course if you choose different labels you might actually get a nondeterminism
[inaudible] which perhaps you didn't have on the [inaudible] the other way around. I'm just
wondering if once you fix the skeleton, does the affect of how you choose the semantic labels
have consequences or are there certain constraints to that sort of stuff?
>> Louis Woods: Okay. So first of all, yes this is not, this is a specific type of nondeterministic
finite state machine, so it's one where you have this forward only path. So here, the predicates
that we are supporting here are always just tag names. We wanted to, we also had some, what
we didn't evaluate it thoroughly, but we also saw how would you do more complex predicates
where you say okay, I'm also looking at attributes. I'm taking, I only want to look at those items
which have an age attribute which is larger than 50. I think the concept is the same but it
makes this predicate matcher thing more complex. You have to then convert strings to binary
and do the comparisons. I don't think, I mean as long as you don't support things like back
references from a child back to the parent and so I think you won't have anything that you
cannot handle, and in particular, the nondeterminism that you have here is not a problem
because you do all of this in parallel, so while, what I want to say is if I have a non-deterministic
state machine in software, and now I have a lot of active states then suddenly this can become
a performance bottleneck, right, because I have to update all of these parallel states on one
single input item, while in hardware I don't care how many states are active. I operate them all
in parallel anyway, so…
>>: Essentially for example here you would get A, but you might [inaudible] zero and it might
be Q1, so the choices are open.
>> Louis Woods: Exactly, and that doesn't…
>>: [inaudible] state is like a substantive state [inaudible] somehow it doesn't go up.
>> Louis Woods: Yes, and because each state is an individual circuit that's fine. Each state just
looks for himself and the overall result is correct.
>>: One more question. Go back to your [inaudible] performance [inaudible]. I'm wondering
about basically the [inaudible] here. You have these things broken out as being the speed of
[inaudible] speedup of the [inaudible], how much time do I spend in each? What is the wall
clock? If I put this graph together with the other graph, if I said what is my parse and execute
times?
>> Louis Woods: This one is a very query dependent. Like I said in a lot of these queries, you,
you know, you can just maybe stack them on top of each other, but there is a benchmark also a
few queries like long joints where really the query execution time is dominant, so there are a
few I don't know, in Q10 whatever, where you would then be 40 seconds, for 40 seconds
joining, so you would get a very different picture.
>>: Could you give me something [inaudible] the average?
>> Louis Woods: Yes, okay. [laughter]. I should have used…
>>: [inaudible] [laughter].
>> Louis Woods: Yes, I can do this.
>>: If it's in the paper, I'll read it. It's just that…
>> Louis Woods: No. Here you go. Actual query execution time varies between 68 milliseconds
and 41 seconds and the median over these was 390 milliseconds. This versus parse time of 2.5
seconds just put this into proportion. And the speedup is what you…
>>: Got ya.
>>: Can you run them in parallel? Because you have 700 segments and you use at most half of
100 so you connect seven of these engines and run FPGA, so you have to run seven of those
queries in parallel…
>> Louis Woods: Seven queries in parallel?
>>: Provided two things, A to the engine [inaudible] and B that the ratio between input data
and output data is what seven [inaudible].
>> Louis Woods: Uh-huh. So with this there is yes, I think yes, but there is one issue namely
how do you, so when we just load the X paths, all of the X paths for one query onto the FPGA,
we basically swiped and then append all of the paths. Now if you have multiple and parallel
you could do this but now one query terminates. You take this out. You leave the other ones
there. You have a hole, so if you would do batches of queries in parallel, I'd say yes, but if you
just want to, okay, I have six queries are still running, but here I finished one and I'm going to
send the next one, there would be an issue of, you know, allocating this thing in a proper
manner. And the other thing is then of course you have to decide on okay, you could always
union all of these queries, but then the filtering effect of course goes down because the queries
might not look at the same data, so that's another issue, you know,.
>>: [inaudible] design, so difference, you wouldn't be running one query in the next [inaudible].
>>: I mean I'm sure [inaudible] dependent were [inaudible].
>>: Right.
>>: This [inaudible] we would have first, second and third. You have [inaudible] space
[inaudible] use one chip [inaudible].
>>: How many queries do we have running on the same document?
>>: But that I would be [inaudible] partitioning it up [inaudible] parallel, how would you do
that?
>>: And you also need in a different you need for multiple parsing, multiple parsing engines.
>>: Yes.
>>: The whole thing [inaudible].
>>: Yes.
>>: So maybe it's not [inaudible].
>>: [multiple speakers] [inaudible].
>>: [inaudible] different FPGA [inaudible].
>>: Right, right but being able to [inaudible] that would mean that you would have to partition
the system, have insertion points where you could sort of break up the cards into [inaudible]
one card into two cards, or take two cards and make one card.
>>: [inaudible] problem [inaudible] one, problem [inaudible] two.
>>: Yeah, yeah, there's a certain granularity, right?
>>: Yes.
>>: You probably also would have to increase your I/O as well. I'm assuming if you ran 780
such a time then input [inaudible] bottlenecked pretty quickly.
>>: So that's a different [inaudible]. That's a [inaudible].
>>: So you're saying it's the same input?
>>: The same [inaudible], but yeah.
>>: Oh, so assuming that the speedup is roughly proportional to the [inaudible] compressing it
then you could just feed all of the seven outputs on the same [inaudible].
>>: Even [inaudible] the fact that you [inaudible].
>>: Yeah, that's true and that's interesting.
>> Ken Eugro: Thank our guest, and Louis, thank you very much.
>> Louis Woods: Thank you. [applause].
Download