Document 17844380

advertisement
37034
>> Ganesh Ananthanarayanan: We're happy to host Rachit Agarwal today.
Rachit got his Ph.D. from URUC and did a post doc with Ion in Berkeley in the
AMPLab. Rachit is one of the kind that truly, you know, lives up to his
systems plus theory billing in that he's actually published at both sitcom or
NSDI as well as SODA. And yeah, today he would be talking about work that he
has been doing in his post doc on Succinct, which has been getting a lot of
attention in the media, in the spark open source community, as well as a lot
of companies that are beginning to experiment to start using it.
>> Rachit Agarwal: Okay. Great. Thanks, Ganesh. Very happy to be here.
Thank you all for all for coming. So, as Ganesh said, I am a post doc and
I've been at Berkeley for two years. And mostly I've been working on the
system Succinct. I'll tell you towards the end some of the things that I've
been thinking about, again, surrounding Succinct. And I did my Ph.D. more on
graphs or something, graph queries. I'm going to talk a little less about
it. But the combined team has always been about interactive queries. So
what do I mean by that is some user sitting right in front of the system,
these could be services or somebody interacting with a system that wants to
do queries on large data sets. And the challenges there are generally
twofold. One is you want to get low latency, but [indiscernible] latency is
less than seven hundredths of millisecond and you also want to get high
throughput. So that's what I'm going to talk about, how to design
interactive systems and why did they start thinking about rethinking
interactive systems. So the first thing is that I think achieving query
interactivity is becoming increasingly harder today. And there are three
main reasons for this. The first one is a scale. In last few years, or at
least in last decade, we have seen all this new media system that have given
rise to this large massive amounts of data. And today, standing here, at
least I can say hundreds of terabytes of data has just [indiscernible].
Right. More than that, the data growth has been reported to be very, very
fast on these user-facing system. So at least on the conservative side,
people are shown that 70 percent growth rate on the scale size is just
normal. And while the data sizes have increased significantly, people still
want to do interesting queries on these data. For example, just three months
ago, Twitter released their search systems so now they have indexed all their
tweets [indiscernible] everybody. Now you can search for simple tweets like,
okay, tell me all the tweet that has a certain person in them. Okay. You
can do more interesting things. For example, there are these new log
analytics companies that allow to you do interesting queries. Here I have a
simple query which is like a regular expression query which says find me all
the locks that have error 404, error 505. Right? So we would want to do
these queries. At a high level, you can translate these queries into
so-called search queries or regular expression queries and we would want to
do very interesting queries in graph systems like Facebook. Every time you
go to a Facebook system, there's some range queries happening in the back and
I'll talk a little bit about graph queries towards the end, but people want
to do very interesting graph queries on this mass amounts of data as well.
So scale has increased. People want to do complex queries. What has not
change really is the definition of interactivity. For anything, it has only
gotten more and more stringent because users don't want to wait too much.
So, the latency and throughput constraints have not changed. Wanted to do
queries on mass amounts of data, even larger amounts of data, want to do
complex queries, and you still want to do that within milliseconds. Okay.
So what is it that makes this problem challenging? So if you look at this
interactive big data problems, let me run a very simple experiment and then
show you some results. Okay. So I'm going to take a massive amount of
records from a company called Conviva. And these records are just to think
of it as a collection of attribute pairs, so to say a role or a key value
pair with multiple attributes in the values. And I'm going to run some
certain search queries. I'm going to use a single Amazon EC2 server with
certain amount of RAM, but show you results for single core results. Okay.
And one of the state-of-the-art systems to be able to do search queries is
called Elasticsearch. I'm going to [indiscernible]. So let's see what
happens. The X axis, I'm going increase the amount of data that I can -that I am doing the queries on. Okay. On the Y axis, I have the throughput
which is the number of queries you can answer per second. Let's start with
the Elasticsearch. So this is what the throughput you see. Okay? Which
means until roughly 16 gigabytes of data, you see really good performance.
You can do roughly 200 queries per core or Elasticsearch. As soon as you go
from 16 to 32, the throughput drops down to just, you know, one or two
queries per second. And this is where the problem lies. And this is 60
gigabyte RAM. I'll tell you why this is the case. But it's not just
Elasticsearch. You see, similar results for Mongo dB and similar results for
Cassandra. Okay. So question is, why is it that we see such huge
performance drop just beyond certain data sizes? Any guesses?
>>
RAM?
>> Rachit Agarwal: RAM. Right? Like you guessed, most of you who have
guessed that at some point, you cannot execute queries in memory. Okay.
you have to go to secondary storage. And the problem is that secondary
storage today is still 100 X slower than main memory.
>>
So
So you don't index in any way?
>> Rachit Agarwal: Yes. So this was -- I'll talk about indexes later on.
This was indexed data. That's why you drop out at 16 gigabytes, because of
the rest of the data is indexing data. I'll show you exactly. But yes,
this, we're using indexes. Elasticsearch uses secondary nexus. Okay. So
the secondary storage is still 100 X slower. Which means if you do a simple
calculation, you will see that even if your ten percent queries go out of
memory, right, throughput reduces by order of magnitude, even if you're ten
percent queries. In fact, if your one percent queries go out of memory, your
system throughput, the number of queries you can answer out of your system
reduces by 2X. Okay. And this is one of the problems that has been known in
the system [indiscernible] for a long while which is increasing the cache hit
rate. Where here, it just becomes more prominent. Okay. So this is the
problem we want to understand. But it's not just that the fixed data sizes
are creating a problem. If you look at what has happened to the last ten
years, in 2006, we used to have, you know, amount of memory was large enough
to keep most of the data sizes in memory, right? Over the last ten years, if
you look at Moore's law, we have seen that memory capacity has been growing
slowly, much, much more slowly than Moore's law. On the other hand, all
these systems have given rise to this data sets which are larger. Right?
And this gap has been increasing. So it's not just like today's data you can
scale out and do in memory. It's that this cap has been increasing very
rapidly and this is where the problem becomes much more challenging than how
do we sustain this over time. Right? So at scale, basically today, all
existing systems expose this very hard choice to users with intractive
queries that either you can do very simple point lookups where you do reads
and writes of data, and you can get intracted where you can do millions of
queries per second. They are both systems from industry and from academia
where you can do queries very fast. On the other hand, if you want to do
even slightly complicated queries like simple search, you really have to lose
interactability because you go out of main memory. Right? And given the
data growth sizes, this problem is only getting worse.
>> Just one question.
data or just hot data?
On the previous graph, is that data size all of the
>> Rachit Agarwal: The hot.
have here, you mean?
>>
So this was for active data.
The number that I
Right, yeah.
>> Rachit Agarwal: Yeah. Yeah. So this number is from this paper, Partha
from HP Labs. He had this survey done where they showed that for some
applications, the active data they call it, active data, which is the hard
data, user-facing data is increasing at this rate.
>>
And how did they define hot data?
>> Rachit Agarwal: So this was one of the -- that's a good question. Okay.
I don't remember the exact time range, but they just said, okay, data used in
exactly this time range was called the hot data. Yes.
>> [Indiscernible] through to the next slide? So I don't know what your
notion of a powerful [indiscernible] is, but Google and Bing both interactive
bounds and the data is way bigger than URL [indiscernible] we're talking
about. So why is that? I mean ->> Rachit Agarwal:
>>
Right.
-- it seems like [indiscernible] is giving intracted performance.
>> Rachit Agarwal: Absolutely. Yes. So Google and Bing and some other
companies, right? Yes. That's one of the points I was going to stress out
later on as well. If you could scale out today, right, and if you could
continue to increase your scale out at 70 percent our data growth rate every
year, you'll be able to sustain this. The problem is that if your data
continues to grow every year, you have to scale out by that particular rate
every year. So if you're data is growing at 50 percent every year, you have
to continue increasing having 50 percent extra servers every year to be able
to just to stay in that performance. Again, because of the same problem,
your queries going to memory, off memory. Now, the rate ->> That's [indiscernible], right? Because it depends upon whether you
want -- again, it depends on what you [indiscernible]. If you're willing to
live with [indiscernible], then no. Right? For example, web data. If I
don't want the latest and the greatest [indiscernible], willing to live with
the data that is a little bit older, I can put older data in cold storage or
something like that and not query it.
>> Rachit Agarwal: Exactly. So yes. I was ignoring that space of queries.
For example, search and regular expressions, it's slightly unclear how we
would define approximation there. Right? So the examples that I showed you
so far were for search and regular expression queries. Right? For those
queries, it's slightly -- people have defined the notion of approximation.
Slightly tricky to define approximation in the context of Google and Bing.
We're there. There are other ways, and Google has the like -- you know,
Google has worked on ways where they actually optimize their system to bet to
be able to do these search waves really, really quickly in memory. Right?
And their main solution is scale out as optimization per server. But if you
look at these more recent company startups like analytics companies and
companies that are using Elasticsearch, they really don't want to scale out
at that late every year. They don't want to have tens of thousands of
servers. And they don't have such optimizers so that Google has them
internally and are not open sourced. So the question is whether you can get
that performance without doing Google [indiscernible] scale out.
[Indiscernible]? Okay. Any questions? Okay. Great. So, what is my
research on. My research focuses on bridging this gap between these large
data sizes and memory capacity. Okay. However, what I want to achieve is
the functionality of so-called NoSQL stores where you can do these powerful
queries on the values and I want to be able to do in memory query execution
for large, much larger data sizes than possible today. The question is
whether it's possible to achieve this new line that is going to come up on
the [indiscernible]. Right? Is it really possible to do queries, you know,
on a 60 gigabyte RAM and I do queries up to 64, 128 gigabytes RAM out losing
the performance. At some point the performance will drop. If possible, you
get gains in two respects. Right? The first thing is you can scale up your
systems much, much larger so you can execute more queries in memory. On the
other hand, if you look at scale, then you can get much better performance in
today's systems. Yes?
>> Based on the trends you're saying, it seems like you're going to buy only
two years. Right? Like why -- so you don't seem like that should be your
idea. It seems like the idea has to be something else that has better
graceful [indiscernible] properties than what you say you're shooting for,
right?
>> Rachit Agarwal: Yeah. That's a good question. So I'm going to show you
[indiscernible]. If you can keep that question for a while, I'm going to
show you that it is not just this point. You can actually extend this point
much, much further by doing clever things. But this is not just two years.
If you can get 10X more data in memory, right, and if the data is growing at
a rate of 70 percent, you get roughly six years. Right? So you get roughly
six years.
>>
[Indiscernible].
>> Rachit Agarwal:
>>
Absolutely.
[Indiscernible].
>> Rachit Agarwal:
Absolutely.
See, that's --
>> Unless you invent a compression algorithm that can do better than
constant.
[Laughter]
>> And then gets better and better with the years we put into make
optimized.
>>
Why don't we do that?
>> Rachit Agarwal:
I promise you that's hard.
I promise you that's hard.
[Laughter]
>> Rachit Agarwal: But, here's another thing, I think new memory
technologies are going to arrive. Right? For example, until recently
announced this 3D cross point technology, right, that are going to present a
very different tradeoff than what we have [indiscernible]. Right? We're
going to have much larger capacities, but much lower latencies than SSDs.
And for them ->>
Is that related to what you [indiscernible]?
>> Memory, if hardware is going to face that issue, then we don't need
any ->> Rachit Agarwal: Oh, they're still going to be, you know, 5x slower than
[indiscernible] but they're going to have more capacity. So for that point,
you want to design sort of solutions that can adopt to have a slightly high
latency, but more capacity as well. Okay. So why don't I answer your
question towards the later of the talk? But yes, the cognition says that you
do get six years here, but not more than that. At some point, you are going
to run out of memory when data sizes become large. Okay. So to resolve this
problem, I've looked at two specific sides. Attack the problem from two
perspectives. The idea is to take these problems that come out of this data
fibers and then take the constraints that come out of systems and design
scalable [indiscernible] and techniques while taking these constraints and
data sizes into account. And then once you have [indiscernible] and
techniques that are tend to solve this problem, you want to build systems
that implement these techniques. What I have focused on during my last few
years of career is that the idea of having these two combined views to attack
the problems and I really think that it's important because if you just focus
on designing scalable systems, then you're going to feel, you know, not
leverage the structure in these interactive problems. On the other hand if
you just focus on designing legal algorithms, then you're going to ignore the
advances in scalable design. So I'm going to focus both on the algorithm
side and a little bit on the system side in this talk. But interestingly,
what I want to show you is that once you start looking at a problem from
these two perspectives, sometimes you're able to look ahead a few years and
look at how these new evolving technologies are going to change the problem
space and design solutions that will work when these technologies evolve. So
I'm going to focus a little bit on that too. Yes?
>> Can you just explain what you mean by scalable systems where -- what
aspects of the systems other than the algorithms and techniques they're
using? Those two things look very similar.
>> Rachit Agarwal: So, for example, you know, if you look at -- so, okay.
Let me understand your question correctly. Are you saying how are scalable
systems different from a scalable algorithm?
>> Scalable systems are both using scalable algorithms and techniques
[indiscernible] so what are these two boxes? If I'm the only one confused by
the two boxes, then I will [indiscernible].
>>
[Indiscernible].
>>
[Indiscernible] that's supposed to mean?
>> Rachit Agarwal: Yes, you are right. At the large scalable systems, just
employ scalable techniques or algorithms running on top of them. But there's
also the lower layer where you don't want to have systems that crash if you
are going to transfer 50 megabytes of data between two queries for example.
Right? So you want to have a rescalable RPC layer where for example you can
touch multiple, tens of thousands of servers and yet not have very high
overheads because of aggregating the data across these and this might be out
of your system space, out of your algorithmic space where you [indiscernible]
solve problems.
>>
[Indiscernible].
>>
Hardware versus software.
>> Rachit Agarwal: Okay. So okay. So this is the focus of my research. So
what I want to start with is by understanding why is it the systems perform
badly at scale. Okay. So let me give you an example of such that we started
with. Here, I have a file which I just color encoded. Each of these blocks
in the file could be terms or characters and what I want to find is all the
green blocks in the file. Okay? Suppose I want to find this green block.
The first technique that people use in literature is so-called data scans.
Okay. The nice thing is you store your input file in memory, right? Every
time a query comes in, you start scanning the file. Okay. And believe me,
it's not just animation. Data scans are actually that slow. Okay? So but
they have something nice, right, that you don't have to store anything in
addition to your import data so you're just store your import data itself.
So the storage overhead is not so high. The problem is that since every
query has to scan [indiscernible] data set, you have very high latency or
very low throughput. On the other side of the techniques, people have
designed these absolutely great indexing techniques. We store the input file
for item of access, but then in addition, you preprocess your input file to
generate results. And this is so-called in the literature different kind of
secondary indexes. So what's the nice thing about indexes, indexing works in
a way that if your query comes in, you just simply do a bind research on your
index and you get the query response. So it's super-duper fast. On the
other hand, since you have to store some additional data structures in
addition to your input file, your storage goes high. Okay. So you have high
storage and low throughput or high throughput. So if I look at the scale,
these are the two techniques. What I do is if I increase the data size, and
now I'm going to plot in one of the state-of-the-art systems with scans and
see what the performance looks like. So this is what the performance looks
like and this is exactly what you would expect that as the data size is
increased, your scan throughput, scan lay tense increases linearly. Right?
Roughly linearly. So if you're doing a scan in faster storage, you have
certain number of queries, not good enough, with larger data sizes and scans
[indiscernible]. And no comes the point that you asked earlier whether these
systems are indexed. Yes. So when you index these systems Elasticsearch,
right, there you have 16 gigabytes of data and the index overhead that I
showed in the last slide make the data sizes -- make the [indiscernible]
execution [indiscernible] memory. Yes.
>> Sorry, just so I can understand [indiscernible] or is this something
[indiscernible]?
>> Rachit Agarwal: So for this one, we had the Conviva data set, the first
plot that I showed earlier. The Conviva data set where I have each row
contains 98 different attributes, so user IDs, what time they logged in on
Conviva, what we do when they show. And for each of the queries, you search
along one particular attribute.
>>
[Indiscernible]?
>> Rachit Agarwal:
Yes.
Yes.
>> So sorry, going back to [indiscernible], I guess it's the same question.
Do we have a -- is there some notion here that there is a key [indiscernible]
Conviva column? Is that a setup in? Do you find keys? Because I don't
understand why you need [indiscernible] to be in the memory. You can just
have a place in the memory and you can kick off the blocks from whatever
stables [indiscernible] you want to pick up, you will get probably something
in the middle, right? Low storage because you're only storing index, and
middling throughput because you go up and access [indiscernible] storage.
>> Rachit Agarwal: So you have to store the data in memory. I'll give you
an example why people do that. Mostly because you also want to do random
access on that data. Here's an example. Twitter. It's red table there,
something is running on top of Redis. The way they did this, they have to
store the tweets in memory. Right? At least for the last few hours' worth
of tweets or last few days' worth of tweets. But not only that, when people
do the search, it's not the user ID that they're returning. They're
returning the tweets of the search results.
>>
[Indiscernible] the same as [indiscernible].
>> Rachit Agarwal:
>>
Compactly.
It's not interactive.
>> Rachit Agarwal: Yes.
the data and -- yes.
Does that make sense?
Yeah.
So you have to store
>> So for the data scans, the fact [indiscernible] today are basically the
columns that are limited to the query so they can actually get much higher
throughout data than what you are locating because you are only accessing the
data that's relevant so how would you [indiscernible] that?
>> Rachit Agarwal: The result I showed you was for actually a column store.
Okay. Yes, you're right that you can -- if you have a one terabyte data set,
that's a small data set. And if you have ten columns, each column is 100
gigabytes. You still have to scan hundred gigabytes of data. Even with
today memory with speed, it will take you one and a half, two seconds. And
we are talking about hundreds of queries per second here.
>> And the other [indiscernible] in terms of the data scans was indexes,
right? So that's the other part of the problem [indiscernible] handle
freshness of data. How quickly can you ingest new data and be able to make
it available for query. So indexing can be a more expensive operation rather
than just scanning the data in [indiscernible] fashion and be able to scan so
you can [indiscernible]. So will you be talking about this problem?
>> Rachit Agarwal: Yeah. You're absolutely right. And no, I will not be
talking about it. Data freshness is a very important problem and so I think
personally, it's going to become more and more important. And all these
techniques that preprocess the data have this problem that you have to
preprocess the data. I think what today's systems do and a what Succinct
does as well is we have an append only data model when new data comes in and
you have a log store where all the new data comes in and you have to find
really fast ways to be able to update this data as well as execute queries.
Right? If you have in-place updates, I personally think this is a problem
that both community systems have algorithms community have not resolved yet.
Right? Updating indexes, we all know it's a very complex problem, and it's
not that we have a fundamental reason to know why it's a complex problem.
Everybody just says it's complex but nobody has resolved that problem. So I
do think that there's a space there to solve a problem. But in terms of
freshness of data, I think most of the system, the techniques to handle this
is to have a separate write store and a separate read store. And that's what
Succinct does as well. Any other questions? Okay. Good. So okay. So this
is the cost you're paying by querying executed queries off slower storage.
So what does Succinct do now? Okay. So this is a two slide description of
Succinct. So in the first slide, I'm going tell you what Succinct does and
then the second slide I'm going to take [indiscernible] answer what powerful
queries mean and what Succinct can do. Okay. So Succinct takes your input
file, okay, and preprocess the file to compress, to store a suit of data
structures. Interesting thing here which I would like you to know is that
Succinct does not have to store input data. All it stores is this compressed
presentation of the input data. Right. Okay. Succinct takes the input
file, generates this compressor presentations, and now, you can execute a lot
of very wide range of queries directly on this compressor presentation.
Okay. So what I want to convince you for next few minutes is that you can
get low storage and high throughput for a larger range of input sizes than
what is possible today. Okay. So why is this interesting? Okay. This is
interesting because Succinct doesn't have to store any additional indexes,
right? But more than that, even though it's not storing indexes, it avoids
data scans and hence it's providing you the functionality of indexes. And
then interesting thing is that this compressed presentation of the data
actually contains within itself the bits or information bits required to be
able to get the functionality of indexes. Okay. So you don't store indexes
but you get the functionality of indexes. You are avoiding data scans and
unless you want to access the data itself, which I highly recommend you not
to do in that file, you don't have to decompress the entire in Succinct all
the queries I executed directly on the compressor presentation. Yes?
>> So you just indexes are hard to update, right?
update your compressor presentations?
How do you remember
>> Rachit Agarwal: Oh, like I said, so when I say they're hard to update, we
do that in terms of in-place updates. When you're doing data events most
people use the write optimized store and then a read optimized store. Read
optimized store is going to contain the compressor presentations in Succinct.
The write optimized store is a lot of store you're going to keep uncompressed
data. And then periodically transfer the uncompressed data into compressor
presentation. Okay. Now people have done a lot of work in systems as well
on designing very efficient log stores. But they did not have to solve the
problem of executing search queries on that and we did not want log store to
become a latency or throughput bottleneck. So you can't scan the data on log
store. So what we did was we used some very simple techniques in terms of
speeding up the queries so we can avoid a scan the entire file. And they can
techniques have been worked on in the database community where we just show
simple diagram indexes. And those are very fast to update. They're just two
hash table lookups.
>> It seems like there's also into every compressor presentation will be
incrementally updated.
>> Rachit Agarwal:
>>
Even Succinct --
Speaking compressor presentations [indiscernible].
>> Rachit Agarwal: Absolutely. Even Succinct. So this is one of the
problems. Yes. So Anrag has been -- the student has been working on this
project is working on now some techniques to be able to update not just
compressor presentations but this new blow fish project that I'll talk about
later on where we show that you can query the compression factor, so you get
the performance of indexes as well. And there we are really trying to
understand why is it for the last five years, 10 years, everybody has been
saying that it's really hard to update indexes because there's no easy reason
to understand. There's no lower bound on being able to update indexes in
query community. I think the best lower bound is still I think 25 years old
which is log in and log in query time you can do in tens of microseconds
today. Yes?
>>
[Indiscernible].
>> Rachit Agarwal: It's the run data of indexes that you won't have to do
data scans. The advantage of Succinct gets you by combining the first two
points which is we are not storing indexes but we are still providing the
functionality of indexes. So what you get is [indiscernible]. Yes.
>> It's very interesting that you don't have any [indiscernible] it looks at
having very, very exact index representations. Right? And having a log
store or things like [indiscernible] and [indiscernible]. They might not be
looking at search, but I'm assuming once you have an index, you can do
something interesting like storing, you know, combining that with the column
store. So just sort of the comparison with systems like Elasticsearch,
[indiscernible] that's what's popular in industry today, but later on, will
you show us some comparison with state of the art research systems?
>> Rachit Agarwal: So I do think that Elasticsearch is state of the art, at
least for search queries. MICA is a simple key value storage, does not
support any query beyond simple reads and writes. So it does not have any
notion of indexes. Same thing with run cloud as well. Silt is a slightly
different system. Silt was supposed to get the memory efficiency, just like
MICA did, but again, for simple key value pair lookups. Right? They did not
have any functionality beyond simple key value pair lookups. Now, they can
be tradeoffs. The tradeoff is that they can achieve much better performance
than Succinct for random access queries only. Right? Because now you can
push much more data in memory. You don't have to worry about search and
you're data structure is not optimized for that. But that composite did not
make sense because they have much weaker functionality. Okay. But I can
tell you numbers. They are going to be 10X, 10X more random access queries
that you can do with MICA with the specialized hardware like they have.
Okay. So coming back to this plot, Succinct, for Succinct to work exactly
like this, this part is not fundamental. Okay. I think Succinct system does
not have those overheads that these very [indiscernible] systems have. This
part is not fundamental. The fundamental part is this flat line there. You
can maintain your performance for much larger range of input sizes so this is
essentially what you get in terms of bridging the gap so you can do roughly
apex larger data sizes today and get interactivity at scale. But once I show
you the [indiscernible], I should also tell you -- oh. Okay. I'm going to
tell you what Succinct can do and then show you some -- what Succinct cannot
do. So what is Succinct data moral and what kind of queries can Succinct do
today? So in Succinct, when we started designing the system, we decided to
go with queries on flat unstructured files. Okay. And this may sound as
boring as the file itself, unstructured files, but what I'm going to show you
later on is that we as system designers, we should really be thinking about
flat files much more than we do. Going to show you that using the simple
interface, you can execute, you can implement many, many powerful models on
top of flat files including key value stores, document stores, tables, and
even graphs using the simple flat file interface. So here's the original
input. What Succinct does is shows this compressor presentation. Now you
can execute a lot of interesting queries. What do I mean by interesting
queries? The first one is that you can execute search. Right? So the
search query will result into either the key if you're thinking about key
value pairs, if you're thinking about off sets and flat files, it will result
in what the results in the original input are, what will execute on this
compressor presentation. Right? You can do random access starting at any
arbitrary offset. You can extract as much data as you want. The third one
is, which is very -- we use this query for many, many optimizations is that
you can do very, very fast counts in Succinct. If you want do count,
documents of certain string, you can execute it really, really fast. You can
append new data. And now, the interesting things come back where you can do
range queries. Interesting thing in Succinct is that the range queries have
the complexity, same as the original search query. So you can get really
fast range queries in Succinct. And finally, a project that we recently
finished and which I'm very excited about is that now you can execute these
very powerful regular expression queries directly on compressed data. So
really, we have nailed the space of search. You can execute as powerful
search queries as one can directly on this compressed data. Okay.
>>
Does the search on this give you back an audit?
>> Rachit Agarwal: Audit list. No. No. It doesn't have to be a sorted
audit list. But as you mean that, okay, so search is taking here a couple of
milliseconds. Right? So sorting, if you have even tens of thousands of
results, it's not going to be latency bottleneck. Somebody else have a
question? Yes?
>>
Range query within the file?
>> Rachit Agarwal: Yes. Range queries within a file or within a column if
you think about column RS stores are within a value, you know, a particular
attribute if you're thinking key value pairs.
>>
[Indiscernible]?
>> Rachit Agarwal:
>>
Yes.
Yes.
Yes?
I'm a little bit struggling.
>> Rachit Agarwal:
Okay.
>> The answer is a single attribute answer, right? As in if you -- if you
are asking for show me column Y of all of the green logs, your complex
representation is not going to get you the column Y that is associated with
the green logs.
>> Rachit Agarwal: It is actually. So, okay, so what is the -- right now,
it's talking about flat files. So if you're thinking about key value stores
or key value pairs or tables, let me tell you what would this mean. If your
query is along the column, you would get access, our result, these results
would be the primary keys in your table. Okay? Once you have the primary
key, you can also do ran done access in Succinct so you can exploit, you can
extract any of the other columns that you want.
>>
But that seems to require a [indiscernible] data structure.
>> Rachit Agarwal: No, no, no. This is -- all these queries executed on a
single data structure that Succinct stores. Succinct compressor
presentation. Search, random access, regular expressions. Just using this
single compressor presentation, you can execute all these queries that are
listed here. Okay? So we are not going to change the data structures based
on the queries. It's the same data structure that allows you to execute all
these queries.
>> So we can take this off and just to be [indiscernible], among those
queries are MICA attribute.
>> Rachit Agarwal: Yes, this is
later on how I analyze these two
to show you how we implement key
Sorry. Yeah. This was only for
for flat files only. I'm going to show you
tables or columns. Okay. Yes? I'm going
value stores and tables on top of that.
the API for flat files. Okay? So okay. We
can do queries and compress it and we don't need a regional indexes. We
don't do data scans and we don't do data decompression. And when I gave this
talk at Google the first time and they told me, okay, you must be really
joking with us. And in fact, actually, it became a very interesting question
for us to understand, you know, what are the tradeoffs that we're making in
this system. Right? And we do make very strange tradeoffs. Strong
tradeoffs. The first one is, like somebody asked here, we spend time
preprocessing all the data. Right? So if you have a system where you want
to execute queries only, you know, thousands of time, then this is not
probably very interesting. This particular would be interesting where you
want to do millions of queries. The second thing is if you want to access
the data, compare to systems like MICA, we have to decompress. We have do
extract certain number of bytes and that takes extra CPU cycles. Okay? So
throughput will be lower if you just focus on random access itself. Third
one is Succinct has focused on point queries. Do not really care about
sequential scan throughput and hence, it's not very useful for systems like
Hadoop or MapReduce, right? You really care about how much data can you read
per second. And finally, we do not in-place updates very efficiently. The
way it is supported in Succinct right now is deletes followed by appending,
you know, delete update [indiscernible]. Okay. So how much time do I have?
I'm running very late. But okay. So I do want to give you some idea about
Succinct because I really think that the data structures are simple. Okay.
So I can describe you that this is not a complicated technique. This was a
simple idea in hindsight. And so I thought that I'll give you some ideas
about the data structures, okay? But it builds upon a lot of theory work
which was done in late '90s and early 2000s. And there are two main areas in
terms of search that people use when querying [indiscernible]. Something
called Burrows-Wheeler Transform, BWT, which was viewed as an [indiscernible]
today. And then something else, suffix arrays, which are much less
appreciated than they should be. Succinct builds upon the latter ones,
suffix arrays. And we have some new data structures which make it very
efficient. And in particular, these new data structures, they impose some
new structure in the data that allows us to execute queries very, very -- by
exploiting this structure, we can execute queries much quicklier than we
could earlier. So I want to tell but this data structures and you know,
about how these queries executed. But let me start with the suffix arrays.
So all of us have the same background. So what do these suffix arrays do?
Suppose I have this file, okay. So the file is just happy puppy. And the
numbers were at the top are just the indexes into the file. Okay. I don't
have to rush, right? Okay. So in the numbers over here are just the indexes
into the file. So the way suffix arrays work is you first construct all the
suffixes in your input file. So since these are suffixes, the entire file
becomes a first suffix. You remove the first character and the remaining
string becomes a second suffix. And so on.
>>
[Indiscernible].
>> Rachit Agarwal: Right. And suffix arrays, there's two is affixes in
assorted order. Okay. So I have all the suffixes sorted lexographically.
Now, for each of the suffix, you store its location in the input file. So
the first suffix, which starts at A, starts at here, at location one. Okay.
Second one at location zero and so on. Okay. So this is what is called a
suffix array. This integer. Okay. And then we have suffixed in sorted
order. So what is the problem? Actually, the nice thing is that if you
wanted to do a substring search, this is just plain simple binary search.
Once we have done binary search, the corresponding numbers here, they give
you the search results. Right? Okay. The problem is if you have a file
with N characters, the top array is roughly inscribed [indiscernible], right?
Sum of one to N. The second one is storing a pointer in the input file, so
each pointer requires log in bits so you have size N log N. Right? Okay.
And if we look at the input file, if you think about ASCII files where you
have only eight N bits in the input file, so these two [indiscernible] are
much, much larger than the input file. Okay. I want to reduce a space. So
here's the idea. Let's focus on the first two entries in the suffix array.
Okay. So this slide is slightly complicated one but I'll try to make it
simple. What I want to do is first see if there is some structure in first
these two entries, right? What is a structure? I got the second -- I got
the second suffix by removing the first character from the first suffix.
Right? Which means if I store this pointer, which is saying where is my next
suffix stored, right, then I can forget about this entire second suffix and
reconstruct it by following the pointer. Okay? Stop me if this is
confusing. But more importantly, this is storing locations in the input
file. So if I remove first character, my value is only going to increase by
one. Because the next [indiscernible] in the next location. Which means
this pointer also tells me where's the next larger value stored in this
array. Okay? Good. So I have removed this entire suffix and I could now
remove this value and compute these values on the fly if I could store these
pointers. Make sense? Okay. Then I'm just going to [indiscernible] over
the entire array, do the same thing over and over again, and I have a
collection of pointers that will allow me to compute unsampled values and I
did not even store the entire array. But since this array was sorted, I
don't even have to store one character for entry. I just have to store the
first occurrence. Right? So I have taken these two massive arrays and
reduced them down to one sampled array plus a few bytes. This is what
Succinct stores as the first step. Okay. Now what were pointers storing?
These pointers were storing -- this was a pointer into this array, right?
Which means this is length and long so this is also going to take log in
bets. So that's a problem. But yes, like I said, you can compute the
unsampled values by following these pointers. Once you hit a sample value,
then the number of pointers, sample value minus number of pointers looked up
will give you the desired result. Right? So what do I do about this array?
Because this has log in bets. So what I did was I took these two large
arrays, stored a set of pointers that allow me to reconstruct this large
array. Right? But this array will also [indiscernible] log in bits per
entry but can anybody see structure there? If you look at all the values
that start with the same character, they have an increasing sequence of
integers. So although I did not have interesting structure in the suffix
arrays, these pointers that allow me to reconstruct these suffixes on the fly
have a very interesting structure. Right? And we know how to compress these
increasing sequences, integer sequences very efficiently using realtime
coding, for example. Okay. So I can store a very compressor presentation of
this data set, of this data structure. What Succinct does is something more.
It takes this data structure, the third one, and it transforms into two
representations where each row is not just increasing sequence of integers
but actually contiguous sequence of integers. Okay. And since contiguous
sequence of integers are heavily compressed, well, you get a lot of benefits
there. And more than that, it also allows us to do some queries very
efficiently rather than doing this binary search over the entire array which
is not very cache efficient, we can reduce the binary search to a very small
part of the array. And then Succinct [indiscernible] finally this part in
the data structures which allows you to do queries which are random access
queries and these were mainly for search and then random access queries.
Okay. So that's one way to think about it. Yes?
>> I'm just kind of wondering. Do you ever worry about the number of memory
accesses as you are compressing data [indiscernible] like for the times you
were talking about memory access [indiscernible]?
>> Rachit Agarwal:
So I missed the last part.
>> So I'm guessing with all this compression that you're doing here, it's
like every query will take a lot of memory accesses for you to try to
reconstruct and follow pointers and whatnot. So do you not worry about ->> Rachit Agarwal: We do actually. We do worry about it. So what I did not
talk about is that Succinct gives you the following guarantees. If you're
doing random access for B bits, then you have to do B-plus log-in point of
lookups. Okay? Which means if you're extracting 1000 bytes, then you have
to do just 16 extra pointer lookups. Okay? So you have 1,000 plus 16
pointer lookups. Now, when you're doing search, you have to do exactly
log-in point and lookups. Actually, two log in in the current implement.
Okay. Now, if I take 100 gigabyte file, this boils down to roughly speaking
38, 39 pointer lookups on a single core. 62 nanoseconds. You can still do
queries, you know, much, much faster than going to SSDs. Right? So unless
you are going to do thousands or even hundreds of pointer lookups, then that
becomes a problem. But yes, you have some overheads of doing queries and
compressed data. Right? As long as those overheads are less than overheads
of going to secondary storage, you win.
>>
Secondary storage you said was [indiscernible]?
>> Rachit Agarwal:
Yes.
SSDs for example.
Today if you look at SS --
>> [Indiscernible] the number of memory accesses is fewer than whatever that
amounts to.
>> Rachit Agarwal: But when you're doing an index lookup, unless you have
very smart data structure like Google, if you're doing an index lookup, you
have to do binary search or SSD stored data. So even for search, you have to
go to SSDs multiple times. There are techniques to avoid that. I'm saying
there are techniques to avoid that, but then that makes really multiple
indexes, three indexes that just makes it very complicated. Yes?
>> Just tell me you just said, you said the comparison point was going to
secondary storage. But isn't there sort of an intermediate point where you
keep everything compressed in my memory but you just continually scan it?
You don't worry about doing an index. Depending on your query workload could
be much faster than secondary storage, and depending on how many pointer
lookups you have to trade off and the latency you're willing to tolerate,
that can also be a bit of a point.
>> Rachit Agarwal: So if I understand your question correctly, you're saying
why don't we just do a scan.
>> Well, I'm just saying that -- no, I understand why you don't want to do
scans. But I'm saying that that maybe should be your fallback position
rather than secondary storage. Saying we can tolerate so many pointer
lookups, well, actually, at some point you're better off doing just scans.
>> Rachit Agarwal: Actually, no. I'll tell you why. I think scans are
really, really slow. Here's the reason. Right? Ten gigabytes of data, it
still takes a hundred milliseconds to scan that data, even if you're not
doing any kind of computation today. Right? Now, if you ->>
But you're doing it from all the cores in a --
>> Rachit Agarwal: No, single core. I'm only talking about single core
performance. Yes, you can paralyze it. So if you are 100 terabytes of data,
then you will -- 100 terabytes of data, one single core gets set and ten
gigabytes of data, you still have to do 100 milliseconds, right?
>>
And you are -- how many milliseconds does that take?
>> Rachit Agarwal: Say 100 milliseconds with the 100, even if you are
[indiscernible] hundred megabyte per second memory ->>
[Indiscernible].
>> Rachit Agarwal:
>>
Sorry?
Seems way too fast.
Is it good for you?
>> Rachit Agarwal: Yeah. I'm saying that I'm doing a conservative analysis,
right? I'm doing a conservative analysis that even ten ->>
[Indiscernible].
>>
You can compress it.
[Indiscernible].
>> Rachit Agarwal: Believe me, I think [indiscernible] people have worked on
that problem. But what I'm saying is that if you are spending
100 milliseconds scanning the data, going to SSD today you can get a state of
the art [indiscernible] 25-microsecond latency random lookups.
>>
[Indiscernible].
>> Rachit Agarwal:
>>
Right.
Are you talking about 4000X extra time?
I'll shut up.
>> Rachit Agarwal:
>>
Right?
Yes?
So [indiscernible] the whole file as a huge stream, right?
>> Rachit Agarwal: I'm going to get to that in the next slide.
key value pairs and stuff, right?
>>
No, no, no, the flat file.
>> Rachit Agarwal:
yes.
You mean the
You are treating the whole flat file as a --
As a stream right now.
In this example that I showed,
>> Oh, so I'm wondering, so if we want to like append [indiscernible] just
append to that sort of file, right? So how do you update a structure? Do
you need to like recompute the suffix array again, or I'm just wondering -and also [indiscernible] keep increasing, do we need to like increase the
[indiscernible] for every item, every integer [indiscernible]?
>> Rachit Agarwal: So we use a recreational solution, at least in the
current version, right, that you have a write optimized log store where you
append the new data. And since data is sharded, you have a read optimized
store where Succinct data structures are not updated unless you're deleting
something. Okay? Now, when a new data arrives, you collect the data for a
while, right, and on this data, we have some optimizations to do queries on
search queries so it does not become a bottleneck. And then you periodically
transfer this into compressor presentations. Okay? So this is the standard
log store approach. In fact, Silt paper that somebody mentioned, Silt paper
does this multi-store approach to this. Yes. Yes. Okay. So actually, so
index as a scans and on this curve, now we've got one new point, which is
Succinct. Okay? So this most of the things I've talked about earlier, this
is Conviva data set, 1.5 kilobyte records, 98 attributes. And systems like
Elasticsearch, Mongo dB and Cassandra, I want to show you some quick numbers
on the storage. So this is the system amount of memory. Okay? I'm going to
plot just storage for [indiscernible] which means that I have the certain
data size which is raw input data size and what is the storage footprint for
these systems? Okay? If you look at Cassandra, you are roughly around
eight, somewhere between 8 and 16. It runs out of memory. Okay. I don't
think this is fundamentally wrong with Cassandra. It's just that that system
was not optimized for memory. But a lot of it is coming out of index
overhead. If you look at Mongo dB, it has similar performance lags. Search
is actually much better. And oh, one thing I should say is that even after
our days of experiments, we couldn't upload where the curves stop. That's
the last point that system can work on with 60 gigabyte RAM.
>>
What is the data here?
>> Rachit Agarwal: The Conviva data set. It's a collection of records with
98 columns or attributes. Right? And it's 1.5 KB records, each of the
records. This is one of the companies that collects video and that has video
logs.
>>
So it's video data?
>> Rachit Agarwal: No, these are log files, user-generated log files like
when do you start watching a video, when did you stop, what video were you
watching, what was the length of the video, where did the video come from,
and all those things. So different ->>
[Indiscernible] the Y axis.
>> Rachit Agarwal: The Y axis, oh, yeah. This is a system storage
footprint, which means for a given particular raw data size, right, how much
data -- what is the amount of data that these systems have in terms of after
creating indexes and stuff? Okay.
>>
Real memory that they use?
>> Rachit Agarwal: Yes. So essentially thinking about it, how much memory
would you need if you wanted to put all this data in memory? Right? And
this is the line that I gave which is the system memory. And here's what ->>
Are you using -- all right.
>> Rachit Agarwal:
Yes, please.
>> Interfere with bottom line here.
systems?
Are you using huge pages in all these
>> Rachit Agarwal: No. So I think Cassandra does not support huge pages.
In Succinct, w disabled the huge pages for fair comparisons. But I'll show
you some numbers towards the end where we actually get a lot of gains by
using huge pages. And then I'm going compare against different systems
there. Yes?
>> Some questions here. I mean, [indiscernible] transform is developed
quite long time ago. And the capability to basic execute search very
efficiently on that data structure is also known [indiscernible]. Can you
comment what's your exact contributions?
>> Rachit Agarwal:
Yes.
>> To view the [indiscernible] previous [indiscernible]? Are you describing
this performance, the efficiency of the [indiscernible] depends on the data
characteristics. But, I mean, I assume you work on Conviva a lot of data
set. That might be more compressed if you are viewing -- searching for other
things like [indiscernible] web page, the compressor [indiscernible] may be
different. Can you comment on those?
>> Rachit Agarwal: Okay. So two questions. The first one is what were our
contributions given that BWTs have been known for a while? So the two
directions to BWTs, one is the transform itself. Right? That does not
have -- and you can do very efficient search queries on that and Google
actually does that. The problem is that it does not provide you compression.
People have worked on using the storage overhead of BWTs, okay, using some
structure there. And that is actually those kind of techniques are also used
in B zip too, which is standard compression technique. But then you lose
your ability to be able to do queries on that compressor presentations. You
have to decompress the data which has its own latency overheads.
>>
[Indiscernible].
>> Rachit Agarwal:
>>
Exactly.
Less compression.
[Indiscernible].
>> Rachit Agarwal: Yes, for compression. Okay? Now, your second question
was that these numbers are, especially the storage numbers, are very
different from the data sizes.
>> So I asked the question, are you transforming the whole data set?
Meaning are you trying to [indiscernible] for the entire data set, or are you
trying to segment the data set [indiscernible]?
>> Rachit Agarwal: Yeah. Succinct allows you to do flexible data sharding,
which means that you can shard your data along columns if you knew that your
queries were only get to get the long columns, or you could shard basically
like Elasticsearch and Mongo dB, which is row sharding. Right? And then for
each of the shards, reconstruct these data structures. Right? Does that
answer your question? I have still not answered one of your questions.
Sorry. I have still not answered one of the questions which was the
dependency of these numbers on the data itself. So yes, you are right, and
any compression property, any compression [indiscernible] would have that
property. Now, there are two things. One is even if your data is completely
incompressible, right, and this is slightly non-intuitive, and I didn't want
to say that earlier, is that Succinct allows you to do search on -- maybe you
won't get data compression, which means your data size would be one, but you
will still get the functionality of indexes. Right?
>> Okay. But I'm not sure [indiscernible] index efficiently
[indiscernible].
>> Rachit Agarwal: Absolutely. Yes. So performance wise, let me settle the
storage numbers first. What we did was we experimented with different -many, many different data sets. I think 20 data sets and we have some
numbers in our paper. And what we showed was the following. If you took
your data set and compressed it using Gzip, right, and then you took it and
compressed it with Succinct, for most -- actually for all the data sets, what
we have seen is the numbers lie between 1.4 to 1.7X of Gzip compression.
>>
The size is one point --
>> Rachit Agarwal: 1.4 to 1.7X of Gzip, which means you're paying 40 to
70 percent extra compared to Gzip. But you get all these functionalities.
Okay?
>> Are you going to show us experiments where you vary how much you -- the
granularity of compression?
>> Rachit Agarwal: I did not tell you yet Succinct could do change in
compression, but I'm going to come down to that towards the end. But yes,
right now, this one is a fixed compression factor that Succinct allows you to
do. I'm going to show you some numbers I think towards the edge. But not
with the varying compression factor.
>> I meant the sharding of the data for the compression indexes. Because
then you can search two indexes at once so you have more memory accesses but
it's in parallel and ->> Rachit Agarwal:
>>
Yeah, but then you get --
-- all the same machine, right?
>> Rachit Agarwal: See, in terms of throughput, it's easier if you just
parallelize, you're going to get that many improvements. Linear
improvements, right?
>> Right. But if you're up against the interactivity deadline, there's a -there's the user at the end and a hundred memory references might not work
for them but 50 might. So being able to configure that appropriate to what's
happening is important.
>> Rachit Agarwal: Yes. If you give me not more than five minutes, I'm show
you a cool result. Okay? So search results. Like I said, these are three
systems. And Succinct runs something like that. After Succinct
[indiscernible] performance [indiscernible] and again, you know, this was
when Succinct stops [indiscernible] in memory. It doesn't [indiscernible]
256 on 60 gigabyte RAM. But, see, this is something which I think is the
power of Succinct. You take a machine with 60 gigabyte of RAM, right, and
you are putting in 128 gigabytes of data on that machine and yet getting some
millisecond or subsecond queries. Okay. So you're putting more data than
the RAM itself. And then we have random access throughput where the
performance numbers look very much similar. And you know, the only thing is
that these systems degrade, have much better degradation, not just flat drops
in terms of throughput. Yes?
>> So how -- what would the gains be, say you ran on dB CBS?
dB CBS.
>> Rachit Agarwal:
16 gigabytes
16 gigabytes?
>> Yeah. Without one of the things [indiscernible]. The question is about
query generality or query [indiscernible] as opposed to I don't know what
[indiscernible] you're using here.
>> Rachit Agarwal: So I showed you the set of queries that Succinct supports
right? Starts regular expressions range queries. And then counts and random
access. What kind of queries do we not support is the aggregate queries,
which means if you want to do it on average, in general, [indiscernible]
queries I don't yet know how to do ->>
But you had a counter.
>> Rachit Agarwal: Oh, counters -- okay, yes. One can say counters
[indiscernible] query, but what I meant is suppose you have a column that is
a salary of people and then you say okay, find me the average salary. I
don't know how to do that for query yet on compressed data directly.
>>
What is your number?
What [indiscernible]?
>> Rachit Agarwal: So we were looking at NoSQL stores, right? So yes, I'm
not answering your question, but I don't know the answer exactly. But we are
looking more on the NoSQL side where people look at system [indiscernible]
Cassandra and these, right? And I'm not sure if people run dB CBS on those
systems. But I'll be happy to run some numbers and see what kind of queries.
Those are more -- actually I haven't looked at -- those are more SQL queries,
right? The problem there is that even if a SQL query has an aggregate part,
right, then I'll have to say I cannot support that query.
>>
[Indiscernible]?
>> Rachit Agarwal: Yes, I think -- ah, in terms of SQL queries, we have to
understand the coverage, right? But our focus was not on SQL at all. Like
even when we released Succinct on top of spark, we have a support for spark
SQL where you can implement the [indiscernible] fast, right, but we did not
release Succinct as part for SQL because ->>
[Indiscernible].
>> Rachit Agarwal: Yeah. So filters is one thing that people use and
likewise which is used in regular expression you can use. So I think
Succinct would be a part of SQL execution rather than your SQL execution
engine or [indiscernible] speaking because I think a scan is just too hard
for Succinct and you have do some scans in spark.
>>
[Indiscernible].
>> Rachit Agarwal: Okay. I want to tell you some more cool things but I
think I'm going to start skipping a few things. But so now, okay, I told you
about what Succinct can do, what Succinct cannot do, and how do we do it. So
now, how do we take this technique and build a distributed store out of it,
right? And you have to think about multiple things in this context. First
is what is the data model that you're going to support? The second one is
and this problem was told to us by LinkedIn people that if you have these
skewed workloads where you have certain data which is very hard, certain data
which is cold, you cannot -- you do not want to have the same data
representation for all the data. Right? So how do you handle skewed
workloads where queries are distributed non-uniformly across different
shards? How do you handle [indiscernible] failures where once a symptom,
once a machine fails then the load on the other remaining replicas increases?
How do you handle data recovery during failures and data concerns? I'm going
to go through one, each one of them and give you some ideas what Succinct
does. So data model. We, like I said, we have this flat file interface
where which allows you to implement many, many data models right on this
using simple serializer interface which means you can run your queries on
unstructured data or key value stores like Voldemort Dynamo. You could do
document stores using one single interface of flat files. And I know I might
be missing some people, but I want to show you this. This is really cool
simple thing. So, this is a Succinct interface. User submits the file.
Just treating green as if it's going to give the system a key value store or
a table. So here I'm going to show you a table and this table has four
columns. Okay? Now, how does Succinct execute these queries along the
columns while working on flat files? Right? It takes this table and it
assigns each column a unique delimiter. Okay? And then, it takes down these
values in each column, appends a delimiter to each of the value and writes
them down as a flat file, okay, which means the first value here, green
value, combined with the delimiter is written first. And the second value
and so on. Okay? Now, I have a flat file that I'm going create Succinct
data structures on. Make sense?
>>
You can go a little bit faster.
>> Rachit Agarwal: I can go a bit faster? Okay. So when a search query
comes in, where I have a green block, a long column, one particular column, I
want to find all the green blocks in one particular column. What I'm going
to is execute this search query, oops. Execute this search query by
appending this delimiter here with the green column. Okay? And now it's
easy to see that you only get the results in the first column because of the
[indiscernible]. And this is happening all inside Succinct. And since I've
showed you for tables, now I can do it for this on documents, I can do it for
key value strokes and everything. Such a simple powerful interface. And
then I was going to show you something that now you can actually do very
powerful graph queries using the same interface. But okay. How do we do
skewed workloads? And this answers part of your question that I'm going to
jump into. What is it that we do today? So something that [indiscernible] a
while ago, at least in MapReduce, is that if you have a query distribution or
a load distribution or in shards of this order, so this is a [indiscernible]
distribution where some shards are very lightly loaded, other shards are hot.
What you do you is you create additional number of replicas along for the
shards that are hot and create fewer replicas. And this is something I'm
going to call selective replication. The problem is that selective
replication is kind of coarse grain. Right? If you want on it to increase
the throughput, then you have to -- if you do 2X replica, then you're
throughput increases by 2X. In memory, you want to do something more
efficient and see if you could do something more efficient. So here's what
we do. So I had the Succinct techniques, right, where I had this input file
and I stored these data structures. So these two data structures as I
mentioned have very small storage overhead. These two, on the other hand,
were sample data structures so their storage depends on the sample rate. And
for sample rate, if I have X, then the tradeoff was very simple. Your space
becomes N log in for the top one, which was the suffix array, and this was
for random access. So two N log in and my sampling rate is X and every time
I have to do a query I spend time roughly X. Okay. For computing and sample
values. So how can I take this and generate something more interesting? I'm
going to take this sampled array and I'm going to use one of the techniques
that people have used long, long ago in redoing and quoting techniques.
Okay? What I'm going to do is take the sampled array and store it along
multiple layers. So the top array has sampling rate two, so I'm restoring
every second sample value. What I'm going to do is store it along multiple
layers 2, 4, 8, and so on. Why is that interesting? The interesting thing
is that I can change the sampling rate easily now. That's what I wanted to
convince you. So if I wanted to do a layer deletion, then I could simply
deallocate the space or a layer. Right? And my sampling rate suddenly
changes from 2 to 4 now. Yes? It is nice because I have reduced my storage
and I've increased my query latency, though, right? Adding the layers is
different. Suppose I want to add a new layer. Right? The problem is that
these are highly interactive systems. We don't want to dedicate extra
resources to be able to fill in round sample values that you need to fill in
this layer, populate this layer. Right? But a sampling rate goes down and
you have a higher storage but low grade latency. Interesting thing is this:
Succinct is already computing the unsampled values on the fly. Right?
During query execution. So we can use that to [indiscernible] fill these
layers. So basically, you get very low overhead ways to fill and delete
layers here. Right? To populate the layers [indiscernible]. And this is
very nice for skewed workloads in particular because once you have completed
the value, since the queries are being executed more and more on that shard,
some values [indiscernible] for skewed workloads. Okay? So you can
basically add and delete layers dynamically and by adding some small few bits
here, there will just show that now you can add and delete layers independent
of existing layers and even do query execution independent of what layers
exist and do not exist. So what I have done is taking the system here,
Succinct, and allowed you to achieve any point on this smooth storage
performance tradeoff curve, you know, where you can change our desire
whatever storage you want, you know, going all the way to B zip to
compression, right, and go all the way to indexes and you can choose any
operating point depending on what sampling rate you chose. Right? And not
only that, by adding and deleting layers, I provided you one way to move
along the straight-off curve very, very efficiently. Right? This is
something that we have never been able to do in distributor systems. Given a
particular data set, you always had a fixed data size. This really allows
you to do -- a chief a very flexible way of designing distributor systems for
interactive queries. Okay. So now, since I have the solution, I can tell
you how I can apply this particular solution to solve many, many system Z
problems. Plus one is like I said, you have coarse-grained access to
these -- for these skewed queried distributions Succinct by using this
flexible trade off, you can do much, much Finer-grained control over the
throughput and latency in particular for shards that have already loaded, you
can move along the straight-off curve and get much finer control depending on
if you want 1.5X throughput, you only increase to that level of storage.
Okay? So I'm going to skip this [indiscernible] failures. Again, the same
thing that today, like I said, Facebook could erase the replica crease for
15 minutes in case of [indiscernible] failures and this creates a very high
load on the remaining replicas. Suppose you have three replicas, one of
those fail, but I don't want to create a new replica, so all the queries that
were going to the failed replica are now going to this new replica. So if
only one [indiscernible] failure happens, your load is to increase by
50 percent on the remaining replicas. So what Succinct allows you do is
navigate along the straight-off curve upon load of spikes and here's one
simple result. What we did was we ran the system and at time equal to 30, we
increased the load on the system by 3X, okay? Once the load on the system
increases, at this point, Succinct says, okay, I have to create another layer
of samples. And it creates layers within five minutes, it has started to
meet the new load of the system because those values in the new layers were
filled up in five minutes. Right? And by the time I spent 15 minutes, my
entire system was again stable. Right? And this can be seen here where
queries, this is the Q lent size as the time release. Q lent starts building
up as soon as I increase the load, and then you can job the Q lent. And this
is the worst case result for Succinct because this is for uniform workloads.
Okay. That is all I just showed you for a skewed workload [indiscernible]
than a minute or so for very high 3X load increases. Okay. So I'm going to
skip this part. One thing that I didn't say is that Succinct doesn't have
yet the transactional support or we don't provide [indiscernible] and that is
something I'm very happy to talk about later on. What I want to talk about
is three quick projects on the future work that I have planned. One is more
short term, the other one is medium term, and the final one is the long term.
So I think in terms of future work, there's one problem that we haven't
really yet resolved which is how to do these interactive queries on graphs.
Increasingly larger number of the services today interact or answer your
queries on those services by exploiting your information on social networks
right? And so people have been thinking about integrating these queries,
interactive queries and graphs, with these distributed stores. There's been
a lot of work in graph processing which means you want to run shard
[indiscernible] which is again same as batch processing. But interactive
queries on graphs are far from efficient. They're two systems that are good
systems that exist, [indiscernible] and Titan, and believe me, they're not
even close to like good. [Indiscernible] does not have any shard so you
cannot scale it up. Titan has global performance. It's not that these
systems are worse or bad. It's that the these complex of the graph queries
are actually very complex, even in my pieces that I showed that some of the
graph queries is just impossible to do without scanning the entire graph.
And here's one simple query. That if you want to find friends of Ratul who
live in Berkley, right? Now the problem is that 2X execute ->>
[Indiscernible]?
[Laughter]
>> Rachit Agarwal: So there are two ways to execute this query. One is you
look at Ratul's friend, which is a large number of people, and then you look
at people in Berkeley, which is again a large number of people, and you do a
very complex join. Right? And as we all know that is going to crash the
system, right? If it was for me, maybe fewer friends. Another way to do
this is you look at Ratul's friend, and for each of the friends, go and
random access its location, the person's location, right? And then filter
out the results while you are doing this random access. What is the problem?
The problem is that Ratul is hot data, right? Everybody wants to look up
Ratul. And then some of the friends, some of his friends might be cold data
that never access their Facebook. So essentially, they're cold data. But
since I'm doing executing this query on a hot note, I can touch -- I may be
touching code nodes because of the relationships between people. Which means
caching becomes much, much more important than standard systems. Right?
Because the cold data, I might not be quoting cold data directly, but the
queries might be going on the cold data indirectly and I think compression
should help that. Right? If you could cache much more data. So I think
this is one thing which I'm super excited about these days and we have done
some very preliminary understanding of existing systems and I think they're
far, far from what we can achieve today. So building a distributed graph
store is a short-term problem which I think will integrate very well with
today's [indiscernible] services. The second problem which, again, I'm very
excited about and we have made some progress is there's all this work about
achieving data confidentiality. Okay? I don't want you to access my data.
Please don't access my data. And then I tell you, okay, take my data, please
query this data, allow me to query this data in an interesting manner, right?
So people want to achieve, you know, some kind of data confidentiality and
they use encryption today. On the other hand, you know, what I have shown
you today is that you can get a lot of performance benefits using
compression. Right? The question is whether it's possible to be able to get
the benefits of compression while getting data confidentiality. And it's a
very challenging problem because compression is all about removing
redundancy. On the other hand, if you think about encryption, then it's
about adding redundancy. Right? And these two things seem very, very
against each other. And we have realized that, you know, at least over the
last six months, me, we have realized that we need fundamentally new
techniques. And Ion and I got interested in this problem and then we start
thinking about, okay, why is it that we should be able to solve this problem?
Because I think we can focus on some limited functionality, but more than
that, what can we do today? So we built this very simple system which is
called MiniCrypt which can do queries and compress encrypted date but no
search, no regular expression, nothing. Just plain, simple key value store
lookups. So I don't want to say much here but it's still a lot to be very
non-trivial problem, but you can get a lot of things there compared to
existing systems. Okay? And Ganesh is telling me that I have only a few
seconds left, so I want to say the third thing which I think is something
which is happening more and more as well, which is this whole direction of
resource disaggregation, and some people call it rack scale computing. So if
you look at Facebook, Google, all these big companies that have recently
moved to this, you know, a resource [indiscernible] datacenter architecture,
which means if you look at today, each server has some amount of RAM, some
number of CPUs, some amount of disk, and they're tightly integrated. Right?
And because of the capacity scaling challenges, people are [indiscernible]
like Intel have realized that it is no longer sustainable, this model. So
what they say is that, okay, we are going to disaggregate all of these
resources. We're going to build CPU blades separate from memory blades, and
separate from IO blades and disk blades. Okay? And all these pool of
resources now will be connected by a network fabric. So this is called rack
scale computing or some people call it [indiscernible] datacenters. The
problem is that this new style of servers are going to fundamentally change.
And by the way, these rack scale computers are already being used in clusters
and datacenters. Facebook actually uses them in one of their Oregon
clusters. And in terms of systems, right, I believe that this is going to -these new architectures are going to fundamentally change the way we design
the systems because the way we build systems for last ten years, 20 years,
were built on some fundamental assumptions like we have very high CPU memory
bandwidth which is no longer possible in this ideation, right? The failure
models are going to change. How do we access data locality is going to
change and even, you know, how do we have CPUs interact with disks is going
to change. And I think there are going to be numerous new challenges in
desktop building systems. So that's all I have to say, but I'm going to take
one extra minute. So what I want to -- what I talked to you about today was
mostly this queries and compressed data. I want to give you a very quick
picture about some of the other projects that I didn't get to talk about. So
I did my Ph.D. on queries and graphs. And as Ganesh said initially, that I
did some of the theory work there. What I was really trying to establish
fundamentally why queries and graphs are a hard problem. Earlier in my
career, I did some work, which I have essentially got interested again,
coding information theory because of the problem of queries on compression
encrypted data. And here we had some very interesting results. More
recently, I have had what with [indiscernible] disaggregation and we recently
had some is successes understanding the network challenges for resource
disaggregation. I've done some work, network debugging a while ago. And but
more recently I started thinking about some interesting network debugging
problems again. And finally, a lot of people still think of me as a routing
guy. And believe me, the only routing I do today is driving to my office
from my home. But a lot of people still think of me as scalable routing guy.
So I think you know, having been exposed to these different areas or subareas
have given me the right platform to solve all the problems that I want to
solve. And none of this would have been possible without all the awesome
collaborators I have. And with all the people that listen to my talks.
Thanks all for coming and I can take more questions now.
[Applause]
>> The question I have is because you're using really complex compression
algorithm, does that make your data more vulnerable to corruption because now
every [indiscernible], every entry is actually related to all of the data of
the original data, right? So for example, if you corrupt one part, that is
just one part is done. But now, if it is corrupted, if you corrupt one part
of your corrupted data, then all the data will be done, right? You cannot
recover it. So that would be a bigger problem. And also, you had mentioned
you were using [indiscernible]. I'm wondering whether you are thinking about
you basically will add another layer on top of that batch layer instead of
replacing that batch layer.
>> Rachit Agarwal: So first question on data corruption, it's an interesting
question. Well, I haven't thought about it, to be frank. But let me. So,
when you said data corruption, you mean when data fails. Or is it -- are you
talking about low-level data corruption where ->>
Yes.
>> Rachit Agarwal:
>>
Okay.
[Indiscernible].
>> Rachit Agarwal: Yes, definitely. Memory, I'm thinking, because we are
doing in memory computations, right? So I'm thinking -- I'm trying to put
data corruption in place. But let's see.
>> With all the alternate memory technologies, they're more likely to fail,
so you might actually have failures.
>> Rachit Agarwal: Right. So the persistence is easy, right? The way we do
process data is just doing it in the secondary storage [indiscernible]. But
an interesting thing might be, again, I'm now speculating since I already
told you I haven't thought about this problem, an interesting thing might be
since we are computing these unsampled values on the fly, if we have some
data corruption, then it might be very, very -- it might be interesting to
understand whether we can we can identify data corruption because there will
be inconsistency between sample values and unsampled values and the number of
pointer lookups that we do.
>> But if you identify that, then how do you [indiscernible]? Because if
you cannot recover it, the damage [indiscernible] how much damage that will
cause.
>> Rachit Agarwal: Absolutely. So the way we do it today is we process the
data. Right? We do data replication on secondary storage with every write.
Right? So once the data is processed and replicated, then something fails
and we can check where it has failed.
>> What I'm saying, if you cannot recover that, you might use the
replication. Then you need to matter how much damage that will cause, right?
[Indiscernible] flat file, that damage will just be the corrupted data. But
if you -- but now you are compressing the data, then if it's corrupted, then
it is very possible that the failure will be amplified because of the
compression.
>> You still store the data in persistent storage.
compressed form in the persistent storage.
>>
You don't store the
So [indiscernible] you need to --
>> Well, you're doing in-memory computation that's on the compressed data
and then you're storing the real data on persistent storage.
>>
Okay.
So the -- that's --
>>
So if you need more, you replicate.
>> Okay.
>>
You need more resilience, you replicate.
>>
[Indiscernible].
>> Rachit Agarwal:
It's not replenishing the batch layer with in -No, no, no.
>> Ganesh Ananthanarayanan:
>> Rachit Agarwal:
[Applause]
Just like you do any --
I think we're out of time, but [indiscernible].
Thanks all, for coming.
Download