>> Christian Konig: Okay. Thank you, everyone, for... to introduce Arun Kumar from the University of Wisconsin where...

>> Christian Konig: Okay. Thank you, everyone, for coming. It is my great pleasure
to introduce Arun Kumar from the University of Wisconsin where he is co-advised by
Jeff Naughton and Dignish Patel. Arun works on the intersection of data management
and machine learning with the stress on data management. And the research he has
produced should be interesting for various reasons. For one, it's highly decorated,
where we are talking about a SIGMOD best paper award in '14. It is also has been
highly controversial. So his paper submissions to database conferences have caused
all sorts of very strong reactions from the reviewers. And it is also practical. So
various pieces of code that are based on his ideas are actually now shipping with a
number of different products. Make sure that ask him about all of this, but the most
important tidbit that I want to leave you with is what few people know is that Arun is
actually a member of the American screen actors guild and has modeled and starred in
commercials. Again, feel free ->> Arun Kumar: One commercial!
>> Christian Konig: One commercial. I am exaggerating slightly. But he is a man of
various talents. If this database stuff doesn't work out, Hollywood is option B.
>> Christian Konig: With that I'll hand it over to Arun.
>> Arun Kumar: Well, Nissan and Amazon, car commercial.
Okay, great. Thanks, Christian and thanks to everyone for coming to the talk and all
the people who are viewing it remotely and as he mentioned my name is Arun Kumar.
I'm from the University of Wisconsin. Today I will talk to you about my research on
accelerating Advanced Analytics. I can talk to you offline about the commercial.
>> Arun Kumar: So Advanced Analytics is becoming ubiquitous. For our purposes, I
define Advanced Analytics as the analysis of potentially large and complex data sets
using statistical machine learning or ML techniques. So basically, it is the coming
together of these worlds of data management and machine learning. I approach this
intersection from a data management standpoint.
We all see Advanced Analytics in action practically every day. I don't need to preach
to the choir here. So everyone here already knows about this. Every time you get
rank results from Google's search engine, the output is because of Advanced
Analytics. Every time Netflix recommends a movie to you, that is Advanced Analytics
in action. And the whole world saw IBM's Watson system defeat human champions in
the Jeopardy answering contest. Watson is powered by Advanced Analytics.
The hyper-file success, highly visible success of these applications of Advanced
Analytics has led to an enormous demand in the enterprise domains and again most of
you probably already are familiar with this. Healthcare, retail, insurance, finance.
Even the academic domains, the sciences and the humanities, for products that make
it easier for these users to integrate Advanced Analytics into their applications. Market
research runs estimate that the market size for Advanced Analytics products is set to
grow to anywhere between six to $29 billion per year over the next few years. And no
wonder then that practically all major data management and analysis companies want
a slice of this pie. There are also a slough of startups in this space, and open source
toolkits to integrate Advanced Analytics into their applications.
In spite of all this, as I'm going to explain today, there still remain numerous
bottlenecks in the end-to-end process of building and deploying Advanced Analytics
applications. At the end of the day, my research is about abstractions, algorithms, and
systems that mitigate these bottlenecks in the end-to-end process from a data
management standpoint, thus accelerating Advanced Analytics. And by acceleration, I
mean both system efficiency, the running time of the systems and the algorithms
involved, and human efficiency. The productivity of the data scientist and other
humans that are involved in the process.
I'll start with a high level overview of my research, which is largely been in the context
of two groups of humans in the Advanced Analytics lifecycle. Data scientists who are
tasked with building and deploying machine learning models that analyze data. And
behind the scenes we have the software engineers at companies such as Google,
Microsoft, IBM, Oracle, who are tasked with building implementations of these
maximum likelihood techniques on top of data processing systems such as Relational
Database Management Systems, Spark, Hadoop, and so on.
From conversations with data scientists, engineers, and others at various settings, we
found that there are numerous bottlenecks in their work and throughout the lifecycle of
Advanced Analytics. We wrote about some of these bottlenecks in an ACM Queue
magazine article that was invited to the communications of the ACM in 2013. My work
has largely focused on three classes of bottlenecks.
The first one arises when data scientists have to build ML models and that is the
process of feature engineering. Transforming the raw data into an ML ready feature
vector, there's a lot of work that goes into the space and some of my work has
explored different bottlenecks in the feature engineering space.
The second class of bottlenecks arise when ML models that have been built have to be
integrated into production software and production systems. And the third class of
bottlenecks that I have looked at are from the perspective of software engineers who
have to build ML toolkits that scale these algorithms to larger amounts of data, the
sheer diversity of these ML techniques makes it a tedious and daunting process for
them to build these toolkits and there are some bottlenecks that are addressed in that
To be more precise in terms of the publications that are focused on these bottlenecks,
in the context of feature engineering I have worked on two main bottlenecks. The first
one is when features arrive from multiple tables which requires joins of tables before
ML can be applied. This has been in the context of projects Orion which appeared in
SIGMOD 15 and project Hamlet which is said to appear in SIGMOD later this year.
The other bottleneck that I looked at was the process of exploratory subset selection
where data scientists are often involved in the loop in picking which features they want
to use for the ML task. Often they use environments such as R for this process, and
we focused on elevating this process to a declarative level, but introducing a domainspecific language for feature selection in the R environment and applying database
style and ML optimizations to improve the runtime performance of this process and
increase the productivity of the app data scientists. That is project Columbus that
appeared in SIGMOD 2014. In the context of integrating ML models into production, I
have looked at probabilistic graphical models for optical character recognition or OCR
data. The data scientists want to query in conjunction with structured data in RDBMSs
that as project Staccato that appeared in VLDB 2012.
Finally in the context of helping software engineers build scalable ML implementations,
I took a step towards devising a unified software abstraction, an architecture for in
RDBMS implementations of ML techniques. That is project Bismarck that appeared in
SIGMOD 2012.
To speak of the impact of some of my research, as Christian already mentioned briefly,
project Bismarck, the code and/or the ideas have been shared by numerous
companies including Green Plum, which is now EMC, Oracle, and Cloudear. And we
also contributed the code of the Bismarck project to the MADlib open source analytics
library. Staccato is being used by projects in digital humanities and sciences.
Columbus won the best paper award at SIGMOD, as he mentioned. And Orion and
Hamlet, I am currently speaking to a product group within Microsoft, the online
security. They keep changing their names. They were called the universal safety
platform. Robert McCann is the person. And integrating some of these ideas into the
scope ML environment on top of cosmos and they are exploring integrating that with
their ML pipelines. And also with LogicBlocks, which is the database company that
wants to deploy some of these ideas on top of their production analytics platforms.
For today's talk I am going to focus on the first bottleneck of applying ML over joins of
multi-table data and I'll dive deeper into projects Orion and Hamlet. I pick this
particular bottleneck primarily because of how it, because I think it best illustrates my
earlier point about the coming together of the worlds of data management and
machine learning, how it gives rise to new problems and opportunities. And also the
sheer number of open research questions that this work has led to. Yes, Ravi?
>>: I'm -- [speaker away from microphone.] you keep saying, I hear at least three or
four projects.
>> Arun Kumar: Right.
>>: Is this in context with one system? Is there a reason why everything is in a
>> Arun Kumar: Well, the Orion and Hamlet has been in the context of one system
environment. The Bismarck stuff has been in the context of another system
environment, but in terms of data platforms, a lot of this has been prototype on top of,
say, Postgre SQL or even Hadoop. From a systems perspective, a lot of these ideas
are on top of the same data platforms, but I consider them system because I look at
like end-to-end stack. From the interaction all the way to the execution perspective.
Okay? So here is the outline for the rest of the talk today. I'll dive into ML over joins
and then I'll talk about my future research ideas. A bit of deeper outline for ML over
joins. I'll motivate the problem and set up an example that I'll use as a running
example for the rest of the talk. And I will go into projects Orion and Hamlet. Let's
start with a little bit of background, normalized data. All of us here probably know
normalized data, but just to set up the example and introduce the terms. Many
structured data sets in the real world are multi-table. For example, consider a data
scientist that's dealing with data at an insurance company, like American Family. They
have data about customers which is a table that contains attributes about customers.
The gender, age, income, employer, and so on.
They also have data about employers, companies, universities, and other
organizations, containing data about companies like state versus headquarters, what is
the revenue, and so on. Notice that the employer ID of the employer is an attribute in
the customer table that connects these two tables. In database parlance it's also
called the foreign key and the employer ID is what is called a primary key, also called
just a key in the employer's table.
If we want to get some information about the employer, off a customer, we need to do
this fundamental operational operation called a join that basically stitches these two
records together based on what the employer ID is.
Data of this form and key-foreign key joins are not specific to insurance. They are
ubiquitous, they arise in practically all domains where there is structured data. They
arise in systems where you have data about ratings being than joined with users and
movies. They arise in online security. They arise in hospitality industry. They arise
even in buyer informatics.
So now that you know everything about databases, let's introduce some examples and
set up the terms for machine learning. Here is the same customer's data set. A very
common machine learning task that is used in enterprise setting is customer churn
prediction. Basically, data scientists want to use machine learning to model the
customers to help prevent customers from moving to a competitor. And they start with
these attributes about customers which become the features for the machine learning
algorithm. The age, the employer, and income and so on.
Now, they start with the training data set to train the ML algorithm and they have to
predict the churn attribute, which is based on past customers that have churned or not.
That is also known as the target or the cross label and the job of the data scientist is to
build a machine learning model using this training data set, so logistic regression
model or SVM or neural network and so on.
Here is the twist that causes trouble in this paradise. The employer ID is a foreign key
that refers to a separate table about employers, ML meets the world of normalized
data. Given the setting of this kind, data scientists view these attributes about the
employers as potentially more features for the ML algorithm. And the intuition could be
that maybe if your customer is employed by a rich corporation based in Washington,
they are less likely to churn. And so they basically want to get all these features which
basically forces them to do key-foreign key joins of these two tables in order to gather
all these features for the ML algorithm.
Unfortunately, almost all major ML toolkits today expect single table inputs for their
data training process. This forces these data scientists to do what I call ML after joins.
So what does this ML after joins and what is the problem? Here is an illustrative
example for the customers and employers table that some representative number of
customers and employers, 100 million customers, 100,000 employers, ten features,
fully features. The feature vectors of the employers are shown with different patterns
over here. You have this key-foreign key join. The input could be, say, hundreds of
But after the join the input blows up to several terabytes. This is because the join has
introduced redundancy in the data. Notice that these vectors about Microsoft are
repeated for every customer that is employed by Microsoft. Now, in many cases this
blow up in storage is actually a major issue. In fact, in one example with one of the
product groups within Microsoft, they actually encountered this problem when they
were doing Web security-related ML applications. It even occurred to the customer of
LogicBlocks. They mentioned that they are joining five tables and the data blew up
over an order of magnitude. So storage is one of the key bottlenecks that ML after
joins encounters. You have to have extra storage space because of these ML toolkits.
There's also redundancy in the data which causes redundancy in analytics
computations, or that wastes runtime. In the real world data is rarely static. As you get
more examples, more features, the data scientist now has this additional headache of
maintaining the output of the join and keeping the models fresh, which impedes the
Ultimately all of this impedes the exploratory analysis process that is often required for
building machine learning models. Extra storage and wasted runtime are what I call
system efficiency issues. Maintenance headaches and impeding the exploratory
analysis process of what I call human efficiency issues because the data scientists is
spending time on grunt work and thus productivity goes down. And nowadays a lot of
these analytics computations are moving to the Cloud. There every bit of extra storage
and extra computation could incur extra monitoring costs.
To summarize, I introduced the problem of ML after joins, which is a fundamental
bottleneck in the end-to-end process of building ML models. Data in the real world is
often multi-table, connected by key-foreign key relationships. Since most ML toolkits
expect single table inputs, the data scientists are forced to materialize their
[indiscernible] join that concatenate feature vectors from all these tables which leads to
human and system efficiency issues. Yes, Ravi?
>>: Quick question. When you build an ML model, built on the real set or training set.
Aren't training sets much smaller? First question.
And second question, the patterns you explain, right, redundancy. They seem to be
very amenable to comprehension techniques like [indiscernible] input.
>> Arun Kumar: Right.
>>: Why don't I take a denormalized relation and express it.
>> Arun Kumar: Right. Great questions. coming to the first question, is the training
sample, does it have to be smaller than the actual data set? Yes, in many applications
sampling could actually work. But for many of the other ML tasks they actually want to
get, say, the effects that are less likely to occur in a sample. Therefore, they want to
use the whole data set. So there is a spectrum of applications over there. Even inside
a sample, if you have a sample example set, they might be redundancy in the data and
as I'm going to explain there will still be a tradeoff space for some of the techniques I'm
going to talk about where it could still benefit the performance, runtime performance.
Coming to the second question with respect to compression, yes, many of niece cases
these data sets that have redundancy are amenable to columnar compression.
However, that introduces an orthogonal tradeoff space where you now have to tradeoff
operational efficiency for compression, or operate directly on compressed data. One
of the techniques that I'm going to talk about today is sort of a stepping stone to that.
Looking directly at operating on compressed data, how do we integrate these ML
techniques. So that's a nice idea for future work and some of the ideas that I'm going
to explain today are amenable to that as well.
>>: I have a quick question on the maintenance. Can you explain that one again?
Because it seems like the maintenance is just like a view of the [indiscernible]. That's
all you need. Is there something more than that?
>> Arun Kumar: Well, yes. Yes and no. It depends on the application and the
environment. In some of these cases they actually produce copies of the data and
they move to different toolkits. So there the infrastructure that we have built in say the
database environment for view maintenance, that gets lost. And, therefore, they have
to manually keep track of where they have the data and how to update the data and so
on. In the database context, we can construct a view of these joins and do the joins
virtually on the fly. I'm going to talk about that as well. Turns out that there is still
computational redundancy if you do the view approach. There's a tradeoff space that
I'll talk about.
>>: Right.
>> Arun Kumar: Okay? To give an overview of my work in my thesis work I proposed
this paradigm of ML over joins to combat this problem of ML after joins. And in project
Orion which appears in SIGMOD 2015 we show how we can avoid joins physically.
Essentially we do not construct a table of the join, we do not need the joins
themselves. We can operate on data exactly as it resides in the multi-table format.
We show how this can help improve runtime performance because the data in the
input of the join could be much smaller, but it produces exactly the same model and
that gives the same accuracy.
This could potentially improve human and system efficiency. Going further, in project
Hamlet, SIGMOD 2016, we show that in some cases you can actually avoid joins
logically as well. What do I mean by that? We can actually get rid of entire input
tables without even looking at them. This obviously runs faster, but I'm going to
explain why it can give very similar accuracy. This could potentially improve human
and system efficiency even further. Moving on, I'll dive into project Orion. In project
Orion our idea to combat the problem of ML after joins is this technique of factorized
learning. And the intuition assemble, we factor ice or decompose these ML
computations and push them down through the joins to the base tables much like how
database systems push selections, projections and aggregates down through joins.
The benefits, we avoid redundancy in I/O because we are not constructing this output
data set. We avoid redundancies in computations. We don't need extra storage
runtime could approve and it could be potentially more data scientist friendly because
they operate run data exactly how it recites rather than on a materialized intermediate
data set.
The question is: Is factorized learning even possible? And in this paper we show that
yes, it is possible at least for a large class of ML models known as Generalized Linear
Models, also known as GLMs that are solved using batch radiant methods.
The challenge is how do we factor ice these GLMs without sacrificing the ML accuracy,
the scalability of this implementations and the ease of development of these ideas on
top of existing data management platforms?
I'll start with some brief background about these GLMs, for those of you who are not
familiar with it. Recall the classification example that I'm using as running example.
Classify customers as to whether they are likely to churn or not. In ML we start the
mapping each customer to a D-dimensional numeric feature vector. In this example
here you have two dimensions. So basically customers are just points. And now
GLMs classify these customers based on a hyperplane W over there. On the one side
are points that basically customers that churn. On the other side are those who do not.
The question is how do we choose a good hyperplane W? In GLMs, recall that we
start with a training data set. That is the input. We have a set of labeled customers.
Those are label points labeled SGS or no, plus or minus. The idea to compute a
hyperplane is to minimize a certain score of the hyperplane on the training data set,
say the misclassification rate. That is essentially a sum of N terms where N is the
number of labeled customers and the function of the inner product W transpose Xi.
That's the distance of the point to the hyperplane. And the class label Yi where Xi is
the feature vector of the I of the customer and Yi is the class table of the customer.
It turns out that the misclassification rate is hard to optimize and thus GLMs instead
use a smooth function of the inner products, W transverse Xi. And depending on what
this function is, we get a different ML model. It is the square distance. It is the popular
least [indiscernible] linear regression model that is used for least trend analysis. It is
the log lost and it is the popular logistical regression classifier. If it is hinge loss, it is
linear SBM and so on.
So basically, the takeaway here is all of these ML models have the same common
mathematical structure, but there is this minimization model problem with the sum of N
terms and the function of the inner product of the transverse Xi.
>>: So there are SBMs with other kernel functions?
>> Arun Kumar: Right.
>>: Then that will not work, right? Your whole approach will not ->> Arun Kumar: That's a good question. It depends upon the structure of the kernel.
If the kernels operate on the inner products, then some of these techniques that I talk
about can actually apply there as well. But in this particular work we focused on GLMs
and therefore we only deal with linear SBM views for generality sake here.
Okay. So how do we do this minimization? So in this technique called gradient as
send dent I'll give some background about. We need to get to the optimal hyperplane
W that minimizes this function, which is also known as the loss function or L of W. And
it turns out that the loss function for GLMs is a ball-shaped curve formerly known as a
convex function. And the optimal is essentially the bottom of the ball. That is some
separating hyperplane in the point space. In general, the loss function is not closed
form. And thus, people use iterative solution known as gradient descent, where the
idea is simple. You start with some value W, compute the gradient at W, and the
gradient is essentially the generalization of the slope to multiple dimensions. And you
move in the opposite direction of the gradient, the descent. And alpha is the step sized
parameter that controls the, how much you move in the opposite direction. To give an
illustration, we start with some W0. That is some separating hyperplane over there.
We do one pass over the data set. Basically look at all the points and compute the
gradient. Take a step in the opposite direction. We get to a new model, W1. Do one
pass over the data. Compute the gradient. Move in the opposite direction. And keep
doing this iteratively until you get closer and closer to the optimal. So this is basically
the BGD technique. What does it do today in the context of -- yes?
>>: [speaker away from microphone.] single pass over the entire table?
>> Arun Kumar: Yes.
>>: And how many passes typically does it take to converge?
>> Arun Kumar: It depends. It depends on the data set and the accuracy desired by
the data scientists. In practice I have seen they run like 20 to 40, 50 iterations?
>>: So then the question of the cost of just joining the key-foreign key, is that a
bottleneck in the larger process? Have you ever heard of 40 passes?
>> Arun Kumar: I'll explain some of the experimental results, with breakdown of what
is the runtime of the join. It turns out that the redundancy introduced by the join is
often the dominating factor for that. Okay?
Okay. So recall that we simply run BGD on the concatenated feature vector if you
want to do ML after joins. You have features from customers, tax next to features from
employers. This is the expression for the loss. Take the first iterative. That's the
expression for the gradient. It is also a sum of N terms with the scaler function G and
you scale the feature vectors Xi.
Notice that the gradient Nabla L, the model or the hyperplane W, and the feature
vector Xi are all D-dimension am vectors, but D is the total number of features.
So how does this work? We start with the input of the join to the join, get the output of
the join. Don't forget about the input. Start with the initial model W0. One pass over
the data. You get the first gradient. You update the model. Next pass over the data,
get the next gradient. Update the model. And proceed iteratively.
So basically BGD after joins, you physically write the output of the join and then you do
one scan of the output table per iteration. What does factorized learning do? Recall
that our goal was to push computations to the base tables and our intuition is simple.
You split up the computations on the two feature vectors from the two tables, XC and
This is the expression for the gradient. The inner product can be rewritten as a sum of
inner products over the customer and employer features and the inside is that a sum of
sums is still a sum. Thus we can do one pass over the employer's table to precompute
the partial inner products over the employer features, use that when we do a pass over
the customers table to reconstruct the full inner products. However, we run into this
challenge of how do we scale the feature vector of the entire feature vector for every
customer? Turns out that one pass over each base table may not be enough. We
need to save some statistics and exchange them across these tables and reconstruct
the full gradient. So how does this work? So here is the output of the join. Get rid of
the output table. Get rid of the joins. You start with the initial model. You chop up the
coefficients or coefficients over customer features and coefficients over employer
features. One pass over the employers table. You get some statistics and if you're
curious, these are basically the partial inner products per employ ID. Use those
statistics to do one pass over the customers table and we get a portion of the gradient.
We also get some statistics over the customer's table. If you are curious, these are
basically the group by sums of the scaler function G per employer ID. Use those
statistics, do a second pass over the employers table and we get the remainder of the
gradient. Stick these two vectors together. That's the full gradient. Proceed and
update the model and go ahead to the next iteration.
So basically factorized learning requires one pass of customers and two passes over
employers per iteration. And it completely avoids the need to join these two base
tables. Now, yes, Vic?
>>: Can you have sort of data features that require combining some columns from the
factor [indiscernible] dimension.
>> Arun Kumar: Yes.
>>: Like a more complex [indiscernible] in that case, are the technique still
>> Arun Kumar: That's a great question. You are basically talking about featurefeature interactions. Now, there are methods that are called second order methods
like Newton's descent, where you actually construct the Hessian metrics that has pairwise interactions among all features.
Turns out that the factors learning technique that I talk about here for linearity extends
to that technique as well. However, there is a tradeoff in terms of the runtime
performance. That tradeoff space looks a little bit different than what it looks like for
here. So in terms of feasibility, yes, it's possible. But in terms of efficiency, the
tradeoff space is a bit different.
Okay? So now I talked about the algorithm when we wanted to work in practice there
are some challenges that arise when you want to implement it on a real system. What
is the statistics that I talk about over per employer ID do not fit in the aggregation
context memory. And how do we implement this on top of existing data processing
systems without having to change the internals for ease of deployability?
For the first one we go into the details of a partitioning based approach in the paper
where we stage the computations of different portions of these statistics and stitch
them together to reconstruct the full gradient. And for the second one we use the
abstraction of user defined functions, specifically used to define aggregate functions
that are available in practically all major database systems, as well as on high
So how does this work in practice in terms of efficiency? Oh, yes, Donald?
>>: It only works for key-foreign key joins, right, [indiscernible] need to join between
the two tables?
>> Arun Kumar: There are two aspect to it. In this particular paper we focused on new
key-foreign key joins because they are very common and we thought that would be
nice. And in terms of applicability of this technique to general joins where you can
have the full cross-product appearing, it is technically feasible, but you need an
identifier, [indiscernible] identifier that needs to be added to the statistic because the
attribute, joining attribute is no longer the primary key in one of the tables. So we
haven't looked at it in this paper, but conceptually there is nothing that prevents us
from extending to it.
Okay? So what about system efficiency, is it actually faster? We've done an extensive
empirical analysis in the paper and I'm going to give a snapshot of one experimental
result here. We implemented all these techniques as user defined functions written in
C on top of the Postgre SQL open source database system and thus our experiments
were single node with 24 gig RAM. We synthesized data sets that resembled of some
the number of tuples, the number of features we saw in practice. Here we have
customers with 150 million tuples, employers 10 million tuples. We ran logistic
regression with batch gradient descent for 20 iterations and I'm going to show two
things. The storage space consumed throughout the process of learning and the
runtime on the Y axis. The input data set, the customers and employers together, is
about 52 gigabytes. BGD after join is over there. They require about 20 gigabytes of
storage space. Factorized learning in contrast does not require any extra storage
space. It basically, there's a gap of 217 percent in terms of runtime -- in terms of
storage space. And in terms of runtime the gap is 334 percent.
And to breakdown the runtimes as Ravi asked earlier, basically the after joins spends
19 percent of its are you positive time in constructing the [indiscernible] of the join and
13 minutes per iteration of BGD. In contrast, factorized learning you avoid the joining
time and the runtime per iteration is about five minutes because it operates on the
smaller data as the input of the join.
Now, of course, all of these relative runtime numbers depend on several factors and
these are the number of tuples, the number of features, how much memory the system
has, the number of iterations and the degree of parallelism and so on. And we have
done an extensive sensitivity analysis to all of these factors in the paper. Overall, it
turns out that factorized learning is usually the fastest approach but not always. There
are some cases where the BGD after joins approach could be slightly faster and we
devised database style cost models for the I/O costs and the CPU costs of this
technique that automatically picks the fastest approach on a given instance of the
There are more details in the paper and I would be happy to chat with you offline about
it. We give a proof for why the accuracy of the BGD technique is not affected and why
we get exactly the same model with the factorized learning. There are other
approaches to approach larger than memory scale that we discuss and the tradeoffs
based there. And the question about the views turns out that I/O relevancy can be
eliminated if you use views but computational redundancy still exists and thus it could
be slower than factorized learning in many cases. And we also extend all of our
techniques to the parallel chat, nothing parallel in the distributed environment and
prototypes offer ideas on high even Hadoop and lessons park. We extended this to
multi-table joins, specifically star joins, but the extension is trivial to snowflake joins as
>>: Are you going to talk about SGD?
>> Arun Kumar: Great point.
>>: Okay.
>> Arun Kumar: Such a theory also extended our ideas to GLMs. So using SGD,
stochastic gradient descent as well as coordinated descent methods. We also
extended it to probabilistic classifiers like naive base and decision trees and also to
clustering techniques such as K-means. A lot of this is work done with Masters and
undergrad students. Basically I could off load some of these technical work to them
and focus on newer ideas because I didn't want to focus on extensions of this directly
myself and I focused on the newer idea project Hamlet that I'll talk about next.
I'm also working on extending the idea of factorized learning to linear algebra so that
we can generalize it to any ML computation that is expressible in linear algebra, and
the input of the join will be automatically basically given a linear algebra script that
operates on a matrix that is the output of the join will produce a system that will
automatically rewrite it to a script that operates on the input of the join.
>>: Just quickly, for these extensions are you seeing similar runtime [indiscernible]
>> Arun Kumar: It depends. It depends on the technique. It depends on the data
parameters. It turns out that for stochastic gradient descent there is no computational
redundancy in the in-memory setting because the model changes after every example
and for coordinate descent it could be that certain overheads imposed by the columnar
access patterns. We looked at it in the context of column stores and the tradeoff
space looks a little bit different for each of these techniques. Some of the probabilistic
class version clustering, it's very similar to BGD.
>>: I guess as part of that SGD, it seems like you don't get the computational gain.
Then if you're doing it on a view you wouldn't have I/O gain either. I was curious, did
you find a gain at all?
Yes. It turns out for SGD since we don't have computational redundancy, the only
thing that matters is the data redundancy. The I/O redundancy. So in order to avoid
construct nail [indiscernible] join we say you can do a hash table and you can do a
view, basically an index hash join and the challenge there, what if the hash table
doesn't fit in the memory? Then you need to partition the data but that scrambles the
ordering that you need for SGD which is very sensitive to the order in which you
access the data. Now it introduces a new runtime accuracy tradeoff. We are looking
at some of the issues in this tradeoff. Things like what if the foreign key is high
correlated with the class label and those sort of issues. So you can expect a paper
about that pretty soon. But I'm happy to chat with you offline with you about more
details if you're interested.
In short, a sum of sums is the same sum factorizing a voids repeating sum. So loan
after joins no more to loan over joins so for. That is project Orion in one stanza.
Moving on. I'll dive into project Hamlet.
Recall the same running example. Customers referring to employers. Here is an
observation. Given the employer ID, all the features of the employers are fixed. So do
we really need the statant revenue features if you know the employer already that is
already present in the customers table? There is this notion of feature redundancy in
the information through literature. And we can show formally that given the employer
ID, all the employer features are redundant. This motivates us to basically consider
the idea of omitting all those features. That is, avoiding the join logically. In short,
avoiding the join. We use the foreign key employer ID as a representative for all the
employer features. Thus, we get rid of that table. However, there's also this notion of
feature relevancy. Certain features of the employers could be highly predictive of the
target, in this case the churn, in which case we might actually want to bring it back and
let the algorithm decide. So oops, we need the join.
You might be wondering what is the big deal? Why do we need to think of avoiding the
join? Why not just use the feature selection technique? These have been studied for
decades that manage this redundancy relevancy tradeoff. We can just apply that in
this context. Well, in one word, the answer is runtime. But here is a brief background
about what is feature selection to do if you are not familiar with it. It is essentially a
method to search the space of subsets of features and update, obtain a subset that is
probably more accurate or more interpretable. And there are various algorithms for
feature selection, wrappers, filters, embedded methods. One of the most importance
lar ones is forward selection where you start with a single feature in your set and you
compute the prediction or the generalization error, the desk error. And then you keep
adding one feature at a time depending on whether the test error goes down or not.
Backward selection is the reverse. You start with your entire set of features and you
keep dropping one at a time based on whether your test error goes down or not.
There are filter techniques that rank features. There are embedded methods like
regulation with L1 or L2 norms.
Coming back to the question of why bother avoiding the join? Well, if you avoid the
join you reduce the number of features that you give as input to these algorithms which
reduces their [indiscernible] space potentially improving their runtime.
Basically we are short circuiting the feature selection process using database schema
information. The key-foreign key relationship. Million dollars question: What happens
to accuracy? To understand the effects of avoiding the join and accuracy, we need to
understand the bias variance tradeoff. I'm going to give a little bit of background here.
Learning theory tells us that the test error of a classifier can be decomposed to three
components: The bias, the variance and the noise. The bias, also known as the
approximation error, again these are informal explanation, is sort of a quantification of
the complexity of the classifier hypothesis space. Models that are more complex tend
to have lower bias.
The variance is a characterization of the instability of a classifier to a particular training
data set. Models that are more complex tend to have higher variance fixing the
training data set and fixing the model if you give fewer and fewer training examples,
the variance tends to go up.
The noise is a component that no model can mitigate because of unobservable data.
And traditionally the bias experience tradeoff is illustrated as follows in ML. As the
model complexity keeps going up for a fixed training data set, the training error keeps
going down. But the test error goes up beyond the point. On the left is a situation of
high bias and low variance due to low model complexity. On the right is a situation of
low bias, high variance because of high model complexity. Situations with high
variance are also co-locally called over-fitting.
So the key question for our work now becomes: What is the effect of avoiding the
join? That is, omitting the feature vector about the employers on the bias and the
variance. We did an applied learning theoretical analysis of avoiding joins and bias
variance. I'm going to give a brief summary here. The effect on bias. So here is the
full feature vector. We renamed the employer ID as FK to be generic and without loss
of generality assume for now that the customer feature set is empty. So we have this
reduced feature vector. The classifier that we learn is essentially a prediction function.
It is a function of the feature vector. And HX is the hypothesis base of all possible
classifiers that you can build based on the feature vector X.
We show that this hypothesis base does not change if you omit the employer features
in this case. Basically, HX equals to HFK. In a sense, this learning theory result is
equivalent to the earlier information theoretic result that I mentioned about the
employer features being redundant.
And what it basically leads to is that if you avoid the join -- oh, and if you actually avoid
the foreign key and use the employer features instead it turns out that the hypothesis
space can actually shrink so basically what this leads to, if you avoid the joins the bias
is unlikely to grow up. If it's not likely to go up, then we don't need to worry about the
bias. So however, it actually turns out that for some popular clarifiers like logistic
regression and naive base, the bias can actually go down. So what happens to
variance? So if the hypothesis space is unlikely to change, does that mean the
variance is unlikely to change? The short answer is no. And the key insight here is
that feature selection may or omit certain features in the new vector. To understand
this, here is an example. Here is your full feature vector and again without loss of
generality, assume the customer feature set is empty. We have this reduced feature
vector. And suppose I give the following true concept that generates the data that is
sort of the worst case scenario for avoiding the join. Suppose the true concept is as
follows: If the state of the employer is Wisconsin, then the customer is unlikely to
churn. Otherwise they are likely to churn.
If I generate thousands of examples based on this true concept and give that as input
to a feature selection method, what is the likely output feature subset? It is highly likely
to be state. And the key insight here is that in general, the domain of the foreign key
could be far larger than the domain of the features it refers to. There are only 50
states. There could be millions of employers. To understand this, here is an
illustration. This is the true concept. And the hypothesis space of classifiers built
using the state features are say encompasses the true concept. But if you use the
foreign key, the hypothesis space is far larger. Notice that we already showed that
HFK equaled HX. Thus if you use the foreign key you would end up with higher
variance compared to the State. However, avoiding the join has forced us to use the
foreign key as a representative of the state feature, which leads to the result that avoid
can joins could increase the variance.
In short, we ask the question: What happens to the bias in variance if you avoid the
join? And are [indiscernible] analysis suggests that avoiding the joins is not going to
increase the bias but it could increase the variance. This is a new runtime
performance accuracy tradeoff in the ML setting using database schema information
and we asked how can we help data scientists exploit our theory of this tradeoff in
practice? Our idea is to devise practical heuristics that help the data scientist bound
the potential increase in variance. And they can now apply a threshold and see if it is
above a certain threshold it is too unsafe, I'm not going to avoid the join. I call this
process avoiding the join safely.
The challenge is how do you even obtain a bound to avoid join safely? There is no
precedent for this. Well, it turns out there are certain bounds in learning theory that we
can use. These bounds are bounds on the absolute variance based on the
[indiscernible] churn of the VC dimension of a classifier and essentially you can think of
it as a quantification of the complexity of a classifier's hypothesis space. Models that
have higher VC dimension tend to have higher variance affecting the training set.
There are several bounds in learning theory using the VC dimension on the variance.
We apply a standard bound from Shy and Shy's popular learning theory textbook
combined with [indiscernible] about the growth function. The expected test error and
the expected training error. The difference is bounded by this function of VC
dimension we. The number of training examples N and the failure probability delta.
And this is for training data sets of size N which are exam examples are IID distributed,
independent and identically distributed. Now, here is the catch. These bounds are on
bounds on absolute variance. What we need is a bound on the increase in variance.
And this leads us to the heuristic, the ROR rule where our intuition is that the increase
in bound caused by the increase in VC dimension is an indicator of the risk of avoiding
the join and we define this quantity Risk of Representation, ROR, which compares the
hypothetical best feature selection output that you can get after avoiding the join,
versus what you can get with avoiding the join. That is this quantity over here. There
are three new terms. VS is the VC dimension of the best classifier that you can get
after avoiding the join. V no is the other one. And delta bias encodes the difference in
the bias. But now here is the problem. VS, V no and delta bias are impossible to
know a priori in general because they require prior knowledge of the true concept. If
you already know the true concept, you probably don't even need machine learning.
How do we overcome this conundrum? Our idea is to upper bound this quantity ROR
to eliminate these terms that require prior knowledge of the true concept and the upper
bounding it, we make it more conservative. In a sense that it's unlikely to avoid joins,
where if you avoid them the error shoots up. But it is likely to miss certain
opportunities to avoid joins.
So we upper bound the ROR. The details are available in the paper. Essentially for
VC dimensions that are linear in the number of features, and this includes popular
classifiers like IU base and logistic regression, the ROR can be upper bounded as
follows. We have these two new terms over here. TFK is the domain size of the
foreign key, the number of employers. And QE Star is the minimum domain size of the
features in the employer's table. And the ROR rule now essentially says if the RH over
here is less than some threshold epsilon, it is safe to avoid the join.
Now, we looked at this and we thought -- oh, yes?
>>: I want to make sure I than. That's the single feature with minimum [indiscernible]
>> Arun Kumar: Exactly, it's the domain size of the smallest ->>: The smallest domain feature, okay.
>> Arun Kumar: Okay? We look at this and thought it is still so complex. Can we
simplify it further to help data scientists. That's where we came up with the tuple issue
rule or the TR rule where the idea is to eliminate the need to even look at the employer
features. Basically we want to get rid of that second term over there.
Define the tuple issue, after the issue of the number of examples and to the domain
side of the foreign key, DFK. That's that quantity over there. In total it's just the
number of training examples, the number of tuples in the customer's table to the
number of employers. That's the domain set of the foreign key. That's why I call it the
tuple ratio. Now, if the domain size as to foreign key feature is much larger than the
minimum domain size of the feature in the employer's table which is almost always the
case in practice, then for reasonably large tuple ratios it turns out that this bound
becomes linear in one area squared of tuple ratio. Flip it over and take the square,
upper bound becomes the lower bound and we end up with the tuple ratio rule which
essentially says if the tuple ratio is greater than some threshold Tao it is safe to avoid
the join. Notice that this is even more conservative than the ROR rule. So after this
long journey through the wonderful world of learning theory, we come back to this
stunningly simple tuple ratio rule that only uses the number of tuples in these two
features to help us avoid joins safely in this ML context.
So even if you have a million features about the employers, it is possible to safely
discard them without even looking at them. So does this work in practice? We ran
extensive experiments with real data sets. I'm going to give a snapshot here. We
tuned these thresholds for these rules using simulation study. And it turns out that the
tuple ratio of 20 is good enough to decide whether a join can be avoided or not. Notice
that this simulation needs to be done once per VC dimension expression, not once we
are data set as you need to do for hyper parameters in machine learning. And the
tolerance that we set for the set error was .001.
We use real data sets. There are three here. There are more in the paper. These are
from car go. All of these are classification tests. In the Walmart data set we want to
predict the number of sales per Department in the store by joining data about sales
with stores and weather indicators. And the up data set we want to predict the ratings
of businesses by users by joining data about ratings with users and businesses.
Flights is a binary classification task where we have data about flight routes and we
want to predict if it's code share or not by joining data about other routes with airlines
and airport information.
We applied a standard of off the shelf classifier IU base which is very popular and
combined it with popular feature selection methods like forward, backward, filters, and
so on. And we used the standard hold out validation methodology to measure
accuracy. 50 percent of the train label set was set aside for training. 25 percent for
validation during feature selection, and 25 percent was set aside as the final hold out
test error which is the final indicator of accuracy. Yes?
>>: Do you have domain [indiscernible] in every foreign key in your training set?
>> Arun Kumar: That's a good question. So in this particular example the domain of
the foreign key is defined by these tables over here. Each individual example of
foreign key instance may or may not occur in a particular training sample. So we use
smoothing to basically figure out what value should be assigned to keys that do not
arise in the training example. So the domain is known. Most of the foreign key values
can occur in a training sample, but for those that do not a arise we use Laplacian
So what are we comparing? Input all the features to the feature selection method.
Versus avoid those features that the TR rules says are safe to avoid. So does the TR
rule predict correctly? That's what we want to check. Notice that this is orthogonal to
the earlier work on project Orion where we worried about physical joins here. All the
data is joined a priori. We just worried about which features are eliminated. So the
results are in the paper, but here is the snapshot about the results for backwards
selection, I base. The tuple ratio rule is applied on a per join basis. So the joins are
decided independently.
If you use all the features as input on the Walmart data set, the error for the predicting
the department wide sales is .8961. This is the root mean square error with one to
seven levels. Does the tuple ratio rule avoid the join with stores and with weather? It
says that both joins are safe to avoid. What is the error if you avoid the stores? What
is the error if you avoid weather? It turns out that the error is .8910. Notice that lower
error is better, but in this case it is not significantly higher. Therefore the TR rule got it
right on both these cases. It is safe to avoid. Thus, the overall error by avoiding both
of these two tables is .8910. However, on the yelp data set, the error with all features
as input, the RMSE for levels one to point is 1.1330. Two tables, the TR rule says
neither of them is safe to avoid. If you avoid the users, the error shoots up to 1.21. If
you avoid the businesses, the error shoots up to 1.24. Therefore, the TR rule got it
right in both cases.
They are not safe to avoid.
And the final error is, of course, the same as the original error because all features are
being used.
On the flights data set the error is discrete 01 loss because it is a binary classification.
It is .1930. Three tables, TR rule says airlines are safe to avoid but not the airport
tables. But if you avoid even of these tables, turns out the error doesn't shoot up. So
basically the TR rule got it right on airlines, but missed the opportunity on airport
tables. The overall error is basically the error that you get by just avoiding the airlines
table. Notice that these are the examples of the missed opportunities that I mentioned
because these rules are conservative. And thus, there is scope for devising less
conservative rules here.
What happens to the runtime? On the up data set there is obviously no speed up
because no joins are avoided. On the flight data set by avoiding this particular tables
features, the speed up was a modest two times. But on the Walmart data set the
speed up was 82 times.
>>: As compared to Orion?
>> Arun Kumar: Sorry?
>>: As compared to Orion or ->> Arun Kumar: This is speed up Orion, so it is speeding up all features.
>>: [speaker away from microphone.]
>> Arun Kumar: So everything is physically joined but we now give only the features
from the sales tables and omit the features from these two tables.
>>: But the speed up, I mean, there's a difference if you apply Orion, the physical
draw on it would be faster?
>> Arun Kumar: It could be faster, yes.
>>: So is this the 82 of Orion or over ML after join?
>> Arun Kumar: It is ML after join. So you construct the full table and omit the
features that the TR resume says it is safe to implement. Yes, integrating those two
could give up more speed up, yes.
Okay? So overall we had seven data sets that we measured in the paper. Fourteen
joins. Turns out that the TR rule correctly avoided seven out of the 11 joins that are
safe to avoid, but missed president opportunity on four joins an it correctly did not
avoid three joins that are not safe to avoid. And the speed-ups range from no speedup because no joins were avoided all the way to 186 times.
>>: Does this change significantly if you would, don't use the one simplification that
actually leads to the simple TR rules, that is to have the more complex formula?
>> Arun Kumar: Great point. Next slide.
>> Arun Kumar: So turns out that the results are the same for logistic regression with
L2 and L1 as well, but it's also the same with the ROR rule that you asked. Basically,
it turns out that the way we simplified the ROR rule is in a way that is rarely reflected in
the real world data set. TR and ROR somehow give the same results even though
they are vastly different in terms of how conservative they are. It turns out that the
accuracy can actually increase by avoiding the joins in some cases and dropping the
foreign key causes the accuracy to drop significantly in many cases because the bias
shoots up.
And we also show the details of the simulation study that goes into how we tune these
thresholds. And finally we also handle cases where there are skews in the foreign key
distribution. Foreign key skews are a classical problem in the database world for
parallel hash joins and so on and we present the first results of the implications of
foreign key skews for ML accuracy.
In short, to join or not to join, that is the question. Tuple ratio rated coin, use at your
discretion. That is project Hamlet in short.
To summarize, I presented to you the problem of ML after joins. The data scientists
are often forced to construct the output of the join because data in the real world is
often multi-table connected by key-foreign key relationships, but most ML toolkits
expect single table inputs. Materializing the [indiscernible] of the join leads to human
and system efficiency issues. And project Orion, I showed how we can avoid joins
physically, only to get the output of the join. No need for the joins.
ML can operate directly on data exactly as it resides. Makes it run faster in many
cases and yields exactly the same accuracy, improves human and system efficiency.
In project Hamlet we went one step further to show how we can avoid joins logically in
some cases. Basically get rid of entire input tables without even looking at them,
which makes these techniques run faster, but explains why they would yield similar
accuracy. Potentially improving human and system efficiency even further.
Moving on, I'm going to comment briefly about my future research ideas. To put some,
bring some perspective into my current work and future work and how they are related,
talked about learning our joins at project Orion. And in the context of feature selection
on joins is project Hamlet, several models are being learned over joins.
It turns out that is part of this larger exploratory feature selection process with data
scientists compare multiple subsets. We talked about that in project Columbus. That
is part of this larger process of feature engineering that I mentioned earlier, where the - excuse me. Where the data scientists convert the raw data into the feature vector we
need for now. That is part of an even larger process, and I promise to stop with. That
it is the exploratory model selection process where basically the task is to obtain a
precise prediction function based on the data set, the raw data set.
For my short-term work future in the next couple of years basically I want to focus on
new extensions and generalizations both on the system side and on the applied
learning theory side for Hamlet, Orion, and Columbus. And for the longer term future
work I want to focus on helping improve these processes of feature engineering and
model selection more generally.
I'll speak briefly about my thoughts on this model selection process, from
conversations with data scientists at various settings, we observe that model selection
is often an iterative exploratory process. Data scientists explore combination of three
things, which I call the Model Selection Triple. What are the features? That involves
feature engineering. What is the algorithm? Is it IU based or district progression or so
on and the hyper parameters for the algorithm?
The data scientists thoughts starts this process by steering and deciding which
combination they want to try out first and then they evaluate the accuracy of this
combination. They execute it on a system, on the data. They get the results, consume
these results manually, often basically figuring out maybe I need to change the
features or maybe I need to change the parameters.
And they proceed with the execution and then consume the results and do this
The bottleneck we observe is most systems today force these data scientists to be on
one of two extremes. Either they force them to do only one combination per iteration,
that is one MST per iteration. Or they tend to cut the data scientist out of the loop and
automate the whole process. Doing one MST per iteration turns out to be too tedious
and painful for these data scientists. Automation works in several applications, but in
many other applications the data scientists do not want to be cut out of the loop. They
want their expertise to be used and they do not want that to be ignored.
And so the question I ask: Is there a more flexible data scientist friendly middle way
possible between these two extremes? We wrote up our thoughts on some of this idea
which is, which appears in a vision paper at ACM SIGMOD Records just a few weeks
ago. This is work in collaboration with Robert McCann, who is part of the security
team I mentioned within Microsoft. And our idea is as follows: Enable these data
scientists to be several logic logically related MSTs together using say a declarative
interface. A higher level interface that does not need them to enumerate these MSTs.
They could link several logically related subsets together or parameters together. And
under the covers, we have a system that generates code and evaluates multiple
models. And we could now apply database style optimization ideas that eliminate
redundancy in computations, materialize intermediate data and so on and thus bring
together techniques from both the ML world and the data management world to speed
up this process.
And ultimately since the system is now more aware of what combinations are being
explored and how they are related, we could apply ideas from the provenance
management literature in the database world to help these data scientists debug these
models and process this information about the process even more effectively, helping
them steer the next iteration better. And so overall, we explain how we can combine
ideas from the data management literature, ML literature, and also human computing
interaction to improve the system efficiency of this process and also the productivity of
these data scientists.
And that's the segue to my concluding remarks about the importance of
intersectionality. I am a data management person and that is my background. But
over the last several years I work closely with people in the machine learning world
and interacted, worked with a lot of ML ideas. As I see it, advanced analytics is really
the coming together of these worlds of data management and machine learning.
Moving forward I would like to continue working in the space of advanced analytics,
building bridges between these two communities because of the sheer number of open
research problems and new opportunities that arise when you take a joint view of
these two worlds and I also would like to explore the interactions with human computer
interaction angles because in many of these tasks the data scientist is often in the
loop. And I would also like to look more closely at application domains where these
advanced analytics techniques are used. I work closely with enterprise data scientists.
Also with Web data scientists. I would like to work with other application domains as
well, to look at what the impact of these advanced analytics techniques are on their
That brings me to the end of my talk. I would like to thank my advisors, Jeff Naughton,
Dignish Patel and my other mentors, Chris Re' and David DeWitt at the Gray Lab as
well as my coauthors and collaborators. All these systems and all the techniques that I
talked about the code for all my projects are available as open source toolkits on my
Web page. Even the data sets are available. Feel free to shoot me an email if you
would like to talk to me about it offline. I'll stop now and I will be happy to take more
questions. Both here and from the remote audience.
>> Arun Kumar: Thank you. Yes?
>>: Yes. The model selection remind me a little bit of ML base. You didn't mention.
So one question I have is, so first of all, the optimizations you talk about, are those
similar to the ones that ML base talks about, the multi-query optimization running?
That's one question.
The other question is once you are at that level, might that kind of, is that where the
huge cost is? And then maybe some of the things that you were talking about in Orion
and Hamlet become less relevant? Or kind of are they just as relevant in the bigger
context than they are in the smaller?
>> Arun Kumar: Right. So to answer the second question first, Orion and Hamlet and
even Columbus the way I think about them are building blocks for this larger version.
So all of these techniques will contribute to these optimizations framework that I talk
about. And when we take joint view of these Model Selection Triples, new
opportunities arise. Like can we reuse results for our cross models, can we materialize
what sorts of intermediate data sets we need toe materialize? And also interactions
within provenance and optimization become important. Can we basically use small
cross iterations? Can we avoid recomputations if they want to do what if debugging,
those sorts of questions. So the way I look at it, all of these are building blocks in the
context of the model selection optimization.
Now, coming to your first question about ML base, we actually talk about relationship
with ML base in the paper as well. The way we view it, ML basis one way to specify an
interface. They do not go into the feature engineering details like the joins, for
example. They talk about automating algorithm selection and hyper parameter tuning,
which is nice. And so it's closer to one end of the spectrum where you want full
automation. Whereas here we talk about this entire spectrum, full automation all the
way to individual iterations. And bridging the gap between these two extremes and
coming up with frameworks, maybe not one framework but multiple frameworks. ML
base could be one. Columbus is another. There are automated hyper parameter
tuning libraries that can be viewed as basically new interfaces for the small selection
management process. And the term we use in the paper is this could be a narrow
waste or a new class of systems that optimize the model selection process. And
thinking about it in this way enables new interfaces like combining multiple feature
subsets and parameter combinations which ML base doesn't do, for instance.
>>: Would the optimizations be similar?
>> Arun Kumar: Some of them could be similar. Some of them could be different.
Like, for example, in the automated model search they talk about doing batching.
They also do computation sharing. So some of that could be similar, but some of them
could be different. Like when we look at the feature engineering context, the join stuff,
for example, they do not do anything like that.
So looking at the end-to-end process of model selection, some of the optimizations
that we are talking about in this context, some of the ML base ideas could be relevant
here. Then it also opens up a whole new space of new optimization ideas and there is
also like we go into the details in the paper. We categorize it into several categories of
optimization. There we also look at introducing new interfaces for accuracy, runtime
tradeoffs like the Hamlet stuff. Or some of the other warm starting and model caching
and reusing, an those sorts of optimizations.
Okay? Any questions? Any other questions? Any questions from the remote
>>: Nothing remote.
The one thing that surprised me, though, sort of [indiscernible] provenance
management there. Because ultimately what we are doing -- this may be just a lack of
intuition on my part. One thing that we are doing is, we are selecting models.
>> Arun Kumar: Right.
>>: Or combinations of models, features, and hyper parameters.
>> Arun Kumar: Right.
>>: Now, can you give us some idea how provenance management plays into that?
>> Arun Kumar: Yes, certainly. In the context of the database world for SQL query
processing, people looked at spare provenance, how provenance, why provenance, all
of these sort of models, to debug why certain things are in the input or why things are
not in the output. In the ML context, the debugging we are talking about is why certain
features matter or why certain subsets of examples matter. So these sorts of
debugging questions as to where they need to invest their effort for the next iteration,
they need to get more examples? They need to get more features of certain kinds?
Providing system support for these sorts of questions could be helpful.
Currently often what they do is just through pure intuition or through manual Notes,
they track the changes and they try to figure it out. So there's a low hanging fruit there
which is basically defining provenance for machine learning and then providing these
querying support for the process information. Stack end next stages things like what if
debugging, things like basically recommendations for features like can you use some
past information to make recommendations based on the way it behaved in the past?
Those sorts of questions.
>>: But turn this around. You're saying that you want to have the sense of
provenance management that we have in databases in the context of machine
learning, but you are not saying that the database provenance management
techniques give you like a ML ->> Arun Kumar: Oh, it's the former. Applying the philosophy of the provenance
management work that you see in the database community, bring that to machine
learning. The same techniques, some of them might work. Some of the techniques
can be reused. Some of the techniques need to be devised from scratch. Looking at
this connection, looking at this context I think is a very interesting area.
>>: Okay, perfect. That's where my hang-up is.
>> Arun Kumar: Okay, great. Yes, Vivek?
>>: [indiscernible] and joins. So in many cases, if you want to build a model, you may
not build it on the entire back data, even though one subset is for customers in
>> Arun Kumar: Right, right.
>>: So for that purpose, I mean [indiscernible] would join anyhow. And I prefer that
the conditions are under on the dimension tables, right?
>> Arun Kumar: So here in the example, examiners are referring to employers. You
are saying employers based in Washington, something like that.
>>: Correct. Right?
>> Arun Kumar: Uh-huh.
>>: So in that case do you still have -- I have to join as least some of the dimension,
not all of them.
>> Arun Kumar: Sure, sure.
>>: Do you still see a benefit in performance in those cases?
>> Arun Kumar: Right. So is it Orion or Hamlet? Both?
>>: Either.
>> Arun Kumar: Okay. So there are two aspects to this. One is even if we have a
completely denormalized table where everything has been physically written out as a
single table, what we found -- and this is in the context of the Santoku system that I
demonstrated at VLDB last year, it might be worth normalizing that table back to the
multi-table form under the covers without the data scientists having to know, and then
apply the factor learning technique because the per iteration runtime actually be
significantly lower if you do factorized learning and therefore this renormalization can
actually turn out to be beneficial. So that's what we found in the context of Santoku
and that's true both in memory, out of memory, whatever.
But it depend on the model. It depends on the data dimensions and all of those things.
And so the same cost model and the tradeoffs matter.
What is the other question? For the Hamlet stuff, even if the data is denormalized, the
runtime comes from the feature selection computations that you are reducing. So it
doesn't matter if the data is physically joined or not. Hamlet will still be able to give you
speed-up because you are avoiding the computations.
But if you have data access times that matter, then the physical join optimization might
also help.
>> Christian Konig: Any last questions?
Nothing online, right?
>> Arun Kumar: I don't see any. No questions yet.
>> Christian Konig: Yeah, it typically doesn't happen. Let's thank the speaker one
more time.
>> Arun Kumar: Thank you for coming, everyone. Thanks, Christian.
[recording concluded.]