>> Christian Konig: Okay. Thank you, everyone, for... to introduce Arun Kumar from the University of Wisconsin where...

>> Christian Konig: Okay. Thank you, everyone, for coming. It is my great pleasure to introduce Arun Kumar from the University of Wisconsin where he is co-advised by Jeff Naughton and Dignish Patel. Arun works on the intersection of data management and machine learning with the stress on data management. And the research he has produced should be interesting for various reasons. For one, it's highly decorated, where we are talking about a SIGMOD best paper award in '14. It is also has been highly controversial. So his paper submissions to database conferences have caused all sorts of very strong reactions from the reviewers. And it is also practical. So various pieces of code that are based on his ideas are actually now shipping with a number of different products. Make sure that ask him about all of this, but the most important tidbit that I want to leave you with is what few people know is that Arun is actually a member of the American screen actors guild and has modeled and starred in commercials. Again, feel free ->> Arun Kumar: One commercial! [laughter.] >> Christian Konig: One commercial. I am exaggerating slightly. But he is a man of various talents. If this database stuff doesn't work out, Hollywood is option B. [laughter.] >> Christian Konig: With that I'll hand it over to Arun. >> Arun Kumar: Well, Nissan and Amazon, car commercial. Okay, great. Thanks, Christian and thanks to everyone for coming to the talk and all the people who are viewing it remotely and as he mentioned my name is Arun Kumar. I'm from the University of Wisconsin. Today I will talk to you about my research on accelerating Advanced Analytics. I can talk to you offline about the commercial. [chuckling.] >> Arun Kumar: So Advanced Analytics is becoming ubiquitous. For our purposes, I define Advanced Analytics as the analysis of potentially large and complex data sets using statistical machine learning or ML techniques. So basically, it is the coming together of these worlds of data management and machine learning. I approach this intersection from a data management standpoint. We all see Advanced Analytics in action practically every day. I don't need to preach to the choir here. So everyone here already knows about this. Every time you get rank results from Google's search engine, the output is because of Advanced Analytics. Every time Netflix recommends a movie to you, that is Advanced Analytics in action. And the whole world saw IBM's Watson system defeat human champions in the Jeopardy answering contest. Watson is powered by Advanced Analytics. The hyper-file success, highly visible success of these applications of Advanced Analytics has led to an enormous demand in the enterprise domains and again most of you probably already are familiar with this. Healthcare, retail, insurance, finance. Even the academic domains, the sciences and the humanities, for products that make it easier for these users to integrate Advanced Analytics into their applications. Market research runs estimate that the market size for Advanced Analytics products is set to grow to anywhere between six to $29 billion per year over the next few years. And no wonder then that practically all major data management and analysis companies want a slice of this pie. There are also a slough of startups in this space, and open source toolkits to integrate Advanced Analytics into their applications. In spite of all this, as I'm going to explain today, there still remain numerous bottlenecks in the end-to-end process of building and deploying Advanced Analytics applications. At the end of the day, my research is about abstractions, algorithms, and systems that mitigate these bottlenecks in the end-to-end process from a data management standpoint, thus accelerating Advanced Analytics. And by acceleration, I mean both system efficiency, the running time of the systems and the algorithms involved, and human efficiency. The productivity of the data scientist and other humans that are involved in the process. I'll start with a high level overview of my research, which is largely been in the context of two groups of humans in the Advanced Analytics lifecycle. Data scientists who are tasked with building and deploying machine learning models that analyze data. And behind the scenes we have the software engineers at companies such as Google, Microsoft, IBM, Oracle, who are tasked with building implementations of these maximum likelihood techniques on top of data processing systems such as Relational Database Management Systems, Spark, Hadoop, and so on. From conversations with data scientists, engineers, and others at various settings, we found that there are numerous bottlenecks in their work and throughout the lifecycle of Advanced Analytics. We wrote about some of these bottlenecks in an ACM Queue magazine article that was invited to the communications of the ACM in 2013. My work has largely focused on three classes of bottlenecks. The first one arises when data scientists have to build ML models and that is the process of feature engineering. Transforming the raw data into an ML ready feature vector, there's a lot of work that goes into the space and some of my work has explored different bottlenecks in the feature engineering space. The second class of bottlenecks arise when ML models that have been built have to be integrated into production software and production systems. And the third class of bottlenecks that I have looked at are from the perspective of software engineers who have to build ML toolkits that scale these algorithms to larger amounts of data, the sheer diversity of these ML techniques makes it a tedious and daunting process for them to build these toolkits and there are some bottlenecks that are addressed in that space. To be more precise in terms of the publications that are focused on these bottlenecks, in the context of feature engineering I have worked on two main bottlenecks. The first one is when features arrive from multiple tables which requires joins of tables before ML can be applied. This has been in the context of projects Orion which appeared in SIGMOD 15 and project Hamlet which is said to appear in SIGMOD later this year. The other bottleneck that I looked at was the process of exploratory subset selection where data scientists are often involved in the loop in picking which features they want to use for the ML task. Often they use environments such as R for this process, and we focused on elevating this process to a declarative level, but introducing a domainspecific language for feature selection in the R environment and applying database style and ML optimizations to improve the runtime performance of this process and increase the productivity of the app data scientists. That is project Columbus that appeared in SIGMOD 2014. In the context of integrating ML models into production, I have looked at probabilistic graphical models for optical character recognition or OCR data. The data scientists want to query in conjunction with structured data in RDBMSs that as project Staccato that appeared in VLDB 2012. Finally in the context of helping software engineers build scalable ML implementations, I took a step towards devising a unified software abstraction, an architecture for in RDBMS implementations of ML techniques. That is project Bismarck that appeared in SIGMOD 2012. To speak of the impact of some of my research, as Christian already mentioned briefly, project Bismarck, the code and/or the ideas have been shared by numerous companies including Green Plum, which is now EMC, Oracle, and Cloudear. And we also contributed the code of the Bismarck project to the MADlib open source analytics library. Staccato is being used by projects in digital humanities and sciences. Columbus won the best paper award at SIGMOD, as he mentioned. And Orion and Hamlet, I am currently speaking to a product group within Microsoft, the online security. They keep changing their names. They were called the universal safety platform. Robert McCann is the person. And integrating some of these ideas into the scope ML environment on top of cosmos and they are exploring integrating that with their ML pipelines. And also with LogicBlocks, which is the database company that wants to deploy some of these ideas on top of their production analytics platforms. For today's talk I am going to focus on the first bottleneck of applying ML over joins of multi-table data and I'll dive deeper into projects Orion and Hamlet. I pick this particular bottleneck primarily because of how it, because I think it best illustrates my earlier point about the coming together of the worlds of data management and machine learning, how it gives rise to new problems and opportunities. And also the sheer number of open research questions that this work has led to. Yes, Ravi? >>: I'm -- [speaker away from microphone.] you keep saying, I hear at least three or four projects. >> Arun Kumar: Right. >>: Is this in context with one system? Is there a reason why everything is in a project? >> Arun Kumar: Well, the Orion and Hamlet has been in the context of one system environment. The Bismarck stuff has been in the context of another system environment, but in terms of data platforms, a lot of this has been prototype on top of, say, Postgre SQL or even Hadoop. From a systems perspective, a lot of these ideas are on top of the same data platforms, but I consider them system because I look at like end-to-end stack. From the interaction all the way to the execution perspective. Okay? So here is the outline for the rest of the talk today. I'll dive into ML over joins and then I'll talk about my future research ideas. A bit of deeper outline for ML over joins. I'll motivate the problem and set up an example that I'll use as a running example for the rest of the talk. And I will go into projects Orion and Hamlet. Let's start with a little bit of background, normalized data. All of us here probably know normalized data, but just to set up the example and introduce the terms. Many structured data sets in the real world are multi-table. For example, consider a data scientist that's dealing with data at an insurance company, like American Family. They have data about customers which is a table that contains attributes about customers. The gender, age, income, employer, and so on. They also have data about employers, companies, universities, and other organizations, containing data about companies like state versus headquarters, what is the revenue, and so on. Notice that the employer ID of the employer is an attribute in the customer table that connects these two tables. In database parlance it's also called the foreign key and the employer ID is what is called a primary key, also called just a key in the employer's table. If we want to get some information about the employer, off a customer, we need to do this fundamental operational operation called a join that basically stitches these two records together based on what the employer ID is. Data of this form and key-foreign key joins are not specific to insurance. They are ubiquitous, they arise in practically all domains where there is structured data. They arise in systems where you have data about ratings being than joined with users and movies. They arise in online security. They arise in hospitality industry. They arise even in buyer informatics. So now that you know everything about databases, let's introduce some examples and set up the terms for machine learning. Here is the same customer's data set. A very common machine learning task that is used in enterprise setting is customer churn prediction. Basically, data scientists want to use machine learning to model the customers to help prevent customers from moving to a competitor. And they start with these attributes about customers which become the features for the machine learning algorithm. The age, the employer, and income and so on. Now, they start with the training data set to train the ML algorithm and they have to predict the churn attribute, which is based on past customers that have churned or not. That is also known as the target or the cross label and the job of the data scientist is to build a machine learning model using this training data set, so logistic regression model or SVM or neural network and so on. Here is the twist that causes trouble in this paradise. The employer ID is a foreign key that refers to a separate table about employers, ML meets the world of normalized data. Given the setting of this kind, data scientists view these attributes about the employers as potentially more features for the ML algorithm. And the intuition could be that maybe if your customer is employed by a rich corporation based in Washington, they are less likely to churn. And so they basically want to get all these features which basically forces them to do key-foreign key joins of these two tables in order to gather all these features for the ML algorithm. Unfortunately, almost all major ML toolkits today expect single table inputs for their data training process. This forces these data scientists to do what I call ML after joins. So what does this ML after joins and what is the problem? Here is an illustrative example for the customers and employers table that some representative number of customers and employers, 100 million customers, 100,000 employers, ten features, fully features. The feature vectors of the employers are shown with different patterns over here. You have this key-foreign key join. The input could be, say, hundreds of gigabytes. But after the join the input blows up to several terabytes. This is because the join has introduced redundancy in the data. Notice that these vectors about Microsoft are repeated for every customer that is employed by Microsoft. Now, in many cases this blow up in storage is actually a major issue. In fact, in one example with one of the product groups within Microsoft, they actually encountered this problem when they were doing Web security-related ML applications. It even occurred to the customer of LogicBlocks. They mentioned that they are joining five tables and the data blew up over an order of magnitude. So storage is one of the key bottlenecks that ML after joins encounters. You have to have extra storage space because of these ML toolkits. There's also redundancy in the data which causes redundancy in analytics computations, or that wastes runtime. In the real world data is rarely static. As you get more examples, more features, the data scientist now has this additional headache of maintaining the output of the join and keeping the models fresh, which impedes the productivity. Ultimately all of this impedes the exploratory analysis process that is often required for building machine learning models. Extra storage and wasted runtime are what I call system efficiency issues. Maintenance headaches and impeding the exploratory analysis process of what I call human efficiency issues because the data scientists is spending time on grunt work and thus productivity goes down. And nowadays a lot of these analytics computations are moving to the Cloud. There every bit of extra storage and extra computation could incur extra monitoring costs. To summarize, I introduced the problem of ML after joins, which is a fundamental bottleneck in the end-to-end process of building ML models. Data in the real world is often multi-table, connected by key-foreign key relationships. Since most ML toolkits expect single table inputs, the data scientists are forced to materialize their [indiscernible] join that concatenate feature vectors from all these tables which leads to human and system efficiency issues. Yes, Ravi? >>: Quick question. When you build an ML model, built on the real set or training set. Aren't training sets much smaller? First question. And second question, the patterns you explain, right, redundancy. They seem to be very amenable to comprehension techniques like [indiscernible] input. >> Arun Kumar: Right. >>: Why don't I take a denormalized relation and express it. >> Arun Kumar: Right. Great questions. coming to the first question, is the training sample, does it have to be smaller than the actual data set? Yes, in many applications sampling could actually work. But for many of the other ML tasks they actually want to get, say, the effects that are less likely to occur in a sample. Therefore, they want to use the whole data set. So there is a spectrum of applications over there. Even inside a sample, if you have a sample example set, they might be redundancy in the data and as I'm going to explain there will still be a tradeoff space for some of the techniques I'm going to talk about where it could still benefit the performance, runtime performance. Coming to the second question with respect to compression, yes, many of niece cases these data sets that have redundancy are amenable to columnar compression. However, that introduces an orthogonal tradeoff space where you now have to tradeoff operational efficiency for compression, or operate directly on compressed data. One of the techniques that I'm going to talk about today is sort of a stepping stone to that. Looking directly at operating on compressed data, how do we integrate these ML techniques. So that's a nice idea for future work and some of the ideas that I'm going to explain today are amenable to that as well. Okay? >>: I have a quick question on the maintenance. Can you explain that one again? Because it seems like the maintenance is just like a view of the [indiscernible]. That's all you need. Is there something more than that? >> Arun Kumar: Well, yes. Yes and no. It depends on the application and the environment. In some of these cases they actually produce copies of the data and they move to different toolkits. So there the infrastructure that we have built in say the database environment for view maintenance, that gets lost. And, therefore, they have to manually keep track of where they have the data and how to update the data and so on. In the database context, we can construct a view of these joins and do the joins virtually on the fly. I'm going to talk about that as well. Turns out that there is still computational redundancy if you do the view approach. There's a tradeoff space that I'll talk about. >>: Right. >> Arun Kumar: Okay? To give an overview of my work in my thesis work I proposed this paradigm of ML over joins to combat this problem of ML after joins. And in project Orion which appears in SIGMOD 2015 we show how we can avoid joins physically. Essentially we do not construct a table of the join, we do not need the joins themselves. We can operate on data exactly as it resides in the multi-table format. We show how this can help improve runtime performance because the data in the input of the join could be much smaller, but it produces exactly the same model and that gives the same accuracy. This could potentially improve human and system efficiency. Going further, in project Hamlet, SIGMOD 2016, we show that in some cases you can actually avoid joins logically as well. What do I mean by that? We can actually get rid of entire input tables without even looking at them. This obviously runs faster, but I'm going to explain why it can give very similar accuracy. This could potentially improve human and system efficiency even further. Moving on, I'll dive into project Orion. In project Orion our idea to combat the problem of ML after joins is this technique of factorized learning. And the intuition assemble, we factor ice or decompose these ML computations and push them down through the joins to the base tables much like how database systems push selections, projections and aggregates down through joins. The benefits, we avoid redundancy in I/O because we are not constructing this output data set. We avoid redundancies in computations. We don't need extra storage runtime could approve and it could be potentially more data scientist friendly because they operate run data exactly how it recites rather than on a materialized intermediate data set. The question is: Is factorized learning even possible? And in this paper we show that yes, it is possible at least for a large class of ML models known as Generalized Linear Models, also known as GLMs that are solved using batch radiant methods. The challenge is how do we factor ice these GLMs without sacrificing the ML accuracy, the scalability of this implementations and the ease of development of these ideas on top of existing data management platforms? I'll start with some brief background about these GLMs, for those of you who are not familiar with it. Recall the classification example that I'm using as running example. Classify customers as to whether they are likely to churn or not. In ML we start the mapping each customer to a D-dimensional numeric feature vector. In this example here you have two dimensions. So basically customers are just points. And now GLMs classify these customers based on a hyperplane W over there. On the one side are points that basically customers that churn. On the other side are those who do not. The question is how do we choose a good hyperplane W? In GLMs, recall that we start with a training data set. That is the input. We have a set of labeled customers. Those are label points labeled SGS or no, plus or minus. The idea to compute a hyperplane is to minimize a certain score of the hyperplane on the training data set, say the misclassification rate. That is essentially a sum of N terms where N is the number of labeled customers and the function of the inner product W transpose Xi. That's the distance of the point to the hyperplane. And the class label Yi where Xi is the feature vector of the I of the customer and Yi is the class table of the customer. It turns out that the misclassification rate is hard to optimize and thus GLMs instead use a smooth function of the inner products, W transverse Xi. And depending on what this function is, we get a different ML model. It is the square distance. It is the popular least [indiscernible] linear regression model that is used for least trend analysis. It is the log lost and it is the popular logistical regression classifier. If it is hinge loss, it is linear SBM and so on. So basically, the takeaway here is all of these ML models have the same common mathematical structure, but there is this minimization model problem with the sum of N terms and the function of the inner product of the transverse Xi. >>: So there are SBMs with other kernel functions? >> Arun Kumar: Right. >>: Then that will not work, right? Your whole approach will not ->> Arun Kumar: That's a good question. It depends upon the structure of the kernel. If the kernels operate on the inner products, then some of these techniques that I talk about can actually apply there as well. But in this particular work we focused on GLMs and therefore we only deal with linear SBM views for generality sake here. Okay. So how do we do this minimization? So in this technique called gradient as send dent I'll give some background about. We need to get to the optimal hyperplane W that minimizes this function, which is also known as the loss function or L of W. And it turns out that the loss function for GLMs is a ball-shaped curve formerly known as a convex function. And the optimal is essentially the bottom of the ball. That is some separating hyperplane in the point space. In general, the loss function is not closed form. And thus, people use iterative solution known as gradient descent, where the idea is simple. You start with some value W, compute the gradient at W, and the gradient is essentially the generalization of the slope to multiple dimensions. And you move in the opposite direction of the gradient, the descent. And alpha is the step sized parameter that controls the, how much you move in the opposite direction. To give an illustration, we start with some W0. That is some separating hyperplane over there. We do one pass over the data set. Basically look at all the points and compute the gradient. Take a step in the opposite direction. We get to a new model, W1. Do one pass over the data. Compute the gradient. Move in the opposite direction. And keep doing this iteratively until you get closer and closer to the optimal. So this is basically the BGD technique. What does it do today in the context of -- yes? >>: [speaker away from microphone.] single pass over the entire table? >> Arun Kumar: Yes. >>: And how many passes typically does it take to converge? >> Arun Kumar: It depends. It depends on the data set and the accuracy desired by the data scientists. In practice I have seen they run like 20 to 40, 50 iterations? >>: So then the question of the cost of just joining the key-foreign key, is that a bottleneck in the larger process? Have you ever heard of 40 passes? >> Arun Kumar: I'll explain some of the experimental results, with breakdown of what is the runtime of the join. It turns out that the redundancy introduced by the join is often the dominating factor for that. Okay? Okay. So recall that we simply run BGD on the concatenated feature vector if you want to do ML after joins. You have features from customers, tax next to features from employers. This is the expression for the loss. Take the first iterative. That's the expression for the gradient. It is also a sum of N terms with the scaler function G and you scale the feature vectors Xi. Notice that the gradient Nabla L, the model or the hyperplane W, and the feature vector Xi are all D-dimension am vectors, but D is the total number of features. So how does this work? We start with the input of the join to the join, get the output of the join. Don't forget about the input. Start with the initial model W0. One pass over the data. You get the first gradient. You update the model. Next pass over the data, get the next gradient. Update the model. And proceed iteratively. So basically BGD after joins, you physically write the output of the join and then you do one scan of the output table per iteration. What does factorized learning do? Recall that our goal was to push computations to the base tables and our intuition is simple. You split up the computations on the two feature vectors from the two tables, XC and XE. This is the expression for the gradient. The inner product can be rewritten as a sum of inner products over the customer and employer features and the inside is that a sum of sums is still a sum. Thus we can do one pass over the employer's table to precompute the partial inner products over the employer features, use that when we do a pass over the customers table to reconstruct the full inner products. However, we run into this challenge of how do we scale the feature vector of the entire feature vector for every customer? Turns out that one pass over each base table may not be enough. We need to save some statistics and exchange them across these tables and reconstruct the full gradient. So how does this work? So here is the output of the join. Get rid of the output table. Get rid of the joins. You start with the initial model. You chop up the coefficients or coefficients over customer features and coefficients over employer features. One pass over the employers table. You get some statistics and if you're curious, these are basically the partial inner products per employ ID. Use those statistics to do one pass over the customers table and we get a portion of the gradient. We also get some statistics over the customer's table. If you are curious, these are basically the group by sums of the scaler function G per employer ID. Use those statistics, do a second pass over the employers table and we get the remainder of the gradient. Stick these two vectors together. That's the full gradient. Proceed and update the model and go ahead to the next iteration. So basically factorized learning requires one pass of customers and two passes over employers per iteration. And it completely avoids the need to join these two base tables. Now, yes, Vic? >>: Can you have sort of data features that require combining some columns from the factor [indiscernible] dimension. >> Arun Kumar: Yes. >>: Like a more complex [indiscernible] in that case, are the technique still [indiscernible]? >> Arun Kumar: That's a great question. You are basically talking about featurefeature interactions. Now, there are methods that are called second order methods like Newton's descent, where you actually construct the Hessian metrics that has pairwise interactions among all features. Turns out that the factors learning technique that I talk about here for linearity extends to that technique as well. However, there is a tradeoff in terms of the runtime performance. That tradeoff space looks a little bit different than what it looks like for here. So in terms of feasibility, yes, it's possible. But in terms of efficiency, the tradeoff space is a bit different. Okay? So now I talked about the algorithm when we wanted to work in practice there are some challenges that arise when you want to implement it on a real system. What is the statistics that I talk about over per employer ID do not fit in the aggregation context memory. And how do we implement this on top of existing data processing systems without having to change the internals for ease of deployability? For the first one we go into the details of a partitioning based approach in the paper where we stage the computations of different portions of these statistics and stitch them together to reconstruct the full gradient. And for the second one we use the abstraction of user defined functions, specifically used to define aggregate functions that are available in practically all major database systems, as well as on high [indiscernible]. So how does this work in practice in terms of efficiency? Oh, yes, Donald? >>: It only works for key-foreign key joins, right, [indiscernible] need to join between the two tables? >> Arun Kumar: There are two aspect to it. In this particular paper we focused on new key-foreign key joins because they are very common and we thought that would be nice. And in terms of applicability of this technique to general joins where you can have the full cross-product appearing, it is technically feasible, but you need an identifier, [indiscernible] identifier that needs to be added to the statistic because the attribute, joining attribute is no longer the primary key in one of the tables. So we haven't looked at it in this paper, but conceptually there is nothing that prevents us from extending to it. Okay? So what about system efficiency, is it actually faster? We've done an extensive empirical analysis in the paper and I'm going to give a snapshot of one experimental result here. We implemented all these techniques as user defined functions written in C on top of the Postgre SQL open source database system and thus our experiments were single node with 24 gig RAM. We synthesized data sets that resembled of some the number of tuples, the number of features we saw in practice. Here we have customers with 150 million tuples, employers 10 million tuples. We ran logistic regression with batch gradient descent for 20 iterations and I'm going to show two things. The storage space consumed throughout the process of learning and the runtime on the Y axis. The input data set, the customers and employers together, is about 52 gigabytes. BGD after join is over there. They require about 20 gigabytes of storage space. Factorized learning in contrast does not require any extra storage space. It basically, there's a gap of 217 percent in terms of runtime -- in terms of storage space. And in terms of runtime the gap is 334 percent. And to breakdown the runtimes as Ravi asked earlier, basically the after joins spends 19 percent of its are you positive time in constructing the [indiscernible] of the join and 13 minutes per iteration of BGD. In contrast, factorized learning you avoid the joining time and the runtime per iteration is about five minutes because it operates on the smaller data as the input of the join. Now, of course, all of these relative runtime numbers depend on several factors and these are the number of tuples, the number of features, how much memory the system has, the number of iterations and the degree of parallelism and so on. And we have done an extensive sensitivity analysis to all of these factors in the paper. Overall, it turns out that factorized learning is usually the fastest approach but not always. There are some cases where the BGD after joins approach could be slightly faster and we devised database style cost models for the I/O costs and the CPU costs of this technique that automatically picks the fastest approach on a given instance of the input. There are more details in the paper and I would be happy to chat with you offline about it. We give a proof for why the accuracy of the BGD technique is not affected and why we get exactly the same model with the factorized learning. There are other approaches to approach larger than memory scale that we discuss and the tradeoffs based there. And the question about the views turns out that I/O relevancy can be eliminated if you use views but computational redundancy still exists and thus it could be slower than factorized learning in many cases. And we also extend all of our techniques to the parallel chat, nothing parallel in the distributed environment and prototypes offer ideas on high even Hadoop and lessons park. We extended this to multi-table joins, specifically star joins, but the extension is trivial to snowflake joins as well. Yes? >>: Are you going to talk about SGD? >> Arun Kumar: Great point. >>: Okay. >> Arun Kumar: Such a theory also extended our ideas to GLMs. So using SGD, stochastic gradient descent as well as coordinated descent methods. We also extended it to probabilistic classifiers like naive base and decision trees and also to clustering techniques such as K-means. A lot of this is work done with Masters and undergrad students. Basically I could off load some of these technical work to them and focus on newer ideas because I didn't want to focus on extensions of this directly myself and I focused on the newer idea project Hamlet that I'll talk about next. I'm also working on extending the idea of factorized learning to linear algebra so that we can generalize it to any ML computation that is expressible in linear algebra, and the input of the join will be automatically basically given a linear algebra script that operates on a matrix that is the output of the join will produce a system that will automatically rewrite it to a script that operates on the input of the join. Yes? >>: Just quickly, for these extensions are you seeing similar runtime [indiscernible] >> Arun Kumar: It depends. It depends on the technique. It depends on the data parameters. It turns out that for stochastic gradient descent there is no computational redundancy in the in-memory setting because the model changes after every example and for coordinate descent it could be that certain overheads imposed by the columnar access patterns. We looked at it in the context of column stores and the tradeoff space looks a little bit different for each of these techniques. Some of the probabilistic class version clustering, it's very similar to BGD. Yes? >>: I guess as part of that SGD, it seems like you don't get the computational gain. Then if you're doing it on a view you wouldn't have I/O gain either. I was curious, did you find a gain at all? Yes. It turns out for SGD since we don't have computational redundancy, the only thing that matters is the data redundancy. The I/O redundancy. So in order to avoid construct nail [indiscernible] join we say you can do a hash table and you can do a view, basically an index hash join and the challenge there, what if the hash table doesn't fit in the memory? Then you need to partition the data but that scrambles the ordering that you need for SGD which is very sensitive to the order in which you access the data. Now it introduces a new runtime accuracy tradeoff. We are looking at some of the issues in this tradeoff. Things like what if the foreign key is high correlated with the class label and those sort of issues. So you can expect a paper about that pretty soon. But I'm happy to chat with you offline with you about more details if you're interested. Okay? In short, a sum of sums is the same sum factorizing a voids repeating sum. So loan after joins no more to loan over joins so for. That is project Orion in one stanza. Moving on. I'll dive into project Hamlet. Recall the same running example. Customers referring to employers. Here is an observation. Given the employer ID, all the features of the employers are fixed. So do we really need the statant revenue features if you know the employer already that is already present in the customers table? There is this notion of feature redundancy in the information through literature. And we can show formally that given the employer ID, all the employer features are redundant. This motivates us to basically consider the idea of omitting all those features. That is, avoiding the join logically. In short, avoiding the join. We use the foreign key employer ID as a representative for all the employer features. Thus, we get rid of that table. However, there's also this notion of feature relevancy. Certain features of the employers could be highly predictive of the target, in this case the churn, in which case we might actually want to bring it back and let the algorithm decide. So oops, we need the join. You might be wondering what is the big deal? Why do we need to think of avoiding the join? Why not just use the feature selection technique? These have been studied for decades that manage this redundancy relevancy tradeoff. We can just apply that in this context. Well, in one word, the answer is runtime. But here is a brief background about what is feature selection to do if you are not familiar with it. It is essentially a method to search the space of subsets of features and update, obtain a subset that is probably more accurate or more interpretable. And there are various algorithms for feature selection, wrappers, filters, embedded methods. One of the most importance lar ones is forward selection where you start with a single feature in your set and you compute the prediction or the generalization error, the desk error. And then you keep adding one feature at a time depending on whether the test error goes down or not. Backward selection is the reverse. You start with your entire set of features and you keep dropping one at a time based on whether your test error goes down or not. There are filter techniques that rank features. There are embedded methods like regulation with L1 or L2 norms. Coming back to the question of why bother avoiding the join? Well, if you avoid the join you reduce the number of features that you give as input to these algorithms which reduces their [indiscernible] space potentially improving their runtime. Basically we are short circuiting the feature selection process using database schema information. The key-foreign key relationship. Million dollars question: What happens to accuracy? To understand the effects of avoiding the join and accuracy, we need to understand the bias variance tradeoff. I'm going to give a little bit of background here. Learning theory tells us that the test error of a classifier can be decomposed to three components: The bias, the variance and the noise. The bias, also known as the approximation error, again these are informal explanation, is sort of a quantification of the complexity of the classifier hypothesis space. Models that are more complex tend to have lower bias. The variance is a characterization of the instability of a classifier to a particular training data set. Models that are more complex tend to have higher variance fixing the training data set and fixing the model if you give fewer and fewer training examples, the variance tends to go up. The noise is a component that no model can mitigate because of unobservable data. And traditionally the bias experience tradeoff is illustrated as follows in ML. As the model complexity keeps going up for a fixed training data set, the training error keeps going down. But the test error goes up beyond the point. On the left is a situation of high bias and low variance due to low model complexity. On the right is a situation of low bias, high variance because of high model complexity. Situations with high variance are also co-locally called over-fitting. So the key question for our work now becomes: What is the effect of avoiding the join? That is, omitting the feature vector about the employers on the bias and the variance. We did an applied learning theoretical analysis of avoiding joins and bias variance. I'm going to give a brief summary here. The effect on bias. So here is the full feature vector. We renamed the employer ID as FK to be generic and without loss of generality assume for now that the customer feature set is empty. So we have this reduced feature vector. The classifier that we learn is essentially a prediction function. It is a function of the feature vector. And HX is the hypothesis base of all possible classifiers that you can build based on the feature vector X. We show that this hypothesis base does not change if you omit the employer features in this case. Basically, HX equals to HFK. In a sense, this learning theory result is equivalent to the earlier information theoretic result that I mentioned about the employer features being redundant. And what it basically leads to is that if you avoid the join -- oh, and if you actually avoid the foreign key and use the employer features instead it turns out that the hypothesis space can actually shrink so basically what this leads to, if you avoid the joins the bias is unlikely to grow up. If it's not likely to go up, then we don't need to worry about the bias. So however, it actually turns out that for some popular clarifiers like logistic regression and naive base, the bias can actually go down. So what happens to variance? So if the hypothesis space is unlikely to change, does that mean the variance is unlikely to change? The short answer is no. And the key insight here is that feature selection may or omit certain features in the new vector. To understand this, here is an example. Here is your full feature vector and again without loss of generality, assume the customer feature set is empty. We have this reduced feature vector. And suppose I give the following true concept that generates the data that is sort of the worst case scenario for avoiding the join. Suppose the true concept is as follows: If the state of the employer is Wisconsin, then the customer is unlikely to churn. Otherwise they are likely to churn. If I generate thousands of examples based on this true concept and give that as input to a feature selection method, what is the likely output feature subset? It is highly likely to be state. And the key insight here is that in general, the domain of the foreign key could be far larger than the domain of the features it refers to. There are only 50 states. There could be millions of employers. To understand this, here is an illustration. This is the true concept. And the hypothesis space of classifiers built using the state features are say encompasses the true concept. But if you use the foreign key, the hypothesis space is far larger. Notice that we already showed that HFK equaled HX. Thus if you use the foreign key you would end up with higher variance compared to the State. However, avoiding the join has forced us to use the foreign key as a representative of the state feature, which leads to the result that avoid can joins could increase the variance. In short, we ask the question: What happens to the bias in variance if you avoid the join? And are [indiscernible] analysis suggests that avoiding the joins is not going to increase the bias but it could increase the variance. This is a new runtime performance accuracy tradeoff in the ML setting using database schema information and we asked how can we help data scientists exploit our theory of this tradeoff in practice? Our idea is to devise practical heuristics that help the data scientist bound the potential increase in variance. And they can now apply a threshold and see if it is above a certain threshold it is too unsafe, I'm not going to avoid the join. I call this process avoiding the join safely. The challenge is how do you even obtain a bound to avoid join safely? There is no precedent for this. Well, it turns out there are certain bounds in learning theory that we can use. These bounds are bounds on the absolute variance based on the [indiscernible] churn of the VC dimension of a classifier and essentially you can think of it as a quantification of the complexity of a classifier's hypothesis space. Models that have higher VC dimension tend to have higher variance affecting the training set. There are several bounds in learning theory using the VC dimension on the variance. We apply a standard bound from Shy and Shy's popular learning theory textbook combined with [indiscernible] about the growth function. The expected test error and the expected training error. The difference is bounded by this function of VC dimension we. The number of training examples N and the failure probability delta. And this is for training data sets of size N which are exam examples are IID distributed, independent and identically distributed. Now, here is the catch. These bounds are on bounds on absolute variance. What we need is a bound on the increase in variance. And this leads us to the heuristic, the ROR rule where our intuition is that the increase in bound caused by the increase in VC dimension is an indicator of the risk of avoiding the join and we define this quantity Risk of Representation, ROR, which compares the hypothetical best feature selection output that you can get after avoiding the join, versus what you can get with avoiding the join. That is this quantity over here. There are three new terms. VS is the VC dimension of the best classifier that you can get after avoiding the join. V no is the other one. And delta bias encodes the difference in the bias. But now here is the problem. VS, V no and delta bias are impossible to know a priori in general because they require prior knowledge of the true concept. If you already know the true concept, you probably don't even need machine learning. How do we overcome this conundrum? Our idea is to upper bound this quantity ROR to eliminate these terms that require prior knowledge of the true concept and the upper bounding it, we make it more conservative. In a sense that it's unlikely to avoid joins, where if you avoid them the error shoots up. But it is likely to miss certain opportunities to avoid joins. So we upper bound the ROR. The details are available in the paper. Essentially for VC dimensions that are linear in the number of features, and this includes popular classifiers like IU base and logistic regression, the ROR can be upper bounded as follows. We have these two new terms over here. TFK is the domain size of the foreign key, the number of employers. And QE Star is the minimum domain size of the features in the employer's table. And the ROR rule now essentially says if the RH over here is less than some threshold epsilon, it is safe to avoid the join. Now, we looked at this and we thought -- oh, yes? >>: I want to make sure I than. That's the single feature with minimum [indiscernible] >> Arun Kumar: Exactly, it's the domain size of the smallest ->>: The smallest domain feature, okay. >> Arun Kumar: Okay? We look at this and thought it is still so complex. Can we simplify it further to help data scientists. That's where we came up with the tuple issue rule or the TR rule where the idea is to eliminate the need to even look at the employer features. Basically we want to get rid of that second term over there. Define the tuple issue, after the issue of the number of examples and to the domain side of the foreign key, DFK. That's that quantity over there. In total it's just the number of training examples, the number of tuples in the customer's table to the number of employers. That's the domain set of the foreign key. That's why I call it the tuple ratio. Now, if the domain size as to foreign key feature is much larger than the minimum domain size of the feature in the employer's table which is almost always the case in practice, then for reasonably large tuple ratios it turns out that this bound becomes linear in one area squared of tuple ratio. Flip it over and take the square, upper bound becomes the lower bound and we end up with the tuple ratio rule which essentially says if the tuple ratio is greater than some threshold Tao it is safe to avoid the join. Notice that this is even more conservative than the ROR rule. So after this long journey through the wonderful world of learning theory, we come back to this stunningly simple tuple ratio rule that only uses the number of tuples in these two features to help us avoid joins safely in this ML context. So even if you have a million features about the employers, it is possible to safely discard them without even looking at them. So does this work in practice? We ran extensive experiments with real data sets. I'm going to give a snapshot here. We tuned these thresholds for these rules using simulation study. And it turns out that the tuple ratio of 20 is good enough to decide whether a join can be avoided or not. Notice that this simulation needs to be done once per VC dimension expression, not once we are data set as you need to do for hyper parameters in machine learning. And the tolerance that we set for the set error was .001. We use real data sets. There are three here. There are more in the paper. These are from car go. All of these are classification tests. In the Walmart data set we want to predict the number of sales per Department in the store by joining data about sales with stores and weather indicators. And the up data set we want to predict the ratings of businesses by users by joining data about ratings with users and businesses. Flights is a binary classification task where we have data about flight routes and we want to predict if it's code share or not by joining data about other routes with airlines and airport information. We applied a standard of off the shelf classifier IU base which is very popular and combined it with popular feature selection methods like forward, backward, filters, and so on. And we used the standard hold out validation methodology to measure accuracy. 50 percent of the train label set was set aside for training. 25 percent for validation during feature selection, and 25 percent was set aside as the final hold out test error which is the final indicator of accuracy. Yes? >>: Do you have domain [indiscernible] in every foreign key in your training set? >> Arun Kumar: That's a good question. So in this particular example the domain of the foreign key is defined by these tables over here. Each individual example of foreign key instance may or may not occur in a particular training sample. So we use smoothing to basically figure out what value should be assigned to keys that do not arise in the training example. So the domain is known. Most of the foreign key values can occur in a training sample, but for those that do not a arise we use Laplacian smoothing. So what are we comparing? Input all the features to the feature selection method. Versus avoid those features that the TR rules says are safe to avoid. So does the TR rule predict correctly? That's what we want to check. Notice that this is orthogonal to the earlier work on project Orion where we worried about physical joins here. All the data is joined a priori. We just worried about which features are eliminated. So the results are in the paper, but here is the snapshot about the results for backwards selection, I base. The tuple ratio rule is applied on a per join basis. So the joins are decided independently. If you use all the features as input on the Walmart data set, the error for the predicting the department wide sales is .8961. This is the root mean square error with one to seven levels. Does the tuple ratio rule avoid the join with stores and with weather? It says that both joins are safe to avoid. What is the error if you avoid the stores? What is the error if you avoid weather? It turns out that the error is .8910. Notice that lower error is better, but in this case it is not significantly higher. Therefore the TR rule got it right on both these cases. It is safe to avoid. Thus, the overall error by avoiding both of these two tables is .8910. However, on the yelp data set, the error with all features as input, the RMSE for levels one to point is 1.1330. Two tables, the TR rule says neither of them is safe to avoid. If you avoid the users, the error shoots up to 1.21. If you avoid the businesses, the error shoots up to 1.24. Therefore, the TR rule got it right in both cases. They are not safe to avoid. And the final error is, of course, the same as the original error because all features are being used. On the flights data set the error is discrete 01 loss because it is a binary classification. It is .1930. Three tables, TR rule says airlines are safe to avoid but not the airport tables. But if you avoid even of these tables, turns out the error doesn't shoot up. So basically the TR rule got it right on airlines, but missed the opportunity on airport tables. The overall error is basically the error that you get by just avoiding the airlines table. Notice that these are the examples of the missed opportunities that I mentioned because these rules are conservative. And thus, there is scope for devising less conservative rules here. What happens to the runtime? On the up data set there is obviously no speed up because no joins are avoided. On the flight data set by avoiding this particular tables features, the speed up was a modest two times. But on the Walmart data set the speed up was 82 times. >>: As compared to Orion? >> Arun Kumar: Sorry? >>: As compared to Orion or ->> Arun Kumar: This is speed up Orion, so it is speeding up all features. >>: [speaker away from microphone.] >> Arun Kumar: So everything is physically joined but we now give only the features from the sales tables and omit the features from these two tables. >>: But the speed up, I mean, there's a difference if you apply Orion, the physical draw on it would be faster? >> Arun Kumar: It could be faster, yes. >>: So is this the 82 of Orion or over ML after join? >> Arun Kumar: It is ML after join. So you construct the full table and omit the features that the TR resume says it is safe to implement. Yes, integrating those two could give up more speed up, yes. Okay? So overall we had seven data sets that we measured in the paper. Fourteen joins. Turns out that the TR rule correctly avoided seven out of the 11 joins that are safe to avoid, but missed president opportunity on four joins an it correctly did not avoid three joins that are not safe to avoid. And the speed-ups range from no speedup because no joins were avoided all the way to 186 times. Yes? >>: Does this change significantly if you would, don't use the one simplification that actually leads to the simple TR rules, that is to have the more complex formula? >> Arun Kumar: Great point. Next slide. [chuckling.] >> Arun Kumar: So turns out that the results are the same for logistic regression with L2 and L1 as well, but it's also the same with the ROR rule that you asked. Basically, it turns out that the way we simplified the ROR rule is in a way that is rarely reflected in the real world data set. TR and ROR somehow give the same results even though they are vastly different in terms of how conservative they are. It turns out that the accuracy can actually increase by avoiding the joins in some cases and dropping the foreign key causes the accuracy to drop significantly in many cases because the bias shoots up. And we also show the details of the simulation study that goes into how we tune these thresholds. And finally we also handle cases where there are skews in the foreign key distribution. Foreign key skews are a classical problem in the database world for parallel hash joins and so on and we present the first results of the implications of foreign key skews for ML accuracy. In short, to join or not to join, that is the question. Tuple ratio rated coin, use at your discretion. That is project Hamlet in short. To summarize, I presented to you the problem of ML after joins. The data scientists are often forced to construct the output of the join because data in the real world is often multi-table connected by key-foreign key relationships, but most ML toolkits expect single table inputs. Materializing the [indiscernible] of the join leads to human and system efficiency issues. And project Orion, I showed how we can avoid joins physically, only to get the output of the join. No need for the joins. ML can operate directly on data exactly as it resides. Makes it run faster in many cases and yields exactly the same accuracy, improves human and system efficiency. In project Hamlet we went one step further to show how we can avoid joins logically in some cases. Basically get rid of entire input tables without even looking at them, which makes these techniques run faster, but explains why they would yield similar accuracy. Potentially improving human and system efficiency even further. Moving on, I'm going to comment briefly about my future research ideas. To put some, bring some perspective into my current work and future work and how they are related, talked about learning our joins at project Orion. And in the context of feature selection on joins is project Hamlet, several models are being learned over joins. It turns out that is part of this larger exploratory feature selection process with data scientists compare multiple subsets. We talked about that in project Columbus. That is part of this larger process of feature engineering that I mentioned earlier, where the - excuse me. Where the data scientists convert the raw data into the feature vector we need for now. That is part of an even larger process, and I promise to stop with. That it is the exploratory model selection process where basically the task is to obtain a precise prediction function based on the data set, the raw data set. For my short-term work future in the next couple of years basically I want to focus on new extensions and generalizations both on the system side and on the applied learning theory side for Hamlet, Orion, and Columbus. And for the longer term future work I want to focus on helping improve these processes of feature engineering and model selection more generally. I'll speak briefly about my thoughts on this model selection process, from conversations with data scientists at various settings, we observe that model selection is often an iterative exploratory process. Data scientists explore combination of three things, which I call the Model Selection Triple. What are the features? That involves feature engineering. What is the algorithm? Is it IU based or district progression or so on and the hyper parameters for the algorithm? The data scientists thoughts starts this process by steering and deciding which combination they want to try out first and then they evaluate the accuracy of this combination. They execute it on a system, on the data. They get the results, consume these results manually, often basically figuring out maybe I need to change the features or maybe I need to change the parameters. And they proceed with the execution and then consume the results and do this iteratively. The bottleneck we observe is most systems today force these data scientists to be on one of two extremes. Either they force them to do only one combination per iteration, that is one MST per iteration. Or they tend to cut the data scientist out of the loop and automate the whole process. Doing one MST per iteration turns out to be too tedious and painful for these data scientists. Automation works in several applications, but in many other applications the data scientists do not want to be cut out of the loop. They want their expertise to be used and they do not want that to be ignored. And so the question I ask: Is there a more flexible data scientist friendly middle way possible between these two extremes? We wrote up our thoughts on some of this idea which is, which appears in a vision paper at ACM SIGMOD Records just a few weeks ago. This is work in collaboration with Robert McCann, who is part of the security team I mentioned within Microsoft. And our idea is as follows: Enable these data scientists to be several logic logically related MSTs together using say a declarative interface. A higher level interface that does not need them to enumerate these MSTs. They could link several logically related subsets together or parameters together. And under the covers, we have a system that generates code and evaluates multiple models. And we could now apply database style optimization ideas that eliminate redundancy in computations, materialize intermediate data and so on and thus bring together techniques from both the ML world and the data management world to speed up this process. And ultimately since the system is now more aware of what combinations are being explored and how they are related, we could apply ideas from the provenance management literature in the database world to help these data scientists debug these models and process this information about the process even more effectively, helping them steer the next iteration better. And so overall, we explain how we can combine ideas from the data management literature, ML literature, and also human computing interaction to improve the system efficiency of this process and also the productivity of these data scientists. And that's the segue to my concluding remarks about the importance of intersectionality. I am a data management person and that is my background. But over the last several years I work closely with people in the machine learning world and interacted, worked with a lot of ML ideas. As I see it, advanced analytics is really the coming together of these worlds of data management and machine learning. Moving forward I would like to continue working in the space of advanced analytics, building bridges between these two communities because of the sheer number of open research problems and new opportunities that arise when you take a joint view of these two worlds and I also would like to explore the interactions with human computer interaction angles because in many of these tasks the data scientist is often in the loop. And I would also like to look more closely at application domains where these advanced analytics techniques are used. I work closely with enterprise data scientists. Also with Web data scientists. I would like to work with other application domains as well, to look at what the impact of these advanced analytics techniques are on their applications. That brings me to the end of my talk. I would like to thank my advisors, Jeff Naughton, Dignish Patel and my other mentors, Chris Re' and David DeWitt at the Gray Lab as well as my coauthors and collaborators. All these systems and all the techniques that I talked about the code for all my projects are available as open source toolkits on my Web page. Even the data sets are available. Feel free to shoot me an email if you would like to talk to me about it offline. I'll stop now and I will be happy to take more questions. Both here and from the remote audience. [applause.] >> Arun Kumar: Thank you. Yes? >>: Yes. The model selection remind me a little bit of ML base. You didn't mention. So one question I have is, so first of all, the optimizations you talk about, are those similar to the ones that ML base talks about, the multi-query optimization running? That's one question. The other question is once you are at that level, might that kind of, is that where the huge cost is? And then maybe some of the things that you were talking about in Orion and Hamlet become less relevant? Or kind of are they just as relevant in the bigger context than they are in the smaller? >> Arun Kumar: Right. So to answer the second question first, Orion and Hamlet and even Columbus the way I think about them are building blocks for this larger version. So all of these techniques will contribute to these optimizations framework that I talk about. And when we take joint view of these Model Selection Triples, new opportunities arise. Like can we reuse results for our cross models, can we materialize what sorts of intermediate data sets we need toe materialize? And also interactions within provenance and optimization become important. Can we basically use small cross iterations? Can we avoid recomputations if they want to do what if debugging, those sorts of questions. So the way I look at it, all of these are building blocks in the context of the model selection optimization. Now, coming to your first question about ML base, we actually talk about relationship with ML base in the paper as well. The way we view it, ML basis one way to specify an interface. They do not go into the feature engineering details like the joins, for example. They talk about automating algorithm selection and hyper parameter tuning, which is nice. And so it's closer to one end of the spectrum where you want full automation. Whereas here we talk about this entire spectrum, full automation all the way to individual iterations. And bridging the gap between these two extremes and coming up with frameworks, maybe not one framework but multiple frameworks. ML base could be one. Columbus is another. There are automated hyper parameter tuning libraries that can be viewed as basically new interfaces for the small selection management process. And the term we use in the paper is this could be a narrow waste or a new class of systems that optimize the model selection process. And thinking about it in this way enables new interfaces like combining multiple feature subsets and parameter combinations which ML base doesn't do, for instance. >>: Would the optimizations be similar? >> Arun Kumar: Some of them could be similar. Some of them could be different. Like, for example, in the automated model search they talk about doing batching. They also do computation sharing. So some of that could be similar, but some of them could be different. Like when we look at the feature engineering context, the join stuff, for example, they do not do anything like that. So looking at the end-to-end process of model selection, some of the optimizations that we are talking about in this context, some of the ML base ideas could be relevant here. Then it also opens up a whole new space of new optimization ideas and there is also like we go into the details in the paper. We categorize it into several categories of optimization. There we also look at introducing new interfaces for accuracy, runtime tradeoffs like the Hamlet stuff. Or some of the other warm starting and model caching and reusing, an those sorts of optimizations. Okay? Any questions? Any other questions? Any questions from the remote audience? >>: Nothing remote. The one thing that surprised me, though, sort of [indiscernible] provenance management there. Because ultimately what we are doing -- this may be just a lack of intuition on my part. One thing that we are doing is, we are selecting models. >> Arun Kumar: Right. >>: Or combinations of models, features, and hyper parameters. >> Arun Kumar: Right. >>: Now, can you give us some idea how provenance management plays into that? >> Arun Kumar: Yes, certainly. In the context of the database world for SQL query processing, people looked at spare provenance, how provenance, why provenance, all of these sort of models, to debug why certain things are in the input or why things are not in the output. In the ML context, the debugging we are talking about is why certain features matter or why certain subsets of examples matter. So these sorts of debugging questions as to where they need to invest their effort for the next iteration, they need to get more examples? They need to get more features of certain kinds? Providing system support for these sorts of questions could be helpful. Currently often what they do is just through pure intuition or through manual Notes, they track the changes and they try to figure it out. So there's a low hanging fruit there which is basically defining provenance for machine learning and then providing these querying support for the process information. Stack end next stages things like what if debugging, things like basically recommendations for features like can you use some past information to make recommendations based on the way it behaved in the past? Those sorts of questions. >>: But turn this around. You're saying that you want to have the sense of provenance management that we have in databases in the context of machine learning, but you are not saying that the database provenance management techniques give you like a ML ->> Arun Kumar: Oh, it's the former. Applying the philosophy of the provenance management work that you see in the database community, bring that to machine learning. The same techniques, some of them might work. Some of the techniques can be reused. Some of the techniques need to be devised from scratch. Looking at this connection, looking at this context I think is a very interesting area. >>: Okay, perfect. That's where my hang-up is. >> Arun Kumar: Okay, great. Yes, Vivek? >>: [indiscernible] and joins. So in many cases, if you want to build a model, you may not build it on the entire back data, even though one subset is for customers in [indiscernible] >> Arun Kumar: Right, right. >>: So for that purpose, I mean [indiscernible] would join anyhow. And I prefer that the conditions are under on the dimension tables, right? >> Arun Kumar: So here in the example, examiners are referring to employers. You are saying employers based in Washington, something like that. >>: Correct. Right? >> Arun Kumar: Uh-huh. >>: So in that case do you still have -- I have to join as least some of the dimension, not all of them. >> Arun Kumar: Sure, sure. >>: Do you still see a benefit in performance in those cases? >> Arun Kumar: Right. So is it Orion or Hamlet? Both? >>: Either. >> Arun Kumar: Okay. So there are two aspects to this. One is even if we have a completely denormalized table where everything has been physically written out as a single table, what we found -- and this is in the context of the Santoku system that I demonstrated at VLDB last year, it might be worth normalizing that table back to the multi-table form under the covers without the data scientists having to know, and then apply the factor learning technique because the per iteration runtime actually be significantly lower if you do factorized learning and therefore this renormalization can actually turn out to be beneficial. So that's what we found in the context of Santoku and that's true both in memory, out of memory, whatever. But it depend on the model. It depends on the data dimensions and all of those things. And so the same cost model and the tradeoffs matter. What is the other question? For the Hamlet stuff, even if the data is denormalized, the runtime comes from the feature selection computations that you are reducing. So it doesn't matter if the data is physically joined or not. Hamlet will still be able to give you speed-up because you are avoiding the computations. But if you have data access times that matter, then the physical join optimization might also help. Okay? >> Christian Konig: Any last questions? Nothing online, right? >> Arun Kumar: I don't see any. No questions yet. >> Christian Konig: Yeah, it typically doesn't happen. Let's thank the speaker one more time. >> Arun Kumar: Thank you for coming, everyone. Thanks, Christian. [recording concluded.]

>> Christian Konig: Okay. Thank you, everyone, for... to introduce Arun Kumar from the University of Wisconsin where...

Related documents

Products

Support

&gt;&gt; Christian Konig: Okay. Thank you, everyone, for... to introduce Arun Kumar from the University of Wisconsin where...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Christian Konig: Okay. Thank you, everyone, for... to introduce Arun Kumar from the University of Wisconsin where...