The Mathematical Foundations of Machine Learning Version 1 5/9/2020 This is a map of subjects and the corresponding textbooks that one should study in order to have a very solid mathematical foundation for doing machine learning work. This is a guide for learning the math that you will use, not for learning the machine learning algorithms themselves. Obviously not all of this is necessary, and you can find work in the machine learning field without knowing all of this. However, if your goal is to have a deep understanding, on both an applied and theoretical level, of algorithms that are typically used in this field, then following this guide will enable you to do so. None of the topics here cover the software aspect of machine learning. There are many applied machine learning courses available online that cover that aspect of the field. Websites like Coursera or Udemy are a good place to start. The only prerequisite knowledge this guide will assume is a year of calculus. There are many resources for learning calculus so they will not be covered here. The book by Stewart or Khan Academy are a perfectly fine way to learn the subject. My goal in writing this guide is to provide someone who wants to read ESL and the Deep Learning book with enough mathematical maturity to do so. Covering the required material will put you in good shape for that. These subjects are difficult and require a serious level of dedication. I’ve done my best to provide textbooks that have solutions available. I will not link to the solution manuals, they can be found with a little bit of searching. In fact, all required texts have solution manuals available aside from the linear models text by Christensen. Also, my goal isn’t to provide resources that are free. I’ve picked what I believe to be are the best quality texts for the subject that will give you the deepest understanding. With that said, it shouldn’t be hard to find electronic copies of them. You will need to be very comfortable with proving things, this is non-negotiable. So step 0 would be to go through a book like How to Prove It by Velleman, or Discrete Mathematics by Rosen. From there we start with linear algebra and introductory analysis. You need a good grasp of these in order to understand statistics at the level that is required for machine learning. It is impossible to understand things like linear models and the central limit theorem without having a good grasp of linear algebra and analysis. For these topics I recommend Linear Algebra by Friedberg, Insel, and Spence and Understanding Analysis by Abbott. These are both great textbooks. The linear algebra one is great because it blends applied and theoretical understanding, and Understanding Analysis helps build intuition for doing further work in analysis. From here we move on to statistics, and then things branch out. As a machine learning practitioner, having a working knowledge of probability can be very helpful, but a rigorous understanding of probability cannot be had unless we first learn some measure theory. Thus I’ve included that in our path as well. Below is the flowchart of subjects to study and their corresponding texts. A blue node is a required subject/text and an orange node is an optional subject/text. Proofs How to Prove It Velleman Linear Algebra Linear Algebra Friedberg, Insel, Spence Optimization Convex Optimization Boyd, Vandenberghe Advanced Statistics Statistical Inference Casella, Berger Introductory Analysis Understanding Analysis Abbott Statistics Introduction to Mathematical Statistics Hogg, McKean, Craig Analysis Principles of Mathematical Analysis Rudin Functional Analysis Introductory Functional Analysis With Applications Kreyszig Introductory Linear Models Applied Linear Statistical Models Kutner, et. al Linear Models Plane Answers to Complex Questions Christensen GLMs Generalized, Linear, and Mixed Models McCulloch, Searle, Neuhaus Advanced Linear Models Advanced Linear Modeling Christensen Further Advanced Statistics Theoretical Statistics Topics for a Core Course Keener Asymptotics Asymptotic Statistics van der Vaart Topology Topology Munkres Measure Theory Measures, Integrals, and Martingales Schilling Introductory Probability A First Look At Rigorous Probability Theory Rosenthal Probability Probability and Measure Theory Ash, Doléans-Dade As stated before, if you are more comfortable with a standard discrete math text then you can replace Velleman with the textbook by Rosen. But, Velleman is great and it has many solutions in the back of the book. If you are finding linear algebra to be difficult then maybe backtrack a bit and try working through Introduction to Linear Algebra by Strang (with its corresponding MIT OpenCourseware videos) or Linear Algebra and Its Applications by Lay. For introductory analysis, Abbott is as good as it gets. If you want more references though, Introduction to Real Analysis by Bartle and Sherbert and The Way of Analysis by Strichartz are also good. The latter is very wordy but the author focuses heavily on building intuition so it’s great if you’re not getting that from Abbott’s text. Once you’ve worked your way through linear algebra and analysis you should have enough maturity to work through Hogg’s intro to statistics textbook. It is a great text when paired with Casella and Berger. I learned from both of these texts and I still reference them from time to time. If you need some supplemental texts to go along with Hogg try Mathematical Statistics with Applications by Wackerly, Mendenhall, and Scheaffer, All of Statistics by Wasserman, and Mathematical Statistics and Data Analysis by Rice. If you can get through the problems in Casella and Berger (you really only need through chapter 10) then you are more than prepared for doing work as a data scientist or machine learning engineer (in terms of probability and statistics). Applied Linear Statistical Models by Kutner, et. al is a great text when it comes to learning linear models. It is incredibly long, clocking in at a little over 1400 pages. However, you can avoid the 2nd half of the book if you are pressed for time because it covers basic design and analysis of experiments (ANOVA and the like). Once you’ve finished that you can move onto the more theoretical aspects of linear models, like distributions of quadratic forms. This is covered in the book by Christensen. It’s a great textbook but unfortunately there is no solution manual available. If you are self-studying and need to be able to check your solutions then I would recommend replacing this text with Linear Models in Statistics by Rencher and Schaalje. The solutions are in the back. There are also many supplemental texts here. Some notable ones are: A Primer on Linear Models by Monahan, Linear Models by Searle (solutions are available on the text’s website), and Linear Statistical Models by Stapleton (solutions are in the back of the book). Do not forget Convex Optimization! Knowing your optimization algorithms is incredibly important as a machine learning practitioner and the text by Boyd and Vandenberghe is considered the bible. It can be a difficult text though. You should have a very solid foundation of linear algebra, calculus, introductory analysis, and even some topology when working your way through it. Supplement with Munkres (the first few chapters on point set topology) if needed because I’m not sure if Abbott does any topology outside of R. Once you’ve covered all that you are in good shape! If you are interested in really understanding probability then you will need a much better understanding of analysis. To do this you should start by working through the first 7 chapters of Rudin (ignore the rest, they’re not great). From here you can skip to introductory functional analysis by Kreyszig if you want. This is functional analysis without measure theory so you’re still taking some baby steps here. To do proper functional analysis we need a working knowledge of measure theory so we have more to work with than sequence spaces. After you’ve completed the book by Schilling you can look into the functional analysis texts by Rudin, Conway, or even Stein and Shakarchi. To do proper probability we need our measure theory. The text by Schilling is a great introduction, and the author provides a full solution manual on his website. It’s a very thorough textbook with great proofs. It covers a few bits and pieces of probability but not enough for our liking, though. If you want a supplement here, or maybe a even a more gentle introduction try Measure, Integration, and Real Analysis by Axler. From here we move on to the study of rigorous probability. Start with the gentle introduction by Rosenthal. It’s a very short text but it has lots of great problems to work through. The introductory chapter explaining why we need measure theory to properly define how probability works gives good motivation. Finally, a full blown probability textbook. Probability by Ash was chosen over Probability and Measure by Billinglsey because it covers roughly the same material and it has many solutions in the back of the text. Both are equally good textbooks though, so they can be interchanged. If you’d like further references then the texts by Chung, Resnick, Durrett, Athreya, and Pollard are good. If you still can’t get enough probability then your next steps would be: Convergence of Probability Measures by Billingsley, Real Analysis and Probability by Dudley, and Uniform Central Limit Theorems by Dudley. Once we have a solid foundation of rigorous probability theory then we can work on even more advanced statistics. The recommended text by Keener is a great book with some solutions provided in the back. I think it is the text used for Stanford’s PhD level theoretical statistics class. A good supplement here is the book Mathematical Statistics by Jun Shao (and its accompanying solutions manual), and if you’re looking for a more Bayesian viewpoint then I would recommend Theory of Statistics by Schervish. Finally, the most widely used text for asymptotic statistics is the one by van der Vaart. A knowledge of measure theory might not be needed for this text but it can never hurt. A supplementary text here would be Elements of Large-Sample Theory by Lehmann. Lastly there is the topic of GLMs and advanced linear models. The mentioned text for GLMs is good but it is heavily theoretical. If you’d prefer something more applied then look into Foundations of Linear and Generalized Linear Models by Agresti and an Introduction to Generalized Linear Models by Dobson and Barnett. The classic text Genralized Linear Models by Mccullagh and Nelder is recommended as well. If you are just interested in categorical data then Categorical Data Analysis (not the introduction) by Agresti cannot be beat. Advanced Linear Modeling by Christensen is really just a survey text covering a lot of advanced techniques, like penalized estimation and reproducing kernel hilbert spaces. If you are interested in any of the specific topics covered in it then there are references provided at the end of each chapter. A few random recommended textbooks that did not fit in fall under econometrics and time series analysis. For econometrics, Econometrics by Hayashi, Econometric Analysis by Greene, Econometric Analysis of Cross Section and Panel Data by Wooldridge, and Econometric Theory and Methods by Davidson and Mackinnon are great texts. For time series analysis I would recommend Time Series Analysis by Hamilton (very dense) and Time Series Analysis and Its Applications by Shumway and Stoffer.