SECOND EDITION To really learn data science, you should not only master the tools—data science libraries, frameworks, modules, and toolkits—but also understand the ideas and principles underlying them. Updated for Python 3.6, this second edition of Data Science from Scratch shows you how these tools and algorithms work by implementing them from scratch. If you have an aptitude for mathematics and some programming skills, author Joel Grus will help you get comfortable with the math and statistics at the core of data science, and with the hacking skills you need to get started as a data scientist. Packed with new material on deep learning, statistics, and natural language processing, this updated book shows you how to find the gems in today’s messy glut of data. • Get a crash course in Python • Learn the basics of linear algebra, statistics, and probability— and how and when they’re used in data science • Collect, explore, clean, munge, and manipulate data • Dive into the fundamentals of machine learning • Implement models such as k-nearest neighbors, Naive Bayes, linear and logistic regression, decision trees, neural networks, and clustering “Joel takes you on a journey from being datacurious to getting a thorough understanding of the bread-and-butter algorithms that every data scientist should know.” —Rohit Sivaprasad Engineer, Facebook “I’ve recommended Data Science from Scratch to analysts and engineers wanting to make the jump into machine learning. It’s the best tool for understanding the fundamentals of the discipline.” —Tom Marthaler Engineering Manager, Amazon • Explore recommender systems, natural language processing, network analysis, MapReduce, and databases —William Cox Machine Learning Engineer, Grubhub For Sale in the Indian Subcontinent & Select Countries Only* *Refer Back Cover Data Science from Scratch First Principles with Python Grus Joel Grus is a research engineer at the Allen Institute for Artificial Intelligence. Previously he worked as a software engineer at Google and as a data scientist at several startups. He lives in Seattle, where he regularly attends data science happy hours. He blogs infrequently at joelgrus.com and tweets all day long at @joelgrus. “Translating data science concepts into code is hard. Joel’s book makes it much easier.” Data Science from Scratch Data Science from Scratch a l e E di t i o n nd co ion Se dit E Gr a y s c DATA / DATA SCIENCE For sale in the Indian Subcontinent (India, Pakistan, Bangladesh, Nepal, Sri Lanka, Bhutan, Maldives) and African Continent (excluding Morocco, Algeria, Tunisia, Libya, Egypt, and the Republic of South Africa) only. Illegal for sale outside of these countries. MRP: ` 1,000 .00 ISBN : 978-93-5213-832-6 Twitter: @oreillymedia facebook.com/oreilly SHROFF PUBLISHERS & DISTRIBUTORS PVT. LTD. Second Edition/2019/Paperback/English Joel Grus SECOND EDITION Data Science from Scratch First Principles with Python Joel Grus Beijing Boston Farnham Sebastopol Tokyo SHROFF PUBLISHERS & DISTRIBUTORS PVT. LTD. Mumbai Bangalore Kolkata New Delhi Data Science from Scratch by Joel Grus Copyright © 2019 Joel Grus. All rights reserved. ISBN: 978-1-492-04113-9 Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safari.oreilly.com). For more information, contact our corporate/ institutional sales department: (800) 998-9938 or corporate@oreilly.com. Editor: Michele Cronin Production Editor: Deborah Baker Copyeditor: Rachel Monaghan Proofreader: Rachel Head Indexer: Judy McConville Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest Printing History: April 2015: First Edition May 2019: Second Edition Revision History for the Second Edition 2019-04-10: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492041139 for release details. First Indian Reprint: May 2019 ISBN: 978-93-5213-832-6 The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Science from Scratch, Second Edition, the cover image of a rock ptarmigan, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. For sale in the Indian Subcontinent (India, Pakistan, Bangladesh, Sri Lanka, Nepal, Bhutan, Maldives) and African Continent (excluding Morocco, Algeria, Tunisia, Libya, Egypt, and the Republic of South Africa) only. Illegal for sale outside of these countries. Authorized reprint of the original work published by O’Reilly Media, Inc. All rights reserved. No part of the material protected by this copyright notice may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage and retrieval system, nor exported to any countries other than ones mentioned above without the written permission of the copyright owner. Published by Shroff Publishers & Distributors Pvt. Ltd. B-103, Railway Commercial Complex, Sector 3, Sanpada (E), Navi Mumbai 400705 • TEL: (91 22) 4158 4158 • FAX: (91 22) 4158 4141 E-mail:spdorders@ shroffpublishers.com•Web:www.shroffpublishers.com Printed at Jasmine Art Printers Pvt. Ltd., Navi Mumbai. Table of Contents Preface to the Second Edition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Preface to the First Edition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv 1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 The Ascendance of Data What Is Data Science? Motivating Hypothetical: DataSciencester Finding Key Connectors Data Scientists You May Know Salaries and Experience Paid Accounts Topics of Interest Onward 1 1 2 3 6 8 11 11 13 2. A Crash Course in Python. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The Zen of Python Getting Python Virtual Environments Whitespace Formatting Modules Functions Strings Exceptions Lists Tuples Dictionaries defaultdict 15 16 16 17 19 20 21 21 22 23 24 25 iii Counters Sets Control Flow Truthiness Sorting List Comprehensions Automated Testing and assert Object-Oriented Programming Iterables and Generators Randomness Regular Expressions Functional Programming zip and Argument Unpacking args and kwargs Type Annotations How to Write Type Annotations Welcome to DataSciencester! For Further Exploration 26 26 27 28 29 30 30 31 33 35 36 36 36 37 38 41 42 42 3. Visualizing Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 matplotlib Bar Charts Line Charts Scatterplots For Further Exploration 43 45 49 50 52 4. Linear Algebra. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Vectors Matrices For Further Exploration 55 59 62 5. Statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Describing a Single Set of Data Central Tendencies Dispersion Correlation Simpson’s Paradox Some Other Correlational Caveats Correlation and Causation For Further Exploration iv | Table of Contents 63 65 67 68 71 72 73 73 6. Probability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Dependence and Independence Conditional Probability Bayes’s Theorem Random Variables Continuous Distributions The Normal Distribution The Central Limit Theorem For Further Exploration 75 76 78 79 80 81 84 86 7. Hypothesis and Inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Statistical Hypothesis Testing Example: Flipping a Coin p-Values Confidence Intervals p-Hacking Example: Running an A/B Test Bayesian Inference For Further Exploration 87 87 90 92 93 94 95 99 8. Gradient Descent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 The Idea Behind Gradient Descent Estimating the Gradient Using the Gradient Choosing the Right Step Size Using Gradient Descent to Fit Models Minibatch and Stochastic Gradient Descent For Further Exploration 101 102 105 106 106 108 109 9. Getting Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 stdin and stdout Reading Files The Basics of Text Files Delimited Files Scraping the Web HTML and the Parsing Thereof Example: Keeping Tabs on Congress Using APIs JSON and XML Using an Unauthenticated API Finding APIs Example: Using the Twitter APIs 111 113 113 115 116 117 118 121 121 122 123 124 Table of Contents | v Getting Credentials For Further Exploration 124 128 10. Working with Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Exploring Your Data Exploring One-Dimensional Data Two Dimensions Many Dimensions Using NamedTuples Dataclasses Cleaning and Munging Manipulating Data Rescaling An Aside: tqdm Dimensionality Reduction For Further Exploration 129 129 131 133 135 137 138 140 142 144 146 151 11. Machine Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Modeling What Is Machine Learning? Overfitting and Underfitting Correctness The Bias-Variance Tradeoff Feature Extraction and Selection For Further Exploration 153 154 155 157 160 161 163 12. k-Nearest Neighbors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 The Model Example: The Iris Dataset The Curse of Dimensionality For Further Exploration 165 167 170 174 13. Naive Bayes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 A Really Dumb Spam Filter A More Sophisticated Spam Filter Implementation Testing Our Model Using Our Model For Further Exploration 175 176 178 180 181 183 14. Simple Linear Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 The Model vi | Table of Contents 185 Using Gradient Descent Maximum Likelihood Estimation For Further Exploration 189 190 190 15. Multiple Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 The Model Further Assumptions of the Least Squares Model Fitting the Model Interpreting the Model Goodness of Fit Digression: The Bootstrap Standard Errors of Regression Coefficients Regularization For Further Exploration 191 192 193 195 196 196 198 200 202 16. Logistic Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 The Problem The Logistic Function Applying the Model Goodness of Fit Support Vector Machines For Further Investigation 203 206 208 209 210 214 17. Decision Trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 What Is a Decision Tree? Entropy The Entropy of a Partition Creating a Decision Tree Putting It All Together Random Forests For Further Exploration 215 217 219 220 223 225 226 18. Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 Perceptrons Feed-Forward Neural Networks Backpropagation Example: Fizz Buzz For Further Exploration 227 230 233 235 238 19. Deep Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 The Tensor The Layer Abstraction 239 242 Table of Contents | vii The Linear Layer Neural Networks as a Sequence of Layers Loss and Optimization Example: XOR Revisited Other Activation Functions Example: FizzBuzz Revisited Softmaxes and Cross-Entropy Dropout Example: MNIST Saving and Loading Models For Further Exploration 243 246 247 250 251 252 253 256 257 261 262 20. Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 The Idea The Model Example: Meetups Choosing k Example: Clustering Colors Bottom-Up Hierarchical Clustering For Further Exploration 263 264 266 268 269 271 277 21. Natural Language Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 Word Clouds n-Gram Language Models Grammars An Aside: Gibbs Sampling Topic Modeling Word Vectors Recurrent Neural Networks Example: Using a Character-Level RNN For Further Exploration 279 281 284 286 288 293 301 304 308 22. Network Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 Betweenness Centrality Eigenvector Centrality Matrix Multiplication Centrality Directed Graphs and PageRank For Further Exploration 309 314 314 316 318 320 23. Recommender Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 Manual Curation viii | Table of Contents 322 Recommending What’s Popular User-Based Collaborative Filtering Item-Based Collaborative Filtering Matrix Factorization For Further Exploration 322 323 326 328 333 24. Databases and SQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 CREATE TABLE and INSERT UPDATE DELETE SELECT GROUP BY ORDER BY JOIN Subqueries Indexes Query Optimization NoSQL For Further Exploration 335 338 339 340 342 345 346 348 349 349 350 350 25. MapReduce. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 Example: Word Count Why MapReduce? MapReduce More Generally Example: Analyzing Status Updates Example: Matrix Multiplication An Aside: Combiners For Further Exploration 352 353 354 355 357 359 359 26. Data Ethics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 What Is Data Ethics? No, Really, What Is Data Ethics? Should I Care About Data Ethics? Building Bad Data Products Trading Off Accuracy and Fairness Collaboration Interpretability Recommendations Biased Data Data Protection In Summary For Further Exploration 361 362 362 363 364 365 365 366 367 368 368 369 Table of Contents | ix 27. Go Forth and Do Data Science. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 IPython Mathematics Not from Scratch NumPy pandas scikit-learn Visualization R Deep Learning Find Data Do Data Science Hacker News Fire Trucks T-Shirts Tweets on a Globe And You? 371 372 372 372 372 373 373 373 374 374 375 375 375 375 376 376 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377 x | Table of Contents Preface to the Second Edition I am exceptionally proud of the first edition of Data Science from Scratch. It turned out very much the book I wanted it to be. But several years of developments in data science, of progress in the Python ecosystem, and of personal growth as a developer and educator have changed what I think a first book in data science should look like. In life, there are no do-overs. In writing, however, there are second editions. Accordingly, I’ve rewritten all the code and examples using Python 3.6 (and many of its newly introduced features, like type annotations). I’ve woven into the book an emphasis on writing clean code. I’ve replaced some of the first edition’s toy examples with more realistic ones using “real” datasets. I’ve added new material on topics such as deep learning, statistics, and natural language processing, corresponding to things that today’s data scientists are likely to be working with. (I’ve also removed some material that seems less relevant.) And I’ve gone over the book with a fine-toothed comb, fixing bugs, rewriting explanations that are less clear than they could be, and freshening up some of the jokes. The first edition was a great book, and this edition is even better. Enjoy! Joel Grus Seattle, WA 2019 Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. xi Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. This element signifies a tip or suggestion. This element signifies a general note. This element indicates a warning or caution. Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/joelgrus/data-science-from-scratch. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a signifi‐ cant amount of example code from this book into your product’s documentation does require permission. xii | Preface to the Second Edition We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Data Science from Scratch, Second Edition, by Joel Grus (O’Reilly). Copyright 2019 Joel Grus, 978-1-492-04113-9.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. O’Reilly Online Learning For almost 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help compa‐ nies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, conferences, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, indepth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, please visit http://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://bit.ly/data-science-from-scratch-2e. To comment or ask technical questions about this book, send email to bookques‐ tions@oreilly.com. For more information about our books, courses, conferences, and news, see our web‐ site at http://www.oreilly.com. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://www.youtube.com/oreillymedia Preface to the Second Edition | xiii Acknowledgments First, I would like to thank Mike Loukides for accepting my proposal for this book (and for insisting that I pare it down to a reasonable size). It would have been very easy for him to say, “Who’s this person who keeps emailing me sample chapters, and how do I get him to go away?” I’m grateful he didn’t. I’d also like to thank my editors, Michele Cronin and Marie Beaugureau, for guiding me through the publishing pro‐ cess and getting the book in a much better state than I ever would have gotten it on my own. I couldn’t have written this book if I’d never learned data science, and I probably wouldn’t have learned data science if not for the influence of Dave Hsu, Igor Tatari‐ nov, John Rauser, and the rest of the Farecast gang. (So long ago that it wasn’t even called data science at the time!) The good folks at Coursera and DataTau deserve a lot of credit, too. I am also grateful to my beta readers and reviewers. Jay Fundling found a ton of mis‐ takes and pointed out many unclear explanations, and the book is much better (and much more correct) thanks to him. Debashis Ghosh is a hero for sanity-checking all of my statistics. Andrew Musselman suggested toning down the “people who prefer R to Python are moral reprobates” aspect of the book, which I think ended up being pretty good advice. Trey Causey, Ryan Matthew Balfanz, Loris Mularoni, Núria Pujol, Rob Jefferson, Mary Pat Campbell, Zach Geary, Denise Mauldin, Jimmy O’Donnell, and Wendy Grus also provided invaluable feedback. Thanks to everyone who read the first edition and helped make this a better book. Any errors remaining are of course my responsibility. I owe a lot to the Twitter #datascience commmunity, for exposing me to a ton of new concepts, introducing me to a lot of great people, and making me feel like enough of an underachiever that I went out and wrote a book to compensate. Special thanks to Trey Causey (again), for (inadvertently) reminding me to include a chapter on linear algebra, and to Sean J. Taylor, for (inadvertently) pointing out a couple of huge gaps in the “Working with Data” chapter. Above all, I owe immense thanks to Ganga and Madeline. The only thing harder than writing a book is living with someone who’s writing a book, and I couldn’t have pulled it off without their support. xiv | Preface to the Second Edition If you have an aptitude for mathematics and some programming skills, author Joel Grus will help you get comfortable with the math and statistics at the core of data science, and with the hacking skills you need to get started as a data scientist. Packed with new material on deep learning, statistics, and natural language processing, this updated book shows you how to find the gems in today’s messy glut of data. • Get a crash course in Python • Learn the basics of linear algebra, statistics, and probability— and how and when they’re used in data science • Collect, explore, clean, munge, and manipulate data • Dive into the fundamentals of machine learning • Implement models such as k-nearest neighbors, Naive Bayes, linear and logistic regression, decision trees, neural networks, and clustering “Joel takes you on a journey from being datacurious to getting a thorough understanding of the bread-and-butter algorithms that every data scientist should know.” —Rohit Sivaprasad Engineer, Facebook “I’ve recommended Data Science from Scratch to analysts and engineers wanting to make the jump into machine learning. It’s the best tool for understanding the fundamentals of the discipline.” —Tom Marthaler Engineering Manager, Amazon • Explore recommender systems, natural language processing, network analysis, MapReduce, and databases —William Cox Machine Learning Engineer, Grubhub DATA / DATA SCIENCE For sale in the Indian Subcontinent (India, Pakistan, Bangladesh, Nepal, Sri Lanka, Bhutan, Maldives) and African Continent (excluding Morocco, Algeria, Tunisia, Libya, Egypt, and the Republic of South Africa) only. Illegal for sale outside of these countries. MRP: ` 1,000.00 ISBN : 978-93-5213-832-6 Grus Joel Grus is a research engineer at the Allen Institute for Artificial Intelligence. Previously he worked as a software engineer at Google and as a data scientist at several startups. He lives in Seattle, where he regularly attends data science happy hours. He blogs infrequently at joelgrus.com and tweets all day long at @joelgrus. “Translating data science concepts into code is hard. Joel’s book makes it much easier.” Data Science from Scratch Data Science from Scratch To really learn data science, you should not only master the tools—data science libraries, frameworks, modules, and toolkits—but also understand the ideas and principles underlying them. Updated for Python 3.6, this second edition of Data Science from Scratch shows you how these tools and algorithms work by implementing them from scratch. nd co ion Se dit E SECOND EDITION Data Science from Scratch First Principles with Python ay Gr s c a le E di ti o n For Sale in the Indian Subcontinent & Select Countries Only* *Refer Back Cover Twitter: @oreillymedia facebook.com/oreilly SHROFF PUBLISHERS & DISTRIBUTORS PVT. LTD. Second Edition/2019/Paperback/English Joel Grus
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )