thedataincubator.com Data Science Fellowship Curriculum MODULE 1: SKILLS AND TOOLS DATA SCIENCE CURRICULUM FELLOWSHIP | 1 T HANK YOU FOR YOUR INTEREST IN THE DATA INCUBATOR! We are the innovative fellowship program for up-and-coming data professionals. Below you’ll find our Data Science Fellowship Curriculum and more about what sets us apart from other fellowship programs. We offer practical and actionable training that provides immediate impact to candidates by focusing on what works. Based on decades of hands-on experience, our immersive programs set the benchmark for professional data education. Our alumni are the pillar of our brand—they’ve trusted our programs and elevated their skills with top-tier career training to get hired with stand-out partners. Our alumni have exclusive access to our network of hiring partners, including thousands of companies around the world, from startups to Fortune 500, who apply our models to drive their business and power their data teams. We don’t just do training—we provide proven methodologies, adaptable resources, experienced instructors, a robust hiring program and world-class support. INTRODUCTION DATA SCIENCE CURRICULUM FELLOWSHIP | 2 Our Curriculum MODULE 1 DATA WRANGLING Students learn how to acquire and manipulate data in Python with the foundational tools of data science. Prerequisites: Basic Python The first step of data science is mastering the computational foundations on which data science is built. We cover the fundamental topics of programming relevant for data science - including pandas, NumPy, SciPy, Matplotlib, regular expressions, SQL, JSON, XML, checkpointing, and web scraping - that form the core libraries around handling structured and unstructured data in Python. Students gain practical experience manipulating messy, real-world data using these libraries. They also walk away with a firm understanding of tools like pip, git, Python, Jupyter notebooks, pdb, and unit testing that leverage existing open-source packages to accelerate data exploration, development, debugging, and collaboration. Associated Project Work Students will scrape picture captions off of a website that tracks the goings-on of New York’s socially well-to-do. By extracting names from these captions, they will assemble a graph of friendships amongst this crowd. Analysis of this graph will produce insights about the most connected New Yorkers. SKILLS AND TOOLS ADDRESSED: n Consuming APIs (and JSON) – Handling URL parameters – Authenticated APIs –API Request Limitations n I terators, Generators, and Coroutines – Iterables and Iterators – Generators – Generator “pipelines” – Generator comprehensions – Time complexity – Itertools in Python – Coroutines – Coroutine “pipelines” – Broadcasting – Coroutines as classes – Unifying generators and coroutines n O verview of Scraping and Munging Technologies – Concepts, languages, and tools – Concrete tasks in Python – Python library cheat sheet MODULE 1: DATA WRANGLING n H ow to (Software) Engineer Real Good – Writing functional code – Version control and other tools – Testing – Testing the web in Flask – Linting – Writing “good code” – Self-documenting code – Code review – Time management n Pandas – Pandas series – Pandas DataFrame – Loading data into pandas – Pandas indices and selecting and slicing data – Using pandas for data analysis – Filtering data – String operations and transformations – Merging data sets – Dealing with missing values – Adding and dropping columns – Aggregating by groups – Automating and repeating the analysis – Exporting data frames to CSV or Excel file – Pandas best practices – Conclusion DATA SCIENCE CURRICULUM FELLOWSHIP | 3 n Scraping – HTTP requests and responses – Understanding URLs – HTML and the DOM – Parsing HTML – CSS selectors – Fetching subsequent pages – Scrapy in Python n Dealing with Strings in Python – The string data structure – Unicode and Byte Strings – Basic string processing – StringIO in Python – Regular expressions n NumPy and SciPy – NumPy – Data types (the nouns) – Operations (the verbs) – Persisting NumPy objects – SciPy n Matplotlib – Matplotlib and Pyplot – Matplotlib plots from Pandas – Seaborn n Functions – Functions as first-class objects – Closures – Variable arguments and keywords – Decorators MODULE 1: DATA WRAGLING: SKILLS AND TOOLS n Exceptions – Catching general exceptions – Handling success – Doing something with the error – Raising errors – Exceptions and the call stack – Reading traceback n Debugging – NameError – TypeError – AttributeError – KeyError – Reading code critically n Python – Jupyter notebooks and the kernel – Variables – Functions – Logic and program flow – Iteration – Whitespace matters – Putting it all together n Object-oriented programming – Everything is an object – Defining a Python class – Adding attributes and methods – Inheritance – Putting it all together again DATA SCIENCE CURRICULUM FELLOWSHIP | 4 MODULE 2 INTRODUCTION TO MACHINE LEARNING Students learn the basics of machine learning, and building and training different types of models. Prerequisites: Basic Python, Basic to intermediate statistics, Basic linear algebra In a world with abundant data, leveraging machines to learn valuable patterns from structured data can be extremely powerful. We explore the basics of machine learning, discussing concepts like regression, classification, model evaluation metrics, overfitting, variance versus bias, linear regression, ensemble methods, model selection, and hyperparameter optimization. Through powerful packages such as Scikit-learn, students come away with a strong understanding of core concepts in machine learning as well as the ability to efficiently train and benchmark accurate predictive models. They gain hands-on experience building complex ETL pipelines to handle data in a variety of formats, developing models with tools like feature unions and pipelines to reduce duplicate work, and practicing tricks like parallelization to speed up prototyping and development. Associated project work Students will develop a series of models to predict a venue’s star rating from various features. Working from 100MB of real-world data, they will start with location-based models before building models based on other attributes of the venues. Finally, an ensemble model will blend the individual models into a final prediction of the venue’s popularity. SKILLS AND TOOLS ADDRESSED: n Introduction to Machine Learning – Statistics vs machine learning – Data as a matrix – Models as functions – Types of machine learning – Parameters and learning n Regression – Linear regression – Regression metrics – Optimization – Stochastic gradient descent – Adding features – Regularization – Example: California housing data set – Reference: statistical motivation – Scikit-Learn API – Classes vs objects – Estimators – Transformers – Pipelines MODULE 2: INTRODUCTION TO MACHINE LEARNING n Classification – Precision and recall – Other classification metrics – Probabilistic models – Logistic regression – Multiclass classification problems n B ias, Variance, and Overfitting – Decision trees – In-sample error – Out-of-sample error – Variance-bias tradeoff – Cross-validation strategies – Grid search for tuning hyperparameters n Scikit-learn Workflow – Writing custom estimators and transformers – Pipelines – Feature unions – Data types – Validating your implementations n T ransformers and Preprocessing – Feature scaling – Encoding categorical variables – Imputation – Dimensionality reduction – Natural language processing – Custom Transformers – Answers to questions n K Nearest Neighbors – Tuning k and other hyperparameters – Normalizing features n Unsupervised Learning – Metrics for clustering – K-Means clustering – Elongated clusters – Gaussian mixture models – Dimensionality reduction – Random projections – Matrix factorization – Principal Components Analysis (PCA) – Non-negative Matrix Factorization (NMF) – Comparison of PCA and NMF DATA SCIENCE CURRICULUM FELLOWSHIP | 5 MODULE 3 SQL AND PRODUCTION TOPICS Students learn how to access databases using SQL interfaces and topics related to programming in a professional environment. Prerequisites: Basic Python, Basic SQL Most data is stored in databases, and they have to be accessed through interfaces. The most common one is SQL, a declarative language that lets us tell the database which data we want and how to present it. We cover the basics of the language itself and some of the Python tools related to it. Associated project work Students will assemble a SQL database of 4 years worth of NYC restaurant inspection data. They will write and execute queries against this database to understand the variations in scores and violations across the city and between different types of restaurants. SKILLS AND TOOLS ADDRESSED: n Advanced SQL – Creating Tables – Database connectors – Temporary tables and views – SQL Alchemy – A note on SQL flavors n SQL - Structured Query Language – SELECT - Getting information from the tables – COUNT, SUM, and DISTINCT - Let SQL do work for you! – WHERE, LIKE, and IN - Filtering the data – ORDER BY - Sorting your outputs – GROUP BY - Aggregating data – HAVING - The “WHERE” clause for grouped data – JOIN - Putting tables together – Creating and using subqueries – CASE - Returning values based on conditional statements MODULE 3: SQL AND PRODUCTION TOPICS DATA SCIENCE CURRICULUM FELLOWSHIP | 6 MODULE 4 VISUALIZATIONS Students learn how to present data visually for both technical and non-technical audiences. Prerequisites: Basic Python Data science is about helping humans understand the story behind the data, and visualizations provide a powerful tool for helping the analyst understand and communicate that story. We discuss the biases and limitations of both visual and statistical analysis to promote a more holistic approach. Associated project work Students will build an interactive web site, giving information on NYC’s bus system. They will process historical data and develop plots to illustrate trends. Using a live feed of bus information, they will compare the current state to this historical average. All of the visualization will be deployed as a Flask app running on Heroku. SKILLS AND TOOLS ADDRESSED: n Explanatory Visualization – Multiple interactive plots – Data transformations: filtering and aggregating – Layout and design – Using Altair with large data sets – Exporting an Altair chart as HTML – Embedding a chart in an HTML document – Plotting geographic data n Exploratory Visualization – Python visualization tools – Describing a distribution – Histograms – Box plots and Violin Plots – Relationships between variables – Non-obvious patterns in the data – Interactivity in visualizations MODULE 4: VISUALIZATIONS n Overview of Data Visualization – Introduction – Pandas plots – Altair plots n Visualization Theory – Different types of data for visualization purposes – Seven categories of visual cues – Generic algorithm for creating a visualization – Portability & accessibility – Perception and visual response – Attention and memory – Visual storytelling n Layout and Design – Design Elements & Principles – Examples (mostly bad, sometimes good) – Axes (use them!) – Choosing the right mark – Data-Ink ratio – Dealing with multiple scales – Small multiples DATA SCIENCE CURRICULUM FELLOWSHIP | 7 MODULE 5 ADVANCED MACHINE LEARNING Students learn more advanced machine learning topics and techniques, including dealing with unstructured data and time series. Prerequisites: Intermediate to advanced statistics, Intermediate linear algebra, Basic programming While machine learning on structured data lays an important foundation, a larger world of analytical opportunities becomes available through understanding advanced machine learning techniques and how to handle unstructured data. We explore techniques such as support vector machines, decision trees, random forests, neural nets, clustering, KMeans, expectation-maximization, time series, and signal processing. Students come away with intuition about the suitability of different techniques for different problems. In addition to handling structured data, students directly apply these techniques to large volumes of real-world unstructured data, solving problems in natural language processing using Word2Vec, bag of words, feature hashing, and topic modeling. Associated project work Students will use NLP techniques to extract sentiment from English text. Working with 300MB of venue reviews, they will build a series of models to predict the star rating associated with a given review. They will also examine statistically improbable phrases that appear in the text corpus. Students will examine methods of dealing with seasonality, as they build models to predict temperatures in several cities. The training data come from National Weather Service observations and must be cleaned before use. Skills and Tools Addressed: n Support Vector Machines (SVM) – Maximal margin classifier – Linear SVM – Non-linear SVM – Multi-class SVM – Approximating kernels – Support vector regression – Outlier detection using SVM n Decision Trees and Random Forests – Decision trees – Ensemble methods – Determining feature importance n Natural Language Processing – Text as a “bag of words” –W ord importance: Term frequency–inverse document frequency (TF-IDF) – Document similarity metrics – Engineering your features – Building the classifier – Additional NLP topics and resources MODULE 5: ADVANCED MACHINE LEARNING n Sentiment Analysis – Bag of words model – Interpreting the model – Grammar and other tools n Time Series – Trends in time series data – Cross-validation for time series – Modeling drift – Modeling seasonality – Modeling “noise” – Using external data sources as features –M ore advanced time series modeling frameworks n Naive Bayes – Predictive modeling using Naive Bayes – Classifying mushrooms (an example) DATA SCIENCE CURRICULUM FELLOWSHIP | 8 n Outlier Detection – Motivation – Concepts – Scikit-learn Implementation – One-class SVM – Isolation forest – Case Study: Anomaly Detection in Time Series – Modeling the background – Detecting seasonality with Fourier Transforms – Detrending – z-Score – Moving-window averages – Including windowed data in model – Bayesian change points – Online learning – References n Recommendation Engine – Problem definition and data format – Feature engineering – Nearest neighbors – Tag data – Dimensional reduction – Recommendation for a user – Cooperative learning – Regression of ratings – Baseline model overfitting and cross-validation – Modeling interaction – Surpriselib – References MODULE 5: ADVANCED MACHINE LEARNING:: SKILLS AND TOOLS n Unbalanced Classes – Introduction: cancer detection case study – Definition and common scenarios for unbalanced data – Simple techniques to deal with unbalanced data – Undersampling – Oversampling – Synthetic data augmentation – Additional approaches – Train/Test split with unbalanced data – Probabilities with unbalanced data – The Python imbalanced-learn package n Digital Signals – Sampling – Noise & filters – Audio files – Filters – Frequency domain n C hoosing the Correct Machine Learning Algorithm – Few features – Many features – Few observations – Many observations – Underfitting – Overfitting – Explicability – Prediction speed – Parallelization – Online learning – Feature scaling – Outlier detection/novelty detection – Comparing ML algorithms DATA SCIENCE CURRICULUM FELLOWSHIP | 9 MODULE 6 THINKING OUTSIDE THE DATA Students learn how data science interacts with business and business concerns. Prerequisites: Intermediate to advanced statistics, Basic to intermediate programming Sometimes the most important question to ask in data science comes from thinking beyond the data itself. We explore a myriad of topics that affect data science decision making as a whole, and affect the implementation of data-driven business policies. Important topics include data fidelity, relevance, and the value of additional data. Bias is a major theme, and students think about how their conclusions are influenced by data collection, external factors, internal structuring, procedural artifacts, and more. Students gain a broader understanding of how to balance tradeoffs to suit the business problem, such as when to favor accuracy over interpretability and vice versa. We also discuss more practical engineering considerations like building for prediction speed or robustness, and deploying to different environments. Students apply this knowledge to case studies that simulate what they would be expected to contribute as part of a real-world team faced with a business problem. SKILLS AND TOOLS ADDRESSED: Hypothesis Testing – False positives versus false negatives – Z-score – CDF and the uniform distribution – T-test – Standard error for a rate – Standard error for a counting process – Power calculations – Mnemonic summary – A/B testing – Causality versus correlation – Distributional tests – Multiple tests – How trustworthy are your data? Personal Interview Questions Algorithms and Data Structures – Sorting – Searching – Dynamic programming – Graph theory What the data really says – Fallacies – Data fidelity – Data relevance – Modeling tradeoffs – Protecting privacy Statistics – Linearity of expectation – Bayes Theorem – Combinatorics MODULE 6: THINKING OUTSIDE THE DATA – Continuous probability – Hypothesis testing – Memoryless processes Data Management – Differing needs – Data warehouses – Data lakes – Self-service Managing data science projects – Strategies in software developmen – Minimum viable products – Agile and data science Metrics and Levers – Metrics in business: KPIs – Improving KPIs - Pulling Levers – Translating metrics – Real world considerations – Systematic bias Case Studies – The prompt – The process – The product – Know your audience – Exercises Data Science Case Studies – What is a case study? – How to Ace a Case Study – Analysis – General Advice – Practice Case Studies DATA SCIENCE CURRICULUM FELLOWSHIP | 10 MODULE 7 DISTRIBUTED COMPUTING WITH SPARK Students learn how to distribute computations across multiple computers, such as a cluster or the cloud, using PySpark. Prerequisites: Basic Python, Basic to intermediate programming Spark is a technology at the forefront of distributed computing that offers a more abstract but more powerful API. This module is taught using the Python API. We cover core concepts of Spark like resilient distributed data sets, memory caching, actions, transformations, tuning, and optimization. Students get to build functioning applications from end to end. They apply that knowledge to directly developing, building, and deploying Spark jobs to run on large, real-world data sets in the cloud (AWS and Google Cloud Platform). Associated project work Students will use Spark to parse and process 10GB of data on posts and users at a popular Q&A website. They will extract insights on the posting habits of users and develop predictors of users’ behavior from their posts. Spark’s machine-learning capabilities will be used to discover meaning in unstructured text data. SKILLS AND TOOLS ADDRESSED: Introduction to Distributed Computing – Big data – Distributed computing – MapReduce: A simple distributed-computing framework – Word Count: The “Hello World” of distributed computing – Word Count in Spark – Other Spark Features Introduction to Functional Programming Style – Stateful vs. stateless code – Decorators – Map, filter, and reduce – Anonymous functions Tweet mini case study – Spark SQL and DataFrames a convenient abstraction – Caching and persistence - the key to Spark’s speed MODULE 7: SKILLS AND TOOLS Streaming Technologies – Apache Kafka – Apache Storm – Spark streaming –B uilding a Spark streaming application – Keeping track of state – Windowed state – Streaming tweets demo PySpark Intro – The Spark API – Word count example – ETL example – Computing statistics – Translating from SQL – Joins in Spark Creating Spark Applications – REPLs – Building Spark applications – Spark on Amazon Web Services –S park on Google Cloud Platform PySpark ML – Algorithms – ML vs. MLlib packages – Spark ML – Pipeline –C ross-validation and grid search – Feature processing PySpark DataFrames – Motivation and Spark SQL –E xploring the Catalyst Optimizer – SQL and DataFrames –A dding columns and functions – Type safety and DataSets – DataFrame Optimization – Joins Advanced Topics in Spark – Key terminology –R elation to Hadoop and MapReduce – Understanding the shuffle – Data partitioning – Shared variables –B est practices and optimization – Resource tuning – Spark UI DATA SCIENCE CURRICULUM FELLOWSHIP | 11 MODULE 8 DEEP LEARNING IN TENSORFLOW Students learn to build and train neural networks using Tensorflow. Both theoretical understanding and practical applications and concerns are addressed. Prerequisites: Basic Python TensorFlow is taking the world of deep learning by storm. We demonstrate its capabilities through its Python and Keras interfaces and build some simple machine learning models. We give a brief overview of the theory of neural networks, including convolutional and recurrent layers. Students will practice building and testing these networks in TensorFlow and Keras, using real-world data. They will come away with both theoretical and practical understanding of the algorithms behind deep-learning algorithms. Associated project work Students will build a series of models to classify images from the Cifar-10 data set. These models will include basic image analysis, convolutional neural networks, and transfer learned deep neural networks. SKILLS AND TOOLS ADDRESSED: n Introduction to TensorFlow – Linear models – Error metrics – Gradient descent – Gradient Descent in TensorFlow – Tensors and operations – Automatic differentiation and tf.GradientTape – Built-in optimization – TensorFlow API overview n Optimization with the Computation Graph – Computation graph – Using accelerators (GPUs/TPUs) n Basic Neural Networks – The XOR problem – Logistic regression – Neural networks and hidden layers – Activation functions – Initial weights n Adversarial Noise – Fooling neural networks – Attacking networks – How do you find adversarial noise? – Putting it all together – Exercise: extending immunity n Convolutional Neural Networks – Convolutions – Convolutional neural networks – Pre-trained CNNs (applications) n T he Inception Model and the Deep Dream Algorithm – Inception model – The deep dream algorithm n Deep Neural Networks – What is deep learning? – Keras API – TensorBoard n Variational Autoencoders – Autoencoders – Building an Autoencoder – Adam optimizer – Application: noise removal – Generating new images – Variational Autoencoders (VAEs) – KL-Divergence – Exercise: new numbers – Exercise: different compression n Optimization – Stochastic Gradient Descent – Exploring the loss surface and learning curves – Overfitting – Regularization – Dropout – Batch normalization n Recurrent Neural Networks – Backpropagation through time – Applications – Example: name classification – Exercise: introduce an embedding layer – Long-short term memory – Example: generating strata abstracts MODULE 8: DEEP LEARNING IN TENSORFLOW DATA SCIENCE CURRICULUM FELLOWSHIP | 12 Our Instructors ROBERT SCHROLL Robert studied squishy physics in Chicago, Amherst, and Santiago, Chile, before uniting his love of computers, teaching, and making pretty graphs at The Data Incubator. In his free time, he plays tuba and right field, usually not simultaneously. View Resume DON FOX Born and raised in deep South Texas, Don studied chemical engineering at MIT and Cornell where he researched renewable energy systems. Don was attracted to data science because it is an interdisciplinary field that combines math, statistics, and computer science to derive insights of processes using data. He enjoys puns, wearing ties, cardigans, and everything fall. He is a Data Scientist in Residence. View Resume ANA HOCEVAR Ana obtained her PhD in Physics before becoming a postdoctoral fellow at the Rockefeller University where she worked on developing and implementing an underwater touchscreen for dolphins. Now she combines her love for coding and teaching as a Data Scientist in Residence. She spends her free time doing pottery, sometimes climbing, and every now and then scuba diving. View Resume RUSSELL MARTIN Russ was born in TN, grew up in NY, and got his PhD in Applied Mathematics from Georgia Tech. After that he lived and worked for seventeen years in the United Kingdom, including Warwick University and the University of Liverpool. In his spare time, Russ reads all sorts of science-y things he probably doesn’t really understand and plays board games. View Resume RICHARD OTT Rich moved from particle physics to data science when he left academia, and is excited to be joining his interests in data and programming with his love of teaching. In his spare time, he’s a fan of science, speculative fiction, board games, and hiking. View Resume TDI INSTRUCTORS DATA SCIENCE CURRICULUM FELLOWSHIP | 13