Uploaded by Vahid H

Fellowship Curriculum Outline 22.03

advertisement
thedataincubator.com
Data Science
Fellowship Curriculum
MODULE 1: SKILLS AND TOOLS
DATA SCIENCE CURRICULUM FELLOWSHIP |
1
T
HANK YOU FOR YOUR INTEREST IN THE
DATA INCUBATOR! We are the innovative
fellowship program for up-and-coming
data professionals. Below you’ll find our
Data Science Fellowship Curriculum and
more about what sets us apart from other fellowship
programs.
We offer practical and actionable training that provides
immediate impact to candidates by focusing on what
works. Based on decades of hands-on experience, our
immersive programs set the benchmark for professional
data education.
Our alumni are the pillar of our brand—they’ve trusted
our programs and elevated their skills with top-tier career
training to get hired with stand-out partners. Our alumni
have exclusive access to our network of hiring partners,
including thousands of companies around the world,
from startups to Fortune 500, who apply our models to
drive their business and power their data teams.
We don’t just do training—we provide proven
methodologies, adaptable resources, experienced
instructors, a robust hiring program and world-class
support.
INTRODUCTION
DATA SCIENCE CURRICULUM FELLOWSHIP |
2
Our Curriculum
MODULE
1
DATA WRANGLING
Students learn how to acquire and manipulate data in Python with the foundational tools of data science.
Prerequisites: Basic Python
The first step of data science is mastering the computational foundations on which data science is built. We cover the
fundamental topics of programming relevant for data science - including pandas, NumPy, SciPy, Matplotlib, regular
expressions, SQL, JSON, XML, checkpointing, and web scraping - that form the core libraries around handling structured
and unstructured data in Python. Students gain practical experience manipulating messy, real-world data using these
libraries. They also walk away with a firm understanding of tools like pip, git, Python, Jupyter notebooks, pdb, and unit
testing that leverage existing open-source packages to accelerate data exploration, development, debugging, and
collaboration.
Associated Project Work
Students will scrape picture captions off of a website that tracks the goings-on of New York’s socially well-to-do. By
extracting names from these captions, they will assemble a graph of friendships amongst this crowd. Analysis of this
graph will produce insights about the most connected New Yorkers.
SKILLS AND TOOLS ADDRESSED:
n Consuming APIs (and JSON)
– Handling URL parameters
– Authenticated APIs
–API Request Limitations
n I terators, Generators, and Coroutines
– Iterables and Iterators
– Generators
– Generator “pipelines”
– Generator comprehensions
– Time complexity
– Itertools in Python
– Coroutines
– Coroutine “pipelines”
– Broadcasting
– Coroutines as classes
– Unifying generators and coroutines
n O
verview of Scraping and Munging Technologies
– Concepts, languages, and tools
– Concrete tasks in Python
– Python library cheat sheet
MODULE 1: DATA WRANGLING
n H
ow to (Software) Engineer Real Good
– Writing functional code
– Version control and other tools
– Testing
– Testing the web in Flask
– Linting
– Writing “good code”
– Self-documenting code
– Code review
– Time management
n Pandas
– Pandas series
– Pandas DataFrame
– Loading data into pandas
– Pandas indices and selecting and slicing data
– Using pandas for data analysis
– Filtering data
– String operations and transformations
– Merging data sets
– Dealing with missing values
– Adding and dropping columns
– Aggregating by groups
– Automating and repeating the analysis
– Exporting data frames to CSV or Excel file
– Pandas best practices
– Conclusion
DATA SCIENCE CURRICULUM FELLOWSHIP |
3
n Scraping
– HTTP requests and responses
– Understanding URLs
– HTML and the DOM
– Parsing HTML
– CSS selectors
– Fetching subsequent pages
– Scrapy in Python
n Dealing with Strings in Python
– The string data structure
– Unicode and Byte Strings
– Basic string processing
– StringIO in Python
– Regular expressions
n NumPy and SciPy
– NumPy
– Data types (the nouns)
– Operations (the verbs)
– Persisting NumPy objects
– SciPy
n Matplotlib
– Matplotlib and Pyplot
– Matplotlib plots from Pandas
– Seaborn
n Functions
– Functions as first-class objects
– Closures
– Variable arguments and keywords
– Decorators
MODULE 1: DATA WRAGLING: SKILLS AND TOOLS
n Exceptions
– Catching general exceptions
– Handling success
– Doing something with the error
– Raising errors
– Exceptions and the call stack
– Reading traceback
n Debugging
– NameError
– TypeError
– AttributeError
– KeyError
– Reading code critically
n Python
– Jupyter notebooks and the kernel
– Variables
– Functions
– Logic and program flow
– Iteration
– Whitespace matters
– Putting it all together
n Object-oriented programming
– Everything is an object
– Defining a Python class
– Adding attributes and methods
– Inheritance
– Putting it all together again
DATA SCIENCE CURRICULUM FELLOWSHIP |
4
MODULE
2
INTRODUCTION TO MACHINE LEARNING
Students learn the basics of machine learning, and building and training different types of models.
Prerequisites: Basic Python, Basic to intermediate statistics, Basic linear algebra
In a world with abundant data, leveraging machines to learn valuable patterns from structured data can be extremely
powerful. We explore the basics of machine learning, discussing concepts like regression, classification, model
evaluation metrics, overfitting, variance versus bias, linear regression, ensemble methods, model selection, and
hyperparameter optimization. Through powerful packages such as Scikit-learn, students come away with a strong
understanding of core concepts in machine learning as well as the ability to efficiently train and benchmark accurate
predictive models. They gain hands-on experience building complex ETL pipelines to handle data in a variety of formats,
developing models with tools like feature unions and pipelines to reduce duplicate work, and practicing tricks like
parallelization to speed up prototyping and development.
Associated project work
Students will develop a series of models to predict a venue’s star rating from various features. Working from 100MB
of real-world data, they will start with location-based models before building models based on other attributes of the
venues. Finally, an ensemble model will blend the individual models into a final prediction of the venue’s popularity.
SKILLS AND TOOLS
ADDRESSED:
n Introduction to Machine
Learning
– Statistics vs machine
learning
– Data as a matrix
– Models as functions
– Types of machine learning
– Parameters and learning
n Regression
– Linear regression
– Regression metrics
– Optimization
– Stochastic gradient descent
– Adding features
– Regularization
– Example: California housing
data set
– Reference: statistical
motivation
– Scikit-Learn API
– Classes vs objects
– Estimators
– Transformers
– Pipelines
MODULE 2: INTRODUCTION TO MACHINE LEARNING
n Classification
– Precision and recall
– Other classification metrics
– Probabilistic models
– Logistic regression
– Multiclass classification
problems
n B
ias, Variance, and
Overfitting
– Decision trees
– In-sample error
– Out-of-sample error
– Variance-bias tradeoff
– Cross-validation strategies
– Grid search for tuning
hyperparameters
n Scikit-learn Workflow
– Writing custom estimators
and transformers
– Pipelines
– Feature unions
– Data types
– Validating your
implementations
n T
ransformers and
Preprocessing
– Feature scaling
– Encoding categorical
variables
– Imputation
– Dimensionality reduction
– Natural language processing
– Custom Transformers
– Answers to questions
n K Nearest Neighbors
– Tuning k and other
hyperparameters
– Normalizing features
n Unsupervised Learning
– Metrics for clustering
– K-Means clustering
– Elongated clusters
– Gaussian mixture models
– Dimensionality reduction
– Random projections
– Matrix factorization
– Principal Components
Analysis (PCA)
– Non-negative Matrix
Factorization (NMF)
– Comparison of PCA and
NMF
DATA SCIENCE CURRICULUM FELLOWSHIP |
5
MODULE
3
SQL AND PRODUCTION TOPICS
Students learn how to access databases using SQL interfaces and topics related to programming in a
professional environment.
Prerequisites: Basic Python, Basic SQL
Most data is stored in databases, and they have to be accessed through interfaces. The most common one is SQL, a
declarative language that lets us tell the database which data we want and how to present it. We cover the basics of the
language itself and some of the Python tools related to it.
Associated project work
Students will assemble a SQL database of 4 years worth of NYC restaurant inspection data. They will write and execute
queries against this database to understand the variations in scores and violations across the city and between
different types of restaurants.
SKILLS AND TOOLS ADDRESSED:
n Advanced SQL
– Creating Tables
– Database connectors
– Temporary tables and views
– SQL Alchemy
– A note on SQL flavors
n SQL - Structured Query Language
– SELECT - Getting information from the tables
– COUNT, SUM, and DISTINCT - Let SQL do work for you!
– WHERE, LIKE, and IN - Filtering the data
– ORDER BY - Sorting your outputs
– GROUP BY - Aggregating data
– HAVING - The “WHERE” clause for grouped data
– JOIN - Putting tables together
– Creating and using subqueries
– CASE - Returning values based on conditional statements
MODULE 3: SQL AND PRODUCTION TOPICS
DATA SCIENCE CURRICULUM FELLOWSHIP |
6
MODULE
4
VISUALIZATIONS
Students learn how to present data visually for both technical and non-technical audiences.
Prerequisites: Basic Python
Data science is about helping humans understand the story behind the data, and visualizations provide a powerful tool
for helping the analyst understand and communicate that story. We discuss the biases and limitations of both visual
and statistical analysis to promote a more holistic approach.
Associated project work
Students will build an interactive web site, giving information on NYC’s bus system. They will process historical data
and develop plots to illustrate trends. Using a live feed of bus information, they will compare the current state to this
historical average. All of the visualization will be deployed as a Flask app running on Heroku.
SKILLS AND TOOLS ADDRESSED:
n Explanatory Visualization
– Multiple interactive plots
– Data transformations: filtering and aggregating
– Layout and design
– Using Altair with large data sets
– Exporting an Altair chart as HTML
– Embedding a chart in an HTML document
– Plotting geographic data
n Exploratory Visualization
– Python visualization tools
– Describing a distribution
– Histograms
– Box plots and Violin Plots
– Relationships between variables
– Non-obvious patterns in the data
– Interactivity in visualizations
MODULE 4: VISUALIZATIONS
n Overview of Data Visualization
– Introduction
– Pandas plots
– Altair plots
n Visualization Theory
– Different types of data for visualization purposes
– Seven categories of visual cues
– Generic algorithm for creating a visualization
– Portability & accessibility
– Perception and visual response
– Attention and memory
– Visual storytelling
n Layout and Design
– Design Elements & Principles
– Examples (mostly bad, sometimes good)
– Axes (use them!)
– Choosing the right mark
– Data-Ink ratio
– Dealing with multiple scales
– Small multiples
DATA SCIENCE CURRICULUM FELLOWSHIP |
7
MODULE
5
ADVANCED MACHINE LEARNING
Students learn more advanced machine learning topics and techniques, including dealing with
unstructured data and time series.
Prerequisites: Intermediate to advanced statistics, Intermediate linear algebra, Basic programming
While machine learning on structured data lays an important foundation, a larger world of analytical opportunities
becomes available through understanding advanced machine learning techniques and how to handle unstructured
data. We explore techniques such as support vector machines, decision trees, random forests, neural nets, clustering,
KMeans, expectation-maximization, time series, and signal processing. Students come away with intuition about the
suitability of different techniques for different problems. In addition to handling structured data, students directly apply
these techniques to large volumes of real-world unstructured data, solving problems in natural language processing
using Word2Vec, bag of words, feature hashing, and topic modeling.
Associated project work
Students will use NLP techniques to extract sentiment from English text. Working with 300MB of venue reviews, they
will build a series of models to predict the star rating associated with a given review. They will also examine statistically
improbable phrases that appear in the text corpus.
Students will examine methods of dealing with seasonality, as they build models to predict temperatures in several
cities. The training data come from National Weather Service observations and must be cleaned before use.
Skills and Tools Addressed:
n Support Vector Machines (SVM)
– Maximal margin classifier
– Linear SVM
– Non-linear SVM
– Multi-class SVM
– Approximating kernels
– Support vector regression
– Outlier detection using SVM
n Decision Trees and Random Forests
– Decision trees
– Ensemble methods
– Determining feature importance
n Natural Language Processing
– Text as a “bag of words”
–W
ord importance: Term frequency–inverse document
frequency (TF-IDF)
– Document similarity metrics
– Engineering your features
– Building the classifier
– Additional NLP topics and resources
MODULE 5: ADVANCED MACHINE LEARNING
n Sentiment Analysis
– Bag of words model
– Interpreting the model
– Grammar and other tools
n Time Series
– Trends in time series data
– Cross-validation for time series
– Modeling drift
– Modeling seasonality
– Modeling “noise”
– Using external data sources as features
–M
ore advanced time series modeling frameworks
n Naive Bayes
– Predictive modeling using Naive Bayes
– Classifying mushrooms (an example)
DATA SCIENCE CURRICULUM FELLOWSHIP |
8
n Outlier Detection
– Motivation
– Concepts
– Scikit-learn Implementation
– One-class SVM
– Isolation forest
– Case Study: Anomaly Detection in Time Series
– Modeling the background
– Detecting seasonality with Fourier Transforms
– Detrending
– z-Score
– Moving-window averages
– Including windowed data in model
– Bayesian change points
– Online learning
– References
n Recommendation Engine
– Problem definition and data format
– Feature engineering
– Nearest neighbors
– Tag data
– Dimensional reduction
– Recommendation for a user
– Cooperative learning
– Regression of ratings
– Baseline model overfitting and cross-validation
– Modeling interaction
– Surpriselib
– References
MODULE 5: ADVANCED MACHINE LEARNING:: SKILLS AND TOOLS
n Unbalanced Classes
– Introduction: cancer detection case study
– Definition and common scenarios for
unbalanced data
– Simple techniques to deal with unbalanced data
– Undersampling
– Oversampling
– Synthetic data augmentation
– Additional approaches
– Train/Test split with unbalanced data
– Probabilities with unbalanced data
– The Python imbalanced-learn package
n Digital Signals
– Sampling
– Noise & filters
– Audio files
– Filters
– Frequency domain
n C
hoosing the Correct Machine
Learning Algorithm
– Few features
– Many features
– Few observations
– Many observations
– Underfitting
– Overfitting
– Explicability
– Prediction speed
– Parallelization
– Online learning
– Feature scaling
– Outlier detection/novelty detection
– Comparing ML algorithms
DATA SCIENCE CURRICULUM FELLOWSHIP |
9
MODULE
6
THINKING OUTSIDE THE DATA
Students learn how data science interacts with business and business concerns.
Prerequisites: Intermediate to advanced statistics, Basic to intermediate programming
Sometimes the most important question to ask in data science comes from thinking beyond the data itself. We
explore a myriad of topics that affect data science decision making as a whole, and affect the implementation of
data-driven business policies. Important topics include data fidelity, relevance, and the value of additional data. Bias
is a major theme, and students think about how their conclusions are influenced by data collection, external factors,
internal structuring, procedural artifacts, and more. Students gain a broader understanding of how to balance tradeoffs to suit the business problem, such as when to favor accuracy over interpretability and vice versa. We also discuss
more practical engineering considerations like building for prediction speed or robustness, and deploying to different
environments. Students apply this knowledge to case studies that simulate what they would be expected to contribute
as part of a real-world team faced with a business problem.
SKILLS AND TOOLS ADDRESSED:
Hypothesis Testing
– False positives versus false negatives
– Z-score
– CDF and the uniform distribution
– T-test
– Standard error for a rate
– Standard error for a counting process
– Power calculations
– Mnemonic summary
– A/B testing
– Causality versus correlation
– Distributional tests
– Multiple tests
– How trustworthy are your data?
Personal Interview Questions
Algorithms and Data Structures
– Sorting
– Searching
– Dynamic programming
– Graph theory
What the data really says
– Fallacies
– Data fidelity
– Data relevance
– Modeling tradeoffs
– Protecting privacy
Statistics
– Linearity of expectation
– Bayes Theorem
– Combinatorics
MODULE 6: THINKING OUTSIDE THE DATA
– Continuous probability
– Hypothesis testing
– Memoryless processes
Data Management
– Differing needs
– Data warehouses
– Data lakes
– Self-service
Managing data science projects
– Strategies in software developmen
– Minimum viable products
– Agile and data science
Metrics and Levers
– Metrics in business: KPIs
– Improving KPIs - Pulling Levers
– Translating metrics
– Real world considerations
– Systematic bias
Case Studies
– The prompt
– The process
– The product
– Know your audience
– Exercises
Data Science Case Studies
– What is a case study?
– How to Ace a Case Study
– Analysis
– General Advice
– Practice Case Studies
DATA SCIENCE CURRICULUM FELLOWSHIP |
10
MODULE
7
DISTRIBUTED COMPUTING WITH SPARK
Students learn how to distribute computations across multiple computers, such as a cluster or the cloud,
using PySpark.
Prerequisites: Basic Python, Basic to intermediate programming
Spark is a technology at the forefront of distributed computing that offers a more abstract but more powerful API. This
module is taught using the Python API. We cover core concepts of Spark like resilient distributed data sets, memory
caching, actions, transformations, tuning, and optimization. Students get to build functioning applications from end to
end. They apply that knowledge to directly developing, building, and deploying Spark jobs to run on large, real-world data
sets in the cloud (AWS and Google Cloud Platform).
Associated project work
Students will use Spark to parse and process 10GB of data on posts and users at a popular Q&A website. They will
extract insights on the posting habits of users and develop predictors of users’ behavior from their posts. Spark’s
machine-learning capabilities will be used to discover meaning in unstructured text data.
SKILLS AND TOOLS
ADDRESSED:
Introduction to Distributed
Computing
– Big data
– Distributed computing
– MapReduce: A simple
distributed-computing
framework
– Word Count: The “Hello
World” of distributed
computing
– Word Count in Spark
– Other Spark Features
Introduction to Functional
Programming Style
– Stateful vs. stateless code
– Decorators
– Map, filter, and reduce
– Anonymous functions
Tweet mini case study
– Spark SQL and DataFrames a convenient abstraction
– Caching and persistence - the
key to Spark’s speed
MODULE 7: SKILLS AND TOOLS
Streaming Technologies
– Apache Kafka
– Apache Storm
– Spark streaming
–B
uilding a Spark streaming
application
– Keeping track of state
– Windowed state
– Streaming tweets demo
PySpark Intro
– The Spark API
– Word count example
– ETL example
– Computing statistics
– Translating from SQL
– Joins in Spark
Creating Spark Applications
– REPLs
– Building Spark applications
– Spark on Amazon Web
Services
–S
park on Google Cloud
Platform
PySpark ML
– Algorithms
– ML vs. MLlib packages
– Spark ML
– Pipeline
–C
ross-validation and grid
search
– Feature processing
PySpark DataFrames
– Motivation and Spark SQL
–E
xploring the Catalyst
Optimizer
– SQL and DataFrames
–A
dding columns and
functions
– Type safety and DataSets
– DataFrame Optimization
– Joins
Advanced Topics in Spark
– Key terminology
–R
elation to Hadoop and
MapReduce
– Understanding the shuffle
– Data partitioning
– Shared variables
–B
est practices and
optimization
– Resource tuning
– Spark UI
DATA SCIENCE CURRICULUM FELLOWSHIP |
11
MODULE
8
DEEP LEARNING IN TENSORFLOW
Students learn to build and train neural networks using Tensorflow. Both theoretical understanding and
practical applications and concerns are addressed.
Prerequisites: Basic Python
TensorFlow is taking the world of deep learning by storm. We demonstrate its capabilities through its Python and Keras
interfaces and build some simple machine learning models. We give a brief overview of the theory of neural networks,
including convolutional and recurrent layers. Students will practice building and testing these networks in TensorFlow
and Keras, using real-world data. They will come away with both theoretical and practical understanding of the
algorithms behind deep-learning algorithms.
Associated project work
Students will build a series of models to classify images from the Cifar-10 data set. These models will include basic
image analysis, convolutional neural networks, and transfer learned deep neural networks.
SKILLS AND TOOLS ADDRESSED:
n Introduction to TensorFlow
– Linear models
– Error metrics
– Gradient descent
– Gradient Descent in TensorFlow
– Tensors and operations
– Automatic differentiation and tf.GradientTape
– Built-in optimization
– TensorFlow API overview
n Optimization with the Computation Graph
– Computation graph
– Using accelerators (GPUs/TPUs)
n Basic Neural Networks
– The XOR problem
– Logistic regression
– Neural networks and hidden layers
– Activation functions
– Initial weights
n Adversarial Noise
– Fooling neural networks
– Attacking networks
– How do you find adversarial noise?
– Putting it all together
– Exercise: extending immunity
n Convolutional Neural Networks
– Convolutions
– Convolutional neural networks
– Pre-trained CNNs (applications)
n T
he Inception Model and the Deep Dream
Algorithm
– Inception model
– The deep dream algorithm
n Deep Neural Networks
– What is deep learning?
– Keras API
– TensorBoard
n Variational Autoencoders
– Autoencoders
– Building an Autoencoder
– Adam optimizer
– Application: noise removal
– Generating new images
– Variational Autoencoders (VAEs)
– KL-Divergence
– Exercise: new numbers
– Exercise: different compression
n Optimization
– Stochastic Gradient Descent
– Exploring the loss surface and learning curves
– Overfitting
– Regularization
– Dropout
– Batch normalization
n Recurrent Neural Networks
– Backpropagation through time
– Applications
– Example: name classification
– Exercise: introduce an embedding layer
– Long-short term memory
– Example: generating strata abstracts
MODULE 8: DEEP LEARNING IN TENSORFLOW
DATA SCIENCE CURRICULUM FELLOWSHIP |
12
Our Instructors
ROBERT SCHROLL
Robert studied squishy physics in Chicago, Amherst, and Santiago, Chile, before uniting his
love of computers, teaching, and making pretty graphs at The Data Incubator. In his free
time, he plays tuba and right field, usually not simultaneously.
View Resume
DON FOX
Born and raised in deep South Texas, Don studied chemical engineering at MIT and
Cornell where he researched renewable energy systems. Don was attracted to data
science because it is an interdisciplinary field that combines math, statistics, and
computer science to derive insights of processes using data. He enjoys puns, wearing ties,
cardigans, and everything fall. He is a Data Scientist in Residence.
View Resume
ANA HOCEVAR
Ana obtained her PhD in Physics before becoming a postdoctoral fellow at the Rockefeller
University where she worked on developing and implementing an underwater touchscreen
for dolphins. Now she combines her love for coding and teaching as a Data Scientist in
Residence. She spends her free time doing pottery, sometimes climbing, and every now
and then scuba diving.
View Resume
RUSSELL MARTIN
Russ was born in TN, grew up in NY, and got his PhD in Applied Mathematics from Georgia
Tech. After that he lived and worked for seventeen years in the United Kingdom, including
Warwick University and the University of Liverpool. In his spare time, Russ reads all sorts
of science-y things he probably doesn’t really understand and plays board games.
View Resume
RICHARD OTT
Rich moved from particle physics to data science when he left academia, and is excited
to be joining his interests in data and programming with his love of teaching. In his spare
time, he’s a fan of science, speculative fiction, board games, and hiking.
View Resume
TDI INSTRUCTORS
DATA SCIENCE CURRICULUM FELLOWSHIP |
13
Download