Data Science & Deep Learning Cheat Sheet

L2, Fundamentals:
Data science pipeline: Frame problem, collect data, preprocess, explore,
model, deploy
Quantization: transforms a continuous set of values (e.g., integers) into a
discrete set (e.g., categories).
Scaling: transforms variables to have another distribution, which puts
variables at the same scale and makes the data work better on many models.
MCAR, the missingness of data is unrelated to any observed or unobserved
variables in the dataset. Essentially, missing data occurs randomly, without any
systematic pattern.
MAR, the missingness of data is related to other observed variables in the
dataset but not to the missing variable itself. This means that once other
variables are taken into account, the missingness is random.
MNAR, the missingness of data is related to the variable that is missing. This
suggests that there is a systematic pattern to the missing data, which is
dependent on the unobserved values of the variable itself or other unobserved
factors. (Imputation not possible)
Ways to model data:
- image classification
- 1. optical character recognition, hand-written digits
- 2. fine-grained categorization, categorizing types of birds.
- text classification
- 1. sentiment analysis
- 2. annotating paragraphs (for scientific paper)
Cross-validation: partitioning the available data into subsets, training the
model on a portion of the data, and evaluating it on the remaining data.
Cross- validation is a good technique to prevent overfitting.
Use gradient descent (an optimization algorithm) to minimize the error to
train the model (function) if the dataset is imbalanced (i.e., some classes have
far less data), the accuracy of all data is a bad evaluation metric. (fix this by
computing average for each class)
For time-series data, it is better to do the split for cross-validation based on
the order of time intervals
For linear regression: To find the optimal coefficient , we need to minimize the
error (the sum of squared errors) using gradient descent or taking the
derivative of its matrix form.
We can model a non-linear relationship using polynomial functions with degree
To evaluate regression models, one common metric is the coefficient of
determination (R-squared). -squared can be greatly affected by outliers.
R-squared = 1 – (Ssres/SStot)
SSres = ∑ (y – yhat)^2
SStot = ∑ (y – ymean)^2
L3-4, Structured data
Splitting Tree Nodes:
- To split a node, the decision tree algorithm considers all possible splits
on each feature and calculates the impurity reduction resulting from each
split. The impurity reduction is measured using entropy or misclassification
- The algorithm selects the split that maximizes the impurity reduction,
resulting in the most homogeneous child nodes.
- This process is repeated recursively for each child node until certain
stopping criteria are met, such as reaching a maximum depth, minimum
number of samples per node, or when no further impurity reduction is
Entropy: measures the impurity or randomness of a dataset. It is calculated
for each node of the decision tree and represents the uncertainty in class
labels. A node with low entropy means that the classes are more
homogeneous, while a node with high entropy indicates a mix of classes.
(∑Probability P * Surprise log(1/P))
Entropy=1 when probability is 50/50, entropy=0 when probability is 0/100
Misclassification Error: measures the proportion of misclassified instances at a
node. The ratio of the number of instances that are not in the majority class
to the total number of instances at the node. (Not sensitive to probs so no
information gain)
Likelihood function: quantifies how well our model fits the data, allowing us
to make predictions and draw conclusions about the underlying process
generating the data (tries to find best parameter, not a loss function like
Loss function:
Purpose of a loss function: Evaluates the performance of a machine learning
model by quantifying the difference between predicted and actual values.
What do they measure: Loss functions measure the discrepancy or error
between predicted and actual values, providing a measure of how well the
model is performing.
Loss functions for regression: MSE, MAE, RMSE
Bagging (Bootstrap Aggregating) is an ensemble learning technique that aims
to improve the stability and accuracy of machine learning models by training
multiple models on different subsets of the training data, obtained through
bootstrap sampling, and then combining their predictions through averaging or
voting, reducing variance and overfitting.
Bootstrap is a resampling technique used in statistics to estimate the sampling
distribution of a statistic by repeatedly sampling with replacement from the
observed data. This method allows for the estimation of the variability of a
statistic without making strong parametric assumptions about the underlying
distribution of the data.
Information gain: stop splitting when the information gain is too small for the
best feature, which means splitting the node does not give a reasonable
reduction of error.
L5, Deep learning
Deep learning models, particularly deep neural networks, have the ability to
automatically learn hierarchical representations of features from raw data
SegNet is a deep learning model for scene segmentation.
A deep neural network (DNN) is composed of interconnected layers of
artificial neurons, also known as nodes or units. Next is an overview of how a
deep neural network works, starting from the artificial neuron:
1. Artificial Neuron: The basic building block of a neural network is the artificial
neuron, which takes multiple inputs, applies weights to each input, sums them
up, and then applies an activation function to produce an output.
2. Activation Function: The activation function introduces non-linearity into the
neural network, allowing it to learn complex patterns in the data. (tanh/ReLU)
3. Layers: Neurons are organized into layers within the neural network. The
input layer receives the raw data, while the output layer produces the final
predictions or outputs. Between the input and output layers, there can be one
or more hidden layers where computations are performed.
4. Weighted Sum and Activation: In each neuron, the inputs are multiplied by
corresponding weights, and the weighted sum is calculated. This sum is then
passed through the activation function to produce the neuron's output.
5. Forward Propagation: The process of passing input data through the network
to generate predictions is called forward propagation. Each layer in the
network performs calculations as described above, passing the output to the
next layer until the final output is generated.
6. Loss Function: The output of the neural network is compared to the actual
target values using a loss function, which quantifies the difference between
the predicted and actual values. Common loss functions include mean squared
error, cross-entropy, and hinge loss.
7. Backpropagation: Backpropagation is the algorithm used to update the
weights of the neural network based on the error calculated by the loss
function. It works by computing the gradient of the loss function with respect
to the weights of the network, and then using this gradient to update the
weights in a direction that minimizes the loss.
8. Training: During the training process, the neural network learns to minimize
the loss function by adjusting its weights through repeated forward
propagation and backpropagation cycles on the training data. This process
allows the network to learn to make accurate predictions on unseen data.
9. Evaluation: Once trained, the performance of the neural network is evaluated
on a separate validation or test dataset to assess its generalization ability and
effectiveness in making predictions on new, unseen data.
Autoencoder: NN designed to learn efficient representations of input data by
compressing it into a lower-dimensional latent space and then reconstructing
the original data from this representation.
Convolutions: refer to the mathematical operation of combining two functions
to produce a third function, particularly used in (CNNs) for extracting
Deep feedforward neural network: multilayer perceptron, this type of NN
consists of multiple layers of interconnected neurons, with information flowing
in one direction from input to output without feedback connections.
Deep learning: automatically learn hierarchical representations of data,
enabling the modeling of complex patterns and relationships.
Full connection: NN architecture where all neurons in one layer are
connected to every neuron in the subsequent layer.
Generative Adversarial Network structure: A framework in ML where two
neural networks, a generator and a discriminator, are trained simultaneously
in a competitive manner to generate realistic synthetic data.
Pooling: A technique used in convolutional neural networks to reduce the
spatial dimensions of feature maps by aggregating neighboring values,
typically using operations like max pooling or average pooling.
Recurrent Neural Network (RNN): A type of neural network designed to
process sequential data by maintaining an internal state or memory, allowing
it to capture temporal dependencies and context in the data.
Stable diffusion: A concept in machine learning referring to the gradual and
stable propagation of information or gradients through deep neural networks
during training, ensuring that the learning process converges effectively.
Subsampling: Also known as downsampling or pooling, subsampling is a
technique used in convolutional neural networks to reduce the spatial
dimensions of feature maps while retaining important information.
Upsampling: The opposite of subsampling, increase the spatial dimensions of
feature maps, typically performed using operations like transposed
convolutions or interpolation methods. (Reason: maintaining spatial
information, generating high-resolution output, enhancing feature
representation) Transformer: NN architecture for natural language processing
tasks, to capture long-range dependencies in sequential data.
Combat overfitting for deep learning by randomly dropping out neurons with a
pre-defined probability (i.e., the dropout technique), which forces the model
to avoid paying too much attention to a particular set of features.
L1 Regularization (Lasso): Adds a penalty term to the loss function based on
the absolute value of the coefficients, promoting sparsity and feature
selection. L2 Regularization (Ridge): Adds a penalty term based on the square
of the coefficients, penalizing large weights without enforcing sparsity as
strongly as L1 regularization. Elastic Net Regularization: Combines L1 and L2
penalties, offering a mix of feature selection and coefficient shrinkage with
two hyperparameters controlling the strength of each regularization type.
Using dropout, regularization, or data augmentation can help us combat
L7, Text Data Processing:
Before deep learning: tokenization (separating words) and normalization
(standardizing word format). Remove unwanted tokens (punc, digits, symbols).
Normalization: Stemming (chops/removes word tails – removing the ‘s’ at the
end), Lemmatization (get lemma of word from dictionary). For lemmatization
we need POS tagging: label role of each word in sentence (Noun, Adv, Adj)
After cleaning tokens we transform data to high-dimensional space (BOW).
Data is now vectors (array of numbers that encode direction/length info).
BOW can be problematic bc it weighs all words equally.
Use TF-IDF to transform into vectors, this is weighted BOW. TF: how frequent
word appears in doc. IDF: weighs word by frequency in other docs, is higher
when term appears in fewer docs.
Topic modeling: encode doc into distribution of topics (LDA).
One-hot encoding is inefficient bc it creates long vectors with many zeros
(uses a lot of GPU) and doesn’t encode similarity (cos-simil between vectors is
always zero).
Dot-product of vectors can be used to measure similarity, considers both angle
and vector length (cos-sim is normalized dot-product).
Word embeddings: efficiently represent text as vectors, similar words have
similar encoding in HD space. Position in word embedding vector space
encodes semantic relations.
Word2Vec is method to train word embeddings by context, use center word to
predict nearby words. Use dot product similarity to calculate probabilities
using softmax function (maps value to probability).
To represent sentences: stack word vectors into matrix.
For CNN inputs need to be same size, but sentences can have different length.
To fix: drop parts that are too long and pad parts too short with zeros. After
input data is same size, put into neural networks.
For RNN inputs can vary in length. Combine RNNs into Seq2Seq model for
classif/sent analys, is flexible in input/output sizes. Generalize Seq2Seq model
to encoder-decoder where encoder produces encoded representation of
entire input. Problem: hard for model to remember previous info, instead we
have model consider all inputs (take mean/max of encoder input to encode
encoder output). Use attention (weighted avg) mechanism to change weights
according to different inputs.
L8, Image Data Processing:
We see images directly, computers read only numbers, they store images as
pixels with RGB channels ranging from 0 to 255. Images can be seen as 3D
tensor (W*H*Channels), RGB is 3 channels (R/G/B).
Before DL: CV used hand-crafted features, developed different image
filters/kernels to extract features using convolution.
Coordinates: origin is top-left. F[x,y] means value of center of the pixel in
image F at location [x,y].
Blur images: Gaussian/box filter, Detect edges: Sobel filter.
Use a set of kernels (filter banks) and aggregate into feature vector to perform
convolution. (with N convolutional kernels and A x B image: feature vector of A
x B x N).
DL allows us to train model end-to-end (inputs raw pixel values, outputs
2012: AlexNet showed we can use multiple GPUs to run CNN with significant
better performance. Convolutional parts of architecture are used to learn
kernels to extract features, last layer(s) are linear classifiers where data
should be linearly seperable.
Components of CNN: Convolutional layers, Activation Functions, Pooling
layers, Fully Connected layers, Normalization. Many ways of combining these
The fully connected layer flattens feature map (image) to 1D vector, then
passed to activation function.
The convolutional layers perform convolution (filtering), each step produces
1 number (dot product of 2 tensors), do iteratively to get matrix of numbers.
For each kernel after convolution we get feature map, obtain another feature
map for other kernel and slide over original map, repeat this. Convolutional
layers can be seen as a way to reduce the number of trainable parameters by
only looking at local region. Operations consider stride (# of steps when
moving filter) and padding (adding zeros around input). In practice: pick 1
combination of padding/stride to keep input and output same size.
Max pooling layer: have NN pay attention to most important info, this reduces
size of each feature map independently.
Activation function: introduces non-linearity. 3problems with Sigmoid:
saturated neurons kill gradients, sigmoid outputs are not zerocentered, exp()
is compute expensive. Use Leaky ReLU to mitigate dying ReLU problem, where
neg values still have slight slope.
Normalization layer normalizes certain region of feature maps to 0 mean and
unit variance (i.e. batch norm, inserted after convolutional layers and before
activation func).
Deep models harder to train: overfit easily, perform worse in training and
testing than shallow models. So use Transfer Learning (using pre-trained
weights to reuse prior knowledge).
Images are 3D tensors, videos are 4D tensors. Combine CNN with RNN or use
3D convolutional layers for classification.
L6, Pytorch:
Hidden Unit Size: The number of neurons in the hidden layers of the neural
network, affecting the model's capacity to learn complex patterns.
Learning Rate: The step size in gradient descent during training, influencing
the speed of convergence and stability of the model.
Weight Decay: Also known as L2 regularization, it adds a penalty term to the
loss function based on the squared magnitude of the weights, helping prevent
Batch Size: The number of samples processed in each iteration of training,
impacting training speed, memory usage, and stochasticity in the optimization
PCA advantages:
Visualization: enables the exploration of data by reducing its dimensionality
while preserving most of its variance, facilitating visualization of patterns,
clusters, and trends.
Feature Selection: helps identify the most informative dimensions in the
data, aiding in feature selection by highlighting variables that contribute most
to its variance.
Noise Reduction: PCA filters out noise and irrelevant variation in the data,
uncovering underlying structure by capturing only meaningful information.
Data Compression: PCA compresses data into a lower-dimensional space,
reducing storage and computational requirements while retaining essential
Identifying Correlations: PCA reveals correlations between variables through
their loadings on principal components, providing insights into relationships
within the data.
PCA disadvantages:
- Loss of interpretability - Assumption of linear relationships - Sensitivity to
outliers - Computational intensity
Permutation feature importance: technique used to assess the importance of
individual features in ML-model. minibatch stochastic gradient descent (SGD)
computes the gradient on a small random subset of the dataset called a
minibatch. (this is called many-shot learning now)
L10, Multimodal Data Processing:
Modality means how a natural phenomenon is perceived or expressed.
Different modalities can share info with different levels of connections
Association = correlation between A and B (A is next to B)
Dependency = causal/temporal (A is caused by B). Multiple modalities can
exist in different parts of ML pipeline:
Image Captioning: images as input, sentences as output (vision → language).
We can also generate images from text (language → vision).
Visual Question Answering: takes images+sentences as input and outputs
label (vision+language → label).
Video Classification: use vid/audio signals to predict categories (vision+audio
→ label).
Transformers use self-attention: way of encoding sequences to tell how much
attention each input should give other inputs (including self). Attention is just
weighted avg.
Convolution layers use fixed weights (kernels) to filter info. Self-attention
layers dynamically compute attention filters to show how well pixel matches
We can also learn a good representation (embedding) so linear classifiers can
separate data easily.
CLIP model learns text-image representation (vision+language →
representation), use zero-shot prediction by taking label with largest
similarity score between label-text and image.
Contrastive Learning brings positive pairs closer and pushes negative pairs far
torch.utils.data.Dataset: Provides a consistent interface for accessing data,
enabling custom dataset creation and compatibility with PT data loading
torch.utils.data.DataLoader: Simplifies and automates the process of iterating
over batches of data, offering batching, shuffling, and parallel data loading
torch.nn.BCEWithLogitsLoss: combines a Sigmoid layer and the BCELoss for
numerical stability
Evaluation metrics: Accuracy= #correctly class points / #all points, Precision=
TP/TP+FP, Recall= TP/TP+FN, F-score= 2*((Prec*Rec)/(Prec+Rec)). Use cross
validation to tune hyper-parameters.