L2, Fundamentals: Data science pipeline: Frame problem, collect data, preprocess, explore, model, deploy Quantization: transforms a continuous set of values (e.g., integers) into a discrete set (e.g., categories). Scaling: transforms variables to have another distribution, which puts variables at the same scale and makes the data work better on many models. MCAR, the missingness of data is unrelated to any observed or unobserved variables in the dataset. Essentially, missing data occurs randomly, without any systematic pattern. MAR, the missingness of data is related to other observed variables in the dataset but not to the missing variable itself. This means that once other variables are taken into account, the missingness is random. MNAR, the missingness of data is related to the variable that is missing. This suggests that there is a systematic pattern to the missing data, which is dependent on the unobserved values of the variable itself or other unobserved factors. (Imputation not possible) Ways to model data: - image classification - 1. optical character recognition, hand-written digits - 2. fine-grained categorization, categorizing types of birds. - text classification - 1. sentiment analysis - 2. annotating paragraphs (for scientific paper) Cross-validation: partitioning the available data into subsets, training the model on a portion of the data, and evaluating it on the remaining data. Cross- validation is a good technique to prevent overfitting. Classification Use gradient descent (an optimization algorithm) to minimize the error to train the model (function) if the dataset is imbalanced (i.e., some classes have far less data), the accuracy of all data is a bad evaluation metric. (fix this by computing average for each class) For time-series data, it is better to do the split for cross-validation based on the order of time intervals Regression For linear regression: To find the optimal coefficient , we need to minimize the error (the sum of squared errors) using gradient descent or taking the derivative of its matrix form. We can model a non-linear relationship using polynomial functions with degree To evaluate regression models, one common metric is the coefficient of determination (R-squared). -squared can be greatly affected by outliers. R-squared = 1 – (Ssres/SStot) SSres = ∑ (y – yhat)^2 SStot = ∑ (y – ymean)^2 L3-4, Structured data Splitting Tree Nodes: - To split a node, the decision tree algorithm considers all possible splits on each feature and calculates the impurity reduction resulting from each split. The impurity reduction is measured using entropy or misclassification error. - The algorithm selects the split that maximizes the impurity reduction, resulting in the most homogeneous child nodes. - This process is repeated recursively for each child node until certain stopping criteria are met, such as reaching a maximum depth, minimum number of samples per node, or when no further impurity reduction is possible. Entropy: measures the impurity or randomness of a dataset. It is calculated for each node of the decision tree and represents the uncertainty in class labels. A node with low entropy means that the classes are more homogeneous, while a node with high entropy indicates a mix of classes. (∑Probability P * Surprise log(1/P)) Entropy=1 when probability is 50/50, entropy=0 when probability is 0/100 Misclassification Error: measures the proportion of misclassified instances at a node. The ratio of the number of instances that are not in the majority class to the total number of instances at the node. (Not sensitive to probs so no information gain) Likelihood function: quantifies how well our model fits the data, allowing us to make predictions and draw conclusions about the underlying process generating the data (tries to find best parameter, not a loss function like entropy). Loss function: Purpose of a loss function: Evaluates the performance of a machine learning model by quantifying the difference between predicted and actual values. What do they measure: Loss functions measure the discrepancy or error between predicted and actual values, providing a measure of how well the model is performing. Loss functions for regression: MSE, MAE, RMSE Bagging (Bootstrap Aggregating) is an ensemble learning technique that aims to improve the stability and accuracy of machine learning models by training multiple models on different subsets of the training data, obtained through bootstrap sampling, and then combining their predictions through averaging or voting, reducing variance and overfitting. Bootstrap is a resampling technique used in statistics to estimate the sampling distribution of a statistic by repeatedly sampling with replacement from the observed data. This method allows for the estimation of the variability of a statistic without making strong parametric assumptions about the underlying distribution of the data. Information gain: stop splitting when the information gain is too small for the best feature, which means splitting the node does not give a reasonable reduction of error. L5, Deep learning Deep learning models, particularly deep neural networks, have the ability to automatically learn hierarchical representations of features from raw data SegNet is a deep learning model for scene segmentation. A deep neural network (DNN) is composed of interconnected layers of artificial neurons, also known as nodes or units. Next is an overview of how a deep neural network works, starting from the artificial neuron: 1. Artificial Neuron: The basic building block of a neural network is the artificial neuron, which takes multiple inputs, applies weights to each input, sums them up, and then applies an activation function to produce an output. 2. Activation Function: The activation function introduces non-linearity into the neural network, allowing it to learn complex patterns in the data. (tanh/ReLU) 3. Layers: Neurons are organized into layers within the neural network. The input layer receives the raw data, while the output layer produces the final predictions or outputs. Between the input and output layers, there can be one or more hidden layers where computations are performed. 4. Weighted Sum and Activation: In each neuron, the inputs are multiplied by corresponding weights, and the weighted sum is calculated. This sum is then passed through the activation function to produce the neuron's output. 5. Forward Propagation: The process of passing input data through the network to generate predictions is called forward propagation. Each layer in the network performs calculations as described above, passing the output to the next layer until the final output is generated. 6. Loss Function: The output of the neural network is compared to the actual target values using a loss function, which quantifies the difference between the predicted and actual values. Common loss functions include mean squared error, cross-entropy, and hinge loss. 7. Backpropagation: Backpropagation is the algorithm used to update the weights of the neural network based on the error calculated by the loss function. It works by computing the gradient of the loss function with respect to the weights of the network, and then using this gradient to update the weights in a direction that minimizes the loss. 8. Training: During the training process, the neural network learns to minimize the loss function by adjusting its weights through repeated forward propagation and backpropagation cycles on the training data. This process allows the network to learn to make accurate predictions on unseen data. 9. Evaluation: Once trained, the performance of the neural network is evaluated on a separate validation or test dataset to assess its generalization ability and effectiveness in making predictions on new, unseen data. Autoencoder: NN designed to learn efficient representations of input data by compressing it into a lower-dimensional latent space and then reconstructing the original data from this representation. Convolutions: refer to the mathematical operation of combining two functions to produce a third function, particularly used in (CNNs) for extracting features. Deep feedforward neural network: multilayer perceptron, this type of NN consists of multiple layers of interconnected neurons, with information flowing in one direction from input to output without feedback connections. Deep learning: automatically learn hierarchical representations of data, enabling the modeling of complex patterns and relationships. Full connection: NN architecture where all neurons in one layer are connected to every neuron in the subsequent layer. Generative Adversarial Network structure: A framework in ML where two neural networks, a generator and a discriminator, are trained simultaneously in a competitive manner to generate realistic synthetic data. Pooling: A technique used in convolutional neural networks to reduce the spatial dimensions of feature maps by aggregating neighboring values, typically using operations like max pooling or average pooling. Recurrent Neural Network (RNN): A type of neural network designed to process sequential data by maintaining an internal state or memory, allowing it to capture temporal dependencies and context in the data. Stable diffusion: A concept in machine learning referring to the gradual and stable propagation of information or gradients through deep neural networks during training, ensuring that the learning process converges effectively. Subsampling: Also known as downsampling or pooling, subsampling is a technique used in convolutional neural networks to reduce the spatial dimensions of feature maps while retaining important information. Upsampling: The opposite of subsampling, increase the spatial dimensions of feature maps, typically performed using operations like transposed convolutions or interpolation methods. (Reason: maintaining spatial information, generating high-resolution output, enhancing feature representation) Transformer: NN architecture for natural language processing tasks, to capture long-range dependencies in sequential data. Combat overfitting for deep learning by randomly dropping out neurons with a pre-defined probability (i.e., the dropout technique), which forces the model to avoid paying too much attention to a particular set of features. L1 Regularization (Lasso): Adds a penalty term to the loss function based on the absolute value of the coefficients, promoting sparsity and feature selection. L2 Regularization (Ridge): Adds a penalty term based on the square of the coefficients, penalizing large weights without enforcing sparsity as strongly as L1 regularization. Elastic Net Regularization: Combines L1 and L2 penalties, offering a mix of feature selection and coefficient shrinkage with two hyperparameters controlling the strength of each regularization type. Using dropout, regularization, or data augmentation can help us combat overfitting L7, Text Data Processing: Before deep learning: tokenization (separating words) and normalization (standardizing word format). Remove unwanted tokens (punc, digits, symbols). Normalization: Stemming (chops/removes word tails – removing the ‘s’ at the end), Lemmatization (get lemma of word from dictionary). For lemmatization we need POS tagging: label role of each word in sentence (Noun, Adv, Adj) After cleaning tokens we transform data to high-dimensional space (BOW). Data is now vectors (array of numbers that encode direction/length info). BOW can be problematic bc it weighs all words equally. Use TF-IDF to transform into vectors, this is weighted BOW. TF: how frequent word appears in doc. IDF: weighs word by frequency in other docs, is higher when term appears in fewer docs. Topic modeling: encode doc into distribution of topics (LDA). One-hot encoding is inefficient bc it creates long vectors with many zeros (uses a lot of GPU) and doesn’t encode similarity (cos-simil between vectors is always zero). Dot-product of vectors can be used to measure similarity, considers both angle and vector length (cos-sim is normalized dot-product). Word embeddings: efficiently represent text as vectors, similar words have similar encoding in HD space. Position in word embedding vector space encodes semantic relations. Word2Vec is method to train word embeddings by context, use center word to predict nearby words. Use dot product similarity to calculate probabilities using softmax function (maps value to probability). To represent sentences: stack word vectors into matrix. For CNN inputs need to be same size, but sentences can have different length. To fix: drop parts that are too long and pad parts too short with zeros. After input data is same size, put into neural networks. For RNN inputs can vary in length. Combine RNNs into Seq2Seq model for classif/sent analys, is flexible in input/output sizes. Generalize Seq2Seq model to encoder-decoder where encoder produces encoded representation of entire input. Problem: hard for model to remember previous info, instead we have model consider all inputs (take mean/max of encoder input to encode encoder output). Use attention (weighted avg) mechanism to change weights according to different inputs. L8, Image Data Processing: We see images directly, computers read only numbers, they store images as pixels with RGB channels ranging from 0 to 255. Images can be seen as 3D tensor (W*H*Channels), RGB is 3 channels (R/G/B). Before DL: CV used hand-crafted features, developed different image filters/kernels to extract features using convolution. Coordinates: origin is top-left. F[x,y] means value of center of the pixel in image F at location [x,y]. Blur images: Gaussian/box filter, Detect edges: Sobel filter. Use a set of kernels (filter banks) and aggregate into feature vector to perform convolution. (with N convolutional kernels and A x B image: feature vector of A x B x N). DL allows us to train model end-to-end (inputs raw pixel values, outputs categories/heatmaps). 2012: AlexNet showed we can use multiple GPUs to run CNN with significant better performance. Convolutional parts of architecture are used to learn kernels to extract features, last layer(s) are linear classifiers where data should be linearly seperable. Components of CNN: Convolutional layers, Activation Functions, Pooling layers, Fully Connected layers, Normalization. Many ways of combining these components. The fully connected layer flattens feature map (image) to 1D vector, then passed to activation function. The convolutional layers perform convolution (filtering), each step produces 1 number (dot product of 2 tensors), do iteratively to get matrix of numbers. For each kernel after convolution we get feature map, obtain another feature map for other kernel and slide over original map, repeat this. Convolutional layers can be seen as a way to reduce the number of trainable parameters by only looking at local region. Operations consider stride (# of steps when moving filter) and padding (adding zeros around input). In practice: pick 1 combination of padding/stride to keep input and output same size. Max pooling layer: have NN pay attention to most important info, this reduces size of each feature map independently. Activation function: introduces non-linearity. 3problems with Sigmoid: saturated neurons kill gradients, sigmoid outputs are not zerocentered, exp() is compute expensive. Use Leaky ReLU to mitigate dying ReLU problem, where neg values still have slight slope. Normalization layer normalizes certain region of feature maps to 0 mean and unit variance (i.e. batch norm, inserted after convolutional layers and before activation func). Deep models harder to train: overfit easily, perform worse in training and testing than shallow models. So use Transfer Learning (using pre-trained weights to reuse prior knowledge). Images are 3D tensors, videos are 4D tensors. Combine CNN with RNN or use 3D convolutional layers for classification. L6, Pytorch: Hidden Unit Size: The number of neurons in the hidden layers of the neural network, affecting the model's capacity to learn complex patterns. Learning Rate: The step size in gradient descent during training, influencing the speed of convergence and stability of the model. Weight Decay: Also known as L2 regularization, it adds a penalty term to the loss function based on the squared magnitude of the weights, helping prevent overfitting. Batch Size: The number of samples processed in each iteration of training, impacting training speed, memory usage, and stochasticity in the optimization process. PCA advantages: Visualization: enables the exploration of data by reducing its dimensionality while preserving most of its variance, facilitating visualization of patterns, clusters, and trends. Feature Selection: helps identify the most informative dimensions in the data, aiding in feature selection by highlighting variables that contribute most to its variance. Noise Reduction: PCA filters out noise and irrelevant variation in the data, uncovering underlying structure by capturing only meaningful information. Data Compression: PCA compresses data into a lower-dimensional space, reducing storage and computational requirements while retaining essential information. Identifying Correlations: PCA reveals correlations between variables through their loadings on principal components, providing insights into relationships within the data. PCA disadvantages: - Loss of interpretability - Assumption of linear relationships - Sensitivity to outliers - Computational intensity Permutation feature importance: technique used to assess the importance of individual features in ML-model. minibatch stochastic gradient descent (SGD) computes the gradient on a small random subset of the dataset called a minibatch. (this is called many-shot learning now) L10, Multimodal Data Processing: Modality means how a natural phenomenon is perceived or expressed. Different modalities can share info with different levels of connections (connected). Association = correlation between A and B (A is next to B) Dependency = causal/temporal (A is caused by B). Multiple modalities can exist in different parts of ML pipeline: Image Captioning: images as input, sentences as output (vision → language). We can also generate images from text (language → vision). Visual Question Answering: takes images+sentences as input and outputs label (vision+language → label). Video Classification: use vid/audio signals to predict categories (vision+audio → label). Transformers use self-attention: way of encoding sequences to tell how much attention each input should give other inputs (including self). Attention is just weighted avg. Convolution layers use fixed weights (kernels) to filter info. Self-attention layers dynamically compute attention filters to show how well pixel matches neighbor. We can also learn a good representation (embedding) so linear classifiers can separate data easily. CLIP model learns text-image representation (vision+language → representation), use zero-shot prediction by taking label with largest similarity score between label-text and image. Contrastive Learning brings positive pairs closer and pushes negative pairs far apart. torch.utils.data.Dataset: Provides a consistent interface for accessing data, enabling custom dataset creation and compatibility with PT data loading utilities. torch.utils.data.DataLoader: Simplifies and automates the process of iterating over batches of data, offering batching, shuffling, and parallel data loading capabilities. torch.nn.BCEWithLogitsLoss: combines a Sigmoid layer and the BCELoss for numerical stability Evaluation metrics: Accuracy= #correctly class points / #all points, Precision= TP/TP+FP, Recall= TP/TP+FN, F-score= 2*((Prec*Rec)/(Prec+Rec)). Use cross validation to tune hyper-parameters.