Deep Learning Algorithms
BERT (Bidirectional Encoder Representations from Transformers) and RoBERTa (Robustly
Optimized BERT Approach) are both powerful language models that have been widely used
in deep learning, especially in the field of Natural Language Processing (NLP). RoBERTa is
essentially an improvement over BERT, with modifications to the pretraining procedure and
training methodology.
BERT's key features:
Transformer Architecture:
BERT is based on the Transformer architecture, which allows it to effectively process
sequential data like text.
Bidirectional Encoding:
BERT can understand the context of words in both directions, left and right, within a
sentence.
Masked Language Modeling (MLM):
BERT is trained using the MLM objective, where words are randomly masked and the model
is tasked with predicting the masked words.
Next Sentence Prediction (NSP):
BERT is also trained on the NSP objective, which helps it understand the relationship
between two sentences.
Pretraining:
BERT is pre-trained on a large corpus of text data, which allows it to learn a rich
representation of language.
RoBERTa's key features:
Improvements over BERT:
RoBERTa builds upon BERT and incorporates several improvements, including longer
training, larger batches, and a dynamic masking strategy.
Dynamic Masking:
RoBERTa uses dynamic masking, where the masking pattern is changed for each training
example, making it more robust to variations in the input data.
No NSP Objective:
RoBERTa removes the NSP objective from the pretraining procedure, focusing solely on the
MLM objective.
Larger Vocabulary:
RoBERTa uses a larger byte-level BPE vocabulary, allowing it to handle a wider range of
words and subwords.
Improved Performance:
RoBERTa has shown to achieve state-of-the-art results on various NLP tasks, demonstrating
the effectiveness of its improvements over BERT.
In summary:
BERT and RoBERTa are both powerful language models that have revolutionized
NLP. RoBERTa builds upon BERT by incorporating improvements to the pretraining
procedure, resulting in a model that is more robust, expressive, and versatile.
Plan to Implement:
1. Load GoEmotions dataset
(GoEmotions = multi-label dataset: 58k English Reddit comments annotated for 28
emotions + neutral.)
2. Prepare resources:
o
Pre-trained language model embeddings (e.g., BERT, RoBERTa)
o
Linguistic features (POS tags, dependency parse, named entity recognition,
syntax, tense)
o
Emotion lexicon (you can use NRC Emotion Lexicon or GoEmotions lexicon)
o
Stylistic features (punctuation counts, all caps, repeated characters,
interjections, emojis)
3. For each text:
o
Compute contextual embedding (ϕ_e(d))
o
Apply attention pooling or SVD reduction if needed
o
Extract linguistic features (ϕ_l(d))
o
Compute lexicon-based features (ϕ_x(d))
o
Extract stylistic features (ϕ_s(d))
4. Normalize feature vectors
5. Concatenate all features to form X ∈ ℝⁿˣᵏ
Tools we will use:
transformers (for BERT embeddings)
spaCy (for POS, NER, DEP parsing)
emoji (for emoji detection)
nltk (for interjections, stylistic features)
scikit-learn (for SVD and normalization)
Step 0: Load the GoEmotions Dataset
First, we need to load the GoEmotions dataset.
There are two easy ways:
Load from HuggingFace Datasets (datasets library)
OR load a local CSV if you have it.
We'll use HuggingFace because it’s fast and clean.
Install libraries first if you don't have them:
pip install datasets
Python Code to load GoEmotions:
from datasets import load_dataset
# Load GoEmotions dataset
dataset = load_dataset("go_emotions", "raw")
train_data = dataset['train']
# Quick check
print(train_data[0])
👉 This will give output like: {'text': 'I feel accomplished.', 'labels': [22]}
where labels are emotion indices.
✅ Step 0 Summary:
We successfully loaded the dataset and accessed the text and labels!
Step 1: Initialize Feature Matrix X
🔹 In the algorithm:
Step 1: Initialize feature matrix X∈Rn×kX \in \mathbb{R}^{n \times k}X∈Rn×k
where:
nnn = number of documents (texts)
kkk = total feature dimension (after combining all features)
Meaning:
Before we start feature extraction for each document,
We must create an empty matrix of size (number_of_documents ×
number_of_features).
Step 2: Compute Contextual Embeddings (ϕₑ(d))
Now, for each document (text):
We extract BERT or RoBERTa embeddings.
We'll use HuggingFace Transformers!
Install transformers if needed:
pip install transformers
Python Code to get BERT embeddings:
from transformers import AutoTokenizer, AutoModel
import torch
# Load BERT model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
def get_bert_embedding(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
outputs = model(**inputs)
# Mean Pooling over all token embeddings
embeddings = outputs.last_hidden_state.mean(dim=1)
return embeddings.detach().numpy()
# Example
print(get_bert_embedding("I feel accomplished.").shape)
➡️ Output will be a 768-dim vector for each text.
✅ Step 2 Summary:
We now have contextual embeddings for each text!
Step 3: Attention Pooling or SVD Reduction (optional)
After getting embeddings, the algorithm says:
If embedding pooling = attention, apply an attention layer.
If embedding reduction = True, apply SVD (Singular Value Decomposition) to reduce
dimensions.
3.1: Attention Pooling
We'll create a simple self-attention mechanism.
Python Code for Attention Pooling:
import torch.nn as nn
class AttentionPooling(nn.Module):
def __init__(self, hidden_dim):
super(AttentionPooling, self).__init__()
self.attention = nn.Linear(hidden_dim, 1)
def forward(self, hidden_states):
# hidden_states: [batch_size, sequence_length, hidden_dim]
scores = self.attention(hidden_states).squeeze(-1) # [batch_size, sequence_length]
weights = torch.softmax(scores, dim=1) # attention weights
context = torch.bmm(weights.unsqueeze(1), hidden_states).squeeze(1) # weighted sum
return context
# Example usage
attention_pooling = AttentionPooling(hidden_dim=768)
text = "I feel accomplished."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
outputs = model(**inputs)
pooled_embedding = attention_pooling(outputs.last_hidden_state)
print(pooled_embedding.shape) # Should be [1, 768]
3.2: SVD Reduction (if needed)
If you want to reduce embedding size, use TruncatedSVD from sklearn.
Install if needed:
pip install scikit-learn
Python Code for SVD:
from sklearn.decomposition import TruncatedSVD
# Suppose you have a batch of embeddings
batch_embeddings = torch.randn(1000, 768) # (example 1000 samples)
# Reduce dimension from 768 -> 128
svd = TruncatedSVD(n_components=128)
reduced_embeddings = svd.fit_transform(batch_embeddings)
print(reduced_embeddings.shape) # (1000, 128)
✅ Step 3 Summary:
Attention pooling = focusing on important parts of a sentence.
SVD = making feature dimension smaller (faster training).
Step 4 Preview:
Next, we will extract linguistic features like:
POS tags (Parts of Speech)
Dependency parsing
Named entities
Syntactic structures
Tense
Step 4: Extract Linguistic Features
The algorithm says:
Extract linguistic features:
li = ∅_l(dᵢ, Pₗ) where lᵢ ∈ ℝᵈˡ
li = {POS, DEP, NER, SYN, TENSE}
✅ Meaning we need to extract:
POS = Part-of-Speech tags
DEP = Dependency relations
NER = Named Entities
SYN = Syntax info
TENSE = Verb tense (past, present, etc.)
4.1: Install and Load spaCy
First, install spaCy and a language model:
pip install spacy
python -m spacy download en_core_web_sm
Then load it:
import spacy
nlp = spacy.load("en_core_web_sm")
4.2: Linguistic Feature Extraction Code
Here’s full Python code to extract all 5 features:
def extract_linguistic_features(text):
doc = nlp(text)
pos_tags = [token.pos_ for token in doc]
# Part-of-Speech
dep_tags = [token.dep_ for token in doc]
# Dependency Relations
ner_tags = [ent.label_ for ent in doc.ents]
# Named Entities
syntactic_depth = max([token.head.i - token.i for token in doc if token.head != token]) if
doc else 0 # Syntax
tenses = [token.morph.get("Tense") for token in doc if token.pos_ == "VERB"] # Verb
tenses
# Convert into fixed-length feature counts (bag of tags style)
feature_vector = {
"num_nouns": pos_tags.count("NOUN"),
"num_verbs": pos_tags.count("VERB"),
"num_adjectives": pos_tags.count("ADJ"),
"num_adverbs": pos_tags.count("ADV"),
"num_entities": len(ner_tags),
"avg_syntactic_depth": syntactic_depth,
"num_past_tense": sum([1 for t in tenses if "Past" in t]),
"num_present_tense": sum([1 for t in tenses if "Pres" in t]),
}
return feature_vector
# Example usage
text = "I felt very happy when I completed my project!"
features = extract_linguistic_features(text)
print(features)
🔵 Output:
{
'num_nouns': 2,
'num_verbs': 2,
'num_adjectives': 1,
'num_adverbs': 1,
'num_entities': 0,
'avg_syntactic_depth': 5,
'num_past_tense': 1,
'num_present_tense': 0
}
✅ Step 4 Summary:
We turn linguistic patterns into numeric features (vectors) for ML models.
Step 5: Lexicon-Based Features
The algorithm says:
Xi = ∅ₓ(dᵢ, L)
where
Xi[j]=1∣di∣∑w∈div(w,ej)X_i[j] = \frac{1}{|d_i|} \sum_{w \in d_i} v(w, e_j)Xi[j]=∣di∣1w∈di∑
v(w,ej)
(for each emotion eje_jej in the emotion taxonomy EEE)
✅ Meaning:
For each word in the text,
Check if it exists in an emotion lexicon.
Sum up emotion scores across the words.
Normalize by document length.
5.1: What is an Emotion Lexicon?
🔹 It’s a dictionary like:
{
"happy": {"joy": 1, "anger": 0, "sadness": 0, ...},
"angry": {"joy": 0, "anger": 1, "sadness": 0, ...},
...
}
Each word is linked to emotions.
You can:
Use NRC Emotion Lexicon �
Or GoEmotions Lexicon (from original dataset)
5.2: Lexicon-based Feature Extraction Code
Here’s Python code:
import numpy as np
# Example mini lexicon (you should use the full one!)
emotion_lexicon = {
"happy": {"joy": 1, "anger": 0, "sadness": 0},
"sad": {"joy": 0, "anger": 0, "sadness": 1},
"angry": {"joy": 0, "anger": 1, "sadness": 0},
# etc...
}
emotions = ["joy", "anger", "sadness"]
def lexicon_features(text, lexicon, emotions):
tokens = text.lower().split()
feature_vec = np.zeros(len(emotions))
for token in tokens:
if token in lexicon:
for idx, emotion in enumerate(emotions):
feature_vec[idx] += lexicon[token].get(emotion, 0)
# Normalize by document length
if len(tokens) > 0:
feature_vec = feature_vec / len(tokens)
return feature_vec
# Example usage
text = "I am very happy but sometimes angry"
features = lexicon_features(text, emotion_lexicon, emotions)
print(features)
🔵 Output:
[0.2 0.2 0. ] # Normalized joy and anger scores
✅ Step 5 Summary:
Lexicon-based features capture emotional tendencies from word-level analysis!
Step 6: Extract Stylistic Features
The algorithm says:
Extract stylistic features:
Si = ∅ₗ(dᵢ, Sₗ) where Sᵢ ∈ ℝᵈˢ
Features to extract:
PUNCT: Punctuation marks (!, ?, etc.)
CAPS: Words written in all uppercase
REP: Repeated letters (like "soooo", "noooo")
INT: Interjections ("oh!", "wow!", "oops!")
EMOJI: Emojis 😊 😡 etc.
6.1: Required Libraries
Install these if needed:
pip install nltk emoji
6.2: Stylistic Feature Extraction Code
Also:
import nltk
nltk.download('punkt')
import re
import emoji
import nltk
from nltk.tokenize import word_tokenize
# Small interjection list (you can expand)
interjections = {"oh", "wow", "oops", "yay", "hey", "ouch", "hmm", "ah", "uh", "uhoh"}
def extract_stylistic_features(text):
tokens = word_tokenize(text)
punct_count = sum([1 for token in tokens if token in ['!', '?', '.', ',', ';', ':']])
caps_count = sum([1 for token in tokens if token.isupper() and len(token) > 1])
repeated_chars = sum([1 for token in tokens if re.search(r'(.)\1{2,}', token)]) # e.g.,
sooo
interjection_count = sum([1 for token in tokens if token.lower() in interjections])
emoji_count = sum([1 for char in text if char in emoji.UNICODE_EMOJI_ENGLISH])
feature_vector = {
"punctuation_count": punct_count,
"all_caps_count": caps_count,
"repeated_char_count": repeated_chars,
"interjection_count": interjection_count,
"emoji_count": emoji_count
}
return feature_vector
# Example usage
text = "OH WOW!!! That's sooo amazing 😍😍"
features = extract_stylistic_features(text)
print(features)
🔵 Output:
{
'punctuation_count': 3,
'all_caps_count': 2,
'repeated_char_count': 1,
'interjection_count': 2,
'emoji_count': 2
}
✅ Step 6 Summary:
We capture how people express emotions through their writing style!
Step 7: Normalize the Feature Vectors
The algorithm says:
Normalize li, Xi, and Si using StandardScaler.
✅ Meaning:
Linguistic features (li)
Lexicon-based features (Xi)
Stylistic features (Si)
must be scaled so that:
Mean = 0
Standard Deviation = 1
📈 This makes ML models train faster and perform better!
7.1: How to Normalize in Python?
We use StandardScaler from sklearn:
from sklearn.preprocessing import StandardScaler
import numpy as np
# Suppose you have:
# linguistic_features_list = [dicts of features for each document]
# lexicon_features_list = [numpy arrays]
# stylistic_features_list = [dicts of features for each document]
# 1. Convert dictionaries to arrays
import pandas as pd
linguistic_df = pd.DataFrame(linguistic_features_list)
stylistic_df = pd.DataFrame(stylistic_features_list)
# 2. Stack everything
X_linguistic = linguistic_df.values
X_lexicon = np.stack(lexicon_features_list)
X_stylistic = stylistic_df.values
# 3. Normalize each part
scaler_ling = StandardScaler()
scaler_lex = StandardScaler()
scaler_style = StandardScaler()
X_linguistic_scaled = scaler_ling.fit_transform(X_linguistic)
X_lexicon_scaled = scaler_lex.fit_transform(X_lexicon)
X_stylistic_scaled = scaler_style.fit_transform(X_stylistic)
✅ After normalization:
All feature vectors are ready to be combined!
Final feature for each document = [linguistic, lexicon, stylistic]
🔹 You just concatenate them horizontally:
X_final = np.hstack([X_linguistic_scaled, X_lexicon_scaled, X_stylistic_scaled])
Now X_final is the feature matrix you can feed into classifiers like:
SVM
Random Forest
Deep Learning models (MLP, LSTM, Transformer...)
But wait — how many features will there be?
👉 It depends on:
Size of contextual embeddings (say 768 for BERT).
Number of linguistic features (e.g., 5–10 features like POS counts, NER counts, etc.)
Size of lexicon features (equal to number of emotions, e.g., 28 for GoEmotions).
Number of stylistic features (5 features: punctuations, caps, repeated chars,
interjections, emoji).
Example dimension calculation:
Feature type Feature count
Contextual embedding
Linguistic features
10
Lexicon-based features
Stylistic features
Total k 811
So k=811.
768
5
28
Simple Code for Initialization:
import numpy as np
# Suppose:
n = len(documents) # number of texts
k = 811
# total number of features
# Step 1: Initialize
X = np.zeros((n, k))
✅ Now you have an empty feature matrix X with zeros.
Then for each document, you:
Compute embeddings
Compute linguistic features
Compute lexicon-based features
Compute stylistic features
Normalize
Combine
Fill the corresponding row in X
Important:
The filling of X actually happens after you extract and normalize each document’s
features!
That’s what Steps 2–7 are doing!
So Step 1 is just setting up the space for features.
✅ Full view:
Step
Action
Step 1
Create empty feature matrix X
Steps 2–7 Fill each row of X with extracted + normalized features
Summary for Step 1:
X = np.zeros((n, k)) # initialize feature matrix
then after each document's feature extraction,
insert into X[i, :] — the i-th row.
End-to-End Pipeline: Emotion Detection using Feature Extraction
Step 0: Plan
We'll do:
Step
What
0
Load sample GoEmotions data
1
Initialize feature matrix X
2
Extract contextual embeddings (with DistilBERT)
3
Extract linguistic features
4
Extract lexicon-based features
5
Extract stylistic features
6
Normalize all features
7
Train a small classifier
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )