Uploaded by diego.jimenez.g

Generative AI Essentials: Lecture Slides

advertisement
Generative AI Essentials
Lecture Slides
These lecture slides are provided for personal and non-commercial use only
Please do not redistribute or upload these lecture slides elsewhere.
Good luck on your exam!
Updated Jan 16, 2025
1
What is the Generative AI Essentials?
Cheat sheets, Practice Exams and Flash cards
www.exampro.co/exp-genai-001
The ExamPro GenAI Essentials is a practical GenAI certification teaching:
• Fundamentals concepts of ML, AI and GenAI
• All Modalities of GenAI with a strong focus on LLMs
• Programmatically working with GenAI workloads
• Both cloud and local LLM workloads
• A cloud/vendor agnostic approach
The course code For the ExamPro GenAI Essentials is EXP-GENAI-001.
The release of this course will have gaps, which will be address in minor
updates and as we receive feedback. Pay attention to upcoming roadmap.
Who is this certification for?
Cheat sheets, Practice Exams and Flash cards
www.exampro.co/exp-genai-001
Consider the ExamPro GenAI Essentials if….
You are preparing prerequisite knowledge to take the Free GenAI
Bootcamp so you can be successful in it’s completion and grading
You need broad and practical knowledge understanding GenAI solutions
so you have the technical flexibility to move in any technical direction
You need to focus on implementation, and deliver GenAI workloads that
are both secure and in-budget.
The GenAI Roadmap
Cheat sheets, Practice Exams and Flash cards
www.exampro.co/exp-genai-001
Certification course
ExamPro GenAI Maturity
Learning Model
Confidence in understanding,
talking about GenAI, having the
tools to get started building.
Project-based Bootcamp
The proof that you can build GenAI
workloads through hands-on projects
https://genai.cloudprojectbootcamp.com/register
How Long to Study to Pass?
Cheat sheets, Practice Exams and Flash cards
Beginner
Basic knowledge of cloud and programming
www.exampro.co/exp-genai-001
Experienced
Have passed multiple GenAI Certification
Have practical experience working in cloud and
local development.
30 hours
15 hours
??? hours (average study time cannot be determined)
• 60% lecture and labs
• 40% practice exams
Recommended to study 1-2 hours a day for 15 days.
What does it take to pass the exam?
Cheat sheets, Practice Exams and Flash cards
www.exampro.co/exp-genai-001
1. Watch video lecture and memorize key information
2. Do hands-on labs and follow along within your own account
3. Do paid online practice exams that simulate the real exam.
Signup and Redeem your FREE Practice Exam
No credit card required
https://www.exampro.co/exp-genai-001
Exam Guide – Grading
Cheat sheets, Practice Exams and Flash cards
www.exampro.co/exp-genai-001
Passing Grade is *750/1000
You need to get “around” 75% to pass
ExamPro uses Scaled Scoring
Exam Guide – Response Types
Cheat sheets, Practice Exams and Flash cards
www.exampro.co/exp-genai-001
There are 65 Questions
• 65 Scored Questions
You can afford to get 16 scored questions wrong
There is no penalty for wrong questions
Format of Questions
• Multiple Choice
• Multiple Answer
• Case Studies
Exam Guide – Duration
Cheat sheets, Practice Exams and Flash cards
www.exampro.co/exp-genai-001
Duration of ~2 hours
You get ~2mins per question
Exam Time is: 130mins
Seat Time is: 160mins
Seat time refers to the amount of time
that you should allocate for the exam.
It includes:
• Time to review instructions
• Show online proctor your workspace
• Read and accept NDA
• Complete the exam
• Provide feedback at the end.
Where do you take the exam?
Cheat sheets, Practice Exams and Flash cards
www.exampro.co/exp-genai-001
Online from the convenience of your own home.
ExamPro delivers exams via:
• TeacherSeat Anchor(online proctored exam system)
A “proctor” is a supervisor, or person who monitors students during an examination.
Exam Guide – Valid Until
Cheat sheets, Practice Exams and Flash cards
www.exampro.co/exp-genai-001
Valid for Forever
If there is a major version update you may want to consider recertifying to update your knowledge.
What is AI?
What is Artificial Intelligence (AI)?
Machines that perform jobs that mimic human behavior
What is Machine Learning (ML)?
Machines that get better at a task without explicit programming
What is Deep Learning (DL)?
Machines that have an artificial neural network inspired by
the human brain to solve complex problems.
Generative AI
What is Generative AI (Gen AI)?
GenAI is a type of artificial intelligence capable of generating new
content, such as text, images, music, or other forms of media.
Example of Midjourney generating a graphic
GenAI is conflated to mean LLMs, but GenAI and LLMs are two different things.
AI vs Generative AI
What is Artificial Intelligence (AI)?
AI is computer systems that perform tasks typically requiring human intelligence.
These include:
• problem-solving
• decision-making
• understanding natural language
AI’s goal is to interpret, analyze, and respond to human actions.
• recognizing speech and images
To simulate human intelligence in machines.
• Simulate: mimic aspects, resembles behaviour
• Emulate: replicates exact processes and mechanisms.
AI applications are vast and include areas like:
• expert systems
• natural language processing
• speech recognition
• robotics
AI is used in various industries for tasks such as:
• B2C : customer service chatbots
• Ecommerce: recommendation systems
• Auto: autonomous vehicles
• Medical: medical diagnosis.
AI vs Generative AI
What is Generative AI?
Generative AI (GenAI) is a subset of AI that focuses on creating new content or data that is novel
and realistic. It can interpret or analyze data but also generates new data itself.
It includes generating text, images, music, speech, and other forms of media.
It often involves advanced machine learning techniques:
• Generative Adversarial Networks (GANs)
• Variational Autoencoders (VAEs)
• Transformer models eg GPT
Generative AI has multiple modalities:
• Vision: realistic images and videos
• Text: generating human-like text
• Audio: composing music
• Molecular: Drug discovery via genomic data
Large Language Models (LLMs) which generate out human-like text is a subset of GenAI but is often
conflating with being AI due to it being the most popular and developed.
AI vs Generative AI
Artificial Intelligence (AI)
Generative AI
Functionality
AI focuses on understanding and decision-making
Generative AI is about creating new, original
outputs.
Data Handling
AI analyzes and makes decisions based on existing
data
Generative AI uses existing data to generate new,
unseen outputs.
Applications
Spans across various sectors, including data
analysis, automation, natural language processing,
and healthcare.
creative and innovative, focusing on content
creation, synthetic data generation, deepfakes,
and design.
LLM Landscape
What is Agent?
Responsible AI Practices
1. Transparency
6. Human oversight
•
•
•
•
Clear communication about AI capabilities and limitations
Explainable AI systems where possible
2. Fairness and non-discrimination
•
•
Mitigating bias in data and algorithms
Ensuring equal treatment across different demographics
Safeguarding user data
Compliance with data protection regulations
Robust testing for potential risks and vulnerabilities
Implementing safeguards against misuse or attacks
Considering environmental impact of AI systems
Efficient use of computational resources
•
•
Incorporating ethical considerations from the outset
Regular ethical audits of AI systems
•
•
Engaging with diverse stakeholders
Contributing to AI safety research
10. Continuous monitoring and improvement
5. Accountability
•
•
•
•
9. Collaboration and knowledge sharing
4. Safety and security:
•
•
7. Sustainability
8. Ethical design
3. Privacy and data protection
•
•
Maintaining human control over critical decisions
Avoiding over-reliance on AI for important judgments
Clear ownership and responsibility for AI systems
Mechanisms for redress in case of harm
•
•
Regular assessment of AI system performance and impact
Updating systems to address emerging issues
There exists multiple authoritative sources on Responsible AI Practices:
• The IEEE's Ethically Aligned Design
• The OECD AI Principles
• The European Commission's Ethics Guidelines for Trustworthy AI
Bot vs Agent
Bot
Agent
Simple tasks
Complex tasks
Specific tasks
Broad tasks
Follow pre-defined scripts or decision trees
More agency in decision making
Limited context understanding
Better at using contest
Customer service, simple queries, repetitive tasks
Complex problem-solving, analysis, multi-step tasks
Easily scalable for specific tasks
Scalability depends on the complexity of tasks
Basic NLP, not often powered by LLMs
Advanced NLP, often powered by LLMs
What is a Foundational Model?
A Foundational Model (FM) is a general purpose model that is trained on vast amounts of data.
We say that an FM is pretrained because it can be fined tuned for specific tasks.
Data
Prediction
Text
Classification
Images
Training
Videos
FM
Uses
Text Generation
Video Generation
Structured Data
Image Generation
Audio Generation
LLMs are a specialized subset of FMs that uses transformer architecture.
What is a Large Language Model (LLM)?
A Large Language Model (LLM) is a Foundational Model that implements the transformer architecture.
During this training phase, the model learns semantics (patterns) of language,
such as grammar, word usage, sentence structure, style and tone.
It is simple to say that LLM just predicts the next sequences of words, but
researchers don’t know how LLMs generate their outputs.
What are Embeddings?
What is a Vector?
An arrow with a length and a direction
Document
Document
What is a Vector Space Model?
Represents text documents or other types of data
as vectors in a High Dimensional space
What are embeddings?
They are vectors of data used by ML models
to find relationships between data.
ML models can also create embeddings.
Document
Document
Document
Document
You can think of embeddings as external memory for performing a task for ML models.
Embeddings can be shared across models (Multi-model pattern) to help coordinate a task between models.
BERT
BERT (Bidirectional Encoder Representations from Transformers)
model released in 2018 by Google researchers.
https://huggingface.co/google-bert
If you were take the Transformer Architecture and just
take the encoder and stack them you would get BERT.
BERT is Bi-directional meaning it can read text both left to
right and right to left to understand the context of text.
BERT
BERT is pre-trained on the following tasks:
• Masked language model (MLM)
• Provided input where tokens are masked.
• Think of asking BERT to fill in the blanks for sentences
• Next sentence prediction (NSP)
• BERT is provided two sentences A and B
• BERT has to predict if B would follow A
BERT can then be fine-tuned to then perform the following tasks:
• Named Entity Recognition (NER)
• Question Answering (QA)
• Sentence Pair Tasks
• Summarization
• Feature Extraction / Embeddings
• and more….
There are multiple BERT model sizes
• BERT Base 100M Parameters
• Bert Large 240M Parameters
• BERT Tiny 4M Parameters
• and ~24 other model sizes…
BERT Variants:
• roBERTa
• DistilBERT
• ALBERT
• ELCTRA
• DeBERTa
While BERT is and older model is still used and ubiquitous
baseline in natural language processing (NLP) experiments.
BERT
Example of using BERT to
perform sentiment analysis
via the Hugging Face
“sentiment analysis” pipeline.
Here Hugging Face is
downloading a BERT Base
Uncased model that has been
fine-tuned for Sentiment
Analysis.
BERT needs to be fine-tuned for
specific tasks beyond
pretraining.
Sentence Transformers
Sentence Transformers (aka SBERT) is built on top of BERT. It create a single vector
for an entire sentence. When comparing similar sentence this much more
performance than simply using BERT which looks at every single word.
Example of using a pre-trained
model to generate embeddings
and compare the similarity of the
two sentences.
Use cases:
• Embeddings
• Semantic Search
• Retrieve and Rerank
• Clustering
• Image Search
• And more…
https://sbert.net/
CNN and RNN
Convolutional Neural Networks (CNNs) are used for processing and analyzing visual data, such as images and videos.
They detect spatial features like edges and textures through filters,
followed by pooling layers to reduce the dimensionality, and fully
connected layers to perform final classification or regression tasks.
CNNs do not feed outputs into their input, but feeds that forward to next layer
Recurrent Neural Networks (RNNs) handle sequential data, where order of data matters, eg. Natural language processing (NLP).
They handle sequential data by maintaining a hidden state
that captures information from previous inputs, enabling the
network to understand context and temporal dependencies,
with each time step's input influencing the subsequent states.
RRNs feed their output backs into their inputs through the same layer
One thing that CNNs and RNNs have in common is they handle data sequentially.
Transformer Architecture
Transformer Architecture was developed by researchers at Google that is effective at
Natural Language Processing (NLP) due to multi-head attention and positional encoding.
Transformer model architecture
consists of two parts:
1. Encoder: reads and
understands the input text. It's
like a smart system that goes
through everything it's been
taught (which is a lot of text) and
picks up on the meanings of
words and how they're used in
different contexts.
2. Decoder: Based on what the encoder
has learned, this part generates new
pieces of text. It's like a skilled writer
that can make up sentences that flow
well and make sense.
Attention Is All You Need Whitepaper
Transformers is a type of neural network architecture invented by Google and University of Toronto.
Google’s Whitepaper on Transformers
The transformer architecture is the
basis for Large Language Models
(LLMs)
Transformers rely on two things
• Positional Encoding
• Self-Attention
What is a Foundational Model?
A Foundational Model (FM) is a general purpose model that is trained on vast amounts of data.
We say that an FM is pretrained because it can be fined tuned for specific tasks.
Data
Prediction
Text
Classification
Images
Training
Videos
FM
Uses
Text Generation
Video Generation
Structured Data
Image Generation
Audio Generation
LLMs are a specialized subset of FMs that uses transformer architecture.
What is a Large Language Model (LLM)?
A Large Language Model (LLM) is a Foundational Model that implements the transformer architecture.
During this training phase, the model learns semantics (patterns) of language,
such as grammar, word usage, sentence structure, style and tone.
It is simple to say that LLM just predicts the next sequences of words, but
researchers don’t know how LLMs generate their outputs.
Transformer Architecture
Transformer Architecture was developed by researchers at Google that is effective at
Natural Language Processing (NLP) due to multi-head attention and positional encoding.
Transformer model architecture
consists of two parts:
1. Encoder: reads and
understands the input text. It's
like a smart system that goes
through everything it's been
taught (which is a lot of text) and
picks up on the meanings of
words and how they're used in
different contexts.
2. Decoder: Based on what the encoder
has learned, this part generates new
pieces of text. It's like a skilled writer
that can make up sentences that flow
well and make sense.
Tokenization
Tokenization is the process of breaking data input (text) into smaller parts.
Tokenization Algorithms:
• Byte Pair Encoding (BPE) used by GPT3
• WordPiece used by BERT
Each token is mapped to a unique ID to the model’s vocabulary
• SentencePiece used by Google T5 or GPT 3.5
When working with a LLMs, the input text must be converted (or "tokenized")
into a sequence of tokens that match the model's internal vocabulary.
When an LLM is trained it will create an internal vocabulary of tokens
which could be between 30,000 to 100,000 tokens.
Tokens and Capacity
When using transformers the decoder continuously feeds the sequence
of tokens back in as output to help predict the next word in the input.
The quick brown
The quick
Encoder
Decoder
The quick brown fox
The quick brown fox jumps
The quick brown fox jumps over
Memory
Each token in a sequence requires memory
As the token count increases, the memory increases.
The memory usage eventually becomes exhausted.
Compute
model perform more operations for each additional token
Longer sequences require more compute
AI services that offer Models-as-a-Service will often have a limit of combined input and output.
What are Embeddings?
What is a Vector?
An arrow with a length and a direction
Document
Document
What is a Vector Space Model?
Represents text documents or other types of data
as vectors in a High Dimensional space
What are embeddings?
They are vectors of data used by ML models
to find relationships between data.
ML models can also create embeddings.
Document
Document
Document
Document
Different embedding algorithms capture
different kinds of relationships.
You can think of embeddings as external memory for performing a task for ML models.
Embeddings can be shared across models (Multi-model pattern) to help coordinate a task between models.
Positional encoding
Positional encoding is a technique used to preserve order of words when processing natural language.
Transformers need positional encoders because they do not process data sequentially and would lose
order of understanding when analyze large bodies of text.
I
heard
0
1
a
2
dog
bark
loudly
3
4
5
at
a
6
Positional encoding adds a positional vector to each
word to keep track of the positions of the words.
cat
7
8
Attention
Attention figures out how each word (or token) in a sequence is important
to other words within that sequence by assigning the words weights.
Self-Attention
Computes attention weights within the same input sequence,
where each element attends to all other elements.
Used in transformers to model relationships in
sequences (e.g., words in a sentence).
Cross-Attention
Computes attention weights between two different sequences,
allowing one sequence to attend to another sequence.
Used in tasks like translation where the output
sequence (decoder) needs to focus on the input
sequence (encoder).
Multi-head Attention
Combines multiple self-attention (or cross-attention) heads in
parallel, each focusing on different aspects of the input.
Used in transformers to improve performance and
capture various dependencies simultaneously.
The Journey to Large Language Models
1950s – 1980s: Early NLP
Rule-Based Systems
Statistical Methods
1990s – 2000s
Bag of Words and N-Grams
Support Vector Machines (SVMs)
Latent Semantic Analysis (LSA)
2010s
Word Embeddings
Recurrent Neural Networking (RNN)
Long Short-Term Memory (LSTM)
Convolutional Neural Networking (CNN)
Gated Recurrent Units (GRUs)
2017s - Now
Transformers Architecture
The basis for LLMS
BERT
GPT
RoBERTa
LLMS
T5
Sequence-to-Sequence (Seq2Seq)
Attention Mechanisms
Order of Solution
When you need to search and retrieve data
• Specific data from your data stores
• Up-to-date data from your data stores
Complexity
Always start
with prompt
engineering
Prompt Engineering
SFT
RAG
When you are training for a specific task eg. Classification
When a large prompt document causes latency issues
When you have a large amount of prompt document
examples for better evaluations
Time/Cost
Supervised Fine Tuning
Supervised Fine-Tuning (SFT) is when you feed a large amount of
training data to adjust the LLM weights for a specific task.
• SFT is more cost-effective than training a model from scratch
• SFT can cause overfitting, which could result in poor predictions when handling unseen or different data.
RAG
Retrieval-Augmented Generation (RAG) is used to access an external source of data before an LLM provides output.
RAG helps with access specific corpus to improve accuracy of response
RAG helps with accessing up-to-date corpus to improve freshness of response
Zero Shot
Zero-Shot prompting is when the model can perform the expected task
with no prior knowledge or examples provided during prompting
An LLM is utilizing its built-in knowledge during its training to be able to complete this task.
Few Shot
Few-shot prompting is when the model can perform the expected task
with examples provided during prompting
Prompt Chaining
Prompt chaining is where the output of one LLM prompt is used
as the input for the next prompt in a sequence.
The reason is when the context window of the prompt is too small to complete
a single large or complex task, so we break the task up into smaller tasks.
Complex Prompt
Given this Japanese subtitles movie file,
translate the the text from Japanese to
English, identify the Grammer being used,
and then convert the file into a .jsonlines
file where each line has the original
Japanese, the English translations and the
referenced grammer rules with grammer
examples
Simple Prompt
(iterate over each line) Translate the text from Japanese to English
Give a list of grammar rules being used -> (List)
(iterate over list) Generate an explanation of this grammar rule
and provide 5 examples
Format original line, translated line and grammer rule
examples into a json file.
Chain of Thought
Chain-of-Thought (CoT) Prompting is when you tell the model to do
step-by-step reasoning so it produces a more accurate result
Without Chain of Thought
Which two prime numbers sum to 10?
With Chain of Thought
Please solve this problem step-by-step. First outline your
reasoning in a clear, logical chain of thought. Then provide the
final concise answer.
Tree of Thought
Tree-of-Thought (ToT) Prompting is where the prompt instructs the
model to explore possible “states” and “transitions” rather than trying to
produce an answer immediately in a single chain of thought.
CO-STAR
CO-STAR is a prompting framework
created by Sheila Teo
Context — Provide background information on the task
## CONTEXT
A free, online introductory Japanese language
workshop for beginners.
## OBJECTIVE
Encourage people to register by highlighting the quick,
fun, and interactive nature of the session.
Objective — Define what the task is that you want the LLM to perform
Style — Specify the writing style you want the LLM to use
## STYLE
Engaging, inviting, clear educational tone.
Tone — Set the attitude of the response
Audience — Identify who the response is intended for
Response — Provide the response format
## TONE
Persuasive yet friendly.
## AUDIENCE
Busy adults curious about learning Japanese, with no
prior experience.
## RESPONSE
A concise, impactful Facebook post.
GenAI Modalities
Modalities in the context of AI is the way something happens, is experienced or captured.
Think of the Modalities of human experience
which is based around the 5 senses (inputs).
The primitive GenAI Modalities are:
1. Text - LLMs
2.
3.
Vision
•
Images – StableFusion
•
Videos – OpenSora
Audio – AudioCraft, Whisper
You could consider these modalities but they could be also
easily fall under an existing primitive GenAI Modality
• Code
• Touch/Haptic eg. Think AR/VR
• Biological Data eg. Genomics
• Sensor Data eg. IoT
• Math Expressions
• 3d Models
It’s important to understand that GenAI does not mean exclusively LLMs because LLMs
only focused on the text modality. LLMs are talked about the most due to their maturity.
Vector Databases
A vector database is a datastore for embeddings.
MongoDB Atlas
Redis
PGvector
Deep Lake
ElasticSearch
Chroma
Weaviate
Pinecone
Vector Search vs Vector Databases
When looking into Vector Databases, you’ll also come across Vector Search products.
Vector Search products may be limited in their capabilities compared to a Vector Database
Exact Vector Search
Vector Search
Vector Database
Yes
Yes
Approximate Nearest Neighbors (ANN)
Persistence
Partial
CRUD
Store Objects
Sharding
Replication
No
Ollama
Ollama is a LLM Model manager that makes it easy to
download, install and run LLMs on your laptop/desktop.
•
•
•
•
Ollama can directly run models in its a runtime env
Each model is self contained
Download an installer and get going on Windows and Mac
Ollama can easily download updates for models.
https://github.com/ollama/ollama
Llamafile
Lllamafile is a single binary format that contains both the model’s
weight a way to serve the model.
https://github.com/Mozilla-Ocho/llamafile
•
•
•
•
•
works by combining llama.cpp with Cosmopolitan Libc.
can run on multiple CPU microarchitectures
supports GPU use
can run on multiple Operating Systems
Weights are embedded as part of the file
Models that are package as Llamafiles:
• LLaMA 3.2 1B/3B Instruct
• Gemma 2 2B/9B/27B Instruct
• LLaVA 1.5
• TinyLlama-1.1B
• Mistral-7B-Instruct
• Phi-3-mini-4k-instruct
• Mixtral-8x7B-Instruct
• WizardCoder-Python-13B/34B
• LLaMA-3-Instruct-8B/70B
• Rocket-3B
• OLMo-7B
• Text Embedding Models
• E5-Mistral-7B-Instruct
• mxbai-embed-large-v1
LangChain
LangChain is an open-source framework to
rapidly prototype agents or LLM workloads.
Features include:
• Prompt templates — eg. formatting user input
• Messages — eg. standardized input and output for chat models
• Output parsers – eg. outputting structured XML, JSON, YAML
• Document loaders (cracking) — eg. PDFs, Markdown, CSV, HTML
• Text splitters (chunking) — eg. By HTML heading, characters, JSON
• Adapters to various Vector stores
• Adapters to various Embeddings
• Various Retrievers to Implement RAG
• Indexing
• Tool Use
• Support to work with Multimodals models
• Agent workflows
• Example Selectors
LangChain
LangChain is *not well-suited for production use cases.
• LangChain (the company) has built a suite of tools
around LangChain framework so they may think
otherwise.
You could ship an LLM workload as a Minimal-ViableProduct (MVP) but you’d eventually need to rebuild your
workload orchestrating multiple containers eg. OPEA.
LangChain was one of the earliest GenAI frameworks that made it easy to swap out
LLMs with a unified API, When LLMs had small context windows LangChain worked
around this limitation by providing multiple chaining strategies like summarization.
LlamaIndex
LlamaIndex is an open-source framework to
rapidly prototyping agents or LLM workloads .
LlamaIndex has wider support for data
connectors and advanced RAG techniques.
•
•
•
•
•
•
Data connectors — eg. Azure AI Search
Data Indexes
Engines
Agents
Observability/Evaluation
Workflows
LlamaIndex has its own platform called LlamaCloud for
enterprise production similar to LangChain having their
own platform offering
Consider just like LangChain that LlamaIndex isn’t
designed for production use-case
GGUF and GGML
GGUF and GGML are file formats used for storing models for inference
often models like GPT (Generative Pre-trained Transformer).
GGUF is the successor of the GGML format so you will often see GGUF
• GGUF is a binary format that is designed explicitly for the fast
loading and saving of models.
• GGUF is specially designed to store inference models and perform
well on consumer-grade computer hardware.
• GGUF supports fine tuning
• Models initially developed in frameworks like PyTorch can be
converted to GGUF format for use with those engines.
GGUF can be executed using:
• Olama
• GPT4ALL
• Llama.ccp
Here you can see on
Hugging Face Google
Gemma 7B has a .gguf file.
Guard Rails
Guard Rails is the process of adding additional checks filters on input and output.
Prompt Injections
Prompt injection is a technique to manipulate AI models like LLMs by inserting hidden instructions into the input.
• Exploits how models process input
• Attempts to override initial instructions
• Uses natural language the model
understands
• Targets weak boundaries between user
input and system instructions
• Leverages context sensitivity of LLMs
Mitigation strategies include input sanitization, clear separation of instructions and user input, and model fine-tuning.
OWASP Top 10 LLMs Apps
OWASP Top 10 LLMs App is a list of the top security concerns when
building Applications that utilized LLMs.
LLM01: Prompt Injection
Manipulating LLMs via crafted inputs can lead to unauthorized access, data breaches, and compromised decision-making.
LLM02: Insecure Output Handling
Neglecting to validate LLM outputs may lead to downstream security exploits, including code execution that compromises system
LLM03: Training Data Poisoning
Tampered training data can impair LLM models leading to responses that may compromise security, accuracy, or ethical behavior.
LLM04: Model Denial of Service
Overloading LLMs with resource-heavy operations can cause service disruptions and increased costs.
LLM05: Supply Chain Vulnerabilities
Depending upon compromised components, services or datasets undermine system integrity, causing data breaches and system f
OWASP Top 10 LLMs Apps
LLM06: Sensitive Information Disclosure
Failure to protect against disclosure of sensitive information in LLM outputs can result in legal consequences or a loss of
competitive advantage.
LLM07: Insecure Plugin Design
LLM plugins processing untrusted inputs and having insufficient access control risk severe exploits like remote
code execution.
LLM08: Excessive Agency
Granting LLMs unchecked autonomy to take action can lead to unintended consequences, jeopardizing reliability,
privacy, and trust.
LLM09: Overreliance
Failing to critically assess LLM outputs can lead to compromised decision making, security vulnerabilities,
and legal liabilities.
LLM10: Model Theft
Unauthorized access to proprietary large language models risks theft, competitive advantage, and dissemination of
sensitive information.
LitGPT
LitGPT is a CLI tool to pretrain, finetune and deploy LLMs at scale.
https://github.com/Lightning-AI/litgpt
•
•
•
•
•
Scratch implementations
No abstractions
Flash Attention
Utilizes Lighting Fabric
Fully Sharded Data Parallel (FSDP)
• 1-1000+ GPUs/TPUs
• Parameter-efficient finetuning (PEFT)
• LORA
• qLORA
• Adapter and Adapter v2
• Recipes for 20+ LLMs
LitGPT has in the works a Python API, so you can in the future
easily train within Jupyter Notebooks.
LitGPT — Training Example
An example of LORA fine tunning:
Finetune and Deploy Llama 3B Example
Using Lighting.AI with the following:
• 1 Nvidia L4 Tensor Core
• 16 vCPUs
• 64 GB RAM
• 24 GB VRAM
With a dataset of 6471 examples
This trained within 766.72 (~12 mins)
We can chat with our fined-tune model:
An equivalent AWS EC2 instance would be a g6.4xlarge:
• $1.323 / hour
• 16 vCPU, 64 GB RAM, 1x L4
Flash Attention
Flash Attention is a memory-efficient, faster variant of the traditional attention mechanism,
optimized for GPUs to handle longer sequences with reduced computation and memory overhead.
https://github.com/Dao-AILab/flash-attention
There are multiple versions:
• Flash Attention
Flash Attention 2
• Refinement of Flash Attention
• Flash Attention 3
• Optimized for Hopper GPUs eg. H100
Flash Attention achieves efficiency by computing attention scores in small chunks, fusing operations like softmax
and matrix multiplications to minimize memory use and speed up computation.
Llama.cpp
Llama.cpp is an inference server for LLMs.
Llama.cpp implements the underlying architecture of Meta’s LlaMa within C/C++ intended to
allow improved performance running models on CPUs.
The main goal of llama.cpp is to enable LLM inference with minimal setup and
state-of-the-art performance on a wide range of hardware - locally and in the cloud.
Llama.cpp has:
• SDKs for many different programming languages eg. yoshoku/llama_cpp.rb
• CLI tool eg. llama-cli
• Lightweight server eg. llama-server
• Integrations with many UI tools
Llama.cpp works only with GGUF format for
model weights.
Bitnet.cpp
1-bit LLMs represent the most extreme form of quantization, using a single bit (0 or 1) for each parameter.
This greatly reduces the model size and computational needs compared to conventional LLMs
Microsoft Bitnet.cpp is an inference framework for 1-bit LLMs
Here’s a comparsion of
Llama.ccp vera Bitnet.ccp for
inference
1-bit LLMs:
• bitnet_b1_58-large
• bitnet_b1_58-3B
• Llama3-8B-1.58-100B-tokens
• Falcon3 Family
KServ
KServe provides a Kubernetes Custom Resource Definition (CRD )
for serving predictive and generative ML models on K8s via KNative
Kserve can serve the following kinds of models:
TensorRT, Tensorflow, Pytorch, SkLearn,
XGBoost, ONNX
Here Kserve using HuggingFace Transformers to serve the model
TGI and TEI
Text Generation Inference (TGI) is a Hugging Face library for serving LLMs.
Text Embeddings Inference (TEI) is a Hugging Face library for severing LLMs that output embeddings.
vLLM
vLLM is an open-source library to serve LLMs.
vLLM can be served various ways but one of the easiest ways is via Docker.
Ray Serve
Ray is a collection of libraries for AI workloads.
Ray Serve can be used to serve AI models with vLLM such as LLMs.
https://github.com/ray-project/ray
Ray is often positioned as a competitor against Apache Spark
Ray Serve
Using vLLM on its own own scales to a single server.
Using Ray Serve with vLLM , vLLM
workloads can be distributed across
multiple servers
TensorRT LLM
NVIDIA® TensorRT is an ecosystem of APIs for high-performance deep learning inference.
TensorRT optimizes the model for the target hardware eg. Nvidia GPUs
TensorRT LLM allows you to serve LLM models using the TensorRT engine using Python code.
Convert checkpoint to TensorRT LLM
checkpoint format
Build Engine
Run for Inference
Context Window Caching
Context Window Caching, also known as Prompt Caching or Context Caching is when
the computed context is stored in memory to help improve response times for LLMs
Use cases for Prompt Caching
• Chatbots with extensive system instructions
• Repetitive analysis of lengthy video files
• Recurring queries against large document sets
• Frequent code repository analysis or bug fixing
Prompt Caching is offered by some providers
for very specific model versions eg:
• Google Gemini
• Anthropic Claude Sonnet
Cached tokens might be billed at a reduced rated, saving
cost along with gaining improved response times.
Structured JSON
Structured JSON is when we want to force an LLM to produce structured json as its output.
There are multiple techniques to force structured JSON:
• Context-free-grammer
• Finite state machines (FSMs)
• Regexes for constrained decoding
Structure Output can be:
• a third-party library
• Built into the API of a LLM
As each token is generated, the LLM next possible tokens are limited based on the schema.
Token by token
Input
LLM
{
Schema
“
a
“
:
Schema is represented either as
• Pydantic
• JSON Schema
Some LLMs will as you to tell the LLM to generate JSON but often that is not necessary
Structured JSON
When using OpenAI they have an API
for structured output and it requires
the use of Pydantic
When using Cohere they have an API for
structured output and it requires a json
schema based on json-schema.org
Getting back ideal Structured JSON is challenging even with built-in APIs with specific providers.
Instructor
Instructor is a library that can produced structure json output.
https://python.useinstructor.com/
Instructor integrates with
the following APIs
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Anthropic
Azure OpenAI
Cerebras
Cohere
Cortex
Fireworks
Gemini
Groq
LiteLLM
llama-cpp-python
Mistral
Ollama
OpenAI
Together
Vertex AI
Writer
DeepSeek
Sandboxing LLMs
Sandboxing is when we put a piece of workload within its own
isolated environment such as a container
LLMs are more likely to exhaust all resources, hang, crash and so sandboxing your
LLMs or other AI models will allow for better management of LLM disasters.
OPEA
Open Platform for Enterprise (OPEA) is a collection of open-source Linux Foundation
projects that provides blueprints to deploy AI workloads using containers.
https://github.com/opea-project
OPEA Projects:
• GenAIExamples — A collection of templates/blueprints for common AI workloads
• GenAIComps — A collection of microservices so you can build your own AI workloads
• GenAIEval — Evaluation, benchmark, and scorecard, targeting for performance on throughput and latency,
accuracy on popular evaluation harness, safety, and hallucination
• GenAIStudio — low code platform to enable users to construct, evaluate, and benchmark GenAI applications.
• and more….
OPEA projects are unopinionated templates that allow you to deploy in
various formats eg. Docker, K8s, onto various hardware eg. Intel, AMD, Nvidia
OPEA — GenAIExamples
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
AgentQnA
AudioQnA
AvatarChatbot
ChatQnA
CodeGen
CodeTrans
DBQnA
DocIndexRetriever
DocSum
EdgeCraftRAG
FaqGen
GraphRAG
InstructionTuning
MultimodalQnA
ProductivitySuite
RerankFinetuning
SearchQnA
Text2Image
Translation
VideoQnA
VisualQnA
WorkflowExecAgent
GenAIExamples is a collection of “mega-services” of specific AI workloads.
If you want to deploy a Chatbot that uses RAG you can modify and the ChatQnA bot.
GenAIComps is made of micro-servies found in GenAIComps
OPEA — GenAIComps
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
agent
GenAIComps are microservices that you can use as the building blocks for your AI workloads.
animation
A microservice will be configured to work in various way with various technologies.
asr/whisper
chathistory
cores
dataprep
embeddings
feedback_management
finetuning
guardrails
image2image
image2video
intent_detection
llms
lvms
nginx
prompt_registry
ragas
reranks
retrievers
text2image
texttosql
tts
vectorstores
web_retrievers
Quantization
Quantization is a compression technique that converts the weights and
activations within an LLM to a lower-precision data type.
Eg. Convert from FP32 to INT8
Benefits
• Smaller models
• Faster inference
• Reduce consumed resources eg. Less
RAM usage
Disadvantages
• Potential loss in quality
Not a perfect example but think of wave with fewer data points
Examples of Quantization:
• qLORA
• GGML/GGUF
Quantization can be a complicated process, often you will see code examples with complex mathematical conversions
Knowledge Distillation
Knowledge Distillation is when you transfer knowledge from a large model to smaller model
so that the smaller model perform the same task faster and at a lower resource cost.
Knowledge Distillation goal is to produce a Small Language Model (SLM)
Soft targets – predictions from the large model
Hard targets – predictions from the ground truth data
Minitron is is a family of of SLM achieved through
Knowledge Distillation and Pruning
Medusa
Medusa adds extra "heads" to LLMs to predict
multiple future tokens simultaneously.
• train multiple decoding heads on the same model.
• training is parameter-efficient
• So "GPU-Poor” machines can train
• delivers approximately a 2x speed increase across
a range of Vicuna model
TPU
Tensor Processing Unit (TPU) is an AI accelerator application-specific
integrated circuit (ASIC) developed by Google for neural network machine
learning, using Google's own TensorFlow software.
TPUs are designed for a high volume of low precision computation
2016: TPU v1 - ASIC for neural network inference
2017: TPU v2 - Added training capability, liquid cooling
2018: TPU v3 - 2x performance of v2
2020: TPU v4 - 2-3x faster than v3
2023: TPU v5p - Liquid-cooled pods with ~9 exaflops performance
iGPU
iGPU (integrated GPU) is when a CPU contains the capabilities
of performing the task similar to a dedicated GPUs.
Intel Lunar Lake chip contains
multiple systems including an iGPU
dGPU could be used to explicitly refer to a dedicated GPU
VPUs
VPU (Visual Processing Unit) is an AI accelerator specialized in
machine vision tasks eg. CNN (convolutional neural networks)
Intel Movidus is an example of a VPU
which comes in the form of a UBS
peripheral that can be plugged into a
remote workstation.
Download