Generative AI Essentials Lecture Slides These lecture slides are provided for personal and non-commercial use only Please do not redistribute or upload these lecture slides elsewhere. Good luck on your exam! Updated Jan 16, 2025 1 What is the Generative AI Essentials? Cheat sheets, Practice Exams and Flash cards www.exampro.co/exp-genai-001 The ExamPro GenAI Essentials is a practical GenAI certification teaching: • Fundamentals concepts of ML, AI and GenAI • All Modalities of GenAI with a strong focus on LLMs • Programmatically working with GenAI workloads • Both cloud and local LLM workloads • A cloud/vendor agnostic approach The course code For the ExamPro GenAI Essentials is EXP-GENAI-001. The release of this course will have gaps, which will be address in minor updates and as we receive feedback. Pay attention to upcoming roadmap. Who is this certification for? Cheat sheets, Practice Exams and Flash cards www.exampro.co/exp-genai-001 Consider the ExamPro GenAI Essentials if…. You are preparing prerequisite knowledge to take the Free GenAI Bootcamp so you can be successful in it’s completion and grading You need broad and practical knowledge understanding GenAI solutions so you have the technical flexibility to move in any technical direction You need to focus on implementation, and deliver GenAI workloads that are both secure and in-budget. The GenAI Roadmap Cheat sheets, Practice Exams and Flash cards www.exampro.co/exp-genai-001 Certification course ExamPro GenAI Maturity Learning Model Confidence in understanding, talking about GenAI, having the tools to get started building. Project-based Bootcamp The proof that you can build GenAI workloads through hands-on projects https://genai.cloudprojectbootcamp.com/register How Long to Study to Pass? Cheat sheets, Practice Exams and Flash cards Beginner Basic knowledge of cloud and programming www.exampro.co/exp-genai-001 Experienced Have passed multiple GenAI Certification Have practical experience working in cloud and local development. 30 hours 15 hours ??? hours (average study time cannot be determined) • 60% lecture and labs • 40% practice exams Recommended to study 1-2 hours a day for 15 days. What does it take to pass the exam? Cheat sheets, Practice Exams and Flash cards www.exampro.co/exp-genai-001 1. Watch video lecture and memorize key information 2. Do hands-on labs and follow along within your own account 3. Do paid online practice exams that simulate the real exam. Signup and Redeem your FREE Practice Exam No credit card required https://www.exampro.co/exp-genai-001 Exam Guide – Grading Cheat sheets, Practice Exams and Flash cards www.exampro.co/exp-genai-001 Passing Grade is *750/1000 You need to get “around” 75% to pass ExamPro uses Scaled Scoring Exam Guide – Response Types Cheat sheets, Practice Exams and Flash cards www.exampro.co/exp-genai-001 There are 65 Questions • 65 Scored Questions You can afford to get 16 scored questions wrong There is no penalty for wrong questions Format of Questions • Multiple Choice • Multiple Answer • Case Studies Exam Guide – Duration Cheat sheets, Practice Exams and Flash cards www.exampro.co/exp-genai-001 Duration of ~2 hours You get ~2mins per question Exam Time is: 130mins Seat Time is: 160mins Seat time refers to the amount of time that you should allocate for the exam. It includes: • Time to review instructions • Show online proctor your workspace • Read and accept NDA • Complete the exam • Provide feedback at the end. Where do you take the exam? Cheat sheets, Practice Exams and Flash cards www.exampro.co/exp-genai-001 Online from the convenience of your own home. ExamPro delivers exams via: • TeacherSeat Anchor(online proctored exam system) A “proctor” is a supervisor, or person who monitors students during an examination. Exam Guide – Valid Until Cheat sheets, Practice Exams and Flash cards www.exampro.co/exp-genai-001 Valid for Forever If there is a major version update you may want to consider recertifying to update your knowledge. What is AI? What is Artificial Intelligence (AI)? Machines that perform jobs that mimic human behavior What is Machine Learning (ML)? Machines that get better at a task without explicit programming What is Deep Learning (DL)? Machines that have an artificial neural network inspired by the human brain to solve complex problems. Generative AI What is Generative AI (Gen AI)? GenAI is a type of artificial intelligence capable of generating new content, such as text, images, music, or other forms of media. Example of Midjourney generating a graphic GenAI is conflated to mean LLMs, but GenAI and LLMs are two different things. AI vs Generative AI What is Artificial Intelligence (AI)? AI is computer systems that perform tasks typically requiring human intelligence. These include: • problem-solving • decision-making • understanding natural language AI’s goal is to interpret, analyze, and respond to human actions. • recognizing speech and images To simulate human intelligence in machines. • Simulate: mimic aspects, resembles behaviour • Emulate: replicates exact processes and mechanisms. AI applications are vast and include areas like: • expert systems • natural language processing • speech recognition • robotics AI is used in various industries for tasks such as: • B2C : customer service chatbots • Ecommerce: recommendation systems • Auto: autonomous vehicles • Medical: medical diagnosis. AI vs Generative AI What is Generative AI? Generative AI (GenAI) is a subset of AI that focuses on creating new content or data that is novel and realistic. It can interpret or analyze data but also generates new data itself. It includes generating text, images, music, speech, and other forms of media. It often involves advanced machine learning techniques: • Generative Adversarial Networks (GANs) • Variational Autoencoders (VAEs) • Transformer models eg GPT Generative AI has multiple modalities: • Vision: realistic images and videos • Text: generating human-like text • Audio: composing music • Molecular: Drug discovery via genomic data Large Language Models (LLMs) which generate out human-like text is a subset of GenAI but is often conflating with being AI due to it being the most popular and developed. AI vs Generative AI Artificial Intelligence (AI) Generative AI Functionality AI focuses on understanding and decision-making Generative AI is about creating new, original outputs. Data Handling AI analyzes and makes decisions based on existing data Generative AI uses existing data to generate new, unseen outputs. Applications Spans across various sectors, including data analysis, automation, natural language processing, and healthcare. creative and innovative, focusing on content creation, synthetic data generation, deepfakes, and design. LLM Landscape What is Agent? Responsible AI Practices 1. Transparency 6. Human oversight • • • • Clear communication about AI capabilities and limitations Explainable AI systems where possible 2. Fairness and non-discrimination • • Mitigating bias in data and algorithms Ensuring equal treatment across different demographics Safeguarding user data Compliance with data protection regulations Robust testing for potential risks and vulnerabilities Implementing safeguards against misuse or attacks Considering environmental impact of AI systems Efficient use of computational resources • • Incorporating ethical considerations from the outset Regular ethical audits of AI systems • • Engaging with diverse stakeholders Contributing to AI safety research 10. Continuous monitoring and improvement 5. Accountability • • • • 9. Collaboration and knowledge sharing 4. Safety and security: • • 7. Sustainability 8. Ethical design 3. Privacy and data protection • • Maintaining human control over critical decisions Avoiding over-reliance on AI for important judgments Clear ownership and responsibility for AI systems Mechanisms for redress in case of harm • • Regular assessment of AI system performance and impact Updating systems to address emerging issues There exists multiple authoritative sources on Responsible AI Practices: • The IEEE's Ethically Aligned Design • The OECD AI Principles • The European Commission's Ethics Guidelines for Trustworthy AI Bot vs Agent Bot Agent Simple tasks Complex tasks Specific tasks Broad tasks Follow pre-defined scripts or decision trees More agency in decision making Limited context understanding Better at using contest Customer service, simple queries, repetitive tasks Complex problem-solving, analysis, multi-step tasks Easily scalable for specific tasks Scalability depends on the complexity of tasks Basic NLP, not often powered by LLMs Advanced NLP, often powered by LLMs What is a Foundational Model? A Foundational Model (FM) is a general purpose model that is trained on vast amounts of data. We say that an FM is pretrained because it can be fined tuned for specific tasks. Data Prediction Text Classification Images Training Videos FM Uses Text Generation Video Generation Structured Data Image Generation Audio Generation LLMs are a specialized subset of FMs that uses transformer architecture. What is a Large Language Model (LLM)? A Large Language Model (LLM) is a Foundational Model that implements the transformer architecture. During this training phase, the model learns semantics (patterns) of language, such as grammar, word usage, sentence structure, style and tone. It is simple to say that LLM just predicts the next sequences of words, but researchers don’t know how LLMs generate their outputs. What are Embeddings? What is a Vector? An arrow with a length and a direction Document Document What is a Vector Space Model? Represents text documents or other types of data as vectors in a High Dimensional space What are embeddings? They are vectors of data used by ML models to find relationships between data. ML models can also create embeddings. Document Document Document Document You can think of embeddings as external memory for performing a task for ML models. Embeddings can be shared across models (Multi-model pattern) to help coordinate a task between models. BERT BERT (Bidirectional Encoder Representations from Transformers) model released in 2018 by Google researchers. https://huggingface.co/google-bert If you were take the Transformer Architecture and just take the encoder and stack them you would get BERT. BERT is Bi-directional meaning it can read text both left to right and right to left to understand the context of text. BERT BERT is pre-trained on the following tasks: • Masked language model (MLM) • Provided input where tokens are masked. • Think of asking BERT to fill in the blanks for sentences • Next sentence prediction (NSP) • BERT is provided two sentences A and B • BERT has to predict if B would follow A BERT can then be fine-tuned to then perform the following tasks: • Named Entity Recognition (NER) • Question Answering (QA) • Sentence Pair Tasks • Summarization • Feature Extraction / Embeddings • and more…. There are multiple BERT model sizes • BERT Base 100M Parameters • Bert Large 240M Parameters • BERT Tiny 4M Parameters • and ~24 other model sizes… BERT Variants: • roBERTa • DistilBERT • ALBERT • ELCTRA • DeBERTa While BERT is and older model is still used and ubiquitous baseline in natural language processing (NLP) experiments. BERT Example of using BERT to perform sentiment analysis via the Hugging Face “sentiment analysis” pipeline. Here Hugging Face is downloading a BERT Base Uncased model that has been fine-tuned for Sentiment Analysis. BERT needs to be fine-tuned for specific tasks beyond pretraining. Sentence Transformers Sentence Transformers (aka SBERT) is built on top of BERT. It create a single vector for an entire sentence. When comparing similar sentence this much more performance than simply using BERT which looks at every single word. Example of using a pre-trained model to generate embeddings and compare the similarity of the two sentences. Use cases: • Embeddings • Semantic Search • Retrieve and Rerank • Clustering • Image Search • And more… https://sbert.net/ CNN and RNN Convolutional Neural Networks (CNNs) are used for processing and analyzing visual data, such as images and videos. They detect spatial features like edges and textures through filters, followed by pooling layers to reduce the dimensionality, and fully connected layers to perform final classification or regression tasks. CNNs do not feed outputs into their input, but feeds that forward to next layer Recurrent Neural Networks (RNNs) handle sequential data, where order of data matters, eg. Natural language processing (NLP). They handle sequential data by maintaining a hidden state that captures information from previous inputs, enabling the network to understand context and temporal dependencies, with each time step's input influencing the subsequent states. RRNs feed their output backs into their inputs through the same layer One thing that CNNs and RNNs have in common is they handle data sequentially. Transformer Architecture Transformer Architecture was developed by researchers at Google that is effective at Natural Language Processing (NLP) due to multi-head attention and positional encoding. Transformer model architecture consists of two parts: 1. Encoder: reads and understands the input text. It's like a smart system that goes through everything it's been taught (which is a lot of text) and picks up on the meanings of words and how they're used in different contexts. 2. Decoder: Based on what the encoder has learned, this part generates new pieces of text. It's like a skilled writer that can make up sentences that flow well and make sense. Attention Is All You Need Whitepaper Transformers is a type of neural network architecture invented by Google and University of Toronto. Google’s Whitepaper on Transformers The transformer architecture is the basis for Large Language Models (LLMs) Transformers rely on two things • Positional Encoding • Self-Attention What is a Foundational Model? A Foundational Model (FM) is a general purpose model that is trained on vast amounts of data. We say that an FM is pretrained because it can be fined tuned for specific tasks. Data Prediction Text Classification Images Training Videos FM Uses Text Generation Video Generation Structured Data Image Generation Audio Generation LLMs are a specialized subset of FMs that uses transformer architecture. What is a Large Language Model (LLM)? A Large Language Model (LLM) is a Foundational Model that implements the transformer architecture. During this training phase, the model learns semantics (patterns) of language, such as grammar, word usage, sentence structure, style and tone. It is simple to say that LLM just predicts the next sequences of words, but researchers don’t know how LLMs generate their outputs. Transformer Architecture Transformer Architecture was developed by researchers at Google that is effective at Natural Language Processing (NLP) due to multi-head attention and positional encoding. Transformer model architecture consists of two parts: 1. Encoder: reads and understands the input text. It's like a smart system that goes through everything it's been taught (which is a lot of text) and picks up on the meanings of words and how they're used in different contexts. 2. Decoder: Based on what the encoder has learned, this part generates new pieces of text. It's like a skilled writer that can make up sentences that flow well and make sense. Tokenization Tokenization is the process of breaking data input (text) into smaller parts. Tokenization Algorithms: • Byte Pair Encoding (BPE) used by GPT3 • WordPiece used by BERT Each token is mapped to a unique ID to the model’s vocabulary • SentencePiece used by Google T5 or GPT 3.5 When working with a LLMs, the input text must be converted (or "tokenized") into a sequence of tokens that match the model's internal vocabulary. When an LLM is trained it will create an internal vocabulary of tokens which could be between 30,000 to 100,000 tokens. Tokens and Capacity When using transformers the decoder continuously feeds the sequence of tokens back in as output to help predict the next word in the input. The quick brown The quick Encoder Decoder The quick brown fox The quick brown fox jumps The quick brown fox jumps over Memory Each token in a sequence requires memory As the token count increases, the memory increases. The memory usage eventually becomes exhausted. Compute model perform more operations for each additional token Longer sequences require more compute AI services that offer Models-as-a-Service will often have a limit of combined input and output. What are Embeddings? What is a Vector? An arrow with a length and a direction Document Document What is a Vector Space Model? Represents text documents or other types of data as vectors in a High Dimensional space What are embeddings? They are vectors of data used by ML models to find relationships between data. ML models can also create embeddings. Document Document Document Document Different embedding algorithms capture different kinds of relationships. You can think of embeddings as external memory for performing a task for ML models. Embeddings can be shared across models (Multi-model pattern) to help coordinate a task between models. Positional encoding Positional encoding is a technique used to preserve order of words when processing natural language. Transformers need positional encoders because they do not process data sequentially and would lose order of understanding when analyze large bodies of text. I heard 0 1 a 2 dog bark loudly 3 4 5 at a 6 Positional encoding adds a positional vector to each word to keep track of the positions of the words. cat 7 8 Attention Attention figures out how each word (or token) in a sequence is important to other words within that sequence by assigning the words weights. Self-Attention Computes attention weights within the same input sequence, where each element attends to all other elements. Used in transformers to model relationships in sequences (e.g., words in a sentence). Cross-Attention Computes attention weights between two different sequences, allowing one sequence to attend to another sequence. Used in tasks like translation where the output sequence (decoder) needs to focus on the input sequence (encoder). Multi-head Attention Combines multiple self-attention (or cross-attention) heads in parallel, each focusing on different aspects of the input. Used in transformers to improve performance and capture various dependencies simultaneously. The Journey to Large Language Models 1950s – 1980s: Early NLP Rule-Based Systems Statistical Methods 1990s – 2000s Bag of Words and N-Grams Support Vector Machines (SVMs) Latent Semantic Analysis (LSA) 2010s Word Embeddings Recurrent Neural Networking (RNN) Long Short-Term Memory (LSTM) Convolutional Neural Networking (CNN) Gated Recurrent Units (GRUs) 2017s - Now Transformers Architecture The basis for LLMS BERT GPT RoBERTa LLMS T5 Sequence-to-Sequence (Seq2Seq) Attention Mechanisms Order of Solution When you need to search and retrieve data • Specific data from your data stores • Up-to-date data from your data stores Complexity Always start with prompt engineering Prompt Engineering SFT RAG When you are training for a specific task eg. Classification When a large prompt document causes latency issues When you have a large amount of prompt document examples for better evaluations Time/Cost Supervised Fine Tuning Supervised Fine-Tuning (SFT) is when you feed a large amount of training data to adjust the LLM weights for a specific task. • SFT is more cost-effective than training a model from scratch • SFT can cause overfitting, which could result in poor predictions when handling unseen or different data. RAG Retrieval-Augmented Generation (RAG) is used to access an external source of data before an LLM provides output. RAG helps with access specific corpus to improve accuracy of response RAG helps with accessing up-to-date corpus to improve freshness of response Zero Shot Zero-Shot prompting is when the model can perform the expected task with no prior knowledge or examples provided during prompting An LLM is utilizing its built-in knowledge during its training to be able to complete this task. Few Shot Few-shot prompting is when the model can perform the expected task with examples provided during prompting Prompt Chaining Prompt chaining is where the output of one LLM prompt is used as the input for the next prompt in a sequence. The reason is when the context window of the prompt is too small to complete a single large or complex task, so we break the task up into smaller tasks. Complex Prompt Given this Japanese subtitles movie file, translate the the text from Japanese to English, identify the Grammer being used, and then convert the file into a .jsonlines file where each line has the original Japanese, the English translations and the referenced grammer rules with grammer examples Simple Prompt (iterate over each line) Translate the text from Japanese to English Give a list of grammar rules being used -> (List) (iterate over list) Generate an explanation of this grammar rule and provide 5 examples Format original line, translated line and grammer rule examples into a json file. Chain of Thought Chain-of-Thought (CoT) Prompting is when you tell the model to do step-by-step reasoning so it produces a more accurate result Without Chain of Thought Which two prime numbers sum to 10? With Chain of Thought Please solve this problem step-by-step. First outline your reasoning in a clear, logical chain of thought. Then provide the final concise answer. Tree of Thought Tree-of-Thought (ToT) Prompting is where the prompt instructs the model to explore possible “states” and “transitions” rather than trying to produce an answer immediately in a single chain of thought. CO-STAR CO-STAR is a prompting framework created by Sheila Teo Context — Provide background information on the task ## CONTEXT A free, online introductory Japanese language workshop for beginners. ## OBJECTIVE Encourage people to register by highlighting the quick, fun, and interactive nature of the session. Objective — Define what the task is that you want the LLM to perform Style — Specify the writing style you want the LLM to use ## STYLE Engaging, inviting, clear educational tone. Tone — Set the attitude of the response Audience — Identify who the response is intended for Response — Provide the response format ## TONE Persuasive yet friendly. ## AUDIENCE Busy adults curious about learning Japanese, with no prior experience. ## RESPONSE A concise, impactful Facebook post. GenAI Modalities Modalities in the context of AI is the way something happens, is experienced or captured. Think of the Modalities of human experience which is based around the 5 senses (inputs). The primitive GenAI Modalities are: 1. Text - LLMs 2. 3. Vision • Images – StableFusion • Videos – OpenSora Audio – AudioCraft, Whisper You could consider these modalities but they could be also easily fall under an existing primitive GenAI Modality • Code • Touch/Haptic eg. Think AR/VR • Biological Data eg. Genomics • Sensor Data eg. IoT • Math Expressions • 3d Models It’s important to understand that GenAI does not mean exclusively LLMs because LLMs only focused on the text modality. LLMs are talked about the most due to their maturity. Vector Databases A vector database is a datastore for embeddings. MongoDB Atlas Redis PGvector Deep Lake ElasticSearch Chroma Weaviate Pinecone Vector Search vs Vector Databases When looking into Vector Databases, you’ll also come across Vector Search products. Vector Search products may be limited in their capabilities compared to a Vector Database Exact Vector Search Vector Search Vector Database Yes Yes Approximate Nearest Neighbors (ANN) Persistence Partial CRUD Store Objects Sharding Replication No Ollama Ollama is a LLM Model manager that makes it easy to download, install and run LLMs on your laptop/desktop. • • • • Ollama can directly run models in its a runtime env Each model is self contained Download an installer and get going on Windows and Mac Ollama can easily download updates for models. https://github.com/ollama/ollama Llamafile Lllamafile is a single binary format that contains both the model’s weight a way to serve the model. https://github.com/Mozilla-Ocho/llamafile • • • • • works by combining llama.cpp with Cosmopolitan Libc. can run on multiple CPU microarchitectures supports GPU use can run on multiple Operating Systems Weights are embedded as part of the file Models that are package as Llamafiles: • LLaMA 3.2 1B/3B Instruct • Gemma 2 2B/9B/27B Instruct • LLaVA 1.5 • TinyLlama-1.1B • Mistral-7B-Instruct • Phi-3-mini-4k-instruct • Mixtral-8x7B-Instruct • WizardCoder-Python-13B/34B • LLaMA-3-Instruct-8B/70B • Rocket-3B • OLMo-7B • Text Embedding Models • E5-Mistral-7B-Instruct • mxbai-embed-large-v1 LangChain LangChain is an open-source framework to rapidly prototype agents or LLM workloads. Features include: • Prompt templates — eg. formatting user input • Messages — eg. standardized input and output for chat models • Output parsers – eg. outputting structured XML, JSON, YAML • Document loaders (cracking) — eg. PDFs, Markdown, CSV, HTML • Text splitters (chunking) — eg. By HTML heading, characters, JSON • Adapters to various Vector stores • Adapters to various Embeddings • Various Retrievers to Implement RAG • Indexing • Tool Use • Support to work with Multimodals models • Agent workflows • Example Selectors LangChain LangChain is *not well-suited for production use cases. • LangChain (the company) has built a suite of tools around LangChain framework so they may think otherwise. You could ship an LLM workload as a Minimal-ViableProduct (MVP) but you’d eventually need to rebuild your workload orchestrating multiple containers eg. OPEA. LangChain was one of the earliest GenAI frameworks that made it easy to swap out LLMs with a unified API, When LLMs had small context windows LangChain worked around this limitation by providing multiple chaining strategies like summarization. LlamaIndex LlamaIndex is an open-source framework to rapidly prototyping agents or LLM workloads . LlamaIndex has wider support for data connectors and advanced RAG techniques. • • • • • • Data connectors — eg. Azure AI Search Data Indexes Engines Agents Observability/Evaluation Workflows LlamaIndex has its own platform called LlamaCloud for enterprise production similar to LangChain having their own platform offering Consider just like LangChain that LlamaIndex isn’t designed for production use-case GGUF and GGML GGUF and GGML are file formats used for storing models for inference often models like GPT (Generative Pre-trained Transformer). GGUF is the successor of the GGML format so you will often see GGUF • GGUF is a binary format that is designed explicitly for the fast loading and saving of models. • GGUF is specially designed to store inference models and perform well on consumer-grade computer hardware. • GGUF supports fine tuning • Models initially developed in frameworks like PyTorch can be converted to GGUF format for use with those engines. GGUF can be executed using: • Olama • GPT4ALL • Llama.ccp Here you can see on Hugging Face Google Gemma 7B has a .gguf file. Guard Rails Guard Rails is the process of adding additional checks filters on input and output. Prompt Injections Prompt injection is a technique to manipulate AI models like LLMs by inserting hidden instructions into the input. • Exploits how models process input • Attempts to override initial instructions • Uses natural language the model understands • Targets weak boundaries between user input and system instructions • Leverages context sensitivity of LLMs Mitigation strategies include input sanitization, clear separation of instructions and user input, and model fine-tuning. OWASP Top 10 LLMs Apps OWASP Top 10 LLMs App is a list of the top security concerns when building Applications that utilized LLMs. LLM01: Prompt Injection Manipulating LLMs via crafted inputs can lead to unauthorized access, data breaches, and compromised decision-making. LLM02: Insecure Output Handling Neglecting to validate LLM outputs may lead to downstream security exploits, including code execution that compromises system LLM03: Training Data Poisoning Tampered training data can impair LLM models leading to responses that may compromise security, accuracy, or ethical behavior. LLM04: Model Denial of Service Overloading LLMs with resource-heavy operations can cause service disruptions and increased costs. LLM05: Supply Chain Vulnerabilities Depending upon compromised components, services or datasets undermine system integrity, causing data breaches and system f OWASP Top 10 LLMs Apps LLM06: Sensitive Information Disclosure Failure to protect against disclosure of sensitive information in LLM outputs can result in legal consequences or a loss of competitive advantage. LLM07: Insecure Plugin Design LLM plugins processing untrusted inputs and having insufficient access control risk severe exploits like remote code execution. LLM08: Excessive Agency Granting LLMs unchecked autonomy to take action can lead to unintended consequences, jeopardizing reliability, privacy, and trust. LLM09: Overreliance Failing to critically assess LLM outputs can lead to compromised decision making, security vulnerabilities, and legal liabilities. LLM10: Model Theft Unauthorized access to proprietary large language models risks theft, competitive advantage, and dissemination of sensitive information. LitGPT LitGPT is a CLI tool to pretrain, finetune and deploy LLMs at scale. https://github.com/Lightning-AI/litgpt • • • • • Scratch implementations No abstractions Flash Attention Utilizes Lighting Fabric Fully Sharded Data Parallel (FSDP) • 1-1000+ GPUs/TPUs • Parameter-efficient finetuning (PEFT) • LORA • qLORA • Adapter and Adapter v2 • Recipes for 20+ LLMs LitGPT has in the works a Python API, so you can in the future easily train within Jupyter Notebooks. LitGPT — Training Example An example of LORA fine tunning: Finetune and Deploy Llama 3B Example Using Lighting.AI with the following: • 1 Nvidia L4 Tensor Core • 16 vCPUs • 64 GB RAM • 24 GB VRAM With a dataset of 6471 examples This trained within 766.72 (~12 mins) We can chat with our fined-tune model: An equivalent AWS EC2 instance would be a g6.4xlarge: • $1.323 / hour • 16 vCPU, 64 GB RAM, 1x L4 Flash Attention Flash Attention is a memory-efficient, faster variant of the traditional attention mechanism, optimized for GPUs to handle longer sequences with reduced computation and memory overhead. https://github.com/Dao-AILab/flash-attention There are multiple versions: • Flash Attention Flash Attention 2 • Refinement of Flash Attention • Flash Attention 3 • Optimized for Hopper GPUs eg. H100 Flash Attention achieves efficiency by computing attention scores in small chunks, fusing operations like softmax and matrix multiplications to minimize memory use and speed up computation. Llama.cpp Llama.cpp is an inference server for LLMs. Llama.cpp implements the underlying architecture of Meta’s LlaMa within C/C++ intended to allow improved performance running models on CPUs. The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud. Llama.cpp has: • SDKs for many different programming languages eg. yoshoku/llama_cpp.rb • CLI tool eg. llama-cli • Lightweight server eg. llama-server • Integrations with many UI tools Llama.cpp works only with GGUF format for model weights. Bitnet.cpp 1-bit LLMs represent the most extreme form of quantization, using a single bit (0 or 1) for each parameter. This greatly reduces the model size and computational needs compared to conventional LLMs Microsoft Bitnet.cpp is an inference framework for 1-bit LLMs Here’s a comparsion of Llama.ccp vera Bitnet.ccp for inference 1-bit LLMs: • bitnet_b1_58-large • bitnet_b1_58-3B • Llama3-8B-1.58-100B-tokens • Falcon3 Family KServ KServe provides a Kubernetes Custom Resource Definition (CRD ) for serving predictive and generative ML models on K8s via KNative Kserve can serve the following kinds of models: TensorRT, Tensorflow, Pytorch, SkLearn, XGBoost, ONNX Here Kserve using HuggingFace Transformers to serve the model TGI and TEI Text Generation Inference (TGI) is a Hugging Face library for serving LLMs. Text Embeddings Inference (TEI) is a Hugging Face library for severing LLMs that output embeddings. vLLM vLLM is an open-source library to serve LLMs. vLLM can be served various ways but one of the easiest ways is via Docker. Ray Serve Ray is a collection of libraries for AI workloads. Ray Serve can be used to serve AI models with vLLM such as LLMs. https://github.com/ray-project/ray Ray is often positioned as a competitor against Apache Spark Ray Serve Using vLLM on its own own scales to a single server. Using Ray Serve with vLLM , vLLM workloads can be distributed across multiple servers TensorRT LLM NVIDIA® TensorRT is an ecosystem of APIs for high-performance deep learning inference. TensorRT optimizes the model for the target hardware eg. Nvidia GPUs TensorRT LLM allows you to serve LLM models using the TensorRT engine using Python code. Convert checkpoint to TensorRT LLM checkpoint format Build Engine Run for Inference Context Window Caching Context Window Caching, also known as Prompt Caching or Context Caching is when the computed context is stored in memory to help improve response times for LLMs Use cases for Prompt Caching • Chatbots with extensive system instructions • Repetitive analysis of lengthy video files • Recurring queries against large document sets • Frequent code repository analysis or bug fixing Prompt Caching is offered by some providers for very specific model versions eg: • Google Gemini • Anthropic Claude Sonnet Cached tokens might be billed at a reduced rated, saving cost along with gaining improved response times. Structured JSON Structured JSON is when we want to force an LLM to produce structured json as its output. There are multiple techniques to force structured JSON: • Context-free-grammer • Finite state machines (FSMs) • Regexes for constrained decoding Structure Output can be: • a third-party library • Built into the API of a LLM As each token is generated, the LLM next possible tokens are limited based on the schema. Token by token Input LLM { Schema “ a “ : Schema is represented either as • Pydantic • JSON Schema Some LLMs will as you to tell the LLM to generate JSON but often that is not necessary Structured JSON When using OpenAI they have an API for structured output and it requires the use of Pydantic When using Cohere they have an API for structured output and it requires a json schema based on json-schema.org Getting back ideal Structured JSON is challenging even with built-in APIs with specific providers. Instructor Instructor is a library that can produced structure json output. https://python.useinstructor.com/ Instructor integrates with the following APIs • • • • • • • • • • • • • • • • • Anthropic Azure OpenAI Cerebras Cohere Cortex Fireworks Gemini Groq LiteLLM llama-cpp-python Mistral Ollama OpenAI Together Vertex AI Writer DeepSeek Sandboxing LLMs Sandboxing is when we put a piece of workload within its own isolated environment such as a container LLMs are more likely to exhaust all resources, hang, crash and so sandboxing your LLMs or other AI models will allow for better management of LLM disasters. OPEA Open Platform for Enterprise (OPEA) is a collection of open-source Linux Foundation projects that provides blueprints to deploy AI workloads using containers. https://github.com/opea-project OPEA Projects: • GenAIExamples — A collection of templates/blueprints for common AI workloads • GenAIComps — A collection of microservices so you can build your own AI workloads • GenAIEval — Evaluation, benchmark, and scorecard, targeting for performance on throughput and latency, accuracy on popular evaluation harness, safety, and hallucination • GenAIStudio — low code platform to enable users to construct, evaluate, and benchmark GenAI applications. • and more…. OPEA projects are unopinionated templates that allow you to deploy in various formats eg. Docker, K8s, onto various hardware eg. Intel, AMD, Nvidia OPEA — GenAIExamples • • • • • • • • • • • • • • • • • • • • • • AgentQnA AudioQnA AvatarChatbot ChatQnA CodeGen CodeTrans DBQnA DocIndexRetriever DocSum EdgeCraftRAG FaqGen GraphRAG InstructionTuning MultimodalQnA ProductivitySuite RerankFinetuning SearchQnA Text2Image Translation VideoQnA VisualQnA WorkflowExecAgent GenAIExamples is a collection of “mega-services” of specific AI workloads. If you want to deploy a Chatbot that uses RAG you can modify and the ChatQnA bot. GenAIComps is made of micro-servies found in GenAIComps OPEA — GenAIComps • • • • • • • • • • • • • • • • • • • • • • • • • agent GenAIComps are microservices that you can use as the building blocks for your AI workloads. animation A microservice will be configured to work in various way with various technologies. asr/whisper chathistory cores dataprep embeddings feedback_management finetuning guardrails image2image image2video intent_detection llms lvms nginx prompt_registry ragas reranks retrievers text2image texttosql tts vectorstores web_retrievers Quantization Quantization is a compression technique that converts the weights and activations within an LLM to a lower-precision data type. Eg. Convert from FP32 to INT8 Benefits • Smaller models • Faster inference • Reduce consumed resources eg. Less RAM usage Disadvantages • Potential loss in quality Not a perfect example but think of wave with fewer data points Examples of Quantization: • qLORA • GGML/GGUF Quantization can be a complicated process, often you will see code examples with complex mathematical conversions Knowledge Distillation Knowledge Distillation is when you transfer knowledge from a large model to smaller model so that the smaller model perform the same task faster and at a lower resource cost. Knowledge Distillation goal is to produce a Small Language Model (SLM) Soft targets – predictions from the large model Hard targets – predictions from the ground truth data Minitron is is a family of of SLM achieved through Knowledge Distillation and Pruning Medusa Medusa adds extra "heads" to LLMs to predict multiple future tokens simultaneously. • train multiple decoding heads on the same model. • training is parameter-efficient • So "GPU-Poor” machines can train • delivers approximately a 2x speed increase across a range of Vicuna model TPU Tensor Processing Unit (TPU) is an AI accelerator application-specific integrated circuit (ASIC) developed by Google for neural network machine learning, using Google's own TensorFlow software. TPUs are designed for a high volume of low precision computation 2016: TPU v1 - ASIC for neural network inference 2017: TPU v2 - Added training capability, liquid cooling 2018: TPU v3 - 2x performance of v2 2020: TPU v4 - 2-3x faster than v3 2023: TPU v5p - Liquid-cooled pods with ~9 exaflops performance iGPU iGPU (integrated GPU) is when a CPU contains the capabilities of performing the task similar to a dedicated GPUs. Intel Lunar Lake chip contains multiple systems including an iGPU dGPU could be used to explicitly refer to a dedicated GPU VPUs VPU (Visual Processing Unit) is an AI accelerator specialized in machine vision tasks eg. CNN (convolutional neural networks) Intel Movidus is an example of a VPU which comes in the form of a UBS peripheral that can be plugged into a remote workstation.