Retrieval Augmented Generation Week 5 - L1 Agenda 1. 2. 3. Recap on Large Language Models Retrieval Augmented Generation (RAG) RAG in conversational setting Recap on Large Language Models Language modeling A language model (LMs) is a statistical distribution over sequences of words. A famous way to learn this distribution is by predicting the next word given the previous context Large Language Models Recently, large language model take this training paradigm to the next level via scaling both model size and training data. For example, GPT-3 (OpenAI) has 175 billion parameters and is trained on 45 terabytes of data. LLMs are memory store? It has been pointed out that LLMs can act as knowledge bases However The size of the model <<< the size of the training set. Viewing a large language model as a compression of the training corpora is a common analogy [1]. Will increasing the size of LLM also improve its capacity to memorize? Figure from [2] How to make LLMs memorize more knowledge Why is this necessary: There may be use cases where it is necessary to memorize customized data (internal data) ● ● ● Personal assistant (healthcare, scheduling or just chit-chat) Sales AI Concept definition generation Two approaches: ● ● Finetuning Retrieval Augmented Generation Pros and cons of finetuning Pros ● ● Quality of the finetuned models are (likely) guaranteed We can change things such as the tone of the LLM (may be important for some) Cons ● ● Cost of building instruction dataset Computational cost of (continued) pretraining and finetuning models In most cases, the cons outweigh the pros Retrieval Augmented Generation A RAG pipeline Figure from retrieval-augmented-generation-notes A basic retrieval component for RAG Include only a vectorDB containing the documents and their vectorized representation. Retrieve top-K documents that are most related to the query by measuring cosine similarity Text embedding - which model to choose? There are a lot of models!! MTEB leaderboard [3] Text embedding - Sentence BERT [4] More recent? Got it! INSTRUCTOR [5] - hkunlp/instructor-large E5 [6] - intfloat/e5-large-v2 Want a cheaper model? Got it! Sent2Vec [7] is a CBOW sentence embedding model. Can operate comfortably on CPU Github Populating the vector DB Steps 1. 2. 3. 4. Collecting the texts Chunking the text to your desired size Vectorize them Index them Cascaded Text Retrieval Contain two stages 1. 2. Candidate retrieval: narrow down the scope of search by choosing top-N candidates, thereby removing totally unrelated documents Candidate re-ranking: rank the candidates and choose the top-k with k <<< N Why an additional step is a good idea? There are several reasons 1. 2. A re-ranking step may offer better precision in ranking Several post-processing techniques may be applied, such as diversification of results RAG in conversational setting Why is conversational setting more difficult? Due to the fact that human conversation are heavily contextualized, i.e. dependent on previous conversation turns [8, 9] [User]: What is the name of the scientist who proposed general relativity? [Assistant]: It is Albert Einstein [User]: When was he born, and how did that theory impact physics research at the time? Query formulation Rewriting the user query based on the context provided by conversation history [User]: When was he born, and how did that theory impact physics research at the time? [User]: When was Albert Einstein born, and how did that theory of general relativity impact physics research at the time? Query formulation - A simple approach Just prompt the LLM to do it for you: Query formulation - pros and cons Pros ● No further fine-tuning of any component in RAG Cons ● Additional cost in LLM inference Open discussion: what happens when the conversation history gets too long? References [1] Language Models as Knowledge Bases? [2] PaLM: Scaling Language Modeling with Pathways [3] MTEB: Massive Text Embedding Benchmark [4] Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks [5] One Embedder, Any Task: Instruction-Finetuned Text Embeddings [6] Text Embeddings by Weakly-Supervised Contrastive Pre-training [7] Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features [8] Contextualized Query Embeddings for Conversational Search [9] Few-Shot Conversational Dense Retrieval