PoorGPU guy RAG
How to Improve a Retrieval-Augmented Generation Pipeline in Text Generation with Preparation and Cleaning
When it comes to create a complex Retrieval Augmented Generation application we meet few hurdles: as soon as we have a long document or multiple documents we see a huge decrease of accuracy in the answers.
Do we have to work on the chunk length? Is it a problem of metadata? Are we simply asking the wrong questions? Is the document clear enough?
How can we overcome these issues? Can we leverage the increased context lengths to bind together relevance and meaning?
In this newsletter we are going to give an answer to few of these questions: we all face them working with generative AI.
Data Collection and Preprocessing are the first step in any data science related project, called usually data ingestion. Unfortunately this step is usually underrated… a lot! It is a big mistake, because remember Garbage In, Garbage Out
Data Collection and Preprocessing
The first step in any text generation pipeline is collecting the necessary data. This can involve gathering information about the topic or issue being generated, identifying relevant resources.
Once we have collected the data, we can preprocess it to remove any errors or inconsistencies that may arise during generation. This involves cleaning up the data, removing irrelevant information, formatting text for better readability, and other quality-assurance measures.
I suggest two main methods that I inherited from Machine Learning. The first one is EDA, that in our scenario will stand for Exploratory Document Analysis.
The second one is the introduction of curated Metadata. I suggest here to use KeyBERT to quickly extract at summary level and at chunks level all the related keywords to be included in the chunks before sending everything to the vector store database.
Pipeline Preparation
The next step is to create a pipeline for generating text. This includes designing an algorithm, building a model, or using pre-trained models to generate text based on input data. The pipeline can involve various components such as tokenization (splitting the input into words), language models (GPT-3 or BERT), preprocessing, and quality assurance.
The new best practice that is an ongoing discussion in the community is to work with an orchestration of models. It is quite clear that one model cannot do them all. And this is related to the tasks to be completed.
Also here I would suggest to opt for two main changes:
The first one is to use a T5 model to extract the summary of the document (that is also enriched with the metadata we discussed before). This model, really slim, is also good for suggesting initial QnA as initial exploration of the document.
The second one is to use a 7B parameters model, such as Mistral or Zephyr to perform the RAG. The RAG pipeline, will leverage a Re-ranking strategy. LangChain introduced lately a function to overcome the issues highlighted in the paper Lost in the middle.
While Language Models (LLMs) are incredibly powerful, they do have some limitations. They can struggle when it comes to processing large amounts of text at once and referencing specific information. Recent research has even shown that LLM performance tends to be highest when relevant information is located at the beginning or end of the input context. When models have to access important information buried in the middle of lengthy contexts, their performance can significantly degrade.
So as of now the entire pipeline stack has:
LaMini-Flan-T5-77M for summarization and initial qna
Multilanguage embeddings for KeyBERT and Vector Store Similarity search
KeyBERT for keywords extraction
LanChain re-ranking for the prompt context and ground truth
Mistral7B/Zephyr7B to complete the RAG
Quality Assurance
Evaluation of RAG is a live topic with an open debate on it. Very smart people created a dedicated library for it, called RAGAS. Honestly I tried to use it with an open-source LLM (I don’t want to rely on ChatGPT…) and the process failed all the time.
So I decided to use another brand new approach: to use a third party Language Model to act as a Judge (called also JudgeLM: Fine-tuned Large Language Models are Scalable Judges). With my limited computational resources I used a similar approach inspired by the official paper:
use GPTQ version of OpenChat3.5 (really promising model) in Google Colab
extract the generations by Zephyr7b on the same questions using simple re-ranking and re-ranking with Summary injection
ask OpenChat to be the judge and score with reasoning the 2 models.
You can try yourself. Believe, I managed to do all of this running all the models on my CPU only, since I am indeed a Poor GPU guy.
Conclusion
Hope you will find all of this useful. Feel free to contact me on Medium.
I am using Substack only for the newsletter.