Stop paying, Start creating: your free ticket to Production-Ready AI

The PoorGPU guy Newsletter 2024 week 6 - Open-source LLMs meet user-friendly pipelines, unlocking unlimited possibilities.

Feb 12, 2024

A recent study from O’Reilly Media ran a survey focused on how companies use generative AI, what bottlenecks they see in adoption, and what skills gaps need to be addressed. Guess what?

Two-thirds (67%) of our survey respondents report that their companies are using generative AI.
Many AI adopters are still in the early stages. 26% have been working with AI for under a year. But 18% already have applications in production.
Difficulty finding appropriate use cases is the biggest bar to adoption for both users and nonusers.
16% of respondents working with AI are using open source models.
Unexpected outcomes, security, safety, fairness and bias, and privacy are the biggest risks for which adopters are testing.

What if I tell you that there are many groups and organizations trying to fill the gap between open-source models an user friendly ready to go pipelines to make them useful in the business?

The Amazon flop

Few months ago Amazon introduced Q, an advanced AI assistant designed specifically for large companies like Amazon. This tool enables employees within these organizations to query documents and corporate systems using natural language processing capabilities powered by artificial intelligence (AI). However, even before the full launch of this innovative product, internal tests conducted by the company have revealed some potential issues that may impact its future performance.

According to obtained confidential documents from tech newsletter Platformer, Q has been found to generate false information and leak sensitive data belonging to Amazon. This includes sharing private details about internal discount programs, unreleased features, as well as locations of AWS (Amazon Web Services) data centers. While Amazon’s spokespeople have dismissed these scenarios as hypothetical, the potential consequences could be severe for a company whose reputation relies heavily on maintaining confidentiality and privacy among its stakeholders.

The tech newsletter Platformer reported also that the model can generates falsehood.

And for a Business, or industrial-grade enterprise, a non-reliable tool, or even worse a tool that give false information is like a cancer.

Meet LLMware

LLMWare is an open-source project that was inspired by all of the rapid development in Generative AI. The team vision for LLMWare is to be the unified, open, extensible framework for retrieval augmented generation and related LLM-based patterns. They aspire that it becomes an indispensable set of tools that anyone can use — from beginners to the most sophisticated AI developers — to rapidly build industrial-grade enterprise LLM-based applications.

As you can imagine the first goal is related to RAG — Retrieval Augmented Generation with an accuracy score so good that it can be used in business.

But they provide several user friendly tutorials on YouTube to teach you how to start from scratch and work with Large Language Models and build an application ready for production.

Here a super cool example of their videos.

llmware is an integrated framework comprised of four major components:

Retrieval: Assemble fact-sets

A comprehensive set of querying methods: semantic, text, and hybrid retrieval with integrated metadata.
Ranking and filtering strategies to enable semantic search and rapid retrieval of information.
Web scrapers, Wikipedia integration, and Yahoo Finance API integration as additional tools to assemble fact-sets for generation.

Prompt: Tools for sophisticated generative scenarios

Connect Models: Open interface designed to support AI21, Ai Bloks READ-GPT, Anthropic, Cohere, HuggingFace Generative models, llmware BLING and DRAGON models, OpenAI.
Prepare Sources: Tools for packaging and tracking a wide range of materials into model context window sizes. Sources include files, websites, audio, AWS Transcribe transcripts, Wikipedia and Yahoo Finance.
Prompt Catalog: Dynamically configurable prompts to experiment with multiple models without any change in the code.
Post Processing: a full set of metadata and tools for evidence verification, classification of a response, and fact-checking.
Human in the Loop: Ability to enable user ratings, feedback, and corrections of AI responses.
Auditability: A flexible state mechanism to capture, track, analyze and audit the LLM prompt lifecycle

Vector Embeddings: swappable embedding models and vector databases

Custom trained sentence transformer embedding models and support for embedding models from Cohere, Google, HuggingFace Embedding models, and OpenAI.
Mix-and-match among multiple options to find the right solution for any particular application.
Out-of-the-box support for 3 vector databases — Milvus, FAISS, and Pinecone.

Parsing and Text Chunking: Prepare your data for RAG

Parsers for: PDF, PowerPoint, Word, Excel, HTML, Text, WAV, AWS Transcribe transcripts.
A complete set of text-chunking tools to separate information and associated metadata to a consistent block format.

It takes 10 minutes to go up and running

The beauty of the LLMware framework is that you can set it up in 10 minutes, and can run from any Laptop.

They created their own pipelines to make all the magic happening behind the curtains. And the amazing thing is that they support basically all the Operating Systems:

MacOS
Linux
Windows

The BLING models

In the Hugging Face organization cards the llmware team claim their vision:

We believe that the ascendence of LLMs creates a major new application pattern and data pipelines that will be transformative in the enterprise, especially in knowledge-intensive industries. Our open source research efforts are focused both on the new “ware” (“middleware” and “software” that will wrap and integrate LLMs), as well as building high-quality automation-focused enterprise RAG models.

There are several biases about open-source LLM to be used in production ready applications. The main trend on the LLM landscape changed a lot during the last 2 years, moving in the same direction the smartphone.

If you have any of history about the mobile phones, they started normal, then they went small (to the point to be smaller than your thumb) and now they go wide (the bigger the screen the better). Large Language Models are going through a similar process:

The bigger the better: last year we saw models going above 100B parameters, as if the more parameters the better the performance (at what computational footprint cost?).
Instruction fine tuning: lately the focus shifted to the use of a very good 7B parameter model using new fine tuning techniques to achieve the same accuracy and performance of the big shots.
Sheared and pruning techniques are lately used to distill models less than 2B parameters to be excellent in specific tasks, without loosing fluency

At this point few questions arise, as reported by Darren Oberst himself in his beautiful article:

What are the smallest models that can demonstrate meaningful instruction following behavior?
To what extent can targeted high-quality instruction training offset the obvious benefit of larger model size?
Do “hallucinations” and other aberrant behavior occur in lower, higher or same frequencies on instruct-trained smaller LLMs?
How do practical constraints such as applying smaller context windows, narrower domain scope, and focused instruction set improve the performance of smaller LLMs in instruct-following?
Do smaller models have a viable role to play in retrieval augmented generation (RAG) scenarios, or will larger models always be the right answer for production use cases?

LLMware team goal has two sides: Firstly, learning from small models will help improve larger ones by sharpening training objectives and improving datasets; secondly, using CPU-based models for local testing is beneficial because they can handle confidential enterprise information.

To explore these questions further, llmware team created the BLING (Best Little Instruct-following No-GPU) model series on HuggingFace using Apache 2.0 licensed high-quality decoder models that can be easily deployed on standard laptops without special quantization techniques.

They are currently training three base GPT models: Pythia, Falcon, and Cerebras, focusing primarily on finding the smallest possible decoder model with consistent question-answering behavior.

For now LLMWare team have extensively trained these models in various ranges from 100M to 3B parameters, mainly concentrating on 1.0–1.5B parameter range.

We are experimenting with different filtered bespoke training datasets that combine fact-based question-answering, key-value extraction, Boolean question-answering (yes/no), recognition of “not found”, long-form summarization, and short-form x-summarization to improve our models further.

Over the past weeks, the team have launched four initial BLING models on HuggingFace with more coming soon!

Conclusions

If you are interested in learning about instruct-training for smaller open source models or sharing ideas and best practices, please don’t hesitate to contact the llmware team.

They have an amazing YouTube Channel with plenty of tutorial, easy to follow and fast to be implemented on your Laptop.

PS: They are on Medium too. Follow them llmWare@Medium

This is only the start!

Hope you will find all of this useful. Feel free to contact me on Medium.

I am using Substack only for the newsletter. Here every week I am giving free links to my paid articles on Medium.

Follow me and Read my latest articles https://medium.com/@fabio.matricardi

Here few articles I wrote on Medium about them. They are free for you reading this Newsletter!