I did an interesting (at least for me) exercise on Hugging Face: I looked for all the models less than 3 Billion parameters in quantized version or small enough to run on any CPU.
Mini Models are usually models with less than 1B parameters
Tiny/Small Models are models below 1.5B parameters
Sheared models are pruned models from 7B or higher reduced between 1.3B and 2.7B parameters
Quantized models, from 3B parameters onward…
Why do we need such models? Let’s have a look together!
What is a Model and what is Quantization
LLMs are large neural networks with high-precision weight tensors. Loading the entire model into memory (and this is why you need RAM!!), the computer transforms words into numbers, analyzes the neural network, and provides results. To overcome hardware limitations, smart individuals quantize (reduce) the model weights, sacrificing some accuracy but enabling modest computers to run large language models. There are 2 main formats for quantized models: GGML (now called GGUF) and GPTQ.
GGML/GGUF is a C library for machine learning (ML) — the “GG” refers to the initials of its originator (Georgi Gerganov). This format is good for people that does not have a GPU, or they have a really weak one. It runs on CPU only.
GPTQ is also a library that uses the GPU and quantize (reduce) the precision of the Model weights. Generative Post-Trained Quantization files can reduce 4 times the original model. If you have a GPU this format is the right one.
What is there for us?
If you don’t have a dedicated GPU (AMD, nVidia etc…) you are limited by your available RAM. And even a 3 Billion parameter is already too much for you (I know as a fact, I am the PoorGPUguy, with my 16 Gb of memory I cannot load and run a 3B model not quantized).
So I started long time ago to see if there is really no way to have a LLM big enough to be able to follow the instructions, small enough to fit normal consumer hardware.
I tested out around 100, but here is the list of the ones that can do really something.
I will publish my results soon. Stay tuned!
A herd of Tiny models
Sheared-LLaMA-1.3B-ShareGPT
This is the instruction tuned version of princeton-nlp/Sheared-LLaMA-1.3B. We trained the base model on 10,000 instruction-response pairs sampled from the ShareGPT dataset (first-turns only). This is a Sheared-LLaMA’s pruning and continued pre-training algorithms 😁 The researchers find that pruning strong base models is an extremely cost-effective way to get strong small-scale language models compared to pre-training them from scratch. Starting from Llama-2–7B model (pre-trained with 2T tokens), pruning it produces a model as strong as an OpenLLaMA model with 3% of its pre-training cost.
Paper: https://arxiv.org/pdf/2310.06694.pdf
Code: https://github.com/princeton-nlp/LLM-Shearing
Models: Sheared-LLaMA-1.3B, Sheared-LLaMA-2.7B
Tinyllama-2–1b-Miniguanaco
The TinyLlama project, led by a research assistant at Singapore University of Technology and Design, is trying to pre-train a 1.1 billion Llama model on three trillion tokens. This model takes up only 550MB of RAM. With some proper optimization, it was possible to achieve this within a span of “just” 90 days using 16 A100–40G GPUs.
The researchers adopted exactly the same architecture and tokenizer as Llama 2. This means TinyLlama can be plugged and played in many open-source projects built upon Llama. Besides, TinyLlama is compact with only 1.1B parameters. This compactness allows it to cater to a multitude of applications demanding a restricted computation and memory footprint. The FT dataset used is openassistant-guanaco. The special training for this model is focused on question answering!
TinyLlama 1.1B 🐋 OpenOrca
This is one of the version of the TinyLlama family, trained on 1 Trillion tokens with the following datasets:
- Open-Orca/OpenOrca
- bigcode/starcoderdata
- cerebras/SlimPajama-627B
image from the official Paper https://mbzuai-nlp.github.io/LaMini-LM/
🦙 LaMini-Flan-T5–77M
LaMini-LM is a collection of small-sized, efficient language models distilled from ChatGPT and trained on a large-scale dataset of 2.58M instructions. The authors explored different model architectures, sizes, and checkpoints, and extensively evaluate their performance across various NLP benchmarks and through human evaluation.
According to the paper, if you train with Distilled Knowledge a small Language Model you can achieve amazing performance. The proposed models achieve comparable performance with Alpaca while is nearly 10 smaller in size, demonstrating the potential of training efficient yet effective language models. This is the smallest one, 77 Million parameters only, encoder-decoder model, based on T5 and Flan dataset.
🦙 LaMini-Flan-T5–248M
LaMini-LM is a collection of small-sized, efficient language models distilled from ChatGPT and trained on a large-scale dataset of 2.58M instructions. The authors explored different model architectures, sizes, and checkpoints, and extensively evaluate their performance across various NLP benchmarks and through human evaluation.
According to the paper, if you train with Distilled Knowledge a small Language Model you can achieve amazing performance. The proposed models achieve comparable performance with Alpaca while is nearly 10 smaller in size, demonstrating the potential of training efficient yet effective language models. This is the middle one, 248 Million parameters, encoder-decoder model, based on T5 and Flan dataset.
💎🦜 StableLM-Zephyr-3B — 4K context window
Stable LM Zephyr 3B is a 3 billion parameter Large Language Model (LLM), 60% smaller than 7B models, allowing accurate, and responsive output on a variety of devices without requiring high-end hardware.
Stable LM Zephyr 3B is a new chat model representing the latest iteration in our series of lightweight LLMs, preference tuned for instruction following and Q&A-type tasks. This model is an extension of the pre-existing Stable LM 3B-4e1t model and is inspired by the Zephyr 7B model from HuggingFace. With Stable LM Zephyr’s 3 billion parameters, this model efficiently caters to a wide range of text generation needs, from simple queries to complex instructional contexts on edge devices.
Stability AI released this model under a non-commercial license that permits non-commercial use.
🦙 Shearedplats-2.7b-v1
An experimental finetune of Sheared LLaMA 2.7b with Alpaca-QLoRA. The dataset used for the fine tuning is an alpca style datasets. It also uses the alpaca style prompt template.
Original model: vihangd/shearedplats-2.7b-v1
Quantized version: Aryanne/Shearedplats-2.7B-v1-gguf
🦙🧙♂️ open-llama-3b-v2-wizard-evol-instuct-v2–196K
This Open Llama model is a distillation of the Llama2–7b, further trained and fine tuned on 1 epoch of the WizardLM_evol_instruct_v2_196k dataset.
It is amazing to see how the WizardLM dataset and the Open Orca dataset change the performance of the models in terms of instruction understanding and following.
Hugging Face Repo: https://huggingface.co/TheBloke/open-llama-3b-v2-wizard-evol-instuct-v2-196k-GGUF
🧙♂️🐋 Wizard-Orca-3B — 4K context window
This model is also an open llama derivative: a distillation of the Llama2–7b, further trained and fine tuned on 2 epoch of pankajmathur’s WizardLM_orca dataset.
Hugging Face Repository: https://huggingface.co/Aryanne/Wizard-Orca-3B-gguf
🛠️ Potential Use-Cases
I am listing here 4 ideas. The last one is from a brilliant idea by Andrej Karpathy. I did not understand HOW to use it, but I understood for for WHAT purpose we can use it.
If you already tried it please drop a message here for everyone!
Edge Device Deployment such as real-time machine translations on edge devices without needing the internet.
Personal assistant or small business chat bots
Real-time dialogue generation for video games (since the developer needs to reserve GPU RAM for the game itself, the LM has to be small).
Assisting Bigger Models: TinyLlama can also assist in the speculative decoding of larger models. For a closer look, check out this tutorial by Andrej Karpathy. See the twitter post here (https://twitter.com/karpathy/status/1697318534555336961)
Good reads (all friendly links)
Here a little list of interesting related articles on Medium. For the subscribers to this newsletter (you 😉) they are free.