The LLM Revolution: Dethroning GPUs and Redefining AI's Future

From Efficiency Concerns to Architectural Bottlenecks, Why the LLM Landscape is Shifting

Dec 11, 2023

The quest to dethrone GPUs from the landscape of large language models (LLMs) is indeed running high! Several factors contribute to this trend:

Efficiency Concerns: GPUs, while powerful, are known for their high energy consumption and cost. This limits their scalability for large-scale LLM training and deployment. Moreover few tech giants are the only actors who can afford this kind of infrastructure, holding a real monopoly of the AI future.
Architectural Bottlenecks: Transformer-based architectures, the current standard for LLMs, have inherent limitations. Their reliance on self-attention mechanisms can lead to quadratic complexity and computational bottlenecks, especially with increasing model size and sequence length.
Emerging Alternatives: New hardware and software technologies are emerging that offer promising alternatives to GPUs for LLM training and inference.

A Look into the Future: A More Diverse and Accessible LLM Ecosystem

Here is the best of this week in the quest for an AI for all;

⏩ Fast Feedforward Networks: 98% of the neural network are usually not involved at all in the generation process. A new study from ETH Zurich shows the innovation of using binary tree decision mapping and limiting the input space.
🐅 StripedHyena-7B: the new Together-AI model family goes beyond the Transformers architecture. Together Research presents new architectures for long context, improved training, and inference performance over the Transformer architecture. Spinning out of a research program from our team and academic collaborators, with roots in signal processing-inspired sequence models, we are excited to introduce the StripedHyena models. This release includes StripedHyena-Hessian-7B (SH 7B), a base model, and StripedHyena-Nous-7B (SH-N 7B), a chat model.
🍏 Apple released MLX: finally an open-source library dedicated to the Apple Silicon m-Series chips. MLX is an array framework for machine learning on Apple silicon, brought to you by Apple machine learning research. It uses a familiar APIs: MLX has a Python API that closely follows NumPy. MLX also has a fully featured C++ API, which closely mirrors the Python API. MLX has higher-level packages like mlx.nn and mlx.optimizers with APIs that closely follow PyTorch to simplify building more complex models. A notable difference from MLX and other frameworks is the unified memory model. Arrays in MLX live in shared memory. Operations on MLX arrays can be performed on any of the supported device types without transferring data.
👾 Mamba and SSSM: Models based on Selective State Space Model architectures are emerging. They skip transformers/attention for faster and more efficient processing of long sequences. Mamba-Chat (https://github.com/havenhq/mamba-chat) is the first chat language model based on a state-space model architecture, not a transformer.
🧱 BLING: llmware and other organizations on Hugging Face are releasing tiny/slim models, smaller than 2B parameters. Among them BLING, Best Little Instruct-following No-GPU are a really breakthrough: using Apache 2.0 licensed high-quality decoder models that can be easily deployed on standard laptops without special quantization techniques.
🚀 QuIP# quantization: Quantization with Incoherence Processing (QuIP), is a new method based on the insight that quantization benefits from incoherent weight and Hessian matrices, which improves several existing quantization algorithms and yields the first LLM quantization methods that produce viable results using only two bits per weight. QuIP Sharp is a weights-only quantization method that is able to achieve near fp16 performance using only 2 bits per weight. QuIP# combines lattice codebooks with incoherence processing to create state-of-the-art 2 bit quantized models.

Focus of the Week from ThePoorGPUGuy

⏩ Fast Feedforward Networks: Revolutionizing Neural Network Layers in GPT architecture and Beyond

Peter Belcak and Roger Wattenhofer, researchers at ETH Zurich, have presented a new algorithm breakthrough that could put an end to the famous Feed Forward Network issue (https://arxiv.org/pdf/2308.14711.pdf).

Named Fast Feedforward Networks, they present one of the most elegant innovations I’ve seen, while creating models that are up to 78 times faster than the originals globally, and up to 220 times faster at the layer level.

In the realm of artificial intelligence and machine learning, neural networks have come a long way since their inception. From simple rule-based systems to complex deep learning architectures, these models are capable of processing vast amounts of data with remarkable accuracy. Among various layers found within such sophisticated structures like Fast Feedforward (FF) Networks or Transformer Blocks lie the key components that enable this advanced analysis: feedforward layers and feature extraction.

These linear transformations in FFs allow for communication between neurons, enabling critical pattern recognition tasks at every layer of a neural network model. Despite their simplicity, these foundational elements are present even within cutting-edge models like OpenAI’s ChatGPT, which relies on the attention mechanism powered by Transformer Blocks to achieve unparalleled performance in natural language processing tasks.

However, as these large and complex models grow larger (as seen with GPT-3), feedforward layers consume an increasingly significant portion of computing power — accounting for 98% of Floating Point Operations (FLOPs) within such behemoths. This comes at a cost: the majority of neurons in hidden layers are inactive, merely participating in data processing without directly influencing output decisions.

To address this issue and optimize performance while maintaining efficiency, researchers propose introducing a novel approach to neural network design called “Hardwired Neural Networks.” The core concept behind Hardwired Neural Networks is harnessing binary trees that selectively engage only the most essential neurons for specific tasks. As these networks progress through training, nodes become increasingly adept at distinguishing between inputs and discard inactive connections, ultimately leading to a more streamlined decision-making process during inference — when models are actively being used in real-time predictions.

By employing hardwired neural networks with depth six binary trees (which divide the layer into halves six times), researchers achieve 94% of original performance while cutting costs and speeding up prediction speeds by orders of magnitude compared to conventional feedforward layers, even though only 1% of neurons participate actively in each FFF layer. This innovative approach demonstrates that optimizing computational efficiency does not necessitate sacrificing model accuracy or predictive power — a significant breakthrough for the future development of advanced AI systems.

In conclusion, by selectively engaging only the most critical neurons within complex models such as ChatGPT’s Transformer Blocks, these hardwiring techniques pave the way for more streamlined and powerful AI systems yet to come.

Future Outlook

The quest to dethrone GPUs from the LLM landscape is still in its early stages. However, the rapid progress in new hardware and software technologies suggests that alternative solutions are on the horizon. While GPUs will likely remain dominant in the short term, we can expect to see a shift towards more efficient and scalable solutions in the coming years. This will allow for the development and deployment of even larger and more powerful LLMs, unlocking new possibilities for AI research and applications.

It’s important to note that there is no single solution that will completely replace GPUs. Different approaches will likely be used for different aspects of LLM training and deployment, depending on specific needs and resources. The key takeaway is that the landscape is rapidly evolving, and we can expect to see exciting innovations in the near future.