Introduction
The year 2023 was a landmark year for the Large Language Models (LLMS) ecosystem which saw a large number of proprietary models and open-source models becoming accessible to the public at large. It also saw LLMs getting published for CPUs in GGML format thus making LLMs accessible to people who lacked powerful GPUs. Soon in August 2023, another format called GGUF was released for CPU-based LLMs. With such rapid changes in the ecosystem many people especially beginners are confused as to what is the difference between GGUF and GGML. In this article, we will do a comparison of GGUF vs GGML and understand their differences and similarities to help clear the confusion among beginners.
What is GGML
1) Library
GGML is a tensor library written in C language created by Georgi Gerganov which lets you quantize Large Language Models and enable it to run on CPU-powered commodity hardware. Since not everyone has powerful GPUs to load and run Large Language Models, GGML has been a great step towards democratizing LLM and making it accessible to anyone having a mere CPU in their system.
By default the parameters in Large Language Models are represented with 32-bit floating numbers. However, the GGML library can convert it into a 16-bit floating point representation thus reducing the memory requirement by 50% to load & run the LLM. This process is known as quantization and although it reduces the quality of LLM inference but it is a tradeoff between having GPU compute & high precision vs CPU compute and low precision.
GGML library also supports integer quantization (e.g. 4-bit, 5-bit, 8-bit, etc.) that can further reduce the memory and compute power required to run LLMs locally on the end user’s system or edge devices.
2) Format
GGML is not just a tensor library but also a file format for LLMs quantized with GGML. In fact, there were 3 file formats used by the GGML library earlier – GGML. GGMF, and GGJT. However, this GGML format had several shortcomings and has been completely depreciated and replaced by the GGUF format which we are going to discuss next.
What is GGUF
GGUF is a new file format for the LLMs created with GGML library which was announced in August 2023. GGUF is a highly efficient improvement over the GGML format that offers better tokenization, support for special tokens, and better metadata storage. Since it addressed several shortcomings of the GGML format, the GGUF format has got positively accepted by the community. GGML format has since become obsolete and is not even supported officially by the GGML library or llama.cpp
GGUF vs GGML
Now to understand the difference between GGUF and GGML, we should be clear that this comparison will be done keeping GGML format in mind and not the GGML library. (Let us do Apple to Apple comparison!)
Speed
GGML format LLMs were known for slower load time and inference performance. However, GGUF LLMs have mmap compatibility that enhances load time and faster inference speed.
Special Tokens
In LLMs, the special tokens act as delimiters to signify the end of the user prompt or system prompt or any special instructions. Special tokens are generally defined in prompt templates of a given LLM and help create more effective prompts & are also useful during LLM fine-tuning.
However, these special tokens were not supported in GGML, but are now supported in GGUF format, thus adding a much needed missing feature for users.
Support for Non-Llama Models
GGML format was originally designed keeping llama architecture in mind. That is why Georgi Gerganov created llama.cpp for inferencing the quantized GGML models. GGUF format is generic design and has extended compatibility with non-llama architecture models like Falcon, Bloom, Phi, Mistral, etc.
Extensibility & Flexibility
In GGML format all the metadata, data, and hyperparameters were saved in a single file which resulted in tight coupling internally. Hence whenever there used to be a change in LLM hyperparameters the GGML model used to break or had backward compatibility issues. On the other hand, the GGUF format has been designed to be more extensible & flexible allowing the addition of new features without breaking anything.
Ease of Use
GGUF format can be loaded and saved easily without using external libraries. Also, the setup requires minimal input from the user. However this was not the case with the GGML format which required external libraries and extensive configuration from the user,
Summary of GGUF vs GGML
Let us wrap up with the summarized view of our comparison between GGUF vs GGML formats –
Aspects | GGML | GGUF |
Basic | GGML is an obsolete format for creating quantized LLMs using the GGML tensor library. | GGUF is the successor of the GGML format that has better efficiency. It is also created with the GGML tensor library |
Speed | Compared to GGUF, the load time of the model and inference speed is on the slower side. | GGUF LLMs have mmap compatibility that enhances load time and faster inference speed. |
Special Tokens | Special tokens are not supported by GGML. | GGUF supports special tokens which are useful for creating effective prompts and also in llm fine-tuning. |
Support for Non-Llama Models | Non-llama models are not supported. | GGUF format has extended compatibility with non-llama architecture models like Falcon, Bloom, Phi, Mistral, etc. |
Extensibility & Flexibility | GGML had extensibility issues where small changes in the base model used to result in breaking changes. | GGUF format has been designed to be more extensible & flexible allowing the addition of new features without breaking anything. |
Ease of Use | Compared to GGUF, the setup for GGML required more inputs from user and also had dependency on external libraries. | GGUF is much more user-friendly for setup, with not much dependency on external libraries. |