What is GGML?

GGML (Gerganov's General Machine Learning) is a C library designed for efficient machine learning, with a particular emphasis on running large language models (LLMs) locally. Created by Georgi Gerganov (hence the name), it provides a way to perform inference of transformer models on various hardware platforms, including CPUs and GPUs. At its core, GGML was an early and successful attempt to establish a file format for LLMs that facilitated easy sharing and local execution.

This article delves into the structure of GGML files, examining how they store and load models. Our focus will be primarily on the file structure and its role in model storage and retrieval, rather than the specifics of model implementation or the inner workings of the GGML library.

For a broader introduction to GGML, the HuggingFace article on Introduction to GGML provides a solid foundation.

Quantization: Compressing Models for Efficient Deployment

Running massive language models on devices with limited memory is a significant challenge. Imagine trying to squeeze a giant inflatable structure into a tiny backpack – it requires clever deflation and folding. Similarly, quantization is the key to making these large models manageable for resource-constrained environments.

Quantization reduces a model's memory footprint by decreasing the precision of its weights. Think of it like compressing an image: you sacrifice some detail to make the file smaller. This is especially crucial for LLMs, which can easily balloon to gigabytes in size.

The primary bottleneck often isn't raw processing power but memory bandwidth, which struggles to keep up. Quantization tackles this by using lower-precision weights, resulting in smaller, faster-loading models.

GGML employs a range of quantization methods, each offering a different balance between size reduction and accuracy. Let's take a visual tour of these techniques:

Decoding the GGML File Structure

GGML files are binary files that house a model's essential components: weights, biases, and other parameters vital for its operation. Here's a breakdown of the key sections:

The Header: The Blueprint of the Model

The header is the crucial first part of a GGML file. It acts as a blueprint, containing essential metadata that describes the model's architecture and how it's stored. Here's a closer look at the information typically found in the header:

Magic Number: A specific sequence of bytes that identifies the file as a GGML file.
Version Number: Indicates the version of the GGML format used.
Tensor Count: The number of tensors (weights, biases) stored in the file.
Hyperparameters: These describe the model's architecture and training process. Common hyperparameters found in the header include:
- Number of Layers: The depth of the neural network.
- Embedding Dimension: The size of the vector representations for each word or token.
- Number of Attention Heads: (For transformer models) The number of parallel attention mechanisms used.
- Feedforward Dimension: The size of the hidden layers in the feedforward network within each transformer block.
- Vocabulary Size: The total number of unique words or tokens the model can handle.
- Optimizer State (if applicable): Details about the optimizer used during training (e.g., Adam, SGD).
Quantization Type: This is a critical piece of information. It specifies the method used to quantize the model's weights (e.g., Q8_0, Q4_K_M). This dictates how the weights are interpreted and dequantized during loading.

The header provides all the necessary context for correctly loading and interpreting the rest of the data in the GGML file. Without it, the stored weights and biases would be meaningless.

Weights: The Core of the Model

The weights represent the learned parameters of the model. They are typically stored as a dense matrix in row-major order, meaning the elements of each row are stored contiguously in memory. Each element in this matrix is a quantized value, representing the weight's strength.

Biases: Fine-Tuning the Model

Biases are additional parameters that are added to the weighted sum of inputs in each neuron. They are stored as a vector, with each element corresponding to a neuron and stored in a quantized format.

Other Parameters: Model-Specific Data

Depending on the specific model, there might be other parameters stored in the GGML file. These can include things like layer normalization parameters, attention masks, or other model-specific data. They are often stored in a dictionary-like format, with keys identifying each parameter and their corresponding quantized values.

Loading a GGML Model: Bringing it to Life

To load a GGML model and prepare it for use, you'll generally follow these steps:

Read the Header: This is the first and most crucial step. The header provides the model's dimensions, quantization type, and other essential metadata required to interpret the rest of the file.
Interpret Quantization Type: The header will specify which quantization method was used (e.g., Q8_0, Q4_K_M). This information is vital because it dictates how the stored weights and biases should be dequantized.
Read the Data: Based on the information from the header, the weights, biases, and any other parameters are read from the file. The quantization type determines how many bytes are read for each weight and how they are interpreted.
Dequantize (if necessary): If the model was quantized (which is usually the case), the weights and biases need to be dequantized to convert them back to floating-point values that can be used for computation.

Once loaded and dequantized, the model is ready for inference tasks.

Demystifying GGML Quantization

Personally, this was the most confusing part of GGML for me. When I went to Ollama to download models, I always saw the options and wondered what they meant. I had a basic idea of their impact, but I didn't know how to interpret the naming system.

So, I decided to write this article to help others understand the naming system and the impact of different quantization types.

Naming Convention: `Q{N}_{Type}_{Variant}`

Q: A simple prefix indicating that we're dealing with quantization.
N: Represents the number of bits used to store each weight (e.g., 2, 3, 4, 5, 8).
Type:
- K: Indicates the use of K-means clustering for optimization.
- 0: Denotes a basic linear quantization scheme.
- 1: Represents an alternative linear quantization scheme.
Variant (optional): This part further specifies the block size used in certain quantization methods:
- S: Small block size
- M: Medium block size
- L: Large block size

Block Size Variants

Small (S):
- Granularity: Finer, allowing for more precise adaptation to local variations in weight distributions.
- Memory overhead: Higher due to the need to store more metadata (e.g., scaling factors) for each small block.
- Block size: Typically 32 weights
Medium (M):
- Approach: Balanced, providing a good compromise between granularity and overhead.
- Usage: Most commonly used in practice.
- Block size: Typically 64 weights
Large (L):
- Metadata overhead: Lower, as fewer blocks require separate metadata.
- Processing speed: Faster because larger chunks of data can be processed at once.
- Drawback: May miss subtle variations in local weight distributions.
- Block size: Typically 128 weights

Understanding "Superblock"

In the context of GGML quantization, particularly the K-means variants (Q4_K_M, Q3_K_S, etc.), the term "superblock" refers to a group of blocks processed together. Each block contains a set of weights (e.g., 32, 64, or 128), and a superblock typically comprises multiple such blocks. Using a superblock structure allows for shared scaling factors or other metadata that can be applied to all weights within the superblock. This can improve compression efficiency and potentially speed up processing.

Common Quantization Types

Q8_0 (8-bit, Basic Linear)

Bits per weight: 8
Quantization type: Basic linear quantization (no clustering)
Block structure: None. Each weight is quantized individually.
Compression ratio: 1:4 compared to the original 32-bit floating-point (f32) representation.
Accuracy loss: Minimal. Q8_0 is known for preserving accuracy well.
Use cases: Ideal for smaller models or when maintaining high accuracy is paramount.

Q4_K_M (4-bit, K-means, Medium blocks)

Bits per weight: 4
Quantization type: K-means clustering
Block size: Medium (typically 64 weights per block, grouped into superblocks)
Compression ratio: 1:8 compared to f32
Accuracy: Good balance between size reduction and accuracy.
Use cases: A popular choice for many LLM deployments due to its balance.
Overhead: Medium processing overhead because of the K-means clustering.

Q3_K_S (3-bit, K-means, Small blocks)

Bits per weight: 3
Quantization type: K-means clustering for value distribution
Block size: Small (typically 32 weights per block, grouped into superblocks)
Compression ratio: 1:10.67 compared to f32
Accuracy: Good balance between compression and accuracy.
Use cases: Commonly used in medium-sized models where a smaller footprint is desired.

Q3_K_L (3-bit, K-means, Large blocks)

Bits per weight: 3
Quantization type: K-means clustering
Block size: Large (typically 128 weights per block, grouped into superblocks)
Compression ratio: 1:10.67 compared to f32
Accuracy: Slightly lower than Q3_K_S but still reasonable.
Processing: Faster due to larger blocks, which can be processed more efficiently.

Q5_K_M (5-bit, K-means, Medium blocks)

Bits per weight: 5
Quantization type: K-means clustering
Block size: Medium (typically 64 weights per block, grouped into superblocks)
Compression ratio: 1:6.4 compared to f32
Accuracy: Higher than Q3 and Q4 variants, offering better precision.
Use cases: Suitable for applications where accuracy is more critical.

Quantization Impact: LLaMA-3 8B Model Example

Here's how different quantization types affect the size and quality of the LLaMA-3 8B model:

Quantization Type	Size (GB)	Compression Ratio	Relative Quality	Use Case
Original (F32)	32.0	1:1	Baseline	Research/Development
Q8_0	8.0	1:4	Excellent	High-accuracy inference
Q5_K_M	5.12	1:6.4	Very Good	Balanced performance
Q5_1	5.12	1:6.4	Good	Simple deployment
Q4_K_M	4.0	1:8	Good	Common deployment
Q3_K_L	3.0	1:10.67	Fair	Size-constrained
Q3_K_S	3.0	1:10.67	Fair	Size-constrained

Memory Usage Comparison (per weight)

Original F32: 32 bits
Q8_0: 8 bits (25% of original)
Q5_K_M/Q5_1: 5 bits (15.6% of original)
Q4_K_M: 4 bits (12.5% of original)
Q3_K_S/Q3_K_L: 3 bits (9.4% of original)

Deep Dive: Q5_K_M Quantization

Overview

Q5_K_M is a quantization scheme designed to reduce the memory footprint of large language models while retaining a reasonable level of accuracy. It achieves this by representing model weights using 5 bits instead of the standard 32-bit floating-point representation.

Memory Layout and Structure

Memory Layout

Each block in Q5_K_M contains:

Original Weight (32-bit float): Each weight starts as a 32-bit (4-byte) floating-point number. For example, a weight like 0.235772 occupies 4 bytes of memory.
Quantized Weight (5-bit int): In Q5_K_M, each weight is compressed down to a 5-bit integer. For instance, the weight 0.235772 might be quantized to 17.
Block Size: A block in Q5_K_M quantization consists of 64 weights.
Total Block Memory: Each block, including its metadata, occupies 64 bytes of memory.

Block Structure

A Q5_K_M block consists of:

Block Header (8 bytes)

scale (float32): A 4-byte floating-point scaling factor applied during dequantization to restore the original range of weights.
min (float32): A 4-byte floating-point value representing the minimum value in the block, serving as an offset during dequantization.

Data Section (40 bytes)

64 weights * 5 bits/weight = 320 bits = 40 bytes containing the quantized weights.

Padding (16 bytes)

Alignment padding to make the total block size 64 bytes, crucial for efficient memory access with SIMD instructions.

Quantization Process

Original Weights

Example set of weights (simplified to 8 weights instead of the usual 64):

original_weights = [
    0.235772, -0.124511, 0.447823, -0.558901,
    0.112233, -0.332211, 0.225566, -0.445566
]

K-means Clustering

With 5 bits, we can represent 2⁵ = 32 distinct values. K-means clustering is used to find the 32 cluster centers that best represent the distribution of the original weights within the block.

Example cluster centers (after K-means converges):

clusters = [
    -0.558901,  # Cluster 0
    -0.445566,  # Cluster 1
    -0.332211,  # Cluster 2
    -0.124511,  # Cluster 3
    0.112233,   # Cluster 4
    0.225566,   # Cluster 5
    0.235772,   # Cluster 6
    0.447823    # Cluster 7
    # ... (and so on, up to 31 clusters)
]

Implementation Steps

Step 1: Determine min and scale

min_val = min(clusters) = -0.558901
max_val = max(clusters) = 0.447823
scale = (max_val - min_val) / (2**5 - 1)
      = (0.447823 - (-0.558901)) / 31
      = 1.006724 / 31
      = 0.032475

Step 2: Map weights to cluster centers

Each original weight is assigned to the nearest cluster center. The index of this cluster center is the quantized representation of the weight.

Example mapping (using simplified linear mapping for illustration):

0.235772 → q = round((0.235772 - (-0.558901)) / 0.032475) = 24 (maps to cluster at index 24)
-0.124511 → q = round((-0.124511 - (-0.558901)) / 0.032475) = 13 (maps to cluster at index 13)

Note: In a true K-means implementation, the mapping would be based on the actual distances to cluster centers, not a simple linear calculation.

Step 3: Error Analysis

Quantization inevitably introduces some error. Let's consider our example:

Original: 0.235772
Quantized (using linear mapping example): 0.220499 (assuming cluster center at index 24 is 0.220499)
Error: 0.015273 (relative error of about 6.48%)

Across a block of weights, typical error metrics are:

Mean Absolute Error (MAE): Usually between 2-5%.
Root Mean Square Error (RMSE): Typically in the range of 3-7%.

Step 4: Memory Savings

For a block of 64 weights:

Original Size: 64 weights * 32 bits/weight = 2048 bits = 256 bytes
Quantized Size:
- Data: 64 weights * 5 bits/weight = 320 bits = 40 bytes
- Metadata: 8 bytes (scale + min)
- Padding: 16 bytes
- Total: 64 bytes
Compression Ratio (per block): 256 bytes / 64 bytes = 4:1
Compression Ratio (overall): Approximately 6.25x when considering the whole model, as metadata is a smaller portion of the total size.

Step 5: Implementation Considerations

5.1 Block Size Impact

Medium block size (64 weights) provides:

Good balance between granularity and overhead
Efficient SIMD processing potential
Reasonable memory alignment (64 bytes)

5.2 Optimization Opportunities

SIMD Processing:

// Example SIMD dequantization pseudo-code
for (int i = 0; i < 64; i += 8) {
    __m256 q = _mm256_cvtepi32_ps(load_5bit_integers(i));
    __m256 result = _mm256_fmadd_ps(q, scale_vector, min_vector);
    _mm256_store_ps(&output[i], result);
}

Memory Access Patterns:

Align blocks to 64-byte boundaries
Sequential access for better cache utilization

Step 6: Trade-offs Summary

Advantages:

Achieves a compression ratio of 6.25x
Preserves good accuracy, typically with less than 5% degradation
Offers efficient computational characteristics

Limitations:

Introduces overhead due to block-wise quantization
May experience some precision loss in extreme values
Requires additional computation for dequantization

This quantization scheme strikes an excellent balance between reducing model size and preserving accuracy, making it particularly well-suited for production deployments where both aspects are critical.

Conclusion: Navigating the GGML Landscape

GGML's quantization methods are a testament to the delicate balance between model size, accuracy, and computational efficiency. By understanding the nuances of different quantization types, you can tailor your model deployments to meet specific requirements, whether it's maximizing accuracy, minimizing memory usage, or optimizing for speed.

Understanding GGML Files: A Deep Dive into Quantization and Visualization of File Structure