Quantized Matrix Multiplication
mlx_quantized_matmul.RdPerforms matrix multiplication with a quantized weight matrix. This operation is essential for efficient inference with quantized models, significantly reducing memory usage and computation time while maintaining reasonable accuracy.
Usage
mlx_quantized_matmul(
  x,
  w,
  scales = NULL,
  biases = NULL,
  transpose = TRUE,
  group_size = 64L,
  bits = 4L,
  mode = "affine",
  device = mlx_default_device()
)Arguments
- x
- An mlx array. 
- w
- An mlx array. Either: - A quantized weight matrix (uint32) from - mlx_quantize(), or
- An unquantized weight matrix that will be quantized automatically 
 
- scales
- An optional mlx array (the quantization scales). If NULL and w is unquantized, w will be quantized automatically. Default: NULL 
- biases
- An optional mlx array (biases to add). For affine quantization, this should be the quantization biases if w is pre-quantized. Default: NULL 
- transpose
- Whether to transpose the weight matrix. Default: TRUE 
- group_size
- The group size for quantization. Default: 64 
- bits
- The number of bits for quantization (typically 4 or 8). Default: 4 
- mode
- The quantization mode, either "affine" or "mxfp4". Default: "affine" 
- device
- Execution target: supply - "gpu",- "cpu", or an- mlx_streamcreated via- mlx_new_stream(). Default:- mlx_default_device().
Details
Quantized matrix multiplication uses low-precision representations (typically 4-bit or 8-bit integers) for weights, which reduces memory footprint by up to 8x compared to float32. The scales parameter contains the dequantization factors needed to reconstruct approximate float values during computation.
The group_size parameter controls the granularity of quantization - smaller groups provide better accuracy but slightly higher memory usage.
Automatic Quantization: If only w is provided (without scales), the function will
automatically quantize w using mlx_quantize() before performing the multiplication.
For repeated operations, it's more efficient to pre-quantize weights once using
mlx_quantize() and reuse them.
Examples
if (FALSE) { # \dontrun{
# Automatic quantization (convenient but slower for repeated use)
x <- mlx_random_normal(c(10, 256))
w <- mlx_random_normal(c(512, 256))
result <- mlx_quantized_matmul(x, w)
# Pre-quantized weights (faster for repeated operations)
quant <- mlx_quantize(w, group_size = 64, bits = 4)
result <- mlx_quantized_matmul(x, quant$w_q, quant$scales, quant$biases)
} # }