Quantize a Matrix
mlx_quantize.RdQuantizes a weight matrix to low-precision representation (typically 4-bit or 8-bit). This reduces memory usage and enables faster computation during inference.
Usage
mlx_quantize(
  w,
  group_size = 64L,
  bits = 4L,
  mode = "affine",
  device = mlx_default_device()
)Arguments
- w
- An mlx array (the weight matrix to quantize) 
- group_size
- The group size for quantization. Smaller groups provide better accuracy but slightly higher memory. Default: 64 
- bits
- The number of bits for quantization (typically 4 or 8). Default: 4 
- mode
- The quantization mode: "affine" (with scales and biases) or "mxfp4" (4-bit floating point with group_size=32). Default: "affine" 
- device
- Execution target: supply - "gpu",- "cpu", or an- mlx_streamcreated via- mlx_new_stream(). Default:- mlx_default_device().
Value
A list containing:
- w_q
- The quantized weight matrix (packed as uint32) 
- scales
- The quantization scales for dequantization 
- biases
- The quantization biases (NULL for symmetric mode) 
Details
Quantization converts floating-point weights to low-precision integers, reducing memory by up to 8x for 4-bit quantization. The scales (and optionally biases) are stored to enable approximate reconstruction of the original values.
Examples
if (FALSE) { # \dontrun{
w <- mlx_random_normal(c(512, 256))
quant <- mlx_quantize(w, group_size = 64, bits = 4)
# Use quant$w_q, quant$scales, quant$biases with mlx_quantized_matmul()
} # }