Quantization

Letter Q

Shrinking a model by lowering the numerical precision of its weights. Same model, fraction of the memory, almost the same accuracy.

Updated:

Working definition

Quantization shrinks a model by lowering the numerical precision of its weights. FP16 to INT8 to INT4. Same model, fraction of the memory, almost the same accuracy. It’s why the gap between frontier and open source keeps closing.