采样、量化与部署 — 概念

采样器链 (Sampler Chain)

sampler-chain 采样器链是一个管道，logits 依次经过多个采样器处理：

常见采样器

采样器	作用
Temperature	缩放 logits 温度
Top-K	只保留概率最高的 K 个
Top-P	只保留累积概率前 P 的 token
Min-P	过滤概率低于最大概率 × min_p 的 token
Typical	基于信息熵的采样
Mirostat	自适应温度控制
Grammar	基于语法约束输出格式
Penalties	重复惩罚、频率惩罚
Logit Bias	手动调整特定 token 的概率

后端采样：对于 MTP 推测解码，Top-K 可在 GPU 上执行（--spec-draft-backend-sampling），避免将全部 logits 传回 CPU。

量化 (Quantization)

quantization 将模型权重从 F16 压缩为低比特整数：

量化类型对比

类型	比特/权重	模型大小 (7B)	质量损失
F16	16	13.5 GB	基准
Q8_0	8.5	7.2 GB	极小
Q5_1	5.5	4.7 GB	小
Q5_0	5.0	4.3 GB	小
Q4_1	4.5	3.9 GB	中等
Q4_0	4.5	3.8 GB	中等
IQ4_XS	4.25	3.6 GB	中等
NVFP4	4	~3.4 GB	中等（NVIDIA 特定浮点格式，Mistral3 等模型自动附加 scale）
Q3_K_S	3.5	3.0 GB	较大
Q2_K	2.75	2.4 GB	大

FP8 → Q8 转换：convert_hf_to_gguf.py 现支持将 FP8 源权重直接转换为 Q8_0，无需先转 F16。

Block Quantization

量化以 block 为单位（通常 32 个权重）：

Block Q4_0 (18 bytes for 32 weights):
┌──────────┬──────────────────────┐
│ d (F16)  │ 16 × uint8 (4-bit × 32) │
│ scale    │ quantized values        │
└──────────┴──────────────────────┘

dequantize: weight = (quantized - 8) × d

Importance Matrix (imatrix)

imatrix 通过在校准数据上统计各张量的重要性，指导量化时保留关键权重：

bash

# 生成 imatrix
./llama-imatrix -m model.gguf -f calibration.txt -o imatrix.dat

# 使用 imatrix 量化
./llama-quantize --imatrix imatrix.dat model.gguf model-Q4_K_M.gguf Q4_K_M

llama-server

llama-server 提供 OpenAI 兼容 API，也可通过统一入口 llama serve 启动：

bash

# 方式 1：直接使用 llama-server
./llama-server -m model.gguf --port 8080

# 方式 2：通过统一入口
./llama serve -m model.gguf --port 8080

# 调用 chat completion
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"model","messages":[{"role":"user","content":"Hello"}]}'

支持的端点：

/v1/chat/completions — Chat Completion
/v1/completions — Text Completion
/v1/embeddings — Embeddings
/v1/models — 模型列表
/health — 健康检查

Server 新增功能

Token 计数 API — */input_tokens 端点可仅计数 token 而不执行完整推理
实时推理中断 — POST /v1/chat/completions/control 可在生成过程中中断 thinking（reasoning）
SSE ping interval — 配置化心跳防止长生成时连接断开
HTTP ETags — 基于 FNV-hash 的缓存支持，减少重复传输
Thinking 模式 — UI 支持推理模式切换和 reasoning effort 级别
API Key 文件 — LLAMA_ARG_API_KEY_FILE 环境变量支持从文件读取 API key
超时延长 — 默认 timeout 提升至 3600s

废弃 API

llama_set_warmup() — 已标记为废弃。对 MoE 模型会导致额外的图重分配，建议手动执行 warmup

采样、量化与部署 — 概念 ​

采样器链 (Sampler Chain) ​

常见采样器 ​

量化 (Quantization) ​

量化类型对比 ​

Block Quantization ​

Importance Matrix (imatrix) ​

llama-server ​

Server 新增功能 ​

废弃 API ​

相关概念 ​