Skip to content

Model benchmarks

Head-to-head LLM comparison on a 5-PDF pentest corpus with 44 manually curated ground-truth values. Numbers drive the curated preset list in config/server_profiles.yml and the recommended-file ordering in anonymize/hf_models.py.

Score

Quality score = F1 ร— 100, rounded. F1 is the harmonic mean of precision and recall: one number from 0 to 100, lower is worse. Bands:

Score Meaning
80 to 100 Excellent. Catches almost every leak with few false alarms.
65 to 79 Good. Usable in production with light Review.
50 to 64 Usable. Expect real time in Review.
40 to 49 Poor. Misses too many leaks or floods Review.
0 to 39 Not recommended. No better than the regex baseline.

Definitions: precision = TP / (TP + FP) (of flagged values, the share that were real leaks). recall = TP / (TP + FN) (of real leaks, the share the model caught). Accuracy is not reported because the denominator (around 50 000 character offsets per PDF, mostly non-leaks) dominates and produces deceptively high numbers (above 99 %).

Methodology

  • Corpus. 5 PDFs from a real pentest report. The benchmark harness reads the corpus location from the BENCH_CORPUS_ROOT env var so the repo never embeds a filesystem path. The corpus itself is not redistributed.
  • Ground truth. 44 distinct customer-identifying values were manually curated and cross-checked three times. The list is kept alongside the corpus (under $BENCH_CORPUS_ROOT/groundtruth.yml), not in this repository, for the same privacy reasons as the corpus itself.
  • Pipeline. Every profile is started fresh, then each PDF is processed via anonymize-dossier all --force-rescan.
  • Chunk strategy. structured, the Markdown-aware splitter that never breaks tables, code fences or headings.
  • Metric. The set of distinct from values applied (case-insensitive) is compared against the ground-truth set per PDF. The aggregate Quality score (and the underlying precision, recall and F1 it rolls up) is reported below.
  • Hardware. Same machine for all runs (NVIDIA RTX 5090, 32 GB VRAM, Linux 6.17).
  • Memory. Peak VRAM sampled from nvidia-smi --query-gpu=memory.used after each PDF. Captures the steady-state KV-cache plus weights footprint while serving requests.
  • Runner: bench/run_precision_benchmark.py.
  • Catalog patcher: bench/apply_bench_to_catalog.py. Reads the JSON sidecar produced by the runner and rewrites CuratedRepo.benchmark_* fields and preset-description VRAM numbers in place.

Aggregate leaderboard (93 models)

Column What it means
Quality F1 ร— 100 rounded. Sort key.
TP Real leaks correctly anonymised.
FP Wrongly flagged strings (over-detection).
FN Real leaks missed.
Precision TP / (TP + FP). Higher = fewer false alarms.
Recall TP / (TP + FN). Higher = fewer missed leaks.
F1 Quality / 100; harmonic mean of precision and recall.
Disk size GGUF size on disk.
Peak VRAM Max GPU memory during serving (nvidia-smi).
Total Wall-clock to anonymise the 5-PDF corpus end-to-end.

Curated set: usable models (Quality >= 50)

The catalog shows every model that scored in the Usable band or better as a curated download with full benchmark numbers on the card. The top 5 ship as built-in presets out of the box; the others are reachable via the Model Manager free-text search or the Curated downloads tab.

# Profile Quality Precision Recall F1 Peak VRAM Total
๐Ÿฅ‡ ministral-3-8b-reasoning-bf16 83 75.5 % 90.9 % 82.5 % 18 940 MB 244 s
๐Ÿฅˆ rtila-qwen3.5-9b-q4 82 74.1 % 90.9 % 82.0 % 7 135 MB 79 s
๐Ÿฅ‰ โ˜… jackrong-qwen3.5-4b-distill-q4 78 79.1 % 77.3 % 78.0 % 4 820 MB 185 s
4 qwen3.5-9b-bf16 78 80.5 % 75.0 % 77.7 % 18 024 MB 210 s
5 ministral-3-8b-reasoning-q5 (Q5_K_M) 76 65.6 % 90.9 % 76.2 % 9 171 MB 112 s
6 opus4.7-godsghost-codex-4b-q4 72 78.4 % 65.9 % 71.6 % 4 798 MB 155 s
7 qwopus3.5-4b-v3-q4 71 76.3 % 65.9 % 70.7 % 4 765 MB 456 s
8 opus4.7-godsghost-codex-4b-q4-mirror (WithinUsAI) 69 69.8 % 68.2 % 69.0 % 4 785 MB 155 s
9 omnicoder-9b-q4 69 69.8 % 68.2 % 69.0 % 6 673 MB 113 s
10 ministral-3-14b-reasoning-q4 67 57.0 % 82.0 % 67.0 % 10 965 MB 174 s
11 omniclaw-qwen3.5-9b-uncensored-v2-q4 67 53.4 % 88.6 % 66.7 % 7 069 MB 260 s
12 jackrong-qwen3.5-4b-distill-v2-q4 67 73.0 % 61.4 % 66.7 % 4 833 MB 198 s
13 qwen3-4b-claude-sonnet-x-gemini-reasoning-iq4 65 56.9 % 75.0 % 64.7 % 5 584 MB 241 s
14 ministral-3-8b-bf16 (Instruct) 65 51.4 % 86.4 % 64.5 % 18 768 MB 214 s
15 default (Jackrong 4B distill Q4 CPU) 64 โ€” โ€” โ€” RAM โ€”
16 qwen3-4b-thinking-minimax-m2.1-coder-q4 64 48.2 % 93.2 % 63.6 % 5 708 MB 382 s
17 granite-4.1-8b-bf16 63 62.2 % 63.6 % 62.9 % 19 908 MB 100 s
18 qwen3-space-agent-claude-uncensored-4b-q4 62 47.6 % 90.9 % 62.5 % 5 662 MB 177 s
19 openthinker2-7b-q4 62 72.7 % 54.5 % 62.3 % 6 412 MB 437 s
20 qwen3-4b-thinking-2507-minimax-m2.1-distill-q4 60 62.5 % 56.8 % 59.5 % 5 783 MB 396 s
21 qwen3-4b-thinking-2507-gemini-3-pro-distill-q4 58 63.2 % 54.5 % 58.5 % 5 861 MB 188 s
22 opensonnet-lite-q8 57 55.0 % 59.0 % 57.0 % 7 416 MB 393 s
23 mistral-nemo-instruct-q4 56 48.0 % 66.0 % 56.0 % 10 239 MB 146 s
24 jackrong-qwen3.5-0.8b-distill-q8 56 67.7 % 47.7 % 56.0 % 2 903 MB 164 s
25 qwen3.5-9b-deepseek-v4-flash-q4 55 82.0 % 41.0 % 55.0 % 7 088 MB 624 s
26 qwen3-4b-2507-geminized-v1-q4 54 78.3 % 40.9 % 53.7 % 5 846 MB 424 s
27 deepthink-reasoning-7b-q4 53 54.0 % 52.0 % 53.0 % 6 449 MB 240 s

Benchmarked but below the Usable cut (Quality < 50)

Reachable via the Model Manager free-text search; each entry shows a โš ๏ธ badge with the benchmark numbers and the reason it didn't make the curated cut.

# Profile Quality Precision Recall F1 Peak VRAM Total
28 unsloth/qwen3.5-2b-ud-q4-k-xl 49 65.4 % 38.6 % 48.6 % 3 265 MB 120 s
29 liontix/qwen3-4b-sonnet-4-gpt-5-distill-q4 48 36.0 % 70.5 % 47.7 % 5 684 MB 147 s
30 meta-llama-3-8b-instruct-q4 47 39.0 % 61.0 % 47.0 % 7 437 MB 124 s
31 wavecoder-ultra-6.7b-iq4 47 54.5 % 40.9 % 46.8 % 11 113 MB 515 s
32 within-us-coder-4b-q4 47 54.5 % 40.9 % 46.8 % 4 762 MB 157 s
33 jackrong-qwen3.5-2b-distill-q4 46 82.4 % 31.8 % 45.9 % 3 367 MB 124 s
34 unsloth/qwen3.5-0.8b-ud-q8-k-xl 44 47.4 % 40.9 % 43.9 % 3 067 MB 160 s
35 opensonnet-lite-q4 42 35.0 % 52.0 % 42.0 % 5 694 MB 436 s
36 darwin-2b-opus-q4 42 38.5 % 45.5 % 41.7 % 3 128 MB 106 s
37 evelyn67/qwen3.5-2b-uncensored-q6 42 31.1 % 63.6 % 41.8 % 3 400 MB 186 s
38 unsloth/qwen3.5-0.8b-ud-q4-k-xl 41 39.6 % 43.2 % 41.3 % 2 530 MB 144 s
39 glm-4.6v-flash-q5 40 40.0 % 41.0 % 40.0 % 8 532 MB 80 s
40 ministral-3-3b-reasoning-bf16 40 28.0 % 68.2 % 39.7 % 9 610 MB 141 s
41 lfm-2.5-1.2b-f16 38 63.2 % 27.3 % 38.1 % 3 880 MB 119 s
42 nvidia-agentic-coder-4b-q4 37 60.0 % 27.3 % 37.5 % 4 253 MB 33 s
43 seed-coder-8b-reasoning-q4 36 55.0 % 27.0 % 36.0 % 7 694 MB 660 s
44 agent-nano-coder-2b-q4 35 29.7 % 43.2 % 35.2 % 4 102 MB 543 s
45 deepseek-coder-6.7b-f16 34 44.4 % 27.3 % 33.8 % 22 221 MB 339 s
46 nemotron-3-nano-4b-q4 34 44.4 % 27.3 % 33.8 % 4 436 MB 40 s
47 qwen3-4b-reasoning-slerp-q8 34 66.7 % 22.7 % 33.8 % 8 172 MB 387 s
48 olympiccoder-7b-q4 33 81.8 % 20.5 % 32.7 % 6 412 MB 453 s
49 ibm-opus4.7-obscure-reasoner-3b-q4 32 75.0 % 20.5 % 32.1 % 4 452 MB 112 s
50 skywork-or1-7b-preview-q4 32 38.7 % 27.3 % 32.0 % 6 412 MB 219 s
51 rikunarita-2-qwen3.5-2b-claude-opus-v2-q5-imat 31 24.7 % 43.2 % 31.4 % 3 261 MB 146 s
52 llama-3.2-3b-instruct-q5 31 23.0 % 46.0 % 31.0 % 5 172 MB 196 s
53 llama-3.2-3b-reason-reflect-lite-q4 31 60.0 % 20.0 % 31.0 % 4 797 MB 35 s
54 magistral-small-2507-q4 (24B) 30 80.0 % 18.0 % 30.0 % 16 782 MB 698 s
55 mythoseek-q4 30 29.2 % 31.8 % 30.4 % 7 139 MB 533 s
56 llada-moe-7b-q4 30 80.0 % 18.2 % 29.6 % 5 687 MB 22 s
57 smollm2-135m-instruct-q4 30 80.0 % 18.2 % 29.6 % 1 690 MB 70 s
58 zeta-q4 30 80.0 % 18.2 % 29.6 % 6 410 MB 185 s
59 smollm3-3b-q4 30 22.9 % 43.2 % 29.9 % 4 241 MB 272 s
60 opencoder-1.5b-instruct-q4 30 80.0 % 18.2 % 29.6 % 5 134 MB 22 s
61 openhands-lm-1.5b-q4 30 80.0 % 18.2 % 29.6 % 2 881 MB 168 s
62 openreasoning-nemotron-1.5b-q4 30 80.0 % 18.2 % 29.6 % 2 882 MB 281 s
63 llama-coyote-coder-4b-q4 30 80.0 % 18.2 % 29.6 % 8 140 MB 218 s
64 qwenseek-2b-bf16 30 23.9 % 38.6 % 29.5 % 5 597 MB 175 s
65 opus-1.5-q4 30 80.0 % 18.2 % 29.6 % 2 417 MB 22 s
66 deecon-securityanalyst-1.5b-q8 30 80.0 % 18.2 % 29.6 % 3 692 MB 56 s
67 deepseek-r1-opus-q8 30 80.0 % 18.2 % 29.6 % 3 914 MB 208 s
68 cicikus-v3-1.4b-opus4.6-q8 30 80.0 % 18.2 % 29.6 % 3 778 MB 99 s
69 qwen3-zero-coder-reasoning-v2-0.8b-f16 30 43.5 % 22.7 % 29.9 % 5 338 MB 135 s
70 qwen-researcher-f16 29 72.7 % 18.2 % 29.1 % 2 784 MB 148 s
71 qwen3-4b-thinking-2507-q4 (MaziyarPanahi) 29 66.7 % 18.2 % 28.6 % 5 676 MB 512 s
72 wizardlm-2-7b-q4 29 25.0 % 34.1 % 28.8 % 7 028 MB 280 s
73 deepseek-r1-distill-qwen-1.5b-ud-q4 27 33.3 % 22.7 % 27.0 % 2 885 MB 85 s
74 security-slm-unsloth-1.5b-f16 27 50.0 % 18.2 % 26.7 % 3 172 MB 217 s
75 falcon3-3b-instruct-q4 26 20.5 % 34.1 % 25.6 % 4 332 MB 130 s
76 zr1-1.5b-q4 26 31.2 % 22.7 % 26.3 % 2 866 MB 130 s
77 lfm2.5-1.2b-thinking-pony-alpha-distill-q4 26 37.5 % 20.5 % 26.5 % 2 302 MB 142 s
78 bonsai-8b-q1_0 (experimental) 23 16.8 % 38.6 % 23.4 % 4 373 MB 393 s
79 cogito-v1-preview-llama-3b-q4 23 14.6 % 52.3 % 22.8 % 4 834 MB 311 s
80 gemma-3-4b-opus-reasoning-distill-q4 21 17.0 % 25.0 % 21.0 % 4 618 MB 522 s
81 ernie-4.5-0.3b-q4 21 22.0 % 20.5 % 21.2 % 1 917 MB 14 s
82 exaone-4.0-1.2b-q4 21 17.5 % 25.0 % 20.6 % 2 970 MB 76 s
83 jairodanielmt/qwen3-1.7b-opus-finetune-q4 21 19.2 % 22.7 % 20.8 % 3 962 MB 240 s
84 qwen2.5-coder-3b-q4 19 18.0 % 20.5 % 19.1 % 3 864 MB 261 s
85 jackrong-qwen3-1.7b-gemini-3-pro-distill-q4 19 16.4 % 22.7 % 19.0 % 4 029 MB 319 s
86 teichai-qwen3-1.7b-gemini-2.5-flash-lite-distill-f16 19 12.6 % 36.4 % 18.7 % 6 391 MB 287 s
87 aya-23-8b-iq4 15 9.2 % 43.2 % 15.2 % 7 907 MB 428 s
88 ibm-grok4-ultra-fast-coder-1b-q4 14 10.3 % 22.7 % 14.2 % 3 410 MB 231 s
89 mradermacher-qwen3-0.6b-claude-opus-distill-q4 14 9.6 % 25.0 % 13.8 % 3 280 MB 159 s
90 llama3.2-agent-hermes-coder-3b-q4 4 2.6 % 13.6 % 4.4 % 4 820 MB 402 s

Architecturally or behaviourally incompatible

These collapse to the Tier-0 deterministic regex baseline (around 30 out of 100) because their LLM never produces usable JSON candidates. Marked โŒ in the Model Manager. Different root causes:

Profile Quality score Root cause
gemma-4-e4b-it-bf16 30 / 100 Gemma 4 SWA-1024 ยน
gemma-4-e2b-it-bf16 30 / 100 Same SWA-1024 ยน
qwen3guard-gen-4b-f16 30 / 100 Safety-tuned model refuses arbitrary-JSON tasks
qwen3guard-gen-8b-f16 30 / 100 Same as 4B
hy-mt1.5-1.8b-2bit n/a 2-bit GGUF quantisation tensor type 2 not supported by the bundled llama.cpp build (load_model: failed to load model); model never starts, so quality is undefined. Recompile llama.cpp with the appropriate -DGGML_โ€ฆ flags or pick a Q4_K_M / Q8_0 quant of the same model when one is published.

ยน Gemma 4 architecture uses Sliding Window Attention (SWA, 1024-token window on 20 of 24 layers, visible in llama-server's creating SWA KV cache, size = 1024 cells, 20 layers). Our system_detector.txt is ~3700 tokens; SWA layers only see the last 1024, which drops the JSON-output instructions. Manual short-prompt tests succeed; the long detector prompt does not. Switching the chat template (peg-native vs. peg-gemma4) does not help: the limitation is structural, not template-related.

How to pick

Pick the row that matches your hardware. The "Why" column is the trade-off you are accepting.

Hardware or goal Pick Why
18 GB VRAM or more, top quality ministral-3-8b-reasoning-bf16 Quality 83, catches around 9 out of every 10 leaks.
18 GB VRAM or more, fewest false alarms qwen3.5-9b-bf16 Precision 80.5 %. Lower recall, so a few more leaks reach Review.
Around 7 GB VRAM, near-leader quality rtila-qwen3.5-9b-q4 Quality 82 at one-third the VRAM of the BF16 leader and the fastest run on the corpus (79 s).
Around 6 GB VRAM, smallest "good" model โ˜… recommended jackrong-qwen3.5-4b-distill-q4 Quality 78 at 2.5 GB on disk. Best small + good pick on the curated set, and the second-best precision (79.1 %, behind qwen3.5-9b-bf16 at 80.5 % which needs almost 4x the VRAM). Recommended starting point on any GPU.
Around 10 GB VRAM, reasoning quality ministral-3-8b-reasoning-q5 Quality 76 at half the VRAM of the BF16 leader. Recall matches BF16; precision drops about 10 points.
No GPU default The shipped CPU profile. Same Jackrong Qwen 3.5 4B Q4_K_M weights as jackrong-qwen3.5-4b-distill-q4, just configured with n_gpu_layers: 0 (Quality 78 at around 2.5 GB on disk, expect roughly 10x slower than the GPU run).
Smallest GGUF, quality is not a priority jackrong-qwen3.5-0.8b-distill-q8 0.8 GB on disk, Quality 56 (below the curated cut but the best of the sub-1 GB tier; Peak VRAM 2.9 GB).

The reasoning models are fed enable_thinking: false so they emit JSON directly, without burning the token budget on <think> blocks.

Why these context sizes

The pipeline is chunked: the detector splits each input segment into ~5000-character chunks via the structure-aware chunker (anonymize/structure_chunker.py) and the LLM only ever sees one chunk per request. Measured request size (calibrated against the real prompts):

Component Tokens (measured)
system_detector.txt (12 748 chars at ~3.5 char/tok) ~3 700
Few-shot examples (top 8 from decisions_history.jsonl) ~250
Chunk body (5000 chars worst case) ~1 430
Output JSON budget (max_tokens in LLMClient) 2 048
Worst-case per request ~7 430

That sets a floor: slot โ‰ฅ 7 430 tokens (slot = ctx_size / parallel). The pre-flight check in anonymize/budget.py refuses any preset that violates it; the unit test in tests/test_preset_budget.py keeps the curated catalog honest.

Slot status of the shipped presets (matches config/server_profiles.yml):

Preset ctx parallel slot required headroom fits
default (Jackrong Qwen 3.5 4B Q4_K_M, CPU) 12 288 1 12 288 ~7 430 +4 858 โœ“
jackrong-qwen3.5-4b-distill-q4 12 288 1 12 288 ~7 430 +4 858 โœ“
rtila-qwen3.5-9b-q4 12 288 1 12 288 ~7 430 +4 858 โœ“
ministral-3-8b-reasoning-bf16 16 384 2 8 192 ~7 430 +762 โœ“
ministral-3-8b-reasoning-q5 (Q5_K_M) 16 384 2 8 192 ~7 430 +762 โœ“
qwen3.5-9b-bf16 16 384 1 16 384 ~7 430 +8 954 โœ“

Same-model quant comparison

Different quants of the same base model, same corpus.

Model Quant F1 ฮ” vs. reference
Qwen 3.5 4B BF16 63.8 % reference
Qwen 3.5 4B Q4_K_XL 58.2 % -5.6 pts
Ministral 3 8B Instruct BF16 64.5 % reference
Ministral 3 8B Instruct Q8_K_XL 64.4 % within noise
Ministral 3 8B Reasoning BF16 82.5 % reference
Ministral 3 8B Reasoning Q5_K_M 76.2 % -6.3 pts (recall identical, precision -10 pts, around half the VRAM)
Granite 4.1 8B BF16 62.9 % reference
Granite 4.1 8B Q8_K_XL 65.3 % +2.4 pts (recall lower, 70.5 % vs 86.4 %)
OpenSonnet Lite Q8_0 57.0 % reference
OpenSonnet Lite Q4_K_M 42.0 % -15 pts (heavy precision drop)

Rule of thumb: BF16 wins same-model comparisons, and the gap widens on smaller quants. KV cache stays at f16 across the board.

Distill vs. base model

Some Q4_K_M distills outscore the BF16 build of their base model. Different training objective, not just a different quant.

Base model Build Quant F1 ฮ” vs. base BF16
Qwen 3.5 9B base (unsloth) BF16 77.7 % reference
Qwen 3.5 9B rtila Assistant Lite Q4_K_M 82.0 % +4.3 pts at around 7 GB VRAM (vs 18 GB for the base)
Qwen 3.5 9B Jackrong Claude-Opus distill Q4_K_M 76.0 % -1.7 pts (within noise) at around 7 GB
Qwen 3.5 4B base (unsloth) BF16 63.8 % reference
Qwen 3.5 4B Jackrong Claude-Opus distill Q4_K_M 78.0 % +14.2 pts at 2.5 GB on disk

Practical takeaway: when picking a small model, prefer a purpose-trained distill of a strong base over the base in BF16. The distill captures the JSON-output discipline the anonymizer needs at a fraction of the VRAM.

Reproducing

You need a corpus folder of your own (the 5-PDF corpus used here is private). Point BENCH_CORPUS_ROOT at any folder of PDFs you have ground truth for, then pick a folder for the run output (anywhere on disk, the path below uses a bench_runs/ directory next to the repo):

export BENCH_CORPUS_ROOT=/path/to/your/pdfs
OUT=./bench_runs/precision_top5

# Run the benchmark for the curated presets.
PYTHONPATH=$(pwd) QT_QPA_PLATFORM=offscreen \
    .venv/bin/python bench/run_precision_benchmark.py \
    --profiles default \
               jackrong-qwen3.5-4b-distill-q4 \
               rtila-qwen3.5-9b-q4 \
               ministral-3-8b-reasoning-bf16 \
               ministral-3-8b-reasoning-q5 \
               qwen3.5-9b-bf16 \
    --out-root "$OUT"

# Patch the catalog + presets with the measured numbers.
PYTHONPATH=$(pwd) .venv/bin/python bench/apply_bench_to_catalog.py \
    "$OUT/report.json"

The runner emits both report.md (human-readable, with per-PDF breakdown plus miss and extra lists) and report.json (machine-readable, consumed by apply_bench_to_catalog.py).

Per-PDF breakdown

The runner writes a per-profile, per-PDF report into <out-root>/<profile>/, with miss and extra lists for every PDF and a top-level report.md summarising the run.