Model benchmarks¶
Head-to-head LLM comparison on a 5-PDF pentest corpus with 44 manually curated ground-truth values. Numbers drive the curated preset list in config/server_profiles.yml and the recommended-file ordering in anonymize/hf_models.py.
Score¶
Quality score = F1 ร 100, rounded. F1 is the harmonic mean of
precision and recall: one number from 0 to 100, lower is worse.
Bands:
| Score | Meaning |
|---|---|
| 80 to 100 | Excellent. Catches almost every leak with few false alarms. |
| 65 to 79 | Good. Usable in production with light Review. |
| 50 to 64 | Usable. Expect real time in Review. |
| 40 to 49 | Poor. Misses too many leaks or floods Review. |
| 0 to 39 | Not recommended. No better than the regex baseline. |
Definitions: precision = TP / (TP + FP) (of flagged values, the
share that were real leaks). recall = TP / (TP + FN) (of real
leaks, the share the model caught). Accuracy is not reported because
the denominator (around 50 000 character offsets per PDF, mostly
non-leaks) dominates and produces deceptively high numbers (above
99 %).
Methodology¶
- Corpus. 5 PDFs from a real pentest report. The benchmark
harness reads the corpus location from the
BENCH_CORPUS_ROOTenv var so the repo never embeds a filesystem path. The corpus itself is not redistributed. - Ground truth. 44 distinct customer-identifying values were
manually curated and cross-checked three times. The list is kept
alongside the corpus (under
$BENCH_CORPUS_ROOT/groundtruth.yml), not in this repository, for the same privacy reasons as the corpus itself. - Pipeline. Every profile is started fresh, then each PDF is
processed via
anonymize-dossier all --force-rescan. - Chunk strategy.
structured, the Markdown-aware splitter that never breaks tables, code fences or headings. - Metric. The set of distinct
fromvalues applied (case-insensitive) is compared against the ground-truth set per PDF. The aggregate Quality score (and the underlying precision, recall and F1 it rolls up) is reported below. - Hardware. Same machine for all runs (NVIDIA RTX 5090, 32 GB VRAM, Linux 6.17).
- Memory. Peak VRAM sampled from
nvidia-smi --query-gpu=memory.usedafter each PDF. Captures the steady-state KV-cache plus weights footprint while serving requests. - Runner:
bench/run_precision_benchmark.py. - Catalog patcher:
bench/apply_bench_to_catalog.py. Reads the JSON sidecar produced by the runner and rewritesCuratedRepo.benchmark_*fields and preset-description VRAM numbers in place.
Aggregate leaderboard (93 models)¶
| Column | What it means |
|---|---|
| Quality | F1 ร 100 rounded. Sort key. |
| TP | Real leaks correctly anonymised. |
| FP | Wrongly flagged strings (over-detection). |
| FN | Real leaks missed. |
| Precision | TP / (TP + FP). Higher = fewer false alarms. |
| Recall | TP / (TP + FN). Higher = fewer missed leaks. |
| F1 | Quality / 100; harmonic mean of precision and recall. |
| Disk size | GGUF size on disk. |
| Peak VRAM | Max GPU memory during serving (nvidia-smi). |
| Total | Wall-clock to anonymise the 5-PDF corpus end-to-end. |
Curated set: usable models (Quality >= 50)¶
The catalog shows every model that scored in the Usable band or better as a curated download with full benchmark numbers on the card. The top 5 ship as built-in presets out of the box; the others are reachable via the Model Manager free-text search or the Curated downloads tab.
| # | Profile | Quality | Precision | Recall | F1 | Peak VRAM | Total |
|---|---|---|---|---|---|---|---|
| ๐ฅ | ministral-3-8b-reasoning-bf16 |
83 | 75.5 % | 90.9 % | 82.5 % | 18 940 MB | 244 s |
| ๐ฅ | rtila-qwen3.5-9b-q4 |
82 | 74.1 % | 90.9 % | 82.0 % | 7 135 MB | 79 s |
| ๐ฅ โ | jackrong-qwen3.5-4b-distill-q4 |
78 | 79.1 % | 77.3 % | 78.0 % | 4 820 MB | 185 s |
| 4 | qwen3.5-9b-bf16 |
78 | 80.5 % | 75.0 % | 77.7 % | 18 024 MB | 210 s |
| 5 | ministral-3-8b-reasoning-q5 (Q5_K_M) |
76 | 65.6 % | 90.9 % | 76.2 % | 9 171 MB | 112 s |
| 6 | opus4.7-godsghost-codex-4b-q4 |
72 | 78.4 % | 65.9 % | 71.6 % | 4 798 MB | 155 s |
| 7 | qwopus3.5-4b-v3-q4 |
71 | 76.3 % | 65.9 % | 70.7 % | 4 765 MB | 456 s |
| 8 | opus4.7-godsghost-codex-4b-q4-mirror (WithinUsAI) |
69 | 69.8 % | 68.2 % | 69.0 % | 4 785 MB | 155 s |
| 9 | omnicoder-9b-q4 |
69 | 69.8 % | 68.2 % | 69.0 % | 6 673 MB | 113 s |
| 10 | ministral-3-14b-reasoning-q4 |
67 | 57.0 % | 82.0 % | 67.0 % | 10 965 MB | 174 s |
| 11 | omniclaw-qwen3.5-9b-uncensored-v2-q4 |
67 | 53.4 % | 88.6 % | 66.7 % | 7 069 MB | 260 s |
| 12 | jackrong-qwen3.5-4b-distill-v2-q4 |
67 | 73.0 % | 61.4 % | 66.7 % | 4 833 MB | 198 s |
| 13 | qwen3-4b-claude-sonnet-x-gemini-reasoning-iq4 |
65 | 56.9 % | 75.0 % | 64.7 % | 5 584 MB | 241 s |
| 14 | ministral-3-8b-bf16 (Instruct) |
65 | 51.4 % | 86.4 % | 64.5 % | 18 768 MB | 214 s |
| 15 | default (Jackrong 4B distill Q4 CPU) |
64 | โ | โ | โ | RAM | โ |
| 16 | qwen3-4b-thinking-minimax-m2.1-coder-q4 |
64 | 48.2 % | 93.2 % | 63.6 % | 5 708 MB | 382 s |
| 17 | granite-4.1-8b-bf16 |
63 | 62.2 % | 63.6 % | 62.9 % | 19 908 MB | 100 s |
| 18 | qwen3-space-agent-claude-uncensored-4b-q4 |
62 | 47.6 % | 90.9 % | 62.5 % | 5 662 MB | 177 s |
| 19 | openthinker2-7b-q4 |
62 | 72.7 % | 54.5 % | 62.3 % | 6 412 MB | 437 s |
| 20 | qwen3-4b-thinking-2507-minimax-m2.1-distill-q4 |
60 | 62.5 % | 56.8 % | 59.5 % | 5 783 MB | 396 s |
| 21 | qwen3-4b-thinking-2507-gemini-3-pro-distill-q4 |
58 | 63.2 % | 54.5 % | 58.5 % | 5 861 MB | 188 s |
| 22 | opensonnet-lite-q8 |
57 | 55.0 % | 59.0 % | 57.0 % | 7 416 MB | 393 s |
| 23 | mistral-nemo-instruct-q4 |
56 | 48.0 % | 66.0 % | 56.0 % | 10 239 MB | 146 s |
| 24 | jackrong-qwen3.5-0.8b-distill-q8 |
56 | 67.7 % | 47.7 % | 56.0 % | 2 903 MB | 164 s |
| 25 | qwen3.5-9b-deepseek-v4-flash-q4 |
55 | 82.0 % | 41.0 % | 55.0 % | 7 088 MB | 624 s |
| 26 | qwen3-4b-2507-geminized-v1-q4 |
54 | 78.3 % | 40.9 % | 53.7 % | 5 846 MB | 424 s |
| 27 | deepthink-reasoning-7b-q4 |
53 | 54.0 % | 52.0 % | 53.0 % | 6 449 MB | 240 s |
Benchmarked but below the Usable cut (Quality < 50)¶
Reachable via the Model Manager free-text search; each entry shows a โ ๏ธ badge with the benchmark numbers and the reason it didn't make the curated cut.
| # | Profile | Quality | Precision | Recall | F1 | Peak VRAM | Total |
|---|---|---|---|---|---|---|---|
| 28 | unsloth/qwen3.5-2b-ud-q4-k-xl |
49 | 65.4 % | 38.6 % | 48.6 % | 3 265 MB | 120 s |
| 29 | liontix/qwen3-4b-sonnet-4-gpt-5-distill-q4 |
48 | 36.0 % | 70.5 % | 47.7 % | 5 684 MB | 147 s |
| 30 | meta-llama-3-8b-instruct-q4 |
47 | 39.0 % | 61.0 % | 47.0 % | 7 437 MB | 124 s |
| 31 | wavecoder-ultra-6.7b-iq4 |
47 | 54.5 % | 40.9 % | 46.8 % | 11 113 MB | 515 s |
| 32 | within-us-coder-4b-q4 |
47 | 54.5 % | 40.9 % | 46.8 % | 4 762 MB | 157 s |
| 33 | jackrong-qwen3.5-2b-distill-q4 |
46 | 82.4 % | 31.8 % | 45.9 % | 3 367 MB | 124 s |
| 34 | unsloth/qwen3.5-0.8b-ud-q8-k-xl |
44 | 47.4 % | 40.9 % | 43.9 % | 3 067 MB | 160 s |
| 35 | opensonnet-lite-q4 |
42 | 35.0 % | 52.0 % | 42.0 % | 5 694 MB | 436 s |
| 36 | darwin-2b-opus-q4 |
42 | 38.5 % | 45.5 % | 41.7 % | 3 128 MB | 106 s |
| 37 | evelyn67/qwen3.5-2b-uncensored-q6 |
42 | 31.1 % | 63.6 % | 41.8 % | 3 400 MB | 186 s |
| 38 | unsloth/qwen3.5-0.8b-ud-q4-k-xl |
41 | 39.6 % | 43.2 % | 41.3 % | 2 530 MB | 144 s |
| 39 | glm-4.6v-flash-q5 |
40 | 40.0 % | 41.0 % | 40.0 % | 8 532 MB | 80 s |
| 40 | ministral-3-3b-reasoning-bf16 |
40 | 28.0 % | 68.2 % | 39.7 % | 9 610 MB | 141 s |
| 41 | lfm-2.5-1.2b-f16 |
38 | 63.2 % | 27.3 % | 38.1 % | 3 880 MB | 119 s |
| 42 | nvidia-agentic-coder-4b-q4 |
37 | 60.0 % | 27.3 % | 37.5 % | 4 253 MB | 33 s |
| 43 | seed-coder-8b-reasoning-q4 |
36 | 55.0 % | 27.0 % | 36.0 % | 7 694 MB | 660 s |
| 44 | agent-nano-coder-2b-q4 |
35 | 29.7 % | 43.2 % | 35.2 % | 4 102 MB | 543 s |
| 45 | deepseek-coder-6.7b-f16 |
34 | 44.4 % | 27.3 % | 33.8 % | 22 221 MB | 339 s |
| 46 | nemotron-3-nano-4b-q4 |
34 | 44.4 % | 27.3 % | 33.8 % | 4 436 MB | 40 s |
| 47 | qwen3-4b-reasoning-slerp-q8 |
34 | 66.7 % | 22.7 % | 33.8 % | 8 172 MB | 387 s |
| 48 | olympiccoder-7b-q4 |
33 | 81.8 % | 20.5 % | 32.7 % | 6 412 MB | 453 s |
| 49 | ibm-opus4.7-obscure-reasoner-3b-q4 |
32 | 75.0 % | 20.5 % | 32.1 % | 4 452 MB | 112 s |
| 50 | skywork-or1-7b-preview-q4 |
32 | 38.7 % | 27.3 % | 32.0 % | 6 412 MB | 219 s |
| 51 | rikunarita-2-qwen3.5-2b-claude-opus-v2-q5-imat |
31 | 24.7 % | 43.2 % | 31.4 % | 3 261 MB | 146 s |
| 52 | llama-3.2-3b-instruct-q5 |
31 | 23.0 % | 46.0 % | 31.0 % | 5 172 MB | 196 s |
| 53 | llama-3.2-3b-reason-reflect-lite-q4 |
31 | 60.0 % | 20.0 % | 31.0 % | 4 797 MB | 35 s |
| 54 | magistral-small-2507-q4 (24B) |
30 | 80.0 % | 18.0 % | 30.0 % | 16 782 MB | 698 s |
| 55 | mythoseek-q4 |
30 | 29.2 % | 31.8 % | 30.4 % | 7 139 MB | 533 s |
| 56 | llada-moe-7b-q4 |
30 | 80.0 % | 18.2 % | 29.6 % | 5 687 MB | 22 s |
| 57 | smollm2-135m-instruct-q4 |
30 | 80.0 % | 18.2 % | 29.6 % | 1 690 MB | 70 s |
| 58 | zeta-q4 |
30 | 80.0 % | 18.2 % | 29.6 % | 6 410 MB | 185 s |
| 59 | smollm3-3b-q4 |
30 | 22.9 % | 43.2 % | 29.9 % | 4 241 MB | 272 s |
| 60 | opencoder-1.5b-instruct-q4 |
30 | 80.0 % | 18.2 % | 29.6 % | 5 134 MB | 22 s |
| 61 | openhands-lm-1.5b-q4 |
30 | 80.0 % | 18.2 % | 29.6 % | 2 881 MB | 168 s |
| 62 | openreasoning-nemotron-1.5b-q4 |
30 | 80.0 % | 18.2 % | 29.6 % | 2 882 MB | 281 s |
| 63 | llama-coyote-coder-4b-q4 |
30 | 80.0 % | 18.2 % | 29.6 % | 8 140 MB | 218 s |
| 64 | qwenseek-2b-bf16 |
30 | 23.9 % | 38.6 % | 29.5 % | 5 597 MB | 175 s |
| 65 | opus-1.5-q4 |
30 | 80.0 % | 18.2 % | 29.6 % | 2 417 MB | 22 s |
| 66 | deecon-securityanalyst-1.5b-q8 |
30 | 80.0 % | 18.2 % | 29.6 % | 3 692 MB | 56 s |
| 67 | deepseek-r1-opus-q8 |
30 | 80.0 % | 18.2 % | 29.6 % | 3 914 MB | 208 s |
| 68 | cicikus-v3-1.4b-opus4.6-q8 |
30 | 80.0 % | 18.2 % | 29.6 % | 3 778 MB | 99 s |
| 69 | qwen3-zero-coder-reasoning-v2-0.8b-f16 |
30 | 43.5 % | 22.7 % | 29.9 % | 5 338 MB | 135 s |
| 70 | qwen-researcher-f16 |
29 | 72.7 % | 18.2 % | 29.1 % | 2 784 MB | 148 s |
| 71 | qwen3-4b-thinking-2507-q4 (MaziyarPanahi) |
29 | 66.7 % | 18.2 % | 28.6 % | 5 676 MB | 512 s |
| 72 | wizardlm-2-7b-q4 |
29 | 25.0 % | 34.1 % | 28.8 % | 7 028 MB | 280 s |
| 73 | deepseek-r1-distill-qwen-1.5b-ud-q4 |
27 | 33.3 % | 22.7 % | 27.0 % | 2 885 MB | 85 s |
| 74 | security-slm-unsloth-1.5b-f16 |
27 | 50.0 % | 18.2 % | 26.7 % | 3 172 MB | 217 s |
| 75 | falcon3-3b-instruct-q4 |
26 | 20.5 % | 34.1 % | 25.6 % | 4 332 MB | 130 s |
| 76 | zr1-1.5b-q4 |
26 | 31.2 % | 22.7 % | 26.3 % | 2 866 MB | 130 s |
| 77 | lfm2.5-1.2b-thinking-pony-alpha-distill-q4 |
26 | 37.5 % | 20.5 % | 26.5 % | 2 302 MB | 142 s |
| 78 | bonsai-8b-q1_0 (experimental) |
23 | 16.8 % | 38.6 % | 23.4 % | 4 373 MB | 393 s |
| 79 | cogito-v1-preview-llama-3b-q4 |
23 | 14.6 % | 52.3 % | 22.8 % | 4 834 MB | 311 s |
| 80 | gemma-3-4b-opus-reasoning-distill-q4 |
21 | 17.0 % | 25.0 % | 21.0 % | 4 618 MB | 522 s |
| 81 | ernie-4.5-0.3b-q4 |
21 | 22.0 % | 20.5 % | 21.2 % | 1 917 MB | 14 s |
| 82 | exaone-4.0-1.2b-q4 |
21 | 17.5 % | 25.0 % | 20.6 % | 2 970 MB | 76 s |
| 83 | jairodanielmt/qwen3-1.7b-opus-finetune-q4 |
21 | 19.2 % | 22.7 % | 20.8 % | 3 962 MB | 240 s |
| 84 | qwen2.5-coder-3b-q4 |
19 | 18.0 % | 20.5 % | 19.1 % | 3 864 MB | 261 s |
| 85 | jackrong-qwen3-1.7b-gemini-3-pro-distill-q4 |
19 | 16.4 % | 22.7 % | 19.0 % | 4 029 MB | 319 s |
| 86 | teichai-qwen3-1.7b-gemini-2.5-flash-lite-distill-f16 |
19 | 12.6 % | 36.4 % | 18.7 % | 6 391 MB | 287 s |
| 87 | aya-23-8b-iq4 |
15 | 9.2 % | 43.2 % | 15.2 % | 7 907 MB | 428 s |
| 88 | ibm-grok4-ultra-fast-coder-1b-q4 |
14 | 10.3 % | 22.7 % | 14.2 % | 3 410 MB | 231 s |
| 89 | mradermacher-qwen3-0.6b-claude-opus-distill-q4 |
14 | 9.6 % | 25.0 % | 13.8 % | 3 280 MB | 159 s |
| 90 | llama3.2-agent-hermes-coder-3b-q4 |
4 | 2.6 % | 13.6 % | 4.4 % | 4 820 MB | 402 s |
Architecturally or behaviourally incompatible¶
These collapse to the Tier-0 deterministic regex baseline (around 30 out of 100) because their LLM never produces usable JSON candidates. Marked โ in the Model Manager. Different root causes:
| Profile | Quality score | Root cause |
|---|---|---|
| gemma-4-e4b-it-bf16 | 30 / 100 | Gemma 4 SWA-1024 ยน |
| gemma-4-e2b-it-bf16 | 30 / 100 | Same SWA-1024 ยน |
| qwen3guard-gen-4b-f16 | 30 / 100 | Safety-tuned model refuses arbitrary-JSON tasks |
| qwen3guard-gen-8b-f16 | 30 / 100 | Same as 4B |
| hy-mt1.5-1.8b-2bit | n/a | 2-bit GGUF quantisation tensor type 2 not supported by the bundled llama.cpp build (load_model: failed to load model); model never starts, so quality is undefined. Recompile llama.cpp with the appropriate -DGGML_โฆ flags or pick a Q4_K_M / Q8_0 quant of the same model when one is published. |
ยน Gemma 4 architecture uses Sliding Window Attention (SWA, 1024-token window on 20 of 24 layers, visible in llama-server's creating SWA KV cache, size = 1024 cells, 20 layers). Our system_detector.txt is ~3700 tokens; SWA layers only see the last 1024, which drops the JSON-output instructions. Manual short-prompt tests succeed; the long detector prompt does not. Switching the chat template (peg-native vs. peg-gemma4) does not help: the limitation is structural, not template-related.
How to pick¶
Pick the row that matches your hardware. The "Why" column is the trade-off you are accepting.
| Hardware or goal | Pick | Why |
|---|---|---|
| 18 GB VRAM or more, top quality | ministral-3-8b-reasoning-bf16 |
Quality 83, catches around 9 out of every 10 leaks. |
| 18 GB VRAM or more, fewest false alarms | qwen3.5-9b-bf16 |
Precision 80.5 %. Lower recall, so a few more leaks reach Review. |
| Around 7 GB VRAM, near-leader quality | rtila-qwen3.5-9b-q4 |
Quality 82 at one-third the VRAM of the BF16 leader and the fastest run on the corpus (79 s). |
| Around 6 GB VRAM, smallest "good" model โ recommended | jackrong-qwen3.5-4b-distill-q4 |
Quality 78 at 2.5 GB on disk. Best small + good pick on the curated set, and the second-best precision (79.1 %, behind qwen3.5-9b-bf16 at 80.5 % which needs almost 4x the VRAM). Recommended starting point on any GPU. |
| Around 10 GB VRAM, reasoning quality | ministral-3-8b-reasoning-q5 |
Quality 76 at half the VRAM of the BF16 leader. Recall matches BF16; precision drops about 10 points. |
| No GPU | default |
The shipped CPU profile. Same Jackrong Qwen 3.5 4B Q4_K_M weights as jackrong-qwen3.5-4b-distill-q4, just configured with n_gpu_layers: 0 (Quality 78 at around 2.5 GB on disk, expect roughly 10x slower than the GPU run). |
| Smallest GGUF, quality is not a priority | jackrong-qwen3.5-0.8b-distill-q8 |
0.8 GB on disk, Quality 56 (below the curated cut but the best of the sub-1 GB tier; Peak VRAM 2.9 GB). |
The reasoning models are fed enable_thinking: false so they emit
JSON directly, without burning the token budget on <think> blocks.
Why these context sizes¶
The pipeline is chunked: the detector splits each input segment into ~5000-character chunks via the structure-aware chunker (anonymize/structure_chunker.py) and the LLM only ever sees one chunk per request. Measured request size (calibrated against the real prompts):
| Component | Tokens (measured) |
|---|---|
system_detector.txt (12 748 chars at ~3.5 char/tok) |
~3 700 |
Few-shot examples (top 8 from decisions_history.jsonl) |
~250 |
| Chunk body (5000 chars worst case) | ~1 430 |
Output JSON budget (max_tokens in LLMClient) |
2 048 |
| Worst-case per request | ~7 430 |
That sets a floor: slot โฅ 7 430 tokens (slot = ctx_size /
parallel). The pre-flight check in
anonymize/budget.py refuses any preset that
violates it; the unit test in
tests/test_preset_budget.py keeps
the curated catalog honest.
Slot status of the shipped presets (matches config/server_profiles.yml):
| Preset | ctx | parallel | slot | required | headroom | fits |
|---|---|---|---|---|---|---|
default (Jackrong Qwen 3.5 4B Q4_K_M, CPU) |
12 288 | 1 | 12 288 | ~7 430 | +4 858 | โ |
jackrong-qwen3.5-4b-distill-q4 |
12 288 | 1 | 12 288 | ~7 430 | +4 858 | โ |
rtila-qwen3.5-9b-q4 |
12 288 | 1 | 12 288 | ~7 430 | +4 858 | โ |
ministral-3-8b-reasoning-bf16 |
16 384 | 2 | 8 192 | ~7 430 | +762 | โ |
ministral-3-8b-reasoning-q5 (Q5_K_M) |
16 384 | 2 | 8 192 | ~7 430 | +762 | โ |
qwen3.5-9b-bf16 |
16 384 | 1 | 16 384 | ~7 430 | +8 954 | โ |
Same-model quant comparison¶
Different quants of the same base model, same corpus.
| Model | Quant | F1 | ฮ vs. reference |
|---|---|---|---|
| Qwen 3.5 4B | BF16 | 63.8 % | reference |
| Qwen 3.5 4B | Q4_K_XL | 58.2 % | -5.6 pts |
| Ministral 3 8B Instruct | BF16 | 64.5 % | reference |
| Ministral 3 8B Instruct | Q8_K_XL | 64.4 % | within noise |
| Ministral 3 8B Reasoning | BF16 | 82.5 % | reference |
| Ministral 3 8B Reasoning | Q5_K_M | 76.2 % | -6.3 pts (recall identical, precision -10 pts, around half the VRAM) |
| Granite 4.1 8B | BF16 | 62.9 % | reference |
| Granite 4.1 8B | Q8_K_XL | 65.3 % | +2.4 pts (recall lower, 70.5 % vs 86.4 %) |
| OpenSonnet Lite | Q8_0 | 57.0 % | reference |
| OpenSonnet Lite | Q4_K_M | 42.0 % | -15 pts (heavy precision drop) |
Rule of thumb: BF16 wins same-model comparisons, and the gap widens on smaller quants. KV cache stays at f16 across the board.
Distill vs. base model¶
Some Q4_K_M distills outscore the BF16 build of their base model. Different training objective, not just a different quant.
| Base model | Build | Quant | F1 | ฮ vs. base BF16 |
|---|---|---|---|---|
| Qwen 3.5 9B | base (unsloth) | BF16 | 77.7 % | reference |
| Qwen 3.5 9B | rtila Assistant Lite | Q4_K_M | 82.0 % | +4.3 pts at around 7 GB VRAM (vs 18 GB for the base) |
| Qwen 3.5 9B | Jackrong Claude-Opus distill | Q4_K_M | 76.0 % | -1.7 pts (within noise) at around 7 GB |
| Qwen 3.5 4B | base (unsloth) | BF16 | 63.8 % | reference |
| Qwen 3.5 4B | Jackrong Claude-Opus distill | Q4_K_M | 78.0 % | +14.2 pts at 2.5 GB on disk |
Practical takeaway: when picking a small model, prefer a purpose-trained distill of a strong base over the base in BF16. The distill captures the JSON-output discipline the anonymizer needs at a fraction of the VRAM.
Reproducing¶
You need a corpus folder of your own (the 5-PDF corpus used here is
private). Point BENCH_CORPUS_ROOT at any folder of PDFs you have
ground truth for, then pick a folder for the run output (anywhere
on disk, the path below uses a bench_runs/ directory next to the
repo):
export BENCH_CORPUS_ROOT=/path/to/your/pdfs
OUT=./bench_runs/precision_top5
# Run the benchmark for the curated presets.
PYTHONPATH=$(pwd) QT_QPA_PLATFORM=offscreen \
.venv/bin/python bench/run_precision_benchmark.py \
--profiles default \
jackrong-qwen3.5-4b-distill-q4 \
rtila-qwen3.5-9b-q4 \
ministral-3-8b-reasoning-bf16 \
ministral-3-8b-reasoning-q5 \
qwen3.5-9b-bf16 \
--out-root "$OUT"
# Patch the catalog + presets with the measured numbers.
PYTHONPATH=$(pwd) .venv/bin/python bench/apply_bench_to_catalog.py \
"$OUT/report.json"
The runner emits both report.md (human-readable, with per-PDF
breakdown plus miss and extra lists) and report.json
(machine-readable, consumed by apply_bench_to_catalog.py).
Per-PDF breakdown¶
The runner writes a per-profile, per-PDF report into
<out-root>/<profile>/, with miss and extra lists for every PDF
and a top-level report.md summarising the run.