Architecture¶

This page describes the data flow and on-disk schema of the anonymizer. For a high-level overview see the README; for benchmark numbers see BENCHMARKS.md.

Data flow¶

flowchart LR
    subgraph llama["llama-server (local, BF16 preset)"]
        srv[("chat completions API")]
    end

    scan["Scan<br/><sub>file inventory<br/>+ image inventory<br/>(PDF/DOCX/PPTX)</sub>"]
    detector["Detector<br/><sub>Tier-1 LLM<br/>structure-aware chunks</sub>"]
    critic["Critic<br/><sub>self-consistency vote<br/>placeholder_safe gate</sub>"]
    triage["Triage<br/><sub>auto_t0 / auto_t1<br/>+ pending</sub>"]
    promote["Promote<br/><sub>merge into<br/>substitution_map.yml</sub>"]
    images["Images Review<br/><sub>per-image editor<br/>writes image_redactions.yml</sub>"]
    bpreview["Build preview<br/><sub>PDF.js, native selection<br/>baked highlights</sub>"]
    apply["Apply<br/><sub>text adapters +<br/>image redactor (in-place bytes),<br/>atomic writes</sub>"]
    build["Build<br/><sub>pandoc / WeasyPrint<br/>extra exports</sub>"]
    verify["Verify<br/><sub>residual-leak sweep<br/>+ image inventory check</sub>"]
    autores["Auto-resolve<br/><sub>re-apply on affected files</sub>"]

    scan --> detector --> critic --> triage --> promote --> images --> bpreview --> apply --> build --> verify --> autores
    autores -. residuals .-> apply

    detector -- chunked HTTP<br/>~5 800 tok/req --> srv
    critic -- chunked HTTP --> srv

    classDef llamaBox fill:#1f6feb,stroke:#1f6feb,color:#fff;
    class srv llamaBox;

Each stage is cooperatively cancellable via stop_event / Stop button. State is persisted between stages so a crash can resume from the last successful checkpoint.

On-disk schema¶

Per project (<output_dir>/):

Path	Format	Purpose
`substitution_map.yml`	YAML	Canonical `from → to` mappings, the source of truth for Apply.
`auto_promoted_t0.yml`	YAML	Tier-0 deterministic-rule candidates queued for the next Promote.
`auto_promoted_t1.yml`	YAML	Tier-1 LLM-confident candidates queued for the next Promote.
`needs_review.yml`	YAML	Pending candidates the operator must Approve / Skip / edit. Each row carries `decision`, `user_edited`, `original_value`.
`applied_substitutions.json`	JSON	Per-event log of substitutions performed by Apply (page + rect for PDFs).
`decisions_history.jsonl`	JSONL	Append-only log of operator decisions + Tier-0 stable-index assignments.
`image_inventory.yml`	YAML	Per-file embedded-image catalog (image_id, format, dims, location). Auto-generated by `scan_images`, never hand-edited.
`image_redactions.yml`	YAML	Operator decisions per `image_id`: `redact` / `skip` / `defer` + the rect list (tool, intensity, text/font/colour). Survives re-scans.
`.anon/img_thumbs/<image_id>.jpg`	JPEG	Idempotent 256x256 thumbnails for the gallery; cache key is `image_id`.
`verifier_report.md`	Markdown	Last verifier run output.
`.anon/state.json`	JSON	Per-stage checkpoints (resume after crash).
`.anon/run_manifest.json`	JSON	Full provenance: project, profile, timestamps.

Per user (resolved via anonymize._paths):

OS	Config root	Data / cache root
Linux	`~/.config/document-anonymizer/`	`~/.local/share/document-anonymizer/`
Windows	`%APPDATA%\report-anonymizer\`	`%LOCALAPPDATA%\report-anonymizer\`
macOS	`~/Library/Application Support/report-anonymizer/`	`~/Library/Application Support/report-anonymizer/`

Files under the config root:

Path	Purpose
`server.yml`	User-level llama-server presets.
`preferences.yml`	UI preferences (default preset, etc.).
`app_settings.yml`	GUI toggles persisted across launches (detector mode, autostart, etc.).
`hf.token`	HuggingFace API token (mode `0600` on POSIX).
`downloads.yml`	Persistent download queue.
`.installer_choice.json`	Windows-only: variant chosen at install time + path to bundled `llama-server.exe`.

Caches (tempfile.gettempdir() namespace, OS-specific):

Path	Purpose
`<tmp>/anondiff/office/`	LibreOffice → PDF cache for office-doc preview.
`<tmp>/anondiff/preview/`	Apply output cache for the live Review preview.

Pipeline stages¶

Scan¶

Walks the input tree honouring .anonignore / .gitignore, symlink-safe, capped by max_depth / max_file_size_mb. Emits a ScanResult with adapter-resolved text_like files vs binary_copy files (binaries are copied verbatim to the output).

Detect (Tier-0 + Tier-1)¶

Tier-0 (anonymize/rules_pass.py): deterministic regex from config/leak_patterns.yml with stable-index assignment. +393331111111 always resolves to the same placeholder thanks to decisions_history.jsonl.
Tier-1 (anonymize/detector.py): structure-aware chunker (structure_chunker.py) feeds the LLM 5K-char chunks. The LLM is prompted via prompts/system_detector.txt + prompts/detect_user.txt.j2 with the top 8 operator decisions as few-shot examples.

Detection mode (single vs multipass)¶

Tier-1 ships with two interchangeable run strategies, picked per-run from a combo box on the Pipeline tab (or --detector-mode on the CLI) and persisted in app_settings.yml under the user-config root (see the Per user table above for the per-OS location):

single (default, fast): one LLM call per chunk against the monolithic prompts/system_detector.txt, ~3500 tokens of instructions covering all 12 categories. Roughly 30 s / typical PDF on the shipped 4B preset.
multipass (high accuracy, ~5× slower): the same chunk is sent to the detector 11 times in a row, each call with a tight ~800-token category-scoped prompt from prompts/detector_multipass/ (system_detector_brand.txt, system_detector_network.txt, …, system_detector_infra_ids.txt). The candidate lists are merged by value (highest confidence wins) before the critic stage, so the rest of the pipeline is unchanged. On the local 5-PDF bench multipass lifted F1 from 0.836 to 0.919 (precision +0.12, recall +0.05); the trade-off is roughly 5× more detector time. Recommended for messy or multi-customer reports and for the 4B preset; the 9B/27B presets are less prompt-sensitive and gain less.

The dispatch happens in anonymize/pipeline.py:_resolve_detector_prompt_paths(), which reads project.detector_mode and returns the ordered list of prompt files; stage_detect_and_critic then loops the detector once per prompt and merges the results.

Critic¶

anonymize/critic.py runs every Tier-1 candidate through a second LLM pass with prompts/system_critic.txt. Self-consistency voting is configurable via n_vote. Candidates with is_real_leak: yes AND placeholder_safe: yes AND confidence ≥ t_high auto-promote (auto_t1). The rest go to needs_review.

Triage¶

anonymize/triage.py partitions every candidate:

flowchart TD
    cand[Candidate] --> tier{Tier?}
    tier -- "Tier-0<br/>(regex)" --> at0["auto_t0<br/>queue → Promote"]
    tier -- "Tier-1<br/>(LLM)" --> safe{"placeholder_safe<br/>= yes?"}
    safe -- no --> rej["rejected<br/>(silently dropped)"]
    safe -- yes --> conf{"confidence<br/>≥ t_high?"}
    conf -- yes --> at1["auto_t1<br/>queue → Promote"]
    conf -- no --> needs["needs_review<br/>(human decides)"]

auto_t0, every Tier-0 hit (already deterministic).
auto_t1, high-confidence Tier-1 hits.
needs_review, everything else; the operator must decide.
rejected, confident critic "no", silently dropped.

Review (GUI only)¶

gui/review_view.py unifies three row types in one tree:

✓ In map, already in substitution_map.yml.
✓ Auto T0/T1, queued for the next Promote.
· Pending, needs operator decision.

Inline edits to value (col 1) or placeholder (col 2) persist immediately to the relevant YAML. Decisions (Approve / Skip) live on Candidate.decision and round-trip through needs_review.yml.

Promote¶

stage_promote reads the three YAMLs (filtered by decision != "skip"), merges them into substitution_map.yml via smap.merge_candidates, then prunes the merged entries from the auto / pending YAMLs so the Review tree never shows duplicates.

Apply¶

anonymize/applier.py runs each format adapter's write():

PDF in-place (pdf_inplace_adapter.py): redacts rects via PyMuPDF and stamps the placeholder text in-place, preserves layout, font, and byte length when possible.
PDF rederive (pdf_rederive_adapter.py): re-derives the PDF from the extracted text + applied substitutions (use when the source PDF is highly stylised and in-place would shrink fonts).
Office (docx/pptx/odt/rtf/xlsx): per-run modification via the official Python libs (python-docx, python-pptx, odfpy, openpyxl).
Text (text_adapter.py): byte-stream substitution for ~80 text/code/markup extensions.

Every output is written atomically through *.tmp + os.replace.

Build¶

anonymize/builder.py runs Pandoc and WeasyPrint to produce the extra-export formats (pdf/html/md) requested via Project.extra_export_formats. No-op when no extra format is set. WeasyPrint replaced the legacy wkhtmltopdf subprocess so the engine no longer needs a Qt5-linked binary on the host, that's what makes the AppImage packaging feasible (no Qt5/Qt6 collision with PySide6). The Project.export_template_id chosen at import time is forwarded through render_pdf so Build / Re-derive output uses the same template the Export dialog would.

Verify¶

anonymize/verifier.py sweeps the output for residual leaks. Loads the active substitution map, NFKC-normalises both sides, decodes HTML entities, strips zero-width chars, then checks each from value isn't present in any output file.

Auto-resolve¶

anonymize/triage.py:auto_resolve_residuals picks up the verifier hits and re-derives placeholders from the existing map (case-match, stable-index lookup), then re-runs Apply only on the affected files. Closes the loop without round-tripping through human Review.

Token budget per request¶

A typical detector request looks like:

Component	Tokens (approx)
System prompt	~1 500
Few-shot (top 8 from decisions log)	~250
Chunk body (5 000 chars ≈ 3 chars/token)	~1 700
Output JSON budget (`max_tokens`)	2 048
Total per request	~5 500

A 16 K context with parallel: 2 gives 8 192 tokens per slot, around 3x headroom. The pre-flight check (anonymize/budget.py) refuses to start the pipeline if the active preset's slot is too small.

Cancellation contract¶

Every long stage accepts an optional stop_event: threading.Event. The GUI's global Stop button sets the event; workers check it between chunks / pages and return early. chat_many uses ThreadPoolExecutor.shutdown(cancel_futures=True) so queued requests are cancelled too, Stop is genuinely instant.