Contributing¶
Thanks for your interest! This is the short guide for getting a PR through the door.
Setup¶
git clone https://github.com/nemmusu/report-anonymizer
cd report-anonymizer
python3.12 -m venv .venv && . .venv/bin/activate
pip install -r requirements.txt
pip install -r requirements-dev.txt
Run the test suite¶
The CI workflow runs the same suite on Python 3.10, 3.11, 3.12. A PR is only mergeable if all 295 tests pass.
Code style¶
- Type hints on every public function. Internal helpers can skip.
- Docstrings on classes and any function whose behaviour isn't obvious from the name (one paragraph max).
- Comments only when the WHY is non-obvious. No "this calls X" comments, readers can see that.
- Don't add abstractions for a hypothetical future requirement. Three similar lines are better than a premature abstraction.
ruff and black are wired up via pyproject.toml; the editor of
your choice should pick them up automatically.
What we look for in a PR¶
- Tests. Every new behaviour gets a test. The pytest suite has
GUI tests that work offscreen via
QT_QPA_PLATFORM=offscreen, no display required. - No silent regressions.
make testmust stay green; if you break a test, fix the test or revisit the change. - Concise description. Title + one paragraph of why. PRs that reproduce the README in the body get sent back :-)
- Screenshots for UI changes (drop them in
docs/screenshots/).
What we don't accept¶
- Calls to remote LLM APIs as default behaviour. The project is local-first by design, cloud is opt-in only, and on the roadmap.
- Telemetry / analytics, even anonymous.
- Auto-updates / self-modifying code paths.
Reporting bugs¶
Open an issue with:
- Steps to reproduce (smallest possible repro).
- Expected vs actual behaviour.
python --version,pip list, plusuname -a(Linux/macOS) orsysteminfo/ver(Windows).- The relevant contents of the user-config root (redact any HF
token!):
~/.config/document-anonymizer/on Linux,%APPDATA%\report-anonymizer\on Windows,~/Library/Application Support/report-anonymizer/on macOS. - Output of
python bin/anonymize-dossier selftest.
Adding a format adapter¶
- Subclass
FormatAdapterfromanonymize/format_adapters/base.py. - Implement
extract()andwrite(). - Register in
format_adapters/__init__.py:get_adapter. - Add a round-trip test in
tests/test_<format>_adapter.py. - Optional but recommended for binary formats: implement
inventory_images()(returnlist[InventoryImageRaw]) andapply_image_redactions()so the new format participates in the Review » Images flow. The default implementations are no-ops, so an adapter that doesn't override them simply skips image redaction.
The existing adapters (docx_adapter.py, xlsx_adapter.py, …) are
the cleanest reference, start from one of those. For image-pass
parity look at pdf_inplace_adapter.py, docx_adapter.py and
pptx_adapter.py.
Benchmarking your changes¶
# Headline benchmark on the full curated corpus.
BENCH_CORPUS_ROOT=/path/to/pdfs make bench-precision
# Spot-check a custom set of GGUFs without touching the autogen manifest.
BENCH_CORPUS_ROOT=/path/to/pdfs python bench/run_extra_models_benchmark.py \
--manifest "$(python -c 'from anonymize._paths import models_dir; print(models_dir())')/_bench_user_models.json" \
--out-root /tmp/anonbench/user_models
The user-manifest path lets you bench a hand-picked subset of
GGUFs (e.g. when validating a new top-5 candidate) without rerunning
the whole catalog. Each entry is {repo, file, status, bytes}; the
runner builds an in-memory ServerProfile, starts llama-server,
runs the precision harness against the corpus, scores against the
ground-truth, and emits report.json + report.md.