Anonymization scope¶

This page is the deep dive behind the README's What it anonymizes table. It describes, category by category, what the detector flags on a penetration-test report, what placeholders it produces, and the classes of strings the pipeline deliberately leaves alone so the report's technical content keeps working.

The source of truth for these rules is prompts/system_detector.txt in the default Fast detection mode; the deterministic regex layer that runs before the LLM lives in config/leak_patterns.yml. When the High accuracy (multi-pass) mode is selected from the Pipeline tab toggle, the same rules live category-by-category under prompts/detector_multipass/ (one focused prompt per category, ~800 tokens each). The placeholder substitutions and category-specific format rules are in prompts/system_critic.txt and anonymize/placeholders.py.

Design principles¶

The technical content stays. Attack chains, exploit code, shell commands, payloads, tool output, library names, RFC ranges, generic versions: these are what makes the report useful as a teaching artefact and must remain untouched. A pentest report should remain readable by another pentester after anonymization.
Only customer-identifying values move. The detector's precision checklist is "if I removed this value, would the attack logic still be understandable?", if yes, it is a leak; if no, it is technical description.
Placeholders are length- and shape-preserving. PDFs are redacted in place, so a 17-character phone number must be replaced by a 17-character placeholder; a 32-hex token by a 32-hex placeholder; a brand name by a brand-shaped neutral string. This avoids reflow.
Same value → same placeholder, every time. A real value that appears twice (in different cases, in different formats, inside a URL or a payload) gets the same neutral substitute on every occurrence. The mapping persists in substitution_map.yml.
Multilingual. Reports come in any language; the detector analyses the input in its native language and emits language-neutral placeholders.

The 12 categories¶

1. `brand`: customer / product names¶

The customer's company name, product line, suite, vendor, app name, and any of their case variants when they appear inside URLs, package names, header names, advisory IDs, etc.

Original	Placeholder
`AcmeBank Pro v25.1.2135`	`VendorApp v25.1.2135`
`acmebank` (lowercase, in a URL)	`vendorapp`
`ContosoVoice`	`VendorVoice`
`NimbusGSM`	`VendorGSM`

The detector flags every form of the brand. If the same word appears as a domain (acmebank.com), as a package (com.acmebank), as a header (X-AcmeBank-Auth) and as a deeplink (acmebank://), each occurrence gets its own candidate so the placeholder rewrites the full token.

2. `network`: IPs and hostnames¶

Real public IPv4 addresses of the customer, the customer's owned domains, and proprietary hostnames under those domains.

Original	Placeholder
`203.0.113.42` (real public IP)	`203.0.113.NN`
`api.acme.com`	`api.vendor.example`
`*.prod.acme.io`	`*.prod.vendor.example`
`keyserver.acmebank.local`	`keyserver.vendor.local`

Placeholders use the RFC 5737 documentation ranges (203.0.113.0/24, 198.51.100.0/24, 192.0.2.0/24) and the RFC 2606 .example TLD so the placeholder is itself valid demo data and will not collide with anyone's real assets.

Not flagged: RFC 5737 ranges, RFC 1918 (10.x, 172.16-31.x, 192.168.x), loopback (127.0.0.1), well-known public DNS (8.8.8.8, 1.1.1.1), generic descriptive endpoints (/api/v1/login, /healthz, keyserver/v1/publish).

3. `phones`: E.164 numbers¶

Any-country phone numbers in any format. The placeholder keeps the country code and the carrier prefix, then zeroes out the rest with a sequential index of equal length.

Original	Placeholder
`+39 344 1234567`	`+39 344 0000001`
`+1 (415) 867-5309`	`+1 (415) 555-0001`
`+44 7700 900123`	`+44 7700 000001`

Not flagged: RFC reserved test ranges (+393440000001, +1-555-0100), already-anonymized numbers.

4. `emails`: customer-domain emails¶

Addresses on the customer's domain or addresses of real people involved in the engagement.

Original	Placeholder
`j.doe@acmebank.com`	`user01@vendor.example`
`pentest@contoso.local`	`user02@vendor.example`

Not flagged: @example.com, @test.local, generic documentation addresses.

5. `credentials`: plaintext user / password / cookie pairs¶

Human-typed credentials and live session tokens taken from real dumps. Each identifier is emitted as a separate candidate so the username and password get independent placeholders.

Original	Placeholder
`j.doe` (username)	`u.demo`
`svc-backup`	`svc-demo01`
`Welcome01!` (password)	`Aaaaaaa00!`
`Hunter2!`	`Aaaaaa0!`
`Authorization: Basic dXNlcjpwYXNz`	new base64 of equal length
`JSESSIONID=8b9c0d1e2f3a4b5c6d7e8f90a1b2c3d4`	`JSESSIONID=8b9c0d1e000000000000000000000001`

The placeholder keeps the length and the character classes (letter / digit / special) of the original so the redacted dump still parses.

Not flagged: documentation placeholders (user/pass, alice/bob in protocol diagrams, foo/bar in code samples), variable names (DB_USER, DB_PASS_2024), only their values are credentials.

6. `keys`: hardcoded tokens, hashes, cryptographic material¶

Hex tokens, base64-encoded keys, JWTs, SAML assertions, OAuth bearer tokens that come from the real environment.

Original	Placeholder
`nfdddf80a3b1c4e5f6079a8b9c0d1e2f` (32-hex)	`nfdddf80000000000000000000000001`
`eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.…` (JWT)	new JWT-shaped string of equal length
public key blob in PEM	length-preserving placeholder

For hex values the placeholder copies the first 8 source characters so two related credentials in the report stay visibly related (nfdddf80…0001, nfdddf80…0002), useful when the report compares two derivations of the same key material.

Not flagged: well-known constants (all-zero IV, RFC 4231 test vectors, Curve25519 public examples), library names (NaCl, libsodium, OpenSSL, Ed25519, AES, SHA-256), code variable names (sodium_key, OWN_KEY, DEV_PUB).

7. `headers`: proprietary HTTP headers¶

Any HTTP header whose name encodes the customer's brand or vendor. The detector rewrites the full header name; the value gets its own placeholder according to its own category (key, cookie, …).

Original	Placeholder
`X-AcmeBank-Auth`	`X-Vendor-Auth`
`X-ContosoServer-Token`	`X-VendorServer-Token`

Not flagged: standard headers (Authorization, Content-Type, X-Forwarded-For, X-Frame-Options, WWW-Authenticate).

8. `app_packages`: App package and bundle identifiers¶

Reverse-domain identifiers whose suffix encodes the customer's brand. The same shape covers Android packages, iOS bundle ids, and desktop-app reverse-domain identifiers (Snap, MSIX, Electron), so they share one category and one placeholder strategy.

Original	Placeholder
`com.acmebank.app` (Android)	`com.vendor.app`
`com.contoso.voice.beta` (Android)	`com.vendor.app.beta`
`it.acmebank.mobile` (iOS bundle id)	`com.vendor.app`
`com.acmebank.app.watchkit-extension` (iOS WatchKit)	`com.vendor.app000NNN`

Not flagged: SDK / library packages and OS frameworks (com.google.firebase.*, com.android.*, androidx.*, org.bouncycastle.*, com.apple.*, system bundles like com.apple.security.codesigning, com.google.GooglePlus, io.flutter.plugins.*, com.facebook.react.*).

9. `user_agents`: customer-app UA strings¶

Client User-Agent strings that name the customer's mobile or desktop app, including iOS-flavoured CFNetwork forms and custom-stack desktop user-agents.

Original	Placeholder
`AcmeApp/3.4 (Android)`	`VendorApp/1.0-android`
`CustomerApp/25.1.2135-android`	`VendorApp/1.0-android`
`AcmeApp/3.4 CFNetwork/1220.1 Darwin/22.5.0` (iOS)	`VendorApp/1.0 CFNetwork/0000.0 Darwin/0.0.0`
`AcmeApp/2.1 (Macintosh; Intel Mac OS X 14_0)`	`VendorApp/1.0 (Macintosh; Intel Mac OS X 0_0)`

Not flagged: standard browser UAs (Mozilla/5.0 (…) Chrome/…), curl/7.x, Wget/1.x, generic SDK UAs (okhttp/4.x, python-requests/2.x).

10. `ids`: internal tracking and advisory IDs¶

Identifier strings whose prefix encodes the customer (ACME-…, CONTOSO-…, CUST-…). The placeholder swaps the prefix while preserving the suffix so cross-references inside the report still point at the right finding.

Original	Placeholder
`ACME-CHAIN-A`	`VENDOR-CHAIN-A`
`CONTOSO-VULN-12`	`VENDOR-VULN-12`
`CUST-INC-9001`	`VENDOR-INC-9001`

Not flagged: CVE identifiers (CVE-2025-1234), CWE identifiers, OWASP references (A03:2021), CVSS strings.

11. `other`: proprietary URI schemes and deeplinks¶

Anything that doesn't fit the previous categories: proprietary URI schemes that are not in the IANA standard list, custom deeplinks, vendor-tied tokens that are still real but don't have a more specific home.

Original	Placeholder
`acme-app://chat?room=42`	`vapp://chat?room=42`
`customerapp://services/provision?token=any`	`app://services/provision?token=any`

The IANA standard scheme list (kept in sync with the prompt) is http, https, ftp, sftp, ssh, file, mailto, tel, data, blob, ws, wss, sip, sips, urn, about, javascript. Anything else is treated as a customer-proprietary deeplink and the whole URL is rewritten so scheme + host + path are anonymized together.

12. `infra_ids`: cloud / Active-Directory / infrastructure resource identifiers¶

Customer-tied resource identifiers that show up in cloud, Active Directory, network and on-prem infrastructure pentests. The pipeline keeps the structural prefix (so the placeholder still parses as the same kind of identifier) and rewrites the customer-tied tail with a deterministic sequential index.

The Tier-0 regex layer in config/leak_patterns.yml catches the four most common deterministic shapes: AWS ARN, EC2 instance id, UUID (Azure tenant / subscription / AD ObjectGUID) and Active-Directory SID. The LLM detector handles the looser shapes (GCP project ids, branded DC=… distinguished-name fragments, branded Kubernetes namespaces).

Original	Placeholder
`arn:aws:iam::123456789012:role/AdminRole` (AWS ARN)	`arn:aws:iam::000000000001:role/vendor-1`
`i-0a1b2c3d4e5f6789a` (EC2 instance id)	`i-0a1b2c3d000000001`
`12345678-1234-5678-1234-567812345678` (Azure tenant UUID, AD ObjectGUID)	`12345678-0000-0000-0000-000000000001`
`S-1-5-21-1234567890-987654321-111222333-1001` (AD SID)	`S-1-5-21-0000000001`
`acme-prod-12345` (GCP project id encoding the customer)	`vendor-prod-0000001`
`CN=John Doe,OU=IT,DC=acme,DC=local` (AD distinguished name)	`CN=user01,OU=Sales,DC=vendor,DC=local`
`MSSQLSvc/sql01.acme.local:1433` (SPN)	rewritten as `network` (host part) plus `infra_ids` for the service prefix when branded

The placeholder strategy lives in anonymize/placeholders.py:infra_id_placeholder; it dispatches by shape so AWS ARNs keep the partition prefix, EC2 IDs keep the i- prefix, UUIDs keep the first 8 hex of the source, and SIDs keep the well-known authority block.

Not flagged: AWS service ARNs that don't carry an account id (arn:aws:iam::aws:role/AWSServiceRoleFor…), Azure built-in SIDs (S-1-5-32-… matches but the placeholder reuses the canonical S-1-5-32-… prefix), Kubernetes namespaces that don't encode the customer (default, kube-system, monitoring), generic AD groups that ship with Windows (Domain Users, Enterprise Admins, these are role names, not customer identifiers).

Embedded images¶

Text rules cover the prose. Image content is handled by a parallel pass that surfaces every embedded image in the input as a thumbnail in the Review » Images tab. Each image is identified by image_id = "sha256:" + sha256(raw_image_bytes), so the same logo across 12 pages produces a single decision.

Four tools are available in the per-image editor, all rendered into actual baked pixels (the canvas re-renders on every change so the operator sees the real result, not a translucent overlay):

Tool	Renders	Use case
Blackout	Solid black rectangle	Customer logo, sensitive name in a screenshot
Blur	Gaussian blur (configurable radius)	Faces, screenshots whose context matters but identifying details don't
Pixelate	NEAREST-resampled mosaic	Same as blur but with stronger irreversibility cues
Text overlay	Coloured background rectangle + centred text	"REDACTED" badges, custom labels with custom font / background colour

Image review tab with editor: thumbnail strip on top, blackout rectangle baked over a Burp request — Per-image editor with the four redaction tools. Live bake means what you draw is what Apply will write.

Identity guarantee. Apply replaces image bytes IN PLACE at the same xref (PDF) / shape position (DOCX, PPTX), so the output keeps the same number of images, in the same files, in the same positions, with the same dimensions. The verifier post-stage asserts this via an inventory cross-check; any mismatch is logged in verifier_report.md.

Out of scope (intentionally). OCR-assist (auto-detect text regions) is not implemented; vector-graphics inside PDF pages are flagged with a warning but not editable from the GUI; ODT and XLSX images surface a "no editor support yet" notice. None of these affect the text pipeline, only the image-redaction surface.

What the pipeline never touches¶

The following classes of strings are deliberately preserved because removing them would break the report's technical narrative or because they are not customer-identifying.

Technical content of the report: descriptions, payloads, exploit code, shell commands, tool output snippets, request / response bodies (only the values they contain may be flagged, not the surrounding code).
Standards and well-known libraries: NaCl, libsodium, OpenSSL, OAuth, OAuth2, JWT, SAML, OIDC, WebRTC, FCM, APNS, OneSignal, Firebase, Google Play Services, Apple Push, libsignal, Curve25519, Ed25519, AES, SHA-256, HMAC, PBKDF2, Argon2.
Standard constants and reserved ranges: RFC 5737, RFC 1918, loopback, well-known public DNS, RFC 2606 example domains.
Generic OS / SDK / library versions: Android 10, iOS 17, Java 17, Python 3.12, OpenSSL 3.0.
Generic hardware models: Samsung SM-A920F, Xiaomi Redmi, Pixel 7, iPhone 14, these are lab-test details, not customer-identifying.
Dates in any format: 7 May 2026, 2026-05-07, 7 maggio 2026.
Generic file names and project paths: ADVISORY.md, README.md, exploit_usage.md, debug_server.py, data/dev_keypair.json, src/main/java/..., proof/screenshot.png, technical artefact names.
Generic descriptive endpoints: /api/v1/login, /healthz, /metrics, keyserver/v1/publish.
Variable / function / class names found in code (including R8 / ProGuard obfuscated names like pi.a.n, zi/a.java, License.WebServiceURI, sodium_key, key_id, encoded_key, OWN_KEY, DEV_PUB, VICTIM_PUB, crypto_box).
Test / placeholder identifiers that already document an attack: VITTIMA, VICTIM, Lab1, Lab2, attaccante, mitm.
Generic security terms: MITM, CSRF, XSS, RCE, SSRF, CVE-*, CVSS, CWE, OWASP.
Already-anonymized values: previous placeholders (+39NNN0000NNN, 203.0.113.NN, vendor.example, X-Vendor-*, VENDOR-CHAIN-*).

How the two tiers cooperate¶

Tier-0 (anonymize/rules_pass.py) is a deterministic regex pass. It catches phone numbers, IP addresses and 32 / 64-hex tokens without touching the LLM, and assigns a stable index (+393331111111 always resolves to the same placeholder via decisions_history.jsonl). Tier-0 hits auto-promote.
Tier-1 (anonymize/detector.py) is the LLM detector with the prompt described above. It walks the document chunk by chunk via the structure-aware splitter, emits candidates with category + suggested placeholder + confidence, then a critic pass checks each candidate against the "is this really a customer-identifying value?" question. High confidence + critic-approved candidates auto-promote; the rest go to the human Review queue.

Why the categories matter¶

The category drives:

Placeholder format: phones get +CC<carrier>0000NNN, IPs get RFC 5737, hex tokens get the 8-char-prefix-preserving rule, etc. Code in anonymize/placeholders.py.
Auto-promotion threshold: some categories (Tier-0 phones, Tier-0 IPs) auto-promote on the first hit; others (brand, credentials, other) require critic agreement.
Per-project review: in the GUI's Review pane, candidates are grouped by category so the operator can blast through homogeneous sets quickly (approve all phones, edit questionable brand variants by hand).

Anonymization scope¶

Design principles¶

The 12 categories¶

1. brand: customer / product names¶

2. network: IPs and hostnames¶

3. phones: E.164 numbers¶

4. emails: customer-domain emails¶

5. credentials: plaintext user / password / cookie pairs¶

6. keys: hardcoded tokens, hashes, cryptographic material¶

7. headers: proprietary HTTP headers¶

8. app_packages: App package and bundle identifiers¶

9. user_agents: customer-app UA strings¶

10. ids: internal tracking and advisory IDs¶

11. other: proprietary URI schemes and deeplinks¶

12. infra_ids: cloud / Active-Directory / infrastructure resource identifiers¶