How to Redact a PDF: Permanent Black Bars vs Removable Highlight

Β· 11 min read Β·redact PDF
Following this guide saves you about 20 minutes vs figuring it out manually.
Advertisement

How to Redact a PDF: Permanent Black Bars vs Removable Highlight

A federal prosecutor in 2011 filed a "redacted" court document with sensitive cooperator names hidden behind black rectangles. A reporter copy-pasted the redacted text into a notepad and got the unredacted version verbatim. The black rectangles were drawing-overlay annotations sitting on top of the original text, not actual text removal β€” exactly the kind of failure the NSA's "Redacting with Confidence" guide had been warning about for nearly a decade. The fundamental problem is that PDF stores text and visual graphics as separate object streams; drawing a black box on a page doesn't touch the underlying text layer at all. Anyone who copy-pastes, runs OCR, or opens the file in a text-extraction tool gets the supposedly-secret content back unchanged. Real redaction has to delete the text from the PDF object stream itself β€” and most "redaction" tools, including some commercial ones, don't actually do that.

This guide covers the real technical difference between a visual cover-up and a true redaction, the legal contexts where the distinction has blown up in court, the per-element vs block-area redaction methods, and how to apply true redaction via a client-side PDF redaction tool where the operation runs in your browser and the original file never uploads to a server. Get this right and your redaction is irreversible; get it wrong and you've just published the secret content with a black rectangle decoration.

Why Black-Box Annotation Isn't Redaction

PDFs structure content in two largely-independent layers: the content stream (the text characters, fonts, font metrics, and exact positions of every glyph) and the graphics overlay (annotations, watermarks, drawn shapes, signatures). When you draw a black rectangle on top of a paragraph using a basic PDF editor, you're adding a graphics-layer annotation β€” the underlying content stream is untouched. The text remains in the PDF object stream, fully searchable, fully copy-pasteable, fully extractable.

This is not theoretical. The 2011 prosecutor failure was widely reported β€” copy-paste of the visible black bars revealed dozens of names. The 2014 NYT investigation of TSA "redacted" documents showed the same failure pattern across dozens of FOIA productions. The 2023 FTC's "redacted" amazon-antitrust filing was reverse-engineered by Bloomberg reporters within hours. Every one of these incidents shared the same root cause: someone drew a black box instead of removing the underlying text.

Real redaction has to do three things: remove the targeted text from the content stream (so PDF text extraction returns nothing for that region), remove the corresponding entries in any embedded font or text-position metadata, and replace the visual area with an opaque block of the appropriate dimensions so the page layout doesn't reflow. Adobe Acrobat Pro's "Redact" tool does this, as do dedicated redaction libraries like the ones documented in the PDF 2.0 specification section 12.5.6.20 on RedactionAnnot dictionaries. The browser-based PDF redaction tool implements the same technical operation client-side using pdf-lib's redaction primitives.

The distinction matters most in legal, regulatory, and journalism contexts where the "redacted" document is a public record and recovering the redacted content has real-world consequences. For internal-only redaction (showing a client only the relevant lines of a contract draft, hiding personal info on a screenshot before posting to Slack), visual cover-up may be acceptable β€” but never confuse "good enough internally" with "safe to publish."

How True Redaction Actually Works

The technical operation has three steps. First, identify the target β€” either by per-element selection (select specific text spans like names, dollar amounts, dates) or by block-area selection (rectangle drawn over a region of the page, redacting everything within). Per-element is more precise but tedious for long documents. Block-area is faster but risks leaving unredacted text just outside the box that might still reveal context.

Second, mark the targets with redaction annotations and apply. Most redaction tools have a two-phase workflow: "mark for redaction" (which adds a visual indicator showing what will be removed but doesn't yet alter the content) and "apply redactions" (which performs the destructive operation, removing text from the content stream and inserting opaque blocks). The two-phase model lets you review before destroying β€” once applied, redaction is irreversible without the original file.

Third, flatten the result and strip metadata. PDF metadata (author, title, creation date, comments, version history, embedded thumbnails) can leak the very information you redacted. The DOJ's FOIA technical guidance is explicit that metadata stripping is a required step in any federal-records redaction workflow. A redacted document that still contains an XMP metadata stream listing the original author and revision history hasn't been fully redacted.

For court filings specifically, the ABA Model Rule 1.6 on confidentiality creates a professional-responsibility duty for attorneys to verify that redaction was actually applied β€” submitting a "black box" PDF that still contains the underlying text can constitute professional misconduct.

Step-by-Step Using ScoutMyTool

The PDF redaction tool runs entirely in your browser using pdf-lib's content-stream editing primitives. Drop a PDF on the page, switch to per-element or block-area mode, mark each region, then click "Apply Redactions" to perform the destructive removal. The result is a new PDF with text actually deleted from the content stream β€” copy-paste, OCR, and text-extraction tools all return nothing for the redacted regions. After redaction, run the compress tool if needed to optimize for file size, and use extract pages to share only the redacted pages of a larger document.

For documents with sensitive content where the file itself shouldn't leave the device, browser-based redaction is the only privacy-safe path. Server-based redaction tools (most "free online" options) require uploading the original unredacted document, creating an exposure window where the third-party server has the unredacted content. For privileged or regulated documents, this exposure alone is the privacy failure β€” even if the resulting redacted file is published correctly.

Advertisement

Worked Examples

Example 1 β€” FOIA production with personal information. A federal agency FOIA officer is preparing 47 pages of inter-agency emails for public release. Names of non-senior employees, personal email addresses, and phone numbers must be redacted under FOIA Exemption 6 (privacy). Method: per-element redaction, marking each name and contact field. After applying, the agency officer extracts text from the redacted PDF to verify nothing remains in the content stream β€” a standard quality check the DOJ FOIA process recommends before public release.

Example 2 β€” Litigation discovery production. A defending firm is producing 200 pages of email correspondence in response to a discovery request. Privileged communications between the client and outside counsel must be redacted. Method: block-area redaction over each privileged email body, leaving the metadata header (sender, recipient, date) visible. After applying, the redacted file is text-extracted and compared against a non-redacted reference copy to confirm only the privileged body text was removed. Bates numbers (added separately via the page-number tool) preserve the production-page identifier across redactions.

Example 3 β€” Public-records release with financial details. A municipal government is releasing a contract under public-records request. Vendor pricing and bid totals are public; vendor employee SSNs and bank routing numbers in the payment schedule are not. Method: per-element redaction targeting the specific 9-digit SSN format and 9-digit routing/account numbers. After applying, the released file is searched for any remaining 9-digit numeric strings β€” a final-pass safety check.

Example 4 β€” Journalism source-protection. A reporter has received a 30-page leaked memo and wants to publish portions but redact source-identifying details (specific dates, named individuals known only to a small group, internal project codenames). Method: per-element redaction on each unique identifier, plus block-area redaction on a 4-paragraph section that contains too many small identifiers for per-element to be safe. After applying, the redacted file's metadata is stripped (XMP stream, author, comments) before publication. The original memo is kept on an air-gapped device.

Common Pitfalls

The biggest pitfall is using a black-box overlay and calling it redaction. The text underneath remains fully extractable. This is the failure mode behind every "DOJ accidentally publishes redacted info" headline. If you're not using a tool that explicitly says it removes content from the PDF stream (not just covers it visually), you're not redacting.

The second is forgetting metadata. PDFs typically contain XMP metadata streams with author, title, creation date, revision history, and sometimes track-changes-style edit history. The redacted content stream might be clean while the metadata stream still contains the original draft. Strip metadata as a separate explicit step.

The third is per-element redaction missing context. Redacting "John Smith" five times throughout a document but leaving in "the manager who reported to the VP of Sales" β€” when the org chart makes that uniquely identifiable β€” defeats the redaction. For high-stakes redaction (law enforcement informants, intelligence sources, witness protection), the NSA Redacting with Confidence guide recommends adversarial review by a second reader specifically looking for context-based re-identification.

The fourth is forgetting embedded objects and attachments. PDFs can contain embedded files, attached images, JavaScript, and forms β€” all of which can carry data that survives a content-stream redaction. The Wikipedia entry on redaction failures catalogs cases where redaction missed embedded objects and PII surfaced via the embedded layer.

The fifth is uploading the unredacted file to a third-party server to perform redaction. The whole point of redaction is preventing exposure; uploading first creates the exposure you were trying to prevent. Browser-based, client-side redaction is the only path that doesn't introduce a third-party copy of the unredacted source.

Frequently Asked Questions

Q: Why isn't a black rectangle on a PDF actually redacted? A: PDF separates text content (the content stream with character codes, font references, and positions) from graphics annotations (drawn shapes, watermarks). A black rectangle is a graphics-layer drawing on top of the page; the text underneath remains in the content stream and is fully recoverable via copy-paste, text extraction, or OCR. True redaction has to delete the text from the content stream itself.

Q: Will Adobe Acrobat redaction work for legal filings? A: Yes β€” Adobe Acrobat Pro's "Redact" tool performs true content-stream removal. The two-phase workflow (mark, then apply) actually edits the PDF objects when you click Apply. Verify by copying text from the redacted region; you should get nothing. Acrobat Reader (free version) does NOT include redaction; that's Acrobat Pro only.

Q: Can OCR recover text from a redacted PDF? A: Only if the text wasn't actually removed. True redaction removes the text from the content stream and inserts opaque visual blocks; OCR scans the visual rendering and returns nothing for the blocked regions. If OCR recovers text, the redaction was actually a black-box overlay rather than real redaction.

Q: What's the difference between flattening and redacting? A: Flattening makes annotations (comments, watermarks, drawn shapes) part of the page content so they can't be separately edited or removed. Flattening a black-box annotation makes the box part of the visible page β€” but the original text underneath the box is still in the content stream. Flattening alone is NOT redaction. Real redaction removes the underlying text in addition to flattening.

Q: How do I redact metadata from a PDF? A: Most redaction tools have a "remove document metadata" option as a separate step. Strip the XMP metadata stream, document properties (author, title, creation date), comments, and revision history. The DOJ FOIA guidance explicitly lists metadata removal as a required step in federal-records redaction.

Q: Can I undo a redaction? A: Real redaction is irreversible β€” the original text has been deleted from the content stream and is not recoverable from the redacted file. This is the point. Always keep a copy of the original (offline, secured) before applying redaction in case you need to re-redact differently later.

Q: Is browser-based redaction as secure as desktop software? A: For the redaction operation itself, yes β€” pdf-lib and similar WebAssembly libraries perform identical content-stream edits to desktop tools. The advantage of browser-based redaction is that the original file never uploads to a server, eliminating the third-party-exposure risk. For privileged or regulated documents, this is often the more secure path overall.

Wrapping Up

Redaction is not "drawing a black rectangle" β€” it is removing text from the PDF content stream so the redacted regions contain no recoverable data. Use a tool that actually performs content-stream removal (Adobe Acrobat Pro, the browser-based redaction tool, or a dedicated redaction library), strip metadata as a separate step, and verify by attempting text extraction on the redacted file. Get this right and the redaction is permanent; get it wrong and you've published the supposedly-secret content with a black-rectangle decoration on top.

Advertisement