The problem with drawing boxes over text
PDF is a page-description language, not an image format. When you place a black rectangle over existing text, the PDF structure now has two overlapping elements: the original character objects below and the black rectangle above. Most PDF viewers render the top element and hide the bottom one from view. But the underlying characters are still in the file's content stream, along with their exact position data.
A 2019 study by the University of Illinois found that redaction failures affected over 60% of produced documents in eDiscovery reviews at major law firms. In 2021, a medical insurer accidentally exposed patient SSNs in public court filings because a paralegal had used a highlight tool rather than a proper redaction tool. The redaction had been "applied" in the visual sense but the underlying data was intact.
- Black rectangle overlaid on text original characters still in content stream
- Copy-paste from the PDF still extracts the hidden text
- Text search scripts can find and extract all redacted values
- PDF repair tools and parsers can strip the overlay and reveal the original content
- Metadata fields may still contain the original data in other parts of the file