Redaction of Sensitive Information

Information hidden in a document can be the unintentional consequence of using complex software. From a forensics point of view, it represents a bonus—something we didn’t expect to find. But information can also be intentionally hidden, disguised, or removed. Uncovering what someone does not want you to know represents a challenge.

In a variety of circumstances, government agencies, the courts, and others need to publish documents that contain sensitive information. Within these documents, they may need to remove or obscure the names of individuals, identification numbers such as a social security IDs, or colorful expletives that are deemed inappropriate for publication. A good example would be the publication of a government intelligence briefing as part of a congressional hearing, in which a foreign informant is named. The name for this selective editing is redaction . It is a polite name for censorship.

In the past, redaction has meant obscuring the relevant text on a piece of paper with a black marker. Any subsequent photocopies of the paper would retain the blacked-out region and there would be no way for anyone to read the underlying words. That has proven to be a simple, cheap, and extremely effective way of hiding information. But these days, most of the documents that we deal with are in electronic form. The PDF file format, in particular, is a very convenient way to distribute documents, including those scanned from handwritten or other non-electronic sources.

In redacting documents of this kind, the approach taken by some has been to use the metaphor of the black marker on the contents of the PDF file. You can import the document into Adobe Illustrator, Macromedia Freehand, or something similar, cover the offending text with a black box, and then export the file as a PDF. Publish this on the Web, and anyone viewing it with Adobe Acrobat Reader, Apple Preview, or other simple PDF viewers, will see the text blacked out and will have no way to see what lies beneath. It is simple and appears to be effective, but you can see the problem, right?

Instead of obliterating the text, what you have done is, in effect, placed a piece of black tape over it. Peel back the tape and everything will be revealed. If you have the full version of Adobe Acrobat, as opposed to the ubiquitous Reader, you can use the TouchUp Object Tool to simply move the blacked-out region to one side, revealing the secrets beneath. Similarly, you can open PDF files in Adobe Illustrator or Macromedia Freehand and select and move the regions. It couldn’t be simpler.

This is clearly a really bad way to do redaction. Nobody would actually use that with real sensitive documents, would they?

In October 2002, the Washington D.C. area witnessed a number of fatal shootings that all appeared to be carried out by the same sniper. Two men were eventually arrested and convicted for the string of murders. In the course of that investigation, the police found a handwritten letter at the scene of one of the shootings in Ashland, VA, on October 19, apparently from the sniper, demanding a 10-million-dollar ransom. The note explained that the money was to be deposited to a specific credit card account.

As part of its coverage of the investigation, the Washington Post obtained a copy of the letter and decided to publish its contents. Somewhere along the line, either the Post, the police, or whoever passed the letter to the Post, chose to redact certain text in the letter including the credit card account number, the name on the card, a contact phone number, and so on. The letter was scanned and then a program such as Adobe Illustrator was used to black out the text, and the composite document was exported as a PDF file. The modified image was printed in the Washington Post and nobody reading the paper was able to read the redacted text. But then they put the PDF on their web site.

Anyone with the full version of Acrobat was able to click on the black boxes, move them to one side and reveal the personal credit card information that was covered. As it turned out, the snipers were arrested shortly thereafter and, in due course, convicted. The credit card had been stolen in California and has, undoubtedly, been cancelled.

None of the disclosed information appears to have disrupted the investigation or subsequent trials. Things could have been very different, however. Keeping evidence confidential prior to an arrest or conviction is a central tenet of police work. Failing to do so can make finding the criminal more difficult and it can seriously jeopardize any court proceedings that follow. This oversight in preparing the PDF files could have had very serious implications.

The PDF files are no longer on the Washington Post site, but you can find copies of the redacted and un-redacted versions here:

The New York Times made a similar mistake a couple of years prior to this in April 2000. At that time they published a detailed analysis of the involvement of the CIA in the overthrow of the government in Iran in 1953, which returned Shah Mohammed Reza Pahlevi to power. The articles, along with some of the supporting documentation, are available on the Times web site at http://www.nytimes.com/library/world/mideast/041600iran-cia-index.html.

The report was based on leaked documents, most notably a classified CIA history of the entire episode. The Times initially decided to withhold certain documents to avoid disclosing the identities of certain agents and informants, and they placed the documents they chose to release on their web site as PDF files. In mid-June 2000, however, they decided to make all the documents available. But prior to doing this, they reviewed the documents and redacted any names or information they thought might jeopardize any intelligence agents. Just as with the D.C. sniper case, they simply blacked out the offending text in Illustrator and exported the PDF file.

How the snafu was uncovered was a little different in this case. John Young of New York just had a copy of the standard Acrobat Reader with which to view the file. He also had a slow computer, which turned out be the critical element. Because of it, the PDF files loaded very slowly and, to his surprise, he got to see the original text for several seconds before the black redaction was overlaid on top of it. By stopping the page load at the right time, he was able to view all the redacted information. More details are available at http://cryptome.org/cia-iran.htm.

This feature was discovered using Version 3.01 of Acrobat Reader and has been fixed in newer releases. If you download the D.C. sniper letter using the current version of Reader with a slow PC or Internet connection, you will see the black boxes appear before the underlying image. All the same, it remains a clear example of the surprises that occur with software when enough people use the same product in different circumstances.

Both of the previous examples involved printed documents that had been scanned and then redacted. A more recent example from May 2005 involves text that was converted from a Microsoft Word document into PDF format. The document was a report by the U.S. Army on their investigation into the tragic shooting death of an Italian intelligence agent at an army checkpoint in Baghdad in March 2005.

The original report had been heavily redacted to conceal classified information prior to its release to the public. Figure 8-3 shows a sample of this. The blacked out text was intended to conceal the names of military personnel, current operational procedures, and recommendations for their improvement.

But within a few hours of its publication, an uncensored version of the report was available. Salvatore Schifani, an IT worker in Italy, had spotted the PDF document on a news site and quickly realized that he could reveal the hidden text simply by cutting and pasting it into another application. On Mac OS X, for example, with the document loaded into the Preview application, the keyboard sequence -2, -A, -C is all it takes to copy the entire uncensored text to the Clipboard.

As with the D.C. sniper example, the creator of this document simply laid black boxes over the sensitive text and did not take the additional steps necessary to fix them in place. This is perhaps the largest and most serious disclosure from a badly redacted PDF document thus far. To add insult to injury, the creator of the PDF file is clearly identified in the document summary.

A BBC news story on the disclosure and an Italian report from the newspaper Corriere Della Sera, which includes links to both versions of the document, can be accessed via these links:

Even if a redacted document has been prepared correctly, there may still be a way to uncover the text in certain cases, or at least to make an educated guess about it. In April 2004, David Naccache of Luxembourg and Clare Whelan of Dublin figured out a clever way to reveal the blacked out words in a U.S. intelligence briefing released to the public in redacted form as part of the inquiry into the September 11 attacks (http://www.theregister.com/2004/05/13/student_unlocks_military_secrets/). Their solution used a combination of font measurements, dictionary searches, and human intuition. One example they studied was this sentence:

An Egyptian Islamic Jihad (EIJ) operative told an ######## service at the time that Bin Ladin was planning to exploit the operative’s access to the US to mount a terrorist strike.

Starting with a slightly rotated copy of the original printed document, they aligned the text and figured out that it was in the font Arial. Then they counted up the number of pixels that were blacked out in the sentence. They looked through all the words in a relevant dictionary and selected those which, when rendered in Arial at the right size, would cover that number of pixels, give or take a few. They whittled down the list of those candidate words from 1,530 to 346 using semantic rules of some sort. Presumably this took into account the word an immediately before the redaction, indicating it began with a vowel. Visual analysis of that subset reduced that further to just seven candidates. Eventually human intuition led them to choose between the words Ukranian, Ugandan, and Egyptian, and they chose the latter. This is not such a remarkable choice given the rest of the sentence, but it’s still a very clever way of figuring out the secret. They applied the same technique to a Defense Department memo and identified South Korea as the redacted words in a sentence concerning the transfer of information about helicopters to Iraq.

How should you redact sensitive information? PDF files are still a great way to distribute documents over the Web. What you need is a way to completely remove the redacted text before the PDF file is created.

If the text is already in the form of a Word document or plain text file, then we can easily replace the problematic words with a standard string such as [redaction], or a string of dashes or other symbols, one for each character that is replaced. The benefit of the latter is that it shows you how much text has been modified. In assessing a document of this sort, especially a redacted government report, I really want to know if the censors removed only a few words or whether there are several pages that I am not being shown. Replacement characters let you see that, but if they cover only a word or two, then they leave useful clues about the hidden text. Using the fixed string hides that information effectively.

With scanned images, the best way to modify them is to use an image editor and erase the offending section of the image. It is important to do this in the same layer as the original image to avoid the aforementioned overlay problems. Save the image out as a simple image format file such as JPEG or PNG and check that it is properly redacted before including it in a PDF file. This does not get around the pixel counting approach, however. In that case, it would be necessary to expand, shrink, or move surrounding sub-images so as to confound this technique.

If the document is already in PDF format, there are two approaches that can protect the content. You can export the document as an image, such as a PNG format file, in which case all the PDF objects are projected onto a single layer. It is impossible to recover the hidden text from this format because the pixels of the overlaid black boxes have replaced it. The problem here is that you lose the flexibility that comes with PDF in terms of viewing individual pages, convenient printing, and so forth.

Alternatively, you can apply one of several document security options that are available in the full version of Adobe Acrobat. These options can allow document printing and viewing but prevent anyone from selecting or changing the text. While this is a convenient approach, the hidden text is still contained within the document, and a successful attack on the security mechanism could potentially reveal it. The best approach is to prevent the sensitive text from ever appearing in the file.

Redaction and censorship are two sides to the same coin. Sometimes the reason for redaction is political, suppressing damaging revelations or simply avoiding embarrassment. Discovering the hidden text might be seen as an act of good investigative journalism. But in many instances, redaction is used legitimately to protect the identity of a victim of crime, a child, or someone whose life may be placed in danger by the revelation. Just because you can reveal a secret does not mean that you should. Think before you act.