Building a successful attack chain with Google Advanced Operators and pdf Metadata

In this article I will talk about the importance of properly sanitizing documents before publishing. In particular, I'll focus on pdf files, which are probably the most common format for documents shared on the web. Starting from the considerations of a recent paper, we will see how an attacker could exploit the information inside the pdf metadata in the reconnaisance phase of an attack to build a successful attack chain against the pdf creator.


pdf metadata leakage
Some days ago I was reading a recent paper by Supriya Adhatarao and Cédric Lauradoux of INRIA called Exploitation and Sanitization of Hidden Data in PDF Files. In their study, the authors gathered a wide corpus of pdf files from security agencies all over the world and parsed them to look for sensitive information hidden in metadata. Quite surprisingly (at least to me), the result was that only 7 security agencies out of 75 were sanitizing their pdf files before publishing them on the Internet, thus exposing a wide variety of confidential information that spans from the name of the pdf author, the tool used to export the pdf file (possibly including the version), details on the information system and so on.

This paper is a nice reading, I leave the link in the References section. In my opinion, all this information that may be leaked with metadata could easily drive an attacker into building a successfull attack chain that could compromise (at least) the machine of the pdf creator. It is probable that attackers are already exploiting this weakness. In the following part of the article, I outline a possible attack chain that could be allowed by the information leakage in the metadata of an improperly sanitized pdf file.

Building an attack chain: step 1 - pdf collection

pdf metadata leakage
So now I am an attacker and I am guessing whether it is possible to footprint an organization using publicly available information. In particular, I want to try to use pdf documents, which is probably the type of document most widely shared by organizations.

For instance, let us suppose that I am interested into the financial aspects of an organizations. In this case I am simulating a targeted attack, an attack where both the company and the goal of the attacker is clearly defined.

I can take advantage of Google Advanced Operators to write a query on the Google search bar that only retrieves pdf files dealing with financial operators. My attempt is to find and download these specific documents in order to see if these pdf files are not properly sanitized. Indeed, if they are not properly sanitized (as usual), I can find information like the Author Name, the tool used to create the document and possibly I can also retrieve information about the OS or the path of the file.

In order to get financial pdf files of an organization I could type, for instance:

filetype:pdf financial report

where filetype is a Google advanced operator that only selects pdf files from the search, and Organization should be substituted with the name of the organization to be targeted.

Building an attack chain: step 2 - exfiltrating information from metadata and building the attack

I can guarantee that almost nobody fully sanitize their pdf before publishing them on the Internet, and if you download some pdf, open it with a text editor like Notepad++ and search for "Author" and "Creator" you will very likely manage to find the author of the pdf file and the tool used to generate it.

At this point an attacker could try to retrieve the email address of the Author, which should not be very difficult because she already knows the name and surname of the subject and organization emails always have a regular shape like name.surname@organization.com.

So at this point if the attacker also discovered the tool used to create the pdf (just look at the "Creator" string inside the raw pdf file), she is able to search in public vulnerability databases to see if by chance a tool with well-known vulnerability has been used. If this is the case, the attacker can craft an email, with a subject that deals with the topic of the downloaded pdf (so that the victim will likely open it), and inserting a malicious attachment that exploits a vulnerability in the software used by the victim. Also in cases the attacker did not manage to find a vulnerability for the tool, she has surely gained more information (probably the author name and there are chances that also the OS in use has been discovered) that can be used to try other attack vectors.

As an example, I have downloaded a pdf that was made publicly available on the Internet from the University to announce the assessment results for the candidates of a given degree programme were I was enrolled some years ago.
I have removed the personal information of the author but as you can see, it was possible to discover both the author and the tool used. This is information that should be kept confidential and should not be disclosed on the Internet. The fact that this information may also by used by an attacker is just another reason to sanitize these documents. pdf metadata leakage
Killing the chain

The best way to cut this chain is to properly sanitize pdf files before publishing them on the Internet or making them publicly available. The wide variety of freely available tools to gather Open Source Intelligence means that the risk of unintentionally leaking confidential information is very high in this days. We should take care of sanitizing all kind of documents before, for instance following guidelines available on NSA or NIST websites. This is the first and best way to kill this attack chain. If this is not possible for past documents, we should at least ensure that proper email security is enforced, using vendors that provide email inspection of the email attachments. The third and last way to kill this chain is having a decent antivirus on hosts, that in case users receive and open a malicious attachment and run the associated script, it is able to detect and block the threat before it is too late.

References

Supriya Adhatarao and Cédric Lauradoux, Exploitation and Sanitization of Hidden Data in PDF Files