The PDF file format is very popular. This page-description language and the PDF reader applications that support it are designed to prevent arbitrary code execution. However, numerous vulnerabilities have been found in popular PDF readers and exploited by countless malicious PDF documents. In this article, I explain how malicious PDF documents can execute arbitrary code, as well as what you can do as an administrator to protect your users. Many of the mitigation techniques that I discuss also apply to other applications, such as Microsoft Office documents.

Figure 1 shows the PDL code for a very simple one-page PDF document with the text “Hello World.” I designed it to contain only the most essential elements that make up a PDF document and to use only ASCII characters, so that you can read the internals of the document.

A PDF document contains a tree structure of objects with all the instructions needed by the PDF reader to render the document’s pages. In our example, the root object is 1 (1 0 obj) and is found at absolute position 12. The root object refers to the collection of pages found in the PDF document (i.e., object 3). Our example document contains only one page, defined in object 4. The content of the page is defined in object 5; you can find the text Hello World between parentheses. (The other keywords define text properties, such as the font to be used and its location on the page.)

This PDF example is easy to understand, because it uses uncompressed text. Typically, PDF documents use compressed text and can’t be easily read without appropriate tools.

The PDF language and most PDF readers support JavaScript. Scripts can be embedded inside a PDF document and executed by the JavaScript engine of the PDF reader. This engine is restricted in its interaction with the OS. For example, there are no JavaScript statements or functions that allow arbitrary files to be read from or written to. JavaScript in PDF documents is often used in form processing, such as in order forms to calculate totals and sales tax.

So, how can malware authors create PDF documents that will infect systems? They do so by exploiting bugs (vulnerabilities) that they actively research in popular PDF reader applications, such as Adobe Reader. These vulnerabilities are often found in the PDF engine or in the JavaScript engine. Back in 2008, one such vulnerability was found in Adobe Reader in the JavaScript util.printf function. (Adobe patched this vulnerability, and it doesn’t exist in recent versions of Adobe Reader.)

Util.printf is a function that takes arguments and produces a formatted string according to the arguments passed to it. But when util.printf is passed some very specific arguments, a bug in the internal code of the util.printf function is triggered. When called with these arguments, the internal code of util.printf doesn’t behave as the programmers intended, because of a bug. Instead of formatting text and returning execution, the program flow makes the execution of the internal code jump outside the program, at an address where no code exists. When a Windows program tries to execute code that doesn’t exist, an error is generated. This error terminates the Adobe Reader process.

Passing program control to an arbitrary address in memory is the holy grail of malware authors and exploit writers. This is what they need to make applications vulnerable to execute their own code. Very skilled exploit writers can achieve total control of the address to which execution jumps. (This is called Extended Instruction Pointer—EIP—control; EIP is the CPU’s instruction pointer—that is, the register that points to the address in memory that contains executable code.) Exploit writers first place their own code at this address, then exploit the vulnerability so that program execution passes to this address.

However, it’s rare to find such exploits with total EIP control in malicious PDF documents in the wild. (Malware found “in the wild” is malware that’s spreading unrestricted on the Internet—not including proof-of-concept malware that isn’t spreading, or malware used in very targeted attacks.) What’s often found in the wild is PDF malware with exploits that achieve partial EIP control. Malware authors can build an exploit to jump to a particular address in memory, outside the normal program execution, but they can’t build an exploit to jump to an arbitrary address in memory. They use a heap spray technique in JavaScript to plant their malicious code in memory: They fill the vulnerable program’s dynamic memory (the heap) with malicious shellcode. Shellcode is a small program written in machine language that can execute correctly anywhere in memory.

Shellcode used in common malicious PDF documents is very small and typically does the following: It downloads an executable file from a web server on the Internet with an HTTP request, writes this file to the disk in the system32 folder, and executes the downloaded file. The shellcode has no real malicious payload; it’s simply a downloader program that downloads and executes the real Trojan from the Internet. (Downloading a Trojan from the Internet provides malware authors with more flexibility; they can change the Trojan on the web server after they release their malicious PDF document in the wild.) This Trojan is what ultimately infects your machine—for example, by making it a member of a botnet.

In summary, here’s how a typical malicious PDF document performs its nefarious actions. When the document is viewed with a PDF reader, a JavaScript script automatically executes. This script fills the heap with shellcode and then triggers a bug (in the PDF language or in the JavaScript language). This action leads to the execution of the shellcode and finally to the download and execution of a Trojan.

 

Mitigation Techniques

What can IT professionals do to prevent malware authors from exploiting bugs? One solution that PDF software vendors often recommend is to disable JavaScript. This action is useful for newly discovered vulnerabilities because it prevents the heap spray from executing.

Another good mitigation technique is to use Least-Privileged User Accounts. Because shellcode in many malicious PDF documents writes Trojans to the system32 folder, it requires administrative access. Removing administrative access prevents the downloader shellcode from operating. It can download the Trojan, but it can’t write the Trojan to the system32 folder and therefore can’t execute the Trojan. Furthermore, many Trojans require administrative rights to insert themselves into the OS.

Data execution prevention (DEP) is another important mitigation technique. In Windows, memory is marked as data or as executable. The heap is actually data—it’s not meant to contain executable machine code. But until the introduction of DEP in Windows XP SP2, microprocessors executed instructions stored in memory designated as data without any problem. DEP changes this behavior of the microprocessor: It no longer executes code (including shellcode) stored in data memory (such as the heap).

To prevent exploitation of bugs, PDF reader vendors must designate memory, such as the heap, as data and activate DEP for their programs. If vendors fail to do so, administrators can still use the Microsoft Enhanced Mitigation Experience Toolkit (EMET) to force DEP on specific programs.

However, exploit researchers have found ways around DEP. Instead of writing their custom shellcode to the heap, they build custom code by borrowing existing instructions from code that’s already loaded into the process address space via executable files (.exe and .dll files). This technique is called return-oriented programming (ROP), and the parcels of borrowed code are called ROP-gadgets. When skilled malware authors can predict where the executable files are loaded into memory, they can borrow code from these executable files for their ROP-gadgets and thereby exploit vulnerabilities in DEP-protected applications. To address this issue, Windows Vista introduced Address Space Layout Randomization. ASLR ensures that executable files are loaded at a (semi-)random address in memory, which prevents malware authors from predicting where their ROP-gadgets will be found in memory.

To benefit from ASLR, you must use a recent Windows version that supports ASLR (XP doesn’t). In addition, the application authors must have marked the executable files for ASLR support. If a software vendor doesn’t include ASLR support in an application, you can still use EMET to force it.

Even software applications that do support ASLR can become vulnerable to ROP attacks when they include DLLs that don’t support ASLR. For example, this is the case with some shell-extension DLLs. Shell extensions provide extra functionality to Windows (e.g., in the right-click Windows Explorer context menu). When you install an application such as WinZip, the setup program also installs a shell extension that provides WinZip integration with the right-click context menu in Windows Explorer, as well as all other applications that use the open and save common dialog boxes. Fortunately, WinZip’s shell-extension DLL supports ASLR, so it doesn’t expose the hosting applications to ROP attacks. But not all software providers are as security minded as WinZip; some software providers install shell-extension DLLs that don’t support ASLR—and these DLLs expose hosting applications to ROP attacks. Applications such as Windows Explorer and Adobe Reader host shell-extension DLLS.

Another mitigation technique that’s becoming popular is sandboxing. With sandboxing, the vulnerable application is more or less isolated from the resources of the underlying OS. As an administrator, you can use special sandboxing applications to isolate your vulnerable applications. In addition, vendors are starting to include sandboxing in their own products (e.g., Internet Explorer—IE—7.0, Microsoft Office 2010, Adobe Reader X). Sandboxing relies on Windows security features, such as integrity levels and restricted tokens, to contain exploits and malware inside the sandbox. Running inside the sandbox, the attacking shellcode is restricted. Depending on the type of sandbox, it has read-only access to the file system and registry (e.g., Adobe Reader X) or is completely isolated (e.g., Google Chrome).

 

Multi-Layered Protection

The best solution for mitigating PDF vulnerabilities is to use the most recent applications and OSs. If your application vendors fall short, you should implement mitigating actions yourself with EMET and sandboxing software. Although I focus on PDFs and Adobe Reader in this article, the solutions I present apply to all types of applications that produce and mange documents, including Microsoft Office.

Note that this article focuses on mitigation of malware found in the wild. Mitigation of highly targeted attacks can be much more difficult, depending on your opponent. In this type of attack, your opponent knows your environment and tailors the malware to operate successfully in that environment without detection. If you’re a financially interesting target and your opponent is skilled and resourced, you’ll need to go beyond the measures that I describe here—such as implementing application whitelisting.

 

Figure 1: Code for the PDF document “Hello World”

%PDF-1.1

 

1 0 obj

<<

 /Type /Catalog

 /Outlines 2 0 R

 /Pages 3 0 R

>>

endobj

 

2 0 obj

<<

 /Type /Outlines

 /Count 0

>>

endobj

 

3 0 obj

<<

 /Type /Pages

 /Kids [4 0 R]

 /Count 1

>>

endobj

 

4 0 obj

<<

 /Type /Page

 /Parent 3 0 R

 /MediaBox [0 0 612 792]

 /Contents 5 0 R

 /Resources

 << /ProcSet 6 0 R

    /Font << /F1 7 0 R >>

 >>

>>

endobj

 

5 0 obj

<< /Length 48 >>

stream

BT

/F1 24 Tf

100 700 Td

(Hello World)Tj

ET

endstream

endobj

 

6 0 obj

[/PDF /Text]

endobj

 

7 0 obj

<<

 /Type /Font

 /Subtype /Type1

 /Name /F1

 /BaseFont /Helvetica

 /Encoding /MacRomanEncoding

>>

endobj

 

xref

0 8

0000000000 65535 f

0000000012 00000 n

0000000089 00000 n

0000000145 00000 n

0000000214 00000 n

0000000381 00000 n

0000000485 00000 n

0000000518 00000 n

trailer

<<

 /Size 8

 /Root 1 0 R

>>

startxref

642

%%EOF