Tool Time: Export PDF Text with Pdftotext

If you occasionally need to export text from PDF files, pdftotext might be a handy addition to your personal toolbox. Part of Foo Labs' free Xpdf package, pdftotext is a command-line tool that automates the export process.

Using pdftotext is straightforward. If you want to export the text from a file named vmware.pdf, you can use pdftotext like this

pdftotext vmware.pdf

This command automatically creates a new file named vmware.txt in the same folder as vmware.pdf. Where possible, pdftotext will remove embedded hyphenation and line breaks. If you also want to remove physical page breaks embedded in the PDF file, you can add the -nopgbrk option:

pdftotext vmware.pdf -nopgbrk

To send the text output to the screen instead of a file, you include the - parameter at the end of the command:

pdftotext vmware.pdf -

You can use multiple parameters together as well:

pdftotext vmware.pdf -nopgbrk -

Pdftotext works only with actual text, so you won't be able to export images or scanned text that hasn't had optical character recognition (OCR) performed on it. However, it works extremely well in its specific niche.

The Xpdf package contains several other tools that can be useful for manipulating PDF files. Pdftoppm and pdftops convert PDF files to the Portable Pixel Map (PPM) or PostScript format, respectively. Pdfimages extracts all images from a PDF file, pdfinfo returns general PDF metadata, and pdffonts diagnoses font-related problems with PDF files. If you work with PDF files and like command-line tools, xpdf is well worth checking out.

Please or Register to post comments.

IT/Dev Connections

Las Vegas
September 30th - October 4th

Paul ThurottYou'll have the opportunity to experience:
• The Microsoft
Technology Roadmap
• Office 365 Implementation
• Hyper-V Optimizing
• Windows 8 Deployment
and much more!

Come See Paul Thurrott & Rod Trent in Person!

Early Registration Now Open

Upcoming Training

Mastering System Center 2012

During over 6 hours of training you can join John Savill from your computer as he will walk you through the key components and capabilities of System Center 2012, what’s involved in using the components, and the benefit they can bring to your environment.

Register Now

Current Issue

May 2013 - The NameTranslate object is useful when you need to translate Active Directory object names between different formats, but it's awkward to use from PowerShell. Here's a PowerShell script that eliminates the awkwardness.

CURRENT ISSUE / ARCHIVE / SUBSCRIBE

Windows Forums

Get answers to questions, share tips, and engage with the Windows Community in our Forums.