PDF file format: Internal Document Structure Explained

The PDF file format has a basic structure that consists of a header, a body, and a trailer. The header contains information about the PDF file, such as the version of the PDF file format, the creation date, and the author of the file. The body of the PDF file contains the actual content of the file, such as text, images, and other media. The trailer of the PDF file contains information about the file, such as the size of the file, the checksum of the file, and the location of the file on the disk.

PDF has more functions than just text: it can include images and other multimedia elements, be password protected, execute JavaScript and so on. The basic structure of a PDF file is presented in the picture below:

PDF Header

The header specifies the version number of the PDF specification used in the document.

This can be found by using a hex editor or the xxd command:

$ xxd temp.pdf | head -n 1
0000000: 2550 4446 2d31 2e33 0a25 c4e5 f2e5 eba7 %PDF-1.3.%……

The temp.pdf PDF document uses PDF specification 1.3. The ‘%’ character is a comment in PDF. This means that the first and second line being comments is true for all PDF documents. The following bytes are taken from the output below: 2550 4446 2d31 2e33 0a25 c4e5 and correspond to the ASCII text “%PDF-1.3.%”. This is followed by some ASCII characters that are using non-printable characters (note the ‘.’ dots), which are usually there to tell some of the software products that the file contains binary data and shouldn’t be treated as 7-bit ASCII text. Currently, the version numbers are of the form 1.N, where the N is from range 0-7.

Body of PDF document

The Body section is used to hold all the document’s data that is being shown to the user.
In other words, the body of the PDF document contains objects such as text streams, images, other multimedia elements, etc.

xref table

This is the cross-reference table. Each object in the document has an entry in this table, which allows for quick and random access to objects in the file. There is no need to read through the entire PDF document just to locate a specific object. Every entry in the table is 20 bytes long.

Here is an example:

xref
0 1
0000000023 65535 f
3 1
0000025324 00000 n
21 4
0000025518 00002 n
0000025632 00000 n
0000000024 00001 f
0000000000 00001 f
36 1
0000026900 00000 n

The cross-reference table of a PDF document is located at the bottom of the file. You can investigate it by opening the PDF in a text editor, such as vi. You will need to scroll to the bottom of the document to see it.

In the example above, there are four subsections (note the four lines that only contain two numbers). The first number in those lines corresponds to the object number, while the second line states the number of objects in the current subsection. Each object is represented by one entry, 20 bytes in total (including the CRLF).

The first 10 bytes define the offset of the object from the start of the PDF document to the beginning of that object.

What follows is a space separator with another number. That number is called “object’s generation number”. After that, there is another space separator, followed by a letter “f” or “n” to indicate whether the object is free or in use.

The first object also contains one entry with object’s generation number 65535. It represent the head of the list of free objects (the letter “f” that means free).

The last object in the cross-reference table has object’s generation number equal to 0.

Subsection 2 has an object ID of 3 and contains one element- object 3, which starts at offset 25324 bytes from the beginning of the document. Subsection 3 has four objects, with ID 21 starting at offset 25518 from the beginning of the file. The remaining objects have IDs 22, 23, and 24 respectively.

Every object in a file is assigned a flag that indicates whether the object is currently being used (“n” for “valid and used”) or not (“f” for “free”). Free objects contain references to the next free object, as well as the generation number that should be applied if the object becomes valid again. This helps to ensure that every part of the file is accounted for.

Since object zero points to the next free object in the table, object 23, and since object 23 is also free and points to the next free object in the table, we can see that objects 24 is pointing back to zero.

The cross-reference table would look like this if every number was represented:

xref
0 1
0000000023 65535 f
3 1
0000025324 00000 n
21 1
0000025518 00002 n
22 1
0000025632 00000 n
23 1
0000000024 00001 f
24 1
0000000000 00001 f
36 1
0000026900 00000 n

The generation number of the object is incremented when the object becomes valid again. (changes flag from ‘f’ to ‘n’) If it were removed again, the generation number would increase to 2. So, if object 23 becomes valid again, the generation number will still be 1. However, if it is removed again, the generation number would increase to 2.

If a PDF document has been incrementally updated, it will usually contain multiple subsections. Otherwise, it should only contain one subsection starting with the number zero.

Trailer

All PDF readers should start reading a PDF from the end of its file. This is because
This is because the PDF trailer contains the location of the cross-reference table and other special objects to the application reading the PDF document.

The example of the trailer is here;

trailer
<< /Size 22 /Root 2 0 R /Info 1 0 R >>
startxref
24212
%%EOF

The last line of the PDF document contains the end of the “%%EOF” file string. Offset from beginning of this file to cross-reference table is specified by a “startxref” string appearing before end-of-file tag. Our cross-reference table starts at offset 24212 bytes. This is preceded by a trailer string which designates start of Trailer section. The contents of trailer sections are embedded within << and >> characters (i.e., key-value pairs in dictionary format).

The trailer section defines several keys such as “/Size”, “/Root”, “/Info” and similar.

Incremental updates

PDFs are designed for incremental updates, meaning we can append new objects to the end of the file without rewriting the whole document. This makes saving changes to a PDF quick and easy. The new structure of the PDF document is illustrated below:

The PDF document still contains the original header, body, cross-reference table and trailer. However, there are also additional body, cross-reference and trailer sections present which contain information on objects that have been changed, replaced or deleted. Deleted objects remain in the file but are marked with an “f” flag. Each trailer is terminated by the “%%EOF” tag and includes a /Prev entry pointing to the previous cross-reference section.

NOTE: PDF documents with versions 1.4 and higher can specify the version entry in the document’s catalog dictionary. This will override the default version from the PDF header. This allows us to take advantage of new features without having to worry about older readers not being able to open the document.