Complete Guide to PDF Compression: Formats, Quality, and File Size Explained

· 12 min read ·PDF compression formats deep guide
Advertisement

Complete Guide to PDF Compression: Formats, Quality, and File Size Explained

A digital archivist preparing 10,000 scanned legal documents for cloud storage runs the standard compression pass and recovers 25% file-size savings. They know from experience that scanned bitonal PDFs should compress closer to 90% with the right algorithm. The tool they're using applies DCT/JPEG to all image content; the documents are mostly black-and-white text scans where JPEG's lossy quantization is the wrong fit — JBIG2's bitonal pattern-matching would compress these files dramatically better. After helping hundreds of users navigate why their PDFs compress poorly, the answer is almost always the algorithm-document mismatch. PDFs aren't compressed as a whole; each image, each text stream, and each metadata block can use a different compression filter, and getting good ratios depends on matching the filter to the content.

The scoutmytool PDF compressor and aggressive compressor handle the common cases automatically. For batch operations across many files, the PDF batch compress tool processes them in parallel. The technical depth below helps when those defaults don't get you the size you need.

How a PDF Stores Its Bytes

The PDF specification — formally ISO 32000-2 published by the International Organization for Standardization — defines a PDF as a tree of indirect objects with a cross-reference table. Each object is a stream of data wrapped in a dictionary that describes how to decode it. The decoders ("filters" in PDF spec terminology) include:

  • FlateDecode: zlib/deflate compression — same algorithm used by gzip and PNG. Lossless, general-purpose. Used for text, vector graphics, and metadata.
  • DCTDecode: JPEG compression. Lossy. Used for color and grayscale photos.
  • JBIG2Decode: JBIG2 compression for bitonal (1-bit) images. Lossless or lossy modes. Used for black-and-white scans of text documents.
  • JPXDecode: JPEG 2000 (wavelet-based). Lossless or lossy. Used in some archival workflows.
  • CCITTFaxDecode (Group 3 / Group 4): older lossless bitonal compression. Predates JBIG2; superseded for new content but common in older PDFs.
  • LZWDecode: legacy LZW compression. Replaced by FlateDecode in modern PDFs because LZW had patent issues.
  • RunLengthDecode: trivial run-length encoding. Rarely effective for real-world content.
  • ASCII85Decode / ASCIIHexDecode: not compressors; expand binary data to ASCII for systems that need text-only PDFs. Always increase file size.

Each stream object's /Filter entry names which decoder applies. A single PDF might contain text streams encoded with FlateDecode, photo objects encoded with DCTDecode, and a logo encoded with JBIG2Decode all in one file. Compression strategy means choosing the best filter for each stream, not picking a single filter for the whole file.

DCT/JPEG — The Right Tool for Photos

DCTDecode (JPEG) is the standard filter for color and grayscale continuous-tone images — photos, screenshots of color content, scanned color documents. JPEG works by:

  1. Converting the image to YCbCr color space (luminance + two chrominance channels)
  2. Subsampling chrominance (humans have less color resolution than luminance)
  3. Splitting into 8×8 pixel blocks
  4. Applying Discrete Cosine Transform (DCT) to each block
  5. Quantizing the DCT coefficients (this is the lossy step — quantization tables determine quality)
  6. Encoding the quantized coefficients with Huffman compression

The Joint Photographic Experts Group ITU recommendation T.81 defines the standard. The quality factor (Q) controls quantization aggressiveness: Q=95 is "near-original," Q=75 is "good," Q=60 is "acceptable for body text + photos," Q=40 is "noticeable artifacting." For email-bound proposals, Q=70-75 is typical.

Effective JPEG compression on a 6×4 inch 300dpi photo (1800×1200 pixels, 6.5 megapixels): can compress from ~6.5MB raw to ~400KB at Q=75 — a 16:1 ratio. The Wikipedia article on JPEG covers the algorithm in depth.

When DCT is the right choice: photos, scanned color documents, screenshots that include photographic content.

When DCT is the wrong choice: black-and-white text scans (JBIG2 is dramatically better), line art (Flate is better), text-only content (Flate is better and lossless).

JBIG2 — The Right Tool for Bitonal Text Scans

JBIG2 (Joint Bi-level Image Group 2) is purpose-built for compressing 1-bit (black-and-white) images, especially scanned text documents. It works in three modes:

  1. Symbol-based — finds repeating glyph patterns (the same "e" appears many times in a page) and encodes each unique symbol once, then references it. This is what makes JBIG2 dramatically better than DCT for text.
  2. Pattern-based — for non-text bitonal content with repeating patterns.
  3. Generic — pixel-level fallback when the above don't apply.

JBIG2 has both lossy and lossless modes. Lossless preserves every pixel exactly. Lossy substitutes "similar enough" symbols for slightly different ones — controversial because it can change individual characters in scanned documents (a famous Xerox bug in 2013 substituted digits in scanned medical records). For archival, always use JBIG2 lossless. For non-archival scan compression where small character substitution risk is acceptable, lossy JBIG2 is more aggressive.

Effective compression: a 300dpi black-and-white scan of a page of typed text (~5MB raw) can compress to ~40KB with JBIG2 lossless — a 125:1 ratio. JBIG2 dominates DCT and CCITT Group 4 for this content type. The Wikipedia article on JBIG2 covers the algorithm.

When JBIG2 is the right choice: scanned text documents (legal filings, books, archival material), bitonal logos.

When JBIG2 is the wrong choice: color or grayscale content (it's bitonal-only), text-rendered-from-source PDFs (the underlying text is already lossless via Flate; converting to JBIG2 would mean rasterizing first, which is wasteful).

Advertisement

FlateDecode — The Right Tool for Text and Vectors

Flate (zlib/deflate) is the workhorse of PDF compression. It's lossless, general-purpose, and used for:

  • Text streams (the actual text content of typed PDFs)
  • Vector graphics (line art, shapes, mathematical illustrations)
  • Metadata (XMP, document info dictionary)
  • Embedded fonts (TrueType, Type 1)
  • General object streams (PDF 1.5+ object stream feature)

For typical text-heavy PDFs, Flate compression on the text streams achieves 60-80% reduction. Vector graphics often achieve 90%+ reduction because they're highly redundant.

The Flate algorithm uses LZ77 (a sliding-window dictionary algorithm) followed by Huffman coding. The same algorithm powers gzip, PNG, and ZIP. The Wikipedia article on Deflate covers the technical details.

When Flate is the right choice: text, vectors, metadata, anything not photographic.

Compression Strategy by Document Type

Document type Primary content Optimal filter mix
Born-digital text (Word→PDF) Text + minor images Flate (text), DCT for any photos
Born-digital report Text + charts + photos Flate (text/charts), DCT (photos)
Color scan Bitmap pages DCT with Q=70-80
Black-and-white text scan Bitonal page images JBIG2 lossless
Mixed scan Color + B&W pages Per-page filter choice (DCT for color, JBIG2 for B&W)
Engineering drawing Vector lines + text Flate (lossless)
Map / graphic Vector + raster Flate (vectors), DCT or JBIG2 (rasters)

The scoutmytool PDF compressor auto-detects content type and applies appropriate filters. For specific scenarios where you need explicit control, aggressive compress targets maximum size reduction.

Lossless vs Lossy Tradeoffs

Lossless compression preserves every bit of the original. Lossless filters: Flate, JBIG2-lossless, JPX-lossless, CCITT. After decompression, the file is byte-identical to before compression. Use for: archival material, legal documents where exact preservation matters, text-based content.

Lossy compression discards data the decoder considers "perceptually unimportant." Lossy filters: DCT (always lossy), JBIG2-lossy, JPX-lossy. After decompression, the file is similar but not identical to before. Use for: photo content where 5-10% perceptual difference is acceptable, screen-only PDFs where print fidelity isn't required.

Most PDF compression in practice mixes both: lossless on text streams, lossy on photo content. The right balance depends on the use case.

File-Size Recovery: Optimization Without Re-Encoding

Beyond filter choice, several techniques recover file size without changing perceptual quality:

Deduplication. PDFs sometimes embed the same image or font multiple times. The PDF batch compress tool detects duplicates and references a single copy.

Font subsetting. Embedding a full font (every glyph) is wasteful when the document only uses 30 characters. Subsetting reduces the embedded font to only the glyphs actually used.

Object stream compression (PDF 1.5+). Modern PDFs combine multiple small objects into compressed object streams, reducing per-object overhead. Older PDFs (pre-2003) without this feature can recover 5-15% just by switching to PDF 1.5+ object streams.

Cross-reference stream compression. PDF 1.5+ supports compressed xref streams.

Removing unused objects. PDFs can carry orphaned objects (e.g., from edits that removed pages without cleaning up the underlying objects). Removing orphans recovers space.

Image downsampling. Reducing 300dpi to 150dpi for screen-only viewing typically recovers 75% of image data with no perceptual loss on screens.

The PDF metadata strip tool handles metadata removal as a pre-compression step.

Worked Examples

Example 1 — 60-page scanned legal filing. Original: 300dpi color scan even though document is black-and-white text (scanner default), 95MB. Compression strategy: convert color pages to grayscale (no information loss for B&W content), then to bitonal at 300dpi, apply JBIG2 lossless. Result: 95MB → 4.2MB, 96% reduction. JBIG2 alone produced the bulk of the savings.

Example 2 — Marketing brochure with photos. Original: born-digital from InDesign with embedded 300dpi photos at JPEG Q=95, 25MB. Compression strategy: re-quantize JPEG photos to Q=75 (no perceptual loss on screen), maintain 300dpi for print quality. Result: 25MB → 8MB, 68% reduction. The scoutmytool compressor at "standard" setting hits this case well.

Example 3 — Annual report with mixed content. Original: text + charts + photos + scanned signatures, 18MB. Compression strategy: Flate for text/charts (lossless), DCT Q=80 for photos (slight quality reduction acceptable), JBIG2 for bitonal signatures. Result: 18MB → 4.5MB, 75% reduction. Mixed-content PDFs benefit most from per-stream filter optimization.

Example 4 — Engineering CAD-export PDF. Original: vector-only output from AutoCAD, 65MB (large despite being vector-only because of dense line work). Compression strategy: Flate compression on vectors (already applied by AutoCAD), object stream consolidation, font subsetting. Result: 65MB → 38MB, 42% reduction. Vector-heavy content has limited compressibility because there's no redundancy to remove.

Common Pitfalls

Applying JPEG to text scans. Compressing a black-and-white text scan as JPEG produces visible artifacts and far worse ratios than JBIG2. Always check whether content is bitonal before choosing a filter.

Aggressive lossy on documents requiring fidelity. Legal documents, medical records, and forensic scans should NOT use lossy compression. The perceptual difference might be invisible, but the byte difference invalidates digital signatures and forensic chain-of-custody.

Re-compressing already-compressed PDFs repeatedly. Each lossy re-compression introduces additional artifacts. After 3-4 passes, a JPEG photo embedded in a PDF develops obvious blocking. Compress once with appropriate settings; don't iteratively re-compress.

Ignoring orphaned objects. PDFs that have been edited many times accumulate orphan objects that no longer reference anything. A "linearize and clean" pass can recover substantial size.

Mistaking format change for compression. Converting a PDF to a different PDF version doesn't compress it; it just rewrites the structure. Real compression requires actual filter changes.

Trusting "compression ratio" claims without quality check. A 90% reduction sounds great until you discover the OCR text layer was destroyed or signatures were invalidated. Verify perceptual quality and forensic integrity after compression.

Forgetting font subsetting. Embedding every glyph of a 200KB font when the document uses 50 glyphs is a 195KB waste. Modern PDF generators subset by default; older ones don't.

Frequently Asked Questions

Q: Why do some PDFs compress 80% and others only 5%? A: Because PDFs that haven't been compressed (or have been compressed with the wrong filters) have more compressible data. Already-well-compressed PDFs (modern Word→PDF exports, optimized PDFs from Adobe) have minimal additional compression possible.

Q: What's the best PDF compression algorithm? A: There's no single best algorithm — it depends on content. JBIG2 is best for bitonal text, DCT for color photos, Flate for text and vectors. The scoutmytool compressor auto-selects per-stream.

Q: Will compression break OCR text in my scanned PDF? A: Heavy lossy compression that re-encodes the image layer can corrupt OCR text. Light compression on the image layer with the OCR text layer preserved separately is safe. After compression, verify by selecting text in the result.

Q: Does PDF/A archival format use specific compression? A: PDF/A-1 (the early archival standard) restricted to JPEG, CCITT, Flate, RunLengthDecode, and LZW filters. PDF/A-2 added JBIG2 lossless and JPX-lossless. PDF/A-3 added embedded files. Choose the right PDF/A level based on your archival requirements.

Q: Can I compress encrypted PDFs? A: Encrypted PDF content is essentially uncompressible (encrypted bytes look random and don't compress well). To compress encrypted PDFs effectively, decrypt first via unlock-PDF, compress, then re-encrypt via protect-PDF.

Q: Does compression degrade image quality every time? A: Lossless compression doesn't. Lossy compression does — re-compressing a JPEG with the same settings still re-quantizes, introducing additional artifacts. For sensitive content, compress once or use lossless.

Q: What's PDF/X used for and how does it affect compression? A: PDF/X is a print-industry archival standard with stricter requirements than PDF/A. PDF/X mandates specific color spaces and embedded fonts; compression filters are restricted to those that preserve print fidelity (Flate, DCT for color photos at appropriate Q levels).

Wrapping Up

PDF compression is filter-by-filter, not file-by-file. JBIG2 for bitonal text scans, DCT for photos, Flate for text and vectors — apply the right algorithm to the right content and 80%+ ratios are routine. The scoutmytool PDF compressor handles auto-selection for typical cases; the aggressive compressor and batch compressor cover specific use patterns. For broader PDF workflow, see the scoutmytool PDF tools index.

Advertisement