liminfo

5 Methods to Reduce PDF File Size

Step-by-step techniques for compressing PDF files to overcome email attachment limits (25MB), web upload restrictions, and slow mobile downloads. From browser-based processing to CLI tools, all methods are immediately applicable.

PDF compressionreduce PDF file sizepdf-libGhostscript PDFPDF optimizationfont subsettingPDF image compressionpypdf Pythoncompress PDF onlineremove PDF metadata

Problem

Your report PDF is 45MB, exceeding Gmail's 25MB attachment limit and making it impossible to send. Web upload forms reject it due to 10MB file size restrictions, and on mobile devices it takes tens of seconds to download. The PDF contains high-resolution images, fully embedded fonts, and editing history metadata that unnecessarily bloat the file. You need effective methods to reduce file size while minimizing quality degradation.

Required Tools

pdf-lib (JavaScript)

Library for creating and manipulating PDFs in the browser or Node.js. Compresses by regenerating PDF structure on the client without a server.

Ghostscript (CLI)

PostScript/PDF interpreter. Powerful CLI tool that provides fine-grained control over image resolution and quality via the -dPDFSETTINGS option.

liminfo PDF Compress

Online tool that compresses PDFs directly in the browser without file uploads. No risk of data leakage.

pypdf (Python)

Python-based PDF processing library. Automates metadata removal, page splitting/merging, and structural optimization via scripts.

Solution Steps

1

Regenerate PDF Structure in the Browser (pdf-lib)

Using pdf-lib, you can optimize PDFs directly in the browser without a server. When you copy pages from the original PDF to a new document, unused objects (deleted images, previous revisions, duplicate resources) are automatically stripped out. This method alone can achieve 30-70% file size reduction on PDFs that have been edited multiple times. Documents that have undergone many revisions accumulate unnecessary internal objects, making structural regeneration especially effective.

import { PDFDocument } from 'pdf-lib';

async function compressPdf(inputBytes: Uint8Array): Promise<Uint8Array> {
  // Load the original PDF
  const srcDoc = await PDFDocument.load(inputBytes);

  // Create a new empty PDF document
  const newDoc = await PDFDocument.create();

  // Copy all pages to the new document (unused objects are stripped)
  const pages = await newDoc.copyPages(srcDoc, srcDoc.getPageIndices());
  pages.forEach((page) => newDoc.addPage(page));

  // Save - control memory usage with objectsPerTick
  const compressedBytes = await newDoc.save({
    useObjectStreams: true,   // Additional compression via object streams
    addDefaultPage: false,
    objectsPerTick: 50,       // Memory optimization for large files
  });

  console.log(`Original: ${(inputBytes.length / 1024 / 1024).toFixed(1)}MB`);
  console.log(`Compressed: ${(compressedBytes.length / 1024 / 1024).toFixed(1)}MB`);
  console.log(`Reduction: ${((1 - compressedBytes.length / inputBytes.length) * 100).toFixed(1)}%`);

  return compressedBytes;
}

// Browser usage example
const fileInput = document.querySelector('input[type="file"]');
fileInput.addEventListener('change', async (e) => {
  const file = e.target.files[0];
  const buffer = await file.arrayBuffer();
  const compressed = await compressPdf(new Uint8Array(buffer));

  // Create download link
  const blob = new Blob([compressed], { type: 'application/pdf' });
  const url = URL.createObjectURL(blob);
  const a = document.createElement('a');
  a.href = url;
  a.download = file.name.replace('.pdf', '_compressed.pdf');
  a.click();
});
2

Image Downsampling (Ghostscript)

Embedded images account for the bulk of PDF file size. Ghostscript's -dPDFSETTINGS option lets you control image resolution and compression quality. Preset characteristics: - /screen: 72 DPI, screen viewing only (maximum compression, unsuitable for printing) - /ebook: 150 DPI, e-books and tablets (balanced compression and quality) - /printer: 300 DPI, general printing (high quality preserved) - /prepress: 300 DPI, press-ready output (color profiles preserved) For typical email attachments and web uploads, the /ebook setting offers the best balance.

# Basic Ghostscript PDF compression (/ebook = 150 DPI, general purpose)
gs -sDEVICE=pdfwrite \
   -dCompatibilityLevel=1.5 \
   -dPDFSETTINGS=/ebook \
   -dNOPAUSE -dBATCH -dQUIET \
   -sOutputFile=output_compressed.pdf \
   input.pdf

# Maximum compression (web/screen viewing only, 72 DPI)
gs -sDEVICE=pdfwrite \
   -dCompatibilityLevel=1.5 \
   -dPDFSETTINGS=/screen \
   -dNOPAUSE -dBATCH -dQUIET \
   -sOutputFile=output_screen.pdf \
   input.pdf

# Advanced: manually specify image resolution (120 DPI)
gs -sDEVICE=pdfwrite \
   -dCompatibilityLevel=1.5 \
   -dDownsampleColorImages=true \
   -dColorImageResolution=120 \
   -dDownsampleGrayImages=true \
   -dGrayImageResolution=120 \
   -dDownsampleMonoImages=true \
   -dMonoImageResolution=120 \
   -dNOPAUSE -dBATCH -dQUIET \
   -sOutputFile=output_120dpi.pdf \
   input.pdf

# Batch compress all PDFs in a folder (Bash)
for f in *.pdf; do
  gs -sDEVICE=pdfwrite \
     -dPDFSETTINGS=/ebook \
     -dNOPAUSE -dBATCH -dQUIET \
     -sOutputFile="compressed_${f}" "${f}"
  echo "${f}: $(du -h "${f}" | cut -f1) -> $(du -h "compressed_${f}" | cut -f1)"
done
3

Font Subsetting to Reduce Size

When a PDF embeds an entire font file (several MB), the file size increases dramatically. Font subsetting extracts and embeds only the glyphs (characters) actually used in the document, which is especially effective for CJK fonts. A Korean font may contain over 11,172 glyphs, but a typical document uses only 500-2,000 characters. Subsetting can reduce font size by over 90%. Python's fonttools library can create subsets, and pypdf can analyze fonts in existing PDFs.

# pip install fonttools pypdf

# 1. Create a font subset (fonttools)
from fontTools.subset import Subsetter, load_font
from fontTools.ttLib import TTFont

def create_font_subset(font_path: str, text: str, output_path: str):
    """Create a subset font containing only used characters"""
    font = TTFont(font_path)
    subsetter = Subsetter()

    # Specify only the Unicode codepoints that are used
    codepoints = {ord(c) for c in text}
    subsetter.populate(unicodes=codepoints)
    subsetter.subset(font)

    font.save(output_path)

    import os
    original = os.path.getsize(font_path) / 1024
    subset = os.path.getsize(output_path) / 1024
    print(f"Original font: {original:.0f}KB")
    print(f"Subset:        {subset:.0f}KB")
    print(f"Reduction:     {(1 - subset/original)*100:.1f}%")

# Usage example - pass text used in the document
document_text = "All the text actually used in this document"
create_font_subset(
    "NotoSans-Regular.ttf",
    document_text,
    "NotoSans-Subset.ttf"
)

# 2. Analyze fonts embedded in a PDF (pypdf)
from pypdf import PdfReader

def analyze_fonts(pdf_path: str):
    """Analyze embedded font list and sizes in a PDF"""
    reader = PdfReader(pdf_path)
    fonts = {}

    for page in reader.pages:
        if "/Resources" in page and "/Font" in page["/Resources"]:
            for font_name, font_obj in page["/Resources"]["/Font"].items():
                font_ref = font_obj.get_object()
                base_font = font_ref.get("/BaseFont", "Unknown")
                is_subset = "+" in str(base_font)  # Check if subset
                fonts[str(base_font)] = {
                    "subset": is_subset,
                    "encoding": str(font_ref.get("/Encoding", "N/A")),
                }

    print(f"Embedded fonts ({len(fonts)}):")
    for name, info in fonts.items():
        status = "subset" if info["subset"] else "full font"
        print(f"  {name} [{status}]")

analyze_fonts("report.pdf")
4

Remove Unnecessary Elements (Metadata, Annotations, JavaScript)

Beyond the main content, PDFs can contain various auxiliary data: metadata (author, creation tool, edit history), annotations, bookmarks, attachments, and embedded JavaScript. These elements unnecessarily increase file size and can also pose security risks (information leakage via metadata, malicious JavaScript). Using pypdf, you can selectively remove these elements. Ghostscript also automatically cleans up many of them by default.

from pypdf import PdfReader, PdfWriter

def strip_pdf_extras(input_path: str, output_path: str):
    """Remove unnecessary metadata, annotations, JavaScript, etc. from PDF"""
    reader = PdfReader(input_path)
    writer = PdfWriter()

    # 1. Copy pages (remove annotations)
    for page in reader.pages:
        # Remove annotations
        if "/Annots" in page:
            del page["/Annots"]
        writer.add_page(page)

    # 2. Clear metadata
    writer.add_metadata({
        "/Producer": "",
        "/Creator": "",
        "/Author": "",
        "/Subject": "",
        "/Keywords": "",
        "/Title": "",
    })

    # 3. Remove JavaScript
    if "/Names" in writer._root_object:
        names = writer._root_object["/Names"]
        if "/JavaScript" in names:
            del names["/JavaScript"]

    # 4. Remove embedded files (EmbeddedFiles)
    if "/Names" in writer._root_object:
        names = writer._root_object["/Names"]
        if "/EmbeddedFiles" in names:
            del names["/EmbeddedFiles"]

    # 5. Save with compression
    writer.compress_identical_objects(remove_identicals=True)

    with open(output_path, "wb") as f:
        writer.write(f)

    import os
    orig = os.path.getsize(input_path) / 1024 / 1024
    new = os.path.getsize(output_path) / 1024 / 1024
    print(f"Original: {orig:.1f}MB -> Cleaned: {new:.1f}MB ({(1-new/orig)*100:.1f}% reduction)")

strip_pdf_extras("report_full.pdf", "report_clean.pdf")

# One-step cleanup + compression with Ghostscript
# gs -sDEVICE=pdfwrite -dPDFSETTINGS=/ebook \
#    -dFastWebView=true \
#    -dPrinted=false \
#    -dNOPAUSE -dBATCH \
#    -sOutputFile=clean_output.pdf input.pdf
5

Optimize Large PDFs with Split, Compress, and Merge

Very large PDFs (100MB+) can cause memory exhaustion when processed all at once. In such cases, splitting the PDF into parts, compressing each part individually, then merging them back is an effective strategy. Advantages of this approach: - Memory usage is limited to the size of each part - Different compression levels can be applied per part (stronger compression for image-heavy sections) - Parallel processing can reduce total processing time

from pypdf import PdfReader, PdfWriter, PdfMerger
import subprocess
import os

def split_compress_merge(input_path: str, output_path: str, pages_per_chunk: int = 20):
    """Split large PDF -> compress each chunk -> merge back"""
    reader = PdfReader(input_path)
    total_pages = len(reader.pages)
    chunk_files = []
    compressed_files = []

    print(f"Total {total_pages} pages, processing in chunks of {pages_per_chunk}")

    # Step 1: Split
    for start in range(0, total_pages, pages_per_chunk):
        end = min(start + pages_per_chunk, total_pages)
        writer = PdfWriter()
        for i in range(start, end):
            writer.add_page(reader.pages[i])

        chunk_path = f"/tmp/chunk_{start:04d}.pdf"
        with open(chunk_path, "wb") as f:
            writer.write(f)
        chunk_files.append(chunk_path)
        print(f"  Split: pages {start+1}-{end} -> {chunk_path}")

    # Step 2: Compress each chunk with Ghostscript
    for chunk_path in chunk_files:
        compressed_path = chunk_path.replace(".pdf", "_compressed.pdf")
        subprocess.run([
            "gs", "-sDEVICE=pdfwrite",
            "-dCompatibilityLevel=1.5",
            "-dPDFSETTINGS=/ebook",
            "-dNOPAUSE", "-dBATCH", "-dQUIET",
            f"-sOutputFile={compressed_path}",
            chunk_path,
        ], check=True)
        compressed_files.append(compressed_path)

        orig_size = os.path.getsize(chunk_path) / 1024
        comp_size = os.path.getsize(compressed_path) / 1024
        print(f"  Compressed: {orig_size:.0f}KB -> {comp_size:.0f}KB")

    # Step 3: Merge
    merger = PdfMerger()
    for compressed_path in compressed_files:
        merger.append(compressed_path)
    merger.write(output_path)
    merger.close()

    # Clean up temporary files
    for f in chunk_files + compressed_files:
        os.remove(f)

    orig = os.path.getsize(input_path) / 1024 / 1024
    final = os.path.getsize(output_path) / 1024 / 1024
    print(f"\nFinal result: {orig:.1f}MB -> {final:.1f}MB ({(1-final/orig)*100:.1f}% reduction)")

# Execute
split_compress_merge("large_report.pdf", "optimized_report.pdf", pages_per_chunk=20)

Common Mistakes

Compressed PDF fails to open or renders incorrectly

Always create a backup of the original before compressing. The useObjectStreams option in pdf-lib is only supported in PDF 1.5 and above. If older viewers have issues, disable this option or set dCompatibilityLevel to 1.4.

Using Ghostscript /screen setting for print-intended PDFs

/screen downsamples to 72 DPI, making images extremely blurry when printed. Use /printer (300 DPI) or higher for print purposes, and only use /ebook (150 DPI) for web/email delivery.

Attempting to compress an encrypted PDF and getting errors

Password-protected PDFs must be decrypted first. Use pypdf's reader.decrypt("password") before compression, or run qpdf --decrypt beforehand. Unauthorized decryption of protected PDFs may violate copyright laws.

Stripping accessibility tags when removing metadata

Indiscriminate removal of elements from PDF/UA (accessibility) tagged documents breaks screen reader compatibility. For official documents requiring accessibility, preserve the /StructTreeRoot (structure tree) when cleaning metadata.

Processing a very large PDF in the browser causes out-of-memory errors

Browser memory limits are typically 1-4GB. For PDFs over 50MB, use chunk-based processing or CLI tools (Ghostscript). Processing in a Web Worker prevents main thread blocking.

Related liminfo Services