5 Methods to Reduce PDF File Size
Step-by-step techniques for compressing PDF files to overcome email attachment limits (25MB), web upload restrictions, and slow mobile downloads. From browser-based processing to CLI tools, all methods are immediately applicable.
Problem
Required Tools
Library for creating and manipulating PDFs in the browser or Node.js. Compresses by regenerating PDF structure on the client without a server.
PostScript/PDF interpreter. Powerful CLI tool that provides fine-grained control over image resolution and quality via the -dPDFSETTINGS option.
Online tool that compresses PDFs directly in the browser without file uploads. No risk of data leakage.
Python-based PDF processing library. Automates metadata removal, page splitting/merging, and structural optimization via scripts.
Solution Steps
Regenerate PDF Structure in the Browser (pdf-lib)
Using pdf-lib, you can optimize PDFs directly in the browser without a server. When you copy pages from the original PDF to a new document, unused objects (deleted images, previous revisions, duplicate resources) are automatically stripped out. This method alone can achieve 30-70% file size reduction on PDFs that have been edited multiple times. Documents that have undergone many revisions accumulate unnecessary internal objects, making structural regeneration especially effective.
import { PDFDocument } from 'pdf-lib';
async function compressPdf(inputBytes: Uint8Array): Promise<Uint8Array> {
// Load the original PDF
const srcDoc = await PDFDocument.load(inputBytes);
// Create a new empty PDF document
const newDoc = await PDFDocument.create();
// Copy all pages to the new document (unused objects are stripped)
const pages = await newDoc.copyPages(srcDoc, srcDoc.getPageIndices());
pages.forEach((page) => newDoc.addPage(page));
// Save - control memory usage with objectsPerTick
const compressedBytes = await newDoc.save({
useObjectStreams: true, // Additional compression via object streams
addDefaultPage: false,
objectsPerTick: 50, // Memory optimization for large files
});
console.log(`Original: ${(inputBytes.length / 1024 / 1024).toFixed(1)}MB`);
console.log(`Compressed: ${(compressedBytes.length / 1024 / 1024).toFixed(1)}MB`);
console.log(`Reduction: ${((1 - compressedBytes.length / inputBytes.length) * 100).toFixed(1)}%`);
return compressedBytes;
}
// Browser usage example
const fileInput = document.querySelector('input[type="file"]');
fileInput.addEventListener('change', async (e) => {
const file = e.target.files[0];
const buffer = await file.arrayBuffer();
const compressed = await compressPdf(new Uint8Array(buffer));
// Create download link
const blob = new Blob([compressed], { type: 'application/pdf' });
const url = URL.createObjectURL(blob);
const a = document.createElement('a');
a.href = url;
a.download = file.name.replace('.pdf', '_compressed.pdf');
a.click();
});Image Downsampling (Ghostscript)
Embedded images account for the bulk of PDF file size. Ghostscript's -dPDFSETTINGS option lets you control image resolution and compression quality. Preset characteristics: - /screen: 72 DPI, screen viewing only (maximum compression, unsuitable for printing) - /ebook: 150 DPI, e-books and tablets (balanced compression and quality) - /printer: 300 DPI, general printing (high quality preserved) - /prepress: 300 DPI, press-ready output (color profiles preserved) For typical email attachments and web uploads, the /ebook setting offers the best balance.
# Basic Ghostscript PDF compression (/ebook = 150 DPI, general purpose)
gs -sDEVICE=pdfwrite \
-dCompatibilityLevel=1.5 \
-dPDFSETTINGS=/ebook \
-dNOPAUSE -dBATCH -dQUIET \
-sOutputFile=output_compressed.pdf \
input.pdf
# Maximum compression (web/screen viewing only, 72 DPI)
gs -sDEVICE=pdfwrite \
-dCompatibilityLevel=1.5 \
-dPDFSETTINGS=/screen \
-dNOPAUSE -dBATCH -dQUIET \
-sOutputFile=output_screen.pdf \
input.pdf
# Advanced: manually specify image resolution (120 DPI)
gs -sDEVICE=pdfwrite \
-dCompatibilityLevel=1.5 \
-dDownsampleColorImages=true \
-dColorImageResolution=120 \
-dDownsampleGrayImages=true \
-dGrayImageResolution=120 \
-dDownsampleMonoImages=true \
-dMonoImageResolution=120 \
-dNOPAUSE -dBATCH -dQUIET \
-sOutputFile=output_120dpi.pdf \
input.pdf
# Batch compress all PDFs in a folder (Bash)
for f in *.pdf; do
gs -sDEVICE=pdfwrite \
-dPDFSETTINGS=/ebook \
-dNOPAUSE -dBATCH -dQUIET \
-sOutputFile="compressed_${f}" "${f}"
echo "${f}: $(du -h "${f}" | cut -f1) -> $(du -h "compressed_${f}" | cut -f1)"
doneFont Subsetting to Reduce Size
When a PDF embeds an entire font file (several MB), the file size increases dramatically. Font subsetting extracts and embeds only the glyphs (characters) actually used in the document, which is especially effective for CJK fonts. A Korean font may contain over 11,172 glyphs, but a typical document uses only 500-2,000 characters. Subsetting can reduce font size by over 90%. Python's fonttools library can create subsets, and pypdf can analyze fonts in existing PDFs.
# pip install fonttools pypdf
# 1. Create a font subset (fonttools)
from fontTools.subset import Subsetter, load_font
from fontTools.ttLib import TTFont
def create_font_subset(font_path: str, text: str, output_path: str):
"""Create a subset font containing only used characters"""
font = TTFont(font_path)
subsetter = Subsetter()
# Specify only the Unicode codepoints that are used
codepoints = {ord(c) for c in text}
subsetter.populate(unicodes=codepoints)
subsetter.subset(font)
font.save(output_path)
import os
original = os.path.getsize(font_path) / 1024
subset = os.path.getsize(output_path) / 1024
print(f"Original font: {original:.0f}KB")
print(f"Subset: {subset:.0f}KB")
print(f"Reduction: {(1 - subset/original)*100:.1f}%")
# Usage example - pass text used in the document
document_text = "All the text actually used in this document"
create_font_subset(
"NotoSans-Regular.ttf",
document_text,
"NotoSans-Subset.ttf"
)
# 2. Analyze fonts embedded in a PDF (pypdf)
from pypdf import PdfReader
def analyze_fonts(pdf_path: str):
"""Analyze embedded font list and sizes in a PDF"""
reader = PdfReader(pdf_path)
fonts = {}
for page in reader.pages:
if "/Resources" in page and "/Font" in page["/Resources"]:
for font_name, font_obj in page["/Resources"]["/Font"].items():
font_ref = font_obj.get_object()
base_font = font_ref.get("/BaseFont", "Unknown")
is_subset = "+" in str(base_font) # Check if subset
fonts[str(base_font)] = {
"subset": is_subset,
"encoding": str(font_ref.get("/Encoding", "N/A")),
}
print(f"Embedded fonts ({len(fonts)}):")
for name, info in fonts.items():
status = "subset" if info["subset"] else "full font"
print(f" {name} [{status}]")
analyze_fonts("report.pdf")Remove Unnecessary Elements (Metadata, Annotations, JavaScript)
Beyond the main content, PDFs can contain various auxiliary data: metadata (author, creation tool, edit history), annotations, bookmarks, attachments, and embedded JavaScript. These elements unnecessarily increase file size and can also pose security risks (information leakage via metadata, malicious JavaScript). Using pypdf, you can selectively remove these elements. Ghostscript also automatically cleans up many of them by default.
from pypdf import PdfReader, PdfWriter
def strip_pdf_extras(input_path: str, output_path: str):
"""Remove unnecessary metadata, annotations, JavaScript, etc. from PDF"""
reader = PdfReader(input_path)
writer = PdfWriter()
# 1. Copy pages (remove annotations)
for page in reader.pages:
# Remove annotations
if "/Annots" in page:
del page["/Annots"]
writer.add_page(page)
# 2. Clear metadata
writer.add_metadata({
"/Producer": "",
"/Creator": "",
"/Author": "",
"/Subject": "",
"/Keywords": "",
"/Title": "",
})
# 3. Remove JavaScript
if "/Names" in writer._root_object:
names = writer._root_object["/Names"]
if "/JavaScript" in names:
del names["/JavaScript"]
# 4. Remove embedded files (EmbeddedFiles)
if "/Names" in writer._root_object:
names = writer._root_object["/Names"]
if "/EmbeddedFiles" in names:
del names["/EmbeddedFiles"]
# 5. Save with compression
writer.compress_identical_objects(remove_identicals=True)
with open(output_path, "wb") as f:
writer.write(f)
import os
orig = os.path.getsize(input_path) / 1024 / 1024
new = os.path.getsize(output_path) / 1024 / 1024
print(f"Original: {orig:.1f}MB -> Cleaned: {new:.1f}MB ({(1-new/orig)*100:.1f}% reduction)")
strip_pdf_extras("report_full.pdf", "report_clean.pdf")
# One-step cleanup + compression with Ghostscript
# gs -sDEVICE=pdfwrite -dPDFSETTINGS=/ebook \
# -dFastWebView=true \
# -dPrinted=false \
# -dNOPAUSE -dBATCH \
# -sOutputFile=clean_output.pdf input.pdfOptimize Large PDFs with Split, Compress, and Merge
Very large PDFs (100MB+) can cause memory exhaustion when processed all at once. In such cases, splitting the PDF into parts, compressing each part individually, then merging them back is an effective strategy. Advantages of this approach: - Memory usage is limited to the size of each part - Different compression levels can be applied per part (stronger compression for image-heavy sections) - Parallel processing can reduce total processing time
from pypdf import PdfReader, PdfWriter, PdfMerger
import subprocess
import os
def split_compress_merge(input_path: str, output_path: str, pages_per_chunk: int = 20):
"""Split large PDF -> compress each chunk -> merge back"""
reader = PdfReader(input_path)
total_pages = len(reader.pages)
chunk_files = []
compressed_files = []
print(f"Total {total_pages} pages, processing in chunks of {pages_per_chunk}")
# Step 1: Split
for start in range(0, total_pages, pages_per_chunk):
end = min(start + pages_per_chunk, total_pages)
writer = PdfWriter()
for i in range(start, end):
writer.add_page(reader.pages[i])
chunk_path = f"/tmp/chunk_{start:04d}.pdf"
with open(chunk_path, "wb") as f:
writer.write(f)
chunk_files.append(chunk_path)
print(f" Split: pages {start+1}-{end} -> {chunk_path}")
# Step 2: Compress each chunk with Ghostscript
for chunk_path in chunk_files:
compressed_path = chunk_path.replace(".pdf", "_compressed.pdf")
subprocess.run([
"gs", "-sDEVICE=pdfwrite",
"-dCompatibilityLevel=1.5",
"-dPDFSETTINGS=/ebook",
"-dNOPAUSE", "-dBATCH", "-dQUIET",
f"-sOutputFile={compressed_path}",
chunk_path,
], check=True)
compressed_files.append(compressed_path)
orig_size = os.path.getsize(chunk_path) / 1024
comp_size = os.path.getsize(compressed_path) / 1024
print(f" Compressed: {orig_size:.0f}KB -> {comp_size:.0f}KB")
# Step 3: Merge
merger = PdfMerger()
for compressed_path in compressed_files:
merger.append(compressed_path)
merger.write(output_path)
merger.close()
# Clean up temporary files
for f in chunk_files + compressed_files:
os.remove(f)
orig = os.path.getsize(input_path) / 1024 / 1024
final = os.path.getsize(output_path) / 1024 / 1024
print(f"\nFinal result: {orig:.1f}MB -> {final:.1f}MB ({(1-final/orig)*100:.1f}% reduction)")
# Execute
split_compress_merge("large_report.pdf", "optimized_report.pdf", pages_per_chunk=20)Common Mistakes
Compressed PDF fails to open or renders incorrectly
Always create a backup of the original before compressing. The useObjectStreams option in pdf-lib is only supported in PDF 1.5 and above. If older viewers have issues, disable this option or set dCompatibilityLevel to 1.4.
Using Ghostscript /screen setting for print-intended PDFs
/screen downsamples to 72 DPI, making images extremely blurry when printed. Use /printer (300 DPI) or higher for print purposes, and only use /ebook (150 DPI) for web/email delivery.
Attempting to compress an encrypted PDF and getting errors
Password-protected PDFs must be decrypted first. Use pypdf's reader.decrypt("password") before compression, or run qpdf --decrypt beforehand. Unauthorized decryption of protected PDFs may violate copyright laws.
Stripping accessibility tags when removing metadata
Indiscriminate removal of elements from PDF/UA (accessibility) tagged documents breaks screen reader compatibility. For official documents requiring accessibility, preserve the /StructTreeRoot (structure tree) when cleaning metadata.
Processing a very large PDF in the browser causes out-of-memory errors
Browser memory limits are typically 1-4GB. For PDFs over 50MB, use chunk-based processing or CLI tools (Ghostscript). Processing in a Web Worker prevents main thread blocking.