liminfo

Python OCR Image Text Extraction

Covers the entire process of extracting text from scanned documents, receipts, business cards, and other images, converting them into searchable digital data.

Python OCRtesseract tutorialimage text extractionpytesseractdocument digitizationmultilingual OCRimage preprocessing OCRPillow image processing

Problem

You need to digitize years of accumulated paper documents (contracts, receipts, meeting notes, etc.) at your company. You want to extract text from scanned images to build a searchable database, but manually typing tens of thousands of documents is impossible. The system must accurately recognize documents containing mixed languages and handle documents with inconsistent scan quality.

Required Tools

Python 3.x

Programming language for building the OCR pipeline

Tesseract OCR

Open-source OCR engine maintained by Google. Supports over 100 languages

pytesseract

Python wrapper library for Tesseract

Pillow (PIL)

Python image library for image loading, preprocessing, and conversion

OpenCV (cv2)

Used for advanced image preprocessing (binarization, noise removal, etc.)

Solution Steps

1

Install Tesseract OCR Engine and Python Packages

Tesseract must be installed at the system level, and pytesseract is a wrapper that calls this engine from Python. For recognizing additional languages, you must install the corresponding language training data (tessdata). Instructions are provided for Ubuntu/Debian; use brew for macOS or the installer for Windows.

# Ubuntu/Debian - Install Tesseract engine
sudo apt update
sudo apt install -y tesseract-ocr

# Install additional language packs (e.g., for Korean)
sudo apt install -y tesseract-ocr-kor

# Check installed language list
tesseract --list-langs

# Install Python packages
pip install pytesseract Pillow opencv-python-headless

# For macOS
# brew install tesseract tesseract-lang

# Verify installation
python -c "import pytesseract; print(pytesseract.get_tesseract_version())"
2

Image Preprocessing (Grayscale, Binarization, Noise Removal)

80% of OCR accuracy depends on preprocessing quality. Feeding raw images directly results in significantly lower recognition rates. Key preprocessing steps: 1. Grayscale conversion - Simplifies computation by removing color information 2. Noise removal - Apply Gaussian blur or median filter 3. Binarization - Clearly separate text and background into black and white 4. Deskew - Correct tilt that occurs during scanning Otsu's binarization and adaptive thresholding are particularly effective for documents with uneven lighting.

import cv2
import numpy as np
from PIL import Image

def preprocess_image(image_path):
    """Image preprocessing pipeline to improve OCR accuracy"""

    # Load image
    img = cv2.imread(image_path)

    # 1. Grayscale conversion
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # 2. Noise removal (median filter - effective for salt-and-pepper noise)
    denoised = cv2.medianBlur(gray, 3)

    # 3. Adaptive binarization (suitable for documents with uneven lighting)
    binary = cv2.adaptiveThreshold(
        denoised,
        255,
        cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY,
        blockSize=11,  # Block size (must be odd)
        C=2             # Correction constant
    )

    # 4. Morphological operation to fix broken characters
    kernel = np.ones((1, 1), np.uint8)
    processed = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel)

    return processed

# Run preprocessing and save result
processed = preprocess_image('scanned_document.png')
cv2.imwrite('preprocessed.png', processed)
print("Preprocessing complete: saved as preprocessed.png")
3

Extract Text with pytesseract

Pass the preprocessed image to pytesseract to extract text. Specify the recognition language with the lang parameter; for mixed-language documents use formats like 'eng+kor'. Control Tesseract's detailed behavior with the config parameter: - --psm (Page Segmentation Mode): How to analyze page layout - --oem (OCR Engine Mode): Which OCR engine to use (LSTM, Legacy, combined) Frequently used PSM modes: - 3: Automatic page segmentation (default, general documents) - 6: Single text block (receipts, etc.) - 7: Single line of text (license plates, etc.) - 11: Sparse text (text inside tables, etc.)

import pytesseract
from PIL import Image

def extract_text(image_path, lang='eng', psm=3):
    """Extract text from an image.

    Args:
        image_path: Path to the image file
        lang: Recognition language (e.g., 'eng', 'kor', 'eng+kor')
        psm: Page Segmentation Mode (3=auto, 6=block, 7=single line)

    Returns:
        str: Extracted text
    """
    img = Image.open(image_path)

    custom_config = f'--oem 3 --psm {psm}'
    text = pytesseract.image_to_string(img, lang=lang, config=custom_config)

    return text

# Basic text extraction
text = extract_text('preprocessed.png', lang='eng')
print("--- Extracted Text ---")
print(text)

# Detailed extraction (per-word coordinates, confidence)
data = pytesseract.image_to_data(
    Image.open('preprocessed.png'),
    lang='eng',
    config='--oem 3 --psm 3',
    output_type=pytesseract.Output.DICT
)

# Filter only words with confidence above 60%
for i, word in enumerate(data['text']):
    conf = int(data['conf'][i])
    if conf > 60 and word.strip():
        x, y, w, h = data['left'][i], data['top'][i], data['width'][i], data['height'][i]
        print(f"Word: '{word}' | Confidence: {conf}% | Position: ({x},{y}) Size: {w}x{h}")
4

Post-processing and Cleanup

OCR results often contain unnecessary whitespace, line breaks, and misrecognized special characters. Post-processing is important for improving result quality. Use regular expressions to clean up patterns, remove unnecessary characters, and save results in a structured format. For bulk processing, results can be saved as JSON or CSV for loading into a database.

import re
import json

def postprocess_text(raw_text):
    """Clean up OCR result text."""

    text = raw_text

    # 1. Collapse consecutive spaces into one
    text = re.sub(r'[ \t]+', ' ', text)

    # 2. Collapse consecutive blank lines into one
    text = re.sub(r'\n{3,}', '\n\n', text)

    # 3. Strip leading/trailing whitespace from each line
    text = '\n'.join(line.strip() for line in text.split('\n'))

    # 4. Common OCR misrecognition corrections
    replacements = {
        '|': 'l',    # Pipe -> lowercase L
        '0': 'O',    # Digit 0 -> uppercase O in title areas (context-dependent)
    }
    # Note: Replacement rules should be customized for each document type

    # 5. Strip leading/trailing whitespace
    text = text.strip()

    return text

# Apply post-processing
raw_text = extract_text('preprocessed.png')
clean_text = postprocess_text(raw_text)

# Save results
output = {
    'source_file': 'scanned_document.png',
    'language': 'eng',
    'raw_length': len(raw_text),
    'clean_length': len(clean_text),
    'text': clean_text,
}

with open('ocr_result.json', 'w', encoding='utf-8') as f:
    json.dump(output, f, ensure_ascii=False, indent=2)

print(f"Original {len(raw_text)} chars -> Cleaned {len(clean_text)} chars")
print(f"Result saved: ocr_result.json")
5

Batch Processing for Large Document Sets

When processing thousands of documents, a batch processing script is needed. It iterates over all images in a folder, performs OCR, displays progress, and logs failed files separately. Using multiprocessing can improve processing speed proportionally to the number of CPU cores.

import os
import glob
from concurrent.futures import ProcessPoolExecutor, as_completed

def process_single_file(image_path):
    """OCR processing for a single file"""
    try:
        processed = preprocess_image(image_path)
        temp_path = image_path + '_temp.png'
        cv2.imwrite(temp_path, processed)
        text = extract_text(temp_path, lang='eng')
        clean = postprocess_text(text)
        os.remove(temp_path)
        return {'file': image_path, 'status': 'success', 'text': clean}
    except Exception as e:
        return {'file': image_path, 'status': 'error', 'error': str(e)}

def batch_ocr(input_dir, output_path, max_workers=4):
    """Batch OCR processing for all images in a folder."""
    patterns = ['*.png', '*.jpg', '*.jpeg', '*.tiff', '*.bmp']
    files = []
    for pat in patterns:
        files.extend(glob.glob(os.path.join(input_dir, pat)))

    print(f"Files to process: {len(files)}, Workers: {max_workers}")

    results = []
    errors = []

    with ProcessPoolExecutor(max_workers=max_workers) as executor:
        futures = {executor.submit(process_single_file, f): f for f in files}
        for i, future in enumerate(as_completed(futures), 1):
            result = future.result()
            if result['status'] == 'success':
                results.append(result)
            else:
                errors.append(result)
            print(f"[{i}/{len(files)}] {result['file']} - {result['status']}")

    # Save results
    with open(output_path, 'w', encoding='utf-8') as f:
        json.dump(results, f, ensure_ascii=False, indent=2)

    print(f"\nComplete: {len(results)} succeeded, {len(errors)} failed")
    if errors:
        print("Failed files:")
        for e in errors:
            print(f"  - {e['file']}: {e['error']}")

# Run
batch_ocr('./scanned_docs/', './ocr_results.json', max_workers=4)

Core Code

Complete OCR pipeline function including preprocessing through post-processing - Supports multilingual documents

import pytesseract
import cv2
from PIL import Image

def ocr_pipeline(image_path, lang='eng', psm=3):
    """Pipeline that performs image preprocessing + OCR extraction + post-processing in one step

    Args:
        image_path: Path to the image file
        lang: OCR recognition language
        psm: Page Segmentation Mode

    Returns:
        dict: Extraction result (text, word count, confidence, etc.)
    """
    # 1. Image preprocessing
    img = cv2.imread(image_path)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    denoised = cv2.medianBlur(gray, 3)
    binary = cv2.adaptiveThreshold(
        denoised, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY, 11, 2
    )

    # 2. Run OCR
    pil_img = Image.fromarray(binary)
    config = f'--oem 3 --psm {psm}'
    text = pytesseract.image_to_string(pil_img, lang=lang, config=config)

    # 3. Calculate confidence
    data = pytesseract.image_to_data(pil_img, lang=lang, config=config, output_type=pytesseract.Output.DICT)
    confidences = [int(c) for c in data['conf'] if int(c) > 0]
    avg_confidence = sum(confidences) / len(confidences) if confidences else 0

    # 4. Post-processing
    import re
    clean = re.sub(r'[ \t]+', ' ', text)
    clean = re.sub(r'\n{3,}', '\n\n', clean).strip()

    return {
        'text': clean,
        'word_count': len(clean.split()),
        'avg_confidence': round(avg_confidence, 1),
        'total_words_detected': len(confidences),
    }

# Usage example
result = ocr_pipeline('scanned_document.png', lang='eng')
print(f"Extracted text ({result['word_count']} words, avg confidence: {result['avg_confidence']}%):")
print(result['text'])

Common Mistakes

Not installing the language pack (tessdata) for the target language

An error occurs saying pytesseract cannot find the specified language. On Ubuntu, run sudo apt install tesseract-ocr-<lang>, or download the .traineddata file from the tessdata GitHub repository and place it in the tessdata directory.

Image resolution is too low (DPI < 200)

Tesseract performs optimally on images with at least 300 DPI. For low-resolution images, upscale 2-3x using cv2.resize before running OCR to significantly improve recognition rates.

Running OCR on the original image without preprocessing

Feeding a raw color image without preprocessing can drop recognition rates below 50%. Always perform grayscale conversion, binarization, and noise removal. Adaptive binarization is especially essential for scanned documents with uneven lighting.

Not setting the PSM mode appropriate for the document type

Use PSM 3 (automatic) for general documents, PSM 6 (single block) for receipts, and PSM 7 for single-line text like license plates. An incorrect PSM setting leads to layout analysis errors, causing text to be mixed up or missing.

Related liminfo Services