Python OCR Image Text Extraction
Covers the entire process of extracting text from scanned documents, receipts, business cards, and other images, converting them into searchable digital data.
Problem
Required Tools
Programming language for building the OCR pipeline
Open-source OCR engine maintained by Google. Supports over 100 languages
Python wrapper library for Tesseract
Python image library for image loading, preprocessing, and conversion
Used for advanced image preprocessing (binarization, noise removal, etc.)
Solution Steps
Install Tesseract OCR Engine and Python Packages
Tesseract must be installed at the system level, and pytesseract is a wrapper that calls this engine from Python. For recognizing additional languages, you must install the corresponding language training data (tessdata). Instructions are provided for Ubuntu/Debian; use brew for macOS or the installer for Windows.
# Ubuntu/Debian - Install Tesseract engine
sudo apt update
sudo apt install -y tesseract-ocr
# Install additional language packs (e.g., for Korean)
sudo apt install -y tesseract-ocr-kor
# Check installed language list
tesseract --list-langs
# Install Python packages
pip install pytesseract Pillow opencv-python-headless
# For macOS
# brew install tesseract tesseract-lang
# Verify installation
python -c "import pytesseract; print(pytesseract.get_tesseract_version())"Image Preprocessing (Grayscale, Binarization, Noise Removal)
80% of OCR accuracy depends on preprocessing quality. Feeding raw images directly results in significantly lower recognition rates. Key preprocessing steps: 1. Grayscale conversion - Simplifies computation by removing color information 2. Noise removal - Apply Gaussian blur or median filter 3. Binarization - Clearly separate text and background into black and white 4. Deskew - Correct tilt that occurs during scanning Otsu's binarization and adaptive thresholding are particularly effective for documents with uneven lighting.
import cv2
import numpy as np
from PIL import Image
def preprocess_image(image_path):
"""Image preprocessing pipeline to improve OCR accuracy"""
# Load image
img = cv2.imread(image_path)
# 1. Grayscale conversion
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# 2. Noise removal (median filter - effective for salt-and-pepper noise)
denoised = cv2.medianBlur(gray, 3)
# 3. Adaptive binarization (suitable for documents with uneven lighting)
binary = cv2.adaptiveThreshold(
denoised,
255,
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY,
blockSize=11, # Block size (must be odd)
C=2 # Correction constant
)
# 4. Morphological operation to fix broken characters
kernel = np.ones((1, 1), np.uint8)
processed = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel)
return processed
# Run preprocessing and save result
processed = preprocess_image('scanned_document.png')
cv2.imwrite('preprocessed.png', processed)
print("Preprocessing complete: saved as preprocessed.png")Extract Text with pytesseract
Pass the preprocessed image to pytesseract to extract text. Specify the recognition language with the lang parameter; for mixed-language documents use formats like 'eng+kor'. Control Tesseract's detailed behavior with the config parameter: - --psm (Page Segmentation Mode): How to analyze page layout - --oem (OCR Engine Mode): Which OCR engine to use (LSTM, Legacy, combined) Frequently used PSM modes: - 3: Automatic page segmentation (default, general documents) - 6: Single text block (receipts, etc.) - 7: Single line of text (license plates, etc.) - 11: Sparse text (text inside tables, etc.)
import pytesseract
from PIL import Image
def extract_text(image_path, lang='eng', psm=3):
"""Extract text from an image.
Args:
image_path: Path to the image file
lang: Recognition language (e.g., 'eng', 'kor', 'eng+kor')
psm: Page Segmentation Mode (3=auto, 6=block, 7=single line)
Returns:
str: Extracted text
"""
img = Image.open(image_path)
custom_config = f'--oem 3 --psm {psm}'
text = pytesseract.image_to_string(img, lang=lang, config=custom_config)
return text
# Basic text extraction
text = extract_text('preprocessed.png', lang='eng')
print("--- Extracted Text ---")
print(text)
# Detailed extraction (per-word coordinates, confidence)
data = pytesseract.image_to_data(
Image.open('preprocessed.png'),
lang='eng',
config='--oem 3 --psm 3',
output_type=pytesseract.Output.DICT
)
# Filter only words with confidence above 60%
for i, word in enumerate(data['text']):
conf = int(data['conf'][i])
if conf > 60 and word.strip():
x, y, w, h = data['left'][i], data['top'][i], data['width'][i], data['height'][i]
print(f"Word: '{word}' | Confidence: {conf}% | Position: ({x},{y}) Size: {w}x{h}")Post-processing and Cleanup
OCR results often contain unnecessary whitespace, line breaks, and misrecognized special characters. Post-processing is important for improving result quality. Use regular expressions to clean up patterns, remove unnecessary characters, and save results in a structured format. For bulk processing, results can be saved as JSON or CSV for loading into a database.
import re
import json
def postprocess_text(raw_text):
"""Clean up OCR result text."""
text = raw_text
# 1. Collapse consecutive spaces into one
text = re.sub(r'[ \t]+', ' ', text)
# 2. Collapse consecutive blank lines into one
text = re.sub(r'\n{3,}', '\n\n', text)
# 3. Strip leading/trailing whitespace from each line
text = '\n'.join(line.strip() for line in text.split('\n'))
# 4. Common OCR misrecognition corrections
replacements = {
'|': 'l', # Pipe -> lowercase L
'0': 'O', # Digit 0 -> uppercase O in title areas (context-dependent)
}
# Note: Replacement rules should be customized for each document type
# 5. Strip leading/trailing whitespace
text = text.strip()
return text
# Apply post-processing
raw_text = extract_text('preprocessed.png')
clean_text = postprocess_text(raw_text)
# Save results
output = {
'source_file': 'scanned_document.png',
'language': 'eng',
'raw_length': len(raw_text),
'clean_length': len(clean_text),
'text': clean_text,
}
with open('ocr_result.json', 'w', encoding='utf-8') as f:
json.dump(output, f, ensure_ascii=False, indent=2)
print(f"Original {len(raw_text)} chars -> Cleaned {len(clean_text)} chars")
print(f"Result saved: ocr_result.json")Batch Processing for Large Document Sets
When processing thousands of documents, a batch processing script is needed. It iterates over all images in a folder, performs OCR, displays progress, and logs failed files separately. Using multiprocessing can improve processing speed proportionally to the number of CPU cores.
import os
import glob
from concurrent.futures import ProcessPoolExecutor, as_completed
def process_single_file(image_path):
"""OCR processing for a single file"""
try:
processed = preprocess_image(image_path)
temp_path = image_path + '_temp.png'
cv2.imwrite(temp_path, processed)
text = extract_text(temp_path, lang='eng')
clean = postprocess_text(text)
os.remove(temp_path)
return {'file': image_path, 'status': 'success', 'text': clean}
except Exception as e:
return {'file': image_path, 'status': 'error', 'error': str(e)}
def batch_ocr(input_dir, output_path, max_workers=4):
"""Batch OCR processing for all images in a folder."""
patterns = ['*.png', '*.jpg', '*.jpeg', '*.tiff', '*.bmp']
files = []
for pat in patterns:
files.extend(glob.glob(os.path.join(input_dir, pat)))
print(f"Files to process: {len(files)}, Workers: {max_workers}")
results = []
errors = []
with ProcessPoolExecutor(max_workers=max_workers) as executor:
futures = {executor.submit(process_single_file, f): f for f in files}
for i, future in enumerate(as_completed(futures), 1):
result = future.result()
if result['status'] == 'success':
results.append(result)
else:
errors.append(result)
print(f"[{i}/{len(files)}] {result['file']} - {result['status']}")
# Save results
with open(output_path, 'w', encoding='utf-8') as f:
json.dump(results, f, ensure_ascii=False, indent=2)
print(f"\nComplete: {len(results)} succeeded, {len(errors)} failed")
if errors:
print("Failed files:")
for e in errors:
print(f" - {e['file']}: {e['error']}")
# Run
batch_ocr('./scanned_docs/', './ocr_results.json', max_workers=4)Core Code
Complete OCR pipeline function including preprocessing through post-processing - Supports multilingual documents
import pytesseract
import cv2
from PIL import Image
def ocr_pipeline(image_path, lang='eng', psm=3):
"""Pipeline that performs image preprocessing + OCR extraction + post-processing in one step
Args:
image_path: Path to the image file
lang: OCR recognition language
psm: Page Segmentation Mode
Returns:
dict: Extraction result (text, word count, confidence, etc.)
"""
# 1. Image preprocessing
img = cv2.imread(image_path)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
denoised = cv2.medianBlur(gray, 3)
binary = cv2.adaptiveThreshold(
denoised, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY, 11, 2
)
# 2. Run OCR
pil_img = Image.fromarray(binary)
config = f'--oem 3 --psm {psm}'
text = pytesseract.image_to_string(pil_img, lang=lang, config=config)
# 3. Calculate confidence
data = pytesseract.image_to_data(pil_img, lang=lang, config=config, output_type=pytesseract.Output.DICT)
confidences = [int(c) for c in data['conf'] if int(c) > 0]
avg_confidence = sum(confidences) / len(confidences) if confidences else 0
# 4. Post-processing
import re
clean = re.sub(r'[ \t]+', ' ', text)
clean = re.sub(r'\n{3,}', '\n\n', clean).strip()
return {
'text': clean,
'word_count': len(clean.split()),
'avg_confidence': round(avg_confidence, 1),
'total_words_detected': len(confidences),
}
# Usage example
result = ocr_pipeline('scanned_document.png', lang='eng')
print(f"Extracted text ({result['word_count']} words, avg confidence: {result['avg_confidence']}%):")
print(result['text'])Common Mistakes
Not installing the language pack (tessdata) for the target language
An error occurs saying pytesseract cannot find the specified language. On Ubuntu, run sudo apt install tesseract-ocr-<lang>, or download the .traineddata file from the tessdata GitHub repository and place it in the tessdata directory.
Image resolution is too low (DPI < 200)
Tesseract performs optimally on images with at least 300 DPI. For low-resolution images, upscale 2-3x using cv2.resize before running OCR to significantly improve recognition rates.
Running OCR on the original image without preprocessing
Feeding a raw color image without preprocessing can drop recognition rates below 50%. Always perform grayscale conversion, binarization, and noise removal. Adaptive binarization is especially essential for scanned documents with uneven lighting.
Not setting the PSM mode appropriate for the document type
Use PSM 3 (automatic) for general documents, PSM 6 (single block) for receipts, and PSM 7 for single-line text like license plates. An incorrect PSM setting leads to layout analysis errors, causing text to be mixed up or missing.