Data Extraction with Regular Expressions
A practical guide to automatically extracting emails, phone numbers, URLs, and more from text using Python re module and regex patterns
Problem
Required Tools
The regular expression module in Python's standard library. Performs pattern matching with functions such as findall, search, sub, and compile.
An online regex tester. Tests patterns in real-time and visually explains the meaning of each part. Select Python mode.
Searches for regex patterns within files from the terminal. Useful for quick checks, and ripgrep (rg) is faster.
Performs column-level regex extraction using DataFrame str.extract() and str.findall() methods.
Solution Steps
Understand basic regex syntax
To use regular expressions effectively, you need to understand the core metacharacters and quantifiers. Here is a summary of the most commonly used patterns. In Python, use raw strings (r"...") to prevent backslash escaping issues. Pre-compiling patterns with re.compile() improves performance for repeated use.
import re
# ========================================
# Core Regex Syntax Reference
# ========================================
# . : Any single character (except newline)
# \d : Digit [0-9]
# \w : Word character [a-zA-Z0-9_]
# \s : Whitespace character (space, tab, newline)
# ^ : Start of string/line
# $ : End of string/line
# * : 0 or more repetitions
# + : 1 or more repetitions
# ? : 0 or 1 occurrence
# {n,m} : n to m repetitions
# [abc] : One of a, b, c
# [^abc] : Any character except a, b, c
# (...) : Capture group
# (?:...) : Non-capturing group
# (?P<name>...) : Named group
# | : OR (alternation)
# \b : Word boundary
# Key functions of the re module
text = "Contact: contact@example.com or 555-123-4567"
re.findall(r"\d+", text) # ['555', '123', '4567'] - list of all matches
re.search(r"\d+", text) # First match object
re.sub(r"\d", "X", text) # Replace digits with X
re.split(r"\s+", text) # Split by whitespace
# Pattern compilation (improves performance for repeated use)
pattern = re.compile(r"\d{3}-\d{3}-\d{4}")
pattern.findall(text) # ['555-123-4567']Write an email address pattern
An email address follows the "localpart@domain" format. The local part can contain letters, numbers, dots, hyphens, and underscores, and the domain consists of letters and numbers with at least one dot (.). A fully RFC 5322 compliant pattern is extremely complex, but a practical-level pattern can accurately extract most email addresses.
import re
# Practical email extraction pattern
email_pattern = re.compile(
r"[a-zA-Z0-9._%+-]+" # Local part: letters, numbers, special chars
r"@" # @ separator
r"[a-zA-Z0-9.-]+" # Domain: letters, numbers, dots, hyphens
r"\.[a-zA-Z]{2,}" # TLD: .com, .co.uk, etc. (2+ characters)
)
text = """
Hello, this is John Smith.
For work inquiries, please email john.smith@company.co.uk.
Personal email: jsmith_123@gmail.com
Marketing team: marketing+promo@startup.io
Invalid formats: @invalid, user@, user@.com
"""
emails = email_pattern.findall(text)
print("Extracted emails:")
for email in emails:
print(f" - {email}")
# Results:
# - john.smith@company.co.uk
# - jsmith_123@gmail.com
# - marketing+promo@startup.io
# Classify by domain
from collections import Counter
domains = [e.split("@")[1] for e in emails]
print(Counter(domains)) # {'company.co.uk': 1, 'gmail.com': 1, 'startup.io': 1}Write a phone number pattern
Phone numbers can be written in various formats. US phone numbers include formats like (555) 123-4567, 555-123-4567, 555.123.4567, and 5551234567. The pattern must handle different separators: hyphens (-), dots (.), spaces, or no separator at all.
import re
# US phone number pattern (with optional country code)
phone_pattern = re.compile(
r"(?:"
r"(?:\+?1[-. ]?)?" # Optional country code (+1)
r"(?:\(\d{3}\)|\d{3})" # Area code: (555) or 555
r"[-. ]?" # Separator (hyphen, dot, space, or none)
r"\d{3}" # Middle digits (3 digits)
r"[-. ]?" # Separator
r"\d{4}" # Last digits (4 digits)
r")"
)
text = """
Customer service: (555) 123-4567
Contact mobile: 555.987.6543
Fax: 555 456 7890
Toll-free: 1-800-555-1234
Direct line: 5551234567
International: +1-555-123-4567
"""
phones = phone_pattern.findall(text)
print("Extracted phone numbers:")
for phone in phones:
print(f" - {phone}")
# Normalize phone numbers (standardize to hyphen format)
def normalize_phone(phone):
digits = re.sub(r"[^\d]", "", phone) # Extract only digits
if digits.startswith("1") and len(digits) == 11:
digits = digits[1:] # Remove country code
if len(digits) == 10:
return f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
return phone
for phone in phones:
print(f" {phone} -> {normalize_phone(phone)}")Write a URL pattern
A URL consists of a protocol (http/https), domain, path, query string, and fragment. A perfect URL regex is extremely complex, but a practical pattern can extract most web URLs. The pattern can also be extended to find bare domains (www.example.com) without a protocol.
import re
# URL extraction pattern
url_pattern = re.compile(
r"https?://" # Protocol (http:// or https://)
r"(?:www\.)?" # www. (optional)
r"[a-zA-Z0-9]" # Domain start (letter/digit)
r"[a-zA-Z0-9.-]*" # Domain body
r"\.[a-zA-Z]{2,}" # TLD
r"(?:/[^\s<>"'\)\]]*)?", # Path + query string (optional)
re.IGNORECASE
)
text = """
Official site: https://www.example.com
API docs: https://api.example.com/v2/docs?lang=en&format=json
GitHub: https://github.com/user/repo/issues/42
Google search: https://www.google.com/search?q=python+regex
Blog: http://blog.example.co.uk/post/2024/01/hello-world
"""
urls = url_pattern.findall(text)
print("Extracted URLs:")
for url in urls:
print(f" - {url}")
# Extract only domains from URLs
domain_pattern = re.compile(r"https?://(?:www\.)?([a-zA-Z0-9.-]+)")
for url in urls:
match = domain_pattern.search(url)
if match:
print(f" Domain: {match.group(1)}")Write a batch extraction script in Python
Create a unified script that extracts emails, phone numbers, and URLs from multiple files at once. Use glob to get the target file list, apply patterns to each file, and organize extraction results into a pandas DataFrame. Recording the source filename and line number alongside each extracted result makes it easy to verify against the original later.
import re
import glob
import pandas as pd
from pathlib import Path
# Define patterns (compiled)
PATTERNS = {
"email": re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"),
"phone": re.compile(
r"(?:\+?1[-. ]?)?(?:\(?\d{3}\)?|\d{3})"
r"[-. ]?\d{3}[-. ]?\d{4}"
),
"URL": re.compile(r"https?://[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}(?:/[^\s<>"'\)\]]*)?"),
"zip_code": re.compile(r"\b\d{5}(?:-\d{4})?\b"), # US zip code (5 or 5+4 digits)
}
def extract_from_file(filepath):
"""Extract all patterns from a file"""
results = []
with open(filepath, "r", encoding="utf-8") as f:
for line_no, line in enumerate(f, 1):
for pattern_name, pattern in PATTERNS.items():
matches = pattern.findall(line)
for match in matches:
results.append({
"file": Path(filepath).name,
"line": line_no,
"type": pattern_name,
"value": match,
})
return results
def batch_extract(directory, file_pattern="*.txt"):
"""Batch extract from all files in a directory"""
all_results = []
files = glob.glob(f"{directory}/{file_pattern}")
print(f"Target files: {len(files)}")
for filepath in sorted(files):
results = extract_from_file(filepath)
all_results.extend(results)
print(f" {Path(filepath).name}: {len(results)} items extracted")
df = pd.DataFrame(all_results)
# Statistics by type
print("\n=== Extraction Results Summary ===")
print(df["type"].value_counts())
# Remove duplicates
df_unique = df.drop_duplicates(subset=["type", "value"])
print(f"\nUnique values: {len(df_unique)} ({len(df) - len(df_unique)} duplicates removed)")
# Save
df_unique.to_csv("extraction_results.csv", index=False, encoding="utf-8-sig")
print("extraction_results.csv saved!")
return df_unique
# Run
df = batch_extract("./documents", "*.txt")Core Code
Seven regex patterns most frequently extracted in practice: email, phone number, URL, zip code, IP address, date, and monetary amount. re.findall() returns all matches as a list at once.
import re
# ========================================
# Core: Practical Data Extraction Pattern Collection
# ========================================
text = """
John Smith, Manager (john.smith@company.co.uk)
Phone: 555-123-4567 / (212) 555-7890
Website: https://www.company.co.uk/about
Zip code: 10001 New York, NY
SSN: 123-45-6789
IP: 192.168.1.100
Date: 2024-03-15
Amount: $1,234,567
"""
patterns = {
"email": r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
"phone": r"(?:\(?\d{3}\)?|\d{3})[-. ]?\d{3}[-. ]?\d{4}",
"URL": r"https?://[^\s<>"']+",
"zip_code": r"\b\d{5}\b",
"IP_addr": r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b",
"date": r"\d{4}[-/.]\d{2}[-/.]\d{2}",
"amount": r"\$[\d,]+",
}
for name, pattern in patterns.items():
matches = re.findall(pattern, text)
print(f"{name}: {matches}")Common Mistakes
Greedy matching extracts a wider range than expected
.* is a greedy quantifier that matches as much as possible. Using .*? for non-greedy (lazy) matching will match only the shortest range. Example: <b>.*</b> matches across multiple tags at once, but <b>.*?</b> matches each tag individually.
International phone number formats (+1-555-...) not handled
Consider not just domestic numbers but international formats as well. Add (?:\+?1[-. ]?)? at the beginning of the pattern to optionally match the country code. The leading digit may be omitted after the country code, so handle this case too.
Metacharacters interpreted with special meaning due to missing escape characters
To match metacharacters like dots (.), parentheses (()), and brackets ([]) literally, they must be escaped with \. In Python, using raw strings (r"...") prevents double-escaping issues. r"\d\.\d" = "\\d\\.\\d"
Attempting to parse HTML/XML with regular expressions
Parsing nested HTML tags with regex is theoretically impossible (a limitation of regular languages). Use a dedicated parser like BeautifulSoup or lxml. Simple tag stripping (re.sub(r"<[^>]+>", "", html)) is possible.