liminfo

Data Extraction with Regular Expressions

A practical guide to automatically extracting emails, phone numbers, URLs, and more from text using Python re module and regex patterns

regex patternsemail extraction regexphone number regexregex testingPython re moduleURL regexdata extractionregex groupsfindalltext pattern matching

Problem

You need to automatically extract email addresses, phone numbers, URLs, zip codes, and more from large volumes of text documents (thousands of files) such as customer support records, contracts, and email bodies. Manually searching and copying each one takes far too much time and leads to omissions. By defining patterns with regular expressions and running a batch extraction Python script, you can process them accurately and quickly.

Required Tools

Python re module

The regular expression module in Python's standard library. Performs pattern matching with functions such as findall, search, sub, and compile.

regex101.com

An online regex tester. Tests patterns in real-time and visually explains the meaning of each part. Select Python mode.

grep / ripgrep

Searches for regex patterns within files from the terminal. Useful for quick checks, and ripgrep (rg) is faster.

pandas

Performs column-level regex extraction using DataFrame str.extract() and str.findall() methods.

Solution Steps

1

Understand basic regex syntax

To use regular expressions effectively, you need to understand the core metacharacters and quantifiers. Here is a summary of the most commonly used patterns. In Python, use raw strings (r"...") to prevent backslash escaping issues. Pre-compiling patterns with re.compile() improves performance for repeated use.

import re

# ========================================
# Core Regex Syntax Reference
# ========================================
# .       : Any single character (except newline)
# \d      : Digit [0-9]
# \w      : Word character [a-zA-Z0-9_]
# \s      : Whitespace character (space, tab, newline)
# ^       : Start of string/line
# $       : End of string/line
# *       : 0 or more repetitions
# +       : 1 or more repetitions
# ?       : 0 or 1 occurrence
# {n,m}   : n to m repetitions
# [abc]   : One of a, b, c
# [^abc]  : Any character except a, b, c
# (...)   : Capture group
# (?:...) : Non-capturing group
# (?P<name>...) : Named group
# |       : OR (alternation)
# \b      : Word boundary

# Key functions of the re module
text = "Contact: contact@example.com or 555-123-4567"

re.findall(r"\d+", text)      # ['555', '123', '4567'] - list of all matches
re.search(r"\d+", text)       # First match object
re.sub(r"\d", "X", text)      # Replace digits with X
re.split(r"\s+", text)        # Split by whitespace

# Pattern compilation (improves performance for repeated use)
pattern = re.compile(r"\d{3}-\d{3}-\d{4}")
pattern.findall(text)           # ['555-123-4567']
2

Write an email address pattern

An email address follows the "localpart@domain" format. The local part can contain letters, numbers, dots, hyphens, and underscores, and the domain consists of letters and numbers with at least one dot (.). A fully RFC 5322 compliant pattern is extremely complex, but a practical-level pattern can accurately extract most email addresses.

import re

# Practical email extraction pattern
email_pattern = re.compile(
    r"[a-zA-Z0-9._%+-]+"   # Local part: letters, numbers, special chars
    r"@"                    # @ separator
    r"[a-zA-Z0-9.-]+"      # Domain: letters, numbers, dots, hyphens
    r"\.[a-zA-Z]{2,}"      # TLD: .com, .co.uk, etc. (2+ characters)
)

text = """
Hello, this is John Smith.
For work inquiries, please email john.smith@company.co.uk.
Personal email: jsmith_123@gmail.com
Marketing team: marketing+promo@startup.io
Invalid formats: @invalid, user@, user@.com
"""

emails = email_pattern.findall(text)
print("Extracted emails:")
for email in emails:
    print(f"  - {email}")

# Results:
# - john.smith@company.co.uk
# - jsmith_123@gmail.com
# - marketing+promo@startup.io

# Classify by domain
from collections import Counter
domains = [e.split("@")[1] for e in emails]
print(Counter(domains))  # {'company.co.uk': 1, 'gmail.com': 1, 'startup.io': 1}
3

Write a phone number pattern

Phone numbers can be written in various formats. US phone numbers include formats like (555) 123-4567, 555-123-4567, 555.123.4567, and 5551234567. The pattern must handle different separators: hyphens (-), dots (.), spaces, or no separator at all.

import re

# US phone number pattern (with optional country code)
phone_pattern = re.compile(
    r"(?:"
    r"(?:\+?1[-. ]?)?"           # Optional country code (+1)
    r"(?:\(\d{3}\)|\d{3})"     # Area code: (555) or 555
    r"[-. ]?"                     # Separator (hyphen, dot, space, or none)
    r"\d{3}"                      # Middle digits (3 digits)
    r"[-. ]?"                     # Separator
    r"\d{4}"                      # Last digits (4 digits)
    r")"
)

text = """
Customer service: (555) 123-4567
Contact mobile: 555.987.6543
Fax: 555 456 7890
Toll-free: 1-800-555-1234
Direct line: 5551234567
International: +1-555-123-4567
"""

phones = phone_pattern.findall(text)
print("Extracted phone numbers:")
for phone in phones:
    print(f"  - {phone}")

# Normalize phone numbers (standardize to hyphen format)
def normalize_phone(phone):
    digits = re.sub(r"[^\d]", "", phone)  # Extract only digits
    if digits.startswith("1") and len(digits) == 11:
        digits = digits[1:]  # Remove country code
    if len(digits) == 10:
        return f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
    return phone

for phone in phones:
    print(f"  {phone} -> {normalize_phone(phone)}")
4

Write a URL pattern

A URL consists of a protocol (http/https), domain, path, query string, and fragment. A perfect URL regex is extremely complex, but a practical pattern can extract most web URLs. The pattern can also be extended to find bare domains (www.example.com) without a protocol.

import re

# URL extraction pattern
url_pattern = re.compile(
    r"https?://"                    # Protocol (http:// or https://)
    r"(?:www\.)?"                   # www. (optional)
    r"[a-zA-Z0-9]"                  # Domain start (letter/digit)
    r"[a-zA-Z0-9.-]*"              # Domain body
    r"\.[a-zA-Z]{2,}"              # TLD
    r"(?:/[^\s<>"'\)\]]*)?",      # Path + query string (optional)
    re.IGNORECASE
)

text = """
Official site: https://www.example.com
API docs: https://api.example.com/v2/docs?lang=en&format=json
GitHub: https://github.com/user/repo/issues/42
Google search: https://www.google.com/search?q=python+regex
Blog: http://blog.example.co.uk/post/2024/01/hello-world
"""

urls = url_pattern.findall(text)
print("Extracted URLs:")
for url in urls:
    print(f"  - {url}")

# Extract only domains from URLs
domain_pattern = re.compile(r"https?://(?:www\.)?([a-zA-Z0-9.-]+)")
for url in urls:
    match = domain_pattern.search(url)
    if match:
        print(f"  Domain: {match.group(1)}")
5

Write a batch extraction script in Python

Create a unified script that extracts emails, phone numbers, and URLs from multiple files at once. Use glob to get the target file list, apply patterns to each file, and organize extraction results into a pandas DataFrame. Recording the source filename and line number alongside each extracted result makes it easy to verify against the original later.

import re
import glob
import pandas as pd
from pathlib import Path

# Define patterns (compiled)
PATTERNS = {
    "email": re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"),
    "phone": re.compile(
        r"(?:\+?1[-. ]?)?(?:\(?\d{3}\)?|\d{3})"
        r"[-. ]?\d{3}[-. ]?\d{4}"
    ),
    "URL": re.compile(r"https?://[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}(?:/[^\s<>"'\)\]]*)?"),
    "zip_code": re.compile(r"\b\d{5}(?:-\d{4})?\b"),  # US zip code (5 or 5+4 digits)
}

def extract_from_file(filepath):
    """Extract all patterns from a file"""
    results = []
    with open(filepath, "r", encoding="utf-8") as f:
        for line_no, line in enumerate(f, 1):
            for pattern_name, pattern in PATTERNS.items():
                matches = pattern.findall(line)
                for match in matches:
                    results.append({
                        "file": Path(filepath).name,
                        "line": line_no,
                        "type": pattern_name,
                        "value": match,
                    })
    return results

def batch_extract(directory, file_pattern="*.txt"):
    """Batch extract from all files in a directory"""
    all_results = []
    files = glob.glob(f"{directory}/{file_pattern}")
    print(f"Target files: {len(files)}")

    for filepath in sorted(files):
        results = extract_from_file(filepath)
        all_results.extend(results)
        print(f"  {Path(filepath).name}: {len(results)} items extracted")

    df = pd.DataFrame(all_results)

    # Statistics by type
    print("\n=== Extraction Results Summary ===")
    print(df["type"].value_counts())

    # Remove duplicates
    df_unique = df.drop_duplicates(subset=["type", "value"])
    print(f"\nUnique values: {len(df_unique)} ({len(df) - len(df_unique)} duplicates removed)")

    # Save
    df_unique.to_csv("extraction_results.csv", index=False, encoding="utf-8-sig")
    print("extraction_results.csv saved!")

    return df_unique

# Run
df = batch_extract("./documents", "*.txt")

Core Code

Seven regex patterns most frequently extracted in practice: email, phone number, URL, zip code, IP address, date, and monetary amount. re.findall() returns all matches as a list at once.

import re

# ========================================
# Core: Practical Data Extraction Pattern Collection
# ========================================

text = """
John Smith, Manager (john.smith@company.co.uk)
Phone: 555-123-4567 / (212) 555-7890
Website: https://www.company.co.uk/about
Zip code: 10001 New York, NY
SSN: 123-45-6789
IP: 192.168.1.100
Date: 2024-03-15
Amount: $1,234,567
"""

patterns = {
    "email":    r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
    "phone":    r"(?:\(?\d{3}\)?|\d{3})[-. ]?\d{3}[-. ]?\d{4}",
    "URL":      r"https?://[^\s<>"']+",
    "zip_code": r"\b\d{5}\b",
    "IP_addr":  r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b",
    "date":     r"\d{4}[-/.]\d{2}[-/.]\d{2}",
    "amount":   r"\$[\d,]+",
}

for name, pattern in patterns.items():
    matches = re.findall(pattern, text)
    print(f"{name}: {matches}")

Common Mistakes

Greedy matching extracts a wider range than expected

.* is a greedy quantifier that matches as much as possible. Using .*? for non-greedy (lazy) matching will match only the shortest range. Example: <b>.*</b> matches across multiple tags at once, but <b>.*?</b> matches each tag individually.

International phone number formats (+1-555-...) not handled

Consider not just domestic numbers but international formats as well. Add (?:\+?1[-. ]?)? at the beginning of the pattern to optionally match the country code. The leading digit may be omitted after the country code, so handle this case too.

Metacharacters interpreted with special meaning due to missing escape characters

To match metacharacters like dots (.), parentheses (()), and brackets ([]) literally, they must be escaped with \. In Python, using raw strings (r"...") prevents double-escaping issues. r"\d\.\d" = "\\d\\.\\d"

Attempting to parse HTML/XML with regular expressions

Parsing nested HTML tags with regex is theoretically impossible (a limitation of regular languages). Use a dedicated parser like BeautifulSoup or lxml. Simple tag stripping (re.sub(r"<[^>]+>", "", html)) is possible.

Related liminfo Services