liminfo

Python Web Crawling (BeautifulSoup)

A practical web crawling guide for sending HTTP requests with requests and parsing HTML with BeautifulSoup4 to automatically collect desired data

Python web crawlingbeautifulsoup tutorialrequests scrapingweb data collectionHTML parsingCSS selectorspagination crawlingpandas data exportrobots.txtUser-Agent configuration

Problem

You need to automatically collect property listing information (price, area, location, floor, etc.) from a real estate website. Manually copying data from hundreds of pages containing thousands of listings is impractical, so you want to automate crawling with a Python script, save the results as structured data (CSV/Excel), and use it for analysis. The site's robots.txt policy must be followed, and proper crawling etiquette must be observed to avoid being blocked.

Required Tools

Python 3.x

The language for writing the crawling script. Version 3.8+ is recommended. Use venv to create a virtual environment for dependency management.

requests

An HTTP request library. Conveniently handles GET/POST requests, session management, cookie handling, and timeout configuration.

BeautifulSoup4 (bs4)

An HTML/XML parser. Extracts desired data using CSS selectors or tag traversal. Using it with the lxml parser provides faster performance.

pandas

Organizes collected data into DataFrames and exports to various formats including CSV, Excel, and JSON.

Solution Steps

1

Set up environment and check robots.txt

Before starting to crawl, you must check the target site's robots.txt. Paths that are Disallowed in robots.txt must not be crawled. Ignoring this can lead to legal issues. Install the required packages and set up the virtual environment.

# Create and activate virtual environment
python3 -m venv crawler-env
source crawler-env/bin/activate

# Install required packages
pip install requests beautifulsoup4 lxml pandas openpyxl

# Check robots.txt (Python)
import urllib.robotparser

rp = urllib.robotparser.RobotFileParser()
rp.set_url("https://example-realestate.com/robots.txt")
rp.read()

# Check whether crawling is allowed for a specific path
print(rp.can_fetch("*", "/listings/"))       # True means crawling is allowed
print(rp.can_fetch("*", "/admin/"))          # False means access is forbidden
print(f"Crawl-delay: {rp.crawl_delay('*')}")  # Recommended request interval
2

Request web pages with requests

Use the requests library to send HTTP GET requests and retrieve HTML. If the User-Agent header is not set, the request may be identified as a bot and blocked. Using a session object reuses cookies and connections for efficiency, and also handles sites that require login. Timeout and retry logic must also be added.

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

# Session setup (retry + timeout)
session = requests.Session()

# Automatic retry configuration (on 5xx errors, connection failures)
retry_strategy = Retry(
    total=3,
    backoff_factor=1,          # Retry at 1s, 2s, 4s intervals
    status_forcelist=[429, 500, 502, 503, 504],
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("https://", adapter)
session.mount("http://", adapter)

# Header configuration (to prevent bot blocking)
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/120.0.0.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml",
}
session.headers.update(headers)

# Request a page
url = "https://example-realestate.com/listings?page=1"
response = session.get(url, timeout=10)
response.raise_for_status()  # Raise exception on 4xx, 5xx errors

print(f"Status: {response.status_code}")
print(f"Encoding: {response.encoding}")
html = response.text
3

Parse HTML and extract data with BeautifulSoup

Parse the HTML string into a BeautifulSoup object, then extract desired data using CSS selectors (select) or find/find_all methods. It's convenient to identify CSS selectors for target elements beforehand using the browser's developer tools (F12). Exception handling for missing elements (None) is important when extracting data.

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "lxml")  # Use lxml parser (fast)

# Extract listing card list
listings = soup.select("div.listing-card")  # CSS selector

results = []
for item in listings:
    # Extract details from each listing
    title = item.select_one("h3.listing-title")
    price = item.select_one("span.price")
    area = item.select_one("span.area")
    location = item.select_one("span.location")
    floor_info = item.select_one("span.floor")

    # Check for None before extracting and cleaning text
    result = {
        "title": title.get_text(strip=True) if title else "",
        "price": price.get_text(strip=True) if price else "",
        "area": area.get_text(strip=True) if area else "",
        "location": location.get_text(strip=True) if location else "",
        "floor": floor_info.get_text(strip=True) if floor_info else "",
        "link": item.select_one("a")["href"] if item.select_one("a") else "",
    }
    results.append(result)

print(f"Extracted {len(results)} items from this page")
for r in results[:3]:
    print(r)
4

Handle pagination (iterate through multiple pages)

Most sites display data across multiple pages. URL parameter patterns (?page=1, ?page=2) are common, and tracking the "Next" button link is another approach. Add an appropriate delay (1-3 seconds) between each request to avoid overloading the server. Also add intermediate save logic so that data collected so far isn't lost if crawling gets blocked.

import time
import random

def crawl_all_pages(base_url, max_pages=100):
    all_results = []

    for page in range(1, max_pages + 1):
        url = f"{base_url}?page={page}"
        print(f"[{page}/{max_pages}] Crawling: {url}")

        try:
            response = session.get(url, timeout=10)
            response.raise_for_status()
        except requests.exceptions.RequestException as e:
            print(f"Request failed (page {page}): {e}")
            break

        soup = BeautifulSoup(response.text, "lxml")
        listings = soup.select("div.listing-card")

        # Stop if no more listings
        if not listings:
            print(f"Last page: {page - 1}")
            break

        for item in listings:
            result = {
                "title": safe_text(item, "h3.listing-title"),
                "price": safe_text(item, "span.price"),
                "area": safe_text(item, "span.area"),
                "location": safe_text(item, "span.location"),
                "page": page,
            }
            all_results.append(result)

        # Intermediate save (every 10 pages)
        if page % 10 == 0:
            save_to_csv(all_results, "partial_results.csv")
            print(f"  -> Intermediate save complete ({len(all_results)} items)")

        # Server load prevention: random delay of 1-3 seconds
        delay = random.uniform(1.0, 3.0)
        time.sleep(delay)

    return all_results

def safe_text(element, selector):
    """Find element by selector and return text, or empty string if not found"""
    tag = element.select_one(selector)
    return tag.get_text(strip=True) if tag else ""
5

Organize and save with pandas DataFrame

Converting collected data into a pandas DataFrame makes filtering, sorting, and statistical analysis easy. Data cleaning such as extracting numbers from price strings or standardizing area units is performed at this stage. Save the final results in your desired format: CSV, Excel, JSON, etc.

import pandas as pd
import re

# Run the crawler
base_url = "https://example-realestate.com/listings"
data = crawl_all_pages(base_url, max_pages=50)

# Create DataFrame
df = pd.DataFrame(data)
print(f"Total {len(df)} items collected")
print(df.head())

# Data cleaning
# Extract numbers from price (e.g., "$350,000" -> 350000)
def parse_price(price_str):
    """Convert price string to numeric value"""
    if not price_str:
        return None
    price_str = price_str.replace(",", "").replace("$", "")
    match = re.search(r"(\d+)", price_str)
    return int(match.group(1)) if match else None

df["price_numeric"] = df["price"].apply(parse_price)

# Extract numbers from area (e.g., "84.95 sqm" -> 84.95)
df["area_sqm"] = df["area"].str.extract(r"([\d.]+)").astype(float)

# Remove duplicates and sort
df = df.drop_duplicates(subset=["title", "location"])
df = df.sort_values("price_numeric", ascending=True)

# Save
df.to_csv("property_listings.csv", index=False, encoding="utf-8-sig")
df.to_excel("property_listings.xlsx", index=False)
df.to_json("property_listings.json", orient="records", force_ascii=False, indent=2)

print(f"Save complete! (CSV, Excel, JSON)")
print(f"Average price: {df['price_numeric'].mean():,.0f}")
print(f"Area range: {df['area_sqm'].min()} ~ {df['area_sqm'].max()} sqm")

Core Code

A complete crawler combining requests + BeautifulSoup + pandas. Includes session reuse, pagination, random delays, and CSV export.

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time, random

# ========================================
# Core: Complete Web Crawler (Request -> Parse -> Save)
# ========================================

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36",
})

def crawl(base_url, max_pages=50):
    results = []
    for page in range(1, max_pages + 1):
        resp = session.get(f"{base_url}?page={page}", timeout=10)
        resp.raise_for_status()

        soup = BeautifulSoup(resp.text, "lxml")
        cards = soup.select("div.listing-card")
        if not cards:
            break

        for card in cards:
            results.append({
                "title": (card.select_one("h3") or {}).get_text(strip=True) if card.select_one("h3") else "",
                "price": (card.select_one(".price") or {}).get_text(strip=True) if card.select_one(".price") else "",
                "area": (card.select_one(".area") or {}).get_text(strip=True) if card.select_one(".area") else "",
                "location": (card.select_one(".location") or {}).get_text(strip=True) if card.select_one(".location") else "",
            })
        time.sleep(random.uniform(1.0, 3.0))   # Polite crawling

    return pd.DataFrame(results)

df = crawl("https://example-realestate.com/listings")
df.to_csv("listings.csv", index=False, encoding="utf-8-sig")
print(f"{len(df)} items saved")

Common Mistakes

Accessing crawl-restricted paths without checking robots.txt

Always check robots.txt before crawling. You can automate this check with the urllib.robotparser module. Crawling Disallowed paths can lead to legal issues (copyright law, computer fraud, etc.).

IP blocked due to rapid consecutive crawling without request intervals

Add a random delay between requests with time.sleep(random.uniform(1.0, 3.0)). If a Crawl-delay header exists, follow that value. If you receive a 429 Too Many Requests response, stop immediately and increase the interval.

Blocked because default User-Agent (python-requests/x.x.x) is exposed due to missing configuration

Set a User-Agent identical to a regular browser with session.headers.update({"User-Agent": "..."}). Also setting Accept and Accept-Language headers reduces the chance of being blocked.

Attempting to fetch dynamically rendered JavaScript content with requests

requests only retrieves static HTML. Content loaded via SPA or Ajax must be collected using Selenium, Playwright, or direct API calls. Look for the actual data API in the browser developer tools Network tab.

Related liminfo Services