liminfo

Convert Excel Data to JSON for API Integration

A hands-on guide to building an automation pipeline that automatically parses weekly Excel reports with Python pandas, cleanses the data, converts it to JSON format, and uploads it to a REST API server.

excel to json conversionpandas excel processingExcel API integrationopenpyxldata conversion automationread_excelto_jsonrequestsREST APIdata pipelineETL automationexcel parsing

Problem

Every Monday, the sales team sends an Excel report (.xlsx) via email, which needs to be manually processed and uploaded in JSON format to an internal API server. The Excel file contains mixed data formats including dates, amounts, and customer names, and empty cells or formatting errors frequently occur. This repetitive task needs to be automated with a Python script so that dropping in an Excel file handles everything from data cleansing to API transmission in one step.

Required Tools

Python 3.x

The core runtime for the data processing script. Version 3.8 or above is recommended.

pandas

A Python data analysis library. Optimized for reading, transforming, and cleansing Excel data using DataFrames.

openpyxl

A Python library for reading and writing .xlsx files. Used as the Excel reading engine for pandas.

requests

A Python HTTP library. Used for sending JSON data to REST APIs via POST/PUT requests.

Solution Steps

1

Environment setup and reading Excel files

First, install the required libraries and load the Excel file into a DataFrame using pandas read_excel(). Use sheet_name to specify a particular sheet and header to set the header row position. To understand the structure of the Excel file first, check the data types and sample data with df.info() and df.head().

# Install libraries
pip install pandas openpyxl requests

# === excel_reader.py ===
import pandas as pd
from pathlib import Path

# Read Excel file
excel_path = Path("weekly_report.xlsx")
df = pd.read_excel(
    excel_path,
    sheet_name="Sales Summary",   # Specify a particular sheet
    header=0,                      # First row is the header
    dtype={
        'customer_code': str,      # Keep as str to prevent leading zeros from being trimmed
        'phone_number': str,
    }
)

# Check data structure
print(f"Rows: {len(df)}, Columns: {len(df.columns)}")
print(df.info())
print(df.head())

# Read all sheets at once
all_sheets = pd.read_excel(excel_path, sheet_name=None)
for sheet_name, sheet_df in all_sheets.items():
    print(f"Sheet '{sheet_name}': {len(sheet_df)} rows")
2

Data cleansing and validation

Real-world Excel files commonly have issues such as empty cells, incorrect formats, and duplicate data. Perform cleansing tasks like NaN value handling, date format standardization, string whitespace removal, and duplicate row removal. Validation catches missing required fields and out-of-range values in advance to prevent API submission failures.

import pandas as pd
import numpy as np

def clean_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    """Excel data cleansing function"""

    # 1. Remove completely empty rows
    df = df.dropna(how='all')

    # 2. Trim whitespace from string columns
    str_columns = df.select_dtypes(include='object').columns
    for col in str_columns:
        df[col] = df[col].astype(str).str.strip()
        df[col] = df[col].replace('nan', np.nan)

    # 3. Convert date columns (Excel date -> ISO format)
    date_columns = ['order_date', 'delivery_date']
    for col in date_columns:
        if col in df.columns:
            df[col] = pd.to_datetime(df[col], errors='coerce')
            # ISO string conversion for API is handled during JSON conversion

    # 4. Convert amount columns to numeric (remove commas and currency symbols)
    money_columns = ['order_amount', 'discount_amount', 'payment_amount']
    for col in money_columns:
        if col in df.columns:
            df[col] = (df[col].astype(str)
                       .str.replace(',', '')
                       .str.replace('$', '')
                       .str.strip())
            df[col] = pd.to_numeric(df[col], errors='coerce').fillna(0).astype(int)

    # 5. Remove duplicate rows (based on order number)
    if 'order_id' in df.columns:
        dup_count = df.duplicated(subset='order_id').sum()
        if dup_count > 0:
            print(f"  Removed {dup_count} duplicates")
        df = df.drop_duplicates(subset='order_id', keep='last')

    # 6. Required field validation
    required = ['order_id', 'customer_name', 'order_amount']
    for col in required:
        if col in df.columns:
            null_count = df[col].isna().sum()
            if null_count > 0:
                print(f"  Warning: {null_count} missing values in '{col}' field")

    return df.reset_index(drop=True)

# Usage
df_clean = clean_dataframe(df)
print(f"After cleansing: {len(df_clean)} rows")
3

Convert DataFrame to JSON format

Convert the cleansed DataFrame to the JSON format required by the API server. Use pandas to_dict() and to_json(), but pay attention to date formatting and character encoding. Map key names to English according to the API specification, and construct nested structures separately if needed.

import json
from datetime import datetime

def dataframe_to_api_json(df: pd.DataFrame) -> list[dict]:
    """Convert DataFrame to JSON list for API submission"""

    # Column name mapping
    column_mapping = {
        'order_id': 'order_id',
        'order_date': 'order_date',
        'customer_code': 'customer_code',
        'customer_name': 'customer_name',
        'product_name': 'product_name',
        'quantity': 'quantity',
        'order_amount': 'amount',
        'discount_amount': 'discount',
        'payment_amount': 'payment',
        'delivery_date': 'delivery_date',
        'notes': 'note',
    }

    df_mapped = df.rename(columns=column_mapping)

    # Convert date columns to ISO format strings
    for col in ['order_date', 'delivery_date']:
        if col in df_mapped.columns:
            df_mapped[col] = df_mapped[col].apply(
                lambda x: x.isoformat() if pd.notna(x) else None
            )

    # Convert NaN to None and generate list of dictionaries
    records = df_mapped.where(pd.notna(df_mapped), None).to_dict(orient='records')

    return records

# Convert
json_data = dataframe_to_api_json(df_clean)

# Verify (ensure_ascii=False to preserve non-ASCII characters)
print(json.dumps(json_data[:2], indent=2, ensure_ascii=False, default=str))

# Example output:
# [
#   {
#     "order_id": "ORD-2024-001",
#     "order_date": "2024-12-02T00:00:00",
#     "customer_name": "John Doe",
#     "amount": 150000,
#     "payment": 135000
#   },
#   ...
# ]
4

Send JSON data via REST API

Use the requests library to POST the converted JSON data to the API server. Sending large amounts of data all at once can cause timeouts, so split it into batches for transmission. This also includes authentication token handling and response status code verification.

import requests
import time
from typing import Any

API_BASE_URL = "https://api.example.com/v1"
API_TOKEN = "your-api-token-here"  # Recommended to read from environment variables
BATCH_SIZE = 50  # Number of records to send at once

def send_to_api(records: list[dict], endpoint: str = "/orders") -> dict[str, Any]:
    """Send JSON records to API in batches"""

    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {API_TOKEN}",
    }

    results = {"success": 0, "failed": 0, "errors": []}
    total_batches = (len(records) + BATCH_SIZE - 1) // BATCH_SIZE

    for i in range(0, len(records), BATCH_SIZE):
        batch = records[i:i + BATCH_SIZE]
        batch_num = i // BATCH_SIZE + 1

        try:
            response = requests.post(
                f"{API_BASE_URL}{endpoint}",
                json={"data": batch},
                headers=headers,
                timeout=30,
            )

            if response.status_code in (200, 201):
                results["success"] += len(batch)
                print(f"  Batch {batch_num}/{total_batches}: "
                      f"{len(batch)} records succeeded")
            else:
                results["failed"] += len(batch)
                results["errors"].append({
                    "batch": batch_num,
                    "status": response.status_code,
                    "message": response.text[:200],
                })
                print(f"  Batch {batch_num}/{total_batches}: "
                      f"Failed (HTTP {response.status_code})")

        except requests.exceptions.Timeout:
            results["failed"] += len(batch)
            results["errors"].append({
                "batch": batch_num,
                "status": "TIMEOUT",
                "message": "Request timed out (30 seconds)",
            })
        except requests.exceptions.ConnectionError as e:
            results["failed"] += len(batch)
            results["errors"].append({
                "batch": batch_num,
                "status": "CONNECTION_ERROR",
                "message": str(e)[:200],
            })

        # Delay to prevent API overload
        time.sleep(0.5)

    return results

# Execute send
print(f"Sending {len(json_data)} records (batch size: {BATCH_SIZE})")
result = send_to_api(json_data)
print(f"\nComplete: {result['success']} succeeded, {result['failed']} failed")
5

Error handling, retry logic, and logging

For reliable automation, add retry logic for failed batches, detailed logging, and result report generation. Use exponential backoff strategy for retries to handle transient network errors. Save the complete processing result to a JSON file for later review.

import logging
from functools import wraps

# Logging configuration
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s [%(levelname)s] %(message)s',
    handlers=[
        logging.FileHandler('excel_to_api.log', encoding='utf-8'),
        logging.StreamHandler(),
    ]
)
logger = logging.getLogger(__name__)

def retry_with_backoff(max_retries=3, base_delay=1.0):
    """Exponential backoff retry decorator"""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries + 1):
                try:
                    return func(*args, **kwargs)
                except (requests.exceptions.Timeout,
                        requests.exceptions.ConnectionError) as e:
                    if attempt == max_retries:
                        logger.error(f"Max retries exceeded: {e}")
                        raise
                    delay = base_delay * (2 ** attempt)
                    logger.warning(
                        f"Retry {attempt + 1}/{max_retries} "
                        f"(after {delay}s): {e}"
                    )
                    time.sleep(delay)
        return wrapper
    return decorator

@retry_with_backoff(max_retries=3)
def send_batch(batch: list[dict], headers: dict) -> requests.Response:
    """Send a single batch (with retries)"""
    return requests.post(
        f"{API_BASE_URL}/orders",
        json={"data": batch},
        headers=headers,
        timeout=30,
    )

# === Run full pipeline ===
def run_pipeline(excel_path: str):
    """Excel -> JSON -> API full pipeline"""
    logger.info(f"Pipeline started: {excel_path}")

    # Step 1: Read Excel
    df = pd.read_excel(excel_path, sheet_name="Sales Summary")
    logger.info(f"Excel loaded: {len(df)} rows")

    # Step 2: Cleanse data
    df_clean = clean_dataframe(df)
    logger.info(f"Data cleansed: {len(df_clean)} rows")

    # Step 3: Convert to JSON
    records = dataframe_to_api_json(df_clean)
    logger.info(f"JSON conversion complete: {len(records)} records")

    # Step 4: Send to API
    result = send_to_api(records)
    logger.info(
        f"API send complete: "
        f"{result['success']} succeeded, {result['failed']} failed"
    )

    # Step 5: Save results
    report_path = f"report_{datetime.now():%Y%m%d_%H%M%S}.json"
    with open(report_path, 'w', encoding='utf-8') as f:
        json.dump(result, f, ensure_ascii=False, indent=2)
    logger.info(f"Report saved: {report_path}")

    return result

if __name__ == '__main__':
    import sys
    excel_file = sys.argv[1] if len(sys.argv) > 1 else "weekly_report.xlsx"
    run_pipeline(excel_file)

Core Code

Core code showing the entire flow from reading Excel to sending to the API. For production use, add error handling, retries, and logging.

#!/usr/bin/env python3
"""Excel -> JSON -> API conversion pipeline core code"""

import pandas as pd
import requests
import json

# 1. Read Excel
df = pd.read_excel("report.xlsx", sheet_name="Sales Summary",
                    dtype={'customer_code': str})

# 2. Cleanse data
df = df.dropna(how='all')
df['order_amount'] = pd.to_numeric(
    df['order_amount'].astype(str).str.replace(',', ''),
    errors='coerce'
).fillna(0).astype(int)
df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')

# 3. Convert to JSON (column name mapping + date conversion)
column_map = {'order_id': 'order_id', 'customer_name': 'customer_name',
              'order_amount': 'amount', 'order_date': 'order_date'}
df = df.rename(columns=column_map)
df['order_date'] = df['order_date'].apply(
    lambda x: x.isoformat() if pd.notna(x) else None
)
records = df.where(pd.notna(df), None).to_dict(orient='records')

# 4. Send to API (in batches)
headers = {"Content-Type": "application/json",
           "Authorization": "Bearer YOUR_TOKEN"}
BATCH = 50
for i in range(0, len(records), BATCH):
    batch = records[i:i+BATCH]
    resp = requests.post("https://api.example.com/v1/orders",
                         json={"data": batch}, headers=headers, timeout=30)
    print(f"Batch {i//BATCH+1}: {resp.status_code}")

Common Mistakes

Excel dates being read as integers (serial numbers)

Excel internally stores dates as the number of days since 1900-01-01. pandas read_excel() usually auto-converts, but if the cell format is "General", it reads as numbers like 44927. Manually convert with pd.to_datetime(df["date"], unit="D", origin="1899-12-30"), or correctly set the date format in the original Excel file.

NaN values being transmitted as "NaN" strings in JSON

Python's float("nan") is not a valid value in the JSON standard. json.dumps() outputs NaN as-is, causing API parsing errors. Convert NaN to None in the DataFrame with .where(pd.notna(df), None), or separately remove NaN values after to_dict().

Non-ASCII characters in JSON getting corrupted at the API server

The json= parameter in requests.post() automatically encodes to UTF-8. However, when saving to a file, not using json.dumps(data, ensure_ascii=False) will escape non-ASCII characters as \uXXXX. Explicitly adding charset=utf-8 to the API Content-Type header is also safer.

Timeouts when POSTing large amounts of data all at once

Sending thousands of records at once can cause memory issues on both the server and client. Split into batches of 50-100 records, with 0.5-1 second intervals between each batch to distribute the load on the API server. Retry logic for failed batches only is also essential.

Related liminfo Services