Convert Excel Data to JSON for API Integration
A hands-on guide to building an automation pipeline that automatically parses weekly Excel reports with Python pandas, cleanses the data, converts it to JSON format, and uploads it to a REST API server.
Problem
Required Tools
The core runtime for the data processing script. Version 3.8 or above is recommended.
A Python data analysis library. Optimized for reading, transforming, and cleansing Excel data using DataFrames.
A Python library for reading and writing .xlsx files. Used as the Excel reading engine for pandas.
A Python HTTP library. Used for sending JSON data to REST APIs via POST/PUT requests.
Solution Steps
Environment setup and reading Excel files
First, install the required libraries and load the Excel file into a DataFrame using pandas read_excel(). Use sheet_name to specify a particular sheet and header to set the header row position. To understand the structure of the Excel file first, check the data types and sample data with df.info() and df.head().
# Install libraries
pip install pandas openpyxl requests
# === excel_reader.py ===
import pandas as pd
from pathlib import Path
# Read Excel file
excel_path = Path("weekly_report.xlsx")
df = pd.read_excel(
excel_path,
sheet_name="Sales Summary", # Specify a particular sheet
header=0, # First row is the header
dtype={
'customer_code': str, # Keep as str to prevent leading zeros from being trimmed
'phone_number': str,
}
)
# Check data structure
print(f"Rows: {len(df)}, Columns: {len(df.columns)}")
print(df.info())
print(df.head())
# Read all sheets at once
all_sheets = pd.read_excel(excel_path, sheet_name=None)
for sheet_name, sheet_df in all_sheets.items():
print(f"Sheet '{sheet_name}': {len(sheet_df)} rows")Data cleansing and validation
Real-world Excel files commonly have issues such as empty cells, incorrect formats, and duplicate data. Perform cleansing tasks like NaN value handling, date format standardization, string whitespace removal, and duplicate row removal. Validation catches missing required fields and out-of-range values in advance to prevent API submission failures.
import pandas as pd
import numpy as np
def clean_dataframe(df: pd.DataFrame) -> pd.DataFrame:
"""Excel data cleansing function"""
# 1. Remove completely empty rows
df = df.dropna(how='all')
# 2. Trim whitespace from string columns
str_columns = df.select_dtypes(include='object').columns
for col in str_columns:
df[col] = df[col].astype(str).str.strip()
df[col] = df[col].replace('nan', np.nan)
# 3. Convert date columns (Excel date -> ISO format)
date_columns = ['order_date', 'delivery_date']
for col in date_columns:
if col in df.columns:
df[col] = pd.to_datetime(df[col], errors='coerce')
# ISO string conversion for API is handled during JSON conversion
# 4. Convert amount columns to numeric (remove commas and currency symbols)
money_columns = ['order_amount', 'discount_amount', 'payment_amount']
for col in money_columns:
if col in df.columns:
df[col] = (df[col].astype(str)
.str.replace(',', '')
.str.replace('$', '')
.str.strip())
df[col] = pd.to_numeric(df[col], errors='coerce').fillna(0).astype(int)
# 5. Remove duplicate rows (based on order number)
if 'order_id' in df.columns:
dup_count = df.duplicated(subset='order_id').sum()
if dup_count > 0:
print(f" Removed {dup_count} duplicates")
df = df.drop_duplicates(subset='order_id', keep='last')
# 6. Required field validation
required = ['order_id', 'customer_name', 'order_amount']
for col in required:
if col in df.columns:
null_count = df[col].isna().sum()
if null_count > 0:
print(f" Warning: {null_count} missing values in '{col}' field")
return df.reset_index(drop=True)
# Usage
df_clean = clean_dataframe(df)
print(f"After cleansing: {len(df_clean)} rows")Convert DataFrame to JSON format
Convert the cleansed DataFrame to the JSON format required by the API server. Use pandas to_dict() and to_json(), but pay attention to date formatting and character encoding. Map key names to English according to the API specification, and construct nested structures separately if needed.
import json
from datetime import datetime
def dataframe_to_api_json(df: pd.DataFrame) -> list[dict]:
"""Convert DataFrame to JSON list for API submission"""
# Column name mapping
column_mapping = {
'order_id': 'order_id',
'order_date': 'order_date',
'customer_code': 'customer_code',
'customer_name': 'customer_name',
'product_name': 'product_name',
'quantity': 'quantity',
'order_amount': 'amount',
'discount_amount': 'discount',
'payment_amount': 'payment',
'delivery_date': 'delivery_date',
'notes': 'note',
}
df_mapped = df.rename(columns=column_mapping)
# Convert date columns to ISO format strings
for col in ['order_date', 'delivery_date']:
if col in df_mapped.columns:
df_mapped[col] = df_mapped[col].apply(
lambda x: x.isoformat() if pd.notna(x) else None
)
# Convert NaN to None and generate list of dictionaries
records = df_mapped.where(pd.notna(df_mapped), None).to_dict(orient='records')
return records
# Convert
json_data = dataframe_to_api_json(df_clean)
# Verify (ensure_ascii=False to preserve non-ASCII characters)
print(json.dumps(json_data[:2], indent=2, ensure_ascii=False, default=str))
# Example output:
# [
# {
# "order_id": "ORD-2024-001",
# "order_date": "2024-12-02T00:00:00",
# "customer_name": "John Doe",
# "amount": 150000,
# "payment": 135000
# },
# ...
# ]Send JSON data via REST API
Use the requests library to POST the converted JSON data to the API server. Sending large amounts of data all at once can cause timeouts, so split it into batches for transmission. This also includes authentication token handling and response status code verification.
import requests
import time
from typing import Any
API_BASE_URL = "https://api.example.com/v1"
API_TOKEN = "your-api-token-here" # Recommended to read from environment variables
BATCH_SIZE = 50 # Number of records to send at once
def send_to_api(records: list[dict], endpoint: str = "/orders") -> dict[str, Any]:
"""Send JSON records to API in batches"""
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {API_TOKEN}",
}
results = {"success": 0, "failed": 0, "errors": []}
total_batches = (len(records) + BATCH_SIZE - 1) // BATCH_SIZE
for i in range(0, len(records), BATCH_SIZE):
batch = records[i:i + BATCH_SIZE]
batch_num = i // BATCH_SIZE + 1
try:
response = requests.post(
f"{API_BASE_URL}{endpoint}",
json={"data": batch},
headers=headers,
timeout=30,
)
if response.status_code in (200, 201):
results["success"] += len(batch)
print(f" Batch {batch_num}/{total_batches}: "
f"{len(batch)} records succeeded")
else:
results["failed"] += len(batch)
results["errors"].append({
"batch": batch_num,
"status": response.status_code,
"message": response.text[:200],
})
print(f" Batch {batch_num}/{total_batches}: "
f"Failed (HTTP {response.status_code})")
except requests.exceptions.Timeout:
results["failed"] += len(batch)
results["errors"].append({
"batch": batch_num,
"status": "TIMEOUT",
"message": "Request timed out (30 seconds)",
})
except requests.exceptions.ConnectionError as e:
results["failed"] += len(batch)
results["errors"].append({
"batch": batch_num,
"status": "CONNECTION_ERROR",
"message": str(e)[:200],
})
# Delay to prevent API overload
time.sleep(0.5)
return results
# Execute send
print(f"Sending {len(json_data)} records (batch size: {BATCH_SIZE})")
result = send_to_api(json_data)
print(f"\nComplete: {result['success']} succeeded, {result['failed']} failed")Error handling, retry logic, and logging
For reliable automation, add retry logic for failed batches, detailed logging, and result report generation. Use exponential backoff strategy for retries to handle transient network errors. Save the complete processing result to a JSON file for later review.
import logging
from functools import wraps
# Logging configuration
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s [%(levelname)s] %(message)s',
handlers=[
logging.FileHandler('excel_to_api.log', encoding='utf-8'),
logging.StreamHandler(),
]
)
logger = logging.getLogger(__name__)
def retry_with_backoff(max_retries=3, base_delay=1.0):
"""Exponential backoff retry decorator"""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(max_retries + 1):
try:
return func(*args, **kwargs)
except (requests.exceptions.Timeout,
requests.exceptions.ConnectionError) as e:
if attempt == max_retries:
logger.error(f"Max retries exceeded: {e}")
raise
delay = base_delay * (2 ** attempt)
logger.warning(
f"Retry {attempt + 1}/{max_retries} "
f"(after {delay}s): {e}"
)
time.sleep(delay)
return wrapper
return decorator
@retry_with_backoff(max_retries=3)
def send_batch(batch: list[dict], headers: dict) -> requests.Response:
"""Send a single batch (with retries)"""
return requests.post(
f"{API_BASE_URL}/orders",
json={"data": batch},
headers=headers,
timeout=30,
)
# === Run full pipeline ===
def run_pipeline(excel_path: str):
"""Excel -> JSON -> API full pipeline"""
logger.info(f"Pipeline started: {excel_path}")
# Step 1: Read Excel
df = pd.read_excel(excel_path, sheet_name="Sales Summary")
logger.info(f"Excel loaded: {len(df)} rows")
# Step 2: Cleanse data
df_clean = clean_dataframe(df)
logger.info(f"Data cleansed: {len(df_clean)} rows")
# Step 3: Convert to JSON
records = dataframe_to_api_json(df_clean)
logger.info(f"JSON conversion complete: {len(records)} records")
# Step 4: Send to API
result = send_to_api(records)
logger.info(
f"API send complete: "
f"{result['success']} succeeded, {result['failed']} failed"
)
# Step 5: Save results
report_path = f"report_{datetime.now():%Y%m%d_%H%M%S}.json"
with open(report_path, 'w', encoding='utf-8') as f:
json.dump(result, f, ensure_ascii=False, indent=2)
logger.info(f"Report saved: {report_path}")
return result
if __name__ == '__main__':
import sys
excel_file = sys.argv[1] if len(sys.argv) > 1 else "weekly_report.xlsx"
run_pipeline(excel_file)Core Code
Core code showing the entire flow from reading Excel to sending to the API. For production use, add error handling, retries, and logging.
#!/usr/bin/env python3
"""Excel -> JSON -> API conversion pipeline core code"""
import pandas as pd
import requests
import json
# 1. Read Excel
df = pd.read_excel("report.xlsx", sheet_name="Sales Summary",
dtype={'customer_code': str})
# 2. Cleanse data
df = df.dropna(how='all')
df['order_amount'] = pd.to_numeric(
df['order_amount'].astype(str).str.replace(',', ''),
errors='coerce'
).fillna(0).astype(int)
df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')
# 3. Convert to JSON (column name mapping + date conversion)
column_map = {'order_id': 'order_id', 'customer_name': 'customer_name',
'order_amount': 'amount', 'order_date': 'order_date'}
df = df.rename(columns=column_map)
df['order_date'] = df['order_date'].apply(
lambda x: x.isoformat() if pd.notna(x) else None
)
records = df.where(pd.notna(df), None).to_dict(orient='records')
# 4. Send to API (in batches)
headers = {"Content-Type": "application/json",
"Authorization": "Bearer YOUR_TOKEN"}
BATCH = 50
for i in range(0, len(records), BATCH):
batch = records[i:i+BATCH]
resp = requests.post("https://api.example.com/v1/orders",
json={"data": batch}, headers=headers, timeout=30)
print(f"Batch {i//BATCH+1}: {resp.status_code}")Common Mistakes
Excel dates being read as integers (serial numbers)
Excel internally stores dates as the number of days since 1900-01-01. pandas read_excel() usually auto-converts, but if the cell format is "General", it reads as numbers like 44927. Manually convert with pd.to_datetime(df["date"], unit="D", origin="1899-12-30"), or correctly set the date format in the original Excel file.
NaN values being transmitted as "NaN" strings in JSON
Python's float("nan") is not a valid value in the JSON standard. json.dumps() outputs NaN as-is, causing API parsing errors. Convert NaN to None in the DataFrame with .where(pd.notna(df), None), or separately remove NaN values after to_dict().
Non-ASCII characters in JSON getting corrupted at the API server
The json= parameter in requests.post() automatically encodes to UTF-8. However, when saving to a file, not using json.dumps(data, ensure_ascii=False) will escape non-ASCII characters as \uXXXX. Explicitly adding charset=utf-8 to the API Content-Type header is also safer.
Timeouts when POSTing large amounts of data all at once
Sending thousands of records at once can cause memory issues on both the server and client. Split into batches of 50-100 records, with 0.5-1 second intervals between each batch to distribute the load on the API server. Retry logic for failed batches only is also essential.