liminfo

Python Data Analysis with Pandas/NumPy

A practical step-by-step guide to the complete data analysis workflow: loading, cleaning, transforming, statistical analysis, and visualization using Pandas and NumPy

Python data analysisPandas tutorialNumPy usagedata cleaningDataFrame manipulationdata visualizationEDA analysisexploratory data analysis

Problem

You need to analyze e-commerce sales data (CSV, 1 million rows). The data contains missing values, duplicate rows, and incorrect data formats (dates as strings, amounts with commas). You need to perform monthly/category sales trends, customer segment purchase patterns, and outlier detection, then derive insights for management reporting. Excel cannot handle the data volume, so you need to build an automated analysis pipeline in Python.

Required Tools

Pandas

Python's core data analysis library. Efficiently loads, cleans, transforms, and aggregates tabular data using the DataFrame structure.

NumPy

Numerical computing library. Processes vector/matrix operations at C-level speed and serves as the underlying engine for Pandas.

Matplotlib / Seaborn

Data visualization libraries. Generate statistical charts, distributions, heatmaps, and more.

Jupyter Notebook

Interactive data analysis environment. Work with code, execution results, and visualizations in a single document.

scipy.stats

Scientific computing library providing statistical tests (t-test, chi-squared, etc.) and probability distribution functions.

Solution Steps

1

Data loading and initial exploration (EDA)

The first step in analysis is loading the data and understanding its structure. Use shape, dtypes, describe(), and info() to quickly check data size, types, and basic statistics. For large CSVs, specify dtypes and parse date columns to optimize memory usage.

import pandas as pd
import numpy as np

# === Load data (optimized) ===
df = pd.read_csv(
    'sales_data.csv',
    parse_dates=['order_date'],        # Auto-parse dates
    dtype={
        'customer_id': 'str',
        'product_id': 'str',
        'category': 'category',        # Save memory with categorical
        'quantity': 'int32',
    },
    thousands=',',                      # Handle comma-separated numbers
    na_values=['N/A', 'null', '-'],     # Values to treat as missing
)

# === Initial exploration ===
print(f"Shape: {df.shape}")           # (1000000, 12)
print(f"Memory: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB")
print(df.dtypes)
print(df.head(10))

# Basic statistics
print(df.describe())                   # Numeric columns
print(df.describe(include='object'))   # String columns

# Check missing values
missing = df.isnull().sum()
print(missing[missing > 0])
print(f"Missing ratio:\n{(df.isnull().mean() * 100).round(2)}%")

# Check duplicates
dupes = df.duplicated().sum()
print(f"Duplicate rows: {dupes} ({dupes/len(df)*100:.2f}%)")

# Unique value counts per column
print(df.nunique())
2

Data cleaning (missing values, duplicates, outliers)

Delete or replace missing values with appropriate substitutes, and remove duplicate rows. Detect outliers using IQR (Interquartile Range) or Z-score. Convert data types correctly and filter invalid values (negative amounts, future dates, etc.).

# === Remove duplicates ===
print(f"Before: {len(df)} rows")
df = df.drop_duplicates(subset=['order_id'], keep='first')
print(f"After: {len(df)} rows")

# === Handle missing values ===
# Numeric: replace with median
df['amount'] = df['amount'].fillna(df['amount'].median())
df['quantity'] = df['quantity'].fillna(1)

# Categorical: replace with mode
df['category'] = df['category'].fillna(df['category'].mode()[0])

# Drop rows with missing critical fields
df = df.dropna(subset=['customer_id', 'order_date'])

# === Data type conversion ===
if df['amount'].dtype == 'object':
    df['amount'] = (
        df['amount']
        .str.replace(',', '', regex=False)
        .astype(float)
    )

# === Outlier detection (IQR method) ===
def detect_outliers_iqr(series, multiplier=1.5):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - multiplier * IQR
    upper = Q3 + multiplier * IQR
    return (series < lower) | (series > upper)

outlier_mask = detect_outliers_iqr(df['amount'])
print(f"Outliers: {outlier_mask.sum()} ({outlier_mask.mean()*100:.2f}%)")

# Remove outliers or cap them
df_clean = df[~outlier_mask].copy()

# Or capping (replace extreme values with boundaries)
Q1, Q3 = df['amount'].quantile([0.25, 0.75])
IQR = Q3 - Q1
df['amount'] = df['amount'].clip(lower=Q1 - 1.5*IQR, upper=Q3 + 1.5*IQR)

# === Validation ===
assert (df['amount'] >= 0).all(), "Negative amounts exist!"
assert (df['quantity'] >= 1).all(), "Zero or negative quantities exist!"
assert df['order_date'].max() <= pd.Timestamp.now(), "Future dates exist!"
3

Data transformation and feature engineering

Create derived variables needed for analysis. Extract month/weekday/quarter from dates and add revenue calculation columns. Encode categorical variables and add group-level statistics as new features.

# === Date-derived features ===
df['year'] = df['order_date'].dt.year
df['month'] = df['order_date'].dt.month
df['quarter'] = df['order_date'].dt.quarter
df['day_of_week'] = df['order_date'].dt.day_name()
df['is_weekend'] = df['order_date'].dt.dayofweek >= 5
df['year_month'] = df['order_date'].dt.to_period('M')

# === Revenue calculations ===
df['total_amount'] = df['amount'] * df['quantity']
df['discount_rate'] = (df['original_price'] - df['amount']) / df['original_price']
df['discount_rate'] = df['discount_rate'].clip(0, 1)

# === Customer segments (RFM analysis basics) ===
now = df['order_date'].max() + pd.Timedelta(days=1)
rfm = df.groupby('customer_id').agg(
    recency=('order_date', lambda x: (now - x.max()).days),
    frequency=('order_id', 'nunique'),
    monetary=('total_amount', 'sum'),
).reset_index()

# Segment classification
rfm['r_score'] = pd.qcut(rfm['recency'], 4, labels=[4, 3, 2, 1])
rfm['f_score'] = pd.qcut(rfm['frequency'].rank(method='first'), 4, labels=[1, 2, 3, 4])
rfm['m_score'] = pd.qcut(rfm['monetary'], 4, labels=[1, 2, 3, 4])
rfm['rfm_score'] = (
    rfm['r_score'].astype(str) +
    rfm['f_score'].astype(str) +
    rfm['m_score'].astype(str)
)

# === Merge group statistics back to original ===
category_avg = df.groupby('category')['total_amount'].mean().rename('category_avg_amount')
df = df.merge(category_avg, on='category', how='left')
df['vs_category_avg'] = df['total_amount'] / df['category_avg_amount']

print(f"Feature engineering complete: {df.shape[1]} columns")
4

Aggregation analysis and pivot tables

Aggregate data from various angles using groupby() and pivot_table(). Derive business insights including monthly revenue trends, category performance, and customer segment patterns.

# === Monthly revenue trend ===
monthly_sales = (
    df.groupby('year_month')
    .agg(
        revenue=('total_amount', 'sum'),
        orders=('order_id', 'nunique'),
        customers=('customer_id', 'nunique'),
        avg_order_value=('total_amount', 'mean'),
    )
    .reset_index()
)
monthly_sales['revenue_growth'] = monthly_sales['revenue'].pct_change() * 100
print(monthly_sales.tail(12))

# === Category revenue analysis ===
category_analysis = (
    df.groupby('category')
    .agg(
        total_revenue=('total_amount', 'sum'),
        avg_price=('amount', 'mean'),
        total_qty=('quantity', 'sum'),
        order_count=('order_id', 'nunique'),
    )
    .sort_values('total_revenue', ascending=False)
)
category_analysis['revenue_share'] = (
    category_analysis['total_revenue'] / category_analysis['total_revenue'].sum() * 100
)
print(category_analysis)

# === Pivot table: Monthly x Category revenue ===
pivot = df.pivot_table(
    values='total_amount',
    index='year_month',
    columns='category',
    aggfunc='sum',
    fill_value=0,
)
print(pivot.tail(6))

# === Day-of-week pattern ===
dow_pattern = df.groupby('day_of_week')['total_amount'].agg(['sum', 'mean', 'count'])
print(dow_pattern)

# === NumPy advanced statistics ===
amounts = df['total_amount'].values

print(f"Mean: {np.mean(amounts):,.0f}")
print(f"Median: {np.median(amounts):,.0f}")
print(f"Std Dev: {np.std(amounts):,.0f}")
print(f"Skewness: {pd.Series(amounts).skew():.2f}")
print(f"Kurtosis: {pd.Series(amounts).kurtosis():.2f}")
print(f"Percentiles: {np.percentile(amounts, [25, 50, 75, 90, 95, 99])}")
5

Data visualization

Visualize analysis results with Matplotlib and Seaborn. Create revenue trend line charts, category bar charts, distribution histograms, and correlation heatmaps. Apply clean styles suitable for management reporting.

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style='whitegrid')

fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Monthly revenue trend
ax = axes[0, 0]
ax.plot(monthly_sales['year_month'].astype(str), monthly_sales['revenue'], marker='o')
ax.set_title('Monthly Revenue Trend')
ax.set_xlabel('Month')
ax.set_ylabel('Revenue')
ax.tick_params(axis='x', rotation=45)

# 2. Revenue share by category (pie chart)
ax = axes[0, 1]
top_categories = category_analysis.head(8)
ax.pie(
    top_categories['total_revenue'],
    labels=top_categories.index,
    autopct='%1.1f%%',
    startangle=90,
)
ax.set_title('Revenue by Category')

# 3. Order amount distribution (histogram + KDE)
ax = axes[1, 0]
sns.histplot(df['total_amount'], bins=50, kde=True, ax=ax)
ax.set_title('Order Amount Distribution')
ax.set_xlabel('Amount')
ax.axvline(df['total_amount'].median(), color='r', linestyle='--', label='Median')
ax.legend()

# 4. Average revenue by day of week
ax = axes[1, 1]
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
dow_data = df.groupby('day_of_week')['total_amount'].mean().reindex(day_order)
sns.barplot(x=dow_data.index, y=dow_data.values, ax=ax, palette='viridis')
ax.set_title('Average Revenue by Day of Week')
ax.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.savefig('sales_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

# === Correlation heatmap ===
numeric_cols = df.select_dtypes(include=[np.number]).columns
corr_matrix = df[numeric_cols].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='RdBu_r', center=0, fmt='.2f')
plt.title('Correlation Heatmap')
plt.tight_layout()
plt.savefig('correlation_heatmap.png', dpi=150)
plt.show()
6

Save results and automate reporting

Export analysis results to Excel files (multiple sheets) and build an automated analysis pipeline. Can be run daily/weekly to refresh reports with the latest data.

# === Export results to Excel (multi-sheet) ===
with pd.ExcelWriter('sales_report.xlsx', engine='openpyxl') as writer:
    monthly_sales.to_excel(writer, sheet_name='Monthly Summary', index=False)
    category_analysis.to_excel(writer, sheet_name='Category Analysis')
    rfm.to_excel(writer, sheet_name='Customer RFM', index=False)
    pivot.to_excel(writer, sheet_name='Monthly by Category')

print("Excel report saved: sales_report.xlsx")

# === Automated pipeline function ===
def analyze_sales(filepath: str, output_dir: str = './reports'):
    """Sales data analysis automation pipeline"""
    import os
    os.makedirs(output_dir, exist_ok=True)

    # 1. Load
    df = pd.read_csv(filepath, parse_dates=['order_date'])
    print(f"Loaded: {len(df)} rows")

    # 2. Clean
    df = df.drop_duplicates(subset=['order_id'])
    df['amount'] = pd.to_numeric(df['amount'], errors='coerce')
    df = df.dropna(subset=['customer_id', 'amount'])

    # 3. Feature engineering
    df['total_amount'] = df['amount'] * df['quantity']
    df['year_month'] = df['order_date'].dt.to_period('M')

    # 4. Aggregate
    summary = df.groupby('year_month').agg(
        revenue=('total_amount', 'sum'),
        orders=('order_id', 'nunique'),
    ).reset_index()

    # 5. Key KPIs
    kpi = {
        'total_revenue': df['total_amount'].sum(),
        'avg_order_value': df['total_amount'].mean(),
        'total_customers': df['customer_id'].nunique(),
        'total_orders': df['order_id'].nunique(),
    }

    # 6. Save
    summary.to_csv(f'{output_dir}/monthly_summary.csv', index=False)
    pd.DataFrame([kpi]).to_csv(f'{output_dir}/kpi.csv', index=False)

    print(f"Analysis complete. KPI: {kpi}")
    return kpi

# Run
kpi = analyze_sales('sales_data.csv')

Core Code

Core Pandas data analysis pipeline: Load -> Clean -> Feature creation -> Aggregation -> Pivot -> Customer segmentation workflow.

import pandas as pd
import numpy as np

# === Core: Pandas Data Analysis Pipeline ===

# 1. Load (optimized)
df = pd.read_csv('sales.csv', parse_dates=['order_date'],
                  dtype={'category': 'category'})

# 2. Clean
df = df.drop_duplicates(subset=['order_id'])
df['amount'] = df['amount'].fillna(df['amount'].median())

# 3. Feature creation
df['total'] = df['amount'] * df['quantity']
df['year_month'] = df['order_date'].dt.to_period('M')

# 4. Aggregation analysis
monthly = df.groupby('year_month').agg(
    revenue=('total', 'sum'),
    orders=('order_id', 'nunique'),
    aov=('total', 'mean'),
).reset_index()
monthly['growth'] = monthly['revenue'].pct_change() * 100

# 5. Pivot table
pivot = df.pivot_table(values='total', index='year_month',
                       columns='category', aggfunc='sum', fill_value=0)

# 6. RFM customer segments
rfm = df.groupby('customer_id').agg(
    recency=('order_date', lambda x: (df['order_date'].max() - x.max()).days),
    frequency=('order_id', 'nunique'),
    monetary=('total', 'sum'),
)

Common Mistakes

Assigning values to a slice of the original DataFrame instead of a copy, causing SettingWithCopyWarning

Create an explicit copy with df_subset = df[condition].copy() before modifying. Or use .loc[] for direct assignment: df.loc[condition, "column"] = value.

Not specifying dtypes causing unnecessarily high memory usage

Specify dtype='category' for categorical data, 'int32' for integers, and 'str' for IDs. The category type alone can save 50-90% memory.

Processing rows in a for loop causing extremely slow performance

Use Pandas vectorized operations (df["col1"] * df["col2"]), apply(), or groupby().transform(). Vectorized operations are 100-1000x faster than for loops.

Related liminfo Services