liminfo

A/B Test Statistical Significance Testing

Covers how to statistically determine whether the conversion rate difference in a landing page A/B test is a real effect rather than random chance. Guides you step by step through hypothesis setup, sample size calculation, chi-squared test, z-test, and p-value interpretation.

AB test statisticssignificance testingp-value interpretationsample size calculationchi-squared testconversion rate comparisonz-testhypothesis testingnull hypothesisalternative hypothesisstatistical powersignificance levelMDEminimum detectable effect

Problem

The new landing page (Variant B) shows a higher conversion rate than the existing page (Variant A), but it is unclear whether this difference is statistically significant. Variant A (control) has 150 conversions out of 5,000 visitors (3.0%), and Variant B (treatment) has 195 conversions out of 5,000 visitors (3.9%). You need to determine whether the 0.9 percentage point difference is a real effect or random variation, and make a data-driven decision on whether to continue the test or roll out Variant B to all traffic.

Required Tools

Python 3.x

Runtime environment for statistical analysis scripts. Version 3.8 or above is recommended.

scipy.stats

The statistics module of SciPy. Provides various statistical tests including chi-squared tests, z-tests, and normal distribution functions.

Sample Size Calculator

Pre-calculates the minimum sample size required before starting a test. Uses statsmodels or online calculators.

Solution Steps

1

Set up hypotheses (H0 / H1) and significance level

All statistical tests begin with hypothesis formulation. Null hypothesis (H0): There is no difference in conversion rates between the two groups (pA = pB) Alternative hypothesis (H1): There is a difference in conversion rates between the two groups (pA != pB) -- two-tailed test Or: Variant B's conversion rate is higher than Variant A's (pB > pA) -- one-tailed test The significance level (alpha) is conventionally set at 0.05 (5%), which is the upper bound for "the probability of incorrectly rejecting the null hypothesis when it is true." Power is typically targeted at 0.80 (80%), which is "the probability of detecting an effect when one truly exists." A two-tailed test is more conservative as it does not assume the direction of the effect in advance, while a one-tailed test only tests one direction, offering higher power but potentially missing effects in the opposite direction.

# Hypothesis setup summary
# H0: p_A = p_B  (no difference in conversion rates)
# H1: p_A != p_B (difference exists, two-tailed test)
#
# Significance level: alpha = 0.05
# Power:             power = 0.80
# Current rate:      p_A = 3.0% (control, existing page)
# Observed rate:     p_B = 3.9% (treatment, new page)

alpha = 0.05
power = 0.80
p_control = 0.030    # Variant A conversion rate
p_treatment = 0.039  # Variant B conversion rate

# Minimum Detectable Effect (MDE)
mde = p_treatment - p_control  # 0.009 (0.9pp)
relative_lift = mde / p_control  # 30% relative improvement
print(f"Absolute difference: {mde:.1%}")
print(f"Relative improvement: {relative_lift:.1%}")
2

Pre-calculate sample size (Power Analysis)

Before starting the test, calculate the minimum sample size needed to obtain statistically significant results. Skipping this step can lead to insufficient samples making it impossible to draw conclusions, or running the test unnecessarily long. Factors affecting sample size: - Baseline conversion rate - Minimum Detectable Effect (MDE) - the smallest difference you want to detect - Significance level (alpha) - typically 0.05 - Power - typically 0.80

from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize
import math

def calculate_sample_size(p_control, mde, alpha=0.05, power=0.80):
    """Calculate minimum sample size per group for an A/B test"""

    p_treatment = p_control + mde

    # Calculate Cohen's h effect size
    effect_size = proportion_effectsize(p_treatment, p_control)

    # Calculate required sample size (per group)
    analysis = NormalIndPower()
    n_per_group = analysis.solve_power(
        effect_size=effect_size,
        alpha=alpha,
        power=power,
        alternative='two-sided'
    )

    return math.ceil(n_per_group)

# === Sample size calculation for different scenarios ===
scenarios = [
    {"name": "MDE 0.5pp", "mde": 0.005},
    {"name": "MDE 1.0pp", "mde": 0.010},
    {"name": "MDE 1.5pp", "mde": 0.015},
    {"name": "MDE 2.0pp", "mde": 0.020},
]

p_control = 0.030

print(f"Baseline conversion rate: {p_control:.1%}")
print(f"Significance level: 5%, Power: 80%")
print("-" * 45)

for s in scenarios:
    n = calculate_sample_size(p_control, s["mde"])
    total = n * 2
    print(f"{s['name']}: {n:,} per group ({total:,} total needed)")

# Example output:
# MDE 0.5pp: 23,738 per group (47,476 total needed)
# MDE 1.0pp:  6,034 per group (12,068 total needed)
# MDE 1.5pp:  2,718 per group ( 5,436 total needed)
# MDE 2.0pp:  1,549 per group ( 3,098 total needed)
3

Data collection and descriptive statistics

After running the test and accumulating sufficient data, first check the conversion rate and confidence interval for each group. Organize the actual A/B test data and calculate the sample size, number of conversions, conversion rate, and 95% confidence interval for each group. It is important to understand the basic characteristics of the data before performing the test.

import numpy as np
from scipy import stats

# === A/B Test Data ===
# Variant A (control): existing landing page
n_A = 5000        # Number of visitors
x_A = 150         # Number of conversions
p_A = x_A / n_A   # Conversion rate: 3.0%

# Variant B (treatment): new landing page
n_B = 5000        # Number of visitors
x_B = 195         # Number of conversions
p_B = x_B / n_B   # Conversion rate: 3.9%

# === Descriptive Statistics Output ===
def confidence_interval(successes, total, confidence=0.95):
    """Confidence interval for a proportion (Wald method)"""
    p = successes / total
    z = stats.norm.ppf(1 - (1 - confidence) / 2)
    se = np.sqrt(p * (1 - p) / total)
    return (p - z * se, p + z * se)

ci_A = confidence_interval(x_A, n_A)
ci_B = confidence_interval(x_B, n_B)

print("=" * 55)
print(f"{'Group':<10} {'Visitors':>8} {'Conv':>6} {'Rate':>8} {'95% CI':>16}")
print("-" * 55)
print(f"{'A(ctrl)':10} {n_A:>8,} {x_A:>6} {p_A:>8.2%} "
      f"[{ci_A[0]:.2%}, {ci_A[1]:.2%}]")
print(f"{'B(treat)':10} {n_B:>8,} {x_B:>6} {p_B:>8.2%} "
      f"[{ci_B[0]:.2%}, {ci_B[1]:.2%}]")
print("=" * 55)
print(f"Rate difference: {p_B - p_A:+.2%} (relative {(p_B-p_A)/p_A:+.1%})")
4

Perform chi-squared test or z-test

Two methods are commonly used for comparing conversion rates in A/B tests: the chi-squared test and the two-proportion z-test. The two methods are mathematically equivalent (for a 2x2 table, chi2 = z^2) and provide the same p-value. The chi-squared test is based on the difference between observed and expected frequencies, while the z-test uses a test statistic calculated by dividing the difference between two proportions by the standard error.

from scipy.stats import chi2_contingency
from statsmodels.stats.proportion import proportions_ztest

# ========================================
# Method 1: Chi-squared test
# ========================================

# 2x2 contingency table
#              Converted  Not Converted
# Variant A    150        4850
# Variant B    195        4805
contingency_table = np.array([
    [x_A, n_A - x_A],   # A: [converted, not converted]
    [x_B, n_B - x_B],   # B: [converted, not converted]
])

chi2, p_value_chi2, dof, expected = chi2_contingency(
    contingency_table,
    correction=False  # Without Yates correction (when sample is large enough)
)

print("=== Chi-squared Test Results ===")
print(f"Chi-squared statistic: {chi2:.4f}")
print(f"Degrees of freedom: {dof}")
print(f"p-value: {p_value_chi2:.6f}")
print(f"Significant (alpha=0.05): {'Yes' if p_value_chi2 < 0.05 else 'No'}")

# ========================================
# Method 2: Two-proportion z-test
# ========================================

count = np.array([x_B, x_A])   # Conversion counts (treatment first)
nobs = np.array([n_B, n_A])    # Sample sizes

z_stat, p_value_z = proportions_ztest(
    count, nobs,
    alternative='two-sided'  # Two-tailed test
)

print("\n=== Two-proportion z-test Results ===")
print(f"z statistic: {z_stat:.4f}")
print(f"p-value: {p_value_z:.6f}")
print(f"Significant (alpha=0.05): {'Yes' if p_value_z < 0.05 else 'No'}")

# One-tailed test (testing B > A direction only)
z_stat_one, p_value_one = proportions_ztest(
    count, nobs,
    alternative='larger'  # One-tailed: treatment > control
)
print(f"\nOne-tailed p-value (B > A): {p_value_one:.6f}")
5

p-value interpretation and decision making

The p-value is "the probability of obtaining results as extreme as or more extreme than those observed, assuming the null hypothesis is true (no difference in conversion rates)." Decision criteria: - p < 0.05: Reject the null hypothesis -> Conversion rate difference is statistically significant -> Consider adopting Variant B - p >= 0.05: Cannot reject the null hypothesis -> Difference may be due to chance -> Continue testing or keep Variant A Caution: A p-value less than 0.05 does not necessarily mean the difference is practically meaningful. Effect size and business context should be considered together.

# === Comprehensive Analysis and Decision Making ===

def ab_test_decision(n_a, x_a, n_b, x_b, alpha=0.05):
    """A/B test comprehensive analysis and decision report"""

    p_a = x_a / n_a
    p_b = x_b / n_b
    diff = p_b - p_a
    relative_lift = diff / p_a if p_a > 0 else float('inf')

    # z-test
    count = np.array([x_b, x_a])
    nobs = np.array([n_b, n_a])
    z_stat, p_val = proportions_ztest(count, nobs, alternative='two-sided')

    # 95% confidence interval for conversion rate difference
    se_diff = np.sqrt(p_a*(1-p_a)/n_a + p_b*(1-p_b)/n_b)
    z_crit = stats.norm.ppf(1 - alpha/2)
    ci_lower = diff - z_crit * se_diff
    ci_upper = diff + z_crit * se_diff

    # Post-hoc power analysis
    from statsmodels.stats.proportion import proportion_effectsize
    from statsmodels.stats.power import NormalIndPower
    es = proportion_effectsize(p_b, p_a)
    observed_power = NormalIndPower().power(
        effect_size=es, nobs1=n_a, alpha=alpha, alternative='two-sided'
    )

    print("=" * 60)
    print("          A/B Test Comprehensive Analysis Report")
    print("=" * 60)
    print(f"Variant A rate: {p_a:.2%} ({x_a}/{n_a})")
    print(f"Variant B rate: {p_b:.2%} ({x_b}/{n_b})")
    print(f"Absolute diff:  {diff:+.2%}")
    print(f"Relative lift:  {relative_lift:+.1%}")
    print(f"Diff 95% CI:    [{ci_lower:+.2%}, {ci_upper:+.2%}]")
    print("-" * 60)
    print(f"z statistic:    {z_stat:.4f}")
    print(f"p-value:        {p_val:.6f}")
    print(f"Observed power: {observed_power:.2%}")
    print("-" * 60)

    if p_val < alpha:
        if ci_lower > 0:
            print("Conclusion: Variant B's conversion rate is significantly higher.")
            print("Recommendation: Consider rolling out Variant B to all traffic.")
        else:
            print("Conclusion: Statistically significant, but CI includes 0.")
            print("Recommendation: Collect more data.")
    else:
        if observed_power < 0.80:
            print("Conclusion: Not significant, but statistical power is insufficient.")
            print("Recommendation: Increase sample size and continue testing.")
        else:
            print("Conclusion: Not significant with sufficient power.")
            print("Recommendation: The conversion rate difference may be practically meaningless.")

    print("=" * 60)

# Execute
ab_test_decision(
    n_a=5000, x_a=150,   # Variant A: 150 conversions out of 5,000
    n_b=5000, x_b=195,   # Variant B: 195 conversions out of 5,000
)

Core Code

Core code for A/B test significance testing. The chi-squared test and z-test provide identical conclusions for a 2x2 table.

import numpy as np
from scipy.stats import chi2_contingency
from statsmodels.stats.proportion import proportions_ztest

# A/B test data
n_A, x_A = 5000, 150   # Variant A: visitors, conversions
n_B, x_B = 5000, 195   # Variant B: visitors, conversions

# === Chi-squared test ===
table = np.array([[x_A, n_A-x_A],
                   [x_B, n_B-x_B]])
chi2, p_chi2, dof, _ = chi2_contingency(table, correction=False)

# === z-test ===
z_stat, p_ztest = proportions_ztest(
    [x_B, x_A], [n_B, n_A], alternative='two-sided'
)

# === Confidence interval for rate difference ===
p_A, p_B = x_A/n_A, x_B/n_B
se = np.sqrt(p_A*(1-p_A)/n_A + p_B*(1-p_B)/n_B)
ci = (p_B - p_A - 1.96*se, p_B - p_A + 1.96*se)

print(f"A: {p_A:.2%}, B: {p_B:.2%}, diff: {p_B-p_A:+.2%}")
print(f"p-value: {p_ztest:.6f}")
print(f"95% CI: [{ci[0]:+.2%}, {ci[1]:+.2%}]")
print(f"Significant: {'Yes' if p_ztest < 0.05 else 'No'}")

Common Mistakes

Repeatedly checking results mid-test and stopping early (Peeking Problem)

Repeatedly checking before reaching the pre-determined sample size dramatically increases the Type I error (false positive) rate. For example, checking the p-value daily can raise the actual significance level from 5% to as high as 20-30%. Always collect data up to the pre-calculated sample size and test only once, or use a sequential testing method.

Sample size too small to obtain significant results

Always perform a power analysis before starting the test. To detect a 0.5 percentage point difference from a baseline conversion rate of 3%, you need approximately 23,000 people per group. If daily traffic is 1,000, you need to run the test for at least 46 days. If traffic is insufficient, consider increasing the MDE or adjusting the significance level.

Ignoring multiple comparisons correction

When simultaneously testing multiple metrics (conversion rate, click-through rate, time on site, etc.) in one test, or testing multiple variants A/B/C/D, the probability that at least one appears "significant" increases. Adjust the significance level using Bonferroni correction (alpha / number of tests) or the Holm-Bonferroni method. When simultaneously testing 3 metrics, use alpha = 0.05/3 = 0.0167 as the threshold.

Ignoring effect size and looking only at p-value

With very large samples, even tiny, practically meaningless differences can appear "statistically significant." Always check the confidence interval of the conversion rate difference and the relative lift alongside the p-value. For example, even if a 0.01 percentage point difference in conversion rate yields p < 0.05, the effect may be negligible compared to the cost of implementing Variant B from a business perspective.

Related liminfo Services