Why the Average Isn't Always Right
A neighbourhood has 9 households earning 50M. Mean income = ~$5M — nobody earns near that. The mean is pulled by extremes.
Core Insight: The mean minimises the sum of squared deviations. It's the least-squares centre — powerful but sensitive to outliers.
Formula
Grouped data:
Worked Example
Scores: 72, 85, 90, 68, 78, 95, 88, 74
Sorted: 68 72 74 78 85 88 90 95
↑
Mean = 81.25
Python Implementation
import numpy as np
import pandas as pd
data = [72, 85, 90, 68, 78, 95, 88, 74]
print(f"Mean: {np.mean(data):.2f}") # 81.25
print(f"Pandas: {pd.Series(data).mean():.2f}") # 81.25
# Grouped data
midpoints = [15, 25, 35, 45, 55]
frequencies = [3, 8, 12, 5, 2]
grouped_mean = np.average(midpoints, weights=frequencies)
print(f"Grouped: {grouped_mean:.2f}") # 33.33
# Outlier effect
data_out = data + [500]
print(f"Normal mean: {np.mean(data):.2f}") # 81.25
print(f"Outlier mean: {np.mean(data_out):.2f}") # 130.56
R Implementation
data <- c(72, 85, 90, 68, 78, 95, 88, 74)
cat("Mean:", mean(data), "\n") # 81.25
# Grouped
cat("Grouped:", weighted.mean(c(15,25,35,45,55), c(3,8,12,5,2)), "\n")
# Trimmed mean (robust)
cat("Trimmed (10%):", mean(data, trim=0.1), "\n")
When to Use Mean vs Median
| Situation | Use |
|---|---|
| Symmetric, no outliers | Mean |
| Skewed data (income, prices) | Median |
| Categorical data | Mode |
| Need SD / variance | Mean |
| Outliers present | Median |
Key Takeaways
- Sum ÷ count —
; always check for outliers first - Outlier sensitivity — one extreme value shifts the mean significantly
- Least-squares — mean minimises
for any constant - Grouped data — use
when only frequency table is available vs— population vs sample symbol; computation is identical- Prefer median for skewed distributions like income or survival times