Importance of Statistics in Data Science
Statistics provides the mathematical foundation for making inferences and predictions from data. Without statistical knowledge, data science becomes mere pattern matching without understanding of uncertainty.
Population vs Sample
- Population: The complete set of all items of interest
- Sample: A subset of the population used for analysis
Types of Data
| Type | Description | Examples |
|---|---|---|
| Numerical | Quantitative continuous values | Height, weight, temperature |
| Categorical | Qualitative discrete values | Gender, color, city |
| Ordinal | Ordered categorical data | Education level, rating |
| Time Series | Data over time | Stock prices, temperature |
Descriptive Statistics
Measures of Central Tendency:
Measures of Dispersion:
Probability Distributions
Common distributions used in data science:
Binomial Distribution - Number of successes in n trials:
Normal Distribution - Bell-shaped continuous distribution:
Poisson Distribution - Count of events in fixed interval:
Hypothesis Testing
The foundation of statistical inference:
- Null Hypothesis (H₀): The default assumption
- Alternative Hypothesis (H₁): What we're trying to prove
- Test Statistic: Calculated from sample data
- P-value: Probability of observing results if H₀ is true
- Significance Level (α): Threshold for rejection (typically 0.05)
Common tests:
- t-test: Comparing means
- Chi-square test: Testing independence
- ANOVA: Comparing multiple means
- Correlation test: Testing relationship between variables
Confidence Intervals
A range of values likely to contain the population parameter:
Correlation
- r = 1: Perfect positive correlation
- r = -1: Perfect negative correlation
- r = 0: No linear correlation
Python Implementation
import numpy as np
import pandas as pd
from scipy import stats
# Descriptive statistics
data = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
mean = np.mean(data)
median = np.median(data)
std = np.std(data, ddof=1)
variance = np.var(data, ddof=1)
# Hypothesis testing - t-test
group1 = [85, 87, 92, 78, 88]
group2 = [79, 82, 89, 75, 81]
t_stat, p_value = stats.ttest_ind(group1, group2)
# Correlation
x = [1, 2, 3, 4, 5]
y = [2, 4, 5, 4, 5]
correlation = np.corrcoef(x, y)[0, 1]
Key Takeaways
- Statistics provides the framework for data-driven decision making
- Understanding populations vs samples is crucial
- Descriptive statistics summarize data characteristics
- Probability distributions model real-world phenomena
- Hypothesis testing enables statistical inference