Introduction to NumPy for Data Science
NumPy (Numerical Python) is the foundation of data science in Python. It provides support for large multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on them.
Why NumPy for Data Science?
NumPy arrays are:
- Memory efficient: Store data in contiguous memory blocks
- Vectorized operations: Apply operations to entire arrays without loops
- Fast: C-implemented, much faster than Python lists
Creating NumPy Arrays
import numpy as np
# From Python list
data = [1, 2, 3, 4, 5]
arr = np.array(data)
# Using built-in functions
zeros = np.zeros((3, 4)) # 3x4 array of zeros
ones = np.ones((2, 3)) # 2x3 array of ones
range_arr = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]
linspace = np.linspace(0, 1, 5) # 5 evenly spaced points
# Random arrays
rand_uniform = np.random.rand(3, 3) # Uniform distribution [0, 1]
rand_normal = np.random.randn(1000) # Standard normal distribution
rand_int = np.random.randint(0, 10, (5, 5)) # Random integers
print("Array shape:", arr.shape)
print("Array dtype:", arr.dtype)
print("Array mean:", arr.mean())
Array Indexing and Slicing
# 2D array
matrix = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16]])
# Basic indexing
print(matrix[0, 0]) # 1 (first element)
print(matrix[-1, -1]) # 16 (last element)
# Slicing
print(matrix[0, :]) # First row: [1, 2, 3, 4]
print(matrix[:, 0]) # First column: [1, 5, 9, 13]
print(matrix[1:3, 1:3]) # Submatrix [[6, 7], [10, 11]]
# Boolean indexing
print(matrix[matrix > 5]) # [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
# Fancy indexing
print(matrix[[0, 2], [1, 3]]) # [2, 12] - elements at (0,1) and (2,3)
Vectorized Operations
arr = np.array([1, 2, 3, 4, 5])
# Element-wise operations
print(arr + 10) # [11, 12, 13, 14, 15]
print(arr * 2) # [2, 4, 6, 8, 10]
print(arr ** 2) # [1, 4, 9, 16, 25]
print(np.sqrt(arr)) # [1. , 1.41, 1.73, 2. , 2.24]
# Universal functions (ufuncs)
print(np.sin(arr)) # Sine of each element
print(np.log(arr)) # Natural log
print(np.exp(arr)) # Exponential
Statistical Functions for Data Science
data = np.array([23, 45, 67, 89, 12, 34, 56, 78, 90, 11])
# Central tendency
print("Mean:", np.mean(data)) # 50.5
print("Median:", np.median(data)) # 45.0
print("Standard Deviation:", np.std(data)) # 28.5
# Percentiles (important for EDA)
print("25th percentile:", np.percentile(data, 25)) # 22.5
print("75th percentile:", np.percentile(data, 75)) # 77.5
# Descriptive statistics
print("Min:", np.min(data)) # 11
print("Max:", np.max(data)) # 90
print("Sum:", np.sum(data)) # 505
print("Variance:", np.var(data)) # 812.25
Reshaping and Transposing
arr = np.arange(1, 13) # [1, 2, ..., 12]
# Reshape
print(arr.reshape(3, 4))
# [[1, 2, 3, 4],
# [5, 6, 7, 8],
# [9, 10, 11, 12]]
print(arr.reshape(2, 2, 3))
# [[[1, 2, 3], [4, 5, 6]],
# [[7, 8, 9], [10, 11, 12]]]
# Flatten and ravel
flat = arr.reshape(-1) # Flatten to 1D
print(flat.ravel()) # Same as flatten but returns view
# Transpose
matrix = np.array([[1, 2], [3, 4], [5, 6]])
print(matrix.T) # Transposed matrix
Broadcasting
# Broadcasting allows operations on arrays of different shapes
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([10, 20, 30])
# b is broadcast to match a's shape
print(a + b)
# [[11, 22, 33],
# [14, 25, 36]]
# 2D + 1D broadcasting
c = np.array([[1], [2], [3]])
print(a + c)
# [[2, 3, 4],
# [6, 7, 8],
# [8, 9, 10]]
Linear Algebra for Data Science
# Dot product
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(np.dot(a, b)) # 32 (1*4 + 2*5 + 3*6)
# Matrix multiplication
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
print(np.matmul(A, B))
# [[19, 22],
# [43, 50]]
# Matrix inverse (important for linear regression)
A = np.array([[4, 7], [2, 6]])
A_inv = np.linalg.inv(A)
print(A_inv)
# Eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)
print("Eigenvalues:", eigenvalues)
print("Eigenvectors:", eigenvectors)
# Determinant
print(np.linalg.det(A)) # 10.0
Practice Exercise: Data Analysis with NumPy
import numpy as np
# Simulate a dataset (e.g., housing prices)
np.random.seed(42)
n_samples = 1000
# Features: size (sq ft), bedrooms, age, distance to downtown
size = np.random.normal(2000, 500, n_samples)
bedrooms = np.random.randint(1, 6, n_samples)
age = np.random.randint(0, 50, n_samples)
distance = np.random.uniform(1, 30, n_samples)
# Target: price (in thousands)
price = (150 + 0.15 * size + 20 * bedrooms - 2 * age - 3 * distance +
np.random.normal(0, 20, n_samples))
# Create dataset
data = np.column_stack([size, bedrooms, age, distance, price])
# Analyze
print("Dataset shape:", data.shape)
print("Mean price:", np.mean(price))
print("Median price:", np.median(price))
print("Price std:", np.std(price))
print("Correlation with size:", np.corrcoef(size, price)[0, 1])
print("Correlation with bedrooms:", np.corrcoef(bedrooms, price)[0, 1])
print("Correlation with age:", np.corrcoef(age, price)[0, 1])
Key Takeaways
NumPy is essential for data science because:
- Efficient array operations - Much faster than Python lists
- Vectorized computations - Avoid explicit loops
- Rich mathematical functions - Statistics, linear algebra, etc.
- Foundation for Pandas - Pandas is built on NumPy
When to Use NumPy
- Numerical computations
- Mathematical operations on arrays
- Linear algebra (regression, PCA, etc.)
- Image processing (as numerical matrices)
- Any numerical data manipulation
Next Steps
- Learn Pandas for data manipulation
- Practice vectorized operations
- Explore NumPy's linear algebra capabilities
- Combine with Matplotlib for visualizations