← Back to Data Science

All Topics

Advertisement

Learn/Data Science/Python Foundations

NumPy Arrays for Data Science

Topic: NumPy

Advertisement

Introduction to NumPy for Data Science

NumPy (Numerical Python) is the foundation of data science in Python. It provides support for large multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on them.

Why NumPy for Data Science?

NumPy arrays are:

  • Memory efficient: Store data in contiguous memory blocks
  • Vectorized operations: Apply operations to entire arrays without loops
  • Fast: C-implemented, much faster than Python lists

Creating NumPy Arrays

import numpy as np

# From Python list
data = [1, 2, 3, 4, 5]
arr = np.array(data)

# Using built-in functions
zeros = np.zeros((3, 4))        # 3x4 array of zeros
ones = np.ones((2, 3))          # 2x3 array of ones
range_arr = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]
linspace = np.linspace(0, 1, 5) # 5 evenly spaced points

# Random arrays
rand_uniform = np.random.rand(3, 3)    # Uniform distribution [0, 1]
rand_normal = np.random.randn(1000)    # Standard normal distribution
rand_int = np.random.randint(0, 10, (5, 5))  # Random integers

print("Array shape:", arr.shape)
print("Array dtype:", arr.dtype)
print("Array mean:", arr.mean())

Array Indexing and Slicing

# 2D array
matrix = np.array([[1, 2, 3, 4],
                   [5, 6, 7, 8],
                   [9, 10, 11, 12],
                   [13, 14, 15, 16]])

# Basic indexing
print(matrix[0, 0])      # 1 (first element)
print(matrix[-1, -1])    # 16 (last element)

# Slicing
print(matrix[0, :])      # First row: [1, 2, 3, 4]
print(matrix[:, 0])      # First column: [1, 5, 9, 13]
print(matrix[1:3, 1:3])  # Submatrix [[6, 7], [10, 11]]

# Boolean indexing
print(matrix[matrix > 5])  # [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]

# Fancy indexing
print(matrix[[0, 2], [1, 3]])  # [2, 12] - elements at (0,1) and (2,3)

Vectorized Operations

arr = np.array([1, 2, 3, 4, 5])

# Element-wise operations
print(arr + 10)        # [11, 12, 13, 14, 15]
print(arr * 2)         # [2, 4, 6, 8, 10]
print(arr ** 2)        # [1, 4, 9, 16, 25]
print(np.sqrt(arr))    # [1. , 1.41, 1.73, 2. , 2.24]

# Universal functions (ufuncs)
print(np.sin(arr))     # Sine of each element
print(np.log(arr))     # Natural log
print(np.exp(arr))     # Exponential

Statistical Functions for Data Science

data = np.array([23, 45, 67, 89, 12, 34, 56, 78, 90, 11])

# Central tendency
print("Mean:", np.mean(data))           # 50.5
print("Median:", np.median(data))       # 45.0
print("Standard Deviation:", np.std(data))  # 28.5

# Percentiles (important for EDA)
print("25th percentile:", np.percentile(data, 25))  # 22.5
print("75th percentile:", np.percentile(data, 75))  # 77.5

# Descriptive statistics
print("Min:", np.min(data))    # 11
print("Max:", np.max(data))    # 90
print("Sum:", np.sum(data))    # 505
print("Variance:", np.var(data))  # 812.25

Reshaping and Transposing

arr = np.arange(1, 13)  # [1, 2, ..., 12]

# Reshape
print(arr.reshape(3, 4))
# [[1, 2, 3, 4],
#  [5, 6, 7, 8],
#  [9, 10, 11, 12]]

print(arr.reshape(2, 2, 3))
# [[[1, 2, 3], [4, 5, 6]],
#  [[7, 8, 9], [10, 11, 12]]]

# Flatten and ravel
flat = arr.reshape(-1)  # Flatten to 1D
print(flat.ravel())    # Same as flatten but returns view

# Transpose
matrix = np.array([[1, 2], [3, 4], [5, 6]])
print(matrix.T)  # Transposed matrix

Broadcasting

# Broadcasting allows operations on arrays of different shapes
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([10, 20, 30])

# b is broadcast to match a's shape
print(a + b)
# [[11, 22, 33],
#  [14, 25, 36]]

# 2D + 1D broadcasting
c = np.array([[1], [2], [3]])
print(a + c)
# [[2, 3, 4],
#  [6, 7, 8],
#  [8, 9, 10]]

Linear Algebra for Data Science

# Dot product
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(np.dot(a, b))  # 32 (1*4 + 2*5 + 3*6)

# Matrix multiplication
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
print(np.matmul(A, B))
# [[19, 22],
#  [43, 50]]

# Matrix inverse (important for linear regression)
A = np.array([[4, 7], [2, 6]])
A_inv = np.linalg.inv(A)
print(A_inv)

# Eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)
print("Eigenvalues:", eigenvalues)
print("Eigenvectors:", eigenvectors)

# Determinant
print(np.linalg.det(A))  # 10.0

Practice Exercise: Data Analysis with NumPy

import numpy as np

# Simulate a dataset (e.g., housing prices)
np.random.seed(42)
n_samples = 1000

# Features: size (sq ft), bedrooms, age, distance to downtown
size = np.random.normal(2000, 500, n_samples)
bedrooms = np.random.randint(1, 6, n_samples)
age = np.random.randint(0, 50, n_samples)
distance = np.random.uniform(1, 30, n_samples)

# Target: price (in thousands)
price = (150 + 0.15 * size + 20 * bedrooms - 2 * age - 3 * distance + 
         np.random.normal(0, 20, n_samples))

# Create dataset
data = np.column_stack([size, bedrooms, age, distance, price])

# Analyze
print("Dataset shape:", data.shape)
print("Mean price:", np.mean(price))
print("Median price:", np.median(price))
print("Price std:", np.std(price))
print("Correlation with size:", np.corrcoef(size, price)[0, 1])
print("Correlation with bedrooms:", np.corrcoef(bedrooms, price)[0, 1])
print("Correlation with age:", np.corrcoef(age, price)[0, 1])

Key Takeaways

NumPy is essential for data science because:

  1. Efficient array operations - Much faster than Python lists
  2. Vectorized computations - Avoid explicit loops
  3. Rich mathematical functions - Statistics, linear algebra, etc.
  4. Foundation for Pandas - Pandas is built on NumPy

When to Use NumPy

  • Numerical computations
  • Mathematical operations on arrays
  • Linear algebra (regression, PCA, etc.)
  • Image processing (as numerical matrices)
  • Any numerical data manipulation

Next Steps

  • Learn Pandas for data manipulation
  • Practice vectorized operations
  • Explore NumPy's linear algebra capabilities
  • Combine with Matplotlib for visualizations

Advertisement

Advertisement

Need More Practice?

Get personalized data science help from ChatWhole's AI-powered platform.

Get Expert Help →