Introduction to Pandas DataFrames
Pandas is the most important library for data science in Python. It provides high-performance, easy-to-use data structures for data manipulation and analysis.
Why Pandas for Data Science?
- DataFrame: Tabular data structure (like Excel/CSV)
- Missing data handling: Built-in NaN support
- Data alignment: Automatic alignment based on labels
- GroupBy: Split-apply-combine operations
- I/O tools: Read/write various formats (CSV, Excel, SQL, etc.)
Creating DataFrames
import pandas as pd
import numpy as np
# From dictionary
data = {
'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'age': [25, 30, 35, 28, 32],
'salary': [50000, 60000, 75000, 55000, 70000],
'department': ['IT', 'HR', 'Finance', 'IT', 'Marketing']
}
df = pd.DataFrame(data)
# From CSV (most common in data science)
# df = pd.read_csv('data.csv')
# From NumPy array
arr = np.random.randn(100, 4)
df = pd.DataFrame(arr, columns=['A', 'B', 'C', 'D'])
# From URL
# df = pd.read_csv('https://example.com/data.csv')
print(df.head()) # First 5 rows
print(df.tail()) # Last 5 rows
print(df.shape) # (5, 4) - rows, columns
print(df.columns) # Column names
print(df.dtypes) # Data types
Basic DataFrame Operations
# Selection
df['name'] # Single column (Series)
df[['name', 'salary']] # Multiple columns
# Row selection by index
df.iloc[0] # First row
df.iloc[0:3] # First 3 rows
# Row selection by label
df.loc[0] # Row with index 0
# Filtering (most important for data science)
df[df['age'] > 30] # Age > 30
df[(df['age'] > 25) & (df['salary'] > 60000)] # Multiple conditions
# Using query method
df.query('age > 30 and salary > 60000')
Adding and Modifying Columns
# Add new column
df['bonus'] = df['salary'] * 0.1
df['full_name'] = df['name'].str.upper()
# Conditional column creation
df['level'] = df['salary'].apply(
lambda x: 'Senior' if x > 65000 else 'Junior'
)
# Using assign (returns new DataFrame)
df = df.assign(
tax=df['salary'] * 0.2,
net_salary=lambda x: x['salary'] - x['tax']
)
# Drop columns
df = df.drop(columns=['bonus', 'tax'])
Handling Missing Data
# Create DataFrame with missing values
df = pd.DataFrame({
'A': [1, 2, np.nan, 4, 5],
'B': [np.nan, 2, 3, 4, np.nan],
'C': [1, 2, 3, np.nan, np.nan]
})
# Check for missing values
print(df.isnull()) # Boolean DataFrame
print(df.isnull().sum()) # Count per column
print(df.notnull().sum()) # Non-null count
# Drop missing values
df.dropna() # Drop rows with any NaN
df.dropna(axis=1) # Drop columns with any NaN
df.dropna(thresh=3) # Keep rows with at least 3 non-NaN
# Fill missing values
df.fillna(0) # Fill with 0
df.fillna(df.mean()) # Fill with column mean
df.fillna(method='ffill') # Forward fill
df.fillna(method='bfill') # Backward fill
df['A'].interpolate() # Interpolate
Statistical Operations
# Descriptive statistics
df.describe() # Count, mean, std, min, 25%, 50%, 75%, max
df.mean() # Mean of each column
df.median() # Median
df.std() # Standard deviation
# Aggregation
df['salary'].sum() # Sum
df['salary'].min() # Minimum
df['salary'].max() # Maximum
df['salary'].count() # Count
# GroupBy (essential for data analysis)
grouped = df.groupby('department')
grouped['salary'].mean() # Mean salary per department
grouped.agg({
'salary': ['mean', 'max', 'min'],
'age': 'mean'
})
# Correlation matrix (important for feature selection)
print(df.corr())
DataFrame Manipulation
# Sort values
df.sort_values('salary', ascending=False)
df.sort_values(['department', 'salary'], ascending=[True, False])
# Rank
df['salary_rank'] = df['salary'].rank(ascending=False)
# Rename columns
df = df.rename(columns={'name': 'employee_name', 'salary': 'annual_salary'})
# Replace values
df['department'].replace({'IT': 'Information Technology', 'HR': 'Human Resources'})
# String operations
df['name'].str.lower()
df['name'].str.contains('a')
df['name'].str.split(' ')
# Apply custom function
df['salary'].apply(lambda x: x * 1.1 if x < 60000 else x * 1.05)
Merging and Joining DataFrames
# Create sample DataFrames
df1 = pd.DataFrame({
'id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'Diana']
})
df2 = pd.DataFrame({
'id': [1, 2, 3, 5],
'salary': [50000, 60000, 75000, 55000]
})
# Merge (like SQL JOIN)
pd.merge(df1, df2, on='id', how='inner') # Inner join
pd.merge(df1, df2, on='id', how='left') # Left join
pd.merge(df1, df2, on='id', how='right') # Right join
pd.merge(df1, df2, on='id', how='outer') # Outer join
# Concatenate
df3 = pd.DataFrame({
'id': [5, 6],
'name': ['Eve', 'Frank'],
'salary': [70000, 65000]
})
pd.concat([df1, df3], ignore_index=True)
# Append (deprecated, use concat)
# df1.append(df3) # Use pd.concat instead
Pivot Tables
# Create sample data
sales = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=90),
'product': np.random.choice(['A', 'B', 'C'], 90),
'region': np.random.choice(['North', 'South', 'East', 'West'], 90),
'sales': np.random.randint(100, 1000, 90)
})
# Simple pivot table
pd.pivot_table(sales, values='sales', index='product', aggfunc='sum')
# Multi-level pivot
pd.pivot_table(
sales,
values='sales',
index='product',
columns='region',
aggfunc='sum'
)
# Multiple aggregations
pd.pivot_table(
sales,
values='sales',
index='region',
aggfunc=['sum', 'mean', 'max']
)
Time Series with Pandas
# Create datetime index
dates = pd.date_range('2024-01-01', periods=365, freq='D')
df = pd.DataFrame({
'date': dates,
'value': np.random.randn(365).cumsum()
})
df.set_index('date', inplace=True)
# Resample (downsampling/upsampling)
df.resample('M').mean() # Monthly average
df.resample('W').sum() # Weekly sum
df.resample('Q').max() # Quarterly max
# Rolling windows
df['rolling_mean'] = df['value'].rolling(window=30).mean()
df['rolling_std'] = df['value'].rolling(window=30).std()
# Shift and diff
df['shifted'] = df['value'].shift(1)
df['diff'] = df['value'].diff()
Practice Exercise: EDA with Pandas
import pandas as pd
import numpy as np
# Load dataset (using built-in for demo)
from sklearn.datasets import load_diabetes
data = load_diabetes()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
# Basic exploration
print("Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())
print("\nData types:")
print(df.dtypes)
print("\nMissing values:")
print(df.isnull().sum())
print("\nDescriptive statistics:")
print(df.describe())
# Analysis
print("\nCorrelation with target:")
correlations = df.corr()['target'].sort_values(ascending=False)
print(correlations.head(10))
# Group analysis by binned values
df['age_bin'] = pd.cut(df['age'], bins=5)
print("\nTarget mean by age bin:")
print(df.groupby('age_bin')['target'].mean())
Key Takeaways
Pandas is essential for:
- Data loading - Read from various sources
- Data cleaning - Handle missing values, duplicates
- Data exploration - Statistics, filtering, aggregation
- Data transformation - Feature engineering
- Data merging - Combine multiple sources
Common Data Science Operations in Pandas
| Operation | Code |
|---|---|
| Load CSV | pd.read_csv('file.csv') |
| Filter rows | df[df['col'] > value] |
| Group by | df.groupby('col').agg(...) |
| Pivot table | pd.pivot_table(...) |
| Handle NaN | df.fillna() or df.dropna() |
| Save to CSV | df.to_csv('file.csv') |
Next Steps
- Master groupby operations
- Learn pivot tables and crosstabs
- Practice time series analysis
- Explore pandas for feature engineering