← Back to Data Science

All Topics

Advertisement

Learn/Data Science/Python Foundations

Pandas DataFrames for Data Science

Topic: Pandas

Advertisement

Introduction to Pandas DataFrames

Pandas is the most important library for data science in Python. It provides high-performance, easy-to-use data structures for data manipulation and analysis.

Why Pandas for Data Science?

  • DataFrame: Tabular data structure (like Excel/CSV)
  • Missing data handling: Built-in NaN support
  • Data alignment: Automatic alignment based on labels
  • GroupBy: Split-apply-combine operations
  • I/O tools: Read/write various formats (CSV, Excel, SQL, etc.)

Creating DataFrames

import pandas as pd
import numpy as np

# From dictionary
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'age': [25, 30, 35, 28, 32],
    'salary': [50000, 60000, 75000, 55000, 70000],
    'department': ['IT', 'HR', 'Finance', 'IT', 'Marketing']
}
df = pd.DataFrame(data)

# From CSV (most common in data science)
# df = pd.read_csv('data.csv')

# From NumPy array
arr = np.random.randn(100, 4)
df = pd.DataFrame(arr, columns=['A', 'B', 'C', 'D'])

# From URL
# df = pd.read_csv('https://example.com/data.csv')

print(df.head())     # First 5 rows
print(df.tail())     # Last 5 rows
print(df.shape)      # (5, 4) - rows, columns
print(df.columns)    # Column names
print(df.dtypes)     # Data types

Basic DataFrame Operations

# Selection
df['name']           # Single column (Series)
df[['name', 'salary']]  # Multiple columns

# Row selection by index
df.iloc[0]           # First row
df.iloc[0:3]         # First 3 rows

# Row selection by label
df.loc[0]            # Row with index 0

# Filtering (most important for data science)
df[df['age'] > 30]              # Age > 30
df[(df['age'] > 25) & (df['salary'] > 60000)]  # Multiple conditions

# Using query method
df.query('age > 30 and salary > 60000')

Adding and Modifying Columns

# Add new column
df['bonus'] = df['salary'] * 0.1
df['full_name'] = df['name'].str.upper()

# Conditional column creation
df['level'] = df['salary'].apply(
    lambda x: 'Senior' if x > 65000 else 'Junior'
)

# Using assign (returns new DataFrame)
df = df.assign(
    tax=df['salary'] * 0.2,
    net_salary=lambda x: x['salary'] - x['tax']
)

# Drop columns
df = df.drop(columns=['bonus', 'tax'])

Handling Missing Data

# Create DataFrame with missing values
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, 4, np.nan],
    'C': [1, 2, 3, np.nan, np.nan]
})

# Check for missing values
print(df.isnull())         # Boolean DataFrame
print(df.isnull().sum())   # Count per column
print(df.notnull().sum())  # Non-null count

# Drop missing values
df.dropna()              # Drop rows with any NaN
df.dropna(axis=1)        # Drop columns with any NaN
df.dropna(thresh=3)      # Keep rows with at least 3 non-NaN

# Fill missing values
df.fillna(0)                       # Fill with 0
df.fillna(df.mean())               # Fill with column mean
df.fillna(method='ffill')          # Forward fill
df.fillna(method='bfill')          # Backward fill
df['A'].interpolate()              # Interpolate

Statistical Operations

# Descriptive statistics
df.describe()         # Count, mean, std, min, 25%, 50%, 75%, max
df.mean()             # Mean of each column
df.median()           # Median
df.std()              # Standard deviation

# Aggregation
df['salary'].sum()    # Sum
df['salary'].min()    # Minimum
df['salary'].max()    # Maximum
df['salary'].count()  # Count

# GroupBy (essential for data analysis)
grouped = df.groupby('department')
grouped['salary'].mean()      # Mean salary per department
grouped.agg({
    'salary': ['mean', 'max', 'min'],
    'age': 'mean'
})

# Correlation matrix (important for feature selection)
print(df.corr())

DataFrame Manipulation

# Sort values
df.sort_values('salary', ascending=False)
df.sort_values(['department', 'salary'], ascending=[True, False])

# Rank
df['salary_rank'] = df['salary'].rank(ascending=False)

# Rename columns
df = df.rename(columns={'name': 'employee_name', 'salary': 'annual_salary'})

# Replace values
df['department'].replace({'IT': 'Information Technology', 'HR': 'Human Resources'})

# String operations
df['name'].str.lower()
df['name'].str.contains('a')
df['name'].str.split(' ')

# Apply custom function
df['salary'].apply(lambda x: x * 1.1 if x < 60000 else x * 1.05)

Merging and Joining DataFrames

# Create sample DataFrames
df1 = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana']
})

df2 = pd.DataFrame({
    'id': [1, 2, 3, 5],
    'salary': [50000, 60000, 75000, 55000]
})

# Merge (like SQL JOIN)
pd.merge(df1, df2, on='id', how='inner')  # Inner join
pd.merge(df1, df2, on='id', how='left')   # Left join
pd.merge(df1, df2, on='id', how='right') # Right join
pd.merge(df1, df2, on='id', how='outer')  # Outer join

# Concatenate
df3 = pd.DataFrame({
    'id': [5, 6],
    'name': ['Eve', 'Frank'],
    'salary': [70000, 65000]
})
pd.concat([df1, df3], ignore_index=True)

# Append (deprecated, use concat)
# df1.append(df3)  # Use pd.concat instead

Pivot Tables

# Create sample data
sales = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=90),
    'product': np.random.choice(['A', 'B', 'C'], 90),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 90),
    'sales': np.random.randint(100, 1000, 90)
})

# Simple pivot table
pd.pivot_table(sales, values='sales', index='product', aggfunc='sum')

# Multi-level pivot
pd.pivot_table(
    sales,
    values='sales',
    index='product',
    columns='region',
    aggfunc='sum'
)

# Multiple aggregations
pd.pivot_table(
    sales,
    values='sales',
    index='region',
    aggfunc=['sum', 'mean', 'max']
)

Time Series with Pandas

# Create datetime index
dates = pd.date_range('2024-01-01', periods=365, freq='D')
df = pd.DataFrame({
    'date': dates,
    'value': np.random.randn(365).cumsum()
})
df.set_index('date', inplace=True)

# Resample (downsampling/upsampling)
df.resample('M').mean()    # Monthly average
df.resample('W').sum()     # Weekly sum
df.resample('Q').max()     # Quarterly max

# Rolling windows
df['rolling_mean'] = df['value'].rolling(window=30).mean()
df['rolling_std'] = df['value'].rolling(window=30).std()

# Shift and diff
df['shifted'] = df['value'].shift(1)
df['diff'] = df['value'].diff()

Practice Exercise: EDA with Pandas

import pandas as pd
import numpy as np

# Load dataset (using built-in for demo)
from sklearn.datasets import load_diabetes
data = load_diabetes()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Basic exploration
print("Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())
print("\nData types:")
print(df.dtypes)
print("\nMissing values:")
print(df.isnull().sum())
print("\nDescriptive statistics:")
print(df.describe())

# Analysis
print("\nCorrelation with target:")
correlations = df.corr()['target'].sort_values(ascending=False)
print(correlations.head(10))

# Group analysis by binned values
df['age_bin'] = pd.cut(df['age'], bins=5)
print("\nTarget mean by age bin:")
print(df.groupby('age_bin')['target'].mean())

Key Takeaways

Pandas is essential for:

  1. Data loading - Read from various sources
  2. Data cleaning - Handle missing values, duplicates
  3. Data exploration - Statistics, filtering, aggregation
  4. Data transformation - Feature engineering
  5. Data merging - Combine multiple sources

Common Data Science Operations in Pandas

OperationCode
Load CSVpd.read_csv('file.csv')
Filter rowsdf[df['col'] > value]
Group bydf.groupby('col').agg(...)
Pivot tablepd.pivot_table(...)
Handle NaNdf.fillna() or df.dropna()
Save to CSVdf.to_csv('file.csv')

Next Steps

  • Master groupby operations
  • Learn pivot tables and crosstabs
  • Practice time series analysis
  • Explore pandas for feature engineering

Advertisement

Advertisement

Need More Practice?

Get personalized data science help from ChatWhole's AI-powered platform.

Get Expert Help →