Introduction
Python provides powerful tools for reading from and writing to files, which is essential for data science workflows. Understanding file I/O operations enables loading datasets, saving processed results, and handling various file formats. Python supports both text and binary file operations with various encoding options.
Key Concepts
- File modes: Read, write, append (text and binary)
- Context managers: Automatic resource cleanup with 'with' statement
- Text vs binary: Different handling for text and binary files
- Encodings: UTF-8, ASCII, and other character encodings
- Line-by-line processing: Efficient handling of large files
- CSV and JSON: Common data file formats
Python Implementation
# Basic file reading
with open("data.txt", "r") as file:
content = file.read()
# Reading lines
with open("data.txt", "r") as file:
lines = file.readlines() # List of all lines
for line in file: # Iterate line by line
print(line.strip())
# Writing to files
with open("output.txt", "w") as file:
file.write("Hello, World!\n")
file.writelines(["Line 1\n", "Line 2\n"])
# Appending to files
with open("log.txt", "a") as file:
file.write("New entry\n")
# CSV handling
import csv
with open("data.csv", "w", newline="") as file:
writer = csv.writer(file)
writer.writerow(["Name", "Age"])
writer.writerows([["Alice", 25], ["Bob", 30]])
# Reading CSV
with open("data.csv", "r") as file:
reader = csv.reader(file)
for row in reader:
print(row)
# JSON handling
import json
data = {"name": "John", "age": 30}
with open("data.json", "w") as file:
json.dump(data, file)
with open("data.json", "r") as file:
loaded = json.load(file)
When to Use
- Loading datasets from disk for analysis
- Saving processed data and results
- Reading configuration files
- Processing log files
- Working with CSV and JSON data exports
- Handling large files with streaming
Key Takeaways
- Always use context managers (with statement) for file operations to ensure proper cleanup
- Specify encoding explicitly when dealing with non-ASCII text
- Use newline="" when writing CSV files to avoid double line endings
- For large files, process line by line to avoid loading entire file into memory
- JSON and CSV are the most common data interchange formats in data science