Introduction
Hugging Face Datasets provides easy access to thousands of datasets for NLP, audio, and computer vision.
Loading Datasets
from datasets import load_dataset
# Load from Hub
dataset = load_dataset('mnist', split='train')
print(dataset)
# Load CSV
dataset = load_dataset('csv', data_files='data.csv', split='train')
# Load JSON
dataset = load_dataset('json', data_files='data.json', field='data')
# Load multiple files
dataset = load_dataset('csv', data_files=['train.csv', 'test.csv'])
Dataset Operations
# Access elements
print(dataset[0]) # First example
print(dataset['text'][:5]) # First 5 texts
# Properties
print(dataset.features)
print(dataset.num_rows)
print(dataset.column_names)
Transform with map
def tokenize(example):
return tokenizer(example['text'], truncation=True, padding='max_length', max_length=128)
tokenized = dataset.map(tokenize, batched=True)
print(tokenized)
Filter
# Filter by condition
filtered = dataset.filter(lambda x: len(x['text'].split()) > 10)
# Keep specific columns
filtered = dataset.filter(lambda x: x['label'] in [0, 1])
Split and Train/Test
# Split dataset
train_test = dataset.train_test_split(test_size=0.2)
train = train_test['train']
test = train_test['test']
# Stratified split
stratified = dataset.train_test_split(test_size=0.2, stratify_column_name='label')
Practice Problems
- Load dataset from Hub
- Access data elements
- Tokenize with map
- Filter examples
- Create train/test split