Introduction
Text preprocessing converts raw text into tokens suitable for machine learning models.
Tokenization
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
text = "Hello world. This is a test."
# Sentence tokenization
sentences = sent_tokenize(text)
print(sentences) # ['Hello world.', 'This is a test.']
# Word tokenization
words = word_tokenize(text)
print(words) # ['Hello', 'world', '.', 'This', 'is', 'a', 'test', '.']
# Whitespace tokenization
words = text.split()
Stemming
from nltk.stem import PorterStemmer, SnowballStemmer
stemmer = PorterStemmer()
words = ['running', 'runner', 'ran', 'runs']
stems = [stemmer.stem(w) for w in words]
print(stems) # ['run', 'runner', 'ran', 'run']
# Snowball (Porter2)
snowball = SnowballStemmer('english')
print([snowball.stem(w) for w in words])
Lemmatization
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp("running runner ran")
lemmas = [token.lemma_ for token in doc]
print(lemmas) # ['run', 'runner', 'run']
# WordNet lemmatizer
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('running', pos='v')) # run
Stop Words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = ['The', 'cat', 'is', 'on', 'the', 'mat']
filtered = [w for w in words if w.lower() not in stop_words]
print(filtered) # ['cat', 'mat']
Practice Problems
- Tokenize sentences and words
- Apply stemming
- Use lemmatization
- Remove stop words
- Build preprocessing pipeline