NLP Preprocessing

Topic: Text Processing

Introduction

Text preprocessing converts raw text into tokens suitable for machine learning models.

Tokenization

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Hello world. This is a test."

# Sentence tokenization
sentences = sent_tokenize(text)
print(sentences)  # ['Hello world.', 'This is a test.']

# Word tokenization
words = word_tokenize(text)
print(words)  # ['Hello', 'world', '.', 'This', 'is', 'a', 'test', '.']

# Whitespace tokenization
words = text.split()

Stemming

from nltk.stem import PorterStemmer, SnowballStemmer

stemmer = PorterStemmer()
words = ['running', 'runner', 'ran', 'runs']
stems = [stemmer.stem(w) for w in words]
print(stems)  # ['run', 'runner', 'ran', 'run']

# Snowball (Porter2)
snowball = SnowballStemmer('english')
print([snowball.stem(w) for w in words])

Lemmatization

import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp("running runner ran")
lemmas = [token.lemma_ for token in doc]
print(lemmas)  # ['run', 'runner', 'run']

# WordNet lemmatizer
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('running', pos='v'))  # run

Stop Words

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
words = ['The', 'cat', 'is', 'on', 'the', 'mat']
filtered = [w for w in words if w.lower() not in stop_words]
print(filtered)  # ['cat', 'mat']

Practice Problems

Tokenize sentences and words
Apply stemming
Use lemmatization
Remove stop words
Build preprocessing pipeline

Need More Practice?

Get personalized Python help from ChatWhole's AI-powered platform.

Get Expert Help →

All Topics

NLP Preprocessing

Introduction

Tokenization

Stemming

Lemmatization

Stop Words

Practice Problems

Need More Practice?