← Back to Python

All Topics

Advertisement

NLP Preprocessing

Topic: Text Processing

Advertisement

Introduction

Text preprocessing converts raw text into tokens suitable for machine learning models.

Tokenization

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Hello world. This is a test."

# Sentence tokenization
sentences = sent_tokenize(text)
print(sentences)  # ['Hello world.', 'This is a test.']

# Word tokenization
words = word_tokenize(text)
print(words)  # ['Hello', 'world', '.', 'This', 'is', 'a', 'test', '.']

# Whitespace tokenization
words = text.split()

Stemming

from nltk.stem import PorterStemmer, SnowballStemmer

stemmer = PorterStemmer()
words = ['running', 'runner', 'ran', 'runs']
stems = [stemmer.stem(w) for w in words]
print(stems)  # ['run', 'runner', 'ran', 'run']

# Snowball (Porter2)
snowball = SnowballStemmer('english')
print([snowball.stem(w) for w in words])

Lemmatization

import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp("running runner ran")
lemmas = [token.lemma_ for token in doc]
print(lemmas)  # ['run', 'runner', 'run']

# WordNet lemmatizer
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('running', pos='v'))  # run

Stop Words

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
words = ['The', 'cat', 'is', 'on', 'the', 'mat']
filtered = [w for w in words if w.lower() not in stop_words]
print(filtered)  # ['cat', 'mat']

Practice Problems

  1. Tokenize sentences and words
  2. Apply stemming
  3. Use lemmatization
  4. Remove stop words
  5. Build preprocessing pipeline

Advertisement

Advertisement

Need More Practice?

Get personalized Python help from ChatWhole's AI-powered platform.

Get Expert Help →