Introduction
NLP techniques for processing and understanding human language.
Text Preprocessing
import re
import nltk
def preprocess_text(text):
text = text.lower()
text = re.sub(r"[^a-zA-Z\s]", "", text)
tokens = text.split()
return tokens
# Remove stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
tokens = [w for w in tokens if w not in stop_words]
TF-IDF Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["This is document one", "This is document two", "Document three"]
vectorizer = TfidfVectorizer(max_features=1000)
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
Sentiment Analysis
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
model = MultinomialNB()
model.fit(X_train, y_train)
Practice Problems
- Tokenize and preprocess text
- Create TF-IDF vectors
- Build text classifier
- Use word embeddings
- Analyze sentiment