Introduction
Text classification assigns labels to text documents, used for sentiment analysis, topic modeling, and spam detection.
Sentiment Analysis
from transformers import pipeline
# Using pretrained model
sentiment = pipeline("sentiment-analysis")
result = sentiment("I love this product!")
print(result) # [{'label': 'POSITIVE', 'score': 0.99...}]
Topic Modeling
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
# Create document-term matrix
vectorizer = CountVectorizer(max_features=1000, stop_words='english')
dtm = vectorizer.fit_transform(documents)
# LDA topic model
lda = LatentDirichletAllocation(n_topics=10, random_state=42)
lda.fit(dtm)
# Get topics
for topic_idx, topic in enumerate(lda.components_):
print(f"Topic {topic_idx}: ", end="")
print([vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-5:]])
Custom Classifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1, 2))),
('clf', LogisticRegression(max_iter=1000))
])
pipeline.fit(train_texts, train_labels)
predictions = pipeline.predict(test_texts)
Multi-label Classification
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
classifier = OneVsRestClassifier(LinearSVC())
classifier.fit(X_train, y_train_multilabel)
predictions = classifier.predict(X_test)
Practice Problems
- Perform sentiment analysis
- Extract topics with LDA
- Build text classification pipeline
- Handle multi-label classification
- Evaluate with classification report