NLP Fundamentals
NLP processes and analyzes text data.
Text Preprocessing
Tokenization: split text into words/tokens. Lowercasing, removing punctuation.
Stop words removal, stemming/lemmatization reduce vocabulary.
Word Embeddings
Word2Vec creates dense vector representations. GloVe pre-trained embeddings.
CountVectorizer, TfidfVectorizer create bag-of-words representations.
Text Classification
Naive Bayes: text classification classic. Logistic regression on TF-IDF works well.
LSTM, BERT for deep learning approaches.
Key Takeaways
- Preprocessing is crucial for NLP
- TF-IDF provides simple text representation
- Deep learning (BERT) provides state-of-art