Feature Extraction Techniques

These are often used before applying statistical or ML models, especially in classical NLP.

Bag of Words (BoW)
- Converts text into a fixed-length vector of word counts.
- Ignores grammar and word order.
Term Frequency–Inverse Document Frequency (TF-IDF)
- Extends BoW by weighting words by their importance across documents.
n-Grams
- Extends BoW/TF-IDF with multi-word units (e.g., bigrams, trigrams).
- Used both for feature extraction and as standalone statistical models.
Word Embeddings
- Dense vector representations capturing semantic meaning.
- Includes:
  - Word2Vec: Skip-gram and CBOW
  - GloVe: Matrix factorization-based
  - FastText: Includes subword information
Kernel Methods (e.g., string kernels, tree kernels)
- Use structured similarity measures for text, often in SVMs.

Last updated 3 months ago