Day 49 introduces the classic feature-extraction techniques that turn raw text into numeric matrices. Bag-of-words counts and TF-IDF scores are the foundations for many traditional NLP pipelines and remain useful for quick baselines or lightweight models.
solutions.py
– exposes build_count_matrix
and build_tfidf_matrix
helper functions plus a small demo script that prints both
representations for a sample corpus.Ensure the lesson dependencies are installed (in particular
scikit-learn
, pandas
, and numpy
). Then execute the walkthrough:
python Day_49_NLP/solutions.py
The automated checks validate the helper functions against a miniature corpus. Run them with:
pytest tests/test_day_49.py
All tests expect to be run from the repository root so that imports resolve correctly.
Day_64_Modern_NLP_Pipelines
for transformer fine-tuning, retrieval-augmented generation, and robust
evaluation workflows that build on the vectorization foundations covered
here.Run this lesson’s code interactively in your browser:
!!! tip “About JupyterLite” JupyterLite runs entirely in your browser using WebAssembly. No installation or server required! Note: First launch may take a moment to load.
???+ example “solutions.py” View on GitHub
```python title="solutions.py"
"""Utility functions and a demo for bag-of-words and TF-IDF vectorization."""
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
def build_count_matrix(corpus):
"""Return a document-term matrix of raw counts for the given corpus."""
vectorizer = CountVectorizer()
matrix = vectorizer.fit_transform(corpus)
return pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names_out())
def build_tfidf_matrix(corpus):
"""Return a document-term matrix of TF-IDF scores for the given corpus."""
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(corpus)
return pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names_out())
def demo():
"""Print a walkthrough of bag-of-words and TF-IDF representations."""
corpus = [
"The quick brown fox jumped over the lazy dog.",
"The dog was not lazy.",
"The fox is quick.",
]
print("--- NLP Vectorization Demo ---")
print("Sample Corpus:")
for doc in corpus:
print(f"- '{doc}'")
print("-" * 30)
print("\n--- 1. Bag-of-Words (CountVectorizer) ---")
df_count = build_count_matrix(corpus)
print("Vocabulary (Feature Names):")
print(df_count.columns.to_list())
print("\nDocument-Term Matrix (Counts):")
print(df_count)
print("This matrix shows the count of each word in each document.")
print("-" * 30)
print("\n--- 2. TF-IDF (TfidfVectorizer) ---")
df_tfidf = build_tfidf_matrix(corpus)
print("Vocabulary (Feature Names):")
print(df_tfidf.columns.to_list())
print("\nTF-IDF Matrix:")
print(df_tfidf.round(2))
print(
"This matrix shows the TF-IDF score for each word, highlighting important words."
)
print("-" * 30)
if __name__ == "__main__":
demo()
```