Text Classification with Scikit-learn: TF-IDF to BERT
Jun 20, 2026 4 Min Read 29 Views
(Last Updated)
Text classification is one of the most common NLP tasks in production and one of the most misunderstood in terms of tool selection. Reaching for a transformer model when a Logistic Regression classifier would perform just as well is a common beginner mistake it adds infrastructure cost, training time, and complexity without a meaningful accuracy gain.
Table of contents
- TL;DR Summary
- What Is Text Classification?
- Setting Up the Environment
- Understanding TF-IDF
- Building the Classical Text Classifier
- TF-IDF vs BERT: When to Use Which
- Common Mistakes Beginners Make
- Conclusion
- FAQs
- What is text classification in machine learning?
- How does TF-IDF work in text classification?
- Which classifier works best with TF-IDF?
- When should I use BERT instead of TF-IDF?
- What is the difference between TF-IDF and BERT?
- Do I need a GPU for text classification with scikit-learn?
- How do I handle imbalanced classes in text classification?
TL;DR Summary
- Text classification is the task of assigning predefined categories to text spam detection, sentiment analysis, topic labeling, and intent recognition are all text classification problems.
- In Python, scikit-learn handles the classical pipeline: This guide covers both ends of the spectrum from a working scikit-learn classifier to a BERT fine-tuning setup with code at every step.
Ready to build real machine learning projects from text classifiers to neural networks with structured guidance? Explore HCL GUVI’s Artificial Intelligence & Machine Learning Course designed to take you from Python fundamentals through core machine learning, deep learning, and NLP applications, with hands-on projects, mentorship, and placement support built in.
What Is Text Classification?
Text classification is a supervised machine learning task — a model is trained on labeled text examples and learns to predict the correct label for unseen text. Every classification problem shares the same structure:
- Input — a string of text
- Output — one or more predefined category labels
- Training signal — labeled examples the model learns from
Common real-world applications include spam vs. not-spam email filtering, positive/negative/neutral sentiment analysis, customer support ticket routing by topic, and news article categorization by subject.
The two dominant approaches in 2026 are classical ML with TF-IDF features and transformer-based models like BERT each with a distinct cost-accuracy tradeoff.
Setting Up the Environment
Install all required libraries before writing any code:
pip install scikit-learn pandas numpy transformers torch datasets
The classical pipeline uses only scikit-learn and pandas. The BERT section requires transformers, torch, and datasets from Hugging Face.
Ready to build real machine learning projects from text classifiers to neural networks with structured guidance? Explore HCL GUVI’s Artificial Intelligence & Machine Learning Course designed to take you from Python fundamentals through core machine learning, deep learning, and NLP applications, with hands-on projects, mentorship, and placement support built in.
Understanding TF-IDF
Before building the classifier, understanding what TF-IDF produces is essential it is the feature representation the model actually trains on.
TF-IDF stands for Term Frequency-Inverse Document Frequency. It converts raw text into numerical vectors by measuring how frequently a word appears in a document (TF), then downweighting words that appear across almost every document (IDF). Words that are common everywhere “the”, “is”, “and” get low scores. Words that are distinctive to specific documents get high scores.
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
"spam offer free money now",
"meeting scheduled for tomorrow",
"click here to claim your prize"
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray())
Each row in the output matrix is a document each column is a word each value is that word’s TF-IDF score in that document. This matrix is what the classifier receives as input.
Natural Language Processing (NLP) and text analytics continue to be among the most widely deployed machine learning applications in production systems. One of the most common use cases is text classification, which powers tasks such as customer support ticket routing, spam detection, sentiment analysis, content moderation, and document categorization. Because of this, foundational techniques like TF-IDF and practical machine learning libraries such as scikit-learn remain highly valuable skills for Python developers. While modern transformer-based models receive significant attention, traditional NLP pipelines built with TF-IDF and classical classifiers are still widely used due to their simplicity, speed, interpretability, and effectiveness in many real-world business applications.
Building the Classical Text Classifier
- The Dataset
A small labeled dataset is used for this implementation the same structure applies to any text classification problem:
import pandas as pd
from sklearn.model_selection import train_test_split
data = {
'text': [
"Win a free iPhone now click here",
"Meeting at 3pm in conference room B",
"Claim your lottery prize today",
"Project deadline has been moved to Friday",
"Exclusive deal just for you free gift",
"Please review the attached report",
"You have been selected for a cash reward",
"Quarterly review scheduled for next week"
],
'label': ['spam', 'ham', 'spam', 'ham', 'spam', 'ham', 'spam', 'ham']
}
df = pd.DataFrame(data)
X_train, X_test, y_train, y_test = train_test_split(
df['text'], df['label'], test_size=0.25, random_state=42
)
- Building the Pipeline
Scikit-learn’s Pipeline chains TF-IDF vectorization and classification into a single object — preventing data leakage between training and test sets:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
pipeline = Pipeline([
('tfidf', TfidfVectorizer(ngram_range=(1, 2), max_features=5000)),
('clf', LogisticRegression(max_iter=1000))
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
The ngram_range=(1, 2) parameter includes both single words and two-word phrases as features — significantly improving classification on short texts where word pairs carry meaning that individual words miss.
One of the earliest large-scale email spam filtering systems was built using Naive Bayes, a machine learning algorithm that remains available in modern libraries such as scikit-learn. Despite being decades old, Naive Bayes continues to be used in production text-classification systems because of its speed, low memory requirements, and strong performance on well-structured, labeled datasets. In many NLP tasks—including spam detection, document categorization, sentiment analysis, and support ticket routing—a simple TF-IDF plus Naive Bayes pipeline can deliver surprisingly competitive results while being far easier to train and deploy than more complex deep learning models.
TF-IDF vs BERT: When to Use Which
| Factor | TF-IDF + sklearn | Fine-Tuned BERT |
| Dataset size | Any — works on small datasets | Benefits from 1,000+ labeled examples |
| Training time | Seconds to minutes | Minutes to hours |
| Infrastructure | CPU only | GPU strongly recommended |
| Accuracy ceiling | Moderate | High |
| Interpretability | High | Low |
| Best use case | Baseline, production simplicity | Nuanced language, high accuracy requirements |
The decision rule is straightforward start with TF-IDF, benchmark it, and only move to BERT if the accuracy gap justifies the infrastructure cost.
Want to go beyond text classification and explore the AI models redefining how machines understand language? Download HCL GUVI’s free Generative AI eBook.
Common Mistakes Beginners Make
1. Skipping the TF-IDF baseline — Jumping straight to BERT without establishing a classical baseline means there is no reference point to measure improvement against. TF-IDF often achieves 85–90% accuracy on clean datasets — making BERT unnecessary.
2. Not using Pipeline — Fitting the TF-IDF vectorizer on the full dataset before splitting introduces data leakage. Pipeline ensures the vectorizer only sees training data during fit.
3. Ignoring class imbalance — A dataset with 90% negative examples and 10% positive will produce a classifier that predicts negative every time and reports 90% accuracy. Always check class distribution and use stratified splits.
4. Using BERT without a GPU — BERT fine-tuning on CPU is prohibitively slow for anything beyond toy datasets. Use Google Colab’s free GPU tier if local hardware is unavailable.
5. Not evaluating with the right metrics — Accuracy is misleading on imbalanced datasets. Always report precision, recall, and F1-score per class — not just overall accuracy.
Conclusion
Text classification with scikit-learn is one of the most practical entry points into NLP — the TF-IDF pipeline is fast to build, easy to interpret, and handles the majority of real-world classification tasks without requiring GPU infrastructure or deep learning expertise. BERT extends that capability to language problems where context and nuance determine the correct label — but it earns its complexity only after the classical baseline has been benchmarked.
Start with TF-IDF and Logistic Regression. Measure the result. Move to BERT only if the gap demands it.
FAQs
What is text classification in machine learning?
Text classification is a supervised learning task where a model is trained to assign predefined category labels to text inputs spam detection, sentiment analysis, and topic labeling are the most common real-world applications.
How does TF-IDF work in text classification?
TF-IDF converts raw text into numerical vectors by scoring each word based on how frequently it appears in a document relative to how commonly it appears across all documents. These vectors serve as input features for classifiers like Logistic Regression or SVM.
Which classifier works best with TF-IDF?
Logistic Regression and Linear SVM consistently perform well with TF-IDF features on most text classification tasks. Naive Bayes is a strong choice for small datasets due to its fast training time and low data requirements.
When should I use BERT instead of TF-IDF?
Use BERT when TF-IDF has been benchmarked and accuracy has plateaued below the acceptable threshold — particularly for tasks involving negation, sarcasm, context-dependent language, or low-resource datasets where pretraining compensates for limited labeled examples.
What is the difference between TF-IDF and BERT?
TF-IDF treats text as a bag of words — word order and context are lost. BERT reads text bidirectionally and captures contextual relationships between words across the entire input sequence, producing fundamentally richer text representations.
Do I need a GPU for text classification with scikit-learn?
No — the TF-IDF pipeline runs efficiently on CPU. GPU hardware is only required when fine-tuning BERT or other transformer models, where the training computation is significantly more intensive.
How do I handle imbalanced classes in text classification?
Use stratified train-test splits, evaluate with F1-score rather than accuracy, and consider techniques like class weighting in the classifier (class_weight=’balanced’ in scikit-learn) or oversampling the minority class with SMOTE.



Did you enjoy this article?