Building a Basic Text Classifier Using Python and Scikit-Learn

Building a Basic Text Classifier Using Python and Scikit-Learn

From sentiment analysis to language translation, NLP techniques empower machines to comprehend and process human language, opening doors to a myriad of applications across various industries. In this blog post, we’ll embark on a hands-on exploration of NLP by building a basic text classifier using Python and the popular scikit-learn library.

Through this practical demonstration, we’ll unravel the essential steps involved in processing and analyzing textual data, from tokenization to model evaluation. Whether you’re a seasoned data scientist or a curious beginner, this guide will provide valuable insights into the fundamental concepts and techniques of NLP.

So, let’s dive in and discover the magic of NLP as we embark on our journey to build a text classifier from scratch. By the end of this blog post, you’ll not only have a deeper understanding of NLP principles but also possess the skills to tackle your own text analysis projects with confidence.

Let’s get started!

  1. Ensure you have Python3 installed and
pip install nltk

2. Download the punkt zip file from the

3. Unzipped and Drop it into

  • Windows: C:\nltk_data\tokenizers
  • OSX: /usr/local/share/nltk_data/tokenizers
  • Unix: /usr/share/nltk_data/tokenizers

4. Copy and Paste this code

import nltk
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load NLTK data'/Library/Frameworks/Python.framework/Versions/3.11/share/nltk_data')

# Input strings and their categories
texts = [
    "the quick brown fox jumps over the lazy dog.",
    "apples and oranges are fruits.",
    "the cat is sleeping on the mat.",
    "birds chirp in the morning.",
    "elephants are large mammals.",
    "cars are vehicles used for transportation.",
    "the sun shines brightly in the sky.",
    "the moon rises at night."
categories = ["animal-related", "non-animal-related", "animal-related", "non-animal-related",
              "animal-related", "non-animal-related", "non-animal-related", "non-animal-related"]

# Tokenize the texts
tokenized_texts = [word_tokenize(text) for text in texts]
print("TOKENIZED TEXTS:", tokenized_texts)

# Convert tokenized texts into single strings
text_strings = [' '.join(tokens) for tokens in tokenized_texts]
print("TEXT STRINGS:", text_strings)

# Initialize CountVectorizer to perform feature engineering.
vectorizer = CountVectorizer()

# Fit the vectorizer to the texts
vectorized_texts = vectorizer.fit_transform(text_strings)
print("VOCABULARY:", vectorizer.get_feature_names_out())

# Transform tokenized texts into numerical vectors
print("VECTORIZED TEXTS:", vectorized_texts.toarray())

# Convert categories into numerical labels
label_map = {"animal-related": 0, "non-animal-related": 1}
labels = [label_map[category] for category in categories]

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(vectorized_texts, labels, test_size=0.2, random_state=42)
print("TRAINING DATA:", X_train.toarray())
print("TESTING DATA:", X_test.toarray())

# Initialize Support Vector Classifier (SVC)
clf = SVC()

# Train the classifier, y_train)

# Make predictions on the test data
predictions = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)

# Print the accuracy
print("Accuracy:", accuracy)


  1. Tokenization and Text Cleaning:
    • Tokenizing the text using NLTK’s word_tokenize function separates each sentence into individual words. This step is essential for preparing the text data for further processing.
  2. Feature Engineering with Bag-of-Words:
    • The use of CountVectorizer from scikit-learn is demonstrated for feature engineering. It converts text data into numerical vectors (bag-of-words representation), which is crucial for training machine learning models on text data.
  3. Splitting Data and Model Training:
    • The data is split into training and testing sets using train_test_split function. This step is necessary to evaluate the model’s performance on unseen data.
    • The Support Vector Classifier (SVC) from scikit-learn is initialized and trained on the training data.
  4. Model Evaluation:
    • The trained model is used to make predictions on the test data, and the accuracy of the model is calculated using the accuracy_score function. This step demonstrates how to evaluate the performance of the model on unseen data.

This sample code demonstrate the a fundamental concept in NLP and machine learning, making it suitable for presentation to illustrate the basic workflow of text processing and classification.

Leave a Reply

Your email address will not be published.