Create simple Bag-of-Words models

Priyansh Kedia
3 min readMay 28, 2021

In the previous blog of the series, we read about what a Bag-of-Words model is, and how we can manually create a simple model.

If you haven't read the blog, read it here:

Photo by Brett Jordan on Unsplash

In this blog, we will learn how to create simple BoW models using Scikit-Learn and Keras.

Create BoW using Scikit-Learn

There are different types of scoring methods that can be used to convert textual data to numerical vectors. You can read about these techniques here.

Scikit-Learn provides different methods for the conversion of textual data into vectors of numerical values. Two of these methods are:

  • CountVectorizer
  • TfidfVectorizer

CountVectorizer

Convert a collection of text documents to a matrix of token counts

from sklearn.feature_extraction.text import CountVectorizer# input text data
text = ["When your only tool is a hammer, all problems start looking like nails."]
# create the instance of vectorizer
vectorizer = CountVectorizer()
# fit is used to learn the vocabulary
vectorizer.fit(text)
# print the generated vocabulary
print(vectorizer.vocabulary_)
# vectorize the input text
vector = vectorizer.transform(text)
print(vector.shape)
print(vector.toarray())

In the above code snippet, we simply supply a sample text data to the vectorizer fit method, which learns the vocabulary from the input text.

Next, we use the transform method to convert the text into a numerical vector.

TfidfVectorizer

Convert a collection of raw documents to a matrix of TF-IDF features.

from sklearn.feature_extraction.text import TfidfVectorizer# input text data
text = ["When your only tool is a hammer, all problems start looking like nails.",
"When your",
"problems start looking"]
# create the instance of vectorizer
vectorizer = TfidfVectorizer()
# fit is used to learn the vocabulary
vectorizer.fit(text)
# print the generated vocabulary and idf values
print(vectorizer.vocabulary_)
print(vectorizer.idf_)
# vectorize an input text
vector = vectorizer.transform([text[0]])
print(vector.shape)
print(vector.toarray())

In the above code snippet, we simply supply a sample text data to the vectorizer fit method, which learns the vocabulary from the input text.

Next, we use the transform method to convert the text into a numerical vector.

Create BoW using Keras

We can use the Tokenizer class of keras preprocessing module.

This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf…

The Tokenizer class is very convenient to use as it provides support for the different types of scoring methods out of the box.

from keras.preprocessing.text import Tokenizer# define 4 documents
docs = ["No man is an island", "Entire of itself,", "Every man is a piece of the continent,", "A part of the main."]
# create the tokenizer
t = Tokenizer()
# fit the tokenizer on the documents
t.fit_on_texts(docs)
# summarize what was learned
print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)
# integer encode documents
encoded_docs = t.texts_to_matrix(docs, mode='count')
print(encoded_docs)

In the above code snippet, we first fit the tokenizer on all documents to learn the vocabulary. This is done using fit_on_texts . Then we use texts_to_matrix to convert the text to the corresponding vector.

An important point to note is the second parameter of this function, mode .

We can use mode to specify the type of scoring method we wish to use while vectorizing the text. The value can be one of “binary”, “count”, “tfidf”, “freq”.

#       if mode == 'count':#             x[i][j] = c#       elif mode == 'freq':#             x[i][j] = c / len(seq)#       elif mode == 'binary':#             x[i][j] = 1#       elif mode == 'tfidf':#             tf = 1 + np.log(c)#       idf = np.log(1 + self.document_count /#             (1 + self.index_docs.get(j, 0)))#             x[i][j] = tf * idf

The above formulas can be found inside the documented explanation of texts_to_matrix method.

In this blog post we read about the different ways of creating Bag-of-Words model using Scikit-Learn and keras.

--

--