Create simple Bag-of-Words models

3 min readMay 28, 2021

In the previous blog of the series, we read about what a Bag-of-Words model is, and how we can manually create a simple model.

If you haven't read the blog, read it here:

In this blog, we will learn how to create simple BoW models using Scikit-Learn and Keras.

Create BoW using Scikit-Learn

There are different types of scoring methods that can be used to convert textual data to numerical vectors. You can read about these techniques here.

Scikit-Learn provides different methods for the conversion of textual data into vectors of numerical values. Two of these methods are:

CountVectorizer
TfidfVectorizer

CountVectorizer

Convert a collection of text documents to a matrix of token counts

from sklearn.feature_extraction.text import CountVectorizer# input text data
text = ["When your only tool is a hammer, all problems start looking like nails."]# create the instance of vectorizer
vectorizer = CountVectorizer()# fit is used to learn the vocabulary
vectorizer.fit(text)# print the generated vocabulary
print(vectorizer.vocabulary_)# vectorize the input text
vector = vectorizer.transform(text)print(vector.shape)
print(vector.toarray())

In the above code snippet, we simply supply a sample text data to the vectorizer fit method, which learns the vocabulary from the input text.

Next, we use the transform method to convert the text into a numerical vector.

TfidfVectorizer

Convert a collection of raw documents to a matrix of TF-IDF features.

from sklearn.feature_extraction.text import TfidfVectorizer# input text data
text = ["When your only tool is a hammer, all problems start looking like nails.",
  "When your",
  "problems start looking"]# create the instance of vectorizer
vectorizer = TfidfVectorizer()# fit is used to learn the vocabulary
vectorizer.fit(text)# print the generated vocabulary and idf values
print(vectorizer.vocabulary_)
print(vectorizer.idf_)# vectorize an input text
vector = vectorizer.transform([text[0]])print(vector.shape)
print(vector.toarray())

In the above code snippet, we simply supply a sample text data to the vectorizer fit method, which learns the vocabulary from the input text.

Next, we use the transform method to convert the text into a numerical vector.

Create BoW using Keras

We can use the Tokenizer class of keras preprocessing module.

This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf…

The Tokenizer class is very convenient to use as it provides support for the different types of scoring methods out of the box.

from keras.preprocessing.text import Tokenizer# define 4 documents
docs = ["No man is an island", "Entire of itself,", "Every man is a piece of the continent,", "A part of the main."]# create the tokenizer
t = Tokenizer()# fit the tokenizer on the documents
t.fit_on_texts(docs)# summarize what was learned
print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)# integer encode documents
encoded_docs = t.texts_to_matrix(docs, mode='count')
print(encoded_docs)

In the above code snippet, we first fit the tokenizer on all documents to learn the vocabulary. This is done using fit_on_texts . Then we use texts_to_matrix to convert the text to the corresponding vector.

An important point to note is the second parameter of this function, mode .

We can use mode to specify the type of scoring method we wish to use while vectorizing the text. The value can be one of “binary”, “count”, “tfidf”, “freq”.

#       if mode == 'count':#             x[i][j] = c#       elif mode == 'freq':#             x[i][j] = c / len(seq)#       elif mode == 'binary':#             x[i][j] = 1#       elif mode == 'tfidf':#             tf = 1 + np.log(c)#       idf = np.log(1 + self.document_count /#             (1 + self.index_docs.get(j, 0)))#             x[i][j] = tf * idf

The above formulas can be found inside the documented explanation of texts_to_matrix method.

In this blog post we read about the different ways of creating Bag-of-Words model using Scikit-Learn and keras.

References

sklearn.feature_extraction.text.CountVectorizer - scikit-learn 0.24.2 documentation

class sklearn.feature_extraction.text. CountVectorizer( *, input='content', encoding='utf-8', decode_error='strict'…

scikit-learn.org

sklearn.feature_extraction.text.TfidfVectorizer - scikit-learn 0.24.2 documentation

class sklearn.feature_extraction.text. TfidfVectorizer( *, input='content', encoding='utf-8', decode_error='strict'…