Create simple Bag-of-Words models
In the previous blog of the series, we read about what a Bag-of-Words model is, and how we can manually create a simple model.
If you haven't read the blog, read it here:
In this blog, we will learn how to create simple BoW models using Scikit-Learn and Keras.
Create BoW using Scikit-Learn
There are different types of scoring methods that can be used to convert textual data to numerical vectors. You can read about these techniques here.
Scikit-Learn provides different methods for the conversion of textual data into vectors of numerical values. Two of these methods are:
- CountVectorizer
- TfidfVectorizer
CountVectorizer
Convert a collection of text documents to a matrix of token counts
from sklearn.feature_extraction.text import CountVectorizer# input text data
text = ["When your only tool is a hammer, all problems start looking like nails."]# create the instance of vectorizer
vectorizer = CountVectorizer()# fit is used to learn the vocabulary
vectorizer.fit(text)# print the generated vocabulary
print(vectorizer.vocabulary_)# vectorize the input text
vector = vectorizer.transform(text)print(vector.shape)
print(vector.toarray())
In the above code snippet, we simply supply a sample text data to the vectorizer fit
method, which learns the vocabulary from the input text.
Next, we use the transform
method to convert the text into a numerical vector.
TfidfVectorizer
Convert a collection of raw documents to a matrix of TF-IDF features.
from sklearn.feature_extraction.text import TfidfVectorizer# input text data
text = ["When your only tool is a hammer, all problems start looking like nails.",
"When your",
"problems start looking"]# create the instance of vectorizer
vectorizer = TfidfVectorizer()# fit is used to learn the vocabulary
vectorizer.fit(text)# print the generated vocabulary and idf values
print(vectorizer.vocabulary_)
print(vectorizer.idf_)# vectorize an input text
vector = vectorizer.transform([text[0]])print(vector.shape)
print(vector.toarray())
In the above code snippet, we simply supply a sample text data to the vectorizer fit
method, which learns the vocabulary from the input text.
Next, we use the transform
method to convert the text into a numerical vector.
Create BoW using Keras
We can use the Tokenizer class of keras preprocessing module.
This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf…
The Tokenizer class is very convenient to use as it provides support for the different types of scoring methods out of the box.
from keras.preprocessing.text import Tokenizer# define 4 documents
docs = ["No man is an island", "Entire of itself,", "Every man is a piece of the continent,", "A part of the main."]# create the tokenizer
t = Tokenizer()# fit the tokenizer on the documents
t.fit_on_texts(docs)# summarize what was learned
print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)# integer encode documents
encoded_docs = t.texts_to_matrix(docs, mode='count')
print(encoded_docs)
In the above code snippet, we first fit the tokenizer on all documents to learn the vocabulary. This is done using fit_on_texts
. Then we use texts_to_matrix
to convert the text to the corresponding vector.
An important point to note is the second parameter of this function, mode
.
We can use mode
to specify the type of scoring method we wish to use while vectorizing the text. The value can be one of “binary”, “count”, “tfidf”, “freq”.
# if mode == 'count':# x[i][j] = c# elif mode == 'freq':# x[i][j] = c / len(seq)# elif mode == 'binary':# x[i][j] = 1# elif mode == 'tfidf':# tf = 1 + np.log(c)# idf = np.log(1 + self.document_count /# (1 + self.index_docs.get(j, 0)))# x[i][j] = tf * idf
The above formulas can be found inside the documented explanation of texts_to_matrix
method.
In this blog post we read about the different ways of creating Bag-of-Words model using Scikit-Learn and keras.
References