Bag-of-Words models in NLP

4 min readMay 28, 2021

In this 2 part series, we shall see how we can develop a simple Bag-of-Words Model in Natural Language Processing.

We shall start by understanding what the Bag-Of-Words model is, how we can develop a simple model using Scikit-Learn and keras.

What is the Bag-Of-Words model?

Broadly speaking, a bag-of-words model is a representation of text which is usable by the machine learning algorithms. As we all know, most machine learning algorithms cannot work directly with non-numerical data, which is why we use various encoding methods like One-Hot-Encoding to convert this textual data into numerical matrices, to be used by the algorithm.

Bag-Of-Words (BoW) aims to extract features from the text, which can be further used in modeling.

Let’s see how this works

How a bag-of-words model works.

The process by which the input data is converted into a vector of numerical feature values is known as Feature Extraction. Bag-Of-Words is also a feature extraction technique for textual data.

One important thing to note is that the BoW model does not care about the internal ordering of the words in a sentence, hence the name. An example of this is,

sent1 = "Hello, how are you?"
sent2 = "How are you?, Hello"

The output vector of both sent1 and sent2 would be the same vector.

BoW consists of two things:

1. A vocabulary of known words - We need to create a list of all of the known words that the model will consider while the process of feature extraction. This can be thought of the process when we try to understand a sentence by understanding the referring to the words in a dictionary.2. A count of the known words which are present - This keeps a count of the words in the input sentence, which are also present in the vocabulary created above.

Let us see how a simple bag of words can be created.

First, let us create a vocabulary of the known words:

We shall use the famous poem, No man is an island by John Donne.Below is the snippet of the first four lines from the poem1. No man is an island, (5 words)
2. Entire of itself, (3 words)
3. Every man is a piece of the continent, (8 words)
4. A part of the main. (5 words)We shall consider each line to be a separate document. 
As it can be seen, we have 4 documents in our example according to the assumption we made above. Now we shall create a vocabulary of all the known words from these documents.The vocabluary is (ignoring the punctuation and case):no, man, is, an, island, entire, of, itself, every, a, piece, the, continent, part, mainWe can see that our vocabulary contains 15 words. We can see that this vocabulary is created from a collection of 21 words.

2. After the vocabulary is created, we shall create vectors for the different documents present. This process is known as Words Scoring.

The easiest way to do this is the Binary Scoring method.

As we know that our vocabulary consists of 15 words, so we can create vectors of length 15, and mark 1 for the words present, and 0 for the words absent in a particular document.So for Document #1, the scoring would look like this:No: 1, man: 1, is: 1, an: 1, island: 1, entire: 0, of: 0, itself: 0, every: 0, a: 0, piece: 0, the: 0, continent: 0, part: 0, main: 0Converting this to a vector, it would look like this
[1,1,1,1,1,0,0,0,0,0,0,0,0,0,0]

There are different ways of scoring words, like Binary scoring which we saw above. You can read about the different ways of scoring here.

In this blog, we saw what the Bag-Of-Words model is and how we can create a basic model.

In the next part, we shall create BoW models using scikit-learn and keras.

You can read the next part of the series here.

References:

A Gentle Introduction to the Bag-of-Words Model - Machine Learning Mastery

The bag-of-words model is a way of representing text data when modeling text with machine learning algorithms. The…

machinelearningmastery.com

Bag-of-Words models in NLP

What is the Bag-Of-Words model?

A Gentle Introduction to the Bag-of-Words Model - Machine Learning Mastery

The bag-of-words model is a way of representing text data when modeling text with machine learning algorithms. The…

Written by Priyansh Kedia