Methods for Scoring Words in NLP
Natural language processing is defined as the interaction between the computer and the human language.
As we all are aware, that the human language is messy, and there are a lot of different ways of saying the same thing.
There are a lot of ways in which human beings can communicate with each other, but what about the communication of humans and computers??
That is where Natural Language Processing comes in.
While solving problems related to NLP, textual data is converted into numerical data for the machine to understand. This conversion is very crucial to the results of an NLP model. There are a lot of different ways in which we can convert textual data into numerical values (vectors in most cases).
The scoring of words is done with respect to a well-defined vocabulary.
There are different ways in which scoring can be done, namely, they are:
- Binary Scoring
- Count Scoring
- Frequency Scoring
- Tf-IDF Scoring
We shall understand briefly about these different types of scoring methods.
Binary Scoring
This is a very simple way of scoring words in a document. In this method, we simply mark 1 when a particular word is present in a document, and 0 when the word is not present.
To understand this, we can take an example
Let us assume that our vocabulary is:
no, man, is, an, island, entire, of, itself, every, a, piece, the, continent, part, main
So for Document No man is an island, the scoring would look like this:No: 1, man: 1, is: 1, an: 1, island: 1, entire: 0, of: 0, itself: 0, every: 0, a: 0, piece: 0, the: 0, continent: 0, part: 0, main: 0Converting this to a vector, it would look like this
[1,1,1,1,1,0,0,0,0,0,0,0,0,0,0]
As we can see, we mark 1 for the words that were present in the vocabulary and 0 for the others. The vectors in this scoring method contain only 0 or 1, hence the name.
One more observation in this scoring method is that the length of the vector is equal to the number of words in the vocabulary.
Count Scoring
This scoring method works on the count of the words in a document.
This creates a vector in which the values correspond to the number of times the particular word has occurred in the document.
If we consider the example above (binary scoring), we shall get the same vector as none of the words is repeating in the document.
Frequency Scoring
This scoring method is often confused with Count scoring. Both the methods are similar, the only difference is that Frequency scoring calculates the frequency (Number of times the words appear in the document out of all the words in the document) of the words in a document.
We shall see this type of scoring with an example
Let us assume that our vocabulary is:
no, man, is, an, island, entire, of, itself, every, a, piece, the, continent, part, mainSo for Document No man is an island, the scoring would look like this:No: 1, man: 1, is: 1, an: 1, island: 1, entire: 0, of: 0, itself: 0, every: 0, a: 0, piece: 0, the: 0, continent: 0, part: 0, main: 0Converting this to a vector, it would look like this
[0.2,0.2,0.2,0.2,0.2,0,0,0,0,0,0,0,0,0,0]
TF-IDF Scoring
This is perhaps the most important type of scoring method in NLP.
Term Frequency - Inverse Term Frequency is a measure of how relevant a word is to a document in a collection of documents. For example, in the document No man is an island, `is` `an` might not be as relevant as `man` `no` `island` to the document.
TF-IDF is calculated by multiplying the number of times the word appears in the document and the inverse of the frequency of the word in the set of documents.
As we saw that TF-IDF is calculated by multiplying two metrics.
- Term Frequency: The number of times a word appears in the document
- Inverse Document Frequency: The inverse of the frequency of the word in the set of documents.
In this blog, we read about the different types of Word scoring methods used in NLP and how each of them works.
References