Bag of Words

1 min readJul 7, 2018

Develop a vector of all words appearing in the entire set of text.
Next, you iterate through each text in the training set and mark a 1 in the row vector corresponding to the word it contains. For example:

Texts:

T1: The food was terrible, I hated it. (7 words)

T2: The restaurant was very far away, I hated it. (9 words)

T3: The pasta was delicious, will come back again. (8 words)

T1 Vector: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0

T2 Vector: 1 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 0

T3 Vector: 1 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1

The problem with this approach is the feature set becomes extremely large and sparse. In order to reduce the feature size to increase computation speed and performance of classification models, several feature selection techniques have been used on the bag of words models.

Filter-based techniques :

Chi-Squared test
Mutual Information
Signal to Noise and
Area under the Receiving Operator Characteristic (ROC) curve

Bag of Words

Filter-based techniques :

Wrapper-based feature selection :

Written by Himanshu Lohiya

No responses yet