Bag of Words
Develop a vector of all words appearing in the entire set of text.
Next, you iterate through each text in the training set and mark a 1 in the row vector corresponding to the word it contains. For example:
Texts:
T1: The food was terrible, I hated it. (7 words)
T2: The restaurant was very far away, I hated it. (9 words)
T3: The pasta was delicious, will come back again. (8 words)
(Derived Corpus): The | food | was | terrible | I | hated | it | restaurant | very | far | away | pasta | delicious | will | come | back | again (17 words)
T1 Vector: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
T2 Vector: 1 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 0
T3 Vector: 1 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1
The problem with this approach is the feature set becomes extremely large and sparse. In order to reduce the feature size to increase computation speed and performance of classification models, several feature selection techniques have been used on the bag of words models.
Filter-based techniques :
- Chi-Squared test
- Mutual Information
- Signal to Noise and
- Area under the Receiving Operator Characteristic (ROC) curve