TF-IDF
1 min readJul 6, 2018
Pre-process data by removing all duplicate entries and blank responses.
Then for each text calculate Tf-Idf (Term frequency — Inverse document frequency) features.
from sklearn.feature.extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(strip_accents='ascii',
stop_words='english',
min_df=0.0005,
subliner_tf=True)codebook_vectorizer = tfidf_vectorizer.fit(TEXT)
- sublinear_df is set to True to use a logarithmic form for frequency
- min_df is the minimum proportion of documents a word must be present in to be kept
- stop_words is set to “english” to remove all common pronouns (“a”, “the”, …) to reduce the number of noisy features.