TF-IDF

1 min readJul 6, 2018

Pre-process data by removing all duplicate entries and blank responses.

Then for each text calculate Tf-Idf (Term frequency — Inverse document frequency) features.

from sklearn.feature.extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(strip_accents='ascii',
                                   stop_words='english',
                                   min_df=0.0005,
                                   subliner_tf=True)codebook_vectorizer = tfidf_vectorizer.fit(TEXT)

sublinear_df is set to True to use a logarithmic form for frequency
min_df is the minimum proportion of documents a word must be present in to be kept
stop_words is set to “english” to remove all common pronouns (“a”, “the”, …) to reduce the number of noisy features.

TF-IDF

Written by Himanshu Lohiya

No responses yet