TF-IDF

Himanshu Lohiya
1 min readJul 6, 2018

--

Pre-process data by removing all duplicate entries and blank responses.

Then for each text calculate Tf-Idf (Term frequency — Inverse document frequency) features.

from sklearn.feature.extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(strip_accents='ascii',
stop_words='english',
min_df=0.0005,
subliner_tf=True)
codebook_vectorizer = tfidf_vectorizer.fit(TEXT)
  • sublinear_df is set to True to use a logarithmic form for frequency
  • min_df is the minimum proportion of documents a word must be present in to be kept
  • stop_words is set to “english” to remove all common pronouns (“a”, “the”, …) to reduce the number of noisy features.

--

--

No responses yet