Wine Rating Prediction
Goal is to predict a wine rating based on its description?
Dataset : 130k reviews with variety, location, winery, price, and description.
import pandas as pd
import numpy as np
import nltk
import refrom bs4 import BeautifulSoup
from contractions import CONTRACTION_MAP
import unicodedatafrom nltk.corpus import stopwords
stopword_list = set(stopwords.words("english"))import seaborn as sns
import matplotlib.pyplot as pltfrom sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_reportprint("Number of rows before removing duplicates : ", len(dataset))dataset = dataset[dataset.duplicated('description, keep=False)
print("Number of rows after removing duplicates : ",len(dataset))dataset.dropna(subset=['description','points'])
print("Number of rows after removing NaN : ",len(dataset))
Number of rows before removing duplicates : 150930
Number of rows after removing duplicates : 92393
Number of rows after removing NaN : 92393
Lets see number of wines per points (ratings) :
fig, ax = plt.subplots(figsize=(30,10))plt.xticks(fontsize=20) # X Ticks
plt.yticks(fontsize=20) # Y Ticksax.set_title('Number of wines per points', fontweight="bold", size=25) # Title
ax.set_ylabel('Number of wines', fontsize = 25) # Y label
ax.set_xlabel('Points', fontsize = 25) # X labeldataset.groupby(['points']).count()['description'].plot(ax=ax, kind='bar')
To simplify the multi classification model, lets divide the problem from 20 label to 4 label classification problem :
1 -> Points 80 to 85 (Under Average wines)
2 -> Points 85 to 90(Average wines)
3 -> Points 90 to 95 (Good wines)
4 -> Points 95 to 100 (Excellent wines)
#Transform method taking points as param
def transform_points_simplified(points):
if points < 85:
return 1
elif points >= 85 and points < 90:
return 2
elif points >= 90 and points < 95:
return 3
elif points >= 95 and points < 100:
return 4
else:
return 5dataset = dataset.assign(points_simplified = dp['points']
.apply(transform_points_simplified))
Cleaning and pre-processing textual data
# Apply to the column
dataset['description'] = dp['description'].map(removeStopwords)
For deeper cleaning and better result follow : Cleaning and pre-processing textual data.
Feature Building
X = dataset['description']
y = dataset['points_simplified']
Bag of Words : represent texts in a vector space, associated with weights (number of occurrences etc…), algorithm :
- Count Vectorizer : weighted by word counting
vectorizer = CountVectorizer()
vectorizer.fit(X)
X = vectorizer.transform(X)
- TF-IDF Vectorizer : the weight increases proportionally to count, but is offset by the frequency of the word in the total corpus.
vectorizer = TfidfVectorizer()
vectorizer.fit(X)
X = vectorizer.transform(X)
Text Split
- Training the Model
90% of the dataset will be used for training (about 80k wines). - Testing the Model
10% of the dataset will be used for testing (about 9k wines).
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.1,
random_state=101)
Machine Learning Classification :
— RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)# Testing the model
predictions = rfc.predict(X_test)
print(classification_report(y_test, predictions))