Wine Rating Prediction

Himanshu Lohiya
3 min readJul 11, 2018

--

Goal is to predict a wine rating based on its description?

Dataset : 130k reviews with variety, location, winery, price, and description.

import pandas as pd
import numpy as np
import nltk
import re
from bs4 import BeautifulSoup
from contractions import CONTRACTION_MAP
import unicodedata
from nltk.corpus import stopwords
stopword_list = set(stopwords.words("english"))
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report
print("Number of rows before removing duplicates : ", len(dataset))dataset = dataset[dataset.duplicated('description, keep=False)
print("Number of rows after removing duplicates : ",len(dataset))
dataset.dropna(subset=['description','points'])
print("Number of rows after removing NaN : ",len(dataset))

Number of rows before removing duplicates : 150930
Number of rows after removing duplicates : 92393
Number of rows after removing NaN : 92393

Lets see number of wines per points (ratings) :

fig, ax = plt.subplots(figsize=(30,10))plt.xticks(fontsize=20) # X Ticks
plt.yticks(fontsize=20) # Y Ticks
ax.set_title('Number of wines per points', fontweight="bold", size=25) # Title
ax.set_ylabel('Number of wines', fontsize = 25) # Y label
ax.set_xlabel('Points', fontsize = 25) # X label
dataset.groupby(['points']).count()['description'].plot(ax=ax, kind='bar')

To simplify the multi classification model, lets divide the problem from 20 label to 4 label classification problem :

1 -> Points 80 to 85 (Under Average wines)
2 -> Points 85 to 90(Average wines)
3 -> Points 90 to 95 (Good wines)
4 -> Points 95 to 100 (Excellent wines)

#Transform method taking points as param
def transform_points_simplified(points):
if points < 85:
return 1
elif points >= 85 and points < 90:
return 2
elif points >= 90 and points < 95:
return 3
elif points >= 95 and points < 100:
return 4
else:
return 5
dataset = dataset.assign(points_simplified = dp['points']
.apply(transform_points_simplified))

Cleaning and pre-processing textual data

# Apply to the column
dataset['description'] = dp['description'].map(removeStopwords)

For deeper cleaning and better result follow : Cleaning and pre-processing textual data.

Feature Building

X = dataset['description']
y = dataset['points_simplified']

Bag of Words : represent texts in a vector space, associated with weights (number of occurrences etc…), algorithm :

vectorizer = CountVectorizer()
vectorizer.fit(X)
X = vectorizer.transform(X)
  • TF-IDF Vectorizer : the weight increases proportionally to count, but is offset by the frequency of the word in the total corpus.
vectorizer = TfidfVectorizer()
vectorizer.fit(X)
X = vectorizer.transform(X)

Text Split

  • Training the Model
    90% of the dataset will be used for training (about 80k wines).
  • Testing the Model
    10% of the dataset will be used for testing (about 9k wines).
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.1,
random_state=101)

Machine Learning Classification :

— RandomForestClassifier

rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
# Testing the model
predictions = rfc.predict(X_test)
print(classification_report(y_test, predictions))

Reference : https://www.kaggle.com/olivierg13/wine-ratings-analysis-w-supervised-ml

--

--

No responses yet