Wine Rating Prediction

3 min readJul 11, 2018

Goal is to predict a wine rating based on its description?

Dataset : 130k reviews with variety, location, winery, price, and description.

import pandas as pd
import numpy as np
import nltk
import refrom bs4 import BeautifulSoup
from contractions import CONTRACTION_MAP
import unicodedatafrom nltk.corpus import stopwords
stopword_list = set(stopwords.words("english"))import seaborn as sns
import matplotlib.pyplot as pltfrom sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_reportprint("Number of rows before removing duplicates : ", len(dataset))dataset = dataset[dataset.duplicated('description, keep=False)
print("Number of rows after removing duplicates : ",len(dataset))dataset.dropna(subset=['description','points'])
print("Number of rows after removing NaN : ",len(dataset))

Number of rows before removing duplicates : 150930
Number of rows after removing duplicates : 92393
Number of rows after removing NaN : 92393

Lets see number of wines per points (ratings) :

fig, ax = plt.subplots(figsize=(30,10))plt.xticks(fontsize=20) # X Ticks
plt.yticks(fontsize=20) # Y Ticksax.set_title('Number of wines per points', fontweight="bold", size=25) # Title
ax.set_ylabel('Number of wines', fontsize = 25) # Y label
ax.set_xlabel('Points', fontsize = 25) # X labeldataset.groupby(['points']).count()['description'].plot(ax=ax, kind='bar')

To simplify the multi classification model, lets divide the problem from 20 label to 4 label classification problem :

1 -> Points 80 to 85 (Under Average wines)
2 -> Points 85 to 90(Average wines)
3 -> Points 90 to 95 (Good wines)
4 -> Points 95 to 100 (Excellent wines)

#Transform method taking points as param
def transform_points_simplified(points):
    if points < 85:
        return 1
    elif points >= 85 and points < 90:
        return 2 
    elif points >= 90 and points < 95:
        return 3 
    elif points >= 95 and points < 100:
        return 4 
    else:
        return 5dataset = dataset.assign(points_simplified = dp['points']
                        .apply(transform_points_simplified))

Cleaning and pre-processing textual data

# Apply to the column
dataset['description'] = dp['description'].map(removeStopwords)

For deeper cleaning and better result follow : Cleaning and pre-processing textual data.

Feature Building

X = dataset['description']
y = dataset['points_simplified']

Bag of Words : represent texts in a vector space, associated with weights (number of occurrences etc…), algorithm :

Count Vectorizer : weighted by word counting

vectorizer = CountVectorizer()
vectorizer.fit(X)
X = vectorizer.transform(X)

TF-IDF Vectorizer : the weight increases proportionally to count, but is offset by the frequency of the word in the total corpus.

vectorizer = TfidfVectorizer()
vectorizer.fit(X)
X = vectorizer.transform(X)

Text Split

Training the Model
90% of the dataset will be used for training (about 80k wines).
Testing the Model
10% of the dataset will be used for testing (about 9k wines).

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   test_size=0.1, 
                                                   random_state=101)

Machine Learning Classification :

— RandomForestClassifier

rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)# Testing the model
predictions = rfc.predict(X_test)
print(classification_report(y_test, predictions))

Wine Rating Prediction

Text Split

Machine Learning Classification :

— RandomForestClassifier

Reference : https://www.kaggle.com/olivierg13/wine-ratings-analysis-w-supervised-ml

Written by Himanshu Lohiya

No responses yet