Data Sources:¶
For our analysis, we sourced our data from the Yelp dataset, which can be found at https://www.yelp.com/dataset.
Specifically, we selected three datasets for analysis:
'yelp_academic_dataset_business.json'
'yelp_academic_dataset_review.json'
'yelp_academic_dataset_tip.json'
These datasets were downloaded from https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset.
WorkFlow¶
- Combine dataset
- Feature Engineering
- Model Building
-
I. Data Preprocessing
- a. Dataframe
- b. Feature Engineering
-
II. Supervised Learning
- a. ml models
- d. AUC-ROC Comparison
- f. Best Model Feature Importance
I. Data Preprocessing¶
The data preprocessing for the Yelp review dataset involved loading the data in chunks, sampling 1,000,000 rows, extracting the year from the 'date' column, and visualizing the review distribution over the years.
import nltk
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="darkgrid")
import re
import json
import numpy as np
from random import shuffle
from sklearn.svm import SVC
from sklearn import neighbors
from nltk.tokenize import sent_tokenize
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import NearestNeighbors
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score, log_loss, roc_auc_score
from sklearn.metrics import accuracy_score, recall_score,classification_report
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import ADASYN
import warnings
warnings.filterwarnings('ignore')
# Define chunksize and initialize an empty DataFrame to store the results
chunksize = 10000
data_review = pd.DataFrame()
# Iterate over the file in chunks and concatenate the results
for chunk in pd.read_json('/Users/Ethan Vaz Falcao/Downloads/Yelp_data/Github/Yelp Datasets/yelp_academic_dataset_review.json', lines=True, chunksize=chunksize):
data_review = pd.concat([data_review, chunk], ignore_index=True)
# randomly select 1,000,000 rows
data_review = data_review.sample(n=1000000, random_state=42)
# print the first 5 rows of the randomly selected dataframe
data_review.head(3)
review_id | user_id | business_id | stars | useful | funny | cool | text | date | |
---|---|---|---|---|---|---|---|---|---|
1295256 | J5Q1gH4ACCj6CtQG7Yom7g | 56gL9KEJNHiSDUoyjk2o3Q | 8yR12PNSMo6FBYx1u5KPlw | 2 | 1 | 0 | 0 | Went for lunch and found that my burger was me... | 2018-04-04 21:09:53 |
3297618 | HlXP79ecTquSVXmjM10QxQ | bAt9OUFX9ZRgGLCXG22UmA | pBNucviUkNsiqhJv5IFpjg | 5 | 0 | 0 | 0 | I needed a new tires for my wife's car. They h... | 2020-05-24 12:22:14 |
1217795 | JBBULrjyGx6vHto2osk_CQ | NRHPcLq2vGWqgqwVugSgnQ | 8sf9kv6O4GgEb0j1o22N1g | 5 | 0 | 0 | 0 | Jim Woltman who works at Goleta Honda is 5 sta... | 2019-02-14 03:47:48 |
count_by_year = data_review.groupby(data_review['date'].dt.year).size()
print(count_by_year)
date 2005 124 2006 529 2007 2161 2008 6839 2009 10546 2010 19696 2011 33117 2012 41255 2013 55172 2014 74578 2015 98597 2016 108142 2017 116992 2018 129703 2019 129758 2020 79558 2021 88686 2022 4547 dtype: int64
# extract the "date" column
dates = data_review["date"]
# extract the year from each date
years = [date.year for date in dates]
# plot a histogram of the years
plt.hist(years, bins=range(min(years), max(years) + 1))
plt.xlabel("Year")
plt.ylabel("Frequency")
plt.title("Histogram of Years")
plt.show()
#set the year of the dataset at 2021 for testing
#data_review_test = data_review[data_review['date'].dt.year >= 2018]
#data_review_test = data_review[data_review['date'].dt.year >= 2020]
data_review.head(3)
review_id | user_id | business_id | stars | useful | funny | cool | text | date | |
---|---|---|---|---|---|---|---|---|---|
1295256 | J5Q1gH4ACCj6CtQG7Yom7g | 56gL9KEJNHiSDUoyjk2o3Q | 8yR12PNSMo6FBYx1u5KPlw | 2 | 1 | 0 | 0 | Went for lunch and found that my burger was me... | 2018-04-04 21:09:53 |
3297618 | HlXP79ecTquSVXmjM10QxQ | bAt9OUFX9ZRgGLCXG22UmA | pBNucviUkNsiqhJv5IFpjg | 5 | 0 | 0 | 0 | I needed a new tires for my wife's car. They h... | 2020-05-24 12:22:14 |
1217795 | JBBULrjyGx6vHto2osk_CQ | NRHPcLq2vGWqgqwVugSgnQ | 8sf9kv6O4GgEb0j1o22N1g | 5 | 0 | 0 | 0 | Jim Woltman who works at Goleta Honda is 5 sta... | 2019-02-14 03:47:48 |
Step 1: Sentiment Analysis¶
We used NLTK's VADER (Valence Aware Dictionary and Sentiment Reasoner) module for sentiment analysis. It applies the sentiment analysis model to a dataset of reviews stored in the data_review DataFrame. The code calculates the sentiment polarity score for each review text using the SentimentIntensityAnalyzer and assigns a label of 1 for positive sentiment or -1 for negative sentiment based on the compound score. The resulting dataset includes the original reviews and their corresponding sentiment labels.
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# initialize sentiment analysis model
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()
# apply model to reviews and assign labels
data_review['label'] = data_review['text'].apply(lambda x: 1 if sia.polarity_scores(x)['compound'] > 0 else -1)
# print head of resulting dataset
data_review.head(3)
[nltk_data] Downloading package vader_lexicon to C:\Users\Ethan Vaz [nltk_data] Falcao\AppData\Roaming\nltk_data... [nltk_data] Package vader_lexicon is already up-to-date!
review_id | user_id | business_id | stars | useful | funny | cool | text | date | label | |
---|---|---|---|---|---|---|---|---|---|---|
1295256 | J5Q1gH4ACCj6CtQG7Yom7g | 56gL9KEJNHiSDUoyjk2o3Q | 8yR12PNSMo6FBYx1u5KPlw | 2 | 1 | 0 | 0 | Went for lunch and found that my burger was me... | 2018-04-04 21:09:53 | -1 |
3297618 | HlXP79ecTquSVXmjM10QxQ | bAt9OUFX9ZRgGLCXG22UmA | pBNucviUkNsiqhJv5IFpjg | 5 | 0 | 0 | 0 | I needed a new tires for my wife's car. They h... | 2020-05-24 12:22:14 | 1 |
1217795 | JBBULrjyGx6vHto2osk_CQ | NRHPcLq2vGWqgqwVugSgnQ | 8sf9kv6O4GgEb0j1o22N1g | 5 | 0 | 0 | 0 | Jim Woltman who works at Goleta Honda is 5 sta... | 2019-02-14 03:47:48 | 1 |
data = data_review[['user_id', 'business_id','date','stars', 'text', "label"]]
# check the data description
# no missing values
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1000000 entries, 1295256 to 794448 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 1000000 non-null object 1 business_id 1000000 non-null object 2 date 1000000 non-null datetime64[ns] 3 stars 1000000 non-null int64 4 text 1000000 non-null object 5 label 1000000 non-null int64 dtypes: datetime64[ns](1), int64(2), object(3) memory usage: 53.4+ MB
#change date to date time
data['date'] = pd.to_datetime(data['date'])
Step 2: Feature Engineering¶
In the feature engineering step, several new features were created to enhance the analysis of the review dataset. Here is a summary of the features:
1) Maximum number of reviews in a day: This feature calculates the maximum number of reviews a user has written on a single day. It provides insights into the user's reviewing activity and helps identify highly active users.
2) Percentage of reviews with positive/negative ratings: This feature calculates the percentage of reviews that are classified as positive, negative, or neutral based on their star ratings. It offers an understanding of the user's overall sentiment towards the businesses they reviewed.
3) Review count: This feature counts the total number of reviews written by each user. It provides an indication of the user's reviewing activity and engagement level.
4) Negative rating percentage: This feature calculates the percentage of negative reviews out of the total reviews written by a user. It helps identify users who tend to write more negative reviews.
5) Positive rating percentage: This feature calculates the percentage of positive reviews out of the total reviews written by a user. It identifies users who predominantly write positive reviews.
6) Neutral rating percentage: This feature calculates the percentage of neutral reviews out of the total reviews written by a user. It indicates users who write a significant number of neutral reviews.
7) Average review length: This feature calculates the average length of reviews written by each user. It provides insights into the user's writing style and level of detail in their reviews.
8) Standard deviation of ratings: This feature calculates the standard deviation of ratings given by a user. It captures the variability in the user's rating behavior and helps identify users with more diverse rating patterns.
These engineered features provide additional dimensions for analyzing user behavior and sentiment in the review dataset, enabling more comprehensive insights and potential correlations with other variables.
data.head(3)
user_id | business_id | date | stars | text | label | |
---|---|---|---|---|---|---|
1295256 | 56gL9KEJNHiSDUoyjk2o3Q | 8yR12PNSMo6FBYx1u5KPlw | 2018-04-04 21:09:53 | 2 | Went for lunch and found that my burger was me... | -1 |
3297618 | bAt9OUFX9ZRgGLCXG22UmA | pBNucviUkNsiqhJv5IFpjg | 2020-05-24 12:22:14 | 5 | I needed a new tires for my wife's car. They h... | 1 |
1217795 | NRHPcLq2vGWqgqwVugSgnQ | 8sf9kv6O4GgEb0j1o22N1g | 2019-02-14 03:47:48 | 5 | Jim Woltman who works at Goleta Honda is 5 sta... | 1 |
#1. Maximum number of reviews in a day
review_day=data.groupby(['user_id','date']).count().sort_values(by='text').reset_index()
review_day.head(3)
user_id | date | business_id | stars | text | label | |
---|---|---|---|---|---|---|
0 | ---2PmXbF47D870stH1jqA | 2015-01-21 20:39:14 | 1 | 1 | 1 | 1 |
1 | eg0WSgiKFOL3fgb4Ri1bCQ | 2012-07-14 04:33:19 | 1 | 1 | 1 | 1 |
2 | eg0cwrodeKGLeLIDTSCXoA | 2010-08-16 13:43:42 | 1 | 1 | 1 | 1 |
max_review=review_day.groupby('user_id').max().reset_index()[['user_id','text']]
max_review.rename(columns={'text':'max_review_on_a_day'},inplace=True)
max_review.head()
user_id | max_review_on_a_day | |
---|---|---|
0 | ---2PmXbF47D870stH1jqA | 1 |
1 | ---UgP94gokyCDuB5zUssA | 1 |
2 | ---r61b7EpVPkb4UVme5tA | 1 |
3 | --17Db1K-KujRuN7hY9Z0Q | 1 |
4 | --1oZcRo9-QKOtTqREKB6g | 1 |
#merge max_review with data
data=pd.merge(data,max_review,on=['user_id'])
data.head(3)
user_id | business_id | date | stars | text | label | max_review_on_a_day | |
---|---|---|---|---|---|---|---|
0 | 56gL9KEJNHiSDUoyjk2o3Q | 8yR12PNSMo6FBYx1u5KPlw | 2018-04-04 21:09:53 | 2 | Went for lunch and found that my burger was me... | -1 | 1 |
1 | 56gL9KEJNHiSDUoyjk2o3Q | oEwmCZknUHgHfEBdKA2SZA | 2018-05-16 11:04:31 | 5 | Best Pho in Indy by far and reasonable prices!... | 1 | 1 |
2 | 56gL9KEJNHiSDUoyjk2o3Q | 8zJYnVNKD7XUHaKFMMREBg | 2019-09-23 22:52:23 | 2 | First of all, if you go to a Mexican restauran... | -1 | 1 |
#2. Percentage of reviews with positive / negative ratings (2)
#Assuming positive rating is rating >4 and negative rating is rating <4
(data.groupby(data['stars']
.apply(lambda x: 'negative' if x < 3 else 'positive' if x > 3 else 'neutral'))['stars']
.count())
stars negative 230687 neutral 98714 positive 670599 Name: stars, dtype: int64
# count number of reviews each user_id had written
d=data.groupby('user_id').count().reset_index()[['user_id','text']]
d.rename(columns={'text':'review_count'},inplace=True)
#merge review_count with data
data=pd.merge(data,d,on=['user_id'])
data.head(3)
user_id | business_id | date | stars | text | label | max_review_on_a_day | review_count | |
---|---|---|---|---|---|---|---|---|
0 | 56gL9KEJNHiSDUoyjk2o3Q | 8yR12PNSMo6FBYx1u5KPlw | 2018-04-04 21:09:53 | 2 | Went for lunch and found that my burger was me... | -1 | 1 | 7 |
1 | 56gL9KEJNHiSDUoyjk2o3Q | oEwmCZknUHgHfEBdKA2SZA | 2018-05-16 11:04:31 | 5 | Best Pho in Indy by far and reasonable prices!... | 1 | 1 | 7 |
2 | 56gL9KEJNHiSDUoyjk2o3Q | 8zJYnVNKD7XUHaKFMMREBg | 2019-09-23 22:52:23 | 2 | First of all, if you go to a Mexican restauran... | -1 | 1 | 7 |
# if a rating is less than 3 then label it negative, if greater than 3 then label it positive
# else neutral
data['rating_cat']=data['stars'].apply(lambda x:'negative' if x < 3 else 'positive' if x > 3 else 'neutral')
data.stars.value_counts()
5 462646 4 207953 1 153057 3 98714 2 77630 Name: stars, dtype: int64
# count number of negative reviews for each user_id
neg_review_count=data.groupby('user_id')['rating_cat'].apply(lambda x: (x=='negative').sum()).reset_index(name='count')['count'].to_list()
# count number of positive reviews for each user_id
pos_review_count=data.groupby('user_id')['rating_cat'].apply(lambda x: (x=='positive').sum()).reset_index(name='count')['count'].to_list()
# count number of neutral reviews for each user_id
neu_review_count=data.groupby('user_id')['rating_cat'].apply(lambda x: (x=='neutral').sum()).reset_index(name='count')['count'].to_list()
review_df=data.groupby('user_id').count().reset_index()[['user_id','text']]
review_df['neg_review_count']=neg_review_count
review_df['pos_review_count']=pos_review_count
review_df['neu_review_count']=neu_review_count
review_df['neg_rating_percentage']=review_df['neg_review_count']/review_df['text']
review_df['pos_rating_percentage']=review_df['pos_review_count']/review_df['text']
review_df['neu_rating_percentage']=review_df['neu_review_count']/review_df['text']
review_df.drop(['neg_review_count','pos_review_count',
'neu_review_count','text'],axis=1,inplace=True)
review_df.head()
user_id | neg_rating_percentage | pos_rating_percentage | neu_rating_percentage | |
---|---|---|---|---|
0 | ---2PmXbF47D870stH1jqA | 0.0 | 1.0 | 0.0 |
1 | ---UgP94gokyCDuB5zUssA | 0.0 | 1.0 | 0.0 |
2 | ---r61b7EpVPkb4UVme5tA | 0.0 | 1.0 | 0.0 |
3 | --17Db1K-KujRuN7hY9Z0Q | 0.0 | 1.0 | 0.0 |
4 | --1oZcRo9-QKOtTqREKB6g | 0.0 | 0.0 | 1.0 |
# merge review_df
data=pd.merge(data,review_df,on=['user_id'])
data.head(3)
user_id | business_id | date | stars | text | label | max_review_on_a_day | review_count | rating_cat | neg_rating_percentage | pos_rating_percentage | neu_rating_percentage | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 56gL9KEJNHiSDUoyjk2o3Q | 8yR12PNSMo6FBYx1u5KPlw | 2018-04-04 21:09:53 | 2 | Went for lunch and found that my burger was me... | -1 | 1 | 7 | negative | 0.428571 | 0.571429 | 0.0 |
1 | 56gL9KEJNHiSDUoyjk2o3Q | oEwmCZknUHgHfEBdKA2SZA | 2018-05-16 11:04:31 | 5 | Best Pho in Indy by far and reasonable prices!... | 1 | 1 | 7 | positive | 0.428571 | 0.571429 | 0.0 |
2 | 56gL9KEJNHiSDUoyjk2o3Q | 8zJYnVNKD7XUHaKFMMREBg | 2019-09-23 22:52:23 | 2 | First of all, if you go to a Mexican restauran... | -1 | 1 | 7 | negative | 0.428571 | 0.571429 | 0.0 |
#3. Average review length
data['review_len']=data['text'].apply(lambda x: len(x))
d=data.groupby('user_id')['review_len'].mean().reset_index()
d.rename(columns={'review_len':'avg_review_len'},inplace=True)
data=pd.merge(data,d,on=['user_id'])
#Standard deviation of ratings of the reviewer’s reviews
d=data.groupby('user_id')['stars'].std().reset_index()
d.rename(columns={'stars':'std_rating'},inplace=True)
data=pd.merge(data,d,on=['user_id'])
data['std_rating'] = data['std_rating'].fillna(0)
data.head(3)
user_id | business_id | date | stars | text | label | max_review_on_a_day | review_count | rating_cat | neg_rating_percentage | pos_rating_percentage | neu_rating_percentage | review_len | avg_review_len | std_rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 56gL9KEJNHiSDUoyjk2o3Q | 8yR12PNSMo6FBYx1u5KPlw | 2018-04-04 21:09:53 | 2 | Went for lunch and found that my burger was me... | -1 | 1 | 7 | negative | 0.428571 | 0.571429 | 0.0 | 394 | 796.285714 | 1.718249 |
1 | 56gL9KEJNHiSDUoyjk2o3Q | oEwmCZknUHgHfEBdKA2SZA | 2018-05-16 11:04:31 | 5 | Best Pho in Indy by far and reasonable prices!... | 1 | 1 | 7 | positive | 0.428571 | 0.571429 | 0.0 | 284 | 796.285714 | 1.718249 |
2 | 56gL9KEJNHiSDUoyjk2o3Q | 8zJYnVNKD7XUHaKFMMREBg | 2019-09-23 22:52:23 | 2 | First of all, if you go to a Mexican restauran... | -1 | 1 | 7 | negative | 0.428571 | 0.571429 | 0.0 | 688 | 796.285714 | 1.718249 |
def avg_word_length(x):
sentence = x
words = sentence.split()
average = round(sum(len(word) for word in words) / len(words),2)
return average
data['avg_word_length']=data['text'].apply(lambda x: avg_word_length(x))
def num_sent(x):
number_of_sentences = sent_tokenize(x)
return len(number_of_sentences)
data['num_sent']=data['text'].apply(lambda x: num_sent(x))
def percentage_of_numerals(x):
numbers = sum(c.isdigit() for c in x)
letter = sum(c.isalpha() for c in x)
try:
result = numbers / letter
except ZeroDivisionError:
result=-1
return result
data['percentage_of_numerals']=data['text'].apply(lambda x: percentage_of_numerals(x))
# if numbers/letter returns ZeroDivisionError
# this means there's 0 letter in the review
data[data['percentage_of_numerals']==-1]
#remove reviews that doesn't contain any words
data=data[data['percentage_of_numerals'] != -1 ]
# percentage of capitalized words
def percentage_of_capitalized(x):
cap = sum(c.isupper() for c in x)
letter = sum(c.isalpha() for c in x)
try:
result = cap / letter
except ZeroDivisionError:
result=-1
return cap/letter
data['percentage_of_capitalized']=data['text'].apply(lambda x: percentage_of_capitalized(x))
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
#clean up the review text
from nltk.tokenize import word_tokenize
import string
import re
from nltk.corpus import stopwords
def clean_text(text):
"""
text: a string
return: modified initial string
"""
#tokenizer breaks string into a list of words
text = word_tokenize(text)
text = " ".join([c for c in text if c not in string.punctuation])
text = text.lower() # lowercase text
text = re.compile('"''#&?!:_[/(){}\[\]\|@,;.]').sub(' ', text) # replace symbols by space in text. substitute the matched string with space.
text = re.sub(r'\d+','', text) # remove symbols and numbers
text = ' '.join(word for word in text.split() if word not in stopwords.words('english')) # remove stopwors from text
return text
data.label.value_counts()
1 844717 -1 155282 Name: label, dtype: int64
data=data.reset_index()
data.rename(columns={"index": "review_id"},inplace=True)
data.label.value_counts()
1 844717 -1 155282 Name: label, dtype: int64
data.columns
Index(['review_id', 'user_id', 'business_id', 'date', 'stars', 'text', 'label', 'max_review_on_a_day', 'review_count', 'rating_cat', 'neg_rating_percentage', 'pos_rating_percentage', 'neu_rating_percentage', 'review_len', 'avg_review_len', 'std_rating', 'avg_word_length', 'num_sent', 'percentage_of_numerals', 'percentage_of_capitalized'], dtype='object')
data.drop(['user_id', 'business_id', 'date','rating_cat'],axis=1,inplace=True)
data = data.reindex(columns=['label','review_id', 'text', 'stars',
'max_review_on_a_day',
'review_count', 'neg_rating_percentage',
'pos_rating_percentage', 'neu_rating_percentage', 'review_len',
'avg_review_len', 'std_rating', 'avg_word_length', 'num_sent',
'percentage_of_numerals', 'percentage_of_capitalized'])
data.head(3)
label | review_id | text | stars | max_review_on_a_day | review_count | neg_rating_percentage | pos_rating_percentage | neu_rating_percentage | review_len | avg_review_len | std_rating | avg_word_length | num_sent | percentage_of_numerals | percentage_of_capitalized | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -1 | 0 | Went for lunch and found that my burger was me... | 2 | 1 | 7 | 0.428571 | 0.571429 | 0.0 | 394 | 796.285714 | 1.718249 | 4.00 | 4 | 0.0 | 0.016287 |
1 | 1 | 1 | Best Pho in Indy by far and reasonable prices!... | 5 | 1 | 7 | 0.428571 | 0.571429 | 0.0 | 284 | 796.285714 | 1.718249 | 4.11 | 5 | 0.0 | 0.040909 |
2 | -1 | 2 | First of all, if you go to a Mexican restauran... | 2 | 1 | 7 | 0.428571 | 0.571429 | 0.0 | 688 | 796.285714 | 1.718249 | 4.19 | 9 | 0.0 | 0.020913 |
II. Supervised Learning¶
The provided code includes functions for supervised learning and evaluation. It creates pipelines for three classifiers (Random Forest, Gradient Boosting, and Logistic Regression) to streamline the training and prediction process. The models are then sorted based on their performance metrics such as F1 score, recall, accuracy, and AUC. Additionally, there's a function for visualizing the confusion matrix and calculating summary statistics. These functions aid in developing and comparing models, facilitating informed decision-making in classification tasks.
def make_pipeline():
"Create a single pipeline that processing the data and then fits the classification."
rf = RandomForestClassifier(random_state=50)
gb=GradientBoostingClassifier(random_state=50)
lr = LogisticRegression(random_state=50)
classifiers=[rf,gb,lr]
pipeline = []
for classifier in classifiers:
pipe = Pipeline(steps=[('classifier', classifier)])
pipeline.append(pipe)
return pipeline
def sort_models(pipelines, X_data, y_data):
"""Sort models based on their f1 score."""
scores = []
for pipe in pipelines:
y_pred = pipe.predict(X_data)
f1score = f1_score(y_data, y_pred, average='weighted')
recallscore=recall_score(y_data,y_pred)
accuracyscore=accuracy_score(y_data,y_pred)
aucscore=roc_auc_score(y_data, pipe.predict_proba(X_data)[:,1])
classfier_name = pipe.steps[-1][1].__class__.__name__.split('.')[-1]
scores.append([classfier_name,f1score,recallscore,accuracyscore,aucscore])
scores_sorted = sorted(scores,key=lambda x:x[1],reverse=True)
return scores_sorted
def make_confusion_matrix(cf,
group_names=None,
categories='auto',
count=True,
percent=True,
cbar=True,
xyticks=True,
xyplotlabels=True,
sum_stats=True,
figsize=None,
cmap='Blues',
title=None):
# CODE TO GENERATE TEXT INSIDE EACH SQUARE
blanks = ['' for i in range(cf.size)]
if group_names and len(group_names)==cf.size:
group_labels = ["{}\n".format(value) for value in group_names]
else:
group_labels = blanks
if count:
group_counts = ["{0:0.0f}\n".format(value) for value in cf.flatten()]
else:
group_counts = blanks
if percent:
group_percentages = ["{0:.2%}".format(value) for value in cf.flatten()/np.sum(cf)]
else:
group_percentages = blanks
box_labels = [f"{v1}{v2}{v3}".strip() for v1, v2, v3 in zip(group_labels,group_counts,group_percentages)]
box_labels = np.asarray(box_labels).reshape(cf.shape[0],cf.shape[1])
# CODE TO GENERATE SUMMARY STATISTICS & TEXT FOR SUMMARY STATS
if sum_stats:
#Accuracy is sum of diagonal divided by total observations
accuracy = np.trace(cf) / float(np.sum(cf))
#if it is a binary confusion matrix, show some more stats
if len(cf)==2:
#Metrics for Binary Confusion Matrices
precision = cf[1,1] / sum(cf[:,1]) #TP/TP+FP
# How much were correctly classified as positive out of all positives.
recall = cf[1,1] / sum(cf[1,:]) #TP/TP+FN
specificity = cf[0,0] / sum(cf[0,:]) #TN/TN+FP
f1_score = 2*(precision*recall / (precision + recall))
stats_text = "\n\nAccuracy={:0.3f}\nPrecision={:0.3f}\nRecall={:0.3f}\nSpecificity={:0.3f}\nF1 Score={:0.3f}".format(
accuracy,precision,recall,specificity,f1_score)
else:
stats_text = "\n\nAccuracy={:0.3f}".format(accuracy)
else:
stats_text = ""
# SET FIGURE PARAMETERS ACCORDING TO OTHER ARGUMENTS
if figsize==None:
#Get default figure size if not set
figsize = plt.rcParams.get('figure.figsize')
if xyticks==False:
#Do not show categories if xyticks is False
categories=False
# MAKE THE HEATMAP VISUALIZATION
plt.figure(figsize=figsize)
sns.heatmap(cf,annot=box_labels,fmt="",cmap=cmap,cbar=cbar,xticklabels=categories,yticklabels=categories)
if xyplotlabels:
plt.ylabel('True label')
plt.xlabel('Predicted label' + stats_text)
else:
plt.xlabel(stats_text)
if title:
plt.title(title)
X_up=data.iloc[:,3:]
y_up = data['label']
data.head(3)
label | review_id | text | stars | max_review_on_a_day | review_count | neg_rating_percentage | pos_rating_percentage | neu_rating_percentage | review_len | avg_review_len | std_rating | avg_word_length | num_sent | percentage_of_numerals | percentage_of_capitalized | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -1 | 0 | Went for lunch and found that my burger was me... | 2 | 1 | 7 | 0.428571 | 0.571429 | 0.0 | 394 | 796.285714 | 1.718249 | 4.00 | 4 | 0.0 | 0.016287 |
1 | 1 | 1 | Best Pho in Indy by far and reasonable prices!... | 5 | 1 | 7 | 0.428571 | 0.571429 | 0.0 | 284 | 796.285714 | 1.718249 | 4.11 | 5 | 0.0 | 0.040909 |
2 | -1 | 2 | First of all, if you go to a Mexican restauran... | 2 | 1 | 7 | 0.428571 | 0.571429 | 0.0 | 688 | 796.285714 | 1.718249 | 4.19 | 9 | 0.0 | 0.020913 |
data.describe()
label | review_id | stars | max_review_on_a_day | review_count | neg_rating_percentage | pos_rating_percentage | neu_rating_percentage | review_len | avg_review_len | std_rating | avg_word_length | num_sent | percentage_of_numerals | percentage_of_capitalized | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 999999.000000 | 999999.000000 | 999999.000000 | 999999.000000 | 999999.000000 | 999999.000000 | 999999.000000 | 999999.000000 | 999999.000000 | 999999.000000 | 999999.000000 | 999999.000000 | 999999.000000 | 999999.000000 | 999999.000000 |
mean | 0.689436 | 499999.645093 | 3.749503 | 1.000376 | 8.026962 | 0.230687 | 0.670599 | 0.098714 | 568.093435 | 568.092786 | 0.590861 | 4.479397 | 7.771253 | 0.003681 | 0.036528 |
std | 0.724347 | 288675.386807 | 1.478731 | 0.019387 | 21.280554 | 0.351385 | 0.372445 | 0.203614 | 528.544684 | 439.786689 | 0.724417 | 0.775118 | 6.137283 | 0.007732 | 0.022958 |
min | -1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 10.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 |
25% | 1.000000 | 249999.500000 | 3.000000 | 1.000000 | 1.000000 | 0.000000 | 0.500000 | 0.000000 | 229.000000 | 270.000000 | 0.000000 | 4.210000 | 4.000000 | 0.000000 | 0.024339 |
50% | 1.000000 | 500000.000000 | 4.000000 | 1.000000 | 2.000000 | 0.000000 | 0.790698 | 0.000000 | 406.000000 | 454.333333 | 0.000000 | 4.420000 | 6.000000 | 0.000000 | 0.031630 |
75% | 1.000000 | 749999.500000 | 5.000000 | 1.000000 | 6.000000 | 0.333333 | 1.000000 | 0.130435 | 720.000000 | 736.000000 | 1.055290 | 4.670000 | 10.000000 | 0.004819 | 0.042105 |
max | 1.000000 | 999999.000000 | 5.000000 | 2.000000 | 458.000000 | 1.000000 | 1.000000 | 1.000000 | 5000.000000 | 5000.000000 | 2.828427 | 245.500000 | 120.000000 | 0.893617 | 1.000000 |
Training & Testing¶
Two upsampling methods, SMOTE and ADASYN, are applied to create synthetic samples for the minority class. Multiple classifiers, including Random Forest, Gradient Boosting, and Logistic Regression, are trained on the upsampled data using pipeline implementations. The models are evaluated based on various metrics such as F1 score, recall score, accuracy score, and AUC score. The results are visualized through bar plots, showing the performance of each model. The best performing model, using SMOTE, is further analyzed using a confusion matrix to assess its classification performance. The same evaluation process is repeated for the models trained on ADASYN data. Additionally, the ROC curves for the top Gradient Boosting models from both SMOTE and ADASYN are plotted, showcasing the models' discriminative ability. Overall, this report provides insights into the effectiveness of upsampling techniques and different classifiers for fake review classification.
X_up_train, X_up_test, y_up_train, y_up_test = train_test_split(X_up, y_up,test_size=0.2, random_state=3)
X_smoted_train, y_smoted_train = SMOTE(random_state=42).fit_resample(X_up_train, y_up_train)
X_adasyn_train, y_adasyn_train = ADASYN(random_state=42).fit_resample(X_up_train, y_up_train)
smote_pipelines = make_pipeline()
# Train all the models
for pipe in smote_pipelines:
pipe.fit(X_smoted_train, y_smoted_train)
# apply sort_models function from above
smoted_sorted_model = sort_models(smote_pipelines, X_up_test, y_up_test)
smoted_sort_model_df = pd.DataFrame(smoted_sorted_model)
smoted_sort_model_df.columns = ['Model','F1 Score','Recall Score','Accuracy Score','AUC Score']
smoted_sort_model_df = pd.melt(smoted_sort_model_df,id_vars=['Model'])
plt.figure(figsize=(9, 4))
plt.title('Models using Smoted Upsampling Behavioral Data',fontsize=15)
ax = sns.barplot(x="Model", y="value", hue = "variable",data=smoted_sort_model_df)
ax.legend(bbox_to_anchor=(1.0, 1.05));
smoted_sort_model_df
Model | variable | value | |
---|---|---|---|
0 | RandomForestClassifier | F1 Score | 0.878752 |
1 | GradientBoostingClassifier | F1 Score | 0.874814 |
2 | LogisticRegression | F1 Score | 0.872867 |
3 | RandomForestClassifier | Recall Score | 0.902930 |
4 | GradientBoostingClassifier | Recall Score | 0.874420 |
5 | LogisticRegression | Recall Score | 0.868476 |
6 | RandomForestClassifier | Accuracy Score | 0.873400 |
7 | GradientBoostingClassifier | Accuracy Score | 0.864645 |
8 | LogisticRegression | Accuracy Score | 0.861800 |
9 | RandomForestClassifier | AUC Score | 0.893174 |
10 | GradientBoostingClassifier | AUC Score | 0.900793 |
11 | LogisticRegression | AUC Score | 0.897184 |
# best smote model performanxce -- gradientboosting
y_pred_smoted=smote_pipelines[1].predict(X_up_test)
cm_smoted = confusion_matrix(y_up_test,y_pred_smoted )
sns.set(font_scale=1.0)
labels = ['True Neg','False Pos','False Neg','True Pos']
categories = ['Fake Review', 'Actual Review']
make_confusion_matrix(cm_smoted,
group_names=labels,
categories=categories,
cmap='Blues',figsize=(8,5))
plt.title('Confusion Matrix',fontsize=15)
plt.tight_layout();
adasyn_pipelines = make_pipeline()
# Train all the models
for pipe in adasyn_pipelines:
pipe.fit(X_adasyn_train, y_adasyn_train)
# apply sort_models function from above
adasyn_sorted_model = sort_models(adasyn_pipelines, X_up_test, y_up_test)
adasyn_sort_model_df = pd.DataFrame(adasyn_sorted_model)
adasyn_sort_model_df.columns = ['Model','F1 Score','Recall Score','Accuracy Score','AUC Score']
adasyn_sort_model_df = pd.melt(adasyn_sort_model_df,id_vars=['Model'])
plt.figure(figsize=(9, 4))
ax = sns.barplot(x="Model", y="value", hue = "variable",data=adasyn_sort_model_df)
ax.legend(bbox_to_anchor=(1.0, 1.05))
plt.title('Models using Adasyn Upsampling Behavioral Data',fontsize=15);
adasyn_sort_model_df
Model | variable | value | |
---|---|---|---|
0 | RandomForestClassifier | F1 Score | 0.876395 |
1 | GradientBoostingClassifier | F1 Score | 0.869196 |
2 | LogisticRegression | F1 Score | 0.866582 |
3 | RandomForestClassifier | Recall Score | 0.899052 |
4 | GradientBoostingClassifier | Recall Score | 0.863496 |
5 | LogisticRegression | Recall Score | 0.856527 |
6 | RandomForestClassifier | Accuracy Score | 0.870470 |
7 | GradientBoostingClassifier | Accuracy Score | 0.857400 |
8 | LogisticRegression | Accuracy Score | 0.853740 |
9 | RandomForestClassifier | AUC Score | 0.890652 |
10 | GradientBoostingClassifier | AUC Score | 0.897070 |
11 | LogisticRegression | AUC Score | 0.896294 |
from sklearn.metrics import roc_curve
plt.figure(figsize=(10,8))
y_pred_smoted=smote_pipelines[0].predict(X_up_test)
#GradientBoosting Upsampling
fpr_gb1, tpr_gb1, thresholds = roc_curve(y_up_test, smote_pipelines[1].predict_proba(X_up_test)[:,1])
fpr_gb2, tpr_gb2, thresholds = roc_curve(y_up_test, adasyn_pipelines[1].predict_proba(X_up_test)[:,1])
roc_auc_GB_smote=roc_auc_score(y_up_test, smote_pipelines[1].predict_proba(X_up_test)[:,1])
roc_auc_GB_adasyn=roc_auc_score(y_up_test, adasyn_pipelines[1].predict_proba(X_up_test)[:,1])
#GradientBoosting with SMOTE and ADASYN
plt.plot(fpr_gb1, tpr_gb1,lw=2, label='Gradient Boosting SMOTE User Activities ROC area = %0.2f)' % roc_auc_GB_smote)
plt.plot(fpr_gb2, tpr_gb2,lw=2, label='Gradient Boosting ADASYN User Activities ROC area = %0.2f)' % roc_auc_GB_adasyn)
plt.plot([0,1],[0,1],c='violet',ls='--')
plt.xlim([-0.05,1.05])
plt.ylim([-0.05,1.05])
plt.legend(loc="lower right",ncol=1,fontsize = 'small')
plt.xlabel('False positive rate',fontsize=10)
plt.ylabel('True positive rate',fontsize=10)
plt.title('ROC curve for Fake Review Classification',fontsize=15);
Best Model¶
We trained a Gradient Boosting Regressor model on a given dataset to predict a target variable. The dataset is split into training and test sets, and the model is trained using the training data. The code then calculates the feature importances, representing the relative importance of each input feature in the trained model. Finally, a bar plot is generated to visualize the feature importances, allowing for easy interpretation and identification of the most influential features. This analysis helps understand the significant factors driving the prediction of the target variable and can guide feature selection or further investigations in the regression model.
from sklearn.ensemble import GradientBoostingRegressor
# Assume X and y are your input and target variables
X_train, X_test, y_train, y_test = train_test_split(X_up, y_up, random_state=3)
# Fit the gradient regression model on the training data
gbr = GradientBoostingRegressor()
gbr.fit(X_train, y_train)
# Assume X_train, X_test, y_train, y_test are your training and test sets
GradientBoostingRegressor()
# Retrieve the feature importances and sort them in descending order
feature_importance = pd.DataFrame(list(zip(X_train.columns, gbr.feature_importances_)),
columns=['feature', 'importance'])
feature_importance = feature_importance.sort_values(by='importance', ascending=False)
# Visualize the sorted feature importances in a bar plot
plt.figure(figsize=(10, 5))
plt.barh(np.arange(len(feature_importance)), feature_importance['importance'])
plt.yticks(np.arange(len(feature_importance)), feature_importance['feature'], fontsize=13)
plt.title('Feature Importance', fontsize=20)
plt.show()
Business Aplications:¶
The predictive model for business attributes based on review and tip textual information has several valuable business applications. These include sentiment analysis to understand customer sentiment and make informed decisions, fraud detection to identify and filter out fake reviews, reputation management to maintain credibility and trustworthiness, and content moderation to ensure high-quality and reliable information for users on review platforms. These applications collectively contribute to improved decision-making, customer satisfaction, and the overall integrity of online review systems.
Here is a link to the complete Github respository https://github.com/EthanFalcao/Yelp-Dataset-Challenge-Analysis-and-Prediction