Objective¶

Our objective is to use a machine learning model to help accurately detect fake reviews in the Yelp dataset, to improve the reliability of online reviews.¶

Data Sources:¶

For our analysis, we sourced our data from the Yelp dataset, which can be found at https://www.yelp.com/dataset.

Specifically, we selected three datasets for analysis:

'yelp_academic_dataset_business.json'

'yelp_academic_dataset_review.json'

'yelp_academic_dataset_tip.json'

These datasets were downloaded from https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset.

WorkFlow¶

Combine dataset
Feature Engineering
Model Building

I. Data Preprocessing

a. Dataframe
b. Feature Engineering

II. Supervised Learning

a. ml models
d. AUC-ROC Comparison
f. Best Model Feature Importance

I. Data Preprocessing¶

The data preprocessing for the Yelp review dataset involved loading the data in chunks, sampling 1,000,000 rows, extracting the year from the 'date' column, and visualizing the review distribution over the years.

In [1]:

import nltk
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="darkgrid")


import re
import json
import numpy as np
from random import shuffle
from sklearn.svm import SVC
from sklearn import neighbors
from nltk.tokenize import sent_tokenize
from sklearn.metrics import confusion_matrix

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import train_test_split


from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import NearestNeighbors
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

from sklearn.pipeline import Pipeline
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score, log_loss, roc_auc_score

from sklearn.metrics import accuracy_score, recall_score,classification_report 

from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import ADASYN
import warnings
warnings.filterwarnings('ignore')

In [2]:

# Define chunksize and initialize an empty DataFrame to store the results
chunksize = 10000

data_review = pd.DataFrame()
# Iterate over the file in chunks and concatenate the results
for chunk in pd.read_json('/Users/Ethan Vaz Falcao/Downloads/Yelp_data/Github/Yelp Datasets/yelp_academic_dataset_review.json', lines=True, chunksize=chunksize):
    data_review = pd.concat([data_review, chunk], ignore_index=True)

In [3]:

# randomly select 1,000,000 rows
data_review = data_review.sample(n=1000000, random_state=42)

# print the first 5 rows of the randomly selected dataframe
data_review.head(3)

Out[3]:

	review_id	user_id	business_id	stars	useful	text	date
1295256	J5Q1gH4ACCj6CtQG7Yom7g	56gL9KEJNHiSDUoyjk2o3Q	8yR12PNSMo6FBYx1u5KPlw	2	1	Went for lunch and found that my burger was me...	2018-04-04 21:09:53
3297618	HlXP79ecTquSVXmjM10QxQ	bAt9OUFX9ZRgGLCXG22UmA	pBNucviUkNsiqhJv5IFpjg	5	0	I needed a new tires for my wife's car. They h...	2020-05-24 12:22:14
1217795	JBBULrjyGx6vHto2osk_CQ	NRHPcLq2vGWqgqwVugSgnQ	8sf9kv6O4GgEb0j1o22N1g	5	0	Jim Woltman who works at Goleta Honda is 5 sta...	2019-02-14 03:47:48

In [4]:

count_by_year = data_review.groupby(data_review['date'].dt.year).size()
print(count_by_year)

date
2005       124
2006       529
2007      2161
2008      6839
2009     10546
2010     19696
2011     33117
2012     41255
2013     55172
2014     74578
2015     98597
2016    108142
2017    116992
2018    129703
2019    129758
2020     79558
2021     88686
2022      4547
dtype: int64

In [5]:

# extract the "date" column
dates = data_review["date"]

# extract the year from each date
years = [date.year for date in dates]

# plot a histogram of the years
plt.hist(years, bins=range(min(years), max(years) + 1))
plt.xlabel("Year")
plt.ylabel("Frequency")
plt.title("Histogram of Years")
plt.show()

In [6]:

#set the year of the dataset  at 2021 for testing
#data_review_test = data_review[data_review['date'].dt.year >= 2018]
#data_review_test = data_review[data_review['date'].dt.year >= 2020]

In [7]:

data_review.head(3)

Out[7]:

	review_id	user_id	business_id	stars	useful	text	date
1295256	J5Q1gH4ACCj6CtQG7Yom7g	56gL9KEJNHiSDUoyjk2o3Q	8yR12PNSMo6FBYx1u5KPlw	2	1	Went for lunch and found that my burger was me...	2018-04-04 21:09:53
3297618	HlXP79ecTquSVXmjM10QxQ	bAt9OUFX9ZRgGLCXG22UmA	pBNucviUkNsiqhJv5IFpjg	5	0	I needed a new tires for my wife's car. They h...	2020-05-24 12:22:14
1217795	JBBULrjyGx6vHto2osk_CQ	NRHPcLq2vGWqgqwVugSgnQ	8sf9kv6O4GgEb0j1o22N1g	5	0	Jim Woltman who works at Goleta Honda is 5 sta...	2019-02-14 03:47:48

Step 1: Sentiment Analysis¶

We used NLTK's VADER (Valence Aware Dictionary and Sentiment Reasoner) module for sentiment analysis. It applies the sentiment analysis model to a dataset of reviews stored in the data_review DataFrame. The code calculates the sentiment polarity score for each review text using the SentimentIntensityAnalyzer and assigns a label of 1 for positive sentiment or -1 for negative sentiment based on the compound score. The resulting dataset includes the original reviews and their corresponding sentiment labels.

In [8]:

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# initialize sentiment analysis model
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

# apply model to reviews and assign labels
data_review['label'] = data_review['text'].apply(lambda x: 1 if sia.polarity_scores(x)['compound'] > 0 else -1)

# print head of resulting dataset
data_review.head(3)
 

[nltk_data] Downloading package vader_lexicon to C:\Users\Ethan Vaz
[nltk_data]     Falcao\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!

Out[8]:

	review_id	user_id	business_id	stars	useful	text	date	label
1295256	J5Q1gH4ACCj6CtQG7Yom7g	56gL9KEJNHiSDUoyjk2o3Q	8yR12PNSMo6FBYx1u5KPlw	2	1	Went for lunch and found that my burger was me...	2018-04-04 21:09:53	-1
3297618	HlXP79ecTquSVXmjM10QxQ	bAt9OUFX9ZRgGLCXG22UmA	pBNucviUkNsiqhJv5IFpjg	5	0	I needed a new tires for my wife's car. They h...	2020-05-24 12:22:14	1
1217795	JBBULrjyGx6vHto2osk_CQ	NRHPcLq2vGWqgqwVugSgnQ	8sf9kv6O4GgEb0j1o22N1g	5	0	Jim Woltman who works at Goleta Honda is 5 sta...	2019-02-14 03:47:48	1

In [9]:

data = data_review[['user_id', 'business_id','date','stars', 'text', "label"]]

In [10]:

# check the data description
# no missing values
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 1295256 to 794448
Data columns (total 6 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   user_id      1000000 non-null  object        
 1   business_id  1000000 non-null  object        
 2   date         1000000 non-null  datetime64[ns]
 3   stars        1000000 non-null  int64         
 4   text         1000000 non-null  object        
 5   label        1000000 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(3)
memory usage: 53.4+ MB

In [11]:

#change date to date time
data['date'] = pd.to_datetime(data['date'])

Step 2: Feature Engineering¶

In the feature engineering step, several new features were created to enhance the analysis of the review dataset. Here is a summary of the features:

1) Maximum number of reviews in a day: This feature calculates the maximum number of reviews a user has written on a single     day. It provides insights into the user's reviewing activity and helps identify highly active users.

2) Percentage of reviews with positive/negative ratings: This feature calculates the percentage of reviews that are classified      as positive, negative, or neutral based on their star ratings. It offers an understanding of the user's overall sentiment       towards the businesses they reviewed.

3) Review count: This feature counts the total number of reviews written by each user. It provides an indication of the user's      reviewing activity and engagement level.

4) Negative rating percentage: This feature calculates the percentage of negative reviews out of the total reviews written by       a user. It helps identify users who tend to write more negative reviews.

5) Positive rating percentage: This feature calculates the percentage of positive reviews out of the total reviews written by       a user. It identifies users who predominantly write positive reviews.

6) Neutral rating percentage: This feature calculates the percentage of neutral reviews out of the total reviews written by a       user. It indicates users who write a significant number of neutral reviews.

7) Average review length: This feature calculates the average length of reviews written by each user. It provides insights      into the user's writing style and level of detail in their reviews.

8) Standard deviation of ratings: This feature calculates the standard deviation of ratings given by a user. It captures the        variability in the user's rating behavior and helps identify users with more diverse rating patterns.

These engineered features provide additional dimensions for analyzing user behavior and sentiment in the review dataset, enabling more comprehensive insights and potential correlations with other variables.

In [12]:

data.head(3)

Out[12]:

	user_id	business_id	date	stars	text	label
1295256	56gL9KEJNHiSDUoyjk2o3Q	8yR12PNSMo6FBYx1u5KPlw	2018-04-04 21:09:53	2	Went for lunch and found that my burger was me...	-1
3297618	bAt9OUFX9ZRgGLCXG22UmA	pBNucviUkNsiqhJv5IFpjg	2020-05-24 12:22:14	5	I needed a new tires for my wife's car. They h...	1
1217795	NRHPcLq2vGWqgqwVugSgnQ	8sf9kv6O4GgEb0j1o22N1g	2019-02-14 03:47:48	5	Jim Woltman who works at Goleta Honda is 5 sta...	1

In [13]:

#1. Maximum number of reviews in a day
review_day=data.groupby(['user_id','date']).count().sort_values(by='text').reset_index()
review_day.head(3)

Out[13]:

	user_id	date	business_id	stars	text	label
0	---2PmXbF47D870stH1jqA	2015-01-21 20:39:14	1	1	1	1
1	eg0WSgiKFOL3fgb4Ri1bCQ	2012-07-14 04:33:19	1	1	1	1
2	eg0cwrodeKGLeLIDTSCXoA	2010-08-16 13:43:42	1	1	1	1

In [14]:

max_review=review_day.groupby('user_id').max().reset_index()[['user_id','text']]
max_review.rename(columns={'text':'max_review_on_a_day'},inplace=True)
max_review.head()

Out[14]:

	user_id	max_review_on_a_day
0	---2PmXbF47D870stH1jqA	1
1	---UgP94gokyCDuB5zUssA	1
2	---r61b7EpVPkb4UVme5tA	1
3	--17Db1K-KujRuN7hY9Z0Q	1
4	--1oZcRo9-QKOtTqREKB6g	1

In [15]:

#merge max_review with data
data=pd.merge(data,max_review,on=['user_id'])
data.head(3)

Out[15]:

	user_id	business_id	date	stars	text	label	max_review_on_a_day
0	56gL9KEJNHiSDUoyjk2o3Q	8yR12PNSMo6FBYx1u5KPlw	2018-04-04 21:09:53	2	Went for lunch and found that my burger was me...	-1	1
1	56gL9KEJNHiSDUoyjk2o3Q	oEwmCZknUHgHfEBdKA2SZA	2018-05-16 11:04:31	5	Best Pho in Indy by far and reasonable prices!...	1	1
2	56gL9KEJNHiSDUoyjk2o3Q	8zJYnVNKD7XUHaKFMMREBg	2019-09-23 22:52:23	2	First of all, if you go to a Mexican restauran...	-1	1

In [16]:

#2. Percentage of reviews with positive / negative ratings (2)

#Assuming positive rating is rating >4 and negative rating is rating <4

In [17]:

(data.groupby(data['stars']
    .apply(lambda x: 'negative' if x < 3 else 'positive' if x > 3 else 'neutral'))['stars']
    .count())

Out[17]:

stars
negative    230687
neutral      98714
positive    670599
Name: stars, dtype: int64

In [18]:

# count number of reviews each user_id had written
d=data.groupby('user_id').count().reset_index()[['user_id','text']]
d.rename(columns={'text':'review_count'},inplace=True)

#merge review_count with data
data=pd.merge(data,d,on=['user_id'])
data.head(3)

Out[18]:

	user_id	business_id	date	stars	text	label	max_review_on_a_day	review_count
0	56gL9KEJNHiSDUoyjk2o3Q	8yR12PNSMo6FBYx1u5KPlw	2018-04-04 21:09:53	2	Went for lunch and found that my burger was me...	-1	1	7
1	56gL9KEJNHiSDUoyjk2o3Q	oEwmCZknUHgHfEBdKA2SZA	2018-05-16 11:04:31	5	Best Pho in Indy by far and reasonable prices!...	1	1	7
2	56gL9KEJNHiSDUoyjk2o3Q	8zJYnVNKD7XUHaKFMMREBg	2019-09-23 22:52:23	2	First of all, if you go to a Mexican restauran...	-1	1	7

In [19]:

# if a rating is less than 3 then label it negative, if greater than 3 then label it positive
# else neutral

data['rating_cat']=data['stars'].apply(lambda x:'negative' if x < 3 else 'positive' if x > 3 else 'neutral')

In [20]:

data.stars.value_counts()

Out[20]:

5    462646
4    207953
1    153057
3     98714
2     77630
Name: stars, dtype: int64

In [21]:

# count number of negative reviews for each user_id
neg_review_count=data.groupby('user_id')['rating_cat'].apply(lambda x: (x=='negative').sum()).reset_index(name='count')['count'].to_list()
# count number of positive reviews for each user_id
pos_review_count=data.groupby('user_id')['rating_cat'].apply(lambda x: (x=='positive').sum()).reset_index(name='count')['count'].to_list()
# count number of neutral reviews for each user_id
neu_review_count=data.groupby('user_id')['rating_cat'].apply(lambda x: (x=='neutral').sum()).reset_index(name='count')['count'].to_list()

In [22]:

review_df=data.groupby('user_id').count().reset_index()[['user_id','text']]

In [23]:

review_df['neg_review_count']=neg_review_count
review_df['pos_review_count']=pos_review_count
review_df['neu_review_count']=neu_review_count

In [24]:

review_df['neg_rating_percentage']=review_df['neg_review_count']/review_df['text']
review_df['pos_rating_percentage']=review_df['pos_review_count']/review_df['text']
review_df['neu_rating_percentage']=review_df['neu_review_count']/review_df['text']

In [25]:

review_df.drop(['neg_review_count','pos_review_count',
          'neu_review_count','text'],axis=1,inplace=True)

In [26]:

review_df.head()

Out[26]:

	user_id	pos_rating_percentage	neu_rating_percentage
0	---2PmXbF47D870stH1jqA	1.0	0.0
1	---UgP94gokyCDuB5zUssA	1.0	0.0
2	---r61b7EpVPkb4UVme5tA	1.0	0.0
3	--17Db1K-KujRuN7hY9Z0Q	1.0	0.0
4	--1oZcRo9-QKOtTqREKB6g	0.0	1.0

In [27]:

# merge review_df
data=pd.merge(data,review_df,on=['user_id'])

In [28]:

data.head(3)

Out[28]:

	user_id	business_id	date	stars	text	label	max_review_on_a_day	review_count	rating_cat	neg_rating_percentage	pos_rating_percentage
0	56gL9KEJNHiSDUoyjk2o3Q	8yR12PNSMo6FBYx1u5KPlw	2018-04-04 21:09:53	2	Went for lunch and found that my burger was me...	-1	1	7	negative	0.428571	0.571429
1	56gL9KEJNHiSDUoyjk2o3Q	oEwmCZknUHgHfEBdKA2SZA	2018-05-16 11:04:31	5	Best Pho in Indy by far and reasonable prices!...	1	1	7	positive	0.428571	0.571429
2	56gL9KEJNHiSDUoyjk2o3Q	8zJYnVNKD7XUHaKFMMREBg	2019-09-23 22:52:23	2	First of all, if you go to a Mexican restauran...	-1	1	7	negative	0.428571	0.571429

In [29]:

#3. Average review length 
data['review_len']=data['text'].apply(lambda x: len(x))

In [30]:

d=data.groupby('user_id')['review_len'].mean().reset_index()
d.rename(columns={'review_len':'avg_review_len'},inplace=True)
data=pd.merge(data,d,on=['user_id'])

In [31]:

#Standard deviation of ratings of the reviewer’s reviews 
d=data.groupby('user_id')['stars'].std().reset_index()
d.rename(columns={'stars':'std_rating'},inplace=True)
data=pd.merge(data,d,on=['user_id'])

In [32]:

data['std_rating'] = data['std_rating'].fillna(0)

In [33]:

data.head(3)

Out[33]:

	user_id	business_id	date	stars	text	label	max_review_on_a_day	review_count	rating_cat	neg_rating_percentage	pos_rating_percentage	review_len	avg_review_len	std_rating
0	56gL9KEJNHiSDUoyjk2o3Q	8yR12PNSMo6FBYx1u5KPlw	2018-04-04 21:09:53	2	Went for lunch and found that my burger was me...	-1	1	7	negative	0.428571	0.571429	394	796.285714	1.718249
1	56gL9KEJNHiSDUoyjk2o3Q	oEwmCZknUHgHfEBdKA2SZA	2018-05-16 11:04:31	5	Best Pho in Indy by far and reasonable prices!...	1	1	7	positive	0.428571	0.571429	284	796.285714	1.718249
2	56gL9KEJNHiSDUoyjk2o3Q	8zJYnVNKD7XUHaKFMMREBg	2019-09-23 22:52:23	2	First of all, if you go to a Mexican restauran...	-1	1	7	negative	0.428571	0.571429	688	796.285714	1.718249

In [34]:

def avg_word_length(x):
    sentence = x
    words = sentence.split()
    average = round(sum(len(word) for word in words) / len(words),2)
    return average

In [35]:

data['avg_word_length']=data['text'].apply(lambda x: avg_word_length(x))

In [36]:

def num_sent(x):
    number_of_sentences = sent_tokenize(x)
    return len(number_of_sentences)

In [37]:

data['num_sent']=data['text'].apply(lambda x: num_sent(x))

In [38]:

def percentage_of_numerals(x):
    numbers = sum(c.isdigit() for c in x)
    letter = sum(c.isalpha() for c in x)
    
    try: 
        result = numbers / letter
    except ZeroDivisionError:
        result=-1
    return result

In [39]:

data['percentage_of_numerals']=data['text'].apply(lambda x: percentage_of_numerals(x))

In [40]:

# if numbers/letter returns ZeroDivisionError
# this means there's 0 letter in the review
data[data['percentage_of_numerals']==-1]
#remove reviews that doesn't contain any words
data=data[data['percentage_of_numerals'] != -1 ]

In [41]:

# percentage of capitalized words
def percentage_of_capitalized(x):
    cap = sum(c.isupper() for c in x)
    letter = sum(c.isalpha() for c in x)
    
    try: 
        result = cap / letter
    except ZeroDivisionError:
        result=-1
    return cap/letter

In [42]:

data['percentage_of_capitalized']=data['text'].apply(lambda x: percentage_of_capitalized(x))

In [43]:

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [44]:

#clean up the review text
from nltk.tokenize import word_tokenize
import string
import re
from nltk.corpus import stopwords

def clean_text(text):
    """
        text: a string
        
        return: modified initial string
    """
    #tokenizer breaks string into a list of words
    text = word_tokenize(text)
    text = " ".join([c for c in text if c not in string.punctuation])
    text = text.lower() # lowercase text
    text = re.compile('"''#&?!:_[/(){}\[\]\|@,;.]').sub(' ', text) # replace symbols by space in text. substitute the matched string with space.
    text = re.sub(r'\d+','', text) # remove symbols and numbers
    text = ' '.join(word for word in text.split() if word not in stopwords.words('english')) # remove stopwors from text
    return text

In [45]:

data.label.value_counts()

Out[45]:

 1    844717
-1    155282
Name: label, dtype: int64

In [46]:

data=data.reset_index()

In [47]:

data.rename(columns={"index": "review_id"},inplace=True)

In [48]:

data.label.value_counts()

Out[48]:

 1    844717
-1    155282
Name: label, dtype: int64

In [49]:

data.columns

Out[49]:

Index(['review_id', 'user_id', 'business_id', 'date', 'stars', 'text', 'label',
       'max_review_on_a_day', 'review_count', 'rating_cat',
       'neg_rating_percentage', 'pos_rating_percentage',
       'neu_rating_percentage', 'review_len', 'avg_review_len', 'std_rating',
       'avg_word_length', 'num_sent', 'percentage_of_numerals',
       'percentage_of_capitalized'],
      dtype='object')

In [50]:

data.drop(['user_id', 'business_id', 'date','rating_cat'],axis=1,inplace=True)
data = data.reindex(columns=['label','review_id', 'text', 'stars',
                                          'max_review_on_a_day',
                                          'review_count', 'neg_rating_percentage',
                                          'pos_rating_percentage', 'neu_rating_percentage', 'review_len',
                                          'avg_review_len', 'std_rating', 'avg_word_length', 'num_sent',
                                          'percentage_of_numerals', 'percentage_of_capitalized'])

In [51]:

data.head(3)

Out[51]:

	label	review_id	text	stars	max_review_on_a_day	review_count	neg_rating_percentage	pos_rating_percentage	review_len	avg_review_len	std_rating	avg_word_length	num_sent	percentage_of_capitalized
0	-1	0	Went for lunch and found that my burger was me...	2	1	7	0.428571	0.571429	394	796.285714	1.718249	4.00	4	0.016287
1	1	1	Best Pho in Indy by far and reasonable prices!...	5	1	7	0.428571	0.571429	284	796.285714	1.718249	4.11	5	0.040909
2	-1	2	First of all, if you go to a Mexican restauran...	2	1	7	0.428571	0.571429	688	796.285714	1.718249	4.19	9	0.020913

II. Supervised Learning¶

The provided code includes functions for supervised learning and evaluation. It creates pipelines for three classifiers (Random Forest, Gradient Boosting, and Logistic Regression) to streamline the training and prediction process. The models are then sorted based on their performance metrics such as F1 score, recall, accuracy, and AUC. Additionally, there's a function for visualizing the confusion matrix and calculating summary statistics. These functions aid in developing and comparing models, facilitating informed decision-making in classification tasks.

In [52]:

def make_pipeline():
    "Create a single pipeline that processing the data and then fits the classification."  
    
    rf = RandomForestClassifier(random_state=50)
    gb=GradientBoostingClassifier(random_state=50)
    lr = LogisticRegression(random_state=50)
    classifiers=[rf,gb,lr]

    pipeline = []
    for classifier in classifiers:
        pipe = Pipeline(steps=[('classifier', classifier)])
        pipeline.append(pipe)

    return pipeline

In [53]:

def sort_models(pipelines, X_data, y_data):
    """Sort models based on their f1 score."""
    scores = []
    for pipe in pipelines:
        y_pred = pipe.predict(X_data)
        f1score = f1_score(y_data, y_pred, average='weighted')
        recallscore=recall_score(y_data,y_pred)
        accuracyscore=accuracy_score(y_data,y_pred)

        aucscore=roc_auc_score(y_data, pipe.predict_proba(X_data)[:,1])
        
        classfier_name = pipe.steps[-1][1].__class__.__name__.split('.')[-1]
        scores.append([classfier_name,f1score,recallscore,accuracyscore,aucscore])
        
    scores_sorted = sorted(scores,key=lambda x:x[1],reverse=True)
    
    return scores_sorted

In [54]:

def make_confusion_matrix(cf,
                          group_names=None,
                          categories='auto',
                          count=True,
                          percent=True,
                          cbar=True,
                          xyticks=True,
                          xyplotlabels=True,
                          sum_stats=True,
                          figsize=None,
                          cmap='Blues',
                          title=None):
    # CODE TO GENERATE TEXT INSIDE EACH SQUARE
    blanks = ['' for i in range(cf.size)]

    if group_names and len(group_names)==cf.size:
        group_labels = ["{}\n".format(value) for value in group_names]
    else:
        group_labels = blanks

    if count:
        group_counts = ["{0:0.0f}\n".format(value) for value in cf.flatten()]
    else:
        group_counts = blanks

    if percent:
        group_percentages = ["{0:.2%}".format(value) for value in cf.flatten()/np.sum(cf)]
    else:
        group_percentages = blanks

    box_labels = [f"{v1}{v2}{v3}".strip() for v1, v2, v3 in zip(group_labels,group_counts,group_percentages)]
    box_labels = np.asarray(box_labels).reshape(cf.shape[0],cf.shape[1])


    # CODE TO GENERATE SUMMARY STATISTICS & TEXT FOR SUMMARY STATS
    if sum_stats:
        #Accuracy is sum of diagonal divided by total observations
        accuracy  = np.trace(cf) / float(np.sum(cf))

        #if it is a binary confusion matrix, show some more stats
        if len(cf)==2:
            #Metrics for Binary Confusion Matrices
            precision = cf[1,1] / sum(cf[:,1]) #TP/TP+FP
#             How much were correctly classified as positive out of all positives.
            recall    = cf[1,1] / sum(cf[1,:]) #TP/TP+FN
            specificity = cf[0,0] / sum(cf[0,:]) #TN/TN+FP 
            f1_score  = 2*(precision*recall / (precision + recall))
        
            stats_text = "\n\nAccuracy={:0.3f}\nPrecision={:0.3f}\nRecall={:0.3f}\nSpecificity={:0.3f}\nF1 Score={:0.3f}".format(
                accuracy,precision,recall,specificity,f1_score)
        else:
            stats_text = "\n\nAccuracy={:0.3f}".format(accuracy)
    else:
        stats_text = ""


    # SET FIGURE PARAMETERS ACCORDING TO OTHER ARGUMENTS
    if figsize==None:
        #Get default figure size if not set
        figsize = plt.rcParams.get('figure.figsize')

    if xyticks==False:
        #Do not show categories if xyticks is False
        categories=False


    # MAKE THE HEATMAP VISUALIZATION
    plt.figure(figsize=figsize)
    sns.heatmap(cf,annot=box_labels,fmt="",cmap=cmap,cbar=cbar,xticklabels=categories,yticklabels=categories)

    if xyplotlabels:
        plt.ylabel('True label')
        plt.xlabel('Predicted label' + stats_text)
    else:
        plt.xlabel(stats_text)
    
    if title:
        plt.title(title)

In [55]:

X_up=data.iloc[:,3:]
y_up = data['label']

In [56]:

data.head(3)

Out[56]:

	label	review_id	text	stars	max_review_on_a_day	review_count	neg_rating_percentage	pos_rating_percentage	review_len	avg_review_len	std_rating	avg_word_length	num_sent	percentage_of_capitalized
0	-1	0	Went for lunch and found that my burger was me...	2	1	7	0.428571	0.571429	394	796.285714	1.718249	4.00	4	0.016287
1	1	1	Best Pho in Indy by far and reasonable prices!...	5	1	7	0.428571	0.571429	284	796.285714	1.718249	4.11	5	0.040909
2	-1	2	First of all, if you go to a Mexican restauran...	2	1	7	0.428571	0.571429	688	796.285714	1.718249	4.19	9	0.020913

In [57]:

data.describe()

Out[57]:

	label	review_id	stars	max_review_on_a_day	review_count	neg_rating_percentage	pos_rating_percentage	neu_rating_percentage	review_len	avg_review_len	std_rating	avg_word_length	num_sent	percentage_of_numerals	percentage_of_capitalized
count	999999.000000	999999.000000	999999.000000	999999.000000	999999.000000	999999.000000	999999.000000	999999.000000	999999.000000	999999.000000	999999.000000	999999.000000	999999.000000	999999.000000	999999.000000
mean	0.689436	499999.645093	3.749503	1.000376	8.026962	0.230687	0.670599	0.098714	568.093435	568.092786	0.590861	4.479397	7.771253	0.003681	0.036528
std	0.724347	288675.386807	1.478731	0.019387	21.280554	0.351385	0.372445	0.203614	528.544684	439.786689	0.724417	0.775118	6.137283	0.007732	0.022958
min	-1.000000	0.000000	1.000000	1.000000	1.000000	0.000000	0.000000	0.000000	1.000000	10.000000	0.000000	1.000000	1.000000	0.000000	0.000000
25%	1.000000	249999.500000	3.000000	1.000000	1.000000	0.000000	0.500000	0.000000	229.000000	270.000000	0.000000	4.210000	4.000000	0.000000	0.024339
50%	1.000000	500000.000000	4.000000	1.000000	2.000000	0.000000	0.790698	0.000000	406.000000	454.333333	0.000000	4.420000	6.000000	0.000000	0.031630
75%	1.000000	749999.500000	5.000000	1.000000	6.000000	0.333333	1.000000	0.130435	720.000000	736.000000	1.055290	4.670000	10.000000	0.004819	0.042105
max	1.000000	999999.000000	5.000000	2.000000	458.000000	1.000000	1.000000	1.000000	5000.000000	5000.000000	2.828427	245.500000	120.000000	0.893617	1.000000

Training & Testing¶

Two upsampling methods, SMOTE and ADASYN, are applied to create synthetic samples for the minority class. Multiple classifiers, including Random Forest, Gradient Boosting, and Logistic Regression, are trained on the upsampled data using pipeline implementations. The models are evaluated based on various metrics such as F1 score, recall score, accuracy score, and AUC score. The results are visualized through bar plots, showing the performance of each model. The best performing model, using SMOTE, is further analyzed using a confusion matrix to assess its classification performance. The same evaluation process is repeated for the models trained on ADASYN data. Additionally, the ROC curves for the top Gradient Boosting models from both SMOTE and ADASYN are plotted, showcasing the models' discriminative ability. Overall, this report provides insights into the effectiveness of upsampling techniques and different classifiers for fake review classification.

In [58]:

X_up_train, X_up_test, y_up_train, y_up_test = train_test_split(X_up, y_up,test_size=0.2, random_state=3)


X_smoted_train, y_smoted_train = SMOTE(random_state=42).fit_resample(X_up_train, y_up_train)
X_adasyn_train, y_adasyn_train = ADASYN(random_state=42).fit_resample(X_up_train, y_up_train)

In [59]:

smote_pipelines = make_pipeline()
# Train all the models
for pipe in smote_pipelines:
    pipe.fit(X_smoted_train, y_smoted_train)

In [60]:

# apply sort_models function from above
smoted_sorted_model = sort_models(smote_pipelines, X_up_test, y_up_test)
smoted_sort_model_df = pd.DataFrame(smoted_sorted_model)
smoted_sort_model_df.columns = ['Model','F1 Score','Recall Score','Accuracy Score','AUC Score']
smoted_sort_model_df = pd.melt(smoted_sort_model_df,id_vars=['Model'])

In [61]:

plt.figure(figsize=(9, 4))
plt.title('Models using Smoted Upsampling Behavioral Data',fontsize=15)
ax = sns.barplot(x="Model", y="value", hue = "variable",data=smoted_sort_model_df)
ax.legend(bbox_to_anchor=(1.0, 1.05));

In [62]:

smoted_sort_model_df

Out[62]:

	Model	variable	value
0	RandomForestClassifier	F1 Score	0.878752
1	GradientBoostingClassifier	F1 Score	0.874814
2	LogisticRegression	F1 Score	0.872867
3	RandomForestClassifier	Recall Score	0.902930
4	GradientBoostingClassifier	Recall Score	0.874420
5	LogisticRegression	Recall Score	0.868476
6	RandomForestClassifier	Accuracy Score	0.873400
7	GradientBoostingClassifier	Accuracy Score	0.864645
8	LogisticRegression	Accuracy Score	0.861800
9	RandomForestClassifier	AUC Score	0.893174
10	GradientBoostingClassifier	AUC Score	0.900793
11	LogisticRegression	AUC Score	0.897184

In [63]:

# best smote model performanxce -- gradientboosting
y_pred_smoted=smote_pipelines[1].predict(X_up_test)

In [64]:

cm_smoted = confusion_matrix(y_up_test,y_pred_smoted )

sns.set(font_scale=1.0)
labels = ['True Neg','False Pos','False Neg','True Pos']
categories = ['Fake Review', 'Actual Review']
make_confusion_matrix(cm_smoted, 
                      group_names=labels,
                      categories=categories, 
                      cmap='Blues',figsize=(8,5))

plt.title('Confusion Matrix',fontsize=15)
plt.tight_layout();

In [65]:

adasyn_pipelines = make_pipeline()
# Train all the models
for pipe in adasyn_pipelines:
    pipe.fit(X_adasyn_train, y_adasyn_train)

In [66]:

# apply sort_models function from above
adasyn_sorted_model = sort_models(adasyn_pipelines, X_up_test, y_up_test)
adasyn_sort_model_df = pd.DataFrame(adasyn_sorted_model)
adasyn_sort_model_df.columns = ['Model','F1 Score','Recall Score','Accuracy Score','AUC Score']
adasyn_sort_model_df = pd.melt(adasyn_sort_model_df,id_vars=['Model'])

In [67]:

plt.figure(figsize=(9, 4))
ax = sns.barplot(x="Model", y="value", hue = "variable",data=adasyn_sort_model_df)
ax.legend(bbox_to_anchor=(1.0, 1.05))
plt.title('Models using Adasyn Upsampling Behavioral Data',fontsize=15);

In [68]:

adasyn_sort_model_df

Out[68]:

	Model	variable	value
0	RandomForestClassifier	F1 Score	0.876395
1	GradientBoostingClassifier	F1 Score	0.869196
2	LogisticRegression	F1 Score	0.866582
3	RandomForestClassifier	Recall Score	0.899052
4	GradientBoostingClassifier	Recall Score	0.863496
5	LogisticRegression	Recall Score	0.856527
6	RandomForestClassifier	Accuracy Score	0.870470
7	GradientBoostingClassifier	Accuracy Score	0.857400
8	LogisticRegression	Accuracy Score	0.853740
9	RandomForestClassifier	AUC Score	0.890652
10	GradientBoostingClassifier	AUC Score	0.897070
11	LogisticRegression	AUC Score	0.896294

In [69]:

from sklearn.metrics import roc_curve
plt.figure(figsize=(10,8))

y_pred_smoted=smote_pipelines[0].predict(X_up_test)

#GradientBoosting Upsampling
fpr_gb1, tpr_gb1, thresholds = roc_curve(y_up_test, smote_pipelines[1].predict_proba(X_up_test)[:,1])
fpr_gb2, tpr_gb2, thresholds = roc_curve(y_up_test, adasyn_pipelines[1].predict_proba(X_up_test)[:,1])
roc_auc_GB_smote=roc_auc_score(y_up_test, smote_pipelines[1].predict_proba(X_up_test)[:,1])
roc_auc_GB_adasyn=roc_auc_score(y_up_test, adasyn_pipelines[1].predict_proba(X_up_test)[:,1])

#GradientBoosting with SMOTE and ADASYN
plt.plot(fpr_gb1, tpr_gb1,lw=2, label='Gradient Boosting SMOTE User Activities ROC area = %0.2f)' % roc_auc_GB_smote)
plt.plot(fpr_gb2, tpr_gb2,lw=2, label='Gradient Boosting ADASYN User Activities ROC area = %0.2f)' % roc_auc_GB_adasyn)



plt.plot([0,1],[0,1],c='violet',ls='--')
plt.xlim([-0.05,1.05])
plt.ylim([-0.05,1.05])

plt.legend(loc="lower right",ncol=1,fontsize = 'small')
plt.xlabel('False positive rate',fontsize=10)
plt.ylabel('True positive rate',fontsize=10)
plt.title('ROC curve for Fake Review Classification',fontsize=15);

Best Model¶

We trained a Gradient Boosting Regressor model on a given dataset to predict a target variable. The dataset is split into training and test sets, and the model is trained using the training data. The code then calculates the feature importances, representing the relative importance of each input feature in the trained model. Finally, a bar plot is generated to visualize the feature importances, allowing for easy interpretation and identification of the most influential features. This analysis helps understand the significant factors driving the prediction of the target variable and can guide feature selection or further investigations in the regression model.

In [70]:

from sklearn.ensemble import GradientBoostingRegressor


# Assume X and y are your input and target variables
X_train, X_test, y_train, y_test = train_test_split(X_up, y_up, random_state=3)

# Fit the gradient regression model on the training data
gbr = GradientBoostingRegressor()
gbr.fit(X_train, y_train)
# Assume X_train, X_test, y_train, y_test are your training and test sets

Out[70]:

GradientBoostingRegressor()

In [71]:

# Retrieve the feature importances and sort them in descending order
feature_importance = pd.DataFrame(list(zip(X_train.columns, gbr.feature_importances_)),
                                  columns=['feature', 'importance'])
feature_importance = feature_importance.sort_values(by='importance', ascending=False)

# Visualize the sorted feature importances in a bar plot
plt.figure(figsize=(10, 5))
plt.barh(np.arange(len(feature_importance)), feature_importance['importance'])
plt.yticks(np.arange(len(feature_importance)), feature_importance['feature'], fontsize=13)
plt.title('Feature Importance', fontsize=20)
plt.show()

Business Aplications:¶

The predictive model for business attributes based on review and tip textual information has several valuable business applications. These include sentiment analysis to understand customer sentiment and make informed decisions, fraud detection to identify and filter out fake reviews, reputation management to maintain credibility and trustworthiness, and content moderation to ensure high-quality and reliable information for users on review platforms. These applications collectively contribute to improved decision-making, customer satisfaction, and the overall integrity of online review systems.

Here is a link to the complete Github respository https://github.com/EthanFalcao/Yelp-Dataset-Challenge-Analysis-and-Prediction

Yelp Dataset Challenge: Detecting Fake Reviews