Objective¶

To predict the business attributes using review and tip textual information.¶

Data Sources:¶

For our analysis, we sourced our data from the Yelp dataset, which can be found at https://www.yelp.com/dataset.

Specifically, we selected three datasets for analysis:

'yelp_academic_dataset_business.json'

'yelp_academic_dataset_review.json'

'yelp_academic_dataset_tip.json'

These datasets were downloaded from https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset.

I. Data Preprocessing¶

In this study, the most prevalent binary attributes were selected, as these were deemed to be more pertinent to customers and compatible with logistic regression analysis. Based on the data presented in Figure 1, the attributes that emerged as the most common were BusinessAcceptsCreditCards, BikeParking, and RestaurantsTakeOut. To effectively analyze this data, vectorization techniques were employed to transform the textual information into numerical representations. Subsequently, logistic regression was utilized for the purpose of classification, enabling a more sophisticated and professional examination of these key attributes.

In [1]:

import json
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.linear_model import LogisticRegression
from scipy.sparse import hstack

# IMPORTING ALL THE NECESSARY LIBRARIES AND PACKAGES
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.corpus import stopwords
import string
import math
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score, roc_curve
from sklearn.model_selection import GridSearchCV
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [2]:

chunksize = 10000
business_df = pd.DataFrame()

# Iterate over the file in chunks and concatenate the resultss\Yelp_data\Github\Yelp Datasets
for chunk in pd.read_json('/Users/Ethan Vaz Falcao/Downloads/Yelp_data/Github/Yelp Datasets/yelp_academic_dataset_business.json', lines=True, chunksize=chunksize):
    business_df = pd.concat([business_df, chunk], ignore_index=True)

review_df = pd.DataFrame()
# Iterate over the file in chunks and concatenate the results
for chunk in pd.read_json('/Users/Ethan Vaz Falcao/Downloads/Yelp_data/Github/Yelp Datasets/yelp_academic_dataset_review.json', lines=True, chunksize=chunksize):
    review_df = pd.concat([review_df, chunk], ignore_index=True)

tip_df = pd.DataFrame()

# Iterate over the file in chunks and concatenate the results
for chunk in pd.read_json('/Users/Ethan Vaz Falcao/Downloads/Yelp_data/Github/Yelp Datasets/yelp_academic_dataset_tip.json', lines=True, chunksize=chunksize):
    tip_df = pd.concat([tip_df, chunk], ignore_index=True)

In [3]:

# Extract relevant columns
business_attributes = business_df[['business_id', 'attributes']]
review_data = review_df[['user_id', 'text','date','business_id' ]]
tip_data = tip_df[['business_id', 'text', 'user_id','date']]
tip_data = tip_data.rename(columns={'text': 'tip'})

In [4]:

yelp_data = pd.merge(business_attributes, review_data, on='business_id')
yelp_data = pd.merge(yelp_data, tip_data, on=['business_id', 'user_id'])
yelp_data= yelp_data.dropna()

In [5]:

yelp_data = yelp_data.rename(columns={'date_x': 'date'})
yelp_data= yelp_data.drop('date_y', axis=1)

In [6]:

yelp_data.head()

Out[6]:

	business_id	attributes	user_id	text	date	tip
0	mpf3x-BjTdTEA3yCZrAYPw	{'BusinessAcceptsCreditCards': 'True'}	trf3Qcz8qvCDKXiTgjUcEg	Bottom Line: \nClean store, Quick Service, Go...	2011-08-01 03:45:56	Dropping off my Amazon return.
1	tUFrWirKiKi_TAnsVWINQQ	{'BikeParking': 'True', 'BusinessAcceptsCredit...	oAvO0BOHOagOI7WVGXlWSA	This is a neat store! It is also organized! Th...	2012-12-12 17:28:26	This place looks the same as other target at c...
2	tUFrWirKiKi_TAnsVWINQQ	{'BikeParking': 'True', 'BusinessAcceptsCredit...	oAvO0BOHOagOI7WVGXlWSA	This is a neat store! It is also organized! Th...	2012-12-12 17:28:26	Its easy to shop here because its well stocked!
3	MTSW4McQd7CbVtyjqoe9mw	{'RestaurantsDelivery': 'False', 'OutdoorSeati...	OyjJWNmlky-Ase9ov1Pq5Q	bun is sucked here and the waitress was really...	2018-12-07 02:22:36	bun is sucked here and the waitress was really...
4	MTSW4McQd7CbVtyjqoe9mw	{'RestaurantsDelivery': 'False', 'OutdoorSeati...	WqeE5e5ROfaVEgkb9dAkiQ	This is my favorite bakery in Chinatown! It's ...	2017-09-13 00:38:08	Love their pastries and drinks!

In [7]:

yelp_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 434063 entries, 0 to 440639
Data columns (total 6 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   business_id  434063 non-null  object        
 1   attributes   434063 non-null  object        
 2   user_id      434063 non-null  object        
 3   text         434063 non-null  object        
 4   date         434063 non-null  datetime64[ns]
 5   tip          434063 non-null  object        
dtypes: datetime64[ns](1), object(5)
memory usage: 23.2+ MB

II. Feature Engineering¶

In the feature engineering process, we randomly selected 100,000 rows from the Yelp dataset and performed various transformations and manipulations. We analyzed the attributes column to determine the frequencies of different attribute values, identifying the top three attributes as BusinessAcceptsCreditCards, BusinessParking, and RestaurantsPriceRange2. We expanded the attributes column into separate columns, dropped the original attributes column, and handled missing values. We created a new column called 'attributes' by combining specific attribute columns.

Additionally, we extracted the year from the 'date' column and visualized the distribution of reviews over the years. We computed additional features such as average review length, standard deviation of reviewer ratings, average word length, number of sentences, and percentage of numerals in the reviews. Finally, we performed text cleaning by removing punctuation, symbols, numbers, and stopwords from the review text. These feature engineering steps were conducted to enhance the dataset for further analysis and modeling.

In [8]:

# randomly select 100,000 rows
yelp_data = yelp_data.sample(n=100000, random_state=42)

# print the first 5 rows of the randomly selected dataframe
yelp_data.head()

Out[8]:

	business_id	attributes	user_id	text	date	tip
78427	f5VqTIMC-9Iw747SwhS0rg	{'RestaurantsPriceRange2': '2', 'WheelchairAcc...	lMY8NBPyzlPbbu-KBYfD9A	This isn't a furniture store IT'S A COMPLEX! \...	2015-12-01 23:38:31	Note taking?
19344	UGdufAsFg_vHMdo6MC8aBg	{'GoodForKids': 'False', 'ByAppointmentOnly': ...	xWmYN57XXZbg0LOK8WbbFQ	Local corner mall by Safeway tire store star b...	2017-11-21 00:57:18	Owners fixing bike I ride??
357698	b9sLRv_j1eijFIEso8-RQg	{'RestaurantsPriceRange2': '1', 'RestaurantsTa...	r9S0VYrdXJrdhfR7OXj8tA	My favorite pretzel bakery for the fact that I...	2017-03-27 20:29:15	They are selling dog bone pretzels at this loc...
20804	LUXRw-mr9emGL2gw4otvVA	{'GoodForKids': 'True', 'ByAppointmentOnly': '...	sGaoPWB_p1aOm5z9cnV3Aw	Absolutely incredible view whether you are goi...	2015-05-05 13:34:44	Bad traffic construction going from NJ to PA
367601	Rxf7eEfub8LC27P28ObgsA	{'Alcohol': ''full_bar'', 'RestaurantsDelivery...	gGAptXub1kr8YyxmApaN4w	Live in the area been visiting Tarpon Tavern o...	2016-04-03 01:59:20	To go food takes forever.....

In [9]:

import pandas as pd
# Assuming you have loaded the Yelp dataset into a DataFrame called 'business_df'
# 'business_df' should contain a column 'attributes' for business attributes

# Create an empty dictionary to store the attribute frequencies
attribute_counts = {}

# Iterate over each business
for attributes in yelp_data['attributes']:
    if attributes is not None:  # Check if attributes is not None
        # Iterate over each attribute
        for attribute in attributes.keys():
            # Increment the count for the attribute in the dictionary
            attribute_counts[attribute] = attribute_counts.get(attribute, 0) + 1

# Sort the attribute frequencies in descending order
sorted_attributes = sorted(attribute_counts.items(), key=lambda x: x[1], reverse=True)

# Print the three most popular attributes
print("Three most popular attributes:")
for attribute, count in sorted_attributes[:7]:
    print(attribute, "-", count)

Three most popular attributes:
BusinessAcceptsCreditCards - 94617
BusinessParking - 91439
RestaurantsPriceRange2 - 90198
BikeParking - 85745
RestaurantsTakeOut - 79415
WiFi - 78992
GoodForKids - 77166

In [10]:

yelp_data = yelp_data.join(yelp_data['attributes'].apply(pd.Series))

# drop the original dictionary column
yelp_data = yelp_data.drop(['attributes','RestaurantsPriceRange2', 'CoatCheck',  'RestaurantsDelivery', 'Caters', 'WiFi', 'BusinessParking', 'WheelchairAccessible', 'HappyHour', 'OutdoorSeating', 'HasTV', 'RestaurantsReservations', 'DogsAllowed', 'ByAppointmentOnly', 'Alcohol', 'RestaurantsAttire', 'NoiseLevel', 'Ambience', 'GoodForKids', 'RestaurantsGoodForGroups', 'RestaurantsTableService', 'DriveThru', 'GoodForMeal', 'BusinessAcceptsBitcoin', 'Smoking', 'Music', 'GoodForDancing', 'BestNights', 'BYOB', 'Corkage', 'AcceptsInsurance', 'BYOBCorkage', 'HairSpecializesIn', 'Open24Hours', 'RestaurantsCounterService', 'AgesAllowed', 'DietaryRestrictions'], axis=1)

In [11]:

yelp_data = yelp_data.fillna(0).replace("True", 1)
yelp_data = yelp_data.replace(['False', 'None'], 0)
yelp_data= yelp_data.dropna()
yelp_data = yelp_data.reset_index(drop=True)

In [12]:

# filter the data by year
yelp_data= yelp_data.dropna()
yelp_data.fillna(value=0, inplace=True)
yelp_data = yelp_data.reset_index(drop=True)

In [13]:

# Define a lambda function to combine the values of the three columns
combine_cols = lambda row: [row['BusinessAcceptsCreditCards'], row['BikeParking'], row['RestaurantsTakeOut']]
# Apply the lambda function to each row of the DataFrame and store the result in a new column
yelp_data['combined_cols'] = yelp_data.apply(combine_cols, axis=1)

In [14]:

yelp_data = yelp_data.drop(['BusinessAcceptsCreditCards','BikeParking','RestaurantsTakeOut'], axis=1)
yelp_data = yelp_data.rename(columns={'combined_cols': 'attributes'})

In [15]:

# extract the "date" column
dates = yelp_data["date"]

# extract the year from each date
years = [date.year for date in dates]

# plot a histogram of the years
plt.hist(years, bins=range(min(years), max(years) + 1))
plt.xlabel("Year")
plt.ylabel("Frequency")
plt.title("Histogram of Years")
plt.show()

In [16]:

yelp_data.head(2)

Out[16]:

	business_id	user_id	text	date	tip	attributes
0	f5VqTIMC-9Iw747SwhS0rg	lMY8NBPyzlPbbu-KBYfD9A	This isn't a furniture store IT'S A COMPLEX! \...	2015-12-01 23:38:31	Note taking?	[1, 1, 1]
1	UGdufAsFg_vHMdo6MC8aBg	xWmYN57XXZbg0LOK8WbbFQ	Local corner mall by Safeway tire store star b...	2017-11-21 00:57:18	Owners fixing bike I ride??	[1, 0, 0]

In [17]:

# check the data description
# no missing values
yelp_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 6 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   business_id  100000 non-null  object        
 1   user_id      100000 non-null  object        
 2   text         100000 non-null  object        
 3   date         100000 non-null  datetime64[ns]
 4   tip          100000 non-null  object        
 5   attributes   100000 non-null  object        
dtypes: datetime64[ns](1), object(5)
memory usage: 4.6+ MB

In [18]:

#3. Average review length 
yelp_data['review_len']=yelp_data['text'].apply(lambda x: len(x))

In [19]:

d=yelp_data.groupby('user_id')['review_len'].mean().reset_index()
d.rename(columns={'review_len':'avg_review_len'},inplace=True)
yelp_data=pd.merge(yelp_data,d,on=['user_id'])

In [20]:

def avg_word_length(x):
    sentence = x
    words = sentence.split()
    average = round(sum(len(word) for word in words) / len(words),2)
    return average

In [21]:

yelp_data['avg_word_length']=yelp_data['text'].apply(lambda x: avg_word_length(x))

In [22]:

import nltk
from nltk.tokenize import sent_tokenize
def num_sent(x):
    number_of_sentences = sent_tokenize(x)
    return len(number_of_sentences)

In [23]:

yelp_data['num_sent']=yelp_data['text'].apply(lambda x: num_sent(x))

In [24]:

def percentage_of_numerals(x):
    numbers = sum(c.isdigit() for c in x)
    letter = sum(c.isalpha() for c in x)
    
    try: 
        result = numbers / letter
    except ZeroDivisionError:
        result=-1
    return result

In [25]:

yelp_data['percentage_of_numerals']=yelp_data['text'].apply(lambda x: percentage_of_numerals(x))

In [26]:

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

#clean up the review text
from nltk.tokenize import word_tokenize
import string
import re
from nltk.corpus import stopwords

def clean_text(text):
    """
        text: a string
        
        return: modified initial string
    """
    #tokenizer breaks string into a list of words
    text = word_tokenize(text)
    text = " ".join([c for c in text if c not in string.punctuation])
    text = text.lower() # lowercase text
    text = re.compile('"''#&?!:_[/(){}\[\]\|@,;.]').sub(' ', text) # replace symbols by space in text. substitute the matched string with space.
    text = re.sub(r'\d+','', text) # remove symbols and numbers
    text = ' '.join(word for word in text.split() if word not in stopwords.words('english')) # remove stopwors from text
    return text

In [27]:

yelp_data=yelp_data.reset_index()

In [28]:

yelp_data.head(2)

Out[28]:

	index	business_id	user_id	text	date	tip	attributes	review_len	avg_review_len	avg_word_length	num_sent	percentage_of_numerals
0	0	f5VqTIMC-9Iw747SwhS0rg	lMY8NBPyzlPbbu-KBYfD9A	This isn't a furniture store IT'S A COMPLEX! \...	2015-12-01 23:38:31	Note taking?	[1, 1, 1]	1222	776.22	5.20	9	0.001037
1	1	QV7QOLww8ym3E2zBgdE2Ow	lMY8NBPyzlPbbu-KBYfD9A	The current 5 star raving reviews are pushing ...	2017-04-04 05:21:19	OPER on deck	[1, 1, 1]	625	776.22	5.08	8	0.004098

In [29]:

# Step 2: Feature Extraction
vectorizer = TfidfVectorizer(stop_words='english')
features = vectorizer.fit_transform(yelp_data['text'])

In [30]:

# Create feature matrix
feature_matrix = pd.DataFrame(features.toarray(), columns=vectorizer.get_feature_names())

In [31]:

# Step 3: Label Encoding
def parse_attributes(x):
    if isinstance(x, str):
        return eval(x)
    else:
        return x
yelp_data['attributes'] = yelp_data['attributes'].apply(parse_attributes)

In [32]:

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
label_matrix = pd.DataFrame(mlb.fit_transform(yelp_data['attributes']), columns=mlb.classes_)

In [33]:

# Step 4: Model Training
X_train, X_test, y_train, y_test = train_test_split(feature_matrix, label_matrix, test_size=0.3, random_state=42)

In [34]:

from sklearn.multiclass import OneVsRestClassifier
model = OneVsRestClassifier(LogisticRegression())
model.fit(X_train, y_train)

Out[34]:

OneVsRestClassifier(estimator=LogisticRegression())

In [35]:

# Step 5: Model Evaluation
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.7012666666666667

In [36]:

#randomly chooses a review from the df
import random
def random_review(df):
    """
    Returns a random text from the "text" column of the given DataFrame.
    """
    # Get the "text" column as a Series
    text_series = yelp_data["text"]
    
    # Choose a random index from the Series
    random_index = random.choice(text_series.index)
    
    # Get the text at the chosen index
    random_text = text_series[random_index]
    
    return random_text

In [37]:

# Obtain new set of attributes
new_attributes = set(['BusinessAcceptsCreditCards', 'BikeParking', 'RestaurantsTakeOut'])
new_mlb = MultiLabelBinarizer()
new_mlb.fit([new_attributes])

Out[37]:

MultiLabelBinarizer()

In [38]:

attribute_list = ['BusinessAcceptsCreditCards', 'BikeParking', 'RestaurantsTakeOut']


new_text = random_review(yelp_data)
preprocessed_text = vectorizer.transform([new_text])
prediction = model.predict(preprocessed_text)
predicted_vector = prediction.reshape(1, -1)

predicted_attributes = [attribute_list[i] for i in range(len(predicted_vector[0])) if predicted_vector[0][i] == 1]

print("Review: \n\n", new_text)
print("\nPredicted attributes:", predicted_attributes)
print("Predicted attribute vector:", predicted_vector[0])

for i in range(len(predicted_vector[0])):
    print(attribute_list[i], ":", predicted_vector[0][i])

Review: 

 When a pizza craving hits, Pizza Hut will fo for carryout. Iike the thin crust and request it well done, so hopefully bottom will be crispy. Their pepperoni tastes great, but really dries my mouth out...haven't figured that one out yet. Not great pizza by any means, but will do in a pinch.
This store is very busy. The earlier you can order the better. 
Half the staff is usually sitting down or playing around.  Don't be afraid to go et your presence be known, rather than take home a cold pizza. I have never dined in the dtore, njust to go.

Predicted attributes: ['BikeParking']
Predicted attribute vector: [0 1]
BusinessAcceptsCreditCards : 0
BikeParking : 1

Training & Results¶

The overall accuracy rate achieved for the whole dataset was 0.74. The multi-label classification was applied to the data with review text and tip text as the features and three different business attributes as labels. The models we used were SVC, logistic regression, and Random Forest.

Business Applications¶

There are two primary business applications that can be found when predicting the business attributes based on review and tip textual information: Targeted marketing and competitor analysis. For the former, businesses can use the insights from the model to tailor their marketing efforts towards specific target audiences. They can develop campaigns highlighting the attributes preferred by their customers, ultimately leading to higher customer satisfaction and increased revenue. For the latter, the model can help businesses gain insights into their competitors’ performance and attributes, enabling them to adapt their strategies and offerings to stay ahead in the market.

Here is a link to the complete Github respository https://github.com/EthanFalcao/Yelp-Dataset-Challenge-Analysis-and-Prediction

Yelp Dataset Challenge: Predicting Business Attributes

Objective¶

To predict the business attributes using review and tip textual information.¶

Data Sources:¶

I. Data Preprocessing¶

II. Feature Engineering¶

Training & Results¶

Business Applications¶

Linear Optimatized NBA Team

Predicting the World Cup

Detecting Fake Reviews

Please feel free to connect with me