Data Sources:¶
For our analysis, we sourced our data from the Yelp dataset, which can be found at https://www.yelp.com/dataset.
Specifically, we selected three datasets for analysis:
'yelp_academic_dataset_business.json'
'yelp_academic_dataset_review.json'
'yelp_academic_dataset_tip.json'
These datasets were downloaded from https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset.
I. Data Preprocessing¶
In this study, the most prevalent binary attributes were selected, as these were deemed to be more pertinent to customers and compatible with logistic regression analysis. Based on the data presented in Figure 1, the attributes that emerged as the most common were BusinessAcceptsCreditCards, BikeParking, and RestaurantsTakeOut. To effectively analyze this data, vectorization techniques were employed to transform the textual information into numerical representations. Subsequently, logistic regression was utilized for the purpose of classification, enabling a more sophisticated and professional examination of these key attributes.
import json
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.linear_model import LogisticRegression
from scipy.sparse import hstack
# IMPORTING ALL THE NECESSARY LIBRARIES AND PACKAGES
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.corpus import stopwords
import string
import math
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score, roc_curve
from sklearn.model_selection import GridSearchCV
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
chunksize = 10000
business_df = pd.DataFrame()
# Iterate over the file in chunks and concatenate the resultss\Yelp_data\Github\Yelp Datasets
for chunk in pd.read_json('/Users/Ethan Vaz Falcao/Downloads/Yelp_data/Github/Yelp Datasets/yelp_academic_dataset_business.json', lines=True, chunksize=chunksize):
business_df = pd.concat([business_df, chunk], ignore_index=True)
review_df = pd.DataFrame()
# Iterate over the file in chunks and concatenate the results
for chunk in pd.read_json('/Users/Ethan Vaz Falcao/Downloads/Yelp_data/Github/Yelp Datasets/yelp_academic_dataset_review.json', lines=True, chunksize=chunksize):
review_df = pd.concat([review_df, chunk], ignore_index=True)
tip_df = pd.DataFrame()
# Iterate over the file in chunks and concatenate the results
for chunk in pd.read_json('/Users/Ethan Vaz Falcao/Downloads/Yelp_data/Github/Yelp Datasets/yelp_academic_dataset_tip.json', lines=True, chunksize=chunksize):
tip_df = pd.concat([tip_df, chunk], ignore_index=True)
# Extract relevant columns
business_attributes = business_df[['business_id', 'attributes']]
review_data = review_df[['user_id', 'text','date','business_id' ]]
tip_data = tip_df[['business_id', 'text', 'user_id','date']]
tip_data = tip_data.rename(columns={'text': 'tip'})
yelp_data = pd.merge(business_attributes, review_data, on='business_id')
yelp_data = pd.merge(yelp_data, tip_data, on=['business_id', 'user_id'])
yelp_data= yelp_data.dropna()
yelp_data = yelp_data.rename(columns={'date_x': 'date'})
yelp_data= yelp_data.drop('date_y', axis=1)
yelp_data.head()
business_id | attributes | user_id | text | date | tip | |
---|---|---|---|---|---|---|
0 | mpf3x-BjTdTEA3yCZrAYPw | {'BusinessAcceptsCreditCards': 'True'} | trf3Qcz8qvCDKXiTgjUcEg | Bottom Line: \nClean store, Quick Service, Go... | 2011-08-01 03:45:56 | Dropping off my Amazon return. |
1 | tUFrWirKiKi_TAnsVWINQQ | {'BikeParking': 'True', 'BusinessAcceptsCredit... | oAvO0BOHOagOI7WVGXlWSA | This is a neat store! It is also organized! Th... | 2012-12-12 17:28:26 | This place looks the same as other target at c... |
2 | tUFrWirKiKi_TAnsVWINQQ | {'BikeParking': 'True', 'BusinessAcceptsCredit... | oAvO0BOHOagOI7WVGXlWSA | This is a neat store! It is also organized! Th... | 2012-12-12 17:28:26 | Its easy to shop here because its well stocked! |
3 | MTSW4McQd7CbVtyjqoe9mw | {'RestaurantsDelivery': 'False', 'OutdoorSeati... | OyjJWNmlky-Ase9ov1Pq5Q | bun is sucked here and the waitress was really... | 2018-12-07 02:22:36 | bun is sucked here and the waitress was really... |
4 | MTSW4McQd7CbVtyjqoe9mw | {'RestaurantsDelivery': 'False', 'OutdoorSeati... | WqeE5e5ROfaVEgkb9dAkiQ | This is my favorite bakery in Chinatown! It's ... | 2017-09-13 00:38:08 | Love their pastries and drinks! |
yelp_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 434063 entries, 0 to 440639 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 business_id 434063 non-null object 1 attributes 434063 non-null object 2 user_id 434063 non-null object 3 text 434063 non-null object 4 date 434063 non-null datetime64[ns] 5 tip 434063 non-null object dtypes: datetime64[ns](1), object(5) memory usage: 23.2+ MB
II. Feature Engineering¶
In the feature engineering process, we randomly selected 100,000 rows from the Yelp dataset and performed various transformations and manipulations. We analyzed the attributes column to determine the frequencies of different attribute values, identifying the top three attributes as BusinessAcceptsCreditCards, BusinessParking, and RestaurantsPriceRange2. We expanded the attributes column into separate columns, dropped the original attributes column, and handled missing values. We created a new column called 'attributes' by combining specific attribute columns.
Additionally, we extracted the year from the 'date' column and visualized the distribution of reviews over the years. We computed additional features such as average review length, standard deviation of reviewer ratings, average word length, number of sentences, and percentage of numerals in the reviews. Finally, we performed text cleaning by removing punctuation, symbols, numbers, and stopwords from the review text. These feature engineering steps were conducted to enhance the dataset for further analysis and modeling.
# randomly select 100,000 rows
yelp_data = yelp_data.sample(n=100000, random_state=42)
# print the first 5 rows of the randomly selected dataframe
yelp_data.head()
business_id | attributes | user_id | text | date | tip | |
---|---|---|---|---|---|---|
78427 | f5VqTIMC-9Iw747SwhS0rg | {'RestaurantsPriceRange2': '2', 'WheelchairAcc... | lMY8NBPyzlPbbu-KBYfD9A | This isn't a furniture store IT'S A COMPLEX! \... | 2015-12-01 23:38:31 | Note taking? |
19344 | UGdufAsFg_vHMdo6MC8aBg | {'GoodForKids': 'False', 'ByAppointmentOnly': ... | xWmYN57XXZbg0LOK8WbbFQ | Local corner mall by Safeway tire store star b... | 2017-11-21 00:57:18 | Owners fixing bike I ride?? |
357698 | b9sLRv_j1eijFIEso8-RQg | {'RestaurantsPriceRange2': '1', 'RestaurantsTa... | r9S0VYrdXJrdhfR7OXj8tA | My favorite pretzel bakery for the fact that I... | 2017-03-27 20:29:15 | They are selling dog bone pretzels at this loc... |
20804 | LUXRw-mr9emGL2gw4otvVA | {'GoodForKids': 'True', 'ByAppointmentOnly': '... | sGaoPWB_p1aOm5z9cnV3Aw | Absolutely incredible view whether you are goi... | 2015-05-05 13:34:44 | Bad traffic construction going from NJ to PA |
367601 | Rxf7eEfub8LC27P28ObgsA | {'Alcohol': ''full_bar'', 'RestaurantsDelivery... | gGAptXub1kr8YyxmApaN4w | Live in the area been visiting Tarpon Tavern o... | 2016-04-03 01:59:20 | To go food takes forever..... |
import pandas as pd
# Assuming you have loaded the Yelp dataset into a DataFrame called 'business_df'
# 'business_df' should contain a column 'attributes' for business attributes
# Create an empty dictionary to store the attribute frequencies
attribute_counts = {}
# Iterate over each business
for attributes in yelp_data['attributes']:
if attributes is not None: # Check if attributes is not None
# Iterate over each attribute
for attribute in attributes.keys():
# Increment the count for the attribute in the dictionary
attribute_counts[attribute] = attribute_counts.get(attribute, 0) + 1
# Sort the attribute frequencies in descending order
sorted_attributes = sorted(attribute_counts.items(), key=lambda x: x[1], reverse=True)
# Print the three most popular attributes
print("Three most popular attributes:")
for attribute, count in sorted_attributes[:7]:
print(attribute, "-", count)
Three most popular attributes: BusinessAcceptsCreditCards - 94617 BusinessParking - 91439 RestaurantsPriceRange2 - 90198 BikeParking - 85745 RestaurantsTakeOut - 79415 WiFi - 78992 GoodForKids - 77166
yelp_data = yelp_data.join(yelp_data['attributes'].apply(pd.Series))
# drop the original dictionary column
yelp_data = yelp_data.drop(['attributes','RestaurantsPriceRange2', 'CoatCheck', 'RestaurantsDelivery', 'Caters', 'WiFi', 'BusinessParking', 'WheelchairAccessible', 'HappyHour', 'OutdoorSeating', 'HasTV', 'RestaurantsReservations', 'DogsAllowed', 'ByAppointmentOnly', 'Alcohol', 'RestaurantsAttire', 'NoiseLevel', 'Ambience', 'GoodForKids', 'RestaurantsGoodForGroups', 'RestaurantsTableService', 'DriveThru', 'GoodForMeal', 'BusinessAcceptsBitcoin', 'Smoking', 'Music', 'GoodForDancing', 'BestNights', 'BYOB', 'Corkage', 'AcceptsInsurance', 'BYOBCorkage', 'HairSpecializesIn', 'Open24Hours', 'RestaurantsCounterService', 'AgesAllowed', 'DietaryRestrictions'], axis=1)
yelp_data = yelp_data.fillna(0).replace("True", 1)
yelp_data = yelp_data.replace(['False', 'None'], 0)
yelp_data= yelp_data.dropna()
yelp_data = yelp_data.reset_index(drop=True)
# filter the data by year
yelp_data= yelp_data.dropna()
yelp_data.fillna(value=0, inplace=True)
yelp_data = yelp_data.reset_index(drop=True)
# Define a lambda function to combine the values of the three columns
combine_cols = lambda row: [row['BusinessAcceptsCreditCards'], row['BikeParking'], row['RestaurantsTakeOut']]
# Apply the lambda function to each row of the DataFrame and store the result in a new column
yelp_data['combined_cols'] = yelp_data.apply(combine_cols, axis=1)
yelp_data = yelp_data.drop(['BusinessAcceptsCreditCards','BikeParking','RestaurantsTakeOut'], axis=1)
yelp_data = yelp_data.rename(columns={'combined_cols': 'attributes'})
# extract the "date" column
dates = yelp_data["date"]
# extract the year from each date
years = [date.year for date in dates]
# plot a histogram of the years
plt.hist(years, bins=range(min(years), max(years) + 1))
plt.xlabel("Year")
plt.ylabel("Frequency")
plt.title("Histogram of Years")
plt.show()
yelp_data.head(2)
business_id | user_id | text | date | tip | attributes | |
---|---|---|---|---|---|---|
0 | f5VqTIMC-9Iw747SwhS0rg | lMY8NBPyzlPbbu-KBYfD9A | This isn't a furniture store IT'S A COMPLEX! \... | 2015-12-01 23:38:31 | Note taking? | [1, 1, 1] |
1 | UGdufAsFg_vHMdo6MC8aBg | xWmYN57XXZbg0LOK8WbbFQ | Local corner mall by Safeway tire store star b... | 2017-11-21 00:57:18 | Owners fixing bike I ride?? | [1, 0, 0] |
# check the data description
# no missing values
yelp_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 100000 entries, 0 to 99999 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 business_id 100000 non-null object 1 user_id 100000 non-null object 2 text 100000 non-null object 3 date 100000 non-null datetime64[ns] 4 tip 100000 non-null object 5 attributes 100000 non-null object dtypes: datetime64[ns](1), object(5) memory usage: 4.6+ MB
#3. Average review length
yelp_data['review_len']=yelp_data['text'].apply(lambda x: len(x))
d=yelp_data.groupby('user_id')['review_len'].mean().reset_index()
d.rename(columns={'review_len':'avg_review_len'},inplace=True)
yelp_data=pd.merge(yelp_data,d,on=['user_id'])
def avg_word_length(x):
sentence = x
words = sentence.split()
average = round(sum(len(word) for word in words) / len(words),2)
return average
yelp_data['avg_word_length']=yelp_data['text'].apply(lambda x: avg_word_length(x))
import nltk
from nltk.tokenize import sent_tokenize
def num_sent(x):
number_of_sentences = sent_tokenize(x)
return len(number_of_sentences)
yelp_data['num_sent']=yelp_data['text'].apply(lambda x: num_sent(x))
def percentage_of_numerals(x):
numbers = sum(c.isdigit() for c in x)
letter = sum(c.isalpha() for c in x)
try:
result = numbers / letter
except ZeroDivisionError:
result=-1
return result
yelp_data['percentage_of_numerals']=yelp_data['text'].apply(lambda x: percentage_of_numerals(x))
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
#clean up the review text
from nltk.tokenize import word_tokenize
import string
import re
from nltk.corpus import stopwords
def clean_text(text):
"""
text: a string
return: modified initial string
"""
#tokenizer breaks string into a list of words
text = word_tokenize(text)
text = " ".join([c for c in text if c not in string.punctuation])
text = text.lower() # lowercase text
text = re.compile('"''#&?!:_[/(){}\[\]\|@,;.]').sub(' ', text) # replace symbols by space in text. substitute the matched string with space.
text = re.sub(r'\d+','', text) # remove symbols and numbers
text = ' '.join(word for word in text.split() if word not in stopwords.words('english')) # remove stopwors from text
return text
yelp_data=yelp_data.reset_index()
yelp_data.head(2)
index | business_id | user_id | text | date | tip | attributes | review_len | avg_review_len | avg_word_length | num_sent | percentage_of_numerals | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | f5VqTIMC-9Iw747SwhS0rg | lMY8NBPyzlPbbu-KBYfD9A | This isn't a furniture store IT'S A COMPLEX! \... | 2015-12-01 23:38:31 | Note taking? | [1, 1, 1] | 1222 | 776.22 | 5.20 | 9 | 0.001037 |
1 | 1 | QV7QOLww8ym3E2zBgdE2Ow | lMY8NBPyzlPbbu-KBYfD9A | The current 5 star raving reviews are pushing ... | 2017-04-04 05:21:19 | OPER on deck | [1, 1, 1] | 625 | 776.22 | 5.08 | 8 | 0.004098 |
# Step 2: Feature Extraction
vectorizer = TfidfVectorizer(stop_words='english')
features = vectorizer.fit_transform(yelp_data['text'])
# Create feature matrix
feature_matrix = pd.DataFrame(features.toarray(), columns=vectorizer.get_feature_names())
# Step 3: Label Encoding
def parse_attributes(x):
if isinstance(x, str):
return eval(x)
else:
return x
yelp_data['attributes'] = yelp_data['attributes'].apply(parse_attributes)
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
label_matrix = pd.DataFrame(mlb.fit_transform(yelp_data['attributes']), columns=mlb.classes_)
# Step 4: Model Training
X_train, X_test, y_train, y_test = train_test_split(feature_matrix, label_matrix, test_size=0.3, random_state=42)
from sklearn.multiclass import OneVsRestClassifier
model = OneVsRestClassifier(LogisticRegression())
model.fit(X_train, y_train)
OneVsRestClassifier(estimator=LogisticRegression())
# Step 5: Model Evaluation
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Accuracy: 0.7012666666666667
#randomly chooses a review from the df
import random
def random_review(df):
"""
Returns a random text from the "text" column of the given DataFrame.
"""
# Get the "text" column as a Series
text_series = yelp_data["text"]
# Choose a random index from the Series
random_index = random.choice(text_series.index)
# Get the text at the chosen index
random_text = text_series[random_index]
return random_text
# Obtain new set of attributes
new_attributes = set(['BusinessAcceptsCreditCards', 'BikeParking', 'RestaurantsTakeOut'])
new_mlb = MultiLabelBinarizer()
new_mlb.fit([new_attributes])
MultiLabelBinarizer()
attribute_list = ['BusinessAcceptsCreditCards', 'BikeParking', 'RestaurantsTakeOut']
new_text = random_review(yelp_data)
preprocessed_text = vectorizer.transform([new_text])
prediction = model.predict(preprocessed_text)
predicted_vector = prediction.reshape(1, -1)
predicted_attributes = [attribute_list[i] for i in range(len(predicted_vector[0])) if predicted_vector[0][i] == 1]
print("Review: \n\n", new_text)
print("\nPredicted attributes:", predicted_attributes)
print("Predicted attribute vector:", predicted_vector[0])
for i in range(len(predicted_vector[0])):
print(attribute_list[i], ":", predicted_vector[0][i])
Review: When a pizza craving hits, Pizza Hut will fo for carryout. Iike the thin crust and request it well done, so hopefully bottom will be crispy. Their pepperoni tastes great, but really dries my mouth out...haven't figured that one out yet. Not great pizza by any means, but will do in a pinch. This store is very busy. The earlier you can order the better. Half the staff is usually sitting down or playing around. Don't be afraid to go et your presence be known, rather than take home a cold pizza. I have never dined in the dtore, njust to go. Predicted attributes: ['BikeParking'] Predicted attribute vector: [0 1] BusinessAcceptsCreditCards : 0 BikeParking : 1
Training & Results¶
The overall accuracy rate achieved for the whole dataset was 0.74. The multi-label classification was applied to the data with review text and tip text as the features and three different business attributes as labels. The models we used were SVC, logistic regression, and Random Forest.
Business Applications¶
There are two primary business applications that can be found when predicting the business attributes based on review and tip textual information: Targeted marketing and competitor analysis. For the former, businesses can use the insights from the model to tailor their marketing efforts towards specific target audiences. They can develop campaigns highlighting the attributes preferred by their customers, ultimately leading to higher customer satisfaction and increased revenue. For the latter, the model can help businesses gain insights into their competitors’ performance and attributes, enabling them to adapt their strategies and offerings to stay ahead in the market.
Here is a link to the complete Github respository https://github.com/EthanFalcao/Yelp-Dataset-Challenge-Analysis-and-Prediction