Predict Amazon Review Helpfulness with Xgboost, Neural Network, and LSTM Neural Network

— — Machine Learning and feature engineer of text data

The major objective of this project is adopting xgboost, neural netwok, and LSTM neural network to develop predictive models and automatically predicts the helpfulness of specific reviews to customers.

Many e-commercial companies, for example Amazon, heavily depend on consumers’ review to provide the potential purchasers’ evaluation of a product. The online merchants are distinct from local markets where consumers can directly choose items based on on-site evaluation, online customers must rely on other information to help them make their purchase decisions. Therefore, developing online communicate channels among customers becomes critically important for commercial websites. To this end, the Amazon allows its users to write their opinions about products as voting either helpful or not helpful. Such opinions are valuable for the potential buyers and product manufactures. However, for new posts and “not hot” products, customers are not able to get enough information from review evaluations. The objective of this project is adopting machine-learning algorithm to develop predictive models and automatically predicts the helpfulness of specific reviews to customers.

The data source is amazon clothing, shoes and jewelry review data, and each data record is a single review of specific product. This dataset contains the following fields:

  1. ItemID: product identification
  2. reviewerID: reviewer identification
  3. rating: rate score of product, and the value range is from 1 to 5
  4. reviewText: review content
  5. reviewHash: review Hash
  6. reviewTime: the time post review
  7. summary: summary title about review content
  8. unixReviewTime: unix time of review post
  9. out of: total votes for specific product
  10. nHelpful: helpful votes out of total votes

Our primary task is to predict the nHelpful for test data sets. Predicting the absolute number of helpful votes is always harder than predicting the helpful ratio (the ratio of number of helpful votes to total votes); therefore, we will predict the helpful ratio rather than the absolute value of helpful votes.

In order to enhance the simulation accuracy, we removed the total votes that are less than two (1 and 0). The reason we select 2 as threshold is because when total votes is 0 or 1, the help ratios turn out either 100% or 0%.

  1. Delay time from the first same product review
# define function to get the first review time for each item
def first_review_time(data_df):
time_dict = {}
for i in range(len(data_df)):
pid = data_df['itemID'][i]
#print (pid)
time_i = data_df['reviewTime'][i]
if pid in time_dict:
if time_i < time_dict[pid]:
time_dict[pid] = time_i
time_dict[pid] = time_i
data_df['firstReviewTime'] = data_df['itemID'].map(time_dict).values
return data_df
# convert reviewTime to datatime data type
train_df['reviewTime'] = pd.to_datetime(train_df['reviewTime'])
train_df = first_review_time(train_df)
train_df['review_first_dif'] = (train_df['reviewTime'] - train_df['firstReviewTime']).astype('timedelta64[D]')
test_df['reviewTime'] = pd.to_datetime(test_df['reviewTime'])
test_df = first_review_time(test_df)
test_df['review_first_dif'] = (test_df['reviewTime'] - test_df['firstReviewTime']).astype('timedelta64[D]')

2. Rating score deviation from mean

def deviation_mean(data_df):
rating_mean_dict = data_df['rating'].groupby(data_df['itemID']).mean().to_dict()
data_df['rating_mean'] = data_df['itemID'].map(rating_mean_dict).values
data_df['rating_mean_dev'] = data_df['rating'] - data_df['rating_mean']
return data_df['rating_mean_dev']
train_df['rating_mean_dev'] = deviation_mean(train_df)test_df['rating_mean_dev'] = deviation_mean(test_df)

3. Number of words of each review text

train_df['reviewWords'] = train_df['reviewText'].apply(lambda x: len(x.split()))test_df['reviewWords'] = test_df['reviewText'].apply(lambda x: len(x.split()))

4. Number of words of each review summary

train_df['summaryWords'] = train_df['summary'].apply(lambda x: len(x.split()))test_df['summaryWords'] = test_df['summary'].apply(lambda x: len(x.split()))

5. Ratio of summary words to review text words

train_df['ratiosuWord'] = train_df['summaryWords'] / train_df['reviewWords']test_df['ratiosuWord'] = test_df['summaryWords'] / test_df['reviewWords']

6. Number of sentences of each review text

def count_sentence(data_df, text):
pun_sen = ['.', '!', '?']
text_col = data_df[text]
sentence_counts = []
for i in text_col:
sentence_count = []
for j in pun_sen:
count_a = i.count(j)
data_df['reviewSentences'] = sentence_counts
return data_df['reviewSentences']
train_df['reviewSentences'] = count_sentence(train_df, 'reviewText')test_df['reviewSentences'] = count_sentence(test_df, 'reviewText')

7. Number of characters of each review text

# I found that string.punctuation did not include all the punctuations of brown words, here I added extra punctuation in the following listpunctuation = ['!','"','#','$','%','&',"'",'(',')','*','+',',','-','.','/',':',';','<','=','>','?','@','[','\\',']','^','_','`','{','|','}','~','``',"''",'--']def count_characters(data_df):
reviewcharacters = []
text_col = data_df['reviewText']
for i in text_col:
a = dict(Counter(i))
b = {k:v for k, v in a.items() if k not in punctuation}
c = sum(list(b.values()))
data_df['reviewChars'] = reviewcharacters
return data_df['reviewChars']
train_df['reviewChars'] = count_characters(train_df)test_df['reviewChars'] = count_characters(test_df)

8. Readability of each review (ARI as index to measure)

def readability(data_df):
wordperSen = []
charperWord = []
reviewRead = []
len_df = len(data_df)
a = list(data_df['reviewWords'])
b = list(data_df['reviewSentences'])
c = list(data_df['reviewChars'])
for i in range(len_df):
if b[i] == 0:
j = a[i] / b[i]
if a[i] == 0:
l = c[i] / a[i]
ari = 4.71 * charperWord[i] + 0.5 * wordperSen[i] - 21.43
data_df['reviewRead'] = reviewRead
return data_df['reviewRead']
train_df['reviewRead'] = readability(train_df)test_df['reviewRead'] = readability(test_df)

9. Number of punctuations of each review text

def numpunct(data_df):
reviewPuncts = []
for i in data_df['reviewText']:
a = dict(Counter(i))
b = {k:v for k,v in a.items() if k in punctuation}
c = sum(list(b.values()))
data_df['reviewPuncts'] = reviewPuncts
return data_df['reviewPuncts']
train_df['reviewPuncts'] = numpunct(train_df)test_df['reviewPuncts'] = numpunct(test_df)

10. Ratio of punctuations with characters

def ratio_puncts_chars(data_df):
return data_df['reviewPuncts'] / data_df['reviewChars']
train_df['ratiopunChar'] = ratio_puncts_chars(train_df)test_df['ratiopunChar'] = ratio_puncts_chars(test_df)

11. Number of capital words of each review text

def numcapwords(data_df):
reviewCwords = []
for i in data_df['reviewText']:
a = i.split()
b = [word for word in a if word.isupper()]
c = len(b)
data_df['reviewCwords'] = reviewCwords
return data_df['reviewCwords']
train_df['reviewCwords'] = numcapwords(train_df)test_df['reviewCwords'] = numcapwords(test_df)

12. Number of capital words of each summary

def numcapwords(data_df):
reviewCwords = []
for i in data_df['summary']:
a = i.split()
b = [word for word in a if word.isupper()]
c = len(b)
data_df['summaryCwords'] = reviewCwords
return data_df['summaryCwords']
train_df['summaryCwords'] = numcapwords(train_df)test_df['summaryCwords'] = numcapwords(test_df)

13. Number of exclimation and question marks of each review text

def numexclquest(data_df):
suexcqueMarks = []
for i in data_df['reviewText']:
a = re.findall(r'[!?]', i)
data_df['numexclquest'] = suexcqueMarks
return data_df['numexclquest']
train_df['numexclquest'] = numexclquest(train_df)test_df['numexclquest'] = numexclquest(test_df)

14. Number of exclimation and question marks of each summary text

def numexclquest(data_df):
suexcqueMarks = []
for i in data_df['summary']:
a = re.findall(r'[!?]', i)
data_df['sunumexclquest'] = suexcqueMarks
return data_df['sunumexclquest']
train_df['sunumexclquest'] = numexclquest(train_df)test_df['sunumexclquest'] = numexclquest(test_df)

15. Number of reviews of each product (measure the popularity of each product)

def numreviewPro(data_df):
itemid_dict = data_df.groupby('itemID')['itemID'].count().to_dict()
data_df['numreviewPro'] = data_df['itemID'].map(itemid_dict).values
return data_df['numreviewPro']
train_df['numreviewPro'] = numreviewPro(train_df)test_df['numreviewPro'] = numreviewPro(test_df)

16. Number of reviews of each reviewers (measure reviewer’s experience)

def numreviewPro(data_df):
itemid_dict = data_df.groupby('reviewerID')['reviewerID'].count().to_dict()
data_df['numreviews'] = data_df['reviewerID'].map(itemid_dict).values
return data_df['numreviews']
train_df['numreviews'] = numreviewPro(train_df)test_df['numreviews'] = numreviewPro(test_df)
train_df_less = train_df.drop(columns = ['categoryID', 'categories', 'reviewerID','reviewText','reviewHash','reviewTime','summary','unixReviewTime','helpful','firstReviewTime'], axis = 1)train_df['helpratio'] = train_df['nHelpful'] / train_df['outOf']test_df_less = test_df.drop(columns = ['categoryID', 'categories', 'reviewerID', 'reviewText','reviewHash','reviewTime','summary','unixReviewTime', 'firstReviewTime','helpful'])train_df_less.corr
  1. xgboost Model Predict
params = {"colsample_bytree": uniform(0.7, 0.3),          "gamma": uniform(0, 0.5),          "learning_rate": uniform(0.001, 0.3), # default 0.1          "max_depth": randint(2, 6), # default 3
"n_estimators": randint(100, 250), # default 100
"subsample": uniform(0.6, 0.4)}
xgb_model = xgb.XGBRegressor(objective="reg:linear",random_state=42)time_split = TimeSeriesSplit(n_splits = 8)xgb_search = RandomizedSearchCV(xgb_model, param_distributions=params, random_state=42, n_iter=4, cv=time_split, verbose=1, n_jobs=1, return_train_score=True)%%time, y_train)
y_pred = xgb_search.predict(X_val)rms = sqrt(mean_squared_error(y_val, y_pred))
print ('RMSE:', rms)

2. Multilayer Perceptron Neural Network Predict

# define function to preprocess data
def preproc(X_train, X_val):
input_list_train = []
input_list_val = []
input_list_test = []

other_cols = [c for c in X_train.columns]

return input_list_train, input_list_val
# data normalization
scalar = StandardScaler()
for scalar_feature in X_train.columns:[scalar_feature].values.reshape(-1,1))
X_train[scalar_feature] = scalar.transform(X_train[scalar_feature].values.reshape(-1,1))[scalar_feature].values.reshape(-1,1))
X_val[scalar_feature] = scalar.transform(X_val[scalar_feature].values.reshape(-1,1))
# build neural network model
model = Sequential()
#get number of columns in training data
n_cols = X_train.shape[1]
#add model layers
model.add(Dense(100, activation='relu', input_shape=(n_cols,)))
model.add(Dense(100, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(1, activation = 'sigmoid'))
# model compile
model.compile(loss = 'mean_squared_error', optimizer = 'adam')
%%time, y_train, validation_split=0.2, epochs=100)
y_predict = model.predict(X_val_list)rms = sqrt(mean_squared_error(y_val, y_predict))
print ('RMSE:', rms)

3. LSTM Neural Network Predict

from keras.layers.recurrent import LSTMmodel = Sequential()#get number of columns in training data
n_cols = X_train.shape[1]
#add model layers
model.add(LSTM(units = 100, input_shape=(X_train.shape[1],1), return_sequences = True))
model.add(LSTM(units = 100, return_sequences = True, activation='relu'))
model.add(LSTM(units = 100, return_sequences = True, activation = 'relu'))
model.add(LSTM(units = 100))
model.add(Dense(units = 1, activation = 'sigmoid'))
model.compile(loss = 'mean_squared_error', optimizer = 'adam')# reshape train data and value data to 3 dimension data structure
X_train_list = np.reshape(np.array(X_train),(X_train.shape[0],X_train.shape[1],1))
X_val_list = np.reshape(np.array(X_val), (X_val.shape[0],X_val.shape[1],1))
y_train_list = np.array(y_train)
y_val_list = np.array(y_val, y_train_list, epochs = 30)
y_predict = model.predict(X_val_list)
rms = sqrt(mean_squared_error(y_val_list, y_predict))
print ('RMSE:', rms)

In this project, xgboost machine learning model, multiple layers neural network and LSTM neural network algorithms were adopted to process amazon clothing, shoes and jewelry review data, and predicting the helpful votes for specific review record. Based on the original dataset, 16 different features were derived. Meanwhile I shared all the codes about feature engineer and model predict, so then make it more convenient for your application. Hope this post is helpful for your project, and if you like this post, please give your claps as many as you can ^-^.

Data Scientist Costar