Benson Duong

Data Science major at UC San Diego

Interests in Data Science, Python Programming, Machine Learning.

View My GitHub Profile

Restaurant Recommendation Data Science Project

(Link to Github Repo)

(Link to Github Repo)

Sub-Project 1: Restaurant Rating Prediction given User Text Reviews

Data Preparation and Feature Engineering

  1. Preprocessing was done on the review text column like lowercasing, and replacement of non alphanumeric characters with whitespace.
  2. TF-IDF vectorization was done on the processed review text column, dropping NLTK’s English stop-words list. This produced a matrix with column count as the vocabulary size.
  3. The review rows of the new TF-IDF matrix were grouped separately by the corresponding binary rating category.
  4. Then within each group, the rows were collapsed with a column-wise mean (i. e. np.mean with axis = 0). This produced a new matrix that had a row count of 2 and a column count of vocabulary size.
  5. Do a column-wise argmax (i. e. axis = 0) on the new array, and filter only for the columns where the argmax row index corresponds to the category for bad ratings. Use these column indices and the vocabulary mapping of the saved TF-IDF vectorizer object to retrieve the words.
  6. The previous steps basically filtered for the words that have a stronger “belongingness” to bad reviews compared to good reviews, using TF-IDF. But now, it was decided that among these words, only 100 is needed. While all of these words have stronger relevance to bad reviews, some are stronger than others, so these words are sorted by the largest absolute difference (between the 2 rows), and the topmost 100 got picked.
  7. With these 100 words, and typical bag-of-words count vectorized matrix was produced upon the original reviews dataset, and serves as the input matrix to the model.

Modeling, with Class Imbalance Undersampling

Results and Evaluation

Sub-Project 2: Predicting if a User will visit a Restaurant

Overview

Word2Vec

Image Recognition with Pre-trained VGGNET16

Unsupervised ML: Dividing the Restaurants and Users into Distinct Sub-groups with K-Means Clustering

Prediction Modeling: Binary Classification

Logistic Regression

Multi-Layered Neural Network

Conclusion