We were provided with 20GB Text Dataset of Amazon Reviews
The endgoal was building a machine learning model on the text reviews and predicting the numerical rating that the Amazon review ended giving to the product (from 1 to 5). This was thus a task of customer satisfaction predicted from text, or sentiment analysis.
We used NLP Feature Engineering with PySpark’s in-built TF-IDF Vectorizer, and Word2Vec.
We also pre-processed the data with Pyspark’s in-built One-hot-encoding, and PCA Dimensionality Reduction.
We ran Data-cleaning and data quality tests with Type-casting and Null Imputation
We conducted descriptive statistics tasks and EDA using groupby’s and aggregations in PySpark.
For the machine learning model, we employed PySpark’s Decision Tree Regressor and model pipeline, and trained the model, which took between around half an hour.