PySpark Text Classification

A Pyspark project done by me and Moses Oh
The project is as follows:
- We were provided with 20GB Text Dataset of Amazon Reviews
- The endgoal was building a machine learning model on the text reviews and predicting the numerical rating that the Amazon review ended giving to the product (from 1 to 5). This was thus a task of customer satisfaction predicted from text, or sentiment analysis.
- We used NLP Feature Engineering with PySpark’s in-built TF-IDF Vectorizer, and Word2Vec.
- We also pre-processed the data with Pyspark’s in-built One-hot-encoding, and PCA Dimensionality Reduction.
- We ran Data-cleaning and data quality tests with Type-casting and Null Imputation
- We conducted descriptive statistics tasks and EDA using groupby’s and aggregations in PySpark.
- For the machine learning model, we employed PySpark’s Decision Tree Regressor and model pipeline, and trained the model, which took between around half an hour.