Assignment 5: Predicting Yelp User Ratings Using Spark ML Li

Assignment 5 Predicting Yelp User Rating Using Spark Ml Library

Predict Yelp user rating based on review text using Spark ML library, including data exploration, feature engineering, model building, ensembling, and optional feature addition, with evaluation and interpretation.

Sample Paper For Above instruction

In this study, we aim to predict Yelp user ratings based on review texts by utilizing Apache Spark's MLlib. The primary goal is to develop a robust classification framework that can effectively distinguish between satisfied and dissatisfied reviews, characterized respectively by ratings of 4 or 5 stars and ratings of 1, 2, or 3 stars. This comprehensive analysis involves multiple stages: data exploration, feature engineering, model training with hyperparameter tuning, ensembling, and the potential integration of user-related features.

Data Exploration

The initial phase involves loading the review dataset from review.json and extracting pertinent attributes: 'text' for the review content and 'stars' for the user rating. Analyzing the distribution of 'stars' provides insights into the class imbalance inherent in the dataset. Typically, reviews are skewed towards higher ratings, necessitating stratified sampling to balance classes. For instance, by evaluating the counts of each star rating, we observe the predominance of certain classes over others. Implementing Spark's sampleBy method allows us to downsample the majority class, scaling the sampling fractions by 0.1 to generate a manageable subset for modeling, particularly given computational constraints.

Feature Engineering

Next, we transform the star ratings into a binary 'rating' variable: assigning 0 to reviews with 1-3 stars denoting dissatisfaction, and 1 to reviews with 4-5 stars indicating satisfaction. We analyze the distribution of this binary label to evaluate class balance—an essential step to prevent biased models. If imbalance persists, stratified sampling ensures an equal representation of both classes in the training data.

Subsequently, textual feature extraction involves preprocessing the review text: removing stop words and punctuations, applying stemming to reduce word forms, and constructing TF-IDF feature vectors. Using Spark's CountVectorizer with setMinDF(100) ensures the inclusion of only frequently occurring words, which enhances the relevance of our feature set. This process, combined with text normalization strategies, aims to produce meaningful and discriminative features for classification.

Model Development and Hyperparameter Tuning

We develop three distinct classifiers: Logistic Regression, Random Forest, and Gradient-Boosted Trees. Each model pipeline incorporates the TF-IDF features and is subjected to hyperparameter tuning via cross-validation with three folds, optimizing the Area Under the Curve (AUC) metric. This systematic approach facilitates model comparison and selection based on predictive performance. For each model, training on the stratified, balanced dataset ensures generalizability, with evaluation based on the test set's AUC scores, providing insights into each model's capability to discriminate between positive and negative reviews.

Superior model performance is typically characterized by higher AUC values, indicating better ranking ability in distinguishing classes. Frequently, ensemble methods or gradient-boosted trees outperform simpler models due to their ability to capture complex patterns, though this must be empirically validated.

Ensemble Modeling

To leverage the strengths of individual classifiers, an ensemble approach combines predictions from Logistic Regression, Random Forest, and Gradient Boosted Trees. Predictions from each model are aggregated using majority voting; if at least two models agree, their consensus prediction is adopted. In cases of disagreement, a default to class 1 is applied. This ensemble prediction is then evaluated using the area under the ROC curve, providing a measure of combined model effectiveness compared to individual models.

Often, ensemble models yield superior ROC-AUC scores, indicating enhanced discrimination ability. The ensemble's performance depends on the diversity and accuracy of base models; thus, combining diverse algorithms generally yields better results.

Incorporating Additional Features

To improve predictive performance further, additional user features such as the average star rating per user—extracted from the user.json dataset—are integrated. Joining this data based on user IDs enables the inclusion of a new numerical feature, 'average_stars,' which potentially influences review ratings. Calculating the correlation between 'average_stars' and the binary 'rating' variable reveals whether user reputation correlates positively with review satisfaction.

Proceeding, the 'average_stars' feature is incorporated into models by using Spark's VectorAssembler to combine it with TF-IDF features. Retraining classifiers with these augmented feature vectors assesses whether this enhancement improves the models' predictive performance. Typically, an increase in metrics such as AUC indicates that user reputation information provides additional predictive power and improves classification accuracy.

Conclusion

Throughout this process, integrating textual analysis with structured user features, coupled with rigorous model tuning and ensembling strategies, results in effective prediction of Yelp reviews' satisfaction levels. The approach demonstrates the importance of detailed preprocessing, balanced data handling, and diversified modeling techniques in real-world recommendation and sentiment analysis systems. Future work may explore advanced features, deeper natural language processing techniques, or neural network-based architectures for further performance improvements.

References

  • J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
  • Apache Spark Documentation, "MLlib: Machine Learning in Spark," 2023. Available at: https://spark.apache.org/docs/latest/mllib-guide.html
  • Chen, T., & Guestrin, C. (2016). "XGBoost: A scalable tree boosting system." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794.
  • Pedregosa, F., et al. (2011). "Scikit-learn: Machine Learning in Python." Journal of Machine Learning Research, 12, 2825–2830.
  • Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
  • Zheng, L., et al. (2019). "Sentiment analysis of Yelp reviews with machine learning." Applied Sciences, 9(9), 1834.
  • Salton, G., & McGill, M. J. (1983). Introduction to Modern Information Retrieval. McGraw-Hill.
  • Rajaraman, A., & Ullman, J. D. (2011). Mining of Massive Datasets. Cambridge University Press.
  • Huang, T.-H., et al. (2020). "Enhancing sentiment analysis with user profile features." Information Processing & Management, 57(2), 102201.
  • Friedman, J., Hastie, T., & Tibshirani, R. (2001). "The Elements of Statistical Learning." Springer.