In This Exercise You Will Be Allowed To Share With Others

In This Exercise You Will Be Allowed To Share With Other Class Member

Analyze the Titanic Data - this is a good reference: 2) Focus on visualizing the effectiveness and method of each algorithm, Compare 5 Machine Learning Algorithms and summarize them in a powerpoint alongwith the code Logistic Regression: The assumptions were met and the accuracy of the best model is 0.8539 Linear SVM: Scaled some features and gained an accuracy of 0.8764 Non-linear Radial SVM: Scaled some features and secured the highest accuracy of 0.8820 Random Forest: Gained an accuracy of 0.8427 and 10-fold cross validation did not improve the model accuracy Naive Bayes: Gained an accuracy of 0.8427 which is identical to the Random Forest accuracy. References Remember to submit your response as a powerpoint

Paper For Above instruction

The Titanic dataset has long served as a benchmark for machine learning classification tasks, providing a compelling context for analyzing various algorithms' effectiveness and visualization techniques. This paper explores five prominent machine learning algorithms—Logistic Regression, Linear Support Vector Machine (SVM), Non-linear Radial SVM, Random Forest, and Naive Bayes—by applying them to the Titanic dataset, evaluating their accuracy, and visualizing their methodologies to compare their strengths and limitations.

Initially, data preprocessing is crucial—handling missing data, feature scaling, and encoding categorical variables—before applying the algorithms. Logistic Regression, a well-understood linear model, assumes independence among predictors and a linear relationship between features and the log-odds of the outcome. After ensuring these assumptions were satisfied, Logistic Regression achieved an accuracy of 0.8539. Its interpretability and simplicity make it a favorable option, especially when the data adheres to its assumptions.

The Linear SVM algorithm, which finds a hyperplane that separates classes with maximum margin, benefits from feature scaling to optimize performance. By scaling features, the Linear SVM achieved an accuracy of 0.8764, outperforming Logistic Regression in this case. Visualization of the decision boundary in two-dimensional feature spaces helps understand how the SVM separates the classes, emphasizing the importance of feature scaling for SVM efficiency.

The Non-linear Radial SVM utilizes a radial basis function (RBF) kernel, allowing it to capture complex, non-linear relationships in the data. After scaling features to enhance kernel performance, the Radial SVM secured the highest accuracy among these models at 0.8820. Visualizations such as kernel plots and decision regions illustrate how the RBF kernel maps inputs into higher-dimensional spaces to achieve non-linear decision boundaries.

Random Forest, an ensemble of decision trees, provided an accuracy of 0.8427. Its robustness against overfitting and ability to handle various feature types make it a powerful classifier. Cross-validation with 10 folds did not significantly improve performance, indicating that the model generalizes well. Visualizations such as feature importance plots shed light on which attributes most influence survival predictions in the Titanic dataset.

Naive Bayes, based on applying Bayes' theorem with strong independence assumptions, attained an accuracy of 0.8427, matching the Random Forest accuracy. Despite its simplicity and assumptions, it often performs adequately in categorical data contexts. Confusion matrices and probability distribution visualizations help in understanding Naive Bayes' decision-making process despite the independence assumption’s limitations.

Comparing these algorithms highlights the trade-offs among interpretability, complexity, and performance. While Radial SVM achieves the highest accuracy, logistic regression remains highly interpretable and computationally efficient. Visualizations across all models enhance understanding of their decision processes, supporting better algorithm selection based on the problem context.

References

  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer.
  • Pedregosa, F. et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825-2830.
  • Han, J., Kamber, M., & Pei, J. (2011). Data Mining: Concepts and Techniques. Morgan Kaufmann.
  • Friedman, J., Hastie, T., & Tibshirani, R. (2001). The Elements of Statistical Learning. Springer.
  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer.
  • Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32.
  • McLachlan, G. (2004). Discriminant Analysis and Statistical Pattern Recognition. Wiley-Interscience.
  • Cover, T., & Thomas, J. (2006). Elements of Information Theory. Wiley-Interscience.
  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
  • Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.