INSS 662 Project You Are Required To Use Weka Or Other Open

Inss 662 Project You Are Required To Usewekaor Other Open Sourced

Inss 662 Project : You are required to use Weka or other open source data mining software, not Watson. 1. Find an open dataset on the Internet, see below 2. Conduct appropriate data mining activities and report the processes and outcomes . You need to use at least three competitive algorithms from the same or different classes of Data mining or Machine Learning techniques. See the lecture slides for Chapter 5 for algorithms and techniques under course materials. 3. Present your results. Possible data sets to choose from: Data from Hackathon such as the Lord of the Machines - Data Science Hackathon , DataHack Premier League , Mckinsey Analytics Online Hackathon , etc. (Be brave to take challenging problem ...) Possible sources are: a. b. c. (Datasets for Data Mining and Data Science) d. open government dataset @ e. Dataset from URL: OR f. Other datasets after getting approval from the instructors.

Paper For Above instruction

The project assigned in INSS 662 requires students to employ open source data mining tools, such as Weka or alternatives, to analyze an openly available dataset. The core objective is to conduct meaningful data mining activities utilizing at least three diverse algorithms from different classes of machine learning or data mining techniques. This comprehensive process involves identifying a suitable dataset, performing data preprocessing, applying multiple algorithms, and analyzing the results to draw insightful conclusions.

Selection of Dataset

The first step involves choosing an appropriate dataset. Students are encouraged to explore datasets from hackathons, government portals, or other credible sources. For example, datasets from events like the Lord of the Machines - Data Science Hackathon, DataHack Premier League, or McKinsey Analytics Online Hackathon are suitable, especially if they pose challenging problems that can demonstrate the versatility of different algorithms. Additional options include open government datasets, datasets from specific URLs, or other datasets approved by instructors. The key is to select a dataset that is rich enough to allow the application of multiple algorithms and generate meaningful insights.

Data Mining Activities and Methodology

The core of this project revolves around performing a series of data mining activities that include data cleaning, transformation, and feature selection, followed by the application of multiple algorithms. The student should document each step meticulously, including rationale for preprocessing techniques, parameter settings, and choice of algorithms. The algorithms selected should belong to different classes, such as decision trees, neural networks, clustering algorithms, or ensemble methods, to exhibit a broad coverage of techniques. For example, applying a decision tree classifier like C4.5, a clustering algorithm like K-Means, and a neural network such as Multi-Layer Perceptron demonstrates a diverse analytical approach.

Analysis and Reporting of Results

Once the algorithms are applied, students must analyze the outcomes in terms of accuracy, precision, recall, and other relevant metrics depending on whether the task is classification or clustering. Visualizations such as confusion matrices, ROC curves, or cluster plots should accompany the report to facilitate interpretation. The report must include comparisons between the algorithms to identify which performed best under the given dataset, as well as discussions on why certain algorithms may have outperformed others based on the data characteristics.

Presentation of Findings

The final deliverable is a comprehensive report outlining the entire process, from dataset selection to final analysis, supported by relevant screenshots, tables, and visualizations. The report should also reflect on potential limitations, such as overfitting, data imbalance, or applicability of the algorithms to real-world scenarios. Clear articulation of conclusions and possible avenues for further research are also essential components of the presentation.

In summary, this project emphasizes the integration of theoretical knowledge of data mining algorithms with practical implementation using open source tools. It assesses students’ ability to handle real-world data, apply multiple techniques, and interpret the results meaningfully, thereby showcasing their competencies in data science and machine learning.

References

  • Witten, I. H., Frank, E., & Hall, M. A. (2016). Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann.
  • Han, J., Kamber, M., & Pei, J. (2011). Data Mining: Concepts and Techniques. Morgan Kaufmann.
  • UCI Machine Learning Repository. (n.d.). Retrieved from https://archive.ics.uci.edu/ml/index.php
  • Pedregosa, F., Varoquaux, G., Gramfort, A., et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825-2830.
  • Hall, M., Frank, E., Holmes, G., et al. (2009). The WEKA Data Mining Software: An Update. SIGKDD Explorations, 11(1).
  • Kohavi, R., & Provost, F. (1998). Glossary of Data Mining Terms. SIGKDD Explorations.
  • Burez, J., & Van den Poel, D. (2007). Handling class imbalance in customer churn prediction. Expert Systems with Applications, 34(3), 1302-1315.
  • Liaw, A., & Wiener, M. (2002). Classification and Regression by randomForest. R News, 2(3), 18-22.
  • Friedman, J., Hastie, T., & Tibshirani, R. (2001). The Elements of Statistical Learning. Springer.
  • Ram, S., & Li, B. (2019). Open Data and Data Science for Decision Making. Data & Knowledge Engineering, 118, 1-6.