Included With This Assignment Is An Excel Spreadsheet

Included With This Assignment Is An Excel Spreadsheet That Contains Da

Included with this assignment is an Excel spreadsheet that contains data with two dimension values. The purpose of this assignment is to demonstrate steps performed in a K-Means Cluster analysis. Review the "k-MEANS CLUSTERING ALGORITHM" section in Chapter 4 of the Sharda et al. textbook for additional background. Use Excel to perform the following data analysis. Plot the data on a scatter plot.

Determine the ideal number of clusters. Choose random center points (centroids) for each cluster. (Note: Each student will select a different random set of centroids.) Using a standard distance formula, measure the distance from each data point to each center point. Assign each data point to an initial cluster region based on closeness. For each cluster, calculate new center points. Repeat steps 4 through 6.

You will use Excel to help with calculations, but only standard functions should be used (i.e., don't use a plug-in to perform the analysis for you). You need to show your work doing this analysis the long way. If you were to repeat steps 4 through 6, what will likely happen with the cluster centroids?

Paper For Above instruction

The goal of this assignment is to perform a manual K-Means clustering analysis using Excel, illustrating understanding of the process by calculating and updating cluster centers iteratively. K-Means clustering is an unsupervised machine learning technique that partitions data into k clusters, aiming to group similar data points based on their features—in this case, two-dimensional variables. Unlike relying on automated tools or plug-ins, students are expected to perform each step manually using Excel's basic functions, such as distance calculations, sorting, and averaging, to deepen comprehension of the clustering process.

The initial step involves visualizing the data points on a scatter plot to comprehend their distribution. This graphical representation provides insight into the potential number of clusters and their approximate locations. Determining the ideal number of clusters is a subjective process; however, common methods include the Elbow method or inspecting the data for natural groupings. For simplicity, students may choose a reasonable number of clusters based on the data visualization, but they should justify their choice.

Once the number of clusters is decided, students randomly select initial centroids—each corresponding to a cluster. To ensure variability and prevent bias, each student must choose a different set of starting centroids from those provided in the sample analysis. These centroids are represented as coordinate points on the scatter plot. The choice of starting points influences the clustering results, illustrating the importance of initial conditions in K-Means.

Next, students calculate the Euclidean distance from each data point to each centroid using the standard distance formula:

Distance = √[(x₂ - x₁)² + (y₂ - y₁)²]

Using Excel functions such as SQRT, POWER, and other basic formulas, students find the distances from each point to each centroid. Based on these distances, each data point is assigned to the nearest centroid's cluster. This assignment creates initial clusters.

Subsequently, for each cluster, students recalculate the centroid by computing the average x-value and y-value of all data points within the cluster. These new centroid positions are then updated in the dataset. The process then repeats: distances are recalculated from each point to the updated centroids, points are reassigned based on proximity, and new clusters are formed.

This iterative process continues until the centroids stabilize—that is, their positions no longer change significantly between iterations, indicating convergence. If the process is repeated multiple times, the cluster centroids tend to settle into positions that best fit the data's natural groupings. In practice, the centroids move towards local minima of the within-cluster sum of squares, which is the objective function that K-Means seeks to minimize.

Understanding what happens if the steps are repeated is crucial. Generally, repeated iterations lead the cluster centroids to stabilize, often pinpointing optimal or near-optimal locations that minimize intra-cluster variance. However, because the initialization is random, different starting centroids can lead to different local minima, resulting in variations in cluster assignments and positions. To mitigate this, multiple runs with different initializations are recommended, selecting the best solution based on the lowest within-cluster sum of squares.

In conclusion, manually performing K-Means clustering in Excel involves visualizing data, selecting initial centers, calculating distances, assigning clusters, recalculating centers, and repeating until convergence. This process highlights the algorithm's iterative nature and emphasizes the importance of initial centroid selection. Performing this analysis without automated tools fosters a deeper understanding of the mechanics behind K-Means and improves skills in data analysis and Excel functions.

References

  • Sharda, R., Delen, D., & Turban, E. (2020). Business Intelligence, Analytics, and Data Mining. 10th Edition. Pearson.
  • Han, J., Kamber, M., & Pei, J. (2012). Data Mining: Concepts and Techniques. 3rd Edition. Morgan Kaufmann.
  • Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8), 651–666.
  • Everitt, B., Landau, S., Leese, M., & Stahl, D. (2011). Cluster Analysis. 5th Edition. Wiley.
  • Jain, A. K., et al. (2004). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8), 651-666.
  • MacQueen, J. (1967). Some Methods for classification and Analysis of Multivariate Observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, 281–297.
  • Arthur, D., & Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 1027–1035.
  • Bray, A., & Curtis, J. T. (1957). An ordination of the upland forest communities of southern Wisconsin. Ecological Monographs, 27(4), 325–349.
  • Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering algorithm. Applied Statistics, 28(1), 100–108.
  • Fukunaga, K., & Hostetler, L. D. (1975). The Estimation of the Gradient of a Density Function, with Applications in Pattern Recognition. IEEE Transactions on Information Theory, 21(1), 32–40.