In Clustering, The Threshold Used To Find Cluster Density ✓ Solved
1.In CLIQUE, the threshold used to find cluster density
In CLIQUE, the threshold used to find cluster density remains constant, even as the number of dimensions increases. This is a potential problem since density drops as dimensionality increases; i.e., to find clusters in higher dimensions the threshold has to be set at a level that may well result in the merging of low-dimensional clusters. Comment on whether you feel this is truly a problem and, if so, how you might modify CLIQUE to address this problem.
Name at least one situation in which you would not want to use clustering based on SNN similarity or density.
Give an example of a set of clusters in which merging based on the closeness of clusters leads to a more natural set of clusters than merging based on the strength of connection (interconnectedness) of clusters.
We take a sample of adults and measure their heights. If we record the gender of each person, we can calculate the average height and the variance of the height, separately, for men and women. Suppose, however, that this information was not recorded. Would it be possible to still obtain this information? Explain.
Explain the difference between likelihood and probability.
Traditional K-means has a number of limitations, such as sensitivity to outliers and difficulty in handling clusters of different sizes and densities, or with non-globular shapes. Comment on the ability of fuzzy c-means to handle these situations.
Clusters of documents can be summarized by finding the top terms (words) for the documents in the cluster, e.g., by taking the most frequent k terms, where k is a constant, say 10, or by taking all terms that occur more frequently than a specified threshold. Suppose that K-means is used to find clusters of both documents and words for a document data set. (a) How might a set of term clusters defined by the top terms in a document cluster differ from the word clusters found by clustering the terms with K-means? (b) How could term clustering be used to define clusters of documents?
Suppose we find K clusters using Ward’s method, bisecting K-means, and ordinary K-means. Which of these solutions represents a local or global minimum? Explain.
You are given a data set with 100 records and are asked to cluster the data. You use K-means to cluster the data, but for all values of K, 1 ≤ K ≤ 100, the K-means algorithm returns only one non-empty cluster. You then apply an incremental version of K-means, but obtain exactly the same result. How is this possible? How would single link or DBSCAN handle such data?
Paper For Above Instructions
1. In CLIQUE, the constant threshold for cluster density presents challenges, particularly as dimensionality increases. As dimensions grow, data points become sparse, resulting in a dilapidation of density measures. The merging of low-dimensional clusters may occur when this constant threshold is applied, leading to an inaccurate representation of clustering structures. One way to modify CLIQUE would be to integrate a dynamic threshold that adjusts proportionally based on the number of dimensions. This alteration would help maintain representative clustering, preserving the integrity of low-dimensional clusters while still accommodating higher dimensions.
2. Clustering based on Shared Nearest Neighbor (SNN) similarity or density may not be appropriate in situations where the dataset exhibits a significant amount of noise or outliers. In scenarios where data distributions are highly skewed or have varying densities, density-based clustering can lead to misleading and ineffective clusters. For instance, in medical data where anomalies could easily skew the results, applying density-based clustering without proper preprocessing and assessment of outlier influence may produce erroneous conclusions.
3. An example of clustering that benefits from closeness rather than interconnectedness could be geographic clustering of customer locations. Consider customer data for a retail chain where determining clusters based on geographical proximity (closeness) offers a more natural representation of shopping behaviors compared to merely assessing connection (interconnectedness) based on purchase history. In this case, customers who live near each other might form clusters that reveal location-based shopping trends that wouldn’t emerge when only focusing on purchase interconnections.
4. If gender information is not recorded in height measurements, obtaining the average heights and variances separately for men and women becomes complex. However, statistical imputation methods or predictive modeling could facilitate estimations based on other available attributes or demographic data. For example, using population averages and demographic distributions might provide a rough estimate, albeit with a potential loss of accuracy and increased uncertainty surrounding the estimates.
5. The distinction between likelihood and probability is foundational to statistical theory. Probability refers to the measure of the likelihood that an event will occur, expressed within predetermined parameters. In contrast, likelihood relates to parameter inference concerning observed data, effectively measuring how likely a particular set of parameters results in the observed data. For instance, given a model with certain parameters, the likelihood assesses how plausible those parameters are regarding the observed outcomes.
6. Fuzzy c-means (FCM) introduces a more nuanced approach to cluster membership, assigning partial membership to data points in multiple clusters rather than singular assignments as seen in K-means. This flexibility enables FCM to handle outliers effectively as data points on the periphery can belong to clusters based on varying degrees of membership. Additionally, FCM can adapt to clusters of diverse shapes and sizes, which traditional K-means struggles with due to its rigid distance measures and hard assignment mechanisms.
7. (a) Term clusters defined by top terms within document clusters are likely to emphasize the most salient features or keywords present in similarly themed documents, potentially yielding more coherent meaning than clusters of words formed by applying K-means to terms alone, which might disregard semantic context. The clusters formed by K-means may simply index terms based solely on frequency without capturing the contextual relationships inherent in document clusters.
(b) Term clustering can define document clusters by aggregating and interpreting terms that repeatedly co-occur across similar documents, allowing researchers to establish bigger categories based on thematic commonalities. This could be particularly useful in topic modeling where groups of terms representing overarching themes illuminate document relationships and facilitate easier analysis and retrieval.
8. In the clustering results from Ward’s method, bisecting K-means, and ordinary K-means, the solutions represent varying minima states. Ordinary K-means may find only local optima due to its dependency on initial cluster center placements, while methods such as Ward's and bisecting K-means attempt more structured merges, alleviating the possibility of local minima by incorporating hierarchical techniques. Ward's method, in particular, seeks a global minimum by minimizing intra-cluster variance.
9. The observed clustering of 100 records resulting in a single non-empty cluster, despite varied K values in K-means, suggests that data lacks separability and isn't distinctly segmentable. This could occur in scenarios where all but one entry share a very close resemblance. Single-linkage clustering or DBSCAN would tackle such data differently; single-linkage might produce elongated clusters, reflecting close proximities, whereas DBSCAN could find core samples that define clusters based on density, leading potentially to more compact representations contrary to K-means.
References
- Anderberg, M. R. (1973). Cluster Analysis for Applications. Academic Press.
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
- Buyya, R., & Murshed, M. (2002). Grid Economy: A New Model for Resource Management in Grid Computing. IEEE International Conference on Grid Computing.
- Hartigan, J. A. (1975). Clustering Algorithms. Wiley.
- Jain, A. K. (2010). Data Clustering: 50 Years Beyond K-means. Pattern Recognition Letters, 31(8), 651-666.
- MacQueen, J. (1967). Some Methods for Classification and Analysis of Multivariate Observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, 281-297.
- Stone, M. (1974). Cross-Validatory Choice and Assessment of Statistical Predictions. Journal of the Royal Statistical Society: Series B (Methodological), 36(2), 111-147.
- Wang, W., & Zaki, M. J. (2005). A Survey of Clustering Approaches for Spatial Data. ACM Computing Surveys, 37(2), 205-230.
- Xia, Y., & Wang, Y. (2017). A Survey of Clustering Algorithms for Big Data. Journal of Data Science, 15(1), 1-23.
- Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH: An Efficient Data Clustering Method for Very Large Databases. ACM SIGMOD International Conference on Management of Data.