Clustering Indices

Clustering algorithms aim to group objects according to their intrinsic characteristics or similarity, finding the structure of the data. Unfortunately, since clustering has an unsupervised approach, learning error cannot be estimated, and model selection and validation have no a standard procedure that guarantees the quality of the results. To solve this problem different validation indices have been proposed. Despite the significant effort made by several authors in the proposal of validation indices for clustering, indices seem to be limited to work only in certain data sets due to the lack of generality. Nowadays, there is still a search of a general index that allows obtaining the natural clusters for problems with different types of distributions. Validation indices have been organized into two types: external criteria and internal criteria. The external criteria use a supervised approach to compare the obtained partitions against the desired partition (class labels). On the other hand, Internal criteria play an important role when there is no information available about the data distribution. These internal criteria focus on assessing characteristics of the obtained data partitions such as clusters compactness, separation, scattering, discordance and density. A lot of internal indices have been proposed since the beginning of clustering and due to this fact, several comparison studies have appeared in literature in order to decide which index is the best. The objectives of this work are to present most of the clustering internal indices (CVIs) that have been proposed until now and to perform a comparative analysis of these CVIs. The indices were grouped and explained according to the approach they are based on. In the comparison methodology, six clustering algorithms were used to generate the candidate partitions. A total of 178737 candidates partitions were obtained. In addition, the effect of the data standardization was also studied. All the codes to reproduce the work is also published. The results show that the performance of the CVIs are low and further research is still required in this field in most of the aspects.