Inertial Entropy and External Validation of Clusterings

Citation

Dan Simovici, Joshua Yee. Inertial Entropy and External Validation of Clusterings[J]. Journal of Harbin Institute of Technology (New Series), 2024, 31(5): 41-54. DOI: 10.11916/j.issn.1005-9113.2023115

Corresponding author

Dan Simovici, Ph.D, Professor.Email: Dan.Simovici@umb.edu

Article history

Received: 2023-10-20

Contents Abstract Full text Figures/Tables PDF

Inertial Entropy and External Validation of Clusterings

Dan Simovici, Joshua Yee

Computer Science Deptartment, University of Massachusetts Boston, Boston 02125, USA

Received: 2023-10-20; Published online: 2024-07-23

Corresponding author: Dan Simovici, Ph.D, Professor.Email: Dan.Simovici@umb.edu.

Abstract: Axiomatization of Shannon entropy is a subject that has received lots of attention in the information theory literature. While Shannon entropy is defined on probability distribution, we define a new type of entropy on the set of partitions of finite subsets of metric spaces, which has a rich algebraic structure as a partially ordered set. We propose an axiomatization of an entropy-like measure of partitions of sets of objects located in metric spaces, and we derive an analytic expression of this new type of entropy referred to as inertial entropy. This approach starts with the notion of inertia of a partition and includes a study of the behavior of the sum of square errors of a partition. In this context, we characterize the chain of partitions produced by the Ward hierarchical clustering method.Starting from inertial entropies of partitions, we introduce conditional entropies which, in turn, generate metrics on partitions of finite sets. These metrics are used as external validation tools for clusterings of labeled data sets.The metric generated by inertial entropy can be used to validate data clustering for labeled data sets. This type of validation aims to determine to what extend labeling of the data coincides with the clustering obtained algorithmically, and we obtain a high degree of consistency of the data labeling with the results of several hierarchical clusterings.

Keywords: partition inertia hierarchical clustering generalized entropy

0 Introduction

We introduce the notion of inertial entropy, a characteristic of partitions of finite metric spaces. This type of entropy and metrics associated with it are used to investigate the quality of hierarchical clusterings of objects that occur in a metric space.

Inertial entropy belongs to a broader family of numerical characteristics of partitions of finite sets^[1]. It can be used to evaluate how tight objects are grouped around the cluster centers and the similarity of two partitions of a set of objects. In turn, this similarity allows us to evaluate how close a partition produced by a clustering algorithm is to a reference partition determined by the labels of the objects.

Partitions that define clusterings may be obtained via a diversity of clustering algorithms. In particular, hierarchical clustering algorithms construct chains of partitions of the set of objects referred usually as a dendrograms. The blocks of the initial dendrogram partition of the set of objects S are singletons. Starting from a given partition of S, a new partition is constructed by fusing a pair of blocks of the previous partition in the chain. The blocks to be fused are selected according to a criterion specific to the clustering method. A variety of methods^[2] have been developed for constructing such clusterings (single-link, complete-link, Ward clusterings, group-average, centroid clusterings, etc.). Each of these methods facilitates favors finding certain types of clusters (elongated clusters in the case of single-link, globular clusters in the case of complete-link, etc.)

A distinct line of research is the consideration of generalization of entropy in the context of fuzzy sets^[3-4]. Also, an interesting line of research is the use of entropy in the study of the singular values of tensors related to hypergraphs^[5].

We define entropy using partitions rather than probability distributions. This is justified by the fact that sets of partitions have a rich algebraic structure, while probability distributions are limited to reflecting the sizes of the blocks of partitions. Indeed, as we see next, the set of partitions of a set can be naturally equipped with a partial order such that the set of partitions of a finite set is a semimodular lattice with respect to this partial order or a geometric lattice (see Ref.[6]). Partitions are also more expressive than probability distributions because they capture the relationships between set elements and the partition blocks where they are located.

The exploration of entropy axiomatization has a long tradition in information theory. Relevant work includes the results of Khinchin^[7], Faddeev^[8], Ingarden and Urbanik^[9]. Daróczy^[10], Havrda and Charvat^[11], and Tsallis and Furuichi^[12-13] who introduced generalizations of entropy that are suitable for partitions of unstructured spaces. The entropy proposed here deals with partitions of objects placed in metric spaces and makes use of concepts related to the metric space structure such as distance, centroid, etc.

In Section 1, we introduce partitions and discuss properties of the partial ordered set of partitions. In Section 2, we introduce an axiomatization of generalized entropy for partitions which yields, as the special case the Shannon entropy. Since the entropy proposed in this note applies to sets of objects located in metric spaces, we introduce the inertia and the notion of centroid of a set and establish the monotonicity of the sum of square errors in Section 3. The inertial entropy and its axiomatization are presented in Section 4. Note that the proposed axiomatization is an adaptation of the general axiomatization to the sets of objects located in metric spaces where the notions of inertia of a set of objects and sum of square errors are applicable. Conditional inertial entropies and metrics generated by these entropies are studied in Section 5. Section 6 presents an application of the metric space of partitions defined by the distance generated by the inertial entropy to clustering technique validation of labeled data. Section 7 presents ideas for future work.

1 Partitions of Finite Sets

We introduce some basic facts that concern partitions. Formally, a partition of a set S is a collection of non-empty subsets of S referred to as blocks, $\pi=\left\{C_{i} \mid i \in I, C_{i} \subseteq S\right\}$ such that $i, j \in I$ and $i \neq j$ implies $C_{i} \cap C_{j}=\emptyset$ and $\cup_{i \in I} C_{i}=S$. If $\pi$ consists of two blocks, $\pi=\{B, C\}$, we refer to $\pi$ as a bipartition. The set of partitions of a finite set S is denoted by PART(S).

Set partitions are fundamental for the study of clusterings because a non-overlapping set clustering can be viewed as a partition whose blocks are the clusters. Thus, we will use the terms "partition block" and "cluster" interchangeably.

A partial order "$\leqslant$" is defined on partitions in PART(S) by setting $\pi \leqslant \sigma$ if each block of $\pi$ is included in a block of $\sigma$. The partition $\alpha_{S}=\{\{x\} \mid x \in S\}$ is the least element of the partially ordered set (PART$(S), \leqslant$), while the single-block partition $\omega_{S}=\{S\}$ is the largest element of $(\operatorname{PART}(S), \leqslant)$.

If $\pi, \sigma \in \operatorname{PART}(S), \pi \leqslant \sigma$ and there is no partition $\tau \in \operatorname{PART}(S)-\{\pi, \sigma\}$ such that $\pi \leqslant \tau \leqslant \sigma$, then we say that $\sigma$ covers $\pi$ and we write $\pi \triangleleft \sigma$. It is easy to show that $\pi \triangleleft \sigma$ if and only if $\sigma$ is obtained from $\pi$ by fusing two of the blocks of $\pi$.

If S, T are two non-empty, disjoint sets and $\sigma \in$ $\operatorname{PART}(S)$, and $\tau \in \operatorname{PART}(T)$ are $\sigma=\left\{B_{1}, B_{2}, \cdots\right.$, $\left.B_{m}\right\}$ and $\tau=\left\{C_{1}, C_{2}, \cdots, C_{n}\right\}$, the sum of the partitions $\sigma$ and $\tau$ is the partition $\sigma+\tau \in \operatorname{PART}(S \cup$ $T$ defined as

$ \sigma+\tau=\left\{B_{1}, B_{2}, \cdots, B_{m}, C_{1}, C_{2}, \cdots, C_{n}\right\} $

Furthermore, if S, T, U are non-empty disjoint sets, $\sigma \in \operatorname{PART}(S), \tau \in \operatorname{PART}(T)$, and $\nu \in$ $\operatorname{PART}(U)$, we have

$\sigma+(\tau+\nu)=(\sigma+\tau)+\nu $

a property referred to as the associativity of partition addition. If S and T are non-empty disjoint sets, then

$ \alpha_{S}+\alpha_{T}=\alpha_{S \cup T} $

If $\alpha=\left\{B_{1}, B_{2}, \cdots, B_{m}\right\}$, then we have:

$ \sigma=\omega_{B 1}+\omega_{B 2}+\cdots+\omega_{B m} $

If the set S consists of a single element, S={s}, then $\alpha_{S}=\omega_{S}=\{s\}$.

2 An Axiomatization of Generalized Entropy for Partitions

In the previous work^[1], we proposed an axiomatization of partition entropy for partitions of finite sets that takes into account the number of elements of the blocks of the partition. The sizes of partition blocks define a probability distribution and this leads to the idea of axiomatizing the entropy of partitions. The proposed system of axioms for partition entropy $H_{\beta}$, where $\beta>1$ includes the following properties:

(P₁) If $\pi, \pi^{\prime} \in \operatorname{PART}(A)$ are such that $\pi \leqslant \pi^{\prime}$, then $H_{\beta}\left(\pi^{\prime}\right) \leqslant H_{\beta}(\pi)$.

(P₂) If A and B are two finite sets such that $|A| \leqslant|B|$, then $H_{\beta}\left(\iota_{A}\right) \leqslant H_{\beta}\left(\iota_{B}\right)$.

(P₃) For every disjoint sets A, B and partition $\pi \in \operatorname{PART}(A)$, and $\sigma \in \operatorname{PART}(B)$, we have:

$ \begin{aligned} & H_{\beta}(\pi+\sigma)=\left(\frac{|A|}{|A|+|B|}\right)^{\beta} H_{\beta}(\pi)+ \\ & \quad\left(\frac{|B|}{|A|+|B|}\right)^{\beta} H_{\beta}(\sigma)+H_{\beta}(\{A, B\}) \end{aligned} $

(P₄) If $\pi \in \operatorname{PART}(A)$ and $\sigma \in \operatorname{PART}(B)$, then

$H_{\beta}(\pi \times \sigma)=\varPhi\left(H_{\beta}(\pi), H_{\beta}(\sigma)\right) $

where $\varPhi: \mathbb{R}^{2} \rightarrow \mathbb{R}$ is a continuous function such that $\varPhi(x, y)=\varPhi(y, x)$, and $\varPhi(x, 0)=x$, for $x, y \in \mathbb{R}$.

It was shown in Ref.[1] that a function $H_{\beta}$ that satisfies the axioms (P₁)-(P₄) and for any partition $\pi=\left\{B_{1}, B_{2}, \cdots, B_{m}\right\} \in \operatorname{PART}(S)$ has necessarily the form:

$H_{\beta}=\frac{1}{2^{1-\beta}-1}\left(1-\sum\limits_{i=1}^{m}\left(\frac{\left|B_{i}\right|}{|S|}\right)^{\beta}\right) $

When $\beta=2, H_{2}(\pi)=2\left(\sum\limits_{i=1}^{m}\left(\frac{\left|B_{i}\right|}{|S|}\right)^{2}-1\right)$, which is twice the value of the Gini index. On other hand, we have

$ \lim _{\beta \rightarrow 1, \beta>1} H_{\beta}(\pi)=-\sum\limits_{i=1}^{m} \frac{\left|B_{i}\right|}{|S|} \log _{2} \frac{\left|B_{i}\right|}{|S|} $

which corresponds to Shannon entropy.

The $\beta$-entropy of a partition depends exclusively on the cardinalities of the blocks of the partition.

When the set S is a subset of a metric space, it becomes possible to define an entropy-like measure that is defined by the scattering of the elements of blocks around the centroids of these blocks, which supplements the description of partitions. This new measure defined in Section 4, which we refer to as the inertial entropy takes advantage of the metric structure that exists for the set of objects.

3 Inertia of a Set of Points

The notion of inertia of a non-empty finite subset S of $\mathbb{R}^{p}$ relative to a vector z originates in mechanics of solids. The inertia of S relative to a vector $\mathit{\boldsymbol{z}} \in \mathbb{R}^{p}$ is the number

$ I_{z}=\sum\limits_{x \in S}\|\boldsymbol{x}-\boldsymbol{z}\|_{2}^{2} $

If $S=\emptyset, I_{z}(\emptyset)=0$, because the sum that defines the inertia contains no terms.

The centroid of a non-empty finite set of vectors $S=\left\{\boldsymbol{x}_{1}, \boldsymbol{x}_{2}, \cdots, \boldsymbol{x}_{m}\right\}$ is defined as:

$ c_{s}=\frac{1}{|S|} \sum\limits_{i=1}^{m} \boldsymbol{x}_{\boldsymbol{i}} $

Huygens' Inertia Theorem (see Ref.[2]) stipulates that

$ I_{z}(S)-I_{c S}(S)=|S|\left\|\boldsymbol{c}_{S}-\boldsymbol{z}\right\|_{2}^{2} $

for every $\mathit{\boldsymbol{z}} \in \mathbb{R}^{p}$. Thus, the minimal value of the inertia I_z(S) is achieved for $\mathit{\boldsymbol{z}}=\boldsymbol{c}_{S}$. The special case of the inertia of S relative to the centroid c_S is referred to as the sum of square errors. We denote I_c_S by sse(S).

It is immediate that if $S \neq \emptyset$, then

$\begin{equation*} \operatorname{sse}(S)=\sum\limits_{\boldsymbol{x} \in S}\|\boldsymbol{x}\|^{2}-|S|\left\|\boldsymbol{c}_{S}\right\|^{2} \end{equation*} $

(1)

Furthermore, the set S is a singleton then sse(S)=0.

Clustering sets of vectors in $\mathbb{R}^{p}$ amounts to partitioning these sets such that the resulting partitions (referred to as clusterings) satisfy certain desirable properties. Depending on specific circumstances, we may insist that clusters are well-separated, or have approximate equal sizes, or have other properties.

Theorem 1 Let W be a subset of S and let $\sigma= \{U, V\}$ be a bipartition of W, We have:

$ \begin{aligned} & \operatorname{sse}(W)=\operatorname{sse}(U)+\operatorname{sse}(V)+\frac{|U||V|}{|W|} . \\ & \quad\left\|\boldsymbol{c}_{U}-\boldsymbol{c}_{V}\right\|^{2} \end{aligned} $

Proof By applying the definition of the sum of square errors we have:

The centroid of W is given by:

$ \boldsymbol{c}_{W}=\frac{1}{|W|} \sum\{\boldsymbol{x} \mid \boldsymbol{x} \in W\}=\frac{|U|}{|W|} \boldsymbol{c}_{U}+\frac{|V|}{|W|} \boldsymbol{c}_{V} $

This allows us to evaluate the variation of the sum of squared errors:

$ \begin{aligned} & \operatorname{sse}(W)-\operatorname{sse}(U)-\operatorname{sse}(V)=\sum\left\{\left\|\boldsymbol{x}-\boldsymbol{c}_{W}\right\|^{2} \mid\right. \\ & \quad \boldsymbol{x} \in U \cup V\}-\sum\left\{\left\|\boldsymbol{x}-\boldsymbol{c}_{U}\right\|^{2} \mid \boldsymbol{x} \in U\right\}- \\ & \sum\left\{\left\|\boldsymbol{x}-\boldsymbol{c}_{V}\right\|^{2} \mid \boldsymbol{x} \in V\right\}=\sum\left\{\left\|\boldsymbol{x}-\boldsymbol{c}_{W}\right\|^{2}-\right. \\ & \left.\left\|\boldsymbol{x}-\boldsymbol{c}_{U}\right\|^{2} \mid \boldsymbol{x} \in U\right\}+\sum\left\{\left\|\boldsymbol{x}-\boldsymbol{c}_{W}\right\|^{2}-\right. \\ & \left.\left\|\boldsymbol{x}-\boldsymbol{c}_{V}\right\|^{2} \mid \boldsymbol{x} \in V\right\} \end{aligned} $

Observe that:

$ \begin{aligned} & \sum\left\{\left\|\boldsymbol{x}-\boldsymbol{c}_{W}\right\|^{2}-\left\|\boldsymbol{x}-\boldsymbol{c}_{U}\right\|^{2} \mid \boldsymbol{x} \in U\right\}= \\ & \sum\limits_{\boldsymbol{x} \in U}\left(\left(\boldsymbol{x}-\boldsymbol{c}_{W}\right)^{\prime}\left(\boldsymbol{x}-\boldsymbol{c}_{W}\right)-\left(\boldsymbol{x}-\boldsymbol{c}_{U}\right)^{\prime}\left(\boldsymbol{x}-\boldsymbol{c}_{U}\right)\right)= \\ & |U|\left(\boldsymbol{c}_{W}^{\prime} \boldsymbol{c}_{W}-\boldsymbol{c}_{U}^{\prime} \boldsymbol{c}_{U}\right)+2\left(\boldsymbol{c}_{U}^{\prime}-\boldsymbol{c}_{W}^{\prime}\right) \sum\limits_{x \in U} \boldsymbol{x}= \\ & |U|\left(\boldsymbol{c}_{W}^{\prime} \boldsymbol{c}_{W}-\boldsymbol{c}_{U}^{\prime} \boldsymbol{c}_{U}\right)+2|U|\left(\boldsymbol{c}_{U}^{\prime}-\boldsymbol{c}^{\prime}{ }_{W}\right) \boldsymbol{c}_{U}= \\ & |U|\left(\left\|\boldsymbol{c}_{W}\right\|^{2}-\left\|\boldsymbol{c}_{U}\right\|^{2}+2\left\|\boldsymbol{c}_{U}\right\|^{2}-2 \boldsymbol{c}^{\prime}{ }_{\boldsymbol{W}} \boldsymbol{c}_{U}\right)= \\ & |U|\left(\left\|\boldsymbol{c}_{W}\right\|^{2}+\left\|\boldsymbol{c}_{U}\right\|^{2}-2 \boldsymbol{c}_{W}^{\prime} \boldsymbol{c}_{U}\right)=|U|\left\|\boldsymbol{c}_{W}-\boldsymbol{c}_{U}\right\|^{2} \end{aligned} $

Using the equality

$ \begin{equation*} \boldsymbol{c}_{W}-\boldsymbol{c}_{U}=\frac{|U|}{|W|} \boldsymbol{c}_{U}+\frac{|V|}{|W|} \boldsymbol{c}_{V}-\boldsymbol{c}_{U}= \\ \frac{|V|}{|W|}\left(\boldsymbol{c}_{V}-\boldsymbol{c}_{U}\right) \end{equation*} $

(2)

The following equation is obtained

$ \begin{aligned} \sum\{ & \left.\left\|\boldsymbol{x}-\boldsymbol{c}_{W}\right\|^{2}-\left\|\boldsymbol{x}-\boldsymbol{c}_{U}\right\|^{2} \mid \boldsymbol{x} \in U\right\}= \\ & \frac{|U||V|^{2}}{|W|^{2}}\left\|\boldsymbol{c}_{V}-\boldsymbol{c}_{U}\right\|^{2} \end{aligned} $

In a similar manner, we have:

$ \left\{\boldsymbol{x}-\boldsymbol{c}_{W}\left\|^{2}-\right\| \boldsymbol{x}-\boldsymbol{c}_{V} \|^{2} \mid \boldsymbol{x} \in V\right\}= \\ \frac{|U|^{2}|V|}{|W|^{2}} \cdot\left\|\boldsymbol{c}_{V}-\boldsymbol{c}_{U}\right\|^{2} $

so,

$ \begin{align*} \operatorname{sse}(W)-\operatorname{sse}(U)-\operatorname{sse}(V)=\frac{|U||V|}{|W|} . \\ \quad\left\|\boldsymbol{c}_{V}-\boldsymbol{c}_{U}\right\|^{2} \end{align*} $

(3)

which is the needed equality.

Corollary 1 If U, W are two subsets of $\mathbb{R}^{n}$ such that $U \subseteq W$, we have sse$(U) \leqslant \operatorname{sse}(W)$.

Proof Since $U \subset W$, an application of Theorem 1 to the partition {U, W-U} of W yields the desired inequality. The conclusion is immediate if U=W.

If $\pi \in \operatorname{PART}(S)$ and $\pi=\left\{B_{1}, B_{2}, \cdots, B_{m}\right\}$, the sum of square errors of $\pi$ is defined as

$\begin{equation*} \operatorname{sse}(\pi)=\sum\limits_{i=1}^{m} \operatorname{sse}\left(B_{i}\right) \end{equation*} $

(4)

Note that $\operatorname{sse}\left(\alpha_{S}\right)=0$. Also, $\operatorname{sse}(\pi)$ is a monotonic function of partitions, that is, $\pi \leqslant \sigma$ implies sse$(\pi) \leqslant \operatorname{sse}(\sigma)$.

The sum of the square errors of a partition plays an important role in analyzing properties of various clustering algorithms. One example is the proof of convergence of the k-means algorithm, where the sum of the square errors of the current clustering is decreasing for the chain of partitions obtained by successive object reassignments.

4 Inertial Entropy and Its Axiomatization

Let $P_{\text {fin }}\left(\mathbb{R}^{p}\right)$ be the collection of finite subsets of $\mathbb{R}^{p}$. We introduce the inertial entropy of a partition of a finite subset S of $\mathbb{R}^{p}$ as a function $H: P_{\text {fin }}\left(\mathbb{R}^{p}\right) \rightarrow \mathbb{R}_{\geqslant 0}$that satisfies the following axioms:

·Initial values axiom: For every set S, we have $H\left(\alpha_{S}\right)=1$ and $H\left(\omega_{S}\right)=0$.

·Addition axiom: If S and T are finite disjoint sets, $\pi \in \operatorname{PART}(S)$, and $\sigma \in \operatorname{PART}(T)$ then we have:

$H(\pi+\sigma)=\frac{\operatorname{sse}(S)}{\operatorname{sse}(S \cup T)} H(\pi)+\frac{\operatorname{sse}(T)}{\operatorname{sse}(S \cup T)} .\\ H(\sigma)+H(S, T) $

Note the similarity between the addition axiom for inertial entropy and the axiom P₃ for partition entropy.

Lemma 1 Let U, V be two disjoint, finite, and non-empty subsets of $\mathbb{R}^{p}$, and let $\pi$ and $\pi^{\prime}$ be partitions in PART$(U \cup V)$ defined as $\pi=\sigma+\alpha_{V}$ and $\pi^{\prime}=\sigma+$ $\omega_{V}$, where $\sigma \in \operatorname{PART}(U)$. Then, we have

$ H(\boldsymbol{\pi})=H\left(\pi^{\prime}\right)+\frac{\operatorname{sse}(V)}{\operatorname{sse}(U \cup V)} $

Proof By the Addition Axiom, we have:

$ \begin{aligned} & H\left(\sigma+\alpha_{V}\right)=\frac{\operatorname{sse}(U)}{\operatorname{sse}(U \cup V)} H(\sigma)+\frac{\operatorname{sse}(V)}{\operatorname{sse}(U \cup V)}+ \\ & H(\{U, V\}) \end{aligned} $

and

$ \begin{aligned} & H\left(\sigma+\omega_{V}\right)=\frac{\operatorname{sse}(U)}{\operatorname{sse}(U \cup V)} H(\sigma)+\frac{\operatorname{sse}(V)}{\operatorname{sse}(U \cup V)} . \\ & H\left(\omega_{V}\right)+H(\{U, V\})=\frac{\operatorname{sse}(U)}{\operatorname{sse}(U \cup V)} H(\sigma)+ \\ & H(\{U, V\}) \end{aligned} $

which implies

$H(\pi)=H\left(\pi^{\prime}\right)+\frac{\operatorname{sse}(V)}{\operatorname{sse}(U \cup V)} $

because $H\left(\alpha_{V}\right)=1$ and $H\left(\omega_{V}\right)=0$.

The next theorem shows that the inertial entropy of a partition is a linear function of the sum of the square errors of the blocks of the partition.

Theorem 2 If $\pi=\left\{B_{1}, B_{2}, \cdots, B_{m}\right\}$ is a partition of a finite and non-empty subset S of $\mathbb{R}^{p}$, with $|S| \geqslant 2$, then

$ H(\boldsymbol{\pi})=1-\frac{1}{\operatorname{sse}(S)} \sum\limits_{i=1}^{m} \operatorname{sse}\left(B_{i}\right)=1-\frac{\operatorname{sse}(\pi)}{\operatorname{sse}\left(\omega_{S}\right)} $

Proof Let $\pi=\left\{B_{1}, B_{2}, \cdots, B_{m}\right\}$ is a partition of S. We have

$ \pi=\omega_{B_{1}}+\omega_{B_{2}}+\cdots+\omega_{B_{m}} $

Starting from $\pi$ consider the descending sequence of partitions

$ \begin{gathered} \pi_{0}=\omega_{B_{1}}+\omega_{B_{2}}+\cdots+\omega_{B_{m}}=\pi \\ \pi_{1}=\alpha_{B_{1}}+\omega_{B_{2}}+\cdots+\omega_{B_{m}} \\ \pi_{2}=\alpha_{B_{1}}+\alpha_{B_{2}}+\cdots+\omega_{B_{m}} \\ \vdots \\ \pi_{m}=\alpha_{B_{1}}+\alpha_{B_{2}}+\cdots+\alpha_{B_{m}}=\alpha_{S} \end{gathered} $

Let

$ \sigma_{i}=\alpha_{B_{1}}+\cdots+\alpha_{B_{i}}+\omega_{B_{i+2}}+\cdots+\omega_{B_{m}} \in \operatorname{PART}\left(S-B_{i+1}\right) $

Then, we have $\pi_{i}=\sigma_{i}+\omega_{B_{i+1}}$ and $\pi_{i+1}=\sigma_{i}+ \alpha_{B_{i+1}}$. By Lemma 1, we have:

$ H\left(\pi_{i+1}\right)=H\left(\pi_{i}\right)+\frac{\operatorname{sse}\left(B_{i+1}\right)}{\operatorname{sse}(S)} $

Thus, we have

$ \begin{gathered} H\left(\pi_{1}\right)=H\left(\pi_{0}\right)+\frac{\operatorname{sse}\left(B_{1}\right)}{\operatorname{sse}(S)} \\ H\left(\pi_{2}\right)=H\left(\pi_{1}\right)+\frac{\operatorname{sse}\left(B_{2}\right)}{\operatorname{sse}(S)} \\ \vdots \\ H\left(\pi_{m}\right)=H\left(\pi_{m-1}\right)+\frac{\operatorname{sse}\left(B_{m}\right)}{\operatorname{sse}(S)} \end{gathered} $

which yields $H\left(\pi_{m}\right)=H\left(\pi_{0}\right)+\frac{1}{\operatorname{sse}(S)} \sum\limits_{i=1}^{m} \operatorname{sse}\left(B_{i}\right)$. Since $\pi_{0}=\pi$ and $\pi_{m}=\alpha_{S}$, we have $H\left(\pi_{m}\right)=1$, and

$H(\boldsymbol{\pi})=1-\frac{1}{\operatorname{sse}(S)} \sum\limits_{i=1}^{m} \operatorname{sse}\left(B_{i}\right) $

An equivalent expression of the inertial entropy of a partition $\pi=\left\{B_{1}, B_{2}, \cdots, B_{n}\right\} \in \operatorname{PART}(S)$ is:

$ \begin{equation*} H(\boldsymbol{\pi})=\frac{\sum\limits_{i=1}^{n}\left|B_{i}\right|\left\|\boldsymbol{c}_{i}\right\|^{2}-|S|\|\boldsymbol{c}\|^{2}}{\sum\limits_{x \in S}\|\boldsymbol{x}\|^{2}-|S|\|\boldsymbol{x}\|^{2}} \end{equation*} $

(5)

where $\boldsymbol{c}_{i}$ is the centroid of block B_i for $1 \leqslant i \leqslant n$, and c is the centroid of S.

Note that $H\left(\omega_{S}\right)=0, H\left(\alpha_{S}\right)=1$ and for any $\pi \in \operatorname{PART}(S)$, we have $0 \leqslant H(\boldsymbol{\pi})=1$.

Theorem 3 Let $\pi, \sigma \in \operatorname{PART}(S)$, where $\pi= \left\{B_{1}, B_{2}, \cdots, B_{m-1}, B_{m}\right\}$. If $\pi <\sigma$ and $\sigma$ is obtained from $\pi$ by fusing the blocks B_m-1 and B_m, then

$ \begin{aligned} & H(\sigma)=H(\pi)-\frac{1}{\operatorname{sse}(S)} \frac{\left|B_{m-1}\right| \cdot\left|B_{m}\right|}{\left|B_{m-1}\right|+\left|B_{m}\right|} \| \boldsymbol{c}_{m-1}- \\ & \boldsymbol{c}_{m} \|^{2} \end{aligned} $

Proof By Theorem 1, we have:

$ \begin{aligned} H(\sigma)= & 1-\frac{1}{\operatorname{sse}(S)}\left(\sum\limits_{i=1}^{m-2} \operatorname{sse}\left(B_{i}\right)+\operatorname{sse}\left(B_{m-1} \cup B_{m}\right)\right)= \\ & 1-\frac{1}{\operatorname{sse}(S)}\left(\sum\limits_{i=1}^{m-2} \operatorname{sse}\left(B_{i}\right)+\operatorname{sse}\left(B_{m-1}\right)+\operatorname{sse}\left(B_{m}\right)+\right. \\ & \left.\frac{\left|B_{m-1}\right| \cdot\left|B_{m}\right|}{\left|B_{m-1}\right|+\left|B_{m}\right|}\left\|\boldsymbol{c}_{m-1}-\boldsymbol{c}_{m}\right\|^{2}\right)=H(\boldsymbol{\pi})-\frac{1}{\operatorname{sse}(S)} \cdot \end{aligned} $

$\frac{\left|B_{m-1}\right| \cdot\left|B_{m}\right|}{\left|B_{m-1}\right|+\left|B_{m}\right|}\left\|\boldsymbol{c}_{m-1}-\boldsymbol{c}_{m}\right\|^{2} $

Corollary 2 The inertial entropy is dually monotonic. In other words, if $\pi, \sigma \in \operatorname{PART}(S)$ and $\pi \leqslant \sigma$, we have $H(\pi) \geqslant H(\sigma)$.

Proof This is an immediate consequence of Theorem 3.

Example 1 Consider the set $S=\left\{\boldsymbol{x}_{1}, \boldsymbol{x}_{2}, \boldsymbol{x}_{3}\right.$, $\left.\boldsymbol{x}_{4}\right\}$ shown in Fig. 1, where

$ \boldsymbol{x}_{1}=\binom{0}{1}, \boldsymbol{x}_{2}=\binom{1}{1}, \boldsymbol{x}_{3}=\binom{0}{0}, \boldsymbol{x}_{4}=\binom{1}{0} $

Fig.1 Set $\left\{\mathit{\boldsymbol{x}}_1, \mathit{\boldsymbol{x}}_2, \mathit{\boldsymbol{x}}_3, \mathit{\boldsymbol{x}}_4\right\}$ in $\mathbb{R}^2$

And the partition $\pi=\left\{\left\{\boldsymbol{x}_{1}, \boldsymbol{x}_{4}\right\}, \left\{\boldsymbol{x}_{2}, \boldsymbol{x}_{3}\right\}\right\}$ of S.

The centroid of $\left\{\boldsymbol{x}_{1}, \boldsymbol{x}_{4}\right\}$ is $\boldsymbol{c}=\left(\frac{1}{2}, \frac{1}{2}\right)$. The same point c is the centroid of $\left\{\boldsymbol{x}_{2}, \boldsymbol{x}_{3}\right\}$ and of the entire set S. We have:

$ \begin{gathered} \operatorname{sse}(S)=4\left(\frac{\sqrt{2}}{2}\right)^{2}=2 \\ \operatorname{sse}\left(\left\{\boldsymbol{x}_{1}, \boldsymbol{x}_{4}\right\}\right)=\left\|\boldsymbol{x}_{1}-\boldsymbol{c}\right\|^{2}+\left\|\boldsymbol{x}_{4}-\boldsymbol{c}\right\|^{2}=1 \\ \operatorname{sse}\left(\left\{\boldsymbol{x}_{2}, \boldsymbol{x}_{3}\right\}\right)=\left\|\boldsymbol{x}_{2}-\boldsymbol{c}\right\|^{2}+\left\|\boldsymbol{x}_{3}-\boldsymbol{c}\right\|^{2}=1 \end{gathered} $

hence

$ \begin{gathered} H(\pi)=1-\frac{1}{\operatorname{sse}(S)}\left(\operatorname{sse}\left(\left\{\boldsymbol{x}_{1}, \boldsymbol{x}_{4}\right\}\right)+\operatorname{sse}\left(\left\{\boldsymbol{x}_{2}, \right.\right.\right. \\ \left.\left.\left.\boldsymbol{x}_{3}\right\}\right)\right)=1-\frac{1}{2}(1+1)=0 \end{gathered} $

In Fig. 2, the inertial entropies of the partitions of the set S are shown in bold letters.

Fig.2 The annotated Hasse diagram of ($\text { PART }(\{\mathbf{1 , 2 , 3}, \mathbf{4}\}), \leqslant$)

Theorem 3 implies that the decrease in inertial entropy that results by fusing two blocks of a partition is determined by the harmonic average of the size of the fused blocks and by the distance between the centroids of these blocks.

The choice of a particular hierarchical clustering method has an obvious impact on the variation of entropy. In each of the methods discussed below the pair of clusters (U, V) to be fused is selected such that the corresponding cluster dissimilarity is minimized.

Table 1 contains the cluster dissimilarity that is minimized when two clusters U and V are fused in some of the most popular hierarchical clustering methods.

Table 1 Dissimilarities used in clutering methods

This allows us to characterize sequences of partitions that are produced by Ward clusterings^[14-15].

Theorem 4 Let S be a finite subset of $\mathbb{R}^{n}$. A sequence of partitions ($\pi_{1}, \pi_{2}, \cdots, \pi_{m}$) is obtained by Ward method if and only if $H\left(\pi_{i}\right) H\left(\pi_{i+1}\right)$ is minimal at each step, $1 \leqslant i \leqslant m-1$.

Proof Let $\left(\pi_{1}, \pi_{2}, \cdots, \pi_{m}\right)$ be a sequence obtained by Ward method. Then, $\pi_{i+1}$ is obtained from $\pi_{i}$ by fusing two blocks B and $B^{\prime}$ of $\pi$ such that ward ($B, B^{\prime}$) is minimal. By Theorem 3, we have

$H\left(\pi_{i}\right)-H\left(\pi_{i+1}\right)=\frac{1}{\operatorname{sse}(S)} \operatorname{ward}\left(B, B^{\prime}\right) $

where B and $B^{\prime}$ are the blocks of $\pi$ that were fused to produce $\pi_{i+1}$. By choice of B and $B^{\prime}$, ward $(B, B^{\prime})$ is minimal and the desired conclusion follows.

Conversely, if the sequence ($\pi_{1}, \pi_{2}, \cdots, \pi_{m}$) is such that $H\left(\boldsymbol{\pi}_{i}\right)-H\left(\boldsymbol{\pi}_{i+1}\right)$ is minimal at each step, by the same theorem, it follows that

$ H\left(\pi_{i+1}\right)-H(\pi)=\frac{1}{\operatorname{sse}(S)} \operatorname{ward}\left(B, B^{\prime}\right) $

B and $B^{\prime}$ are selected such that ward $(B, B^{\prime} )$ is minimal, which implies that the sequence $\left(\pi_{1}, \pi_{2}\right.$, $\cdots, \pi_{m}$) was obtained through the Ward method.

In Fig. 2, the chains of partitions obtained by Ward clustering are represented by thick continuous lines.

5 Conditional Inertial Entropy and Related Metrics

External clustering validation consists in comparing the results of a cluster algorithm to an externally known result, such as externally provided class labels. This type of validation can be used for selecting the right clustering algorithm for a specific dataset and has been examined in Refs. [16-22].

In this paper, we evaluate the distance between a reference partition and the partition supplied by a clustering algorithm using the metric on the set of partitions of the set of objects generated by the inertial entropy.

Let $\pi=\left\{B_{1}, B_{2}, \cdots, B_{m}\right\} \in \operatorname{PART}(S)$ and let $C \subseteq S$. The trace of $\pi$ on C is the partition,

$\pi_{C}=\{B \cap C \mid B \in \pi, B \cap C \neq \emptyset\} $

We have $\pi_{C} \in$ PART; also if C is a block of $\pi$, $\pi_{c}=\omega_{C}$.

Definition 1 Let $\pi, \sigma \in \operatorname{PART}(S)$, where $\sigma=\left\{C_{1}, C_{2}, \cdots, C_{n}\right\}$. The inertial conditional entropy of $\pi$ and $\sigma$ is

$ H(\pi \mid \sigma)=\sum\limits_{j=1}^{n} \frac{\operatorname{sse}\left(C_{j}\right)}{\operatorname{sse}(S)} H\left(\pi_{C_{j}}\right) $

Note that

$ \begin{gathered} H\left(\pi \mid \omega_{S}\right)=H(\pi) \\ H\left(\omega_{S} \mid \sigma\right)=\sum\limits_{j=1}^{n} \frac{\operatorname{sse}\left(C_{j}\right)}{\operatorname{sse}(S)} H\left(C_{j}\right) \end{gathered} $

and $H\left(\pi \mid \alpha_{S}\right)=0$ for every $\pi \in \operatorname{PART}(S)$.

For $\pi=\left\{B_{1}, B_{2}, \cdots, B_{m}\right\}$ and $\sigma=\left\{C_{1}, C_{2}, \cdots\right.$, $\left.C_{n}\right\}$ in $\operatorname{PART}(S)$, the conditional inertial entropy can be written explicitly as:

$ \begin{aligned} & H(\pi \mid \sigma)=\sum\limits_{j=1}^{n} \frac{\operatorname{sse}\left(C_{j}\right)}{\operatorname{sse}(S)} H\left(\pi_{C_{j}}\right)=\sum\limits_{j=1}^{n} \frac{\operatorname{sse}\left(C_{j}\right)}{\operatorname{sse}(S)} . \\ & \left(1-\frac{1}{\operatorname{sse}\left(C_{j}\right)} \sum\limits_{i=1}^{m} \operatorname{sse}\left(B_{i} \cap C_{j}\right)\right)=\frac{1}{\operatorname{sse}(S)} \cdot \\ & \sum\limits_{j=1}^{n}\left(\operatorname{sse}\left(C_{j}\right)-\sum\limits_{i=1}^{n} \operatorname{sse}\left(B_{i} \cap C_{j}\right)\right)= \\ & H(\pi \wedge \sigma)-H(\sigma) \end{aligned} $

Since $H(\pi \wedge \sigma)=H(\pi \mid \sigma)+H(\sigma)$, it follows that

$ \begin{equation*} H(\boldsymbol{\pi}) \leqslant H(\boldsymbol{\pi} \mid \sigma)+H(\boldsymbol{\sigma}) \end{equation*} $

(6)

for $\pi, \sigma \in \operatorname{PART}(S)$.

Theorem 5 Let $\pi, \sigma \in \operatorname{PART}(S)$ be two partitions of a finite set S. We have $H(\pi \mid \sigma)=0$ if and only if $\sigma \leqslant \pi$.

Proof Suppose that $\sigma=\left\{C_{1}, C_{2}, \cdots, C_{n}\right\}$. If $\sigma \leqslant \pi$, then $\pi_{C j}=\omega_{C_{j}}$ and, therefore,

$ H(\pi \mid \sigma)=\sum\limits_{j=1}^{n} \frac{\operatorname{sse}\left(C_{j}\right)}{\operatorname{sse}(S)} H\left(\omega_{C_{j}}\right)=0 $

Conversely, suppose that

$ H(\pi \mid \sigma)=\sum\limits_{j=1}^{n} \frac{\operatorname{sse}\left(C_{j}\right)}{\operatorname{sse}(S)} H\left(\pi_{C j}\right)=0 $

This implies $H\left(\pi_{C_{j}}\right)=0$ for $1 \leqslant j \leqslant n$, which means that $\pi_{C j}=\omega_{C j}$, for $1 \leqslant j \leqslant n$. Therefore, each block C_j of $\sigma$ is included in a block of $\pi$, so $\sigma \leqslant \pi$.

Corollary 3 Let $\pi, \sigma \in \operatorname{PART}(S)$, where S is a finite set. We have

$H(\pi \wedge \sigma)=H(\pi \mid \sigma)+H(\sigma)=H(\sigma \mid \pi)+H(\pi) $

Proof This is a direct consequence of Theorem 5.

Next, we show that the conditional inertial entropy is dually monotonic with respect to its first argument and monotonic with respect to its second argument.

Theorem 6 Let $\pi, \sigma, \sigma^{\prime} \in \operatorname{PART}(S)$, where S is a finite subset of $\mathbb{R}^{p}$. If $\sigma \leqslant \sigma^{\prime}$, then $H(\sigma \mid \pi) \geqslant$ $H\left(\sigma^{\prime} \mid \pi\right)$, and $H(\pi \mid \sigma) \leqslant H\left(\pi \mid \sigma^{\prime}\right)$.

Proof Since $\sigma \leqslant \sigma^{\prime}$, we have $\pi \wedge \sigma \leqslant \pi \wedge$ $\sigma^{\prime}$, so $H(\pi \wedge \sigma) \geqslant H\left(\pi \wedge \sigma^{\prime}\right)$. Therefore, $H(\sigma \mid$ $\pi)+H(\pi) \geqslant H\left(\sigma^{\prime} \mid \pi\right)+H(\pi)$, which implies $H(\sigma \mid \pi) \geqslant H\left(\sigma^{\prime} \mid \pi\right)$.

For the second part, it suffices to prove the inequality for partitions $\sigma$ and $\sigma^{\prime}$ such that $\sigma \prec \sigma^{\prime}$. Without loss of generality assume that

$ \begin{gathered} \sigma=\left\{C_{1}, C_{2}, \cdots, C_{n-2}, C_{n-1}, C_{n}\right\} \\ \sigma^{\prime}=\left\{C_{1}, C_{2}, \cdots, C_{n-2}, C_{n-1}, C_{n}\right\} \end{gathered} $

We have

$ \begin{aligned} & H\left(\pi \mid \sigma^{\prime}\right)=\sum\limits_{j=1}^{n-2} \frac{\operatorname{sse}\left(C_{j}\right)}{\operatorname{sse}(S)} H\left(\pi_{C_{j}}\right)+\frac{\operatorname{sse}\left(C_{n-1} \cup C_{n}\right)}{\operatorname{sse}(S)} \geqslant \\ & \sum\limits_{j=1}^{n-2} \frac{\operatorname{sse}\left(C_{j}\right)}{\operatorname{sse}(S)} H\left(\pi_{C_{j}}\right)+\frac{\operatorname{sse}\left(C_{n-1}\right)}{\operatorname{sse}(S)} H\left(\pi_{C_{n-1}}\right)+ \\ & \frac{\operatorname{sse}\left(C_{n}\right)}{\operatorname{sse}(S)} H\left(\pi_{C n}\right) \quad \text { (by the addition axiom ) }= \\ & H(\pi \mid \sigma) \end{aligned} $

Theorem 7 Let $\pi, \sigma, \tau$ be three partitions of the finite set S, where $S \subseteq \mathbb{R}^{p}$. We have:

$H(\pi \mid \sigma \wedge \tau)+H(\sigma \mid \tau)=H(\pi \wedge \sigma \mid \tau) $

Proof By Corollary 3, we have

$ H(\pi \mid \sigma \wedge \tau)=H(\pi \wedge \sigma \wedge \tau)-H(\sigma \wedge \tau) $

$ H(\sigma \mid \tau)=H(\sigma \wedge \tau)-H(\tau) $

By adding these equalities, we have

$ H(\boldsymbol{\pi} \mid \sigma \wedge \boldsymbol{\tau})+H(\boldsymbol{\sigma} \mid \boldsymbol{\tau})=H(\boldsymbol{\pi} \wedge \sigma \wedge \boldsymbol{\tau})-H(\boldsymbol{\tau}) $

A further application of Corollary 3 yields the desired equality.

Theorem 8 Let $\pi, \sigma, \tau$ be three partitions of the finite set S, where $S \subseteq \mathbb{R}^{p}$. We have:

$ H(\boldsymbol{\pi} \mid \sigma)+H(\sigma \mid \tau) \geqslant H(\boldsymbol{\pi} \mid \boldsymbol{\tau}) $

Proof The monotonicity of the conditional inertial entropy in its second argument and the anti-monotonicity of the same in its first argument allows us to write:

$ \begin{gathered} H(\pi \mid \sigma)+H(\sigma \mid \tau) \geqslant H(\pi \mid \sigma \wedge \tau)+ \\ H(\sigma \mid \tau)=H(\pi \wedge \sigma \mid \tau) \geqslant H(\boldsymbol{\pi} \mid \tau) \end{gathered} $

which is desired inequality.

Corollary 4 Let $\pi, \sigma$ be two partitions of the finite set, where $S \subseteq \mathbb{R}^{p}$.We have

$ H(\boldsymbol{\pi} \vee \boldsymbol{\sigma})+H(\pi \wedge \sigma) \leqslant H(\pi)+H(\sigma) $

Proof By Theorem 8, we have $H(\boldsymbol{\pi} \mid \boldsymbol{\sigma}) \leqslant H(\pi \mid \tau)+H(\tau \mid \sigma)$. Replacing the conditional inertial entropies, the following inequality is obtained:

$ H(\pi \wedge \sigma)-H(\sigma) \geqslant H(\pi \wedge \tau)-H(\tau)+$ $H(\tau \wedge \sigma)-H(\sigma) $

which implies

$ H(\boldsymbol{\tau})+H(\boldsymbol{\pi} \wedge \boldsymbol{\sigma}) \leqslant H(\boldsymbol{\pi} \wedge \boldsymbol{\tau})+H(\boldsymbol{\tau} \wedge \sigma) $

Choosing $\tau=\pi \vee \sigma$ yields the desired inequality.

Theorem 9 The mapping d: PART $(S)^{2} \rightarrow$ $\mathbb{R}_{\geqslant 0}$ defined as

$ d(\boldsymbol{\pi}, \sigma)=H(\boldsymbol{\pi} \mid \sigma)+H(\sigma \mid \boldsymbol{\pi}) $

is a metric on PART(S).

Proof A double application of Theorem 8 yields:

$ \begin{aligned} & H(\pi \mid \sigma)+H(\sigma \mid \tau) \geqslant H(\pi \mid \tau) \\ & H(\sigma \mid \pi)+H(\tau \mid \sigma) \geqslant H(\tau \mid \pi) \end{aligned} $

Adding these inequalities gives the triangular inequality for d:

$ d(\boldsymbol{\pi}, \boldsymbol{\sigma})+d(\sigma, \boldsymbol{\tau}) \geqslant d(\boldsymbol{\pi}, \tau) $

The symmetry of d is immediate and it is clear that $d(\pi, \pi)=0$ for every $\pi \in \operatorname{PART}(S)$.

Suppose $d(\pi, \sigma)=0$. Because the values of H are non-negative, this implies $H(\pi \mid \sigma)=H(\sigma \mid \pi)=0$. By Theorem 5, we have both $\sigma \leqslant \pi$ and $\pi \leqslant \sigma$, so $\pi=\sigma$. Thus, d is a metric on PART(S).

We refer to the metric on PART(S) introduced in Theorem 9 as the inertial metric on the set of partitions PART(S).

Lemma 2 Let $\pi, \sigma, \tau$ be three partitions in PART(S). We have:

$ \frac{H(\pi \mid \sigma)}{H(\pi \wedge \sigma)}+\frac{H(\sigma \mid \tau)}{H(\sigma \wedge \tau)} \geqslant \frac{H(\pi \mid \tau)}{H(\pi \wedge \tau)} $

Proof Applying the definition of conditional inertial entropy, the desired inequality is obtained as follows:

$ \begin{gathered} \frac{H(\pi \mid \sigma)}{H(\pi \wedge \sigma)}+\frac{H(\sigma \mid \tau)}{H(\sigma \wedge \tau)}=\frac{H(\pi \mid \sigma)}{H(\pi \mid \sigma)+H(\sigma)}+ \\ \frac{H(\sigma \mid \tau)}{H(\sigma \mid \tau)+H(\tau)} \geqslant \\ \frac{H(\pi \mid \sigma)}{H(\pi \mid \sigma)+H(\sigma \mid \tau)+H(\tau)}+ \\ \frac{H(\sigma \mid \tau)}{H(\sigma \mid \tau)+H(\sigma \mid \tau)+H(\tau)}= \\ \frac{H(\pi \mid \sigma)+H(\sigma \mid \tau)}{H(\pi \mid \sigma)+H(\sigma \mid \tau)+H(\tau)} \geqslant \\ \frac{H(\pi \mid \tau)}{H(\pi \mid \sigma)+H(\sigma \mid \tau)+H(\tau)}= \\ \frac{H(\pi \mid \tau)}{H(\pi \wedge \tau)} \end{gathered} $

Theorem 10 The mapping $\delta:$ PART $(S)^{2} \rightarrow$ $\mathbb{R}_{\geqslant 0}$ defined as

$ \delta(\pi, \sigma)=\frac{d(\pi, \sigma)}{H(\pi \wedge \sigma)} $

is a metric on PART(S) such that $0 \leqslant \delta(\pi, \sigma) \leqslant 1$ for $\pi, \sigma \in \operatorname{PART}(S)$.

Proof The non-negativity and the symmetry of $\delta$ are immediate. To prove the triangular axiom, we write:

$ \begin{aligned} & \delta(\pi, \tau)=\frac{d(\pi, \tau)}{H(\pi \wedge \tau)}=\frac{H(\pi \mid \tau)+H(\tau \mid \pi)}{H(\pi \wedge \tau)} \leqslant \\ & \frac{H(\pi \mid \sigma)}{H(\pi \wedge \sigma)}+\frac{H(\sigma \mid \tau)}{H(\sigma \wedge \tau)}+\frac{H(\tau \mid \sigma)}{H(\tau \wedge \sigma)}+ \\ & \frac{H(\sigma \mid \pi)}{H(\sigma \wedge \pi)}(\text { by Lemma 2) }=\delta(\pi, \sigma)+ \\ & \delta(\sigma, \pi) \end{aligned} $

Furthermore, since $H(\boldsymbol{\pi} \mid \sigma) \leqslant H(\boldsymbol{\pi} \wedge \boldsymbol{\sigma})$, it follows that $0 \leqslant \delta(\pi, \sigma) \leqslant 1$.

Note that for $\pi, \sigma \in \operatorname{PART}(S)$, we have:

$ \delta(\pi, \sigma)=2-\frac{H(\pi)+H(\sigma)}{H(\pi \wedge \sigma)} $

The metric $\delta(\pi, \sigma)$ whose values vary in the interval[0, 1] can be used for the external validation of clusterings. This type of validation is useful when we have labeled data and we wish to examine how various clusterings agree with the labels. For example, for the iris data set with its three classes (setosa, versicolor, and virginica), it is known that the classes are not well-separated. If $\pi$ is a clustering of the data set and $\sigma$ is the three-class partition determined by the three types of irises, then $\delta(\pi, \sigma)$ measures the quality of the clustering $\pi$. The closer $\delta(\pi, \sigma)$ is to 0, the better the quality of the clustering.

6 External Clustering Validation using Inertial Entropy Metric

Clustering seeks to partitions sets of objects such that objects located in the same block of a partition are similar and objects located in distinct blocks are dissimilar.

In general, the number of blocks of the desired partition is unknown, and the extent to which the proposed partition fits the input data is an important problem in clustering. In general, an optimal clustering algorithm does not exist, which means that different algorithms may produce different partitions and none of these partitions needs to be optimal^[23]. Also, for many clustering algorithms, the number of clusters is a parameter supplied by the analyst.

This paper focus on external clustering validation which involves clustering data whose items are already labeled with the cluster where a data item should belong. These labels define a ground truth partition $\sigma$.

The purpose of external clustering validation is to test a variety of clustering algorithms in order to determine which of these algorithms is better suited to produce a correct clustering $\pi$(that is, a clustering that is as consistent as possible with the ground truth partition $\sigma$).

Recent surveys^[24-26] discuss the challenges of clustering validation as a difficult task which lacks the theoretical foundation of supervised learning. Our focus is to provide such a theoretical basis using inertial entropy and the normalized distance $\delta(\pi, \sigma) \in[0, 1]$ generated by it. Smaller values of $\delta$ (close to 0) mean that the clustering algorithm produces a result consistent with the ground truth partition. The minimum of $\delta(\pi, \sigma)$ suggest that the computed partition $\pi$ is as consistent as possible with the ground truth partition.

We apply external validation to the iris data set from the UC Irvine Machine Learning Repository, which has three natural classes that form the reference partition $\sigma$. The partition obtained via a hierarchical clustering is denoted by $\pi$, and the quality of various hierarchical clusterings is evaluated through the distance $\delta(\pi, \sigma)$. For each partition, we plot its values against the number of clusters in that partition.

We focus primarily on the Ward method clustering method, depicted in Fig. 3, given its relevance to our theoretical framework. We also include plots for complete, single, and average link methods for comparison in Figs. 4 and 5, respectively. Each plot includes a vertical dashed line denoting the number of clusters at which the $\delta$ function has a global minimum, and states the minimum value of the delta function in the captions.

Fig.3 The values of $\delta(\pi, \sigma)$ plotted against the number of clusters for the iris data set using the Ward method(There are 3 clusters in the natural clustering of the set, and 3 clusters in true partition, the global minimum of $\delta$ also occurs at 3, with a value of 0.0490)

Fig.4 The values of $\delta$ plotted against the number of clusters for the iris data set using the complete link method(There are 3 clusters in the natural clustering of the set, and 3 clusters in true partition, the global minimum of $\delta$ also occurs at 3, with a value of 0.0680)

Fig.5 The values of $\delta$ plotted against the number of clusters for the iris data set using the average link method(There are 3 clusters in the natural clustering of the set, and 3 clusters in true partition, the global minimum of $\delta$ also occurs at 3, with a value of 0.0430)

If the chosen clustering method is suitable for a specific data set, our results show that a remarkable coincidence with the real number of clusters given in the description of the data set. Also, our results coincide with the experimental results obtained in Ref.[25]. However, when we apply the single-link technique to the iris data set, the minimum distance between the reference partition $\pi$ and the partition computed by the single-link clustering occurs for n=20. This is due to the well-known preference of single-link algorithms for elongated clusters, which the iris data set does not contain (see Fig. 6).

Fig.6 The values of $\delta$ plotted against the number of clusters for the iris data set using the single link method(There are 3 clusters in the natural clustering of the set, and 3 clusters in true partition, but the global minimum of $\delta$ occurs at 20, with a value of 0.0720)

We also note that the Ward and average link methods had the smallest minimum values of the $\delta$ function, indicating that they produce clusterings closest to the true underlying structure of the data set. This in turn indicates that the Ward and average link methods are the best choices to use for hierarchical clustering for this particular data set.

The $\delta$ function paired with the Ward method has a global minimum coinciding with the natural number of clusters on the iris data set. This suggests the $\delta$ function can be used for partition validation by identifying its global minimum. We test this idea by repeating the above experiment on several additional data sets consisting of one synthetic data set of 5 clusters, and six natural data sets from the UC Irvine repository, as shown in Figs. 7-13. These natural data sets are comprised of the balance scale, breast cancer Wisconsin, glass, congressional votes, wine, and zoo data sets. Each set was pre-processed by normalizing the features to a 0-1 range and removing records with missing values.

Fig.7 The values of $\delta$ plotted against the number of clusters for the synthetic data set using the Ward method(There are 5 clusters in the natural clustering of the set, and 5 clusters in true partition, the global minimum of $\delta$ also occurs at 5, with a value of 0.0003)

Fig.8 The values of $\delta$ plotted against the number of clusters for the balance scale data set using the Ward method(There are 3 clusters in the natural clustering of the set, and 3 clusters in true partition, but the global minimum of $\delta$ occurs at 4, with a value of 0.6300)

Fig.9 The values of $\delta$ plotted against the number of clusters for the breast cancer Wisconsin data set using the Ward method(There are 2 clusters in the natural clustering of the set, and 2 clusters in true partition, the global minimum of $\delta$ also occurs at 2, with a value of 0.0730)

Fig.10 The values of $\delta$ plotted against the number of clusters for the glass data set using the Ward method (There are 6 clusters in the natural clustering of the set, and 6 clusters in true partition, but the global minimum of $\delta$ occurs at 12 with a value of 0.6240. There are also local minima at 3 and 5 clusters, along with other local minima at higher cluster values not shown in the zoomed window)

Fig.11 The values of $\delta$ plotted against the number of clusters for the congressional votes data set using the Ward method(There are 2 clusters in the natural clustering of the set, and 2 clusters in true partition, the global minimum of $\delta$ also occurs at 2 with a value of 0.4080)

Fig.12 The values of $\delta$ plotted against the number of clusters for the wine data set using the Ward method(There are 3 clusters in the natural clustering of the set, and 3 clusters in true partition, the global minimum of $\delta$ also occurs at 3 with a value of 0.0660)

Fig.13 The values of $\delta$ plotted against the number of clusters for the zoo data set using the Ward method(There are 7 clusters in the natural clustering of the set, and 7 clusters in true partition, but the global minimum of $\delta$ occurs at 6 with a value of 0.1100)

After pre-processing, each set was clustered according to the Ward hierarchical method and its $\delta$ values computed with its true partition and plotted against number of clusters. Each plot is labeled with the true number of classes in the data and the name of the data set.

In the majority of the experimental cases, the global minimum of $\delta$ coincides with the number of partitions in the true class structure of the data. Additionally, this minimum was very sharp and distinct, and had no close competing local minima. This was the case for the iris, breast cancer Wisconsin, congressional votes, and wine data sets, where the numbers of clusters identified coincide with the numbers obtained in Ref.[25]. This behavior did not hold in the zoo, glass, and balance scale data sets.

In zoo, the global minimum is at 6 instead of 7 clusters. This behavior is in part due to the imbalance between the number of elements in each cluster of the true partition, which has seven clusters having 41, 20, 5, 13, 4, 8, and 10 elements.

The behavior of $\delta$ in the glass set is more erratic, with the global minimum at 12 clusters and a number of local minima, none of which coincide with the number of clusters in the true partition. Again, this behavior may be in part due to the heavily unbalanced nature of the glass set. Its six clusters have sizes 70, 76, 17, 13, 9, and 29.

Balance scale is similarly unevenly weighted, with its cluster sizes being 49, 288, and 288. This seems to suggest that the $\delta$ function is less effective at detecting the natural number of clusters in a data set when the clusters are heavily imbalanced. We investigate this phenomenon by examining each data set used and noting its cluster sizes and its minimum delta value. The results are summarized in Table 2.

Table 2 Experimental results on several data sets

The three data sets, zoo, glass, and balance scale, which did not have a minimum $\delta$ value coinciding with the natural cluster number, are in fact the most imbalanced based on the cluster sizes in the above table. They are also among the highest in their minimum $\delta$ values, indicating a greater distance from the natural partition. These experiments suggest that in general, the $\delta$ function is more effective at detecting the natural cluster number through its minimum value when the cluster sizes of the data are more balanced. In particular, its minimum value tends to be smaller when cluster sizes are balanced. However, it should be noted that the congressional votes data set defies this trend. It is well balanced and the minimum $\delta$ value coincides with its natural cluster number, but it has a high minimum $\delta$ value unlike the other four well-balanced sets.

Experimental results suggest that when the data set is not severely imbalanced in cluster size, and a suitable hierarchical linkage method is used, the $\delta$ function is an effective method of partition validation. The "ideal" partition of a dendrogram is indicated by the minimum value of the $\delta$ function taking the dendrogram partitions and the true class partition of the data set as arguments.

7 Conclusions and Future Work

We introduced axiomatically the notion of inertial entropy for partitions of data sets that are subsets of a metric space and the conditional variant of this type of entropy. Conditional entropy generates a metric on the space of partitions and this metric can be normalized such that it ranges in the interval [0, 1]. Then, we demonstrate that the partition metric can be used for comparing a reference partition of labeled data with partitions produced by a variety of clustering methods which allows us to evaluate various clustering methods as well as the clusterability of a data set.

It appears that the entropic approach introduced here can be extended to other types of data sets, in particular to graph data sets. This would allow external validation of clusterings of this type of data, a topic for future research.

References

[1]	Simovici D A, Jaroszewicz S. An axiomatization of partition entropy. IEEE Transactions on Information Theory, 2002, 48(7): 2138-2142. DOI:10.1109/TIT.2002.1013159 (0)
[2]	Simovici D A. Clustering-Theoretical and Practical Aspects. Singapore: WorldScientific, 2021. (0)
[3]	Verma R, Merigó J M. On Sharma-Mittal's entropy under intuitionistic fuzzy environment. Cybernetics and Systems, 2021, 52(6): 498-521. DOI:10.1080/01969722.2021.1903722 (0)
[4]	Boffa S, Ciucci D. Logical entropy and aggregation of fuzzy orthopartitions. Fuzzy Sets and Systems, 2023, 455: 77-101. DOI:10.1016/j.fss.2022.07.014 (0)
[5]	Chen C, Rajapakse I. Tensor entropy for uniform hypergraphs. IEEE Transactions on Network Science and Engineering, 2020, 7(4): 2889-2900. DOI:10.1109/TNSE.2020.3002963 (0)
[6]	Birkhoff G. Lattice Theory. 3rd edn. Providence, RI : American Mathematical Society, 1973. (0)
[7]	Khinchin A I. Mathematical Foundations of Information Theory. New York: Dover, 1957. (0)
[8]	Faddeev D K. On the notion of entropy of finite probability distributions (in Russian). Usp. Mat. Nauk, 1956, 11: 227-231. (0)
[9]	Ingarden R S, Urbanik K. Information without probability. Coll. Math., 1962, 1: 281-304. (0)
[10]	Daróczy Z. Generalized information functions. Information and Control, 1970, 16: 36-51. DOI:10.1016/S0019-9958(70)80040-7 (0)
[11]	Havrda J H, Charvat F. Quantification methods of classification processes: Concepts of structural α-entropy. Kybernetica, 1967, 3: 30-35. (0)
[12]	Tsallis C. Possible generalization of boltzmann-gibbs statistics. Journal of Statistical Physics, 1988, 52: 479-487. DOI:10.1007/BF01016429 (0)
[13]	Furuichi S, Yanagi K, Kuriyama K. Fundamental properties of tsallis relative entropy. Journal of Mathematical Physics, 2004, 45(12): 4868-4877. DOI:10.1063/1.1805729 (0)
[14]	Ward J H. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 1963, 58: 236-244. DOI:10.1080/01621459.1963.10500845 (0)
[15]	Miyamoto S, Abe R, Endo Y, et al. Ward method of hierarchical clustering for non-euclidean similarity measures. Proceedings of the 2015 7th International Conference of Soft Computing and Pattern Recognition (SoCPaR). Piscataway: IEEE, 2015, 60-63. (0)
[16]	Draszawka K, Szymański J. External Validation Measures for Nested Clustering of Text Documents. Emerging Intelligent Technologies in Industry. Ryżko D, Rybiński H, Gawrysiak P, et al. Heidelberg: Springer Berlin, 2011: 207-225. DOI: 10.1007/978-3-642-22732-5. (0)
[17]	Brun M, Sima C, Hua J, et al. Model-based evaluation of clustering validation measures. Pattern Recognition, 2007, 40(3): 807-824. DOI:10.1016/j.patcog.2006.06.026 (0)
[18]	Liu Y, Li Z, Xiong H, et al. Understanding of internal clustering validation measures. Proceedings of the 2010 IEEE International Conference on Data Mining. Piscataway: IEEE, 2010, 911-916. DOI:10.1109/ICDM.2010.35 (0)
[19]	Wu J, Chen J, Xiong H, et al. External validation measures for k-means clustering: A data distribution perspective. Expert Systems with Applications, 2009, 36(3): 6050-6061. DOI:10.1016/j.eswa.2008.06.093 (0)
[20]	Ullmann T, Hennig C, Boulesteix A-L. Validation of cluster analysis results on validation data: A systematic framework. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2022, 12(3): 1444. DOI:10.1002/widm.1444 (0)
[21]	Halmos P R. Lectures on Boolean Algebras. New York: Springer, 1974. (0)
[22]	Xiong H, Li Z. Clustering Validation Measures. Data Clustering. Aggarwal C C, Reddy C K. New York: CRC Press, 2018: 571-606. (0)
[23]	Pal N R, BiswasJ. Cluster validation using graph theoretic concepts. Pattern Recognition, 1997, 30(6): 847-857. DOI:10.1016/S0031-3203(96)00127-6 (0)
[24]	Schubert E. Stop using the elbow criterion for k-means and how to choose the number of clusters instead. ACM SIGKDD Explorations Newsletter, 2023, 25(1): 36-42. DOI:10.1145/3606274.3606278 (0)
[25]	Arbelaitz O, Gurrutxaga I, Muguerza J, et al. An extensive comparative study of cluster validity indices. Pattern Recognition, 2013, 46(1): 243-256. DOI:10.1016/j.patcog.2012.07.021 (0)
[26]	Milligan G W, Cooper M C. An examination of procedures for determining the number of clusters in a data set. Psychometrika, 1985, 50: 159-179. DOI:10.1007/BF02294245 (0)