Bhattacharyya distance

In statistics, the Bhattacharyya distance measures the similarity of two discrete probability distributions. It is normally used to measure the separability of classes in classification.

For discrete probability distributions p and q over the same domain X, it is defined as:
 * $$D_B(p,q) = -\ln \left( BC(p,q) \right)$$

where:
 * $$BC(p,q) = \sum_{x\in X} \sqrt{p(x) q(x)}$$

is the Bhattacharyya coefficient. For continuous distributions, the Bhattacharyya coefficient is defined as:
 * $$BC(p,q) = \int \sqrt{p(x) q(x)}\, dx$$

In either case, $$0 \le BC \le 1$$ and $$0 \le D_B \le \infty$$. $$D_B$$ does not obey the triangle inequality, but $$\sqrt{1-BC}$$ does obey the triangle inequality.

For multivariate Gaussian distributions $$p_i=N(m_i,P_i)$$,
 * $$D_B={1\over 8}(m_1-m_2)^T P^{-1}(m_1-m_2)+{1\over 2}\ln \,\left({\det P \over \sqrt{\det P_1 \, \det P_2} }\right)$$,

where $$m_i$$ and $$P_i$$ are the means and covariances of the distributions, and
 * $$P={P_1+P_2 \over 2}$$.

Note that the first term in the Bhattacharyya distance is related to the Mahalanobis distance.

Bhattacharyya coefficient
The Bhattacharyya coefficient is an approximate measurement of the amount of overlap between two statistical samples. The coefficient can be used to determine the relative closeness of the two samples being considered.

Calculating the Bhattacharyya coefficient involves a rudimentary form of integration of the overlap of the two samples. The interval of the values of the two samples is split into a chosen number of partitions, and the number of members of each sample in each partition is used in the following formula,


 * $$\mathrm{Bhattacharyya} = \sum_{i=1}^{n}\sqrt{(\mathbf{\Sigma a}_i\cdot\mathbf{\Sigma b}_i)}$$

where considering the samples a and b, n is the number of partitions, and ai, bi are the number of members of samples a and b in the i'th partition.

This formula hence is larger with each partition that has members from both sample, and larger with each partition that has a large overlap of the two sample's members within it. The choice of number of partitions depends on the number of members in each sample; too few partitions will lose accuracy by over-estimating the overlap region, and too many partitions will lose accuracy by creating individual partitions with no members despite being in a surroundingly populated sample space.

The Bhattacharyya coefficient will be 0 if there is no overlap at all due to the multiplication by zero in every partition. This means the distance between fully separated samples will not be exposed by this coefficient alone.