Cosine Similarity: Not for Probabilities
Tags: lda, probability, topic modeling,
Categories: Data Science, Mathematics,
Over the past year and a half, I’ve gained experience with topic modeling to identify mobile apps that are similar with respect to their functionality. One of the most popular topic modeling algorithms is LDA, or Latent Dirichlet Allocation. It allows one to map a document into an inferred probability distribution across topics learned by the model during training. Gensim is an easy-to-use library for implementing topic models like LDA. It also supports Doc2Vec, which produces low dimensional vector representations of the documents instead of probability distributions. One difference between probabilistic models like LDA and embedding models like Doc2Vec is the concept of similarity. For Doc2Vec, the cosine similarity is a popular and appropriate measure. Document vectors that point in nearly the same direction are similar, whereas those that are near-orthogonal are not.
For LDA, the cosine similarity is not typically used because it is not a measure designed for comparing probability distributions. But I have yet to see an explanation as to why it is an inappropriate measure. This fact is mentioned briefly in one of Radim Řehůřek’s gensim tutorials, but not elaborated on. So why use the symmetrized KL divergence, Hellinger distance, or the Jensen-Shannon divergence over the cosine similarity? In the next section, I present a simple case that demonstrates why the cosine similarity is flawed for probability distributions.
Update (2022-09-26): Criticisms of the cosine similarity are also put forth in this paper on an alternate similarity measure (and is a generalization of the Hellinger similarity for probabilities discussed in the next section). The authors state that the cosine similarity is not effective for dealing with probabilities, with no follow-up explanation. They also note that the cosine similarity is related to the Euclidean distance, which does not perform well in high dimensions due to the curse of dimensionality. This is true! But this issue of dimensionality is not specific to probability distributions. And it is not really an issue for sparse vectors, which have a lower intrinsic dimensionality and are common in NLP applications. In any case, it’s a nice paper that provides the reader with a summary of the many other similarity measures that exist before showing that their measure outperforms several others.
Probability Vectors
Let’s think about exactly what a discrete probability vector is. We know that probabilities must sum to 1 over all possible values:
$$ \sum\limits_{i=1}^n p_i = 1 $$
An alternate way to think of this is that it is a unit vector under the L1-norm. If we take the square root of all entries in this vector, it becomes a unit vector under the L2-norm:
$$ \sqrt{\sum\limits_{i=1}^n (\sqrt{p_i})^2} = 1 $$
We can use this knowledge to make some connections between the cosine similarity and other kinds of similarities used for probability distributions. For example, the Bhattacharrya coefficient, which is effectively a “Hellinger similarity”, can be written like so (relying on the assumption that \(\vec{p}\) and \(\vec{q}\) are unit vectors under the L1-norm):
$$ BC(p,q) = \sum\limits_{i=1}^n \sqrt{p_i}\sqrt{q_i} = \text{cossim}\left(\sqrt{\vec{p}}, \sqrt{\vec{q}}\right) $$
So why can we not also apply the cosine similarity to the untransformed probability vectors? Well – let’s look at an example. Let’s say we have the following two probability vectors defined on \(x \in \{1,2,3\}\):
$$ \vec{p}(x) = [1, 0, 0]$$
$$ \vec{q}(x) = [1/2, y, 1/2-y] $$
where \(0 \leq y \leq \frac{1}{2}\). Because the support of \(\vec{p}\) is restricted to \(x=1\), any changes to \(y\) should not affect the similarity between the \(\vec{p}\) and \(\vec{q}\) distributions at all. With the square-root normalization, this is indeed the case:
$$ \sqrt{\vec{p}} \cdot \sqrt{\vec{q}} = \sqrt{1/2} $$
In contrast, let’s say we want to take the cosine similarity using the standard L2-normalization. The normalizing factor then depends on the sum of squares of all entries.
$$ \hat{x} = \frac{\vec{x}}{||\vec{x}||_{2}} = \frac{\vec{x}}{ \sqrt{\sum_i x_i^2 } }$$
and
$$ \text{cossim}(\vec{p}, \vec{q}) = \frac{1}{2\sqrt{1/4 + y^2 + (1/2-y)^2}} $$
Though the overlap in probability does not depend on \(y\) at all, the similarity changes with \(y\). This does not make sense!
In contrast, if we are considering the similarity between two vectors that do not represent probability distributions, then all that we care about is the angle between them. So therein lies the distinction: probability vectors increase in similarity as the overlap on their common axes increases, while embedding vectors increase in similarity as the angle between them decreases. Thus, these vectors require different similarity measures.
This is especially pertinent to topic models; we may have a model that allows for hundred topics, but each document will only contain a few; we expect these topic distributions to be sparse. Say we are comparing documents A, B, and C. Document A belongs to topics 1, 2 and 3, while documents B and C belong to topics 3, 5, and 6. If the only difference between documents B and C are the proportions of their membership to topics 5 and 6, they should have the same similarity to document A (which contains only topic 3, and not 5 or 6).
So in short: the cosine similarity is not for probability distributions! (But the Hellinger similarity captures the spirit of it while technically being correct.)