Semantic similarity

tags: NLP, Evaluating NLP

N-gram matching

For two sequences $x$ and $\hat{x}$, we denote the sequence of $n$-grams with $S_x^n$ and $S^n_{\hat{x}}$. The number of matched $n$-grams between the two sentences is: \[ \sum_{w \in S_{\hat{x}}^n} \mathbb{I}[w \in S_{x}^n ] \] with $\mathbb{I}$ the indicator function.

From this we can construct the exact match precision (Exact-$P_n$) and recall (Exact-$R_n$): \[ \text{Exact-}$P_n$ = \frac{\sum_{w \in S_{\hat{x}}^n} \mathbb{I}[w \in S_{x}^n ]}{| S_{\hat{x}}^n|} \] and \[ \text{Exact-}$R_n$ = \frac{\sum_{w \in S_{x}^n} \mathbb{I}[w \in S_{\hat{x}}^n ]}{| S_{x}^n|} \]

Here are some well-known metrics based on the number of $n$-grams matches.

BLEU

The BLEU metric (Papineni et al. 2002) is very widely used in NLP and particularly in Machine translation . It is based on Exact-$P_n$ with some key modifications:

$n$-grams in the reference can be matched only once.
The number of exact matches is accumulated for all reference-candidate pairs in the corpus and divided by the total number of $n$-grams in all candidate sentences.
A brevity penalty discourages very short candidates.

The score is computed with various values of $n$ and geometrically averaged.

An smoothed extension of the BLEU metric, SENT-BLEU (Koehn et al. 2007) is computed at the sentence level.

METEOR

METEOR (Banerjee, Lavie 2005) computes Exact-$P_1$ and Exact-$R_1$ with the possibility to match word stems and synonyms.

The extension METEOR-1.5 (Denkowski, Lavie 2014) weighs content and function words differently, and also applies importance weighting to different matching types.

More recently METEOR++ 2.0 adds a learned paraphrase resource to the algorithm. Because of these external resources, the full feature set of METEOR is only available for few languages.

ROUGE

ROUGE (Lin 2004) is a widely used metric for summarization evaluation.

ROUGE-$n$ computes Exact-$R_n$ (usually $n = 1, 2$), while ROUGE-$L$ is a variant of Exact-$R_1$ with the numerator replaced by the length of the longest common subsequence.

Edit-distance-based metrics

Embedding-based metrics

Universal sentence encoder

The Universal sentence encoder (USE) (Cer et al. 2018) can embed a sentence into a single vector. The distance between two sentence embeddings (usually normalized and therefore cosine distance), can then be used as a proxy for semantic similarity.

Learned metrics

BERTScore

The goal of this metric is to compute and aggregate the pairwise similarity between BERT embeddings of words in the sentence (Zhang et al. 2020).

Bibliography

Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu. 2002. "Bleu: A Method for Automatic Evaluation of Machine Translation". In , 311–18.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, et al.. 2007. In , edited by John A. Carroll, Antal van den Bosch, and Annie Zaenen. The Association for Computational Linguistics. https://aclanthology.org/P07-2045/.
Satanjeev Banerjee, Alon Lavie. 2005. In , edited by Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare R. Voss, 65–72. Association for Computational Linguistics. https://aclanthology.org/W05-0909/.
Michael J. Denkowski, Alon Lavie. 2014. In , 376–80. The Association for Computer Linguistics. DOI.
Chin-Yew Lin. July 2004. In , 74–81. Association for Computational Linguistics. https://aclanthology.org/W04-1013.
Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, et al.. 2018. CoRR abs/1803.11175. http://arxiv.org/abs/1803.11175.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi. February 24, 2020. February 24, 2020DOI.