Multi-SimLex: A dataset for evaluating representation learning models

Multi-SimLex: A dataset for evaluating representation learning models

Multi-SimLex is a large-scale multilingual resource for measuring lexical semantic similarity. The current version of Multi-SimLex provides human judgments on the semantic similarity of word pairs for as many as 13 monolingual and 66 cross-lingual datasets. The languages covered are typologically diverse and represent both major languages (e.g., Chinese, Arabic, Spanish, Russian) and less-resourced ones (e.g., Welsh, Kiswahili). Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs, providing a representative coverage of word classes (nouns, verbs, adjectives, adverbs), frequency ranks, similarity intervals, lexical fields, and concreteness levels.

Download the Multi-SimLex dataset Dataset here

Please cite the following paper:

Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual Lexical Semantic Similarity