Word embeddings for biomedical natural language processing: A survey


Word representations are mathematical objects that capture the semantic and syntactic properties of words in a way that is interpretable by machines. Recently, encoding word properties into low-dimensional vector spaces using neural networks has become increasingly popular. Word embeddings are now used as the main input to natural language processing (NLP) applications, achieving cutting-edge results. Nevertheless, most word-embedding studies are carried out with general-domain text and evaluation datasets, and their results do not necessarily apply to text from other domains (e.g., biomedicine) that are linguistically distinct from general English. To achieve maximum benefit when using word embeddings for biomedical NLP tasks, they need to be induced and evaluated using in-domain resources. Thus, it is essential to create a detailed review of biomedical embeddings that can be used as a reference for researchers to train in-domain models. In this paper, we review biomedical word embedding studies from three key aspects: the corpora, models and evaluation methods. We first describe the characteristics of various biomedical corpora, and then compare popular embedding models. After that, we discuss different evaluation methods for biomedical embeddings. For each aspect, we summarize the various challenges discussed in the literature. Finally, we conclude the paper by proposing future directions that will help advance research into biomedical embeddings.

Language and Linguistics Compass