Literature-based Discovery Evaluation Dataset

Literature-based Discovery Evaluation Dataset

The aim of Literature-Based-Discovery (LBD) is to discover new knowledge from existing literature. This is typically acheived using the ABC model (also known as Swanson linking), where a graph is used to represent co-occurrence in literature. Concepts are the nodes and edges represent co-occurrence between the two concepts in, for example, a sentence.

The ABC model hypothesizes that there is a meaningful relation between concept A extracted from some publication(s) and concept C extracted from another when there is some concept B appearing in both. The results of ABC model are relations (termed paths) among entities A, B, and C.

There are two modes of Literature-based Discovery: Open Discovery and Closed Discovery. The goal of Open Discovery is to find a path ABC that leads to a previously unknown discovery, while Closed Discovery aims to find the linking concept B given both A and C as inputs to the system.

For either mode of discovery, an LBD system requires evaluation via ‘time travel,’ where a well-established scentific discovery is used to evaluate the system by artificially removing all related nodes and edges from the graph. Likewise, all nodes and edges that appears after the year of the discovery are removed from the graph. This forces the the system to try to predict the discovery from the remaining information.

We have produced a dataset to evaluate our LION LBD system, and have made this dataset public for anyone intending to evaluate LBD systems and compare with the provided gold standard as well as with the performance of our system.

You can find out more and download the Literature-based Discovery Dataset here

Please cite the following paper:

LION LBD: a literature-based discovery system for cancer biology