Hallmarks of Cancer Corpus

The Hallmarks of Cancer (HOC) corpus consists of 1852 PubMed publication abstracts manually annotated by experts according to the Hallmarks of Cancer taxonomy. The taxonomy consists of 37 classes in a hierarchy. Zero or more class labels are assigned to each sentence in the corpus. The labels are found under the “labels” directory, while the tokenized text can be found under “text” directory. The filenames are the corresponding PubMed IDs (PMID).

Download the Hallmarks of Cancer corpus here

Please cite the following papers:

Automatic semantic classification of scientific literature according to the hallmarks of cancer

Cancer Hallmarks Analytics Tool (CHAT): a text mining approach to organize and evaluate scientific literature on cancer