Research Projects

A tool that allows users to quickly find the association between the hallmarks of cancer and a search query

Text mining millions of scientific articles to identify chemical risk

A Literature-based discovery (LBD) system for cancer

Automatic extraction of supply chain networks from news articles

Resources & Datasets

A corpus of 3661 PubMed abstracts, manually annotated by experts according to a taxonomy describing how toxic chemicals enter the body, and how they can be monitored.

A corpus of 1852 PubMed abstracts, manually annotated by experts according to a taxonomy describing how cancer starts and spreads in the body.

A dataset of 10 large graphs representing co-occurrence of concepts in PubMed abstract sentences. It can be used to evaluate the performance of LBD systems using real-world scientific discoveries by applying ‘time travelling’.

A large dataset of semantic similarity scores for 1888 word pairs in 13 languages, as well as derived cross-lingual scores

143 pairs of verbs annotated for semantic similarity by 10 annotators.

A corpus of 7,803 sentences annotated with 33,524 relations assigning types to variables appearing in mathematical text.

Teaching & Supervision

I currently supervise an introductory undergraduate course in Computational Lingustics.

In the past I co-lectured a course on Biomedical Information Processing at the Department of Computer Science and Technology, as well as supervised undergraduate students in several courses including: Object-Oriented Programming (year 1), Further Java (year 2), and Software Engineering (year 2).

I also supervise postgraduate research projects (mainly MPhil projects). If you are looking for a research project and are based (or about to start) in Cambridge, feel free to contact me.


  • simon.baker.gen [at]
  • Language Technology Lab (LTL)
    Theoretical & Applied Linguistics
    University of Cambridge
    9 West Road
    Cambridge CB3 9DA, United Kingdom
  • LinkedIn profile
  • Twitter profile