BioBERT: A new frontier in biomedical entity recognition

Header image for BioBERT: A new frontier in biomedical entity recognition

Previously published research is an incredibly important resource for researchers wishing to conduct their own research. To effectively use this information, however, they are confronted with the daunting task of wading through millions of available publications to find the few that are relevant to them. The figure below shows the number of publications submitted each year to MEDLINE, a database for abstracts of biomedical publications. Annual submissions have recently started to surpass one million, which means that a solution to this problem has to be discovered now.

For researchers to be able to filter through all these new publications, they will need to be annotated in some way. A frequently used approach nowadays is using Artificial Intelligence systems, as the manual annotation of publications is an expensive and time-consuming task requiring a lot of domain-specific expertise. This field of research, called Natural Language Processing, ranges from tasks like recognising the specific terms in a text (Named Entity Recognition) to finding the documents that correspond best to a given question (Question Answering).

Although these systems show tremendous potential, they are also still limited to some degree. One major problem is the amount of data these models need to train on, which still has to be created manually. To overcome this problem, Google has created a new Artificial Intelligence system called BERT, which stands for Bidirectional Encoder Representation from Transformers. This model has managed to greatly reduce the amount of training data required by using a method called Transfer Learning, which involves splitting BERT’s training into two different phases. First it trains fully unsupervised on enormous amounts of unannotated data, after which it only needs to train a little bit more on annotated training data for a specific task. This major reduction in required annotated training data makes BERT especially interesting for the biomedical domain, where very little annotated data is available. BERT has therefore quickly been adapted for this purpose, resulting in a new version called BioBERT1.

The potentially game changing potential of this new technology led Vartion  to set up a research assignment, focusing on applying BioBERT for Named Entity Recognition. This Natural Language Processing task was chosen because it is a prerequisite for many others. In this project, a model was trained that could recognise four types of biomedical entities in PubMed abstracts: organisms, proteins, lipids and saccharides.

This research involved several steps. First, training data was created based on PubMed abstracts, after which the BioBERT model was trained on this data. Then, this Named Entity Recognition model was used to predict the entities present in all the abstracts in the PubMed database. At the same time, these abstracts were classified into the main biomedical research domains they corresponded to, so that the spread of entities through these domains could be determined. In the figure below, an example can be seen of the annotations a BioBERT transformer model can provide.

BioBERT models show great promise for document annotation, but there are also a few challenges that still need to be overcome. Despite the fact that much less training data is needed than before, creating this data is still a difficult process. Furthermore, BioBERT currently requires the use of many computational resources, making it difficult to implement. Google has already published new research in this direction, however, through the ALBERT model, which reduces the resources required.

By researching new training data creation strategies and adapting ALBERT to the biomedical domain, this new transformer-based Artificial Intelligence method can become an invaluable tool to assist today’s researchers in building a brighter tomorrow.