Clinical trials, which explain the experiments done or observations made in clinical research, are generated in large quantities. They often describe the procedures and drugs used and the outcome of these new procedures. A lot of the information written in these articles is of great importance to accelerate current studies or start new research. However, the manual extraction of the articles that are relevant to a specific research topic can be a labour-intensive and inefficient task. With the implementation of Natural Language Processing (NLP) techniques, though, huge numbers of clinical trials can be analysed, and the desirable data can be extracted. This, in turn, can accelerate the literature review stage of a study.
Therefore, we decided to develop two Named Entity Recognition (NER) models one extracting drugs and the other complaints/illnesses (CI) from clinical trial articles. To achieve this goal, we used the Convolutional Neural Network (CNN) of the open source NLP library spaCy.
The model has learned to identify the occurrence of named entities based on the context and semantics together with word vectors. These word vectors are a result of the data that the model has been pretrained on.
Figure 1: The procedure used for research into the possibility to identify drugs and adverse events in clinical trial data.
The procedure used in this research started with the retrieval of articles. Subsequently, the data was converted to the right format for the web-based annotation tool. To achieve persistent annotations, one drug and one CI definition were created. A training and an evaluation dataset were created, following the predefined drug and CI definitions. The training dataset contained roughly 1200 annotated paragraphs, the evaluation dataset 300. After the training of the model, it was evaluated. The scores are shown in the table below.
Table 1: The scores achieved on the evaluation dataset, showing the F-score, precision and recall for both entity types.
The two models were applied to all clinical trial abstracts and the data was saved to a database. “Chemotherapy”, “Insulin” and “Saline” were the drug entities most frequently annotated by the model. For the CI entities these were “Pain”, “Tumor” and “Infection”. A review of the visualised annotated abstracts showed that most entities were properly annotated except for a few which were incomplete.
Notwithstanding the high F-score of the models, some improvements might be made. For example, modifying the model to only identify AEs instead of CIs, which could be done with relationship extraction. The creation of the annotations to train and evaluate the model was time consuming, which could be accelerated by generation of semi-automated annotations. Based on the results presented and discussed, it can be concluded that an approach is feasible to create a custom NER model that can identify drugs and their corresponding adverse events.