Machine Learning Techniques for Biomedical Text Retrieval in PubMed

Rezarta Islamaj and Lana Yeganova

National Center for Biotechnology Information

Abstract:

PubMed provides free access to MEDLINE®, the US National Library of Medicine’s premier bibliographic database containing citations and author abstracts from more than 5000 biomedical journals published in the United States and in other countries. Currently, there are more than 19 million biomedical citations in PubMed. PubMed is accessed by millions of users each day. User’s daily interactions with the system include queries for particular citations, abstract views and/or full text articles views. The current PubMed retrieval system displays a listing of relevant articles in reverse chronological order.

Understanding users’ queries and retrieving the relevant articles is crucial for the accurate activity of a search engine such as PubMed. Our work consists of designing machine learning methods for better understanding the aspects of users’ queries and better retrieving of the relevant articles. Specifically:

- Queries in PubMed contain three words on average. Would handling a multiword query as a meaningful phrase as opposed to a Boolean conjunction of separate query words affect the retrieval quality?

- Queries in PubMed contain an author name 36% of the time. Different articles sharing an author name may have been written by different individuals. How can we identify the different individuals behind the same name?

- Published articles contain unrecognized abbreviations which hinder indexing algorithms. How can we identify the correct definition of abbreviated terms in text?

- Queries in PubMed contain words preferred by users to access a particular article. How can we identify the article keywords that help the accessibility of an article?

In this tutorial we will survey these specific applications in PubMed and our solutions for efficient retrieval.

SLIDES