Machine Learning Techniques for Biomedical Text Retrieval in PubMed
Rezarta Islamaj and Lana Yeganova
Abstract:
PubMed provides free access to MEDLINE®, the US National
Library of Medicine’s premier bibliographic database containing citations and
author abstracts from more than 5000 biomedical journals published in the
Understanding users’ queries and retrieving the relevant articles is crucial for the accurate activity of a search engine such as PubMed. Our work consists of designing machine learning methods for better understanding the aspects of users’ queries and better retrieving of the relevant articles. Specifically:
- Queries in PubMed contain three words on average. Would handling a multiword query as a meaningful phrase as opposed to a Boolean conjunction of separate query words affect the retrieval quality?
- Queries in PubMed contain an author name 36% of the time. Different articles sharing an author name may have been written by different individuals. How can we identify the different individuals behind the same name?
- Published articles contain unrecognized abbreviations which hinder indexing algorithms. How can we identify the correct definition of abbreviated terms in text?
- Queries in PubMed contain words preferred by users to access a particular article. How can we identify the article keywords that help the accessibility of an article?
In this tutorial we will survey these specific applications in PubMed and our solutions for efficient retrieval.