Medline Analysis

From NCBO Wiki
Revision as of 13:58, 22 February 2010 by Rong (talk | contribs)
Jump to navigation Jump to search

Introduction

The Unified Medical Language System (UMLS) Metathesaurus are the most widely used underlying sources for biomedical natural language processing (NLP) systems, even though they were not designed as terminologies for NLP tasks. However the performances of these systems are not satisfactory. In this study, we systematically analyzed UMLS terms by analyzing their occurrences in over 18 million MEDLINE abstracts written in human natural language. Our goals are three folds: 1. analyze UMLS term frequency and syntactic distribution on MEDLINE; 2. build an automatically filtered UMLS Metathesaurus based on MEDLINE analysis; 3. build an augmented UMLS Metathesaurus where each term is associated with its MEDLINE frequency and syntactic distribution statistics. The automatically filtered and augmented UMLS Metathesaurus can be used to improve efficiency and precision of UMLS-based information retrieval and NLP tasks. After automatic MEDLINE filtering, the augmented UMLS contains 518,835 terms, roughly 13% of original terms. Each term in the augmented UMLS is associated with a vector of syntactic distribution statistics and its MEDLINE frequency.

Code repository: <on our g-forge server> .. https://bmir-gforge.stanford.edu/gf/project/

Data and Methods

18,413,784 million abstracts published in MEDLINE from 1965 to 2009 were parsed into sentences (96,374,837). Each sentence was lexically parsed to generate a parse tree using the Stanford Parser. We used the publicly available information retrieval library, Lucene, to create an index on sentences and their corresponding parse trees. UMLS 2009AB version was used in our study, which includes 5,175,449 distinct English strings and 2,120,271 concepts. The term frequency (sentence level) and document frequency (abstract level) were calculated by counting occurrences of each UMLS terms in all the MEDLINE sentences and abstracts. The tf-idf (term frequency-inverse document frequency) of each UMLS terms was calculated as following: tf_idf = (1+log10(tf)) * (log(N/df), where tf is term frequency, df is document frequency, N is the total number of abstracts (18,413,784 in total). The syntactic types and frequencies for each term were collected from all the parse trees where the term appears. Each term was assigned a vector of syntactic types and probabilities.

How to Use

File:Example.jpg