IMPROVING TEXT CLASSIFICATION USING BACKGROUND KNOWLEDGE
 


Abstract:
        
The wide-spread availability of information in digital form has been both a blessing and a 
curse.  We now have enormous amounts of important information at our fingertips, yet 
finding information relevant to a given task has become more and more difficult as the 
amount of information has continued to grow without bound.  One important tool in 
managing the glut of online information is automated text classification, in which text 
documents are automatically assigned to any of a fixed set of given categories.  For 
example, Web pages could be automatically placed into a hierarchy of topics based on 
content, or email messages could be prioritized based on interest.

Inductive classification learning is one of the most popular tools for text classification.  
This supervised learning method takes a corpus of documents already assigned to 
appropriate categories, and extrapolates from them procedures for labeling new, other 
uncategorized documents with suitable categories.  However, such methods are data 
intensive, requiring significant human effort to label sufficient documents to achieve 
effective levels of accuracy.  Recently, researchers have begun developing methods for 
exploiting additional sources of information to improve the performance of text 
classification learning methods even when given limited amounts of data.

This talk will describe three approaches for integrating sources of background knowledge 
into the text classification process.  The first takes existing unlabeled documents and 
assigns labels to them, adding them to the corpus of labeled documents as additional data 
that did not require any explicit human labeling.  The second approach redefines the 
process by which documents are compared to one another, assessing similarity not based 
on direct comparison between documents but instead via similarity to shared pieces of 
background text.  The third approach uses background text to re-express documents into a 
new space in which similarities are easier to discern and in which document comparison 
takes place.  The use of background knowledge is shown to generally improve text 
classification accuracy across a range of tasks, and the talk will discuss the characteristics 
of a learning problem that makes each of these approaches more or less effective at 
improving classification.  The talk will conclude with a discussion broader lessons on the 
value of background knowledge in machine learning more generally.