IMPROVING TEXT CLASSIFICATION USING BACKGROUND KNOWLEDGE Abstract: The wide-spread availability of information in digital form has been both a blessing and a curse. We now have enormous amounts of important information at our fingertips, yet finding information relevant to a given task has become more and more difficult as the amount of information has continued to grow without bound. One important tool in managing the glut of online information is automated text classification, in which text documents are automatically assigned to any of a fixed set of given categories. For example, Web pages could be automatically placed into a hierarchy of topics based on content, or email messages could be prioritized based on interest. Inductive classification learning is one of the most popular tools for text classification. This supervised learning method takes a corpus of documents already assigned to appropriate categories, and extrapolates from them procedures for labeling new, other uncategorized documents with suitable categories. However, such methods are data intensive, requiring significant human effort to label sufficient documents to achieve effective levels of accuracy. Recently, researchers have begun developing methods for exploiting additional sources of information to improve the performance of text classification learning methods even when given limited amounts of data. This talk will describe three approaches for integrating sources of background knowledge into the text classification process. The first takes existing unlabeled documents and assigns labels to them, adding them to the corpus of labeled documents as additional data that did not require any explicit human labeling. The second approach redefines the process by which documents are compared to one another, assessing similarity not based on direct comparison between documents but instead via similarity to shared pieces of background text. The third approach uses background text to re-express documents into a new space in which similarities are easier to discern and in which document comparison takes place. The use of background knowledge is shown to generally improve text classification accuracy across a range of tasks, and the talk will discuss the characteristics of a learning problem that makes each of these approaches more or less effective at improving classification. The talk will conclude with a discussion broader lessons on the value of background knowledge in machine learning more generally.