IMPROVING TEXT CLASSIFICATION USING BACKGROUND KNOWLEDGE
Abstract:
The wide-spread availability of information in digital form has been both a blessing and a
curse. We now have enormous amounts of important information at our fingertips, yet
finding information relevant to a given task has become more and more difficult as the
amount of information has continued to grow without bound. One important tool in
managing the glut of online information is automated text classification, in which text
documents are automatically assigned to any of a fixed set of given categories. For
example, Web pages could be automatically placed into a hierarchy of topics based on
content, or email messages could be prioritized based on interest.
Inductive classification learning is one of the most popular tools for text classification.
This supervised learning method takes a corpus of documents already assigned to
appropriate categories, and extrapolates from them procedures for labeling new, other
uncategorized documents with suitable categories. However, such methods are data
intensive, requiring significant human effort to label sufficient documents to achieve
effective levels of accuracy. Recently, researchers have begun developing methods for
exploiting additional sources of information to improve the performance of text
classification learning methods even when given limited amounts of data.
This talk will describe three approaches for integrating sources of background knowledge
into the text classification process. The first takes existing unlabeled documents and
assigns labels to them, adding them to the corpus of labeled documents as additional data
that did not require any explicit human labeling. The second approach redefines the
process by which documents are compared to one another, assessing similarity not based
on direct comparison between documents but instead via similarity to shared pieces of
background text. The third approach uses background text to re-express documents into a
new space in which similarities are easier to discern and in which document comparison
takes place. The use of background knowledge is shown to generally improve text
classification accuracy across a range of tasks, and the talk will discuss the characteristics
of a learning problem that makes each of these approaches more or less effective at
improving classification. The talk will conclude with a discussion broader lessons on the
value of background knowledge in machine learning more generally.