Term Discrimination Based Robust Text Classification with Application to E-mail Spam Filtering

Junejo, Khurram Nazir

Please use this identifier to cite or link to this item: http://localhost:80/xmlui/handle/123456789/5337

Title:	Term Discrimination Based Robust Text Classification with Application to E-mail Spam Filtering
Authors:	Junejo, Khurram Nazir
Keywords:	Computer science, information & general works
Issue Date:	2008
Publisher:	Lahore University of Management Sciences
Abstract:	The Internet has touched every part of our lives, including our interactions and communications. Printed books are being replaced by electronic books (e-books), personal and official correspon- dences have shifted to electronic mail (e-mail), and news is now being read online. This is gener- ating huge volumes of unstructured textual data that needs to be analyzed, filtered, and organized automatically in order to harness its wealth of information for profitable gains. By 2013, it is projected that the worldwide volume of e-mails will reach 507 billion e-mails per day out of which 89% will be spam e-mails [Radicati (2009)]. In 2008, the cost of spam to businesses in terms of hardware, software, and human resource cost was around $140 billion [Research (2008)]. Content-based text classification can automatically organize text documents into predefined thematic categories. However, text classification is challenging in the modern Internet environment. Firstly, text documents are sparsely represented in a very high dimensional feature space (easily in hundred thousands), making learning and generalization difficult. Secondly, due to the high cost of labeling documents researchers are forced to collect training data from sources different from the target domain, which results in a distribution shift between training and test data. Thirdly, although unlabeled data is easily available its utilization in practical text classification for improved performance remains a challenge. One important domain for text classification, which embodies these challenges, is that of e-mail spam filtering. A typical e-mail service provider (ESP) caters to thousands to millions of users where each user can have his own interests of topics and preferences for spam and non-spam e-mails. Personalized service-side spam filtering provides a solution to this problem; however, for such solutions to be practically usable they must be efficient, scalable, and robust to distribution shifts. In this thesis, we propose a robust text classification technique that combines local generative models and global discriminative classifiers through the use of discriminative term weighting and linear opinion pooling. Terms in the documents are assigned weights that quantify the discrimina- tion information they provide for one category over the others. These weights, called discriminative term weights (DTW), also serve to partition the terms into two sets. An opinion pooling strategy consolidates the discrimination information of terms in the sets to yield a two dimensional feature space, in which a discriminant function is learned to categorize the documents. In addition to a supervised technique, we also develop two semi-supervised variants for personalizing the local and global models using unlabeled data. We then generalize our technique into a classifier framework that integrates different feature selection criteria, discriminative term weighting schemes, infor- mation pooling strategies, and discriminative classifiers. We provide a theoretical comparison of our proposed framework with existing generative, discriminative, and hybrid classifiers. Our text classification framework is evaluated with five discriminative term weighting strategies, six opinion consolidation techniques, and four discriminative classifiers. We employ nine real-world datasets from different domains in our experimental evaluation, and the results are compared with four benchmark text classification algorithms via accuracy and AUC values. Our framework is also evaluated under varying distribution shift, on gray e-mails, on unseen e-mails, and under varying classifier size. Scalability of our spam filter is also demonstrated for personalized service-side spam filtering. Statistical significance tests confirm that our technique performs significantly better than the compared techniques in both supervised and semi-supervised settings, and in global and person- alized spam filtering. In particular, it performs remarkably well when distribution shift is high between training and test data, a phenomenon common in e-mail systems. Additional contributions of this thesis include a systematic analysis of the spam filtering problem and the challenges to effective global and personalized spam filtering at the service side. We formally define key characteristics of e-mail classification such as distribution shift and gray e-mails, and relate them to machine learning problem settings. The concept of term discrimination introduced in this work has also found applications in text clustering, visualization, and feature extraction, and it can be extended for keyword extraction and topic identification from textual documents.
URI:	http://142.54.178.187:9060/xmlui/handle/123456789/5337
Appears in Collections:	Thesis

Files in This Item:

File	Description	Size	Format
2242.htm		128 B	HTML	View/Open

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets