The student Iñaki Velez de Mendizabal Gonzalez obtained an OUTSTANDING qualification with 'INTERNATIONAL DOCTORATE’ mention
The student Iñaki Velez de Mendizabal Gonzalez obtained an OUTSTANDING qualification with 'INTERNATIONAL DOCTORATE’ mention
The student Iñaki Velez de Mendizabal Gonzalez obtained an OUTSTANDING qualification with 'INTERNATIONAL DOCTORATE’ mention
Thesis title: Dimensionality reduction for the improvement of anti-spam filters.
Court:
- Chairmanship: Octavian Adrian Postolache (ISCTE)
- Vocal: Iryna Yevseyeva (University of De Montfort)
- Vocal: José Mª Gómez Hidalgo (TIBCO SOFTWARE)
- Vocal: : Ekhi Zugasti Uriguen (Mondragon Unibertsitatea)
- Secretary: Iñaki Garitano Garitano (Mondragon Unibertsitatea)
Abstract:
Nowadays, spam represents more than 45% of the world’s email traffic. Filtering techniques to combat the problem of spam distribution have been the subject of many research studies in recent years. Several combinations of legal, administrative and technical perspectives were tested. The combination of technical approaches, namely, the widely exploited content-based and token-based filtering techniques, revealed low significance improvements on spam classification performance. Due to the limited performance of token-based strategies, new knowledge representation schemes (such as those based on word-embeddings, topics, or synsets) have been developed. The use of synsets to represent the meaning of the words guides the community towards the identification of the intentionality of a message, allowing the classification of messages that want to sell products, obtain information about us, etc. The advantage of this kind of synsets representations lies on the capability to taxonomically group concepts, handling the polysemy and synonymy. These properties have been successfully exploited in this research work to design a novel Machine Learning (ML) based lossless feature reduction schemes by grouping concepts strategies. This type of reduction schemes has achieved a reduction in the classification problem dimensionality (number of features), improving the classification performance. In a second step we introduce and demonstrate the effectiveness of a new feature reduction scheme that combines the strengths of lossless and lossy strategies. Finally, in order to use the Leetspeak encrypted words, a decoder has been designed and tested. The proposed system reduces the number of unprocessed words considerably, improving the classification rates of spam messages.