Generic Urdu NLP Framework for Urdu Text Analysis: Hybridization of heuristics and Machine Learning Techniques

Khan, Wahab

Please use this identifier to cite or link to this item: http://localhost:80/xmlui/handle/123456789/5148

Title:	Generic Urdu NLP Framework for Urdu Text Analysis: Hybridization of heuristics and Machine Learning Techniques
Authors:	Khan, Wahab
Keywords:	Computer Science
Issue Date:	2019
Publisher:	International Islamic University, Islamabad.
Abstract:	The internet was initially designed to present information to users in English. However, with the passage of time and the development of standard web technologies such as browsers, programming languages, libraries, frameworks, databases, front and back-ends, protocols, APIs, and data formats, the internet became a multilingual source of information. In the last few years, the natural language processing (NLP) research community has observed a rapid growth in online multilingual contents. Thus, the NLP community maims to explore monolingual and cross-lingual information retrieval (IR) tasks. Digital online content in Urdu is also currently increasing at a rapid pace. Urdu, the national language of Pakistan and the most widely spoken and understandable language of Indian sub-continent, is considered a low-resources language (Mukund, Srihari, & Peterson, 2010). Part of speech (POS) tagging and named entity recognition (NER) are considered the most basic NLP tasks. Investigation of these two tasks in Urdu is very hard. POS tagging, the assignment of syntactic categories for words in running text is significant to natural language processing as a preliminary task in applications such as speech processing, information extraction, and others. Named entity recognition (NER) corresponds to the identification and classification of all proper nouns in texts, and predefined categories, such as persons, locations, organizations, expressions of times, quantities and monetary values, etc. it is considered as a sub-task and/or sub-problem in information extraction (IE) and machine translation. NER is one of the hardest task in Urdu language processing. Previously majority Urdu NER systems are based on machine learning (ML) models. However, the ML model needs sufficiently large annotated corpora for better performance(Das, Ganguly, & Garain, 2017). Urdu is termed as a scared resource language in which sufficiently large annotated corpus for ML models’ evaluation is not available. Therefore, the adoption of semi-supervised approach which is largely dependent on usage of the huge amount of unlabeled data is a feasible solution. In this thesis, we propose a generic Urdu NLP framework for Urdu text analysis based on machine learning (ML) and deep learning approaches. Initially, we addressed POS challenges by developing a novel tagging approach using the linear-chain conditional random fields (CRF). We employed a strong, stable, balanced language-independent and language dependent feature set for Urdu POS task and used the method of context words window. Our approach was evaluated against a support vector machine (SVM) technique for Urdu POS - considered Abstract WAHAB KHAN Reg: No. 72-FBAS/PHDCS/S12 vi as the state of the art - on two benchmark datasets. The results show our CRF approach to improving upon the F-measure of prior attempts by 8.3 to 8.5%. Secondly, we adopted deep recurrent neural network (DRNN) learning algorithms with various model structures and word embedding as a feature for the task of Urdu named entity recognition and classification. These DRNN models include long short-term memory (LSTM) forward recurrent neural network (RNN), LSTM bi-directional RNN, backpropagation through time (BPTT) forward RNN and BPTT bi-directional RNN. We consider language-dependent features such as part of speech (POS) tags as well as language independent features such as N-grams. Our results show that the proposed DRNN-based approach outperforms existing work that employ CRF based approaches. Our work is the first to use DRNN architecture and word embedding as a feature for Urdu NER task and improves upon prior attempts by 9.5% in the case of maximum margin.
Gov't Doc #:	18222
URI:	http://142.54.178.187:9060/xmlui/handle/123456789/5148
Appears in Collections:	Thesis

Files in This Item:

File	Description	Size	Format
10445.htm		121 B	HTML	View/Open

Show full item record