An OCR System for Printed Nasta'liq Script: A Segmentation Based Approach

Naz, Saeeda

Please use this identifier to cite or link to this item: http://localhost:80/xmlui/handle/123456789/5059

Full metadata record

DC Field	Value	Language
dc.contributor.author	Naz, Saeeda	-
dc.date.accessioned	2019-06-25T11:19:11Z	-
dc.date.accessioned	2020-04-11T15:35:40Z	-
dc.date.available	2020-04-11T15:35:40Z	-
dc.date.issued	2016	-
dc.identifier.govdoc	16779	-
dc.identifier.uri	http://142.54.178.187:9060/xmlui/handle/123456789/5059	-
dc.description.abstract	Machine simulation of human reading has been a subject of intensive research for almost four decades. The latest improvements in recognition methods and systems for Latin script are very promising and matured product are available for those languages in the market. On the contrary, despite more than one decade of research in the field of Urdu Optical Character Recognition (OCR), the reading skill of the computer is still way behind that of human. Automatic Urdu character recognition is a challenging task due to less attention of researchers and intrinsic complexity of Urdu text. That is highly cursive and calligraphic nature, diagonality in writing, and vertical overlap between characters in a sub-word. In this research, we present a novel implicit segmentation based technique for development of an OCR for printed Nasta'liq text lines. This work introduces a novel and robust approach based on statistical models that provide solution for recognition of Nasta’liq style Urdu text. Unlike to classical approaches which segment text into words, ligatures or characters, we employ an implicit segmentation where text lines are recognized during segmentation. The developed system is evaluated on standard Urdu text databases and compared with the state-of-the-art recognition techniques proposed till date. In the proposed recognition system, we use two strategies, first is based on manual features and second on automatic features. In the first strategy, we split each text line image into small frames of width ‘n’ by using a sliding window and extract many features from each frame. These features are then concatenated to form a feature vector for the text line. In the second strategy, we extract features automatically by using the Multi-dimensional (MD) Long Short Term Memory (LSTM) model in one scenario and by Convolutional Neural Network (CNN) model in other scenario. Features extracted from the text lines along with their respective transcriptions are fed to a Recurrent Neural Network (RNN) for training or classification. Recognition is obtained by using MDLSTM based recognizers with the Connectionist Temporal Classification (CTC) output layer. The experiments conducted on a standard UPTI database yield promising results. We obtained 96.40% (3.6% error rate) recognition rates using manual features, 98% (2.0% error rate) using raw pixels based features and 98.12% (1.88% error rate) using CNN based features.	en_US
dc.description.sponsorship	Higher Education Commission, Pakistan	en_US
dc.language.iso	en_US	en_US
dc.publisher	Hazara University, Mansehra	en_US
dc.subject	Computer Sciences	en_US
dc.title	An OCR System for Printed Nasta'liq Script: A Segmentation Based Approach	en_US
dc.type	Thesis	en_US
Appears in Collections:	Thesis

Files in This Item:

File	Description	Size	Format
9944.htm		120 B	HTML	View/Open

Show simple item record