An approach to analysis of arabic text documents into text lines, words, and characters

Hakim A. Abdo, Ahmed Abdu, Ramesh Manza, Shobha Bawiskar

Abstract


Text line extraction from a text document image and segmenting it into isolate words and segmenting these words into individual characters are considered as one of the most critical processes in OCR systems development and turning the document into a searchable electronic representation, this paper presents a new approach to analyze the Arabic text documents, the proposed approach contains four steps, preprocessing, text line segmentation, word segmentation, character segmentation. The horizontal projection method are used to detect and extract the text line from preprocessed text documents image, in word segmentation step The space threshold are computed to determine the spaces among connected components in text line as within-word space or between-words space for segmenting the text line into isolate words, finally thinning method applied to find the skeleton of segmented word and analyses geometric characteristics of the characters to detect ligatures and characters. The proposed approach was tested and evaluated on a set of 115 text images, this set contains images from the KFUPM Handwritten Arabic TexT (KHATT) database and some images produced by the authors. The experiment results are extremely encouraging, with a success rate of 98.6% for lines segmentation, 96% for words segmentation, and 87.1% for characters segmentation.

Keywords


Arabic text; Geometric characteristics; Handwritten; Segmentation projection;

Full Text:

PDF


DOI: http://doi.org/10.11591/ijeecs.v26.i2.pp754-763

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Indonesian Journal of Electrical Engineering and Computer Science (IJEECS)
p-ISSN: 2502-4752, e-ISSN: 2502-4760
This journal is published by the Institute of Advanced Engineering and Science (IAES) in collaboration with Intelektual Pustaka Media Utama (IPMU).

shopify stats IJEECS visitor statistics