Digitization And Indexing Of Arabic Historical Manuscripts In Qatar
Abstract
Background Hundreds of thousands of rare Arabic manuscripts are available in Qatar. A rich archival heritage of Islamic world and Qatar are preserved in the Qatar National Library (QNL). A number of international projects have been carried out in different parts of the world to digitize Arabic manuscripts, for example by the World Digital Library in cooperation with several international bodies, such as UNESCO, the Bibliotheca Alexandrina, and King-Abdullah University of Science and Technology. Still, the above engines do not have the ability or interface to find words inside the image of a manuscript. Our indexing system was implemented on different Arabic manuscripts datasets including samples from the QNL. Objectives The goal of this research project is to be able to query for words in images of any manuscripts database, and point out the word location in the images and the equivalent text. It shows the results of the query to the user who can then view the text in our interactive website. As demonstrated in Figure 1, the website interface is aiding the users to find their query easily. Methods Through our project, we designed and implemented a novel indexing system. In that work, we present an algorithm for automatic segmentation of manuscripts. The segmented page is then manually annotated to correct mistakes in segmentation. During the correction phase information about the image is extracted and stored in a database. This extracted information is then indexed and the users can use our search interface to easily find words in any ancient manuscripts that have been added to the system. To the best of our knowledge, this is the first word search system for manuscripts that use text queries to highlight the search terms in the manuscript image. Results The focal challenge in this project is the segmentation of handwritten Arabic manuscripts to index the word by word. Therefore, manually correction of the automatic segmentation of the manuscript was added to get 100% segmentation rate. This paper discuss a robotic method indexing Arabic manuscripts that has not been developed previously for handwritten manuscripts: providing an interactive website for the word search engine, to index, store, and provide users with searching and highlighting capability in the document image. Conclusions We considered the need for converting the words available in handwritten documents into electronic data with the goal of enabling it to become searchable online. A system prototype applying the proposed and described approach is being developed and experimentally tested, to fully demonstrate the capabilities of the website on Arabic manuscripts. An overview of the initial experimental studies is presented. We expect the proposed word retrieval system to take the search in manuscripts to a new level.
DOI/handle
http://hdl.handle.net/10576/29633Collections
- Computer Science & Engineering [2343 items ]