A robust method for line and word segmentation in handwritten text
Abstract
Line and word segmentation is a key-step in any document image analysis system. It can be used for instance in handwriting recognition when separating words before their recognition. Line segmentation can also serve as a prior step before extracting the geometric characteristics of lines which are unique to each writer. Text line and word segmentation is not an easy task because of the following problems: 1) text lines do not all have the same direction in the handwritten text; 2) text lines are not always horizontal which makes their separation more difficult; 3) characters may overlap between successive text lines; 4) it is often confusing to distinguish between inter and intra word distances. In our method, line segmentation is done by using a smoothed version of the handwritten document which makes it possible to detect the main line components using a subsequent thresholding algorithm. The connected components of the resulting image are then assigned to a separate label which represents a line component. Then, each text region which intersects only with one line component is assigned to the same label of that line component. The Voronoi diagram of the image thus obtained is then computed in order to label the remaining text pixels. Word segmentation is performed by computing a generalized Chamfer distance in which the horizontal distance is slightly favored. This distance is subsequently smoothed in order to reflect the distances between word components and neglect the distance to dots and diacritics. Word segmentation is then performed by thresholding the distance thus obtained. The threshold depends on the characteristics of the handwriting. We have therefore computed several features in order to predict it, including: the sum of maximum distances within each line component, the number of connected components within the document and the average width and height of lines. The optimal threshold is then obtained by training a linear regression of those features on a training set of about 100 documents. This method achieved the best performance on the ICFHR Handwriting Segmentation Contest dataset reaching a matching score of 97.4% on line segmentation and 91% on word segmentation. The method has also been tested on the QUWI Arabic dataset reaching 97.1% on line segmentation and 49.6% on word segmentation. The relatively low performance of word segmentation in Arabic script is due to the fact that words are very close to each other with respect to English script. The proposed method tackles most of the problems of line and word segmentation and achieves high segmentation results. It can however be improved by combining it with a handwriting recognizer which will eliminates words which are not recognized.
DOI/handle
http://hdl.handle.net/10576/27945Collections
- Computer Science & Engineering [2402 items ]