Preliminary Research for Provision of Javanese Script Image Dataset from Javanese Script Printed Book
Anastasia Rita Widiarti, Gabriel Ryan Prima, Ciprianus Kuntoro Adi

Informatics Department
Faculty of Science and Technology
Sanata Dharma University
Yogyakarta
Indonesia


Abstract

The initial process of developing a Javanese script transliteration system to other scripts using a character recognition approach requires training data in the form of script images with all possible forms of writing. Meanwhile, no reference shows the unique forms of Javanese script.
This research tries to provide a training dataset with all its unique shape possibilities. The source of the dataset is script images from a book written in Javanese script and then process using image processing technology. The captured images were then grouped into their respective classes. The K-means clustering algorithm can be an alternative to automating process of grouping.
The process of providing data in this research starts from preprocessing the document image which includes the sub-processes of binarization, inverse, and filtering. The process continued by script segmentation using the projection profile method. Each script image is then processed in the feature extraction steps using the Intensity of Character or IoC algorithm. The feature data of each script image is then grouped using the K-Means clustering algorithm.
The data was taken from the scan results of Hamong Tani^s book on pages 2 and 59. After preprocessed and segmented images, 597 images of Javanese script were obtained. By using the IoC 3x3 feature, and the number of groups determined by 65 classes, the silhouette index value of the grouping results was found to be 0.5060. This means that the cluster structure is included in the feasible category.
After calculating the ground truth value, namely by checking the contents of the similarity of the images in the formed group, it was found that the accuracy of the results was 86%. So it can be concluded that the steps taken in this research can be used as a model in the process of providing a Javanese script images dataset.

Keywords: k-mean algorithm, clustering, javanese script, image processing

Topic: Computer Science

BTS 2022 Conference | Conference Management System