Enhancing Indonesian Text Processing with Rule-Based Stemming for Affixed and Reduplicated Words

Irwan Setiawan, Fitri Diani, Yadhi A. Permana, Suprihanto

Enhancing Indonesian Text Processing with Rule-Based Stemming for Affixed and Reduplicated Words
Irwan Setiawan, Fitri Diani, Yadhi A. Permana, Suprihanto

Department of Computer and Informatics Engineering, Politeknik Negeri Bandung, Indonesia

Abstract

This paper presents the development and evaluation of two rule-based stemming algorithms, SFAIS (Suffix-First Approach Indonesian Stemmer) and PFAIS (Prefix-First Approach Indonesian Stemmer), aimed at addressing the unique morphological challenges of the Indonesian language. Our study, which includes the creation of comprehensive datasets comprising 31,310 unique root words and 19,075 unique affixed words, including 1,966 reduplicated words derived from 6,872 root words, is a significant contribution to Indonesian natural language processing. These datasets are made publicly available to support further research. SFAIS demonstrated an Index Compression Factor of 65.09, Word Stemmed Factor of 99.89%, and Correctly Stemmed Words Factor of 93.24%, while PFAIS showed an Index Compression Factor of 63.94, Word Stemmed Factor of 98.08%, and Correctly Stemmed Words Factor of 92.17%. SFAIS achieved an overall accuracy of 93.14%, outperforming PFAIS, which gained 90.40%. Error analysis revealed that SFAIS had a lower total error count (1,309) than PFAIS (1,832), with fewer under-stemming and miss-stemming errors. These results highlight the efficacy of the suffix-first approach in accurately processing Indonesian affixed and reduplicated words. Our study significantly contributes to Indonesian natural language processing by providing more accurate stemming algorithms and valuable datasets.

Keywords: Indonesian Language Processing, Rule-Based Stemming, Morphological Analysis, Affixed Words, Reduplication Words

Topic: Artificial Intelligence (AI)

ICAST 2024 Conference | Conference Management System