Skip Nav Destination
Close Modal
Update search
NARROW
Format
Journal
Date
Availability
1-1 of 1
Maged S. Al-Shaibani
Close
Follow your search
Access your saved searches in your account
Would you like to receive an alert when new items match your search?
Sort by
Journal Articles
Publisher: Journals Gateway
Computational Linguistics 1–42.
Published: 15 November 2024
Abstract
View article
PDF
This article introduces a novel representation of Arabic text as an alternative approach for Arabic NLP, inspired by the dotless script of ancient Arabic. We explored this representation through extensive analysis on various text corpora, differing in size and domain, and tokenized using multiple tokenization techniques. Furthermore, we examined the information density of this representation and compared it with the standard dotted Arabic text using text entropy analysis. Utilizing parallel corpora, we also drew comparisons between Arabic and English text analysis to gain additional insights. Our investigation extended to various upstream and downstream NLP tasks, including language modeling, text classification, sequence labeling, and machine translation, examining the implications of both the representations. Specifically, we performed seven different downstream tasks using various tokenization schemes comparing the standard dotted text with dotless Arabic text representations. Performance using both the representations were comparable across different tokenizations. However, dotless representation achieves these results with significant reduction in vocabulary sizes, and in some scenarios showing reduction of up to 50%. Additionally, we present a system that restores dots to the dotless Arabic text. This system is useful for tasks that require Arabic texts as output.