One distinguishing characteristic of the Arabic language is its diacritics. Although
diacritics are used in other languages such as French, Romanian, and Croatian, they are limited to a few letters, as opposed to Arabic which has diacritics on most letters (Darwish et al.). There are eight Arabic diacritics, which are: fatha, damma, kasra, fathatan, dammatan, kasratan, sukun, and shadda. These diacritics serve several crucial functions, mainly core-word diacritics to aid with correct word pronunciation, and case-ending diacritics that are grammar-based. The three most common case-endings are the nominative (المرفوع), accusative (المنصوب), and genitive and the ,(الفاعل) cases. The nominative case includes the subject of a verbal sentence (المجرور) subject and predicate of a nominal sentence (المبتدأ والخبر). The accusative case includes the object .(ظرف الزمان والمكان) and adverbial expressions of time and place (المفعول به) of a transitive verb The genitive case includes the object of a preposition (الاسم المجرور), and the second term of an idāfa (المضاف إليه). While both case-ending and core-word diacritics are often omitted in Arabic writing, such as news articles and books, they still have an important role in comprehension, some diacritics being often added when necessary for clarification.
The role of diacritics also extends to Islam in many ways. Firstly, Arabic diacritics were not part of early Quranic scriptures, but the present-day Quran is fully diacritized. Abu Al-Aswad Al-Du’ali, one of the earliest Arabic grammarians who is named the Father of Arabic Grammar, was responsible for the diacritization of the Quran (Stearns). He walked into a mosque one day to pray and heard a man reciting (in Arabic) the verse, “And [it is] an announcement from Allah and His Messenger to the people on the day of the greater pilgrimage that Allah is disassociated from the disbelievers, and [so is] His Messenger. (Quran 9:3).
"وَأَذَٰنٌۭ مِّنَ ٱللَّهِ وَرَسُولِهِۦٓ إِلَى ٱلنَّاسِ يَوْمَ ٱلْحَجِّ ٱلْأَكْبَرِ أَنَّ ٱللَّهَ بَرِىٓءٌۭ مِّنَ ٱلْمُشْرِكِينَ ۙ وَرَسُولُه"
In his recitation, the man incorrectly said the last word as “wa rasoolih” (وَرَسُولِه) instead of “wa rasooloh” (وَرَسُولُه) (Dodge 88), which changed the meaning of the verse from from “Allah and His Messenger are disassociated from the disbelievers” to “Allah is disassociated from both the disbelievers and His Messenger” even though both “rasoolih” and “rasooluh” mean “His Messenger”. This led Al-Du’ali to formalize Arabic Grammar rules, including case-ending diacritics to prevent such mistakes while reading the Quran.
The presence of diacritics in the Arabic language allows for more fluid sentence structures, such as the previously mentioned example, where it is common for the direct object to come before the subject of a sentence. The subjects and direct objects can be understood and differentiated with case-ending diacritics. There are several other similar examples in the Quran where one incorrect case-ending can drastically change the meaning of a verse. Another example is, “Only those fear Allah , from among His servants, who have knowledge.” (Quran 35:28). InArabic,changingthecase-endingonthewordAllahfrom."إِنَّمَا يَخْشَى ٱللَّهَ مِنْ عِبَادِهِ ٱلْعُلَمَـٰٓؤُا۟" “Allaha” (ٱللَّهَ) to “Allahu” (ٱللَّهُ) would change Allah to being the subject of the sentence rather than the direct object, which in turn would change the meaning of the verse to “Allah fears from among His servants those who have knowledge.” From these examples, the power of diacritics is evident in its ability to flip the meaning of sentences entirely.
Arabic diacritics are also used among Islamic scholars to debate Islamic jurisprudence. One example of Arabic diacritics and grammar being used to determine Islamic rulings is a Quranic verse on wudu’, or ablution. The verse states, “O you who have believed, when you rise to [perform] prayer, wash your faces and your forearms to the elbows and wipe over your heads and wash your feet to the ankles. (Quran 5:6)."
"يَـٰٓأَيُّهَا ٱلَّذِينَ ءَامَنُوٓا۟ إِذَا قُمْتُمْ إِلَى ٱلصَّلَوٰةِ فَٱغْسِلُوا۟ وُجُوهَكُمْ وَأَيْدِيَكُمْ إِلَى ٱلْمَرَافِقِ وَٱمْسَحُوا۟ بِرُءُوسِكُمْ وَأَرْجُلَكُمْ إِلَى ٱلْكَعْبَيْنِ "
Scholars used this verse to debate the rulings on several aspects of wudu’, including whether it is obligatory to follow the mentioned order of washing, and whether the feet should be washed or wiped. In analyzing the syntactic structure of the verse, we can see that “your faces and your forearms” (وُجُوهَكُمْ وَأَيْدِيَكُمْ) are in the accusative form as they are the direct objects of the verb “wash” (فَٱغْسِلُوا۟). In the second part of the verse “wipe over your heads”, “your heads” (رُءُوسِكُمْ) is in the genitive form because of the presence of the preposition (بِ) directly before it. However, “your feet” (أَرْجُلَكُمْ) is in the accusative form like the faces and forearms, and unlike the heads. This shows that the “and” (وَ) before “your feet” is a conjunction for the first part of the verse (“wash your faces and your forearms”) rather than the second part (“wipe over your heads”), which is how the ruling was determined to be washing the feet rather than wiping. However, there are alternative readings to the Quran that have the feet in the genitive form, and interpret it as wiping the feet rather than washing them. Then, comes the question of why the feet were mentioned after the verb wipe rather than after the verb wash if the feet are also meant to be washed. In other words, why didn’t the verse state “wash your faces and your forearms to the elbows and your feet to the ankles and wipe over your heads” instead? Some scholars used this breaking up of independent clauses as evidence to debate the obligation of performing wudu’ in the specified order, whereas others debated that the specified order is optional because “and” doesn’t signify order, as opposed to “then” (ثمَّ) (Abd al-Husayn and al-Musawi).
The aforementioned examples point to the role of case-endings in understanding the overall meaning of a sentence. Core-word diacritics, on the other hand, have a role in differentiating between similar words. Because Arabic follows the root system where most nouns, adjectives, adverbs and verbs are derived from a set of 10,000 roots of mostly 3 letters (Darwish et al.), there are many similar words that are spelled the same but have different core-word diacritics. One example of this is the word بر, which could either mean righteousness, land, or wheat, depending on the diacritic on the first letter (ب). The word بِرّ means righteousness, whereas بَرّ means land and بُرّ means wheat. Homonyms, words having the same spelling or pronunciation but different meanings, are not exclusive to Arabic. In English, for example, there are homonyms such as trunk, which could mean an elephant’s trunk or a tree trunk, and bat, which could refer to the animal or sports equipment. In many of these cases, the meanings of homonyms are identified through context. While this can also be the case for Arabic and is part of the reason why Arabic newspapers and books aren’t always diacritized, there are cases where context may not be helpful. For example, the word فلك can either be فَلَك which means orbit, or فُلْك which means ship. In one verse, it is stated, “It is not allowable for the sun to reach the moon, nor does the night overtake the day, but each, in an orbit, is swimming. (Quran 36:40)"
”لَا ٱلشَّمْسُ يَنۢبَغِى لَهَآ أَن تُدْرِكَ ٱلْقَمَرَ وَلَا ٱلَّيْلُ سَابِقُ ٱلنَّهَارِ ۚ وَكُلٌّۭ فِى فَلَكٍۢ يَسْبَحُونَ “
In this example, the context word “swimming” may suggest فلك to mean ship since they both have to do with bodies of water, but the word “swimming” is rather metaphorical to describe the motion of the sun and moon in orbits.
An additional challenge in distinguishing words without diacritics in Arabic is قتل الرجل distinguishing the passive form of a verb from its active form. For example, the sentence without diacritics is ambiguous as it could have one of three different meanings: the man killed [someone] (قَتَلَ الرَّجُلُ), [someone] killed the man (قَتَلَ الرَّجُلَ), or the man was killed (قُتِلَ الرَّجُلُ). Even distinguishing between different active forms of verbs from the same root can also be difficult, as the only difference between a form I verb and a form II verb is the shaddah on the second letter in the verb, such as the form I verb سَمِع (to hear) and the form II verb سمَّع (to recite). To add on, there are additional ambiguities that arise in undiacritized verbs in their conjugations, such as the verb سَمِعت which could either translate to “she heard” (سَمِعَتْ) or “I heard” (سَمِعْتُ). These ambiguities in undiacritized Arabic text also pose challenges for children and non-native speakers learning Arabic. Without diacritics, it can be difficult to correctly read unfamiliar words with the absence of short vowel letters, in addition to the difficulty in looking up meanings of new words in Arabic dictionaries as Arabic dictionaries define words of the same root adjacent to each other, which can cause confusion as to which diacritized word in the dictionary corresponds to the undiacritized word that is being looked up.
The challenges that arise from the lack of diacritics in Arabic texts as well as the fact that diactritization is time consuming and requires linguistic expertise point to the usefulness of a tool to automate the process of diacritization. This tool could be used by children and other Arabic learners to aid them in correctly reading raw text from a wider range of sources as opposed to limiting themselves to only diacritized texts or attempting to read undiacritized texts while facing the previously mentioned challenges. However, the uses of automating diacritization in fact are not just limited to Arabic learners but rather extend to the population of native Arabic speakers; automated diacritization has technological applications in the development of widely-used features such as Arabic text-to-speech models, automated speech recognition systems, and machine translation (Almanaseer et al.). The absence of diacritics in Arabic text is a problem when developing text-to-speech software (Rebai and BenAyed) that needs to be able to pronounce the correct short-vowels regardless of the presence or absence of diacritics in a text. In machine translation, automated diacritization can disambiguate words for more accurate translations. In our current digital world, with Arabic being the fifth most spoken language in the world, one of the six official languages of the United Nations, being spoken by over 422 million people (Abbad and Xiong) in addition to having more than 168 million Arabic internet users globally and over 2.1 billion Arabic websites (Almanaseer et al.), the importance of such a tool that automates diacritization with high accuracy significantly increases.
Existing research relies primarily on three datasets: the Linguistic Data Consortium’s Arabic Treebank part 3 (LDC-ATB3), Tashkeela, and the Quran (Abbad and Xiong; Abandah and Abdel-Karim; Almanaseer et al.; Nelken and Shieber). Some research methods make a distinction between Classical Arabic (CA), which is the language of the Quran and old poems and books, and Modern Standard Arabic (MSA), which is the language used in the media and in education today (Abandah and Abdel-Karim). The LDC-ATB3 dataset consists of MSA while Tashkeela includes both MSA and CA; the LDC-ATB3 consists of 599 distinct newswire stories from the Lebanese publication An Nahar (Arabic Treebank), whereas the Tashkeela dataset contains over 75 million fully vocalized words obtained from 97 books, collected mostly from Islamic classical books, and also includes Modern Standard Arabic texts that represent 1.15% of the corpus, which is about 867,913 words (Zerrouki). The limited number of available datasets is mainly a result of the limited number of diacritized texts available as the diacritics are often omitted, as previously mentioned. This is already a significant limitation for research as research using machine learning models often relies on the size and diversity of the data. The datasets were diacritized by humans so they are prone to human error, so if all the models are trained on the same datasets, they could pick up and incorporate the same human errors in different ways.
Three main approaches have been used in research on automated diacritization methods: rule-based models, statistical models, and hybrid models. Rule-based models rely on the linguistic rules of Arabic to determine the diacritic on each letter, with rules being specified for both core-word and case-ending diacritics. Core-word diacritic rules involve the identity of a letter as well as its position in the word such as: For the letter إ, the diacritic must be a kasra; For the first letter of a word, the diacritic cannot be a sukoon; and If a letter is not the last character of the word, the diacritic cannot be any form of Tanween (Abbad and Xiong). Case-ending diacritic rules involve looking at surrounding words, such as specifying a rule that “the relative pronoun (اسم موصول) is always followed by a nominative verb in the present tense (فعل مضارع مرفوع) or a verb in the past tense or a nominative noun (اسم مرفوع) or a particle (حرف)” or the rule that “the preposition (حرف جر) is always followed by a genitive noun (اسم مجرور)” (Chennoufi and Mazroui). Other rules that would need to be specified are exceptions to the common grammatical rules of Arabic, such as cases when an indefinite noun cannot have any tanween or kasra diacritics (الممنوع من الصرف). Statistical models, on the other hand, assign probabilities of each diacritic appearing on each character in a word and choose the diacritic with the highest probabilities. These probabilities are determined by applying deep learning algorithms, such as Recurrent Neural Networks with Long Short-Term Memory (Abbad and Xiong, Abandah and Abdel-Karim) and Deep Belief Networks (Almanaseer et al.), to a dataset to train a model. The last and most common approach is the hybrid approach which combines rule-based models with statistical models, often using rule-based approaches to fine tune the outputs of statistical models to achieve better results with lower error rates (Almanaseer et al.).
The performance of the developed models is measured in two ways: the Word Error Rate (WER), and the Diacritic Error Rate (DER), with the error rates for diacritization including and excluding case endings being measured separately. The WER is the percentage of incorrectly diacritized words, words with at least one incorrectly diacritized letter, whereas the DER is the percentage of incorrectly diacritized letters (Almanaseer et al.). The performance of diacritization automation models has largely improved over the years since 2005 when the error rates for one model were 12.8% DER and 23.6% WER when including case endings, and 6.5% DER and 7.3% WER when excluding case endings (Nelken and Shieber). Recent models have error rates ranging between 1% and 6% including a model that, when tested on the testing data (a subset of the LDC-ATB3 dataset that wasn’t used in training the model) had a performance of 2.21% DER and 6.73% WER when including case endings, and 1.2% DER and 2.89% WER when excluding case endings. That same model was tested on another dataset that contains 26 thousand MSA words from children’s story books that was newly formed to increase the number of available datasets for research, and had a similar performance of 2.4% DER and 6.57% WER when including case endings, and 1.33% DER and 2.83% WER when excluding case endings (Almanaseer et al).
Although these low single digit error rates may seem low, given the frequency at which we consume words; a 3% error rate means that a 750-word article would be expected to have approximately 22 diacritization errors. Additionally, one noteworthy pattern is that error rates are at least 50% higher when taking the case endings into account. As such, case-ending diacritization has proved to not only be a challenge faced by Arabic learners and native speakers alike, but also a challenge to machine learning models because of the complex interdependence of words in sentences. An additional challenge faced by researchers when building models is that case-ending diacritics aren’t necessarily on the last letter of a word. This is the case in words that end with pronouns, such as in كتابَه ,كتابِه ,كتابُه, which all mean “his book”, where the last letter is the pronoun “his” and the case-ending diacritic is on the penultimate letter.
Although there are several diacritization tools that are available for use, including Farasa, MADAMIRA, Mishkal, and Shakkala, such diacritization tools are limited as many researchers do not release their source code or provide applications for people to use (Abbad and Xiong). Farasa and Mishkal are application programming interfaces (APIs) that can be inaccessible to many as they require some knowledge of coding to set them up for use. MADAMIRA and Shakkala are web applications that diacritize text upon entering undiacritized text and pressing a button. Using the example provided on Farasa’s website
"يُشار إلى أن اللغة العربية يتحدثها أكثر من 422 مليون نسمة" (it is noteworthy that the Arabic language is spoken by over 422 million people), when diacritized,
Farasa outputs “يُشَارُ إِلَى أَنَّ اللُّغَةَ العَرَبيَّةَ يَتَحَدَّثُها أَكْثَرُ مِنْ 422 مِلْيونَ نَسَمَةٍ”,
MADAMIRA outputs “يُشارُ إِلَى أَنَّ اللُّغَةَ العَرَبِيَّةَ يَتَحَدَّثُها أَكْثَرَ مِن 422 مِلْيُونِ نَسَمَةٍ”, and
Shakkala outputs “يَُشَارَ إلَى أَنَّ اللُّغَةَ الْعَرَبِيَّةَ يَتَحَدَّثُهَا أَكْثَرُ مَنْ 422 مُلْيُونُ نَسْمَةً”.
In this relatively simple and short example, all three tools output different case endings for the word مليون (million), where Farasa outputs the accusative case (مِلْيونَ), MADAMIRA outputs the genitive case (مِلْيُونِ), and Shakkala outputs the nominative case with the incorrect core-word diacritic on the first letter (مُلْيُونُ). This shows that although the diacritization tools available can still be useful in certain cases, these tools don’t use the most up-to-date methods and are likely to make errors frequently.
Overall, although diacritics are often omitted in books, news articles and other written texts, this is the case because native Arabic speakers with a mastery of the language can infer the diacritics on their own. However, diacritics still have a significant role in disambiguating words and meanings in many different ways. Case endings are used to clarify the placements of words in sentences when different diacritics can have drastic changes on the meaning, especially with flexible sentence structures in Arabic that don’t require a subject and allow for the direct object of a verb to come before its subject. Core-word diacritics, on the other hand, are used to disambiguate words, such as differentiating homonyms, different forms of passive and active verbs as well as verb conjugations, all which only differ in their diacritics. The role of diacritics is especially significant in the Islamic context where Quranic verses are interpreted differently by different scholars with diacritics being used as supporting evidence to back up different interpretations. Additionally, the Quran is an error-free diacritized text that is useful as a dataset for developing and testing diacritization systems.
Although Arabic diacritization research has made significant progress over recent years, the work related to Arabic diacritics is far from done. Research on automated diacritization methods can still be improved to improve accuracy and reduce error rates, and diacritization tools need to be implemented as user-friendly web or mobile applications, or as features embedded into existing Arabic websites so that they can be more widely available and used in Arabic education. Additionally, another area that needs work is increasing the number of available diacritized datasets available for research. This would require manual diacritization of large amounts of text from various sources, genres, levels, and time periods by individuals that have expert knowledge of Arabic grammar and diacritization rules. In addition to CA and MSA that have been used in Arabic diacritics research, Arabic dialects are becoming more commonly used in writing (Abandah and Abdel-Karim), such as in text messaging and posting to social media platforms. Arabic dialects are not bound by the grammar rules of Arabic and don’t include case-ending diacritics (the diacritic that is used is the sukoon, the diacritic representing the absence of a short-vowel), unlike CA and MSA, but are nonetheless important to be considered research, especially in applications such as speech recognition where dialects are the spoken form of Arabic.
References