Creating Readability Formulas for non-English languages: The Problem of the Syllable
A readability score is a computer-calculated index which can tell you roughly what level of education someone will need to be able to read a piece of text easily. The original formulas used to generate these scores such as the Flesch Reading Ease were created for use with the English language. However, over recent years the potential value of readability tests for non-English languages has increased. For example, researchers developing a readability measure for use in India highlight the social potential of a readability measure in ensuring the selection of appropriate education materials and improving literacy rates in the country.
The body of research exploring the use of readability measurement with non-English languages is fast developing. Research into the measurement of readability for alphabetic languages such as Spanish began as early as the 1950s. But, as well as advances in readability for alphabetic languages, in recent years there has been focus on measuring the readability of non-alphabetic languages such as Arabic, Bangla and Hindu.
The focus here will be on the challenge of syllables
Requirements for, and challenges to, the creation of valid and reliable readability measures are too many to address in a single article. The focus here will be on the challenge of syllables: how to count them with accuracy and how important they are, or not, to the readability of text for different languages.
Challenges of counting syllables
As anyone who has ever tried to learn another language will know, language is hugely complex. For every rule there are exceptions with endless words not behaving in any way how you’d expect them to. Syllables are a key contributor to the confusion.
Consider, for example, the vowel cluster oi. This cluster appears in lots of words, e.g. boil, going, but if and how oi is counted in terms of syllables differs across different sorts of words. For example, in words such as going, -ing is a suffix added onto a root word e.g. go-ing. In these words, -ing counts as a syllable. So, go is one syllable and by adding ing we create a two-syllable word with the oi in the middle of the word but split across two syllables. However, ing can also be a word ending without being a suffix, such as in the onomatopoeic word boing. In this case, it is the oi that is emphasised in the word and the word has only 1 syllable in total. This contrasts to the word going which has the same number of letters and has an oi vowel cluster in the same position in the word but is counted as two syllables.
For many readability measures, number of syllables is one of the building blocks on which the formula is created
The complex differences in the way words behave is problematic when it comes to the automatic measurement for readability formulas. For many readability measures, number of syllables is one of the building blocks on which the formula is created. Flesch Reading Ease, Gunning Fox Index and SMOG all use syllable counting as a basis to judge how easy or hard a piece of text is to read. So, any inaccuracies or inconsistencies in the counting of syllables will have consequences for the accuracy of the readability measurement.
As we've seen in the example of oi above, vowel clusters, that is two or more vowels occurring next to each other in a word without an intervening consonant, can be particularly problematic sometimes pronounced as one syllable, sometimes two. Syllabic consonants, those consonants that either form a syllable on their own or that form the nucleus of the syllable, are also a problem. For example, linguist and phonetician Peter Ladefoged explains that the word predatory would be pronounced by some as a four-syllable word, pre-da-to-ry, by others as a three syllable word, pre-da-tory, with the tory sound more like the phonetic ‘tree’.
As linguistics PhD. Marc Ettlinger summarises in his blog, along with the challenges posed by vowel clusters (e.g. going vs. boing) and syllabic consonants there is a grey area where syllable count of words is just not definitive. For example, is feel one syllable or two? What about fire? Or pile?
These challenges can lead to discrepancies in the number of syllables counted for a given word posing a problem for readability formula where number of syllables is being counted automatically and forms a building block for that formula.
So, for readability formula for different languages, the way that syllables are counted in the formula needs to reflect and be tailored to how syllables emerge in that language.
The relationship between number of syllables and difficulty
As well as the issues around accurate counting, syllables pose a further issue for the development of readability formula. That is, how much importance should be placed on number of syllables when calculating how easy or difficult it is to read a given text?
In English readability formula, word length often measured by number of syllables is key in how difficult or easy a text is to read. In formula such as Flesch Reading Ease and Gunning Fox Index, words with three syllables or more are categorised as ‘hard’ words. So, texts with a higher proportion of words with three syllables or more will result in a more difficult readability score compared to texts with less words of three or more syllables.
This notion that the greater number of syllables, the greater the difficulty in reading, although perhaps fine with English, becomes problematic when we apply it to other languages.
This notion that the greater number of syllables, the greater the difficulty in reading, although perhaps fine with English, becomes problematic when we apply it to other languages. For example, Greenlandic, spoken by some residents of Greenland and Denmark, uses very long words. Of course we could say that many languages contain long words but in terms of common usage, Greenlandic uses a lot of long words a lot of the time. A single word can reflect a complex sentence. For example, ‘inniminneereerpunga’, translates to ‘I have a reservation’. While this may seem difficult to read, it is only a single word compared to the 4 words in its English translation. If a reader is used to these long, polysyllabic words, it could be that having to read fewer longer words might be easier than reading more, shorter words.
Research into text readability in Bangla has brought into question the importance of word length and number of syllables in a word in readability of the language. In a study of readability of texts in Bangla through machine learning approaches, researchers state that in Bangla, polysyllabic words are common in everyday use. For example, the pronoun AaMaDer আমাদের in Bengali contains 3 syllables and translates to the single syllable English pronoun 'our'. For English readability formula this would count as a hard word whereas in Bangla, this word would not be viewed as difficult.
A similar issue arises in Arabic. Researchers developing readability formula for measuring reading difficulty of Arabic texts note that many Arabic words consist of three syllables. For example, GuWanTi (جونتي )ج in Arabic is a three-syllable word and translates to the single syllable English noun ‘gloves’. As well as the number of syllables or length of a word, the researchers also highlight that the familiarity of a word should be considered in readability formula. They argue that a word that is familiar will be easier to read than a non-familiar word with words becoming easier to read if they are used in everyday language. Given that polysyllabic words are commonly used in languages such as Bangla and Arabic, the notion that that words that have more than two syllables are hard to read is far from a universal truth for all languages. The lack of acknowledgement of word familiarity in English readability formula could be seen as weakness of these formula and emphasise the value in considering readability measures that do take into account word familiarity, such as the Dale-Chall readability formula, alongside other readability formula scores.
So, syllables, both in terms of how to count them and how important they are when you do count them, pose a challenge to the development of readability formula. Emerging findings for readability measures for other languages indicate a distancing from syllables as a building block for readability formula. For example, the Automatic Arabic Readability Index avoids syllables altogether defining difficult words as those consisting of more than six letters after removing “ ال “ from the beginning of the word (“ ال “ translates as the).
As well as allowing us to measure ease of reading across a range of languages, advances in readability formula for non-English languages also provide an opportunity to reflect on existing English formula and to question, could they be improved and if so, how?
Premium Subscribers get access to a whole host of specialist readability tools, including text, URL and file scoring, from as little as $5 per month!
- Unlimited Text Readability
- File (Word Docs, PDFs, etc) Scoring
- Bulk Text Readability
- Bulk URL Readability
- Readability API
"I use readable.io's bulk processing tools to help maintain the quality and readability of my website"
Matthew Skilton, Appointment Reminder