Quran Corpus

This database constitutes the first step towards building a psycholinguistic database for Qur’anic Arabic. It currently contains the entire Qur’anic Arabic corpus (Dukes, 2009), which was built on the verified Arabic text of the Qur’an distributed by the Tanzil project (Zarabbi-Zadeh, 2008). In this corpus, 77,430 orthographic tokens had already been segmented following the whitespaces between them in the text. The corpus also had the position of each token in the text annotated by its surah (chapter) number, sentence number, and word position in the sentence. Each token also had its own Buckwalter transliteration that uses ASCII characters to represent Arabic orthography. In order to build the lexicon, we scripted special rules to convert each token’s Buckwalter transliteration into a contextual broad phonetic transcription that takes into account co-articulatory effects in continuous Qur'anic recitation that are marked orthographically in the script. Pauses in the Qur'anic recitation are reflected in sentence endings and compulsory pause markers, which the transcription also takes into account. It is important to note that this corpus is unique in that all the words appear in a certain order and are recited in that order. Due to strict rules of recitation, or tajweed, the pronunciation of a word depends on the position of the word in a sentence as well as the word that precedes or follows it; thus context plays a huge role in the pronunciation of a word. This makes the Qur’an lexicon different from other lexicons that were created from corpora with words in isolation.

More information on the creation of the Qur’an Lexicon can be found here.

The work constituted part of Siti Syuhada Binte Faizal’s PhD (2019) on Visual word processing of non-Arabic-speaking Qur'anic memorisers.

In this first version, we provide a version of the database that is searchable by word form (using Buckwalter transliteration), part-of-speech tag, stem, lemma, and root. Searches list all entries that fit those search terms and provide further information on person, gender and number for nouns and adjectives; aspect, mood, voice and form for verbs; derived nouns, and state and case for nominals.

In the future we aim to add searches for lexical variables (length: character, syllable, phone; frequency: item, syllable, biphone, phone; lexical uniqueness point, orthographic and phonological neighbourhood sizes, and orthographic and phonological Levenshtein distances) as well as phonotactic probabilities (positional segment and biphone).

This open-source resource will be useful for researchers studying Qur’anic Arabic lexical and phonological processing as well as for making systematic cross-linguistic comparisons that allow better delineation of language-specific and language-general processes in language processing.

If you have any questions about this activity, please contact Ghada Khattab at [email protected]. Many thanks to the Research Software Engineering Unit at Newcastle for their support in building this resource.

References:

Dukes, K.(2009). Quranic Arabic Corpus. Retrieved 31 January 2015 from http://corpus.quran.com

Binte Faizal, S. S.(2019). Visual word processing of non-Arabic-speaking Qur'anic memorisers (Doctoral dissertation, Newcastle University). https://theses.ncl.ac.uk/jspui/handle/10443/4674

Binte Faizal,, S. S., Khattab, G., & McKean, C.(2015). The Qur'an Lexicon Project: A database of lexical statistics and phonotactic probabilities for 19, 286 contextually and phonetically transcribed types in Qur'anic Arabic. In The Scottish Consortium for ICPhS 2015 (Ed.) Proceedings of the 18th International Congress of Phonetic Sciences. P. 0968. ISSN: 0241-0669. https://eprints.ncl.ac.uk/213017

Zarabbi-Zadeh, H.(2008). Tanzil Project. Retrieved 31 January 2015 from http://tanzil.net

About