Abstract
This article provides a detailed insight into computational approaches for deciphering Bronze Age Aegean and Cypriot scripts, namely, the Archanes script and the Archanes formula, Phaistos Disk, Cretan hieroglyphic (including the Malia Altar Stone and Arkalochori Axe), Linear A, Linear B, Cypro-Minoan, and Cypriot scripts. The unique contributions of this article are threefold: (1) a thorough review of major Bronze Age Aegean and Cypriot scripts and inscriptions, digital data and corpora associated with them, existing computational decipherment methods developed in order to decipher them, and possible links to other scripts and languages; (2) the definition of 15 major challenges that can be encountered in computational decipherments of ancient scripts; and (3) an outline of a computational model that could possibly be used to simulate traditional decipherment processes of ancient scripts based on palaeography and epigraphy. In the context of this article the term decipherment denotes the process of discovery of the language and/or the set of symbols behind an unknown script, and the meaning behind it.
1. Introduction
The process of determination of the accurate time frame of when writing first appeared is extremely difficult. It is currently believed that writing first appeared in 3200 BCE in the region of Sumer (Glassner 2018), located in southern ancient Mesopotamia (present-day Iraq). The script that appeared in this region is known today as Sumerian pre-cuneiform, and this writing served as a foundation for the later Sumerian cuneiform. After the appearance of Sumerian cuneiform other scripts emerged in Egypt (approximately 3100 BCE), Indus Valley (approximately 2500 BCE, although it is arguable whether this was a linguistic script or not [Sproat 2014; Oakes 2019]) in present-day India and Pakistan, Crete (approximately 1900 BCE) in present-day Greece, China (approximately 1200 BCE), and Central America (approximately 600 BCE) (Robinson 2007). The first sign of writing west of Egypt is usually considered to be the Archanes script discovered in the 1960s in the Aegean (Decorte 2018). Not much is known about this script, and it is not generally agreed upon whether it is indeed a script or a repeated collection of symbols known as the Archanes formula. Regardless of its nature, the Archanes script and the Archanes formula were precursors to Europe’s most famous scripts used during the Bronze Age (c. 3000-700 BCE [Vandkilde 2016]), and possibly some of the earliest European scripts—Cretan hieroglyphic, Phaistos Disk, Linear A, Linear B, Cypro-Minoan, and Cypriot scripts. Many decipherment attempts of these scripts have been proposed over the years, but only the Linear B and Cypriot scripts have generally been accepted as deciphered. Decipherment attempts of ancient scripts have usually been based on epigraphy (the study of ancient inscriptions), paleography (the study of ancient handwritings), grammatical comparison to possibly related scripts, and so forth, and only recently have the computational methods aimed at the decipherment of these scripts started appearing.
Even though the term decipherment usually refers to the process of determination of the symbol system behind an unknown script or text, in the context of this paper it will denote the process of discovery of the language and/or the set of symbols behind an unknown script. According to Gelb and Whiting (1975), decipherments in their broader sense can be classified into four types that are based on the level of the expert knowledge regarding the writing system and the language involved: (0) both the language and the writing system are known, (I) the language is known but the writing system is unknown, (II) the language is unknown but the writing system is known, and (III) both the language and the writing system are unknown. In Gelb and Whiting (1975) it is emphasized that the “known” and “unknown” concepts in this context are not solid, but rather fuzzy categories that shade into each other. The first of these types of decipherment (Type 0) that Gelb and Whiting (1975) suggested is trivial and does not represent a problem at all, the second (Type I) and the third (Type II) are difficult, and the fourth type (Type III) is the most difficult one of them all and might even be undecipherable. By taking this categorization into account, we can categorize Bronze Age Aegean and Cypriot scripts and languages. Linear B and Cypriot syllabaries would fall into the Type 0 category, since the symbol sets and the languages behind them are already known. Linear A syllabary would fall into the Type II category, since the language behind it is unknown, but the set of symbols is mostly known as it is related to Linear B. The Archanes script and the Archanes formula, Phaistos disk, Cretan hieroglyphic script, and the Cypro-Minoan syllabary would all unfortunately fall into the Type III category, since neither the language nor the majority of their symbols are currently known. Attempts at computational decipherment of these scripts should focus on discovering the language(s) behind them, a complete set of symbols, and the writing systems they were used in.
This article presents a systematic review of the Archanes script and the Archanes formula, Phaistos Disk, Cretan hieroglyphic, Linear A, Linear B, Cypro-Minoan, and Cypriot scripts, and focuses on the computational approaches proposed for their decipherment. Even though Linear B and Cypriot scripts are already considered to be deciphered, computational approaches to their decipherment still exist because they might prove useful to the decipherments of other scripts possibly related to them but still undeciphered. In addition, we give an overview of available digital corpora related to the Bronze Age Aegean and Cypriot scripts and discuss challenges one might face when attempting to decipher them. As far as we are aware, this is the first systematic review that focuses only on the techniques proposed for the computational decipherment of Bronze Age Aegean and Cypriot scripts. The only similar review we were able to find was presented in Ferrara and Tamburini (2022), and although it is an interesting resource that focuses on paleographic, epigraphic, and computational techniques for the decipherment of ancient scripts, as well as the challenges that the researchers might face, it does not discuss many of the existing methods for the computational decipherment of Bronze Age Aegean and Cypriot scripts, nor the digital corpora or the related languages that might prove useful in the decipherment process.
We would like to fully disclose that this article is not written from the standard epigraphical or paleographical point of view. Its primary target audience comprises computer science researchers for whom archaeology, linguistics, epigraphy, or any of the related fields mainly associated with the traditional methods for the decipherment of ancient scripts are not primary research areas. This article is primarily intended for researchers interested in using computational techniques to decipher ancient scripts, as there has been a surge in the use of natural language processing (NLP) and deep learning techniques in the last few years. It is our belief that these novel resources available to computer scientists could prove useful in the decipherment of ancient scripts. Besides these main goals, this article can also serve as an introduction to some of the earliest European scripts and languages.
The article is structured as follows. In Section 2 we discuss the human language, linguistic classification of language families, and various types of classifications regarding the different types of writing systems. In Section 3 we discuss Bronze Age Aegean and Cypriot scripts, namely, the Archanes script and the Archanes formula, Cretan hieroglyphic (including the Malia Altar Stone and Arkalochori Axe), Phaistos Disk, Linear A, Linear B, Cypro-Minoan, and Cypriot scripts. In Section 4 we discuss challenges that need to be addressed when attempting to decipher ancient, unknown scripts. In Section 5 we discuss digital resources and corpora available for the Bronze Age Aegean and Cypriot scripts. In Section 6 we give a detailed overview of traditional and deep learning based approaches commonly used in NLP. In Section 7 we give an overview of computational approaches to the decipherment of Bronze Age Aegean and Cypriot scripts, alongside an overview of computational approaches proposed for the decipherment of other scripts as well. Additionally, we look at possible methods that can be used to simulate traditional processes used for the decipherment of ancient scripts. In Section 8 we discuss languages and scripts that were geographically close and contemporary with the Bronze Age Aegean and Cypriot scripts, with hopes that some of them can be indisputably linked to the undeciphered Bronze Age Aegean and Cypriot scripts and aid in their decipherment. In Section 9 we give a conclusion and discuss future work.
2. Languages and Writing Systems
Human language, or the capacity for it, is probably at least 150,000 to 200,000 years old (Pagel 2017). Approximately 7,000 languages exist in the world today (Radikov et al. 2019), and most of them are classified as belonging to certain language families. Language families represent groups of languages that share some similarities and have the same ancestral language, usually called a proto-language. For example, English language (alongside most European languages) belongs to the Indo-European language family, whose ancestral language is hypothesized to be some unknown Proto-Indo-European1 language. Even though most of the existing languages can be classified as belonging to one of the known language families, there are languages that are distinct and unique enough to be classified as language isolates because their connection to known language families cannot be reliably established (e.g., Basque and Burushaski languages [Urban 2021]). For a detailed overview of the world’s languages and language families we refer the reader to Pereltsvaig (2021), and for a list of language families and the languages they encompass we refer the reader to Glottolog (2023).
As means for writing down various languages, many different writing systems and scripts have been developed over the years. A writing system is a notational system for a natural language (Neef 2015), while its physical representation is referred to as a script (Pae and Wang 2022). Writing systems can be divided into various categories based on the phonetics or semantics connected with the symbols that they are associated with. For example, a writing system where symbols represent vowels and consonants can be referred to as a type of alphabet called a phonemic alphabet, whose specific instances include, for example, Latin and Greek scripts. A writing system where symbols represent syllables instead of vowels or consonants can be referred to as a syllabary, and an example of a script belonging to this type of a writing system is Cherokee (Cushman 2011).
Even though we mentioned phonemic alphabets and syllabaries, we need to emphasize that there is no universally agreed upon taxonomy of writing systems, and many researchers in this field propose different classifications. For example, Robinson (2009) divided writing systems into 6 different categories: syllabic systems (e.g., Japanese Kana), logosyllabic systems (e.g., Chinese), logoconsonantal systems (e.g., Egyptian), consonantal alphabets (e.g., Arabic), phonemic alphabets (e.g., Greek), and logophonemic alphabets (e.g., English). Allan (2015) divided writing systems into logographic and phonographic scripts, and split them into even smaller subcategories based on the size of phonetic units that their symbols represent. Sproat and Gutkin (2021) divided writing systems into syllabic systems (e.g., Chinese), moraic systems (e.g., Japanese Kana), consonantal systems or abjads (e.g., Semitic scripts), abugidas or alphasyllabaries (e.g., Brahmic scripts), and alphabets (e.g., Greek). Gelb and Whiting (1975) divided full writing systems into logo-syllabic, syllabic, and alphabetic. Daniels and Bright (2010) divided writing systems into logosyllabaries, syllabaries, consonatories or abjads, alphabets, and abugidas.
It is important to note that many scripts cannot fully be classified as belonging to just one type of a writing system, regardless of its name. Many scripts can contain properties that are characteristic of multiple types of writing systems, and this is why there is no universally agreed upon taxonomy of writing systems. In this article, however, we will use the taxonomy of writing systems presented in Daniels and Bright (2010), which divides writing systems into logosyllabaries, syllabaries, consonatories or abjads, alphabets, and abugidas. Logosyllabaries are types of writing systems where “the characters of a script denote individual words (or morphemes) as well as particular syllables” (Daniels and Bright 2010). Examples of logosyllabaries include the Mayan logosyllabary, cuneiform (Gnanadesikan 2009), Egyptian hieroglyphs, and Chinese. Syllabaries are types of writing systems where “the characters denote particular syllables, and there is no systematic graphic similarity between the characters for phonetically similar syllables” (Daniels and Bright 2010). Examples of syllabaries include Cretan hieroglyphic, Linear A, Linear B, Cypro-Minoan, Cypriot (Karnava 2014b), and Cherokee (Cushman 2011). Abjads or consonatories are types of writing systems where “the characters denote consonants (only)” (Daniels and Bright 2010). They are also referred to as consonantal alphabets since they do have symbols for consonants like true alphabets do, but, unlike true alphabets, not for the vowels. In abjads, the vowels are inferred either from the word’s syntax or semantics, and not by explicitly stating them. Examples of abjads include Hebrew, Ugaritic, Mandaic, Estrangelo, and Arabic (Daniels 2003). Alphabets are types of writing systems where “the characters denote consonants and vowels” (Daniels and Bright 2010). Examples of alphabets include Latin (or Roman), Greek, Glagolitic, Cyrillic, Etruscan, and Phoenician. The Latin alphabet is the most widely used script in the world (Kolinsky and Verhaeghe 2017). Abugidas are types of writing systems where “each character denotes a consonant accompanied by a specific vowel, and the other vowels are denoted by a consistent modification of the consonant symbols” (Daniels and Bright 2010). They are also known as alphasyllabaries, since they are somewhere in between true alphabets and true syllabaries. Examples of abugidas include Thai, Burmese, Khmer, and Lao (Ding, Utiyama, and Sumita 2018).
In order to decipher an unknown script, it is important to first determine the type of writing system it belongs to. Based on characteristics of undeciphered scripts it is possible, with a certain degree of confidence, to determine the writing system they belong to. If, for example, a script only has a small number of symbols, it is probably an alphabet. Alphabets use symbols that represent either consonants or vowels, and therefore do not need a large set of symbols to represent words. However, if a script has a very large number of symbols, it is possible that they represent either syllables or ideas, and therefore that script may belong to a logosyllabary or a syllabary. According to Coe (1999) as cited in Fuls (2015), alphabets do not have more than 36 signs, while pure syllabaries have between 40 and 90. Scripts that have hundreds or thousands of signs are more complex and combine a small number of phonetic signs with a large number of logograms (Robinson 2009).
In this article we will focus on computational decipherment of Bronze Age Aegean and Cypriot scripts, usually classified as syllabaries. Potential decipherment of different types of writing systems would represent an entirely different problem than the one we are focused on and would require completely different viewpoints and methods to potentially realize. This falls out of scope of our current research, but would be an interesting area to focus on in the future.
3. Bronze Age Aegean and Cypriot Scripts and Inscriptions
Many ancient scripts and inscriptions still await decipherment, for example, the inscriptions written in the Etruscan alphabet (present-day Italy), Cretan hieroglyphic script (present-day Greece), Linear A (present-day Greece), Proto-Elamite script (present-day Iran), Rongorongo (present-day Easter Island), Indus script (present-day Pakistan and India), Vinča or Danube script (present-day Serbia and Romania), Sitovo inscription (present-day Bulgaria), and so forth. Deciphering these scripts is an extremely complicated process because researchers can encounter many obstacles in their way. These obstacles are outlined in more detail in Section 4, but can include things such as multiple unknown languages behind an unknown script, unknown writing systems, and unknown reading directions.
The decipherment of any of the world’s undeciphered scripts and inscriptions would represent an enormous breakthrough. However, in this article, we are particularly focused on scripts and inscriptions found mostly in the Aegean area and in Cyprus.
Major Bronze Age Aegean and Cypriot scripts include the Archanes script and the Archanes formula, Cretan hieroglyphic script, Phaistos Disk, Linear A syllabary, Linear B syllabary, Cypro-Minoan syllabary, and the Cypriot syllabary. There are a few inscriptions and early signs of writing from the Aegean area that are not generally accepted as standalone scripts, and these include the Malia Altar Stone and Arkalochori Axe. Figure 1 shows the locations of the islands of Crete and Cyprus, where the majority of Bronze Age Aegean and Cypriot inscriptions were discovered. For the creation of the map we used QGIS geospatial software (QGIS Development Team 2023).
Figure 2 shows the timeline of Bronze Age Aegean and Cypriot scripts. The approximate dates of use for each of these scripts are taken from corresponding sources cited in Sections 3.1–3.7.
In subsections below we give an overview of Bronze Age Aegean and Cypriot scripts; for an extensive and in-depth review of Aegean pre-alphabetic scripts we refer the reader to Davis (2010).
3.1 The Archanes Script and the Archanes Formula
The earliest recognized evidence for the use of a script on Crete was found on a number of seals, and these inscriptions are known today as the Archanes formula (Anastasiadou 2016). Archanes formula refers to two sign groups, or sign sequences, present in their entirety or in part on a number of seals belonging to the wider collection of inscribed objects called the Archanes script (Olivier, Godart, and Poursat 1996; Karnava 1999; Decorte 2018). The objects belonging to the Archanes script were excavated by Yannis Sakellarakis in the mid 1960s near Archanes on Crete, and are generally dated somewhere between the end of the third and the beginning of the second millennium BCE (c. 2200–1800 BCE) (Decorte 2018). According to Ferrara and Weingarten (2022), the Archanes formula cannot be considered a proper writing system or a script. Out of all the Aegean scripts or inscriptions that have been analyzed, the Archanes script and the Archanes formula were the ones with the least number of attempts at their decipherment, possibly due to the small number of inscriptions.
3.2 Cretan Hieroglyphic Script
The Cretan hieroglyphic script, or Cretan hieroglyphs, dates back to the Middle Bronze Age in Crete (c. 2100/2050–1700/1675 BCE) (Nosch and Ulanowska 2021). Because this script does not appear to be connected to other writing systems that used hieroglyphs in their inscriptions (e.g., Egyptian or Hittite), it is thought to have been invented locally on Crete (Karnava 2014a) rather than developed from other writing systems that predated them. This represents a significant obstacle in their potential decipherment—if they are not related to other known scripts or inscriptions, it might be incredibly difficult or even impossible to ever understand them. To add to the difficulties regarding the potential decipherment of this script, there are currently fewer than 400 known inscriptions that bear the Cretan hieroglyphs, and the inscriptions themselves are rather short (Ferrara, Montecchi, and Valério 2021).
It is generally assumed that the Cretan hieroglyphic script represents syllabic writing (Revesz 2022; Karnava 2014a) with a logographic element (Karnava 2014a), so it can be classified as a logosyllabary (Civitillo 2021). According to Karnava (2014a), both right-to-left and left-to-right reading and writing directions are used in the Cretan hieroglyphic script. The language behind the Cretan hieroglyphic script is called a Minoan language (Revesz 2017b). This language was probably spoken by the Bronze Age inhabitants of Crete, and is currently still unknown.
Over the years, researchers have attempted to determine if the Cretan hieroglyphic script is related to other writing systems, inscriptions, or languages. They have examined various formulas and scripts, such as the Archanes formula (Ferrara, Montecchi, and Valério 2021), Linear A (Ferrara, Montecchi, and Valério 2022), Linear B (Daggumati and Revesz 2019), Proto-Hungarian and Hungarian languages (Revesz 2016b), and so on. In a study by Serafimov and Tomezzoli (2011) the authors lay evidence of early Slavic presence in Crete in the second millennium BCE, and note similarities between the Cretan hieroglyphic script, Linear A, and Linear B scripts with the Vinča (in present-day Serbia), Gradeshnitsa, and Karanovo (both in present-day Bulgaria) culture writings. Revesz (2016a) suggests that Crete is the likely origin of the Cretan Script Family, which includes the Cretan hieroglyphic script, Linear A, Linear B, Cypriot syllabary, Greek alphabet, Phoenician alphabet, Old Hungarian alphabet, South Arabic alphabet, and the Tifinagh alphabet. This script family was expanded in Revesz (2017a) with the addition of the Carian alphabet. Despite numerous theories about the origin of Cretan hieroglyphs and their connection to other scripts and languages, there is no single generally accepted theory, and Cretan hieroglyphic script remains a mystery.
Cretan hieroglyphic script and Linear A co-existed for almost two centuries (Ferrara, Montecchi, and Valério 2022), even though Cretan hieroglyphic script predates Linear A, and Linear A is based on it (Davis 2010). Cretan hieroglyphic script consists of 96 syllabograms, ten of which also serve as logograms (Nosch and Ulanowska 2021). This would make the Cretan hieroglyphic script logo-syllabic. A list of Cretan hieroglyphic signs can be found in Ferrara, Montecchi, and Valério (2023).
Some of the longest known inscriptions in Cretan hieroglyphs are the Phaistos Disk, the Malia Altar Stone, and the Arkalochori Axe (Revesz 2017c), but it is important to emphasize that sometimes the Phaistos Disk and the Arkalochori Axe are listed as separate scripts (Revesz 2018). In the scope of this article we will regard the Phaistos Disk as a separate script, as this seems to be the current consensus, and Malia Altar Stone and Arkalochori Axe as texts written in Cretan Hieroglyphic script. The Malia Altar Stone (or Malia Stone Inscription) was excavated in 1937 in Malia on Crete, and it is the only known inscription in Cretan hieroglyphs engraved in stone (Kenanidis and Papakitsos 2017) (most of the other inscriptions are found on clay objects or sealstones [Younger 1999]). The Malia Altar Stone consists of 16 symbols (Revesz 2017d), and their translation was proposed in Revesz (2017c, d). The Arkalochori Axe is a bronze double axe discovered in 1935 in a cave in central Crete (Duhoux 1998). It contains 15 signs and dates back from c. 1700 BCE (Davis 2010). A translation of the Arkalochori Axe was proposed in Revesz (2017c).
The Cretan hieroglyphs are still generally regarded as undeciphered, although there have been attempts at their decipherment (e.g., Revesz 2016b).
3.3 Phaistos Disk
The Phaistos Disk is a unique, one-of-a-kind circular clay object discovered in the ruins of the Phaistos palace in 1908 by Luigi Pernier (Reczko 2009). It differs from other Aegean and Cypriot Bronze Age artefacts in terms of its construction, as it was imprinted with small seals rather than inscribed by hand. The diameter of the Phaistos Disk varies from 158 to 165 millimeters as it is not perfectly round, and its thickness varies from 16 to 21 millimeters (Evans 1909b). The disk dates back to the seventeenth century BCE (Chorozoglou, Koukis, and Papakitsos 2017) and is imprinted on both sides with 241 symbols, with a total of 45 unique symbols (Reczko 2009). The Phaistos Disk is still generally regarded as undeciphered, although there have been a few attempts at its decipherment (e.g., Achterberg 2004; Revesz 2015b, 2016c). The summary of decipherment attempts can be found in Eisenberg (2008). The reading direction of the disk has been debated and is still not generally agreed upon. Some of the signs imprinted on the disk have a stroke underneath and might represent diacritical marks, similar to the strokes used in Devanagari script for writing Sanskrit (Chadwick 1987). Although there is a general view among archaeologists that the Phaistos Disk is an authentic Bronze Age artefact, there have been times when this view was questioned and when it was suggested that the disk itself might be a forgery (Eisenberg 2008). The Phaistos Disk is currently on display in the Heraklion Archaeological Museum on Crete.
3.4 Linear A Script
Linear A is an undeciphered script that was discovered by the British archaeologist Sir Arthur J. Evans during his excavations in Greece in the late 1800s and early 1900s. It was used during the period c. 1750–1450 BCE (Robinson 2009), and is classified as a logosyllabary (Salgarella and Castellan 2021a; Salgarella 2022). This script is mainly associated with Crete, but was also used on other islands in the Aegean Sea as well as on Mainland Greece. The total number of signs from all 1,370 known Linear A documents is estimated to be between 7,362 and 7,396 (Schoep 2002). Out of all these signs, 97 are unique, and 64 of them are implemented in Linear B with slight modifications (Tan 2022). Average word length in Linear A documents is 3.3 signs (Fuls 2015). The majority of Linear A clay tablets come from the Minoan palace of Haghia Triada (Chadwick 1987).
The language behind Linear A is considered to be Minoan language, the same one as the one behind the Cretan hieroglyphic script (Revesz 2017b). A link between these two scripts in terms of their common origin was suggested in Owens (1996).
Since the language behind Linear A is still unknown, there have been many attempts to discover what this language is or where it originated from. Some of these attempts include trying to connect Linear A with Hurro-Urartian languages, the Indo-European language family, and the Etruscan language (Facchetti 2002). Duhoux (1990) states the following: “Linear A’s language has been recognised as cognate to an impressive list of languages: Hittite, Luwian, Lycian, Sanskrit, Greek, Indo-European, Semitic, Carian, Basque, and so on.” Recently, Revesz (2016c, 2020) proposed a theory that Linear A is connected to the Uralic language family. According to Schoep (2002), the best-founded hypotheses for the language behind Linear A are Semitic (non-Indo-European language) and Lycian (an extinct language belonging to the Anatolian branch of the Indo-European language family). We will discuss languages possibly related to Bronze Age Aegean and Cypriot scripts in more detail in Section 8.
3.5 Linear B Script
Linear B was discovered in the form of clay tablets and vessels by the British archaeologist Sir Arthur J. Evans during his excavations in Greece in the late 1800s and early 1900s. It was in use on Crete and Mainland Greece during the period c. 1400–1190 BCE (Salgarella 2020). It is classified as a logosyllabary (Salgarella and Castellan 2021a).
Linear B was deciphered in the 1950s by Michael Ventris and John Chadwick, with significant contributions from Emmett L. Bennett and Alice E. Kober (Bennett, Chadwick, and Ventris 1956; Chadwick and Ventris 1973; Ventris and Chadwick 2015; Kober 1948). It was used for writing Mycenaean Greek (Davis 2010).
Linear A and Linear B scripts have approximately the same number of signs (Melena 2014). Out of 97 unique Linear A signs, 64 of them are implemented in Linear B with slight modifications (Tan 2022). Syllabograms common to both Linear A and Linear B, alongside their phonetic transcriptions, are listed in Melena (2014). Although there is a clear link between Linear A and Linear B scripts, they were almost certainly used to write different languages (Whittaker 2005). Linear B was used for writing the Mycenaean Greek (Davis 2010) language, while the language written in Linear A script is still unknown but referred to as Minoan. Additionally, both scripts were primarily used for administrative purposes, but the usage of Linear A was wider than the one of Linear B, which seems to have been restricted to palatial bureaucracy (Whittaker 2005). For the detailed comparison of Linear A versus Linear B inscriptions in terms of places of discovery and objects on which the inscriptions themselves were found, the reader is encouraged to consult Tomas (2010).
3.6 Cypro-Minoan Syllabary
The Cypro-Minoan syllabary was the script used by the pre-Greek inhabitants of Cyprus (Davis 2010). Although its name implies that it was used only on Cyprus, it was also used in coastal parts of Syria and Tiryns in Greece (Valério 2016). The script was in use between 1550 and 1050 BCE (Smith and Hirschfeld 1999), and comes in three varieties that are linked to the places where the inscriptions were discovered (Billigmeier 1976). Not all researchers agree on the number of varieties of this script, and research presented in Corazza et al. (2022) indicates that it is a unitary script. The Cypro-Minoan syllabary is based on Linear A, has around 100 unique symbols, and is still undeciphered (Davis 2011). Nearly 250 inscriptions in Cypro-Minoan are known to exist, and the total number of syllabograms than can be found in these inscriptions is less than 4,000 (Valério 2016).
3.7 Cypriot Syllabary
The Cypriot syllabary (or Classical Cypriot script) was mainly used on Cyprus between the eighth and third centuries BCE (Karnava 2014c). It was developed from the Cypro-Minoan syllabary by the people who lived on Cyprus and spoke their own dialect of the Greek language (Davis 2010). The syllabary consisted of either 54 signs (common syllabary) or 55 signs (Paphian syllabary) (Karnava 2014c). There are around 880 inscriptions bearing Cypriot syllabary writing, 800 of which were found on Cyprus and 80 in Egypt (Davis 2010). The Cypriot syllabary was deciphered, and its decipherment began in 1871 with the work of Assyriologist George Smith (Davis 2010), but later contributed to by many other researchers as well (Karnava 2014c). Cypriot syllabary was used to write the Arcado-Cypriot Greek language (Kenanidis and Papakitsos 2015).
4. Challenges in Deciphering Unknown Scripts
Luo et al. (2021) described the two main challenges that can be encountered during attempts to decipher lost languages: “(1) the scripts are not fully segmented into words; (2) the closest known language is not determined.” While this is true, we would also like to expand this list with other challenges that should be addressed when attempting to decipher lost scripts, and propose the following:
Unknown writing system and reading direction. This obstacle is one of the greatest challenges in deciphering ancient scripts. Current practices commonly use vocabulary size to determine the type of writing system to which the script belongs. If this size is uncertain or unknown, there are no certain ways to determine whether the writing system is an alphabet, abugida, abjad, syllabary, logosyllabary, or a mixture of different writing systems. If a writing system is unknown, establishing the reading direction can also be difficult. There are many types of reading directions that scripts can have. For example, some scripts are read from right to left, while others are read from left to right. Some scripts use boustrophedon2 style of writing (e.g. the Great Inscription on Crete that is part of the Gortyn law code [Remondino et al. 2008]). Sometimes, the reading direction can be inferred from scripts that are geographically and chronologically close to the script being deciphered. However, this method does not always work, as some scripts are autochthonous and not related to other scripts.
Unknown punctuation. Not knowing whether the script uses punctuation or spaces between the words can make the undeciphered text difficult to segment into words and subsequently decipher.
Unknown language. Not being familiar with the language or languages the script is used to encode can make the script almost undecipherable. This is unfortunately the case with most of the currently undeciphered scripts such as the Cretan hieroglyphic script, Linear A, and Cypro-Minoan script.
Small dataset size. When a dataset of available inscriptions in undeciphered script is very small it might be impossible to decipher that script. It should be remembered that the archaeologists are still discovering ancient artefacts and that there is hope the small datasets of undeciphered inscriptions will be expanded in the near future. Recently, in 2022, scientists succeeded in translating the oldest sentence written in the world’s first alphabet (Vainstub et al. 2022; Osborne 2022), and this gives us hope that new discoveries, translations, and decipherments of ancient inscriptions will follow in the future.
Incomplete vocabulary. Sometimes the vocabulary of an unknown script is incomplete, usually due to the small number of inscriptions or texts available. This makes it difficult to determine the type of writing system the script possibly belongs to, and complicates the potential frequency analysis that can be performed on the text in order to gain more information about its content, language, or grammar.
Unknown syntactic (or word order) typology. Not being familiar with the dominant order of different types of words (e.g., verbs, subjects, and objects) in a sentence can pose a significant problem in a decipherment process. In linguistics, the word order in a sentence can be classified into six traditional typologies regarding the relative position of subject (S), object (O), and verb (V): SVO, SOV, VSO, VOS, OVS, and OSV (Croft 2003). However, these are not the only existing word order typologies as others have been proposed as well. For example, Dryer (1997) proposed a word order typology that classifies languages as OV or VO, and as SV or VS. Additionally, free word order typologies also exist (e.g., Slavic languages are generally classified as free SVO languages [Siewierska and Uhlířová 1998]), and can further complicate the decipherment process of unknown scripts.
Unknown morphological (or word structure) typology. Not knowing how or if the words in undeciphered scripts can change their form can aggravate the decipherment process. In morphological typology there are two distinct types of languages, analytic and synthetic (Šincek 2020), that different languages can be classified into. Synthetic languages, which are more flexible than the analytic ones, can further be divided into four subtypes: agglutinating, fusional or inflected, polysynthetic, and oligosynthetic languages (Šincek 2020). Agglutinating languages, which combine smaller morphemes into words, are particularly interesting as they include many ancient languages (e.g., Sumerian, Elamite, Hattic [Tóth 2007]) that are possibly related to undeciphered Aegean and Cypriot scripts.
Existence of allographs. Not knowing whether there are multiple symbols or signs used for the same sound, syllable, word, concept, or idea can aggravate the decipherment process. For example, in the English alphabet there are uppercase and lowercase letters for each sound, but if someone unfamiliar with that tried to decipher the text written in the English language, they would certainly encounter a number of difficulties in their decipherment process. Their constructed chart of possible symbols used in the undeciphered text would be twice as long as it realistically needs to be, and that would probably lead to many incorrect conclusions. In this example we did not even consider different writing styles such as cursive, which would certainly further complicate the entire process.
Difficulties regarding the penmanship of each particular scribe. Another problem that is closely related to the previous one regards the handwriting or penmanship of each particular scribe that wrote or inscribed ancient texts. Depending on their handwriting, the same symbols could look very different, and may lead to further difficulties in constructing the complete set of symbols connected to a certain script.
Existence of exonyms. Not knowing whether exonyms exist in the text can make the decipherment process of undeciphered text more difficult. Exonyms are usually localized names for foreign places that differ from their autochthonous names. For example, Croatia is an exonym (a name for Croatia used by the English-speaking people), while Hrvatska is an endonym (a name for Croatia used by the Croatian people).
Existence of homonyms. Not knowing whether the words in undeciphered scripts change their meaning based on the context they are used in can have an impact on the decipherment process. These words are known as homonyms, and can be either spelled or pronounced the same. Since it is usually unknown how to accurately pronounce words that belong to undecipherd scripts, in this context we consider the homonyms to represent words that differ in meaning and have the same spelling, while their pronunciation can be considered unknown.
Unknown context. Not knowing the exact location where the undeciphered texts were discovered can have an adverse effect on the decipherment process. The location where the texts are found can provide significant contextual information about the content of the text. For example, if the inscribed clay tablet with unknown text is found inside an old church, it is far more likely to contain religious text than bureaucratic text.
Unavailable parallel data. In the context of NLP, parallel data usually encompasses the same text written in different languages or scripts. The lack of this data makes the decipherment process very difficult, and is unfortunately the reality for all of the undeciphered scripts today as far as we are aware. For example, Egyptian hieroglyphs presented a major decipherment problem and were only successfully deciphered when the Rosetta Stone bearing the parallel inscriptions in three different languages was unearthed.
Unavailable digital data and corpora. The majority of undeciphered languages and scripts lack an adequate associated digital dataset that could be used in their potential computational decipherment.
Unavailable hardware resources. Computational decipherment of unknown ancient texts is an extremely challenging task, and it may be impossible to accomplish on limited hardware. The decipherment process could therefore necessitate the use of high performance computing (HPC), which is not readily available to everyone.
From the above outlined challenges we can conclude that many different things can affect the decipherment process of unknown scripts. The more information is known about the unknown script and the people that used it and their environment, the more likely it is that it will one day be deciphered.
The above outlined challenges in the decipherment processes of unknown scripts can be encountered whether the unknown script is being deciphered by either the computational or standard approaches based on epigraphy and paleography. One challenge, however, can only be associated with the computational approaches to unknown script decipherment, and that challenge is related to the computational power that is available for that task. Building flexible computational decipherment frameworks for automated decipherment of undeciphered scripts these days is almost inseparable from powerful hardware resources, both computationally and in terms of data storage. HPC solutions provide adequate computation platform. Scaling deciphering models, no matter what kind of approach is used to build a model (e.g., deep learning neural networks, machine learning methods, statistical models) is facilitated with the HPC platforms. When additional parameters are needed to extend possibilities and accuracy of the existing models or when the newly discovered texts for decipherment should be added to the existing ones, hardware resources that easily support such additions and extensions are a good choice. HPC resources today are much more accessible for research and for business applications, technically as well as financially, either as on-site solutions or provided by different public institutions and finally through private vendors with cloud computing resources like Microsoft Azure (Microsoft 2023), Amazon Web Services (Amazon 2023), Google Cloud Platform (Google 2023), and many others.
5. Data and Corpora
Traditional or printed corpora of Bronze Age inscriptions related to the Minoan civilization first started appearing in the early 1900s, with works such as Evans (1909a, b), and continued with a number of other corpora such as GORILA (Godart and Olivier 1985) and CHIC (Olivier, Godart, and Poursat 1996). Out of all the available traditional corpora related to the Bronze Age Minoan civilization, GORILA and CHIC corpora are the ones that are most frequently used. The GORILA corpus obtained its name from the first letters of surnames of its authors (Godart and Olivier), and from the first letters of the first part of the title of their manuscript (Recueil des Inscriptions en Linéaire A). The CHIC corpus obtained its name from the first letters of words included in the title of the manuscript where it was presented (Corpus Hieroglyphicarum Inscriptionum Cretae).
Many of the digital corpora related to the Bronze Age Minoan civilization is based on the CHIC or GORILA corpora. The most well-known available digital corpora on ancient Bronze Age Aegean and Cypriot scripts (Aurora 2015a; Revesz, Rashid, and Tuyishime 2019b; Rashid 2019; Salgarella and Castellan 2021a; Ferrara et al. 2023a; Lastilla, Ravanelli, and Ferrara 2019; Younger 2023; Petrolito et al. 2015; Papavassileiou, Owens, and Kosmopoulos 2020; Greco, Flouda, and Notti 2023; Hogan 2022, 2023) are discussed below in more detail.
Aurora (2015a) presented DĀMOS (Database of Mycenaean at Oslo), an electronic corpus of Mycenaean texts written in Linear B script. Alongside Mycenaean texts, this database includes information about scribal hands, sites where the texts were discovered, approximate time frames, phonological transcriptions of the texts, and so forth. The author plans on enriching the database with closely related digital resources, for example, the ones that contain information about Minoan Linear A. The DĀMOS database is currently available at Aurora (2015b).
Revesz, Rashid, and Tuyishime (2019b) and Rashid (2019) presented the AIDA (Ancient Inscription Database and Analytics) system, which currently stores inscriptions written on the Phaistos Disk and the ones written in Linear A and Cretan hieroglyphic scripts. The authors plan on expanding the AIDA database with inscriptions written in Sumerian, Elamite, and the Indus Valley script. AIDA is currently available at Revesz, Rashid, and Tuyishime (2019a), where it provides possible syllabic values, translations, cognates, and related languages of Linear A symbols and sequences.
Salgarella and Castellan (2021a) presented SigLA (Signs of Linear A) online database of Linear A symbols and sequences that includes their graphical representations (drawings), phonetic transcriptions, places of origin, time period, types of artefacts on which the inscriptions are found, and so forth. The SigLA dataset is available at Salgarella and Castellan (2021b).
Through the ERC INSCRIBE (INvention of SCRIpts and their BEginnings) project (Ferrara et al. 2023a; Lastilla, Ravanelli, and Ferrara 2019), many 3D representations of artefacts bearing Cretan hieroglyphic and Linear A inscriptions are available online at Ferrara et al. (2023b). The ERC INSCRIBE project is still ongoing and it is possible that the online resources available through it will be expanded in the future.
Younger (2023) is a Web site created and maintained by professor emeritus John G. Younger. This Web site contains links to many online resources (some of which are written by Prof. Younger) regarding the Phaistos Disk, Linear A, Linear B, and Cretan hieroglyphic scripts. Some of the linked resources contain, among much valuable information, information on possible phonetic transcriptions of Linear A sequences.
Other digital corpora has been presented as well, for example, Petrolito et al. (2015) (Linear A), and Papavassileiou, Owens, and Kosmopoulos (2020) (Linear B), but unfortunately we were unable to find these datasets online. Additional online resources regarding the Bronze Age Aegean scripts include Greco, Flouda, and Notti (2023) (Linear A and Linear B), Tselentis (2011) (Linear B), Luo, Cao, and Barzilay (2019a, b) (Linear B; this dataset is a modification of the one presented in Tselentis [2011]), Hogan (2022) (Linear A and Linear B), and a tool for exploring Linear A syllabary presented in Hogan (2023) and associated with Hogan (2022).
Figure 3 shows an overview of available online resources regarding the Bronze Age Aegean and Cypriot scripts.
It can be seen from the discussion above that there is a significant lack of digital resources associated with Bronze Age Aegean and Cypriot scripts, and this represents a significant challenge. It is also important to note that even though some of the digital resources regarding the Bronze Age Aegean and Cypriot scripts do exist, they are by no means standardized or complete. It is an unfortunate fact that we cannot be completely certain of the completion of a set of symbols belonging to a particular script, because often only scarce or short inscriptions written in that script are available to us. One example of this is the Phaistos Disk, which is a unique object that cannot be definitely linked to anything else. Another problem related to the standardization of the set of symbols belonging to a particular undeciphered ancient script is that many of the symbols in digital databases are hand drawn by computer scientists. This is a direct consequence of the fact that digital representations of ancient symbols are usually not freely available and can be copyrighted. This represents a certain obstacle in constructing digitalized datasets that could be compatible with each other, as each available dataset will usually have its own set of hand-drawn symbols or contain phonemic transcriptions of symbols.
6. Commonly Used Methods in Natural Language Processing
Commonly used methods in NLP and subsequently in the automatic decipherment of ancient scripts can be divided into three categories: methods commonly associated with pre-processing of textual information, methods commonly associated with processing of textual information, and evaluation techniques designed to output a numerical representation that indicates how well the final NLP model is performing on novel textual data it has not seen during the training phase. Figure 4 shows a flow diagram of common NLP methods discussed in the following sections.
6.1 Pre-processing of Textual Information
Pre-processing of textual information is of utmost importance in NLP applications. Methods commonly associated with this step are used to prepare textual data for further processing, which subsequently helps in improving the accuracy and speed of the model. Pre-processing techniques in NLP encompass statistic and linguistic text analysis, and commonly include processes such as tokenization, stemming, lemmatization, POS (Part-of-Speech) tagging, and NER (Named Entity Recognition) identification.
6.1.1 Tokenization.
Tokenization can be defined as the deconstruction of words into smaller tokens such as individual symbols or syllables (e.g., the word “disk” could be split into “di” and “sk”). Tokenization is a general term that encompasses sentence tokenization (the process of splitting the input text into sentences), word tokenization (the process of splitting the input text into words), subword tokenization (the process of splitting the input text into subwords), and character-based tokenization (the process of splitting the input text into singular characters such as letters, spaces and punctuation marks). Even though each of these tokenization methods can be used as a pre-processing step in NLP, by far the most widely used one is subword tokenization.
Subword tokenization has some advantages over the other tokenization methods in terms of vocabulary size and out-of-vocabulary words. In NLP, vocabulary can be defined as the number of unique tokens (words, subwords, etc.) that the model learns during the training phase, and subsequently uses in further processing. Subword tokenization keeps the vocabulary size relatively stable, while sentence and word tokenization methods tend to have much larger vocabularies, which increases the overall complexity of the model and decreases its velocity. Character-based tokenization methods have the smallest vocabularies, but also fail to deliver any contextual information to the model, which hinders its performance. Subword tokenization also deals very well with situations in which it encounters out-of-vocabulary words. These words are words that may appear in the input text given to the model, but that have not been encountered by the model during the training. In cases like these, sentence and word tokenization methods would fail, but character-level tokenization would not (given that the newly encountered word does not contain any novel characters), and subword tokenization most likely would not fail either (although this cannot be guaranteed given its limited vocabulary size and the subword tokenization method used).
Most widely used subword tokenization methods include Byte-Pair Encoding (BPE) (Gage 1994; Sennrich, Haddow, and Birch 2016), WordPiece (Schuster and Nakajima 2012; Wu et al. 2016), and Unigram (Kudo 2018) models. The BPE model was initially designed for data compression purposes (Gage 1994), and later adapted for POS tagging (Sennrich, Haddow, and Birch 2016). It works by finding characters and/or groups of characters that most commonly appear together in the training dataset, and merges them into unique tokens that are used for vocabulary construction. The BPE model is most notably used in OpenAI’s GPT-2 (Das and Verma 2020). The WordPiece model was initially constructed for Japanese and Korean voice search system (Schuster and Nakajima 2012), and later adapted for POS tagging (Wu et al. 2016). According to Wu et al. (2016), the WordPiece model is similar to Sennrich, Haddow, and Birch (2016), and is “data-driven to maximize the language-model likelihood of the training data, given an evolving word definition.” The Unigram POS tagger, on the other hand, is a probabilistic “mixture of characters, subwords and word segmentations” (Kudo 2018), and is based on the Unigram language model. For more in-depth information on tokenization methods in NLP we would encourage the reader to consult Spathis and Kawsar (2023) and Mielke et al. (2021).
6.1.2 Stemming.
Stemming can be defined as the process of determination of the stem of the word (i.e., the base of the word to which affixes such as prefixes and suffixes can be added; e.g. the stem of the word “decoding” is “decod”). Stemming is incredibly important when processing highly inflected languages (e.g., Croatian, Slovenian, Polish, and Finnish) because it keeps the vocabulary size under control by recognizing different forms of the word as variations of the same word instead of an entirely new word.
Stemming algorithms (or stemmers) can in general be classified as being either linguistic-based or computational-based (Jabbar et al. 2020). Linguistic-based stemmers use hand-crafted grammatical rules to determine the stem of the word (e.g., Porter 1980; Kariyawasam, Senanayake, and Haddela 2019; Kaur and Buttar 2020), while computational-based stemmers use statistical (i.e., machine learning-based; e.g., Melucci and Orio 2003; Bölücü and Can 2019) or non-statistical (i.e., corpus-based; e.g., Singh and Gupta 2017, 2019) computations (Jabbar et al. 2020).
It is important to note that even though many different stemming algorithms exist, not all of them will work well for all languages, as languages vary and can be highly specific and unique.
6.1.3 Lemmatization.
Lemmatization can be defined as the process of determination of the lemma of the word (i.e., its canonical or dictionary form; e.g., the lemma of the word “decoding” is “decode”). It is similar to stemming, but inherently more complex and time-consuming. In terms of accuracy, it is difficult to determine which of the two is better, as studies have shown inconsistent results when comparing the two (Pramana et al. 2022).
Lemmatization algorithms (or lemmatizers) are usually either rule-based (e.g., Plisson et al. 2004; Stanković et al. 2016; Nandathilaka, Ahangama, and Weerasuriya 2018), machine learning-based (e.g., Kestemont et al. 2017; Freihat et al. 2018; Manjavacas, Kadar, and Kestemont 2019; Akhmetov et al. 2020; Karwatowski and Pietron 2022; Hafeez et al. 2023) or they use a combination of these methods and represent hybrid models (e.g., Ingason et al. 2008; Sahala et al. 2023). Even though many traditional machine learning techniques have been used for lemmatization throughout the years (e.g., random forest classification model [Akhmetov et al. 2020] and maximum entropy classifier [Freihat et al. 2018]), it seems that the researchers have lately been more oriented towards the deep learning-based approaches (e.g., Kestemont et al. 2017; Manjavacas, Kadar, and Kestemont 2019; Ezhilarasi and Maheswari 2021b; Karwatowski and Pietron 2022; Hafeez et al. 2023).
Lemmatizers have even been developed for ancient inscriptions such as those written in Ancient Greek (Vatri and McGillivray 2020), Early Irish (i.e., Old and Middle Irish) (Dereza 2018), Classical Armenian, Old Georgian and Syriac (Vidal-Gorène and Kindt 2020), Akkadian (Sahala et al. 2023), and additionally for palaeographic eleventh century stone inscriptions as well (Ezhilarasi and Maheswari 2021b).
6.1.4 POS Tagging.
POS tagging or grammatical tagging can be defined as the process of determination of parts of speech (nouns, verbs, adjectives, etc.) for words appearing in textual documents that are being examined. POS tagging is usually accomplished by using rule-based (or linguistic) approaches, stochastic (or probabilistic) approaches, or hybrid approaches.
Rule-based approaches to POS tagging generally encompass pre-defined grammatical rules determined by linguistic experts, and they usually focus on one specific language. This is to ensure that the rules determined for one language are not used for the POS tagging of another, as they are often not compatible. If a large enough textual dataset tagged with POS labels exists, then a simple algorithm can be constructed that could automatically behave (more or less successfully) like a human expert and construct grammatical rules useful for POS tagging. One of the simplest examples of this kind of an approach is Brill’s tagger (Brill 1992). Brill’s tagger is a simple yet powerful algorithm that extracts grammatical rules from a tagged corpus, and uses these rules in order to accomplish POS tagging of novel textual documents. Even though this tagger is usually classified as belonging to rule-based approaches to POS tagging, it is sometimes classified as belonging to machine learning based approaches to POS tagging (Chiche and Yitagesu 2022), which are inherently stochastic.
Stochastic approaches to POS tagging are used for determining the probabilities with which a word, given a certain context, belongs to each of the predefined POS tags. These approaches encompass commonly used machine learning algorithms such as artificial neural networks (e.g., Vidal-Gorène and Kindt 2020; Ezhilarasi and Maheswari 2021a, b), Hidden Markov Models (e.g., Lee, Tsujii, and Rim 2000; Gao and Johnson 2008; Stratos, Collins, and Hsu 2016), Support Vector Machines (e.g., Nakagawa, Kudo, and Matsumoto 2001; Giménez and Màrquez 2004; Ekbal and Bandyopadhyay 2008), and Conditional Random Fields (e.g., Krishnapriya et al. 2014). Artificial neural networks are mathematical models designed to mimic a human brain. They generally require a large amount of training data and computational resources, but often produce relatively accurate results as well. Hidden Markov Models (HMMs) are probabilistic finite state machines in which states correspond to POS tags, and observations to words (Zin and Thein 2009). HMMs are often used alongside the Viterbi algorithm (Viterbi 2006), a dynamic algorithm used to discover the hidden state path (Cahyani and Vindiyanto 2019), that is, the most probable POS tags for a given textual sequence. Support Vector Machine (SVM) is a machine learning algorithm for binary classification that, in its simplest form, learns a linear hyperplane that separates one set of examples (so called positive examples) from another set (negative examples) (Antony, Mohan, and Soman 2010). Even though it was originaly designed for binary classification, it was adapted to work with multiple classes as well (Hsu and Lin 2002). In SVM based POS tagging, multiclass classification is tackled by taking one POS tag at a time as a positive class, and the rest as negative (Fernando et al. 2016). Conditional Random Fields (CRFs) (Lafferty, McCallum, and Pereira 2001) are used to build probabilistic models that are able to segment and label the sequence data (Pallavi and Pillai 2014), and are in fact random fields globally conditioned on the observations that might range over natural sentences (Lafferty, McCallum, and Pereira 2001). They are often used in various NLP tasks (e.g., for NER identification in Patil, Patil, and Pawar [2020]) and not just for POS tagging. Stochastic approaches to POS tagging also commonly make use of n-gram analysis (e.g., Mittal, Sethi, and Sharma 2014), which can be defined as the deconstruction of text into smaller parts containing n words (e.g., the sentence “over the mountains” could be split into “over the” and “the mountains” in 2-gram or bigram analysis). This kind of analysis enables the examination of context because it allows for words to be analyzed alongside their neighbors, which in turn provides insight into the kind of words that usually appear together.
Hybrid approaches to POS tagging encompass methods that combine rule-based and stochastic approaches to POS tagging.
6.1.5 NER Identification.
NER (Named Entity Recognition) identification can be defined as the process of determination of words that denote the names of people, places, organizations, dates, common abbreviations, and so forth. It represents one of the crucial steps in knowledge extraction and in the construction of semantic networks and knowledge graphs commonly used in artificial intelligence. Lately, NER identification methods are commonly deep learning based (e.g., Lample et al. 2016; Yan, Jiang, and Dang 2021; Cui et al. 2021), but also encompass methods based on hand-crafted rules (e.g., Riaz 2010; Studiawan, Hasan, and Pratomo 2023), CRFs (e.g., Chen et al. 2019; Patil, Patil, and Pawar 2020), HMMs (e.g., Azarine, Arif Bijaksana, and Asror 2019), and various combinations of different methods (e.g., Jin et al. 2019; Drovo et al. 2019; Yi et al. 2020).
It is important to note that many NER identification methods for widely spoken languages exist, but NER identification is still a challenging problem for low-resource languages. Ancient unknown languages unfortunately fall into this latter category, which subsequently makes the associated scripts extremely difficult to decipher. Automatic analysis of low-resource languages is an ongoing problem that has gained momentum in recent years (Haddow et al. 2022; Costa-jussà et al. 2022). It is our belief that the advancements in this field will have a direct correlation with the advancements in computational decipherments of ancient scripts. This is somewhat supported by the fact that even Michael Ventris used a process similar to NER identification in the decipherment of Linear B when he assumed that certain Linear B words indicated the names of places on Crete (Mycenaean Epigraphy Group 2023). Granted, his decipherment process did not involve computational models that are the main focus of our investigation, but parallels can still be drawn between his research and the current state-of-the-art models aimed at the decipherment of ancient scripts.
6.2 Processing of Textual Information
Processing of textual information is usually based on traditional machine learning methods, deep learning models, or a combination of both.
Traditional machine learning methods include well-known approaches such as HMMs, SVMs, and CRFs. These models are statistical learning methods that were previously discussed in Section 6.1 in terms of their use in pre-processing tasks of NLP.
Artificial neural networks are digital models capable of, more or less successfully, mimicking the internal workings and decision making procedures of a human brain. They are commonly associated with the term deep learning, which can be defined as a methodological toolkit for building multilayer neural networks (Saxe, Nelli, and Summerfield 2021). Deep learning based models have gained quite a momentum in the decipherment of ancient scripts in recent years. On one hand, they are able to model highly complex functions (Goodfellow, Bengio, and Courville 2016), but on the other hand they require large amounts of training data and computational resources. The amount of training data required for deep learning based models poses a significant challenge in the decipherment of ancient scripts, since that data usually does not exist in the required quantities nor is it digitalized. When it comes to ancient undeciphered scripts, the types of digital training data associated with them are either visual (i.e., digital photographs or renderings of tablets containing ancient inscriptions), textual (i.e., digitalized inscriptions), or auditory (i.e., sounds associated with symbols used in a particular script—but this type of data is extremely rare and usually converted to the form of phonetic transcriptions). The available types of training data for a particular undeciphered script have a significant impact on the selection of an artificial neural network type and model best suited for the automatic undecipherment task. Convolutional neural networks (CNNs) are preferred for visual data processing, while textual data is usually processed by artificial neural networks built with Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber 1997), Gated Recurrent Units (Cho et al. 2014), and so on. An important breakthrough in deep learning was the introduction of the attention mechanism by Bahdanau, Cho, and Bengio (2014). This attention mechanism generally improved the performance of the basic encoder-decoder architectures in artificial neural network translation systems that previously decreased with the increase in length of the input sentence. This breakthrough led further to Transformer models in 2017 (Vaswani et al. 2017) that are used today as a main architecture for large language models. Large language models such as Generative Pre-trained Transformers (GPTs) (Radford et al. 2018) (on which ChatGPT [OpenAI 2023] is based), BART (Lewis et al. 2020), PaLM 2 (Anil et al. 2023), and Phi-2 (Javaheripi and Bubeck 2023) represent a major breakthrough in NLP and a giant step towards general artificial intelligence.
Despite the significant progress made in NLP in recent years, automatic decipherment of ancient scripts still remains a challenging task. There are, however, some positive changes occurring as well. Open source models, solutions, and data sources are becoming increasingly common and therefore facilitate further research in NLP and automatic decipherment attempts. For example, the source code for the Ithaca deep neural network for the textual restoration and geographical and chronological attribution of ancient Greek inscriptions (Assael et al. 2022a) is available at Assael et al. (2022b). Additionally, an increase in digital data that can be used for training NLP models oriented towards ancient inscriptions has been noted recently. For example, the Perseus Digital Library containing Greek and Roman text collections (Crane 2023a, b) and searchable libraries of Linear A inscriptions presented in Salgarella and Castellan (2021a, b) and Hogan (2023) are available online.
Since the automatic decipherment of ancient scripts is a relatively novel area of research, and since not many methods that tackle this problem exist, it cannot be stated with certainty whether it is better to use traditional or deep learning based approaches. Currently, a combination of both methods is recommended.
If the reader is interested in a more thorough and in-depth review of statistical machine learning and deep learning methods, we would suggest the following: a detailed overview of traditional machine learning methods can be found in Bontempi (2021), while a survey presented in Sommerschield et al. (2023) gives a detailed overview of machine learning methods and approaches for tackling different aspects of ancient language research.
6.3 Evaluation
Traditional metrics used for measuring the performance of NLP models are BLEU (Bilingual Evaluation Understudy) (Papineni et al. 2002) and ROUGE (Recall-Oriented Understudy for Gisting Evaluation) (Lin 2004), but these metrics have been shown to suffer from low correlation with human judgment (Blagec et al. 2022). Many submetrics related to BLEU and ROUGE have been proposed over the years (Graham 2015; Blagec et al. 2022), as well as many completely novel metrics (e.g., BERTScore; Zhang et al. 2020). However, it is important to note that it is currently impossible to construct or design a completely automatic method that requires no human intervention for evaluating every natural language processing model in existence, and this is especially true for large language models associated with generative artificial intelligence (e.g., GPT-3; Brown et al. 2020). When it comes to ancient undeciphered scripts, it is just as difficult to evaluate and score the results of their computational analysis since in many cases there is nothing to compare them against, and their evaluation remains an open problem. In the case of Bronze Age Aegean and Cypriot undeciphered scripts, the obtained results might be compared to other geographically close known scripts from a similar time period in order to try and evaluate them, and for this very purpose we discuss these scripts in Section 8. If the reader is further interested in the evaluation methods of common natural language processing tasks (that each come with their own set of challenges), we encourage them to consult Blagec et al. (2022) and Lee et al. (2023) for additional details.
7. Computational Approaches to Deciphering Ancient Scripts
Computational approaches to deciphering unknown scripts, as opposed to the standard epigraphical and paleographical decipherments, started appearing in the late 1990s, with works such as Knight and Yamada (1999).
Current computational approaches to deciphering unknown scripts can be divided into two categories: (1) approaches aimed at partial or complete translation of unknown text to one or more of the known languages, and (2) approaches aimed at the discovery of novel information about the undeciphered script that might aid in the future partial or complete translation. Most of the existing computational approaches to deciphering unknown scripts fall into the second category, and these approaches include things like cognate or related word detection, closest known language detection, automatic dataset augmentations, statistical analysis of ancient texts, phonetic decipherments and transcriptions, word alignments, and so forth.
Since there are not many existing computational approaches to deciphering ancient scripts (not just regarding the Bronze Age Aegean and Cypriot scripts, but in general), we decided to divide them into Aegean and Cypriot approaches (discussed in Subsection 7.1), and non-Aegean and non-Cypriot approaches (discussed in Subsection 7.2). Finally, we also discuss computational simulations of traditional decipherment methods in Subsection 7.3.
7.1 Existing Approaches for the Decipherment of Aegean and Cypriot Scripts
In this section we present an overview of computational approaches regarding the decipherment of Aegean and Cypriot scripts. Existing approaches for the computational decipherment of these scripts can be loosely divided into those mostly based on the analysis of individual symbols, those mostly based on the analysis of words or word parts, and those oriented towards the prediction of missing or damaged symbols and dataset augmentation. These categories do not have hard boundaries as many of the existing approaches for the decipherment of Aegean and Cypriot scripts combine elements from multiple categories, and their assignment to a particular category is therefore rather fuzzy.
Figure 5 shows a chronological overview of computational approaches for the decipherment of Bronze Age Aegean and Cypriot scripts alongside their brief descriptions and indications regarding the script or scripts they are focused on. The figure does not include information regarding the Archanes script and the Archanes formula for the simple reason that we did not manage to find any papers focused on their computational decipherment. From Figure 5 it is clear to see that the vast majority of the computational decipherments of Bronze Age Aegean and Cypriot scripts revolves around Linear A and Linear B syllabaries, and that the other scripts are not very well represented in this emerging research field. Furthermore, there is an obvious lack of a standardized digital dataset of Bronze Age Aegean and Cypriot inscriptions, as most of the existing approaches use internal datasets that are not available online.
7.1.1 Approaches Based on the Analysis of Individual Symbols.
This category of approaches for the decipherment of Aegean and Cypriot scripts is oriented towards the comparison of individual symbols, and words, or word parts from one or more undeciphered scripts to the symbols, words or word parts from one or more deciphered scripts. These deciphered scripts are usually selected for this comparison on the basis of their geographical location (i.e., they were or are used in or near the general area where the undeciphered scripts were used), and on the basis of temporal proximity (i.e., they were used around the same time period as undeciphered scripts). This category of approaches encompasses several methods (Revesz 2015a, 2016a, b, 2017b, c, d; Daggumati and Revesz 2019, 2023; Corazza et al. 2021; Srivatsan et al. 2021; Corazza et al. 2022).
Revesz (2015a) presented a bioinformatics inspired analysis of the signs belonging to Linear A, Linear B, Phoenician, South Arabic, Greek, Old Hungarian, and Cretan hieroglyphic scripts. They compared the symbols from these scripts to each other in terms of visual and phonetic features and presumable meanings, and grouped them into four categories they named A, C, G, and T after the four DNA nucleotides (adenine, cytosine, guanine, and thymine). They used these categories to encode the seven scripts they analyzed in their research, and used ClustalW2 phylogenetic algorithms to construct their hypothetical evolutionary tree. ClustalW2 is a DNA or protein multiple sequence alignment program for three or more sequences (EMBL’s European Bioinformatics Institute 2023). The authors in Revesz (2015a) reported that the results they obtained indicate that the seven compared scripts had a common ancestor. Their evolutionary tree consisted of three branches, the first one containing Linear A and Linear B scripts, the second one containing the Cretan hieroglyphic script, and the third one containing Old Hungarian, South Arabic, Phoenician, and Greek scripts.
Revesz (2016a) presented a bioinformatics inspired approach to determining the evolutionary language tree of the Cretan hieroglyphic, Linear A, Linear B, and Cypriot syllabaries, and the Greek, Phoenician, Old Hungarian, South Arabic, and Tifinagh alphabets. The authors presented a table where they showed symbols from these scripts that have known phonemic values and compared them with each other. They proposed a measure for calculating the similarity of pairs of symbols in these scripts, and concluded that the Cretan hieroglyphic, Cypriot, Linear A, and Linear B syllabaries, and the Old Hungarian and Tifinagh alphabets belong to one language branch, while Greek, Phoenician, and South Arabic alphabets belong to a different language branch.
Revesz (2016b) proposed a computer-aided translation of the Cretan hieroglyphic script based on the possible phonetic correspondences between the symbols of the Cretan hieroglyphic script and the symbols of the Phaistos Disk. In their research they utilized findings from their previous work presented in Revesz (2016c), where they proposed a translation of the Phaistos Disk.
Revesz (2017c) presented a semi-automatic method for the translation of the Arkalochori Axe and the Malia Altar Stone inscriptions. In their method the authors used synoptic transliteration that aligned symbols from several scripts that are possibly evolutionarily related and whose phonetic values are at least partially known. This made it easier to guess potential phonetic values of symbols in the unknown script. The possibly related scripts that the authors used were Carian and Old Hungarian alphabets. The authors further extend their work on Malia Altar Stone analysis in (Revesz 2017d).
Revesz (2017b) proposed a feature-based similarity measure to visually compare symbols from different scripts. The proposed similarity measure compares symbols based on 13 different features—for example, whether the symbol contains some curved line or not, whether it contains parallel lines or not, and so on. They used this similarity measure to develop a new phonetic grid for the Linear A syllabary, and then construct an English-Minoan-Uralic dictionary. The algorithm that the authors developed can, for example, take a Linear A symbol as an input, and return its syllabic value.
Daggumati and Revesz (2019) and Daggumati and Revesz (2023) presented a method for discovering relationships (in terms of visual similarity) between ancient scripts. They used CNNs and SVMs to look for correlation between symbols in different scripts. The ancient scripts that the authors focused on included the Brahmi script (34 symbols were used from this script), Cretan hieroglyphic script (22 symbols were used), Greek alphabet (27 symbols were used), Indus Valley script (23 symbols were used), Linear B syllabary (20 symbols were used), Phoenician alphabet (22 symbols were used), Proto-Elamite script (17 symbols were used), and Sumerian pictographs (34 symbols were used). The dataset that the authors used included 900 images (780 for training and 120 for validation) for each symbol, and this large number of images was reached via image augmentation of hand-drawn symbols. The results that the authors obtained showed that the Linear B script is highly correlated with the Cretan hieroglyphic script.
Corazza et al. (2021) proposed a method based on optimization research for assessing the values of some of the more problematic numerical fraction signs used in the Linear A syllabary. Their method combined palaeographical, statistical, typological, and constraint-based computational approaches. In their work the authors listed all of the fraction signs presumably used in Linear A, and discussed different theories on their possible meanings. Even though Linear A and Linear B are inherently linked, the authors emphasized that the quantities of commodities are recorded differently in these two scripts (namely, Linear B did not use fractions), thus making the comparison between the two challenging. This comparison was possible, however, between Linear A fraction signs and the signs used in the “Egyptian systems (hieroglyphic, hieratic, ’Eye of Horus,’ and demotic), Mesopotamian cuneiform (Old Akkadian/Old Babylonian phases), Greek alphabetical, Coptic alphabetical, North African Fez/Rumi, the ’vulgar’ Arabic system, and Indian Grantha.” The research presented by the authors helped narrow down the range of possible values of Linear A fraction signs, and assisted in identifying some of the languages and scripts that may be useful in deciphering it.
Srivatsan et al. (2021) proposed a neural framework for Linear B script analysis and demonstrated it on the scribal hand representation learning from raw images of Linear B glyphs. The aim of their study was to develop an approach that could potentially be used in uncovering patterns associated with ways in which various scribes may have written the same symbol, and perhaps to even discover how the Linear B writing system evolved. The authors represented each Linear B glyph with two vector embeddings—one associated with the potential scribal hand that made it, and one representing the syllable the glyph is associated with. The vector embeddings associated with scribal hands are learned during the training phase of the convolutional neural network the authors used. In order to avoid overfitting due to the limited size of the dataset, the authors used image augmentation to increase the size of the dataset from 4,134 to 111,618 images. The augmentation types that they used consisted of dilation, erosion, and horizontal and vertical translation. The authors identified a total of 74 scribal hands associated with the images the model was trained on, and evaluated their results on baseline human generated scribal hand representations from Skelton (2008). The authors report that the results obtained by their method were closer to the baseline manual representations than the results obtained by an autoencoder model they compared their method against.
Corazza et al. (2022) proposed a method based on CNNs for classification of symbols belonging to the Cypro-Minoan syllabary. Their model was named Sign2Vecd, and it represented a modification of the DeepCluster-v2 model (Caron et al. 2018, 2020). The authors used context in their approach and looked at what symbols usually appeared together in inscriptions. The model was trained on 2,899 Cypro-Minoan sign images obtained from 213 inscriptions. The results they obtained indicate that the Cypro-Minoan syllabary is a unitary script rather than three separate varieties, and that the differences between Cypro-Minoan symbols are mainly due to the material of the artefact on which they are located.
7.1.2 Approaches Based on the Analysis of Words or Word Parts.
This category of approaches for the decipherment of Aegean and Cypriot scripts is oriented towards the comparison of words or word parts between undeciphered and deciphered scripts, in order to potentially discover some connections between them that could aid in the decipherment process. This category of approaches encompasses several methods (Revesz 2015b, 2016c; Luo, Cao, and Barzilay 2019a; Min Eu, Xu, and Cacciafoco 2019; Colin and Cacciafoco 2020; Mavridaki, Galiotou, and Papakitsos 2021a, b).
Revesz (2015b) and Revesz (2016c) presented a semi-automatic method for the translation of the Phaistos Disk by using its connections to other ancient languages and scripts, including the Proto-Finno-Ugric language and the Old Hungarian alphabet. Their method includes five steps: transliteration of the Phaistos Disk symbols, construction of the Proto-Finno-Ugric and Proto-Hungarian dictionary, determination of a consonant base for each word in the dictionary, determination of matches between transliterated text and dictionary words, and the formation and translation of sentences. Their work was based on Revesz (2015a) and the assumption that the sound changes within a word (during a longer time period) are similar to genetic mutations. According to the authors, the Phaistos Disk possibly represents an ancient sun hymn that might be connected to a winter solstice ceremony.
Luo, Cao, and Barzilay (2019a) proposed a method for the automatic decipherment of lost languages that is based on LSTM based sequence-to-sequence neural decipherment via minimum-cost flow. The goal of their model was to determine character-level correspondences between cognates or related words. They tested their method on the Linear B syllabary, where their method correctly translated 67.3% of cognates on a challenging Linear B dataset, and 84.7% on a noiseless, less challenging Linear B corpus. In the challenging Linear B dataset, roughly 50% of the Linear B were missing a cognate word in Greek. The dataset that the authors used in the development of their method is available at Luo, Cao, and Barzilay (2019b), and presents a modification of the one presented in Tselentis (2011).
Min Eu, Xu, and Cacciafoco (2019) presented an initial design of a computational method for comparing Linear A strings to lexical lists and dictionaries of languages from a compatible time period. Because their proposed design is still under development, not much in-depth information is available on it.
Colin and Cacciafoco (2020) proposed a “brute-force attack” based algorithmic approach for comparing Linear A clusters to different dictionaries of languages used in the relative chronological and geographical vicinity to Linear A. Their aim was to perform an exhaustive search for a language or a language family that the Linear A script might be related to. This search included the comparison between words from the dictionaries and lexical lists of different languages, and the list containing information on Linear A. Preliminary results that the authors obtained indicate that the links could exist between Linear A symbols and symbols found in languages such as Luwian, Thracian, Hamito-Semitic, and so forth.
Mavridaki, Galiotou, and Papakitsos (2021a) and Mavridaki, Galiotou, and Papakitsos (2021b) presented an outline of a software tool that could be used for understanding and learning Linear A. Their tool contains a multilingual database and could be used to connect Linear A words to related words in other languages. Since the software tool the authors presented is still under development, not much information on it has been released to the public.
7.1.3 Approaches Oriented Towards the Prediction of Missing or Damaged Symbols.
This category of approaches for the decipherment of Aegean and Cypriot scripts is mostly oriented towards the prediction of missing or damaged symbols with the aim of text restoration and dataset augmentation. Since not many ancient texts associated with certain undeciphered languages exist, this could prove incredibly important for future researchers. This category of approaches encompasses several methods (Assael, Sommerschield, and Prag 2019; Papavassileiou, Owens, and Kosmopoulos 2020; Karajgikar, Al-Khulaidy, and Berea 2021; Papavassileiou, Kosmopoulos, and Owens 2023).
Assael, Sommerschield, and Prag (2019) proposed a model for ancient text restoration that recovers missing characters from damaged texts by the use of bidirectional LSTM based deep neural network. They named their model Pythia and trained it on Ancient Greek inscriptions dating from the seventh century BCE to fifth century CE. For each text restoration task Pythia returns 20 most probable outcomes (decoded by beam search) instead of just one. The Pythia model was expanded in Assael et al. (2022c), where the authors presented a deep neural network for textual restoration, geographical attribution, and chronological attribution of ancient Greek inscriptions and named it Ithaca. Ithaca achieves 62% accuracy on tasks revolving text restoration, but also helps historians in improving their accuracy from 25% to 72%. Additionally, Ithaca can determine the geographical origin of ancient texts with 71% accuracy, and ascertain the timeframe in which they were created with an error range of ±30 years.
Papavassileiou, Owens, and Kosmopoulos (2020) presented a dataset of Mycenaean Linear B sequences. On one part of this dataset they conducted an experiment that aimed to predict the phonetic values of unidentified or illegible Linear B signs by using a linear-chain CRF based approach. The part of the dataset that was used in this experiment included tablets from Knossos that referred to lists of personnel. The authors plan on making the dataset public once it is completed and reviewed by experts in the field.
Karajgikar, Al-Khulaidy, and Berea (2021) proposed a method for exploring the Linear A dataset with the purpose of obtaining information on the distribution of symbols in the corpus. Their method consisted of many different natural language processing techniques, such as generating bigrams and trigrams, using weighted frequency for symbol prediction, word embedding, and so on. The framework of their proposed model included exploratory n-gram analysis and exploratory knowledge mining. Exploratory n-gram analysis revealed high levels of entropy within the probable words in Linear A dataset, higher than the levels commonly found in other languages. This indicated to the authors that Linear A may not represent a language at all. However, when the authors performed the analysis on the entire Linear A dataset and not just on probable words from the dataset, the entropy decreased. Results of the exploratory n-gram analysis were therefore inconclusive. Exploratory knowledge mining, on the other hand, was used to determine whether the symbols in the dataset represented words or ideograms. The authors compared k-Nearest Neighbors and naïve Bayes classifiers for that task, and ultimately chose the former as the final model.
Papavassileiou, Kosmopoulos, and Owens (2023) presented a generative neural language model based on bidirectional recurrent neural networks for restoration of damaged and incomplete Mycenaean texts (e.g., the ones written in Linear B). Their method could be used for augmenting already scarce inscriptions contained in Linear B datasets by means of predicting the missing syllables. The dataset of Linear B sequences that the authors used in their model is the one presented in their previous work in Papavassileiou, Owens, and Kosmopoulos (2020), but unfortunately unavailable online. In order to help improve the accuracy of their augmentation method, the authors manually extracted 4 rules from a subset of the Linear B dataset. This subset is called series D, and it consists of 1,100 tablets that contain information about sheep herds of Crete. The rules that the authors extracted from that subset are the results of the linguistic analysis and include information about common structure of Linear B tablets from series D (e.g., the common order of toponyms and personal names on texts inscribed on tablets). Finally, the authors released a table containing inscriptions from damaged tablets alongside their most probable augmentation generated by the proposed model.
7.2 Existing Approaches for the Decipherment of Non-Aegean and Non-Cypriot Scripts
In the interest of covering a greater selection of algorithms that might be used for the decipherment of ancient scripts, in this section we present a chronological overview of computational approaches regarding the decipherment of non-Aegean and non-Cypriot scripts. As opposed to the Aegean and Cypriot scripts, these scripts may use non-syllabic writing systems and may come with their own unique sets of challenges that fall outside of the scope of this article.
There are several existing approaches for the computational decipherment of non-Aegean and non-Cypriot scripts (Knight and Yamada 1999; Knight et al. 2006; Terras 2006b; Snyder and Barzilay 2008; Snyder 2010; Snyder, Barzilay, and Knight 2010; Snyder and Barzilay 2010; Berg-Kirkpatrick and Klein 2011; Kim and Snyder 2013; Currey and Karakanta 2016; Deri and Knight 2016; Luo et al. 2021). We divide these approaches into three categories: (1) those based on the analysis of cognate words, (2) those based on the digitalization of the papyrology process, and (3) those based on using information gathered from high-resource languages. These categories intertwine and do not have hard set boundaries.
7.2.1 Approaches Based on the Analysis of Cognate Words.
This category of approaches encompasses methods based on the discovery and analysis of related words from different languages and includes several approaches (Snyder, Barzilay, and Knight 2010; Berg-Kirkpatrick and Klein 2011; Luo et al. 2021).
Snyder, Barzilay, and Knight (2010) proposed a method for the automatic decipherment of lost languages that is based on a non-parametric Bayesian framework. Their method required a non-parallel corpus in a known related language to produce alphabetic mappings and identify cognate words, and was based on the assumption that the writing system of the language trying to be deciphered was alphabetic in nature. The authors concentrated on Ugaritic and Hebrew languages. When using Hebrew as a related known language to Ugaritic, their method accurately mapped 29 of 30 Ugaritic letters, and accurately translated 60.4% of all cognate words. Their proposed method was elaborated upon in additional detail (Snyder 2010; Snyder and Barzilay 2010).
Berg-Kirkpatrick and Klein (2011) proposed a method for the detection of cognate words between two scripts written in an alphabetic writing system. They formulated the problem they were trying to solve as a combinatorial optimization problem where coordinate descent procedure was used to discover solutions. Their method incorporated both a matching between alphabets and a matching between lexicons, and was evaluated on four datasets where it achieved promising results. The languages that the method was evaluated on included English, Italian, Spanish, Portuguese, Hebrew, and Ugaritic.
Luo et al. (2021) proposed a generative framework for deciphering undersegmented ancient scripts using a phonetic prior. The inputs to their model include an undersegmented inscription in an undeciphered language written in an alphabetic writing system, and a vocabulary in a deciphered language also written in an alphabetic writing system. The aim of their model is to extract spans from the undeciphered text and match them to cognates in a deciphered text. The authors used International Phonetic Alphabet (IPA) embeddings and represented each symbol of a known language with a vector of its phonological features. The authors also proposed a measure for language closeness that can be used to discover languages that are close or related to the undeciphered language. The authors showed that their method accurately detected Gothic and Proto-Germanic deciphered languages as related, Ugaritic and Hebrew deciphered languages as related, and Iberian undeciphered language as probably not related to any of the languages they compared it against. They also concluded that Iberian is more similar to Basque (also undeciphered) than to any of the other compared languages. Since the selection of compared languages that they used was rather small (they used Latin, Spanish, Turkish, Hungarian, Proto-Germanic, and Basque), these findings could perhaps be expanded upon in the future by the inclusion of a greater number of languages in their linguistic analysis.
7.2.2 Approaches Based on the Digitalization of the Papyrology Process.
This category of approaches encompasses methods based on the digitalization of the traditional papyrology process, using the following approach.
Terras (2006b) proposed a computational model of the papyrology process that can aid in the complicated procedure of reading very damaged ancient documents. They demonstrated their method, based on image processing and image analysis via the adapted GRAVA (Grounded Reflective Adaptive Vision Architecture) architecture (Robertson 1999, 2001), on the Vindolanda tablets discovered in the ruins of ancient Roman fort Vindolanda located in Chesterholm in Northern England. Although not initially proposed for the decipherment of unknown texts, their model behaves in a similar manner to the expert papyrologists and can potentially accelerate either the decipherment process, or at least the construction of digital datasets of unknown texts and inscriptions. The method proposed in Terras (2006b) was further elaborated upon in Terras and Robertson (2005) and Terras (2006a).
7.2.3 Approaches Based on Using High-resource Languages as a Base.
This category of approaches encompasses methods based on using information gathered from high-resource languages for the analysis of possibly related low-resource languages. For the purposes of model evaluation and testing, researchers sometimes use high-resource languages as if they were low-resource languages and then try to “decipher” them. This category of approaches has been used by many papers (Knight and Yamada 1999; Knight et al. 2006; Snyder and Barzilay 2008; Kim and Snyder 2013; Currey and Karakanta 2016; Deri and Knight 2016).
Knight and Yamada (1999) proposed a method for the phonetic decipherment (or text-to-speech conversion) of unfamiliar scripts written in a known language (as in the case of Aegean Linear B). They used a bidirectional sound-to-character sequence finite-state transducer, and used Expectation-Maximization algorithm to determine sound-character mapping probabilities. In their work the authors concentrated on Spanish, Japanese, and Chinese languages. Even though they did not test their algorithm on any ancient unknown script, they did conclude that the decipherment is possible even with only limited knowledge of the language behind that script. This is promising because oftentimes not much information is known about a language behind an unknown script, and this is certainly true in the case of undeciphered Bronze Age Aegean and Cypriot scripts.
Knight et al. (2006) proposed a method for automatic decipherment of ciphertexts into plaintexts that is based on the Expectation-Maximization algorithm. The authors particularly focused on phonetic decipherments, and divided them into two separate categories: regular phonetic decipherments (when the language behind the ciphertext is known), and universal phonetic decipherments (when the language behind the ciphertext is unknown). Ciphertexts can in the latter case be regarded as texts that the researchers have trouble deciphering, such as Linear A inscriptions and most of the texts and documents associated with the Minoan civilization. The authors built phoneme n-gram databases for 80 different languages, and showed that when compared to these databases, the encoded input text written in the Spanish language and encoded with a simple substitution cipher can successfully be recognized as being written in Spanish (and if not in Spanish, then the model predicts Galician, Portuguese, or Kurdish languages as being the closest matches to the encoded text). The work presented in Knight et al. (2006) can therefore be extremely helpful in discovering more information about ancient scripts written in unknown languages.
Snyder and Barzilay (2008) presented a non-parametric, unsupervised, and hierarchical Bayesian model that induces morpheme segmentations of multiple selected languages and concurrently identifies cross-lingual morpheme patterns. The authors concentrated on Arabic, Hebrew, Aramaic, and English languages. The motivation behind their work was to discover connections between different languages and determine whether those connections could prove useful in automatic language analysis.
Kim and Snyder (2013) proposed a method for the prediction of consonants and vowels for an unknown language and alphabet. They performed posterior inference over 503 languages in order to obtain information that would allow them to discover general linguistic patterns. In order to obtain that information, they used HMMs and made 3 assumptions: (1) each language has an unobserved set of parameters that can explain its observed vocabulary, (2) each language-specific set of parameters has an unobserved common prior shared between a cluster of related languages, and (3) each of those clusters derives its parameters from a common prior of all language groups. The motivation behind their research was to discover, for a given unknown language and alphabet, a group of languages they are related to. This would provide them with certain assumptions on common linguistic patterns which they could use in the prediction of consonants and vowels. Even though their research was limited to alphabets, they plan on expanding it to include other writing systems as well.
Currey and Karakanta (2016) proposed a method for lessening the problem of a lack of training data for low-resource languages3 by augmenting them with data from related high-resource languages. In their work, the authors treated Spanish language as if it were a low-resource language, and used Italian and Portuguese as related high-resourced languages. Although their method did not show that language modeling for a low-resource language can be improved by using information from related high-resource languages, it did show promising results in the field of statistical machine translation.
Deri and Knight (2016) proposed a grapheme-to-phoneme model trained on high-resource language(s) and applied to related low-resource language(s). Grapheme-to-phoneme models convert written words into pronunciations that are usually represented in IPA. The dataset on which the authors trained the model consisted of more than 650,000 word-pronunciation pairs from more than 500 languages. The authors made use of an online repository of cross-lingual phonological data named Phoible (Moran, McCloy, and Wright 2014).4 Phoible is composed of language phoneme inventories that contain sets of phonemes represented in IPA, and each phoneme also contains a unique feature vector that represents its phonological features.
7.3 Computational Simulations of Traditional Decipherment Methods
Traditionally, decipherments of ancient scripts are based on epigraphy, paleography, and linguistics. Michael Ventris and his colleagues who jointly contributed to the decipherment of Linear B did not have computer resources to help them in their decipherment process, but were still able to successfully achieve their goal. In this section of the article we summarize the traditional Linear B decipherment process and propose automatic methods that could be used to simulate it.
According to the Mycenaean Epigraphy Group (2023), the Linear B decipherment process included the following:
determination of Linear B’s writing system—it was a syllabary based on the number of phonetic symbols, but contained a smaller fraction of logograms as well;
determination of an overall content of Linear B inscriptions—in certain cases it was possible to infer that the inscriptions represented administrative documents based on easily recognizable logograms that provided the much needed contextual information;
establishment of the definitive list of Linear B symbols and their variants with the same phonetic values—this analysis was performed by Emmett L. Bennett Jr.;
statistical analysis of Linear B words that only slightly differed from one another—this analysis was performed by Alice Kober and showed that the language behind Linear B was an inflected language—that the words could change depending on, for example, gender;
educated guesses—Michael Ventris made a series of educated guesses about Linear B script, such as hypothesizing that certain Linear B words could represent the names of real places on Crete, that vowels are usually found in the beginning of the words in syllabic scripts, and that consonants followed by vowels usually appeared in the middle. Ventris then created a table depicting the symbols and their possible phonetic values, and through trial and error realized that the language behind Linear B was a form of archaic Greek.
To simulate the above-mentioned steps one could use a number of different computational methods discussed in Section 6. Here we present a general outline of a computational approach to five traditional decipherment steps outlined above. In all five steps we assume that there is enough data on which the algorithm can be trained or analysis performed, but sadly this is not always the case.
7.3.1 Determination of a Writing System Associated with an Unknown Script.
Knowledge about the writing system used in an unknown script is extremely important. A decipherment process associated with an alphabet based writing system would be very much different from one based on a logosyllabary. Sometimes, a good guess about the type of writing system used in a certain unknown script can be based upon similar or possibly related scripts (this may not be known), on a number of unique symbols used in an unknown script (this may also be unknown), or on a statistical language analysis of available texts and an examination of their internal structure.
Usually, alphabets have a small number of unique symbols while logosyllabaries have many more. A solid line that could be drawn between different types of writing systems unfortunately does not exist, and many scripts exhibit the characteristics of multiple types of writing systems instead of just one. Computational approaches associated with these processes could include expert-based rules regarding the number of unique symbols found in unknown scripts, a statistical comparison with scripts whose writing systems are already known, and a computational expert-based statistical analysis of grammatical structure.
7.3.2 Context Determination.
In order to determine the overall context of an undeciphered text, one must first look at the location at which the text was found. If the text was found in the ruins of a religious monument, one might conclude that the text itself had a religious meaning. If the text was found on or near the remnants of an old market, perhaps it contained numbers and fractions associated with the prices of goods once offered at that market. However, since the original location of the text cannot always be determined, one might use image processing based search and recognition of certain pictograms and logograms in order to gain insight of the overall context of the text. If, for example, an automatic recognition system recognizes a pictogram or a logogram resembling a person, it might conclude that the symbols near it might represent that person’s name. This visual recognition system might use traditional image processing methods (e.g., corner detection, edge detection, and various other filters) or convolutional neural networks trained on logograms with already known meanings. The same algorithm could also be modified to search for similarities between pictograms and logograms in different scripts, and try to determine the relationships (if any) between them.
7.3.3 Vocabulary Construction.
Vocabulary construction is the most important step of the decipherment process since many of the other steps depend on it, but it can also be the most challenging one to realize. Symbols can have many different variations and in order to uncover them all, a statistical analysis of the text must be performed. Once obtained, these results can then be compared to the grammatical features of other languages in order to uncover the relationship (if any) between them.
7.3.4 Statistical Text Analysis.
Statistical analysis of words in an undeciphered text assumes that the text can be segmented into words. This is not always the case, however, as the punctuation marks might not even exist or they might be unknown. If the words are able to be extracted from the text, a simple calculation of an edit distance between the extracted words will locate similar words and provide a measure of their similarity. Further grammatical analysis of similar words can try to discover whether the language behind the script is inflectional or not.
7.3.5 Educated Guesses.
This step of the decipherment process of ancient scripts is possibly the most difficult one to simulate via a computational method. It depends on the knowledge from many different areas (historical, geographical, grammatical, linguistic, etc.), and on a neural process in which that knowledge is ultimately combined. In order to even begin to realize this step of the decipherment process, a construction of a generalized database that incorporates all of that knowledge is required. How this should be performed, however, and what data should be included in it, is still an open question.
7.4 Final Remarks on the Computational Decipherment of Ancient Scripts
It can be concluded from previous sections that the computational research on ancient unknown scripts is a complex and challenging field. With the advancements in computer science, and especially in artificial intelligence and deep learning models, this field continues to evolve. Our final remarks on the reviewed approaches for the computational decipherments of ancient scripts are as follows:
There is an obvious lack of digital data and corpora associated with the computational decipherment of ancient scripts. Datasets currently linked to ancient unknown scripts are not standardized among the researchers, and in return this makes various decipherment models and associated results difficult to compare.
The datasets associated with ancient unknown scripts are generally small and require extensive computational augmentation. However, since the methods used for these augmentations are not standardized, even if different researchers start working with the same dataset, that dataset might change drastically after augmentation.
Different methods for the decipherment of ancient scripts can start with different assumptions, and therefore arrive at completely different conclusions. This is especially true for cases when researchers assume that the ancient unknown scripts are related to certain languages and scripts, as these can widely differ.
Computational approaches for the decipherment of ancient scripts usually involve comparison with chronologically and geographically related languages and scripts. This comparison is usually based on the analysis of visual and phonological features of symbols and words, with the aim of identifying those that might be related and have similar meanings.
Deep learning based models for the computational decipherment of ancient scripts are becoming increasingly common, and are starting to outnumber the methods based on traditional machine learning (e.g., SVM or HMM). This is not surprising, however, as large language models and generative artificial intelligence are capable of making connections between different data points that the traditional machine learning algorithms are simply not designed for. This makes deep learning models better at finding hidden connections between different languages and scripts, which can in turn greatly impact the field of computational decipherment of ancient scripts.
Computational decipherment of ancient scripts is an incredibly interesting and rapidly growing field. With the advancements in computer science, we believe it will one day be possible to decipher not only Bronze Age Aegean and Cypriot scripts, but many other unknown scripts as well. We hope that this will lead to new insights about the lives of the people who used these scripts, as well as to additional information about our own history as well.
8. Related Scripts and Languages
One of the most important things in deciphering unknown ancient languages is to try to determine their closest relative(s), as this could have a positive impact on the resolution of challenges outlined in Section 4. The importance of using related languages to enhance statistical language models is not new, and it was already emphasized in previous work (Currey and Karakanta 2016; Pourdamghani and Knight 2017; Karakanta, Dehdari, and van Genabith 2018; Pourdamghani and Knight 2019; Mavridaki, Galiotou, and Papakitsos 2021b). Tóth (2007) also states that the distribution of many typological features of languages is not random but restricted to a relative geographical area. Since the language(s) behind the Bronze Age Aegean and Cypriot scripts are mostly unknown, it is helpful to look at the history of those scripts and the people that used them in order to gather more information about them and connect them to possible related scripts and languages.
British archaeologist Sir Arthur J. Evans named the Bronze Age culture of Crete Minoan (Chadwick 1987). Even though Evans hypothesized that the Minoans were refugees from Northern Egypt (Callaway 2013), the ancestry of the Minoan civilization is still a highly debated topic and no clear stand on this issue exists among researchers. Recent analysis of mitochondrial DNA taken from the Minoan osseous remains from a cave ossuary in the Lassithi plateau of Crete that was presented in Hughey et al. (2013) refutes the Evans hypothesis of Minoan North African origin, and supports the hypothesis of autochthonous development of the Minoan civilization by the descendants of the Neolithic settlers of Crete who probably arrived there around 9,000 years ago from Anatolia and/or the Middle East (Hughey et al. 2013). Research presented in Hofmanová et al. (2016) demonstrated that a direct genetic link exists between the Mediterranean and Central European early farmers and those of Greece and Anatolia. It should be emphasized that the authors in Hofmanová et al. (2016) concentrated on paleogenomic data obtained from northern Greece and northwestern Turkey, and not on Crete or Cyprus. Additionally, research presented in Omrak et al. (2016) shows a direct link between Anatolia and the early European Neolithic gene pool. Other theories on the Minoan origins exist as well. For example, in Lazaridis et al. (2017) the Minoan ancestry is linked to Anatolian and Aegean Neolithic farmers, and to the ancient populations related to those of Caucasus and Iran; in Revesz (2019) the Minoan ancestry is linked to Anatolia, Danube Basin, and the Black Sea littoral area; and in Revesz (2021) the Minoan ancestry is linked to Northern Greece, Anatolia, Caucasus, and the Balkans (mostly the Danube Basin). Previous attempts at linking the Bronze Age Aegean and Cypriot scripts to other possibly related scripts can also be found (Revesz 2017b; Schrijver 2019; Revesz 2020).
Languages used during the Bronze Age in relative proximity to Greece and Cyprus that we will focus on in this section include the Sumerian, Ancient Egyptian, Eblaite, Akkadian, Hattic, Hittite, Palaic, Luwian, and Phoenician languages. These languages were mostly used in certain places close to the Mediterranean Sea, especially Mesopotamia, Ancient Egypt, Anatolia, and Phoenicia. The timeline of these languages is summarized in Table 1, and discussed and referenced in detail in sections 8.1–8.4. Additionally, in section 8.5 we discuss geographically close languages that became prominent after the disappearance of Bronze Age Aegean and Cypriot scripts, as these languages could have potentially been influenced by the disappeared Aegean and Cypriot scripts, and could perhaps be helpful in their decipherment.
Language . | Linguistic classification . | Script . | Geographical region . | Approximate timeframe . |
---|---|---|---|---|
Sumerian | Language isolate | Cuneiform | Mesopotamia | 4000–2000 BCE |
Old Egyptian | Afro-Asiatic → Ancient Egyptian | Egyptian hieroglyphs | Ancient Egypt | 2700–2200 BCE |
Eblaite | Afro-Asiatic → Semitic → East Semitic | Adapted cuneiform | Syria | 2450–2350 BCE |
Akkadian | Afro-Asiatic → Semitic → East Semitic | Adapted cuneiform | Mesopotamia | 2350 BCE – first century CE |
Middle Egyptian | Afro-Asiatic → Ancient Egyptian | Egyptian hieroglyphs | Ancient Egypt | 2200–1800 BCE |
Hattic | Language isolate, possibly Uralic → Finno-Ugric → Ugric | Not recorded by its native speakers; known from Hittite cuneiform texts | Anatolia | late third to mid-second millennium BCE |
Kalasmic | Indo-European → Anatolian | Cuneiform | Anatolia | 1650–1200 BCE |
Hittite | Indo-European → Anatolian | Cuneiform | Anatolia, Syria | 1650–1180 BCE |
Palaic | Indo-European → Anatolian | Cuneiform | Anatolia | sixteenth to twelfth century BCE |
New Egyptian | Afro-Asiatic → Ancient Egyptian | Egyptian hieroglyphs | Ancient Egypt | 1580–700 BCE |
Luwian | Indo-European → Anatolian | Adapted cuneiform, Anatolian hieroglyphs | Anatolia, Syria | 1500–700 BCE |
Phoenician | Afro-Asiatic → Proto-Semitic → West Semitic → Central Semitic → Northwest Semitic → Canaanite | Phoenician alphabet | Lebanon and many other Mediterranean coastal areas | twelfth century BCE – 196 CE |
Language . | Linguistic classification . | Script . | Geographical region . | Approximate timeframe . |
---|---|---|---|---|
Sumerian | Language isolate | Cuneiform | Mesopotamia | 4000–2000 BCE |
Old Egyptian | Afro-Asiatic → Ancient Egyptian | Egyptian hieroglyphs | Ancient Egypt | 2700–2200 BCE |
Eblaite | Afro-Asiatic → Semitic → East Semitic | Adapted cuneiform | Syria | 2450–2350 BCE |
Akkadian | Afro-Asiatic → Semitic → East Semitic | Adapted cuneiform | Mesopotamia | 2350 BCE – first century CE |
Middle Egyptian | Afro-Asiatic → Ancient Egyptian | Egyptian hieroglyphs | Ancient Egypt | 2200–1800 BCE |
Hattic | Language isolate, possibly Uralic → Finno-Ugric → Ugric | Not recorded by its native speakers; known from Hittite cuneiform texts | Anatolia | late third to mid-second millennium BCE |
Kalasmic | Indo-European → Anatolian | Cuneiform | Anatolia | 1650–1200 BCE |
Hittite | Indo-European → Anatolian | Cuneiform | Anatolia, Syria | 1650–1180 BCE |
Palaic | Indo-European → Anatolian | Cuneiform | Anatolia | sixteenth to twelfth century BCE |
New Egyptian | Afro-Asiatic → Ancient Egyptian | Egyptian hieroglyphs | Ancient Egypt | 1580–700 BCE |
Luwian | Indo-European → Anatolian | Adapted cuneiform, Anatolian hieroglyphs | Anatolia, Syria | 1500–700 BCE |
Phoenician | Afro-Asiatic → Proto-Semitic → West Semitic → Central Semitic → Northwest Semitic → Canaanite | Phoenician alphabet | Lebanon and many other Mediterranean coastal areas | twelfth century BCE – 196 CE |
Figure 6 shows the approximate locations of Mesopotamia, Ancient Egypt, Anatolia, and Phoenicia. For the creation of the map we used QGIS geospatial software (QGIS Development Team 2023). A further review of computational natural language processing approaches to similar and related languages, language varieties, and dialects can be found in Zampieri, Nakov, and Scherrer (2020).
8.1 Mesopotamian Languages
The historical region of Mesopotamia approximately includes the area that is now eastern Syria, southeastern Turkey, and most of Iraq (Frye et al. 2023).
Mesopotamian languages were written mostly in a cuneiform script. According to Kudrinski (2016), cuneiform script was invented in the late fourth millennium BCE for writing one of the languages of southern Mesopotamia, supposedly Sumerian. The Sumerian language (c. 4000–2000 BCE [Hajiyeva 2019]) is considered to be a language isolate (Pereltsvaig 2021), which means it does not appear to share an ancestral language with any other known language. It was written in the Sumerian cuneiform script (c. 3400 BCE – 75 CE [Gutherz et al. 2023]). The paleographic links between the Sumerian language and the Aegean scripts (mainly Linear A and Linear B) are outlined in Papakitsos and Kenanidis (2015).
Two additional Mesopotamian languages that might be related to the Aegean scripts are Eblaite and Akkadian. These languages were written in a script adapted from Sumerian cuneiform (Pereltsvaig 2021). Eblaite was spoken in northeast Levant or present-day Syria (Gordon 1997) (as cited in Kitchen et al. 2009), and is attested only from the Ebla archives (c. 2450–2350 BCE) (Pearce 2010). Akkadian language (c. 2350 BCE – first century CE [Seri 2010]) was spoken in Mesopotamia (Beckman 1983). Eblaite and Akkadian belong to the Afro-Asiatic → Semitic → East Semitic language family (Hetzron 2009).
8.2 Ancient Egyptian Languages
According to Chen (1999), Ancient Egyptian language can be divided into five phases: Old Egyptian (c. 2700–2200 BCE), Middle Egyptian (c. 2200–1800 BCE), New Egyptian (c. 1580–700 BCE), Demotic (c. 700 BCE–600 CE), and Coptic (c. 600 CE–1000 CE). Since the Bronze Age lasted from approximately 3000 BCE to 700 BCE (Vandkilde 2016), only the Old Egyptian, Middle Egyptian, and New Egyptian phases of the Ancient Egyptian language were used during that time, while Demotic and Coptic phases came later. According to Loprieno and Müller (2012), Ancient Egyptian language represents a separate branch of the Afro-Asiatic language family.
8.3 Anatolian Languages
The historical region of Anatolia refers to the area that encompasses most of the territory of present-day Turkey. The Anatolian branch is the oldest attested branch of the Indo-European language family tree (Joseph 2003), but the languages belonging to the Anatolian branch are considered extinct (Pereltsvaig 2021). Anatolian languages include Hittite, Luwian, Palaic, Lycian, Milyan, Carian, Pisidian, and Sidetic (Adiego 2016).
Hittite language was written in cuneiform and used in areas that belong to present-day Turkey and northern Syria in the following periods: 1650-1500 BCE (old Hittite), 1500–1350 BCE (middle Hittite), and 1350-1180 BCE (new Hittite) (Sukhareva et al. 2017). Hittite is the oldest attested Indo-European language (Kloekhorst 2007). Grammatical comparison of Linear A and Hittite is available in Janke (2022).
The Luwian language was used in the central and southern Anatolia and northwestern Syria in the period c. 1500–700 BCE, and written in Anatolian hieroglyphs and an adaptation of Mesopotamian cuneiform (Yakubovich 2015).
Palaic language was used in the northern region of modern-day Turkey and attested from the sixteenth to twelfth century BCE (Bianconi 2019). Palaic language was written in cuneiform (Kudrinski 2018).
According to the attested timeframes given in Bianconi (2019) for the Carian (between the seventh and the third century BCE), Lycian5 (between the sixth and the fourth century BCE), Sidetic (third century BCE), and Pisidian (between the first century BCE and the third century CE) languages, we can conclude that they were not used during the Bronze Age, whose approximate timeframe is given as 3000 BCE to 700 BCE in Vandkilde (2016).
One other language spoken in Anatolia during the Bronze Age, but not belonging to the Indo-European language family and Anatolian language branch, is Hattic. Hattic (or Hattian) language was used in Anatolia in the late third to mid-second millennium BCE, and the native speakers of the Hattic language did not record it in writing (Goedegebuure 2013). The only information we have on this language comes from the Hittite documents that occasionally included Hattic words and sentences (Revesz 2017b). According to Revesz (2017b), the Hattic language is generally considered by linguists to be a language isolate, but could possibly be an Ugric language and some form of Proto-Hattic could represent the origin of the Minoan language. According to the linguistic classification, the Ugric language is a branch of the Finno-Ugric language family (which might be a sub-branch of the larger Uralic language family) and encompasses Hungarian and Ob-Ugric sub-branches (Pereltsvaig 2021). The Hungarian sub-branch refers to the Hungarian language that is native to Hungary and represents one of the official languages of the European Union. The Ob-Ugric sub-branch encompasses Khanty and Mansi languages (Pereltsvaig 2021) that are spoken in Russia.
In Revesz (2017b) an expansion of the Uralic language family is proposed, where the Hungarian sub-branch is replaced with the West-Ugric sub-branch that encompasses Minoan, Hattic, and Hungarian languages. Revesz (2017b) also suggests that the Minoan language is additionally connected to both the Indo-European language family and the Greek language. In addition to Revesz (2017b), there have also been other papers connecting the Minoan language to the Finno-Ugric language branch (e.g., Schrijver 2019; Revesz 2018, 2020).
Finally, it is interesting to note that in late 2023 a new language was discovered on cuneiform tablets from Hattusa, Anatolia (Julius-Maximilians-Universität Würzburg 2023a, b; de Lazaro 2023; Keys 2023). According to de Lazaro (2023), it is believed that this newly discovered Indo-European language might have been spoken by the people from Kalasma (located approximately in modern north-western Turkey). Because of this, this new language has been named Kalasma, Kalašma, or Kalasmic language. Not much is currently known about this language, but it was approximately used from 1650 BCE to 1200 BCE (Julius-Maximilians-Universität Würzburg 2023a). The discovery of Kalasmic language gives hope to researchers working with ancient languages as it shows them that there are secrets waiting to be uncovered, even after thousands of years in waiting.
8.4 Phoenician Language
Phoenicia was an “ancient country of southwestern Asia at the eastern end of the Mediterranean Sea where modern Lebanon and adjacent parts of Syria and Israel now are” (Merriam-Webster.com Dictionary 2023b).
The Phoenician language was first spoken in the coastal parts of today’s Lebanon from the twelfth century BCE until 196 CE, and spread to many Mediterranean areas by Phoenician merchants (Hetzron 2009). Linguistic classification of the Phoenician language is Afro-Asiatic → Proto-Semitic → West Semitic → Central Semitic → Northwest Semitic → Canaanite (Rubin 2008).
8.5 Later Languages
Most significant languages and/or language groups that became prominent after or at the brink of the disappearance of the Bronze Age Aegean and Cypriot scripts and in the relative vicinity of the Bronze Age Minoan civilization include the Etruscan language (central-west Italy and French island of Corsica), Messapic language (Southeastern Italy), Thracian language (Southeastern Europe), Illyrian language (Southeastern Europe), Phrygian language (Turkey), Dacian language (Southeastern Europe), Aramaic language (Middle East), Lepontic language (Northern Italy), and many other languages and dialects. Since these languages were used in the relative geographical proximity to Crete and Cyprus, they may have been influenced in some ways by the Minoan civilization and may hold the key to the decipherment of the Cretan hieroglyphic, Linear A, and Cypro-Minoan scripts.
People who spoke Etruscan and Messapic languages might have had possible connections to the Minoan civilization. The earliest known inscriptions in the Etruscan language are dated to about 700 BCE (Freeman 1999). According to Robinson (2009), Herodotus wrote that the Etruscans migrated to Italy through the Aegean islands from Lydia in Anatolia, but most scholars disagree with that point of view due to the lack of archaeological evidence. Etruscan was spoken in central-west Italy, written in the Etruscan alphabet, and it still remains undeciphered. Messapic, on the other hand, was first attested in the sixth century BCE (Matzinger 2015), and was spoken in southeastern Italy. According to Blažek (2005), Messapian people are first mentioned by Herodotus as descendants of Cretans at the time of Minos. In Greek mythology Minos was the king of Crete associated with the labyrinth of Minotaur, and the king after which Sir Arthur J. Evans named the Minoan civilization.
9. Conclusion
In this article we presented a review of computational approaches to deciphering Bronze Age Aegean and Cypriot scripts, namely, the Archanes script and the Archanes formula, Cretan hieroglyphic (including the Malia Altar Stone and Arkalochori Axe), Phaistos Disk, Linear A, Linear B, Cypro-Minoan, and Cypriot scripts. We analyzed and compared the available digital corpora and resources associated with these scripts, outlined the possible challenges and proposed the required steps that need to be undertaken in order to improve the likelihood of computational decipherment of ancient scripts.
Our main conclusions are as follows:
There is a dire need for a unified, digitalized dataset of Aegean and Cypriot inscriptions and dictionaries, alongside possible phonetic transcriptions of different symbols.
This unified dataset should contain inscriptions and dictionaries regarding the other languages and scripts that might be related to Bronze Age Aegean and Cypriot scripts as well, mainly those used in ancient Anatolia, Mesopotamia, Phoenicia, and Egypt.
Bronze Age Aegean and Cypriot inscriptions are scarce and short, so the augmentation of the proposed unified dataset should be considered, either computational or physical (i.e., via novel archaeological discoveries).
More research (either genealogical, linguistic, or archaeological) is needed on the possible connections of Bronze Age Aegean and Cypriot scripts to other scripts and languages used in the relative geographical and chronological vicinity. This will probably require the use of high performance computing.
And finally, more exposure on the currently undeciphered ancient scripts is needed. In our personal experience many people were unaware of the existence of these scripts, and the more information is released to the general public about them, the higher the probability of their successful decipherment becomes.
This article can serve as an introduction to some of the earliest European scripts and languages, and help shed some light onto the scripts and languages that came after them, such as the Illyrian language that is still not very well understood. In our future work we plan on expanding our research to other undeciphered scripts and languages used geographically close to Crete and Cyprus (we will mainly focus on the Balkan Peninsula), but not necessarily in the same time period. We also plan on performing the linguistic typology analysis of Bronze Age Aegean and Cypriot languages, alongside their possibly related languages, in order to obtain more information on possible connections between these scripts. Furthermore, we are already working on a possible artificial intelligence and machine learning oriented analysis of low-resource languages, and plan on continuing and expanding our research in this increasingly important field.
Acknowledgments
This project has received funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement no 951732. The JU receives support from the European Union’s Horizon 2020 research and innovation programme and Germany, Bulgaria, Austria, Croatia, Cyprus, Czech Republic, Denmark, Estonia, Finland, Greece, Hungary, Ireland, Italy, Lithuania, Latvia, Poland, Portugal, Romania, Slovenia, Spain, Sweden, United Kingdom, France, Netherlands, Belgium, Luxembourg, Slovakia, Norway, Switzerland, Turkey, Republic of North Macedonia, Iceland, and Montenegro. This work was also partly supported by the Ministry of Science and Education of the Republic of Croatia under the FESB VIF project “iEnv - Intelligent Observers in Environmental Protection.” The authors would like to thank Dr. Simon Castellan from INRIA Rennes-Bretagne Atlantique research centre for his swift and informative replies to their questions about the SigLA database, and would also like to express their gratitude to the anonymous reviewers who helped improve this article. Thank you.
Notes
More information on Proto-Indo-European language can be found in Rau (2010).
Boustrophedon writing is defined in Merriam-Webster.com Dictionary (2023a) as “the writing of alternate lines in opposite directions (as from left to right and from right to left).” When given enough input data, this kind of writing is relatively simple to recognize as it uses systematic mirror imaging or flipping of individual glyphs.
Low-resource languages are generally considered as languages for which not much data exists that can be used in natural language processing applications; high-resource languages are their exact opposites.
References
Author notes
Action Editor: Xuanjing Huang