A Computational Approach to Scribal Practice

. The study of the construction of social meaning in ancient Maya communities of Meso- america poses a variety of methodological problems in historical sociolinguistics due to the reliance on written records by means of a writing system that exhibits variation itself. While variation in writing systems has been previously studied in terms of diachronic shifts and dialectal variation, systematic approaches still remain elusive. This paper explores new avenues for the computational extraction of sociolinguistic features, resulting in the automatic extraction of useful sociolinguistic information from written corpora using Machine Learning algorithms. We show that these features can help illuminating the contribution of pragmatic choices in the selection of graphemes to stylistic practices that are key in the construction of Mayan scribal communities of practice.


Introduction
The analysis of the construction of social meaning in ancient Maya communities of Mesoamerica rests on the understanding of the social meaning of material signs, discourses, and actions that have to be adequately situated in their context of the physical world (Geertz 1973;Scollon and Scollon 2003;Houston 2004). Such reliance on written records pose a variety of methodological problems in the historical sociolinguistics of Mesoamerican languages, only incremented by the difficulty of accessing to traditional sociolinguistic variables. Only in a few cases it is possible to categorize an approximate age, sex, and status of a writer. In addition, the writing system used to encode language exhibits a complex relation with variation in the host language due to an array of sociolinguistic factors. Although there has been interest in the interaction between writing and language among sociolinguistics since the early work by Weinreich (1953), this relation has been studied mainly in terms of diachronic change and dialectal variation (Weinreich 1953;Labov 1972a, inter alia). In this paper, we explore the contribution of pragmatic choices in the selection of graphemes to stylistic practices that are key in the construction of scribal communities of practice (Justeson 1978). In particular, we follow up on recent studies addressing change and variation in both the encoded languages and their writing systems in Mesoamerica (e.g. Lacadena and Wichmann 2002) by means of a systematic analysis of a proxy corpus of Classic Maya inscriptions.

Sociolinguistic Variation in Classic Maya Writing
Language variation and change was reflected in Classic Maya writing system, and result in an important source of linguistic information for the reconstruction of ancestral varieties of several modern Mayan languages (Lacadena and Wichmann 2002). However, systematic sociolinguistic analysis of variation and change as observed from variation and change in the writing system is still to be developed, essentially for two reasons. First and foremost, the unavailability of comprehensive, large corpora of Classic Maya, associated to a general lack of appropriate tools for analysis for even small corpora. Second, the partially deciphered writing system maintains a complex relation with the languages represented in the texts, as seen above. In our opinion, the complex relation between Classic Maya writing system and the hosted languages can be better addressed by using the same methodological approaches traditionally used in sociolinguistics (Labov 1972b;Shibamoto Smith and Schmidt 1996, inter alia). As such, variation and change in both the languages and the writing system can be productively analyzed in sociolinguistic terms by introducing relevant linguistic and social features Labov (1972b).
The emerging picture during the Classic Period is that of a diverse linguistic area in which several language vernaculars are in contact and percolate to different degrees into the high-prestige language of the inscriptions (Houston et al. 2000;Lacadena and Wichmann 2002). For the most part of the corpora the determination of social features is difficult. Access to traditional social variables such as age, sex, and status of a writer is partial in most cases. However, it is also possible to consider most of the additional types of congruence introduced by Weinreich as archetypical of languages in contact (Weinreich 1953). Therefore, variables such as geographic areas, ethnicity, cultural or ethnic groups, religion, occupation, and rural vs. urban population, for which access is less of a problem, emerge as important candidates to understand the dynamics between linguistic varieties.
The bilingual individuals constituting the locus of study in language contact (Weinreich 1953) are to be found in our setting as bilingual scribes that form part of a scribal community of practice-workshops that enforces normativity through the establishment of grammatical rules and sets of prescribed scribal practices. A central question to address is, then, what are the possible features, linguistic and graphemic, that enable us the identification of communities of scribal practice. Classic Maya corpus shows a wealth of information of sociopolitical actors and their interactions (Martin and Grube 2000), resulting in dense data that enables social network analysis. Once these communities are determined, it could be possible to contrast the sociolinguistic differences between variable communities of practice (Milroy and Margrain 1980), and examine the relation between these communities and the sociopolitical landscape.
The mechanisms and structural causes of transfer at the graphemic level are to be traced back to linguistic variation and change. The following sections discuss, following the layout established by Weinreich (1953), some of the preliminary findings in the literature and also found during the early stages of corpus construction regarding linguistic features that index variation and change. Then, new features at the graphemic level are introduced and analyzed in a synchronic case study.

Diachronic variation
Language change in the diachronic axis (Labov 1972a) is reflected in the phonological, morphological, and semantic levels. By the end of the Classic Period, around 900 A.D., Classic Ch'olan experienced a series of phonological changes, in a series of changes that could have happened in short period of time (cf. Trudgill 2002). Long vowels shortened and glottalized vowels disappeared (Lacadena and Wichmann 2004). The distinction between velar and glottal fricatives, once predicted by Norman (1984 [1978]) in their initial reconstruction of Proto-Cholan and found in early Classic Cholan by Grube (2004), also vanished by the end of the Classic Period. Houston et al. (2000) have documented a shift of -h -… -aj from intransitive positional marker to passive, with the old passive marker -V 1 y becoming a marker of the mediopassive. After that process, a positional -wan marker is introduced as an innovation, probably after a process of percolation from the vernacular Ch'olan in the Tabasco region into the high-prestige variety. Lexical change is more difficult to track without the availability of computational methods, but some examples can be recognized: early logograms, probably borrowed from close-by written traditions such as the Epi-Olmec, changed their lexical values while presumably keeping the same semantic reference. This seems to be the case of the word for sun or sun god, initially rendered using the term JAMA borrowed from Mixe-Sokean languages, and used only in early stages of Maya writing. Later, the same logogram carries the Mayan word K'IN, sun or sun god. Semantic changes are also difficult to trace. A possible example involves the adjective k'uh, which is originally taken to mean sacred but experiences a shift in the semantic space towards venerable by the time of the Spanish contact.

Synchronic variation
Synchronic variation in Mayan languages during different stages in the Classic Period points towards the existence of a plurality of spoken vernaculars that sometimes percolated into the written high-prestige variety. Although it follows the paradigmatic situation of diglossia introduced by Ferguson 1959), there are several questions to be answered related to the manifestation of the interaction between the different vernaculars in the contact situation, and their interaction with the high prestige variety. Phonemic variation shows a differential phono-logical system between southern (Ch'olan) and northern (Yucatecan) varieties. Among the numerous examples we can mention the spelling of the number 4 in a series of texts from Ek B'alam, in the north of the Yucatan peninsula, using the Yucatecan phonology as ka-na, kan, instead of the attested Ch'olan version chan registered in southern sites. The abstractive suffix -il appears to derive abstract nouns from concrete nouns, such as 'ajawil, 'kingdom', from 'ajaw, 'king', in most of the Southern Maya Lowlands. However, the same suffix is replaced by the abstractive -lel in Western Maya Lowlands, home to modern day Ch'ol and Chontal that retain reflexes of that suffix (Lacadena and Wichmann 2002). In the Northern Maya Lowlands of Yucatan, the abstractive attested is -lil, which is the ancestor of the abstractives found in Yucatecan languages. Lexical variation seems to be less common, but there are some unequivocal cases such as the word for month, winik in Western Lowland Mayan, and winal in Eastern varieties.

A Proxy Corpus for the Analysis of Mayan Scribal Practice
The analysis of scribal practices requires the construction of appropriate corpora of Classic Mayan texts (Cases et al. 2014). This process involves the designing of the corpus, the collection of data, the encoding in machine-readable format, and a proper assemblage and storage of relevant metadata, including linguistic and epigraphic annotation (McEnery and Hardie 2012).
The question of what exactly is an appropriate corpus is important, and it has different answers depending on the overall objectives of the research. To the bare minimum, a well designed corpus for research should avoid the confirmation bias, i.e. a design that favors the interpreter's initial hypothesis or beliefs (McEnery and Hardie 2012: 14). The confirmation bias can be avoided using the principle of total accountability, which states that the researcher must not select a favorable subset of the data.
A long term objective aims for the construction of a monitor corpus of Maya writing, including all the glyphic texts from the Classic and Post-Classic periods, as well as Colonial and Post-Colonial alphabetic texts. The medium term objective, and the aim in this paper, has been the creation of a balanced corpus of Classic Maya texts with specific attention to diachronic balancedness, genre and provenience. The short term objective considered the development of an opportunistic corpus with a manageable size, but enough large as to obtain relevant data for a given area. The area selected was the Western Maya Lowlands, with texts from the sites of Palenque (PAL) and Comalcalco, Tabasco (CML), covering a small variety of genres in a span of 200 years in the Late Classic Period. The results in the following sections have been obtained using this proxy corpus.
In order to illustrate the procedure used in the construction of the corpus, we will use a text example from Comalcalco. In a remarkable discovery in 1998, archaeologist Ricardo Armijo-Torres found a sealed urn in the platform between Temples II and II-A that are situated in the Main Plaza. The archaeological contextualization shows that the urn contained the burial of a male individual, accompanied by 74 beads of jade, 52 shark's teeth, a series of ornamental shells, prismatic obsidians, eccentric flints, specular hematite counters, seven stingray needles, sixteen stingray spines with hieroglyphs carved, and 82 small pendants, 36 of which were carved with glyphs (Armijo 1999; Armijo, Gallegos y Zender 2000; Armijo, Zender y Gallegos 2000). Other organic remains of the interior could have been due to the presence of leather, opening the possibility of the material to have functioned as a bag for the rest of artifacts (Armijo, personal communication to Cases, 2006).
One of the texts from this urn is the stingray spine number 3, technically referred to as CML Urna 26 Spine 3. The epigraphic rendition of the text from the Spine 3 is shown in Figure 1. An epigraphic contextualization includes the transcription, transliteration, and morphological segmentation as follows:
These components constitute a wireframing for the first set of contextualizations at the archaeological and epigraphical level, necessary to perform historical sociolinguistic analysis, and serve to illustrate the procedure used in the corpus construction.

Identifying Scribal Practices
In Classic Maya writing, graphemic types are usually combined in glyphic blocks following a series of internal rules that restricted the permissibility of such combinations, resulting in observable patterns in the written texts. Analysis of these patterns, that could be termed graphotactics, has been the object of study since the beginning of the decipherment of the writing system. Most of the early work focused primarily in the arrangement of signs inside glyphic blocks (e.g., Thompson 1950Thompson , 1962Kelley 1976;Justeson 1978;Grube 1990;Lacadena 1995) and graphemic chains, resulting in spelling rules (Kelley 1976;Justeson 1978;Bricker 1986;Grube 1990;Houston et al. 1998;Kaufman with Justeson 2003;Lacadena and Wichmann 2004). The constraints inside a graphemic chain or block will be referred to as short range graphotactics. A medium range graphotactics would consider graphemic restrictions inside a given text. This section briefly considers the evaluation of medium range statistical graphotactics, i.e., in the analysis of possible constraints in graphemic selection as resulted from statistical data that can eventually serve as graphemic features for a sociolinguistic analysis. The selectional restriction of graphemic types in short range graphotactics results from the scribal practice of applying a series of spelling rules with the object of representing an utterance of the underlying language, and a series of compositional rules constraining their graphic arrangement. These constraints have been studied by Justeson (1978). The rules were probably learned as part of the scribal training in the high-prestige variety inside the community of practice (Ferguson 1959).
A question to be raised is the size of the range, measured in number of graphemes for example, to which these rules would apply. Spelling rules range of action seems to be limited to the extent of graphemic chains, whether they create constraints in the nucleus or suffix domains. Figure 2 shows the distribution of the number of graphemes per graphemic chain, together with the adjustment using Epanechnikov's kernel. The distribution is centered in three graphemes per graphemic chain, that could belong to any type of sign. This type of representation merges all kind of signs for all the graphemic chains. This information can be analyzed by taking into account the sign type for known graphemes with reading value, i.e. either logographic or syllabic. The distribution shows that in this corpora the glyphic blocks are most frequently composed out of three graphemes, followed by blocks composed of four graphemes, etcetera. However, as useful as this can be as an initial description of the text, this type of distribution merges all types of graphemes and is mostly a representation of short-range selection of the graphemic types -a selectional restriction on the types like this can only serve as a simple metric for downstream tasks. The dotted line red is the result of applying a simple model (adjustment with Epanechikov's kernel). Figure 3 represents the distribution of syllabic versus logographic signs for all the texts in the corpus, where the radius of the circles are proportional to the relative frequency. Therefore, it can be seen that combinations with one logogram and two phonograms are the most common, followed by graphemic chains with three syllables and no logograms, observations compatible with the distribution mentioned before. More interesting observations appear when the information is plotted taking into account parameters like provenience, authorship, or genre. Figures 4, 5 and 6 portrait the frequencies of combinations in the syllabic-logographic plane for the texts produced by the scribe workshops of Kan B'ahlam, K'an Joy Chitam, Ahku'l Mo' Naahb' (K'uk' B'ahlam in Palenque, and Aj Pakal Tahn from Comalcalco not shown). In the case of Kan B'ahlam, the texts are represented adjacent each other in function of the artifact, that ultimately is closely related to both size and genre. It is notable that the plots for the panels from the Temple of the Inscription show similar patterns, with higher frequencies of graphemic chains with highly frequent three phonogram graphemic chains, and one logogram and two phonograms graphemic chains. The main texts from the Group of the Cross are also remarkably similar,  Figure 3. Distribution of syllabic versus logographic signs for all the texts in the corpus, where the radius of the circles are proportional to the relative frequency. The bubbles with larger radius correspond to graphemic chains composed out of one logogram and two phonograms (1 in the logogram axis, 2 in the phonogram axis). Graphemic chains with three syllables and no logograms are next in terms of frequencies (0 logograms and 3 syllabic signs).
Figure 4 (next page). Distribution of graphemic chains in the syllabic-logographic plane (syllabic versus logographic graphemes) for the texts produced by the scribe workshops of K'inich Kan B'ahlam from Palenque. The texts are arranged adjacent each other in function of the textual artifact, which is closely related to both text size and the literary genre. Note how graphemic chains consisting of one logogram and one and two phonograms have higher frequencies.  Figure 5. Distribution of graphemic chains in the syllabic-logographic plane (syllabic versus logographic graphemes) for the texts of K'inich K'an Joy Chitam from Palenque, arranged by artifact kind. The text distributions from this workshop are similar to the distributions from K'inich Kan B'ahlam, suggesting some sort of continuity in the scribal practices. Figure 6. Distribution of graphemic chains in the syllabic-logographic plane for the texts of K'inich Ahku'l Mo' Naahb' from Palenque, arranged by artifact kind. The text distributions from this workshop, specially the remarkable Temple XIXth platform, have higher weights in the lower parts of the diagrams. This suggests a change in the scribal practice that favored a more logographic writing in clear departure from the previous tradition. This change in practice could be correlated to more profound changes brought by K'inich Ahku'l Mo' Naahb' (Stuart 2005). these representations depend on very short-range properties of the text, and these graphs are not easily comparable. This is one of the main reasons to look for medium range quantities in the next section. In middle range graphotactics, the derived measurements result from quantities that are averaged across the full length of the text. Therefore, in contrast to short range, they provide better estimates for style or authorship detection. Nonetheless, it is the combination of the information obtained from both ranges what ultimately helps to identify scribal practices.

Graphemic Signatures
It is now interesting to introduce a quantity that will be termed morphographicity, referring to the ratio of logograms and phonograms over tokens calculated across the full length of a given text. As it will be shown, morphographicity can help to estimate a measure of the amount of phoneticism that, deliberately or not, the scribe chose in his practice of writing.
In a given point t in the text, where tindicates the number of tokens, if γ l is the number of logograms, γ s the number of phonograms, and γ r is the number of other graphemes, these quantities satisfy the trivial sum In relative terms, the relative amount of logograms and phonograms respect to the total signs up to that point can be defined as These quantities would sum 1 if all the signs are known, but this is not always the case due to eroded or unknown graphemes. The text T with N_T tokens is therefore partitioned as  Figure 7 includes the morphographicity diagram for Aj Pakal Tahn's CML Urn 26 Spines 3. These texts start with a number of logograms higher than the number of phonetic signs, a result expected considering that the initial clauses are dates and these are usually represented with logograms for numbers and day names. In order to analyze morphographicity, it is convenient to represent each text in the logographic-syllabic plane, where the texts have coordinates , s l M M , as it appears in Figure 8. In this morphographicity plane, those texts for which all the graphemes are known lay in the rect depicted in black. Depending on the number of graphemes eroded-a physical parameter-or graphemes with unknown reading value-an observer interference-, texts will deviate from this line of perfect accessibility into a not perfectly accessible region. In other words, from a graphemic perspective, researchers have more knowledge about those texts close to the line than in the region beneath. Fully phonetic texts will have coordinates (1,0), and fully logographic texts (0,1). Thus, it is interesting to note that K8885, a conch shell of unknown provenance, lays closely to the fully phonetic point in this plane (1, 0). An example of fully logographic text comes from an early pendant from Kaminaljuyu, where the scribe chose to write a text with no morphological marking. Most texts range in the middle region, with a notable gap between the fully logographic point and the populated area. The reason for this can be found behind the complexities of constructing a narrative with a fully-logographic text, where inflectional and derivational markers will not be possible to render in most cases. Figure 7. Distribution of graphemic types (logograms and phonograms) versus token position for Aj Pakal Tahn's Spine 3 (Comalcalco). Green lines indicate the relative amount of phonograms M_s (t) at a given position t inside the text, while blue lines represent the relative amount of logograms M_l (t) for that position. This representation makes explicit the overall ratio of types: the text start with more logograms than syllabic signs. The crossing of lines indicates that the number of phonograms starts exceeding the number of logograms.
For the corpus at hand, it can be appreciated interestingly enough two main clusters in this plane. A color plot with symbols for each author helps to identify that one cluster is formed by Aj Pakal Tahn's texts and one more from Comalcalco, around the point (60,30), while Palenque's texts cluster around (40, 50) ( Figure 8).
In order to better discern the clusters, it is possible to project the texts along the orthogonal line between each point and the perfect accessibility line. This is equivalent to perform a linear extrapolation of the coordinates M_l and M_s that a given text would have in case of being perfectly accessible. After the projection, the texts in the perfect accessibility line form a linear distribution 3 .
The distribution's density has two main modes, corresponding to the two main clusters from Palenque and Comalcalco scribes, and a smaller third one corresponding to K8885 in the extreme right. Obviating this last case, an unsupervised Machine Learning model using a mixture distribution of multimodal Gaussians provides the parameters for the former two distributions. These parameters fully characterize the distribution of the morphographicity of the texts.
The parameters show that Aj Pakal Tahn distribution model is centered in a syllabicity of around 69%, and therefore a logographicity of 31%, with a deviation of only 7.5%. Palenque scribes are centered in a syllabicity of around 47%, logographicity of 53%, with also a deviation of 7.5%. Insofar these distributions model morphographicity, the values of their parameters are determined by a combination of factors including linguistic features-prominently genre and topic-, graphemic features, including functional constraints, graphemic style and authorship, or in other words, 3 Essentially this is achieved by an affine transformation composed of a clockwise rotation of 45 degrees with center the origin and a scale to the range [0,1]. scribal practice. Consequently, these distributions are considered here as the scribal graphemic signature, understanding the fact that they represent a mixture of factors that results in series of observable patterns in the graphemic level. A representation of the adjustment appears in Figure 9.

Conclusions and Future Work
The case study in Comalcalco and Palenque shows the possible application of graphemic features to sociolinguistic analysis. It is important to note that in this analysis only two distributions have been considered: it is possible, and desirable, to extend the analysis for intra-site discrimination of scribal traditions, inter-site analysis, and a diachronic evolution of the graphemic signatures. In the general framework, the next step involves the analysis of the interaction between scribal practices detected at the graphemic level with linguistic variation and change as exposed previously, according to Weinreich's types. Once the dynamics of this interaction is established, we will be able to link variation between communities of scribal and linguistic practice to sociopolitical networks.  Figure 9. Distribution of texts in the morphographicity plane scaled up to 100 (horizontal axis is syllabicity, vertical axis is logographicity). Each point represents the final values of the graphemic type distributions for each text (cf. Figures 7 and 8), with coordinates M_s,M_l. A text for which all the glyphs can be read (or at least the type of the graphemes is known) would be represented by a point lying in the line drawn on the top right section of this plane. Texts with a number of missing graphemes will lie any place in the area between the upper bound represented by that line and the origin. A text with a relatively large number of logograms will be closer to upper left part, while a text with relatively large number of phonograms will be closer to the lower right part. There is a gradient in the use of graphemic types that can be associated to different scribal practices. Ellipses show the unsupervised clustering of texts that generate the notion of graphemic signature.