A Computational Analysis of the Maltese Broken Plural Alex Farrugia Department of Artificial Intelligence University of Malta Email: alexfar@gmail.com Abstract The Maltese Broken Plurals have always been treated as a mechanism totally lacking in rules or structure. The traditional view has been that there is simply no relation between the singular, and broken plural forms. Tamara Schembri [1], in her B.A. thesis, argues that this view is not entirely correct and offers evidence for regularities governing transformations between the respective forms. In this paper we describe an attempt to examine the computational implications of Schembri’s work for classification of the singular nouns and generation of the corresponding plural form. The solution adopted is based on based on artificial neural networks and involves a pattern associator flanked on both sides by an encoding and decoding unit. We experimented with various encoding schemes and parameter settings and the network produces results that, after optimisation, are closely correlated with Tamara Schembri’s theory. Results indicate that, although we are still far off from creating a complete computational model for the Broken Plural, the classification of nouns in their singular form is not impossible and can be achieved with a relatively high degree of accuracy using machine learning techniques. Index Terms computational morphology, artificial neural networks, broken plural Michael Rosner Department of Artificial Intelligence University of Malta Email: mike.rosner@um.edu.mt are formed by systematically adding a suffix e.g. omm, ommijiet (mother, mothers) or changing a final vowel karroza, karrozzi (car, cars). The relation between singular and sound plural forms is regular. Broken plurals, on the other hand involve a more diverse set of morphological transformations including internal vowel changes e.g. tifel, tfal (boy, boys), qamar, qmura (moon, moons), redoubling of consonants ġdid, ġodda (new) and combinations thereof. Moreover, the relation between the forms is apparently not systematic. This is clearly displayed by some simple examples such as ratal, rtal (a unit of weight) and h̄awt, h̄wat (trough, troughs) which have the same plural pattern but different singular patterns. This is further substantiated by the arguments presented in other texts. For example, ”Dwar il-plural miksur m’hemmx regoli”1 is the view expressed in the book IrRegoli talKitba Tal-Malti (1998) concerning the broken plurals in Maltese. Sutcliffe [2] also points out that it is impossible to induce any rules for the structure of the latter. Mons. L. Cachia [3] follows suit by saying “Il plurali miksura huma varji h̄afna u ma nistgh̄ux nagh̄tu regoli dwarhom”2 . Cremona [4] further strengthens this point by saying “Gh̄alhekk fil-Malti l-Plural Miksur bh̄ala regula ma jistax jinbena fuq il-gh̄amla tas-Singular”3 . This pessimistic outlook towards inducing any rules catering for the plural miksur has become accepted. Borg and Azzopardi-Alexander’s authoritative work on Maltese [5] states that there is no particular connection between the singular form and the plural pattern. Since Sutcliffe’s observations it has hardly been challenged at all. Nevertheless, Tamara Schembri [1] has recently offered some solid evidence against this position. Al- 1. Introduction Within Semitic morphology, there are two types of noun and adjective plural forms: sound (regular) plurals, and broken (irregular) plurals. Sound plurals 1. There are no rules regarding the broken plural 2. The broken plurals display a significant degree of variety and we cannot infer any rules from them 3. Therefore, as a rule, in Maltese, the Maltese broken plural cannot be constructed on the basis of the singular form though she did not solve the problem of generating the correct plural from the corresponding singular form, she does provide a measure of generalisation by dividing the set of all forms into eleven classes and showing that only some transformations between subclasses are possible. She concludes that the broken plural formation process is not entirely irregular. The existence of this partial evidence for a more systematic account of tne broken plural is interesting from a computational perspective and is the underlying raison d’être of the present project. This paper continues as follows. Section 2 describes the project’s aims and objectives. Section 3 discusses the main design issues, followed, in section 4, by an account of the implementation. The remainder of the paper concerns results and conclusions (sections 5 and 6 respectively). The work reported in this paper was carried out by the first author as a final year project for the BSc. IT (Hons) degree at the University of Malta. Some of the material in the paper is drawn verbatim from the project’s final report [6]. 2. Aims and Objectives The main aim of the project is to incorporate the linguistic regularities hinted at by Schembri into a computer program which actually computes the broken plural. Normally, when designing such programs, we already have in mind a set of rules describing regularities, and we then go about encoding the rules in a way understandable to the computer. Under the present circumstances we have no such rules available. Instead, we have a somewhat fragmentary, partial, and semi-technical theory of the phenomena at hand. Our first objective was to decide on a computational model for dealing with this situation. Not surpisingly, we looked toward a data-driven solution based on machine learning, as described further in the next section. All machine learning systems require an encoding of the problem. It turns out that the representation chosen can radically affect the difficulty, and hence, given limited resources, the quality of the solution. To illustrate this point, suppose you have to perform a long division using only Roman numerals. Obviously, the solution procedure will be much harder to express than when using Arabic numerals. Arabic numerals are a“better” representation than Roman ones for this problem. In the case at hand we are looking for an optimal representation for encoding the linguistic data. A second objective was therefore to investigate which kind of representation works best for learning a solution from the data. The main hypothesis we wanted to test was that a linguistically-based representation i.e. a representation based on linguistic features of the word, produces better results than a non-linguisticall-based encoding. We tested this assertion by pitting two different encoding methods based on linguistic features against a non-linguistic encoding method. Apart from the linguistic aspect, we also experimented within the area of artificial neural networks, working with different network architectures and making use of different variables in order to investigate their effects on the performance of the system. 3. Design Issues We considered a number of different computational issues before adopting the final design. First of all, we required an underlying computatinal model for carrying out mappings between singular and plural forms. Finite state transducers are well-known, mathematically elegant and computationally sufficient for this job. Moreover, they have been successfully applied to Semitic Morphology (cf. Kiraz [7]). However the power of related formalisms, such as Beesley and Karttunen’s xfst [8] lies in its ability to express complex algebraic operations like composition over transducers, not in its ability to express generalised transformations. So in one sense, the xfst formalism is too powerful: we don’t need to express such complex operations. On the other, it is too weak since, being finite every form has to be either explicitly present in the lexicon or generated dynamically by applying a replace rule to a base form. Generalised transformations, involving variables, for example, cannot be expressed. We eventually decided on a machine-learning approach which would attempt to learn the regularities present in the original dataset provided by Schembri. This dataset comprises of a few hundred examples of words together with their (broken) plurals, and a classification scheme for the plural forms. The basic design adopted used was similar to that used by Rumelhart and McClelland [9] and matching in detail to that of Plunkett Nakisa [10]. The model used to generate the plural forms of singular nouns or adjectives is shown in Figure 1. Words are input to the Encoding Layer which converts the latter to its phonetic representation and then this is transformed into a binary feature string. This is then passed to the input layer of the network which will feed forward the inputs into the computational units of the hidden and output layers. The resultant binary string Figure 2. Architecture of the network used in the categorisation process Figure 1. Architecture of the network used in the generation process is then passed from the Output Layer to the Decoding Unit, which transforms it into a grapheme string. Similarly, the networks created for categorisation purposes have identical encoding and input layers (refer to 2. However the architecture then proceeds in a cascading fashion following up on the first hidden layer with two smaller hidden layers, and an output layer containing an output node for each possible category provided by Schembri’s theory. Both types of network made use of a template similar to that used by Plunkett and Nakisa [10] for aligning words in the input layer. This consisted of an alternating series of Vowels and Consonants (VCVCVCVCVCVCVCVCVCVCV). This is done simply by inserting an empty string anywhere which does not correspond to the pattern in question, for example ‘kaxxa’ would produce the following output: ‘’, ‘k’, ‘a’, ‘x’, ‘’, ‘x’,‘a’,‘’,‘’,‘’,‘’,‘’,‘’,‘’,‘’,‘’,‘’. The latter was useful because of Root and Pattern Morphology involved in the broken plurals where in most forms, the root consonants of the singular form are retained in its plural form. This template was also used in the output layer of the generation networks. 4. Implementation Three main methods of encoding were used. 4.1. Phonetic Features The first, labeled ”Phonetic Features Approach” relates to a grapheme-to-phoneme conversion process which offers a very faithful rendition of the underlying phonetics of each word. The decoding process (phoneme-to-grapheme conversion) is less clearcut partly due to lack of linguistic data. For this reason we decided to extend the set of Maltese phonemes handled in order to be able to fully reverse the encoding process. Some examples of such cases relate to the present of silent graphemes such as ‘h’ or ‘gor words h’ which make use of the word-final devoicing rule (see Borg and Azzopardi-Alexander [5]). This approach made use of a phoneme set of 74 possible different symbols, most of which were actually geminates or slight variants or allophones of particular phonemes. For the implementation we adapted MGPT, a grapheme-to-phoneme transcriber created by Sinclair Calleja [11] to include the extended alphabet to cater for the full representation required by this encoding method. We also created a component which made use of most of the rules used by the MGPT module as well as some extra rules which were necessary to complete this process. Each phoneme had a unique binary representation made up of 27 distinct linguistic features. 4.2. Features Lite The next encoding method, which was similar to the former, was dubbed “Features Lite”. It consisted of a much more general approach to converting from graphemes to phonemes and vice versa. This was based on a 1:1 mapping of graphemes to phonemes which basically produced a much smaller set of unique phonemes. Consequently even the feature set used was different and slightly smaller with a total of 23 distinct features necessary to fully represent our alphabet. Refer to Figure 3 for a graphic explanation of this 5. Results 5.1. Cross Validation Figure 3. Features Lite Encoding Process encoding process. 4.3. Grapheme Encoding Finally, another encoding approach was used, the “Grapheme Encoding” method. This method is based on Maltese orthography and hence not on any linguistic principles; it simply encodes a word without considering the grapheme-to-phoneme process by assigning an arbitrary binary string to each possible character. Six bits were necessary to uniquely represent each different grapheme. This method was designed specifically to investigate the hypothesis that an encoding method based on phonetics is more likely to produce better results than using an arbitrary method. The networks used the back-propagation algorithm (first described by Kerbos [12] and subsequently adopted by the Connectionist community) for the learning process. We also made use of certain parameters such as Momentum and Input Gain in order to optimise performance. The main method used in the process of testing both the generation and classification networks is known as “cross-validation”. This means that the dataset is partitioned into a number of subsets. The system will then be trained on all of the subsets except for one, which will be used to test the system. This is repeated for all the different subsets. Ideally, the ‘leave-oneout-cross-validation’ method would have been used (this entails partitioning the data set into a number of subsets equal to the number of members within the set). However, trying this out on a system such as ours requires an inconceivable amount of time. We thus opted for the ‘K-fold’ cross-validation method instead. The latter implies that the training set is divided into k equally sized subsets. In our case we decided to give k a value of 10. This value was chosen since it balances out the small size of our dataset with the amount of time needed to train the whole system relatively well. We also made use of an extra data set made up of nonsense words which we extracted from Tamara Schembri’s project. Although the latter only spanned the first 4 categories, it helped us test the system with respect to new forms. Throughout the training process, an important assumption was made with regard to nouns having several different plural forms. Although, any one of these forms was deemed to be correct as far as the network’s output was concerned, if the network’s output for such a noun did not match exactly any of its possible forms, we chose to use the first form (according to the class hierarchy) in order to train the system and calculate its error. We based this assumption on the fact that, statistically speaking, the most important forms are clearly the ones at the very top of the hierarchy; in fact, the classes seem to be labeled in descending order of frequency. Throughout the course of this project, the feature sets used were slowly evolving according to changes which were either based on new linguistic information we had uncovered or else as a sort of experiment to assess how the reduction or adding of a particular feature to the set will affect its output. One of the main conclusions which we can extract from the linguistically-based approaches is that in the phonetics arena, certain defaults are much more important than previously assumed. For example, most consonants are by default central, and vowels are considered to be sonorants as well as voiced. We noticed that the specification of the latter as defaults increased the performance of the system radically. 5.2. Generation Model Performance We will now report on the performance of the generation model. Initially we tested our model using a hidden layer which contained twice the number of neurons contained within the input and output layers. This delivered relatively good generalisation results. However, we discovered through further testing, that increasing the number of neurons within this layer to three times that contained within the other layers increased the performance on novel forms by about 3 We also noticed that increasing the number of neurons increases the the amount of variance between forms and hence adds to the complexity of the function to be learnt. So the learning rate had to be decreased. A value of -0.03 was thus assigned, offering a good tradeoff between the speed of convergence as well as the ability of the network to learn the whole set. Given a lot more time, most probably the optimal value would be around -0.01. However the number of connections in this type of network meant that training it was a very time consuming process. On the other hand momentum was set to 0.9 in order to compensate for the lower training rate and speed up the process a bit. 5.3. Performance of Different Encoding Methods We also tested the different methods of encoding on separate networks. The Grapheme Encoding” method managed to learn about 52the set after around 3000 epochs, whereas the “Phonetic Features” Method did not manage to learn more than around 40used. We think that the former was restricted by its simplicity and lack of representational power whereas the latter went to the other extreme, offering a representational method too rich and specific for such a data set and such a problem. The Features Lite Method on the other hand managed to offer a good balance, learning very close to all of the set (around 98 Unluckily this model did not generalise to new forms as well as the categorisation model, providing an average of 26.6% correct generations. Through the results we see that Schembri’s analysis of the singular to plural correspondences were correct. Category B, although being less productive than Class A, rendered a better performance. As seen in Category H, some of the forms were hardly significant at all statistically yet they still managed to be learnt reasonably well. We also noted that Category B type plurals were sometimes inflected out of singular nouns or adjectives coming from other forms. The type of errors which were generated, in a certain sense suggest that the encoding method coupled with the input and output layer template did not provide the best way of representing and presenting words to the network. The fact that sometimes even the radicals were changed when mapping from the singular to the plural form indicates that possibly the emphasis of the latter’s importance was not strong enough in our system. We are also unsure as to the reason the vowel patterns sometimes get confused. Yet, according to the nonsense words and the responses generated by native speakers in Schembri’s work, certain plural forms for novel, nonsense forms are generated by different people using different vowel patterns. One might say that these results emulate this sort of behaviour. 6. Conclusion In conclusion, these results show us that the Maltese broken plural is a really complex system and that modelling it is easier said than done. However, we have also seen that most of the regularities present in most a given data set can be learnt by a neural network given a suitable encoding. We have also shown that apparently, categorisation of nouns in the singular form is not impossible and can be achieved with a relatively good degree of accuracy. We hope and are confident that given some future improvements to the design reported in this paper, a system which performs even better at the categorisation process can be built. We also confirmed the hypothesis that phoneticallybased encoding can yield much better results than a non-linguistic one. This seems to confirm Rumelhart and McClelland’s [9]belief that phonetics play a rather important role in the learning and acquisition of a language. However, this conclusion is relative to the data at hand and is underspecified - since there are many potential phonetic representations. To find out which phonetic representation is best requires further investigation on larger data sets. This confirmation involved building an improvement to Calleja’s MGPT system [11] an existing graphemeto-phoneme conversion component which is not only efficient, but which includes a rule set which is very accessible and easy for the non-programmer to understand. At the same time, we also changed and adapted the rules in order to make this process reversible for a phoneme-to-grapheme conversion. In conclusion, the evidence suggests that a neural approach to Maltese morphological analysis is promising - at least for the case of broken plurals - in that it allows us to avoid explicitly programming the classification task in the form of rules. It did not perform very well when attempting to generalise the task to unseen examples. The real questions to investigate next are how much of Maltese morphology can be covered in this way, and how a machine learning approach can be integrated with those parts of maltese morphology that at first sight are better handled by means of explicit morphological rules composed by experts in linguistics. References [1] T. Schembri, “The broken plural in maltese - an analysis of the maltese broken plural,” University of Malta, Tech. Rep., 2006, unpublished B.Sc. thesis. [2] E. Sutcliffe, A Grammar of the Maltese Language. Progress Press, 1924. [3] L. Cachia, Regoli Tal-Kitba Tal-Malti. Valletta: Klabb Kotba Maltin, 1997. [4] A. Cremona, Taglim Fuq il-Kitba Maltija. Union Press, 1975. Valletta: [5] A. Borg and M. Azzopardi-Alexander, Maltese. London: Routledge, 1997. [6] A. Farrugia, “Maltimorph: A computational analysis of the maltese broken plural,” University of Malta, Tech. Rep., 2008, unpublished B.Sc. thesis. [7] G. Kiraz, Computational Nonlinear Morphology, with Emphasis on Semitic Languages. Cambridge, UK: Cambridge University Press, 2001. [8] K. R. Beesley and L. Karttunen, Finite State Morphology, ser. CSLI Studies in Computational Linguistics. Stanford: CSLI Publications, 2003. [Online]. Available: http://linguistlist.org/pubs/books/getbook.cfm?BookID=6754 [9] D. Rumelhart and J. L. McClelland, Parallel distributed processing: Explorations in the microstructure of cognition. Cambridge, Ma.: MIT Press, 1986. [10] K. Plunkett and R. C. Nakisa, “A connectionist model of the arabic plural system,” Language and Cognitive Processes, vol. 12, no. 5,6, pp. 807–836, 1986. [11] S. Calleja, “Maltese speech recognition over mobile telephony.” Master’s thesis, University of Malta, 2004. [12] P. J. Werbos, The roots of backpropagation: from ordered derivatives to neural networks and political forecasting. New York, NY, USA: Wiley-Interscience, 1994.