The Spanish Collocation Tool and Its Application in Corpus-Based Study of Spanish for Teaching and Learning

The purpose of this paper is to introduce a developed software program called the “Spanish Collocation Tool (SCT)” and its application in related corpus-based studies. The Spanish Collocation Tool (SCT) was designed to assist with the research and analysis of Spanish collocation. The SCT allows searches of collocated elements not limited to words, but also parts of speech and lemmas. Furthermore, it can compare two collocation lists to detect any significant differences between them. In this study, this collocation tool, SCT, and a constructed L3 Taiwanese learners’ written corpus of Spanish called CEATE were combined to create efficient access to results in a systematic approach. Furthermore, by using the SCT, the pedagogical implications of the search results for the development of on-line multimedia material for learning Spanish collocations are discussed in the end.

However, according to Lee (2010), there is a gap between Spanish and English studies with respect to the development of collocation tools and investigations applied in corpus-based collocation research taking into account both quantity and technique. Therefore, the purpose of this paper is to present a new software program called the "Spanish Collocation Tool (SCT)" as well as its application in related studies. This paper is organized as follows. Section 2 reviews previous studies on collocation tools and related research. Section 3 presents the design and development of the SCT. In Section 4.1, the SCT was then used to study Taiwanese L3 learners of Spanish to provide an applied example for linguistic analysis. The pedagogical implications of the SCT in on-line teaching material are discussed in Section 4.2, and Section 5 provides some final conclusions.

Literature review
This section provides a review of previous related research in theory and practice to understand the current situation and the main functions of the developed collocation tools.
Earlier studies on collocation have explored statistical issues that occur in natural language processing. For example, Handl (2008) indicated that the structure of a collocation consists of various dimensions, including both semantic and lexical structures as well as statistics. Almela Sánchez (2006) mentioned that the extraction of collocations takes into consideration the relationship between a head and its collocated element regarding co-occurrence, mutual selection, window size, frequency, and statistical significance.
Computational technology can be used for developing new search functions intended to extract systematic and generalized information from large corpora, which could provide strong evidence instead of being dependent only on linguistic intuition (Fontenelle, 1994). Fontenelle (1994) also pointed out that possible collocated elements can be found through the calculation of the collocated words that appear before or after a head. If the occurrence frequency of two words is higher than expected, they can be considered a significant combination, or a collocation. The procedure included analyzing frequency, the window size of collocated elements, POSs, and the syntax. Lin (1998) also analyzed a combination consisting of a head, its collocated element and modifiers from a text of 100-million words to find the probability of their co-occurrence.
In speaking of statistical methods, Butler (1998) indicated three ways to measure Spanish collocations, observed or expected values, T-value and types, or tokens. In addition, McCarthy et al. (2003) suggested three statistical methods, chi-square, mutual information, and the log-likelihood ratio (LLR). A widely cited study by Manning & Schütze (1999) indicated that an χ2 (chi-square) could be used to modify the shortcoming of the t-test, in which it is assumed that the population is under a normal distribution. The χ2 can be used to examine whether the co-occurrence of two collocated words is within the confidence level. If not, the words are not considered a collocation. On the other hand, it was found that using mutual information (MI) could help in determining frequency independence. However, some errors might occur with respect to the confidence level; low-frequency words were shown to have higher scores than high-frequency words. Fontenelle (1994) also argued that the results generated through a computational program making use of complicated statistical methods to identify collocations sometimes turn out to provide non-collocations for linguists. Therefore, our study adopted the approach of generating collocations through statistical tests, checking with references and manual inspection for further selection and modification of the generated collocation list.
Finally, we evaluated two existing tools: Sketch Engine (Kilgarriff et al., 2014) and Corpus del Español (Davies, 2002-). One of the advantages of Sketch Engine is that its retrieval results display combinations with different parts of speech. Secondly, the examples of collocation can assist users with understanding of its usage. Thirdly, Sketch Engine provides a more advanced retrieval function, for instance, a comparison between two similar verbs with their collocated words. However, Sketch Engine is not free for public use, although there are four available corpora that can be selected. Furthermore, setting conditions for collocation retrieval is complicated. On the other hand, another tool with a collocation searching function, Corpus del Español, presents different ways of retrieving collocations. There are several advantages to using the Corpus del Español. It provides retrieval results according to frequency, which helps users easily select an appropriate collocation. The provided examples facilitate users' understanding of searched collocations and the retrieval results display collocation lists that are easy for users to understand. Nevertheless, users cannot import their own data for the search purpose.

Spanish Collocation Tool (SCT)
The "Spanish Collocation Tool (SCT)" was developed by the Web Mining and Multilingual Knowledge System (WMMKS) Laboratory in the Department of Computer Science and Information Engineering (CSIE) at the National Cheng Kung University (NCKU) in Taiwan.

Design of SCT
The SCT was written in C# programming language and was designed to offer flexibility with regard to setting window size and selecting different statistical methods for collocated elements.
To set the window size, one needs to enter a range of numbers, which refers to the distance between two collocated elements that are allowed to appear. This selection process is included in order to detect results, excluding possibilities of modifiers inserted between two words in a collocation pair. We experimented with different window sizes and found that fewer results were derived under a smaller window size setting, while larger a window size contained noisier information. However, for the different purposes of this study, a big window size range was set in order to observe more examples, followed up by manual checking to exclude the noisy information.
Furthermore, the calculation of statistical probability and correlation refers to a search for only significant pairs of two "closed" words. In the statistical design of the SCT, a chi-square estimation and mutual information were used to test whether the probability of two co-occurring elements in a combination was under the confidence level. We decided to adopt the chi-square and mutual information to calculate values of imported data, after having compared several different models of statistical methods related to collocation studies in the areas of computational linguistics and corpus linguistics through extensive research of previous studies (i.e., Manning & Schütze, 1999). In addition, the SCT software can compare two collocation lists in order to detect any significant difference between them by calculating the mutual information (MI) or chi-square (χ2) scores for each collocated pair with relative entropy (also called the Kullback-Leibler divergence). Positive and high scores in the formulas indicate a contrastive difference between the two imported corpora.
Instead of searching only for collocated pairs of raw data (i.e., words), the grouping of different parts of speech and words of the same lemmas by the SCT can provide more systematic and generalized results that will facilitate the typological classifications used for further research analysis. Therefore, the SCT was designed to undertake the procedure of POS-tagging and lemmatization in order to develop more informative functions. We evaluated several POS taggers and found the Tree Tagger (a program developed by Helmut Schmid at the University of Stuttgart, http:// www.cele.nottingham.ac.uk/~ccztk/treetagger.php) much more efficient in comparison with other tools in terms of processing speed. Thus, it was adopted to POS-tag imported data. The Spanish tagger and lemmatizer system were preinstalled in the internal program of the SCT and could be triggered automatically for the convenience of usage. Therefore, the SCT allows searches for combinations of words, parts of speech, and lemmas. In addition, the contexts where these collocation pairs appear can be retrieved in order to obtain complete information about the usage of learners and native speakers in a corpus for further analysis.

Interface of results
In the interface (Figure 1), a collocation list appears on the left side of the screen. Each collocation consists of Word 1 and Word 2, of which prototypes are Lemma 1 and Lemma 2, respectively. POS1 and POS2 are their parts of speech. The Count represents the frequency at which a collocation appears in the imported data. The Score is the result of the kernel methods. W1Count and W2Count represent the frequency of occurrence of the collocated elements in the imported data. After viewing the first collocation result output, the user can set further required condition(s) in the "Filter" area and then click "Apply" to restrict the output and obtain a more systematic result for posterior analysis.
Examples of possible submitted commands: lemma2=civil and score>2 word1count>2 or word2count>2 Pos1=NC Pos1=NC and Pos2=NP and word1= sociedad Then, the user must have two collocation lists ready to be compared. The result of comparing two collocation lists will appear in another window. According to the score sequence, the higher the absolute value of a score is, the more significant the difference between the same collocation in two imported corpora is.
Although the SCT is free for public use and can derive collocation lists of not only words but also parts of speech and lemmas, it still has certain limitations. For example, the retrieval collocations can only consist of two collocated elements. Therefore, we hope to extend the query function from a bi-gram to an N-gram collocation in order to cover a bigger range of collocated elements for studies. We have developed an error detection and revision suggestion system, and we combined the developed result with "Spanish Collocation" to enrich its assisting functions and applications. By taking advantage of combined functions, related research can be expanded to a larger scope.

The application of the Spanish Collocation Tool in teaching and learning
This section focuses on SCT applications in a corpus-based study and in on-line teaching material.

Corpus-based study
Firstly, we demonstrate the application of the SCT in a corpus-based collocation study by analyzing L3 learner data. This study was intended to examine the collocation of learned uses by analyzing the data compiled in an annotated L3 learners' corpora called Taiwanese Learners' Written Corpus of Spanish (Corpus Escrito de Aprendices Taiwaneses de Español, CEATE, Figure 2) as well as by comparing the data of learner-written texts with their revised texts corrected by native speakers of Spanish. The comparison of the two types of texts was intended to contrast the distance between the interlanguage of L3 Spanish learners and the uses in the target language.
The learners' written corpus (CEATE) consisted of texts written by those who studied Spanish as their third language (L3) after learning English as their second language (L2) and Mandarin-Chinese as their mother tongue (L1). This learner corpus was POS-tagged and error-correction annotated.

Methodology
To avoid noisy information, we extracted texts using the following criteria in order to obtain consistent data characteristics: a passage of 100-200 Spanish words in length, a textual description type, and a theme related to leisure life and daily routines. To observe the collocation usage in Spanish, we focused on learners at the intermediate level with instruction of 576 to 1,088 hours, excluding learners with special backgrounds (such as immigrants, exchange, or transfer students). Ultimately, around 36,000 words in Spanish were analyzed, among which there were 17,914 words and 17,563 words from original and revised texts, respectively.
When importing data, we first fixed the maximum window size at 2 (the range between two words). Then we set the following conditions for the results of the collocation that we intended to obtain for further analysis. For example, with the purpose of observing two types of collocation, lexical and grammatical, the first group consisted of collocation types such as N-Adj/Adj-N, V-N, Adv-Adj, and V-Adv/Adv-V. The second group included the following types, N-Prep/Prep-N, Adj-Prep/Prep-Adj, V-Prep/Prep-V, and V-V. Then, texts containing collocated pairs were retrieved for further detailed analysis. Since it was unavoidable that POS-tagging of learners' texts would lower the percentage of correctness, we checked the POS-tagged results manually to maintain the basic quality of POS-tagging in the final step of extracting data.
By examining the compared results of the original texts written by learners and the texts corrected by native speakers of Spanish through the SCT, we were able to derive lists related to learners' overuse and underuse of collocation. The results of collocation overuse showed that combinations appeared more in the original text than in the revised version. In classroom teaching, the results suggested that the items of overuses should be discouraged as learners learn to use them. On the other hand, the underused items on the collocation list should be stressed in teaching. We considered not only the scores (high chi-square score) but also the counts (≥ four times) of collocated pairs to exclude any collocation pairs that appeared only once but had high value. In addition, the result lists with high positive values of relative entropy (KL values) were considered of importance and they should be stressed in teaching and learning.

Results
In order to find out the correct uses of different combinations, we derived a lexical collocation scale consisting of N+Adj>V+Adv>Adj+N>V+N among the texts written by learners at the intermediate level. That is, the combination of N+Adj shows the highest accuracy rate, whereas the V+N combination has the lowest rate among all combinations. We also developed the following hierarchical order of grammatical collocation uses by the learners at the intermediate level: V+P>Adv+P>V+V. If we consider 50% of similarity as the threshold for stable development, we could draw the conclusion that N+Adj might have been developed at the beginning level, whereas the combination of Adj+N enters a more stable stage at the intermediate level; the combination of V+N, is still on the way to be developed at a later stage. These observations may imply an order of priority for the design of instructional materials.
With respect to the structure of V+N, combinations of "solucionar problema, tomar sol" were found at the intermediate level. For the combination of Adj+N, we could find that learners at the intermediate level might either know how to distinguish the pre-noun adjective from the post-noun-for example, "buen/mal humor, nuevos amigos," or they were already familiar with these fixed expressions. In the N+Adj combination, we observed many examples related to abstract concepts at the intermediate level, for example, "experiencia especial, comercio exterior". By applying our new software to complete the above referenced corpus-based contrastive research between learners' written texts and native speakers' revised texts, the derived collocation lists can provide a clearer direction for designing pedagogical materials for the learning of collocation. With the SCT and extracted L3 data, we were able to obtain a more conclusive generalization of collocation usage for a specific level of Taiwanese learners of Spanish.  It begins with a text entitled "Los recuerdos con mi familia" by integrating ten keywords that were intended to be included in the instruction. In addition to the connected videos and the vocabulary list, further information related to their collocated pairs was included in hyperlinks. Take two keywords, "viaje" and "foto", as examples. In the learning material, different types of collocations such as N-Adj/Adj-N ("buen viaje, largo viaje, viaje maravilloso") and V-N ("hacer viaje, tomar fotos, sacar fotos") can be demonstrated based on the text extracted from learner data and modified by native speakers of Spanish. Furthermore, the context (i.e., sentences) where these collocated elements can appear and their correspondent translations in Chinese, the mother tongue of our Taiwanese learners, can be served as a learning option.

Teaching material
The sample of collocation teaching materials based on the development of the SCT and its application in a corpus-based contrastive study served as a prototype for designing more related pedagogical material. In the future, other references to the native corpus, Corpus del Español (Davies, 2002) can be added to obtain a more generalized view of the natural language by examining the frequency and tendency toward native usages.

Conclusions
In light of the current trend in corpus linguistics research, we have developed a Spanish Collocation Tool (SCT) to facilitate the analysis of collocations for Spanish. In comparison with other existing tools, the SCT is a downloadable free service for selecting statistical methods, setting window sizes, searching queries including POS-tagged and lemma information, and viewing source texts.
The present study also combined the corpus tool, SCT, and a constructed L3 Taiwanese student-written corpus of Spanish, CEATE, to create efficient access to results in a systematic approach. By using the SCT in the study, we derived a learning sequence and found patterns in learners' collocation uses in contrast to those used by native speakers.
Furthermore, by using the SCT, the pedagogical implications of the search results and generalization of usage inclination for Taiwanese learners of Spanish led to the development of on-line multimedia material for learning Spanish collocations. The provided sample served as a prototype for designing effective pedagogical materials to facilitate Taiwanese learners' L3 learning.
Finally, it is hoped that researchers and teachers with the same interests can benefit by applying our software in data analysis and in teaching material design.