“We All Make Mistakes!”. Analysing an Error-coded Corpus of Spanish University Students’ Written English

. The present study analyses the errors identified in the written argumentative texts of 304 Spanish university students of English taken from two different corpora –one from a technical university context and the other from learners enrolled in the Humanities. Considered an important design criterion for computer learner corpora studies, the metadata of the students’ was recorded and their competence levels were measured using the Oxford Quick Placement Test. The scores obtained (0 to 60) were then related to the CEFR (Common European Framework of Reference for Languages) levels ranging from A1 to C2.


Introduction
Computer learner corpora and computer-aided interlanguage (IL) analysis have undoubtedly become household names within the field of applied linguistics and language learning.Both of these provide the framework for the present study which involves an analysis of the linguistic features of Spanish university students' written English.The learner texts were automatically parsed and latterly manually error coded.The learner corpus with all the annotated errors provides information concerning the learners' pedagogical needs, and especially highlights those errors typically made by Spanish L1 students.The error-coded corpus was created within an earlier publically-funded project named TREACLE and is currently being used for a new project which involves the proposed development of individualised online self-study programmes which are tailor-made for learners with different needs, i.e. aimed at creating an online system to allow for targeted learning of English grammar for Spanish L1 learners.
To begin, we will gloss over the theoretical concepts that underlie the study: the characteristics of different language teaching and learning methods and approaches for which IL errors have been central; the reason why these linguistic features of learner language are still of interest to teachers, researchers and the learners themselves; how the study of this learner language can be facilitated by certain tools and software.We then describe the learner corpus and the tool used for the analysis.Finally the results are presented and discussed.

The rise and fall of IL error studies
Language errors have always been of interest to both teachers and learners of foreign languages.The different teaching methodologies in the last fifty or sixty years have tended to swing from one extreme to the other as regards the importance given to the errors produced in the learner's interlanguage (IL).In the days of Contrastive Analysis (CA) the aim was to avoid error-making by identifying the differences between the learner's mother tongue and the target language, predicting and describing patterns that were likely to cause difficulty, and eliminating these through drilling.Researchers conducted these contrastive analyses comparing different languages, which, according to Fries (1945), was an indispensable requirement for the creation of language teaching materials.The reasoning behind the theory was simple: when learning a second language a person will tend to transfer mother tongue structures in second language production (Lado 1957), and where L1 structures differ from the L2, mistakes will be made.According to CA, the identification of the differences and similarities between various languages and the subsequent prediction of possible errors was enough to deal with the problems of teaching those languages.
However, a growing body of research in the 60s and 70s (Briere 1964, Nemser 1971, Whitman and Jackson 1972) showed that many of the predicted errors were not observed in the learners' IL, whilst other errors were commonly made by language learners with very different mother tongues.It was also noted that too much attention was being paid to hypothesising about what the learner may do, to the detriment of studying what she or he actually does (Schachter 1974).Subsequently, Error Analysis (EA), a method that attempted to explain the essentially creative nature of the language acquisition process (Schachter and Celce-Murcia 1977: 442) emerged as an alternative methodology to the previous behaviourist habit-formation theory providing researchers with a way of studying learner language, i.e. by not concentrating exclusively on what Ellis (1994: 48) describes as 'fully-formed languages' (L1 and L2).Thus, EA became integrated into the field of applied linguistics, above all, thanks to the seminal work of Corder (1967) who highlighted the importance of errors in the learner's IL since they not only provided teachers and researchers with information concerning how much the learners had learnt, and how they were learning, but also, through making errors, how the learners themselves could discover the rules of the target language.
Latterly, EA was criticised for not having a more rigorous methodology (Ellis 1994) and for concentrating on the negative aspects of the learners' IL, while ignoring the achievements (Enkvist 1973, Hammarberg 1974).Moreover, the phenomenon of avoidance (Schachter 1974), or how learner language develops over time (Ellis 1994: 69) were important aspects that were never addressed.
Since the 1970s, trends in second or foreign language classroom research have seen a shift in focus from program-product relations to a focus on process-product or process-process oriented research (Chaudron 2000(Chaudron : 1699)).The advent of the communicative approach to language teaching in the 70s meant that there was, in general terms, a greater focus on learners and the strategies used by them to acquire foreign languages.This, in turn, centred research interest on the process of learning rather than the product, on fluency rather than accuracy which inevitably led to a situation where attention was actually drawn away from the linguistic study of what learners really produce, overlooking grammatical errors to a certain extent, while concentrating on meaning, with the result that linguistic accuracy suffered (Harley andSwain 1984, Lyster 1987;Alderson andSteel 1994, Renou 2000).
Despite changing trends in language teaching/learning methodologies over the years, errors made by learners are still of interest to teachers, researchers and above all, to the learners themselves.Although over twenty years have passed since Carl James, one of the most prestigious researchers in the field, stated that CA and EA were still going strong (1994: 179), there are indeed aspects of both of these methods that can be seen to play an important part in the description and explanation of SLA processes, since in order to describe and then explain the IL, both L1 and L2 are normally referred to.Thus, as opposed to earlier studies which involved CA and EA as methods of teaching/learning, in recent years, error analysis has played a more discreet, although it must be added constant, role, not so much as a theory of language learning, but on a day-to-day basis with teachers continuing to correct errors and improve learning materials, and researchers looking for, among other aspects, ways of helping learners to be more successful with their output by investigating feedback and error correction concerning, for instance, the effectiveness (or not) of error correction (Frantzen 1995, Ferris 1995a, Chandler 2003, Bruton 2009a, Bitchener et al. 2005, Bitchener and Knoch 2010b, Sheen 2010a, Truscott 1996); the effect of different types of feedback, i.e. teachercentred feedback (Ferris 1995a, Ferris 1995b, Ferris and Hedgcock 1998, Ferris and Roberts 2001, Bitchener et al. 2005, Ellis et al. 2008), peer feedback (Sotillo 2006, Díez-Bedmar andPérez-Paredes 2012), or computer-mediated feedback (Dekhinet 2008, Sauro 2009, Vinagre and Muñoz 2011).

Technology for IL students: computer learner corpora and computeraided analysis
There have also been several technological developments in the last twenty or so years which have resulted in a renewal of interest in the study of learner errors and which have enabled teachers, researchers and learners to be able to analyse or study output with a view to understanding more about the process of language acquisition, and how to teach or learn more efficiently.
Firstly, Computer Learner Corpora (CLC) were developed which involved the creation of large computerised databases of authentic written or spoken language produced by language learners (Granger 2003).Since the first projects in the 1980s and 1990s, (Faerch et al. 1984, Granger 1993, 1998) interest has increased a hundredfold, as can be seen on the Learner Corpora around the World web page created by the Centre for Corpus Linguistics (CECL) at Louvain, Belgium (see also Pravec 2002, Granger 2008, Neff et al. 2007and Diez Bedmar 2008 for Spanish CLC studies).With regard to CLC used for pedagogical purposes, several projects have been designed with the intention of improving curriculum design for the teaching of English as a Foreign Language.These projects do not exclusively focus on the so-called 'negative' aspects of learner interlanguage but also analyse the positive linguistic properties of student texts with the aim of describing what students can be expected to learn at each level of proficiency, and how teaching material based on learner corpora should be sequenced.The English Profile Project 2 is pioneer in this field.The aim was initially to determine the criterial linguistic features of each of the CEFR levels by comparing the usage of different (positive) features in learner texts as well as the negative features i.e. the incidence of errors, at each proficiency level.If a particular feature is found to be significantly higher at one level in comparison with lower levels, it is understood to be 'criterial' to the higher level and those found above that.In this way, materials writers and teachers should have an idea of what aspects of English are typically learned at each CEFR level, and therefore which forms and structures can most suitably be introduced in learning programmes (Hawkins andButtery 2010, Hawkins andFilipović 2012).Up to now, two free online tools have been developed as grammar and vocabulary teaching aids.Secondly, computer programs, some commercially available, such as WordSmith Tools (Scott 1996), WMatrix (Rayson 2003), Concordance (Watt 1999) and others, open source, such as AntConc (Anthony 2012) among many others, have been developed which can analyse the language produced in seconds generating frequency lists, concordances, syntactic and POS analyses, and so on.As regards learner language, and more specifically, IL error analysis, there are a few programmes that have been specifically designed to code IL errors (Hutchinson 1996, Dagneaux et al. 1996, Izumi 2005, O'Donnell 2008) although to date, these programmes are not automatic and mainly provide user friendly interfaces and tagsets together with comprehensive tagging manuals which help the researcher with the task of coding the errors once they have been identified.

Aims of the research
In the present study the language errors made by Spanish university students of English are identified and coded.The following research questions were posed: a. Concerning the whole error-coded corpus: a.1 Of the six main categories of errors in the coding system (lexical, grammatical, phrasing, pragmatic, punctuation and uncodable) which is the most frequent in the corpus?a.2 Which are the most frequent subcategories and specific features of this main error category?b.Concerning the different competence levels of the learners: b.1 Of the six main categories of errors, which have been identified as the most frequent when comparing the CEFR levels of the students?Once this is determined, which are the most frequent subcategories and specific features of this main category per level?b.2 Do the errors made at different competence levels vary?Do some errors improve whilst others become more salient as language competence increases?

The corpora
The project we describe centres on the English language output of undergraduates in two different types of degree programmes: English Philology in a Humanities faculty, and courses taught at a technical university where engineering studies (Telecommunications, Civil Engineering, Industrial Engineering, etc.) are combined with Fine Arts, Architecture and Applied Computer Science.The general aim of TREACLE, 3 involving the Universidad Autónoma de Madrid and Universitat Politècnica de València, was to improve the efficiency and quality of language learning at tertiary education institutions in Spain.
With regard to the corpora used for the analysis, the UPV Learner corpus is part of the MiLC corpus (Andreu et al. 2010) and consists of 950 written compositions (180,000 words) from Spanish students of all levels, mostly centring on the topic of immigration.The WriCLE Corpus (Rollinson and Mendikoetxea 2010), is composed of 750 argumentative essays (on immigration, homosexual marriage and traffic problems, amongst other topics) written by Spanish learners of English of all levels of proficiency.For the error coding in the TREACLE project, 304 of these essays (109,974 words) are analysed.Students were given the Oxford Quick Placement Test (paper and pen test) -scores were taken at the time of writing-and these were converted to CEFR levels as indicated in the user manual.

Annotation software
The corpora were parsed and manually error-annotated.The rationale is that parsing the learners' output will provide information concerning what students are attempting and getting right, and the detection and tagging of errors will tell us what students are getting wrong.In both cases, the annotation was carried out by the UAM Corpustool4 (O'Donnell 2008).1. Select text containing error.
2. Provide the corrected text here.
3. Assign features to current segment here.Figure 1 shows how the error annotation system works.The UAM Corpus Tool firstly allows the coder to select the text where the error has been identified (step 1 in the diagram), then s/he can provide the correct word or words as shown in step 2 and finally, the coder can choose from the different levels of error in order to establish the exact error code (step 3).In order to facilitate the coding of the error, the system provides a set of hierarchically-organised error codes.Referring to the exact case of the error in figure 1, the coder chooses from among the six main categories of error as shown in figure 2. These are grammar, lexical, punctuation, pragmatic, phrasing and uncodable.However, it should be mentioned that we are not primarily aiming at elaborating "lexically organized dictionaries of errors but instead it is the grammatical topic in which the error would be taught during an EFL course that is of prime interest" (O'Donnell 2012a).For this reason, the error detected is associated to the grammatical unit which provides the context for the error, i.e. the clause and/or phrase.In order to code the error in figure 1 to its most delicate level, the coder would first choose 'grammar error', followed by 'np-error' (noun phrase error) and then 'determiner-error'.As we are dealing with an error regarding the article, the next level is 'determiner-choice-error' (See figure 3 to see how coding can be done to the most delicate level in this case).With a view to facilitating the coder's job, there is both a comprehensive gloss in the coding window and a longer Coding Criteria Manual 5 that can be referred to whilst selecting the code with clear guidelines and examples of each error.The error coding scheme contains 170 error features in total, of which 132 are leaf features (not more delicately specified).
Since one of the aims of developing the error tagged corpus is to explore the nature of the different errors made by Spanish university learners of English, the grammar network is structured in such a way as to mirror a typical grammar book recommended for use such as Quirk and Greenbaum (1973), Downing and Locke, (2006), among others.
The major divisions in the coding system refer to errors in phrases (Noun phrase, Prepositional phrase, Adjectival phrase, Adverbial phrase); errors in clause construction (clause-error and verb phrase-error); and errors in the formation of complex clauses.
In several previous studies, researchers working on the TREACLE project have reported on the error coding process in general, i.The present study will give the results by taking into account firstly the total errors in the coded corpus, then the number of errors as per 1000 words and after we look at the amount of errors made per CEFR level, concentrating on the total amount of errors made by each learner group.

Results and discussion
Having carried out the manual error coding of the 110,000 word corpus, a total of 15,850 errors were identified.Figure 4 shows the general results concerning the error coding of the texts in our corpus.In answer to our first research question concerning the most frequent errors in the corpus, by far the greatest number of errors can be found within the grammar error category (N=7413).The second largest group consists of the lexical errors (N=3345) followed by punctuation (N=2089), pragmatic errors (N=1542), phrasing errors (N=1270) and uncodable errors (N=191).
Grammar errors are those where some grammatical rule has been broken (wrong class for context, word order, agreement problem, missing but necessary element, present but unnecessary element, etc.).
Having seen that the grammar errors account for the most frequent main category, we turn to analyse the next most delicate level of errors within this group.
In figure 5, it is the noun phrase errors (np-errors) (N = 3334) that are almost three times more frequent than any other sub-category, the second most frequent being prepositional-phrase errors (N= 1233) followed by verb-phrase errors in third place (N = 1171).
Figure 5. Sub-categories within GRAMMAR error Looking in greater detail and at the most delicate level, i.e. those codes which have no other subcategories below them, Table 1 below gives the top twelve grammar errors most frequently found in the corpus.
We display the results in two different ways.In the first column, we offer the results comparing the specific errors to the total grammar errors.According to this, a total of 27% of all grammar errors involve determiners.In addition, we have added in the third column the percentage of each subcategory in relation to the 12 most frequent grammar errors and in the case of determiner errors, we find this is as high as 43%.This particular category of error, as we mention later, is the most frequent across all the competence levels from A1 to B2.  This law not only is going to improve (…) health of  the general opinion about (…) education system is focused Determiner-present-not-required: When looking at the number and type of determiner errors, the most frequent is the use of a determiner when none is required as in the following examples.
 due to the fact that the seventy percent of Spanish population is nonsmoker  the law goes against the freedom of the smokers Wrong-number: a common error involving using the singular when a plural is required, or vice versa.This particular linguistic feature is found in the head of a noun phrase, as in the following examples:  alternative ways of using the transports  if every rich countries gave them Wrong-category-for-np-head: this error occurs when the head of a noun phrase cannot be used in that particular context (i.e. an adjective is written instead of a noun, or when a noun is used which is unsuitable either semantically and/or syntactically).
 In these days the controversial has risen of tone  this new law will make the illegal traffic of drugs Pronoun-choice-error: the pronoun chosen is inappropriate for the linguistic context.
 that muslim's sons could study his religion in the school  I think that them don't want Errors with the choice of preposition are the most frequent within the preposition error category.Some examples are: Prep-choice-error:  In the other hand…  The purpose of this essay is to give a personal account on some of the possible ways-Unnecessary-preposition: a preposition is added when not required.
 This is the reason on why the local council  and it prevents to the poor countries Finally, clause errors within the grammar category include when a subject is required but not given.Obligatory-subject-absent:  If a student is at the University even if is not in class  the younger a person is, the easier will be to reintegrate her/him Adjunct-order-error: when adjuncts chosen, although correct, do not follow syntactic rules.
 there are so many people who insult them still and that makes homosexuals  One possible solution to this problem might be to use less the car Subject-finite-agreement: subject and verb do not agree in number.
 People who is not in favour of  when someone die beyond our control The present study is also concerned with reporting on the key errors made according to the different CEFR levels the students have been assigned.It must be stressed however, that dividing learner groups into levels is a somewhat arbitrary division within the continuum of language learning.When can we say that a learner has really acquired a particular form or structure?(see O'Donnell 2016).Indeed, as many characteristics of learner writing are present at all levels, the figures we show can only be said to indicate the tendencies within each level, i.e. a feature is observable at say, A1 level, but does the frequency of this error decrease when looking at higher levels?
Our second research question focuses on the incidence of different errors in relation to the level the learner-writer has been assigned.We respond to this question in three stages: firstly, we look at which main error types are the most frequent per level, and of these, we identify which are the subcategories and leaf features of these main categories.
Secondly, and in order to address the second part of this research question concerning whether the errors made at different levels vary or not, we carry out an analysis of the 10 most frequent grammar errors looking at the incidence of a particular error type in relation to all the errors made at that particular level in order to get an insight into the type of errors that are most prominent (if any) at each level.In third place, of these 10 most frequent errors, we investigate whether some types of errors improve whilst others become more salient as language competence increases.
In figure 6 below, the most frequent main category is shown.At all competence levels, it is the grammar category, ranging from 32% of errors at C2 level, to 50% of errors at A2 level.Also we note that lexical errors also fall with rising proficiency, whereas punctuation, phrasing and pragmatic errors go up. Figure 6.Frequency of main categories of errors according to CEFR levels.
Concentrating on the most numerous category of error, we now look in greater detail at the grammar errors made at each level, but in this case as a proportion of all the grammar errors made within each level and not as a percentage of all the errors in the corpus as we have done until now.
Once again, noun phrase errors are by far the most numerous, indeed, when observing A1, A2, B1, B2 and C1, between 40% and 50% of all grammar errors fall within the noun phrase category.The highest percentage of errors within the C2 level are more evenly distributed between noun phrase, prepositional phrase and clause errors.This can be seen in more detail in table 2. We now move on to make a comparative study of the most frequent errors within each level and according to the total errors made by the students regarding the grammar errors.This means focusing on the most delicate level of errors or leaf features, so for instance, within the category noun-phrase (np) errors, the subcategories are given, for example, determiner-error and then determiner-absentrequired, in this way following the hierarchy shown in figure 3 and providing more specific information about a particular error.
The results show that the most frequent errors do indeed vary somewhat according to the competence level of the writers.We explain these results from the data comparing in the first place, levels A1, A2, B1 and B2 and secondly, levels C1 and C2 since the main differences seem to be highlighted between these two larger groups.We only look at the ten most frequent errors since after this the quantity of errors is very low, especially in the case of the C1 and C2 levels.The two most frequent grammar errors from levels A1 to B2 coincided at all these levels (table 3).The most frequent, when a determiner is used but not required, ranges from 14% of grammar errors with A1 rising to 16% at level A2 and falling slightly to 15% at B1 and B2 levels.In the case of A1, of the 113 determinerpresent-not-required errors, 101 were when the definite article the was used when not required, compared to only 11 involving the use of the indefinite article a/an.At A2 level, this is also the case with 383 of the 433 errors in this category (definite article) and 36 errors with a/an and a few other determiners which were not used by the A1 group.With B1, 230 out of 257, and turning to B2, 162 out of 186 errors in this category.Similarly, when looking at the second most frequent error within the grammar category, in all cases, it was an error of preposition choice, ranging from 9.68% in the case of B1 to 11.71% with A1 and in relation to total Grammar errors at each level.The prepositions that were used wrongly most were: of and in at A1; of, for and in at A2; in and of at B1 and B2.The use of the preposition of may be a result of direct translation, as in the biggest problem of the world or many persons of Spain or we could infer of this study.
4. Most frequent grammar errors at C1 and C2.Turning to levels C1 and C2 (table 4), errors concerning choice of preposition are the most frequent at these levels.At C1 level, they account for 15%, and at C2 level, 23%.In fact, wrong choice errors account for 76% and 83% respectively of all the prepositional phrase errors made at these levels.Similar to the lower levels, the most common errors are made using of and in at C1 level and to and of at C2 level.In second place, 9.5% of C1 errors involved the determiner present not required feature.In contrast, the second most frequent error within the C2 group involved an adjunct error, namely the incorrect order of the adjuncts representing 11% of the total.This result might be expected in this case where higher level students use structures that beginners and intermediate learners would not use.The errors coded as adjunct-order-error are cases where the adjunct is misplaced in relation to the verb, the object or compliment or other adjuncts.We are therefore dealing with more complicated syntactic structures which are more prevalent at more advanced levels.As can be seen in Table 5, subject-finite agreement was the third most frequent error, accounting for between 6.6% of all errors made at A1 level to 9.2% at B1 level.Some of these may be a result of transfer from the learners' L1 as with the noun 'people' in I feel that people is looking for new(…) or each people have a different opinion (…) which is found across all levels.Also the difference between a typical A1 error in this category: (…) and you thinks that you hasn't (…) or a more 'advanced error' as in (…) halt this pandemic that affects 30 million people around the world and kill more than two millions.As regards fourth place, determiner absent required as in this law not only is going to improve health of passive smokers (B2) was the most frequent with A1, A2 and B2.Within B1 level errors, wrong number was fourth most frequent.Table 6 shows the third and fourth most frequent errors within levels C1 and C2.In the case of C1, this was determiner-absent-required and wrong-number respectively, whereas within the C2 level third and fourth place are taken by determiner-present-not-required and subject-finite-agreement.The fifth and sixth most frequent errors (table 7) involved wrong-number in the case of levels A1, A2 and B2, and determiner-absent-required at B1 level; determiner-choice-error followed at A1 and A2 levels, whilst the feature obligatory-subject-absent is most frequent at B1 and B2.Interestingly, the most frequent verb used with the subject missing is the verb 'to be' -with B1 level 34 of the 63 cases and 32 of the 45 cases at B2 level.At levels C1 and C2 (table 8) there is no exact coincidence within these levels: the fifth most frequent error is determiner-choice and determiner-absent-required at C1 and C2 respectively followed by subject-finite-agreement (C1) and obligatoryobject-absent (C2).Unlike the case of obligatory-subject-absent, there are no examples with the verb 'to be', but at these levels there are few errors of this type.In seventh place (table 9), determiner-agreement predominates at A1, A2 and B2 levels, whereas with B1 level we find unnecessary-preposition.The determiner 'this' is the most predominant at levels A1 (9 out of 31), A2 (46 out of 91 determiners), and B2 (19 out of 34).Other common errors with determiners are: another/other, possessive adjectives such as his/her/your, etc.The eighth most frequent grammar error varies more within the levels: A1 (unnecessary-preposition -most frequent prepositions added when not required were: to and of); A2 (obligatory-subject-absent); B1 (determiner-agreement); B2 (determiner-choice error).Table 10 also shows more variation within the levels regarding the type of errors made.For C1 the seventh and eighth most frequent errors are adjunt-order-error and determiner-agreement whilst for C2 we have wrong-number and modal-tenseaspect-choice error.Turning to the ninth and tenth most frequent grammar errors in the corpus, table 11 shows that few coincide: For A1 level, pronoun-choice-error and obligatorysubject-absent; A2 level, unnecessary-preposition and adjunct-order-error; B1, determiner-choice-error and pronoun-choice error; B2, unnecessary-preposition and infinitive-clause-formation-error.With regard to the higher competence levels (table 12), at C1, we have obligatorysubject-absent and unnecessary-preposition and at C2, premodifier-order-problem and obligatory-subject-absent. Within the noun phrase category of errors, clearly several types of determiner errors are found across all levels although frequency varies.The feature determiner-present-not-required is the most frequent but in the case of C1 and C2, the percentage of these in comparison to total grammar errors is lower (9% and 10% respectively).With levels A1 to B2, the amount varies from 14% to 16%.Another sub-category of determiner error is determiner-absent-required which is also found at all levels, varying from 5% to 6% across all levels.In the case of determiner-choice-error, the highest incidence is at A1 and C1 levels (around 4.5%).Neither determiner-choice-error nor determiner-agreement-errors were recorded at C2 level, although the latter stands at 3% across levels A1 to C1.Still within the noun phrase, the category wrong-number is found at all levels although a lower percentage at C1 and C2 levels.Pronoun errors are similar across all levels (around 2%) except they are not found amongst the ten most frequent errors at C2.
Although it must be said that the total number of C2 errors is lower than any other competence level, the most frequent error within this group, belonging to the prepositional phrase category, namely preposition-choice-error, accounts for 23% of C2 errors, considerably more than the 9% to 11% at levels A1 to B2.The amount found for C1 lies at a point in-between, standing at 15%.However, there is one type of preposition wrongly used across all levels and that is of, which may be attributed to directly translating from Spanish.This particular misuse was also found in MacDonald (2005).We also found unnecessary-preposition to be present at all levels, although in all cases, lower than 3% of the total grammar errors per level.
The most frequent category of error related to the verb phrase involves subjectfinite-agreement.This is found at all levels, and the frequency ranges from 4% at C1 to the highest, 9% at B1.
Turning to what we classify as clause errors, the more notable differences in the results come up.For instance, among the first ten most frequent grammar errors, adjunct-order-errors are not found at A1, B1 or B2.They account for 2.5% at level A2, but at the higher levels, at C1 we have 4.3% and at C2, a much higher 11%.The category obligatory-subject-absent is found across all levels within the ten most frequent errors, but only in the case of C2 do we find obligatory-object-absent.Also classified as clause error, we note infinitive-clause-formation-error was only found at level B2 among the ten most frequent errors.
Finally, there are two features that were only found amongst the C2 level errors: modal-tense-aspect-choice-error (within verb phrase) and pre-modifier-orderproblem (noun phrase).
Based on the above data and taking into consideration that we are concentrating on the percentage of the most frequent errors within the grammar category and per CEFR level, the prominence of determiner errors fell with rising proficiency, although certain errors such as preposition-choice-errors and adjunct order errors became more salient.
Although it might have been hypothesized that the percentage of the ten most frequent grammar errors would drop according to the competence level of the writers, in fact this was not corroborated by our data.The presence or absence of a particular error did not seem to depend on level (with the exception of some of the C2 errors) since in several cases the presence was somewhat erratic across the levels.To give an example, subject-finite-agreement as a percentage of errors per level steadily goes up from A1, through A2 and B1, but then drops at B2 and goes up once again at C1 level.Similarly, if we take determiner-choice-error, the incidence goes down from A1 to B2 but then there is an increase at C1, whilst at C2 there were no examples in the data.
Pronoun-choice-error was present in the 10 most frequent errors at A1 and B1 levels, but was not found with A2, B2, C1 or C2.

Conclusion
In the present study we set out to analyse the data obtained from an error-coded corpus of Spanish university students' compositions in English.The corpus was compiled with written texts from the Wricle Corpus and the UPV Learner corpus.The students were assigned competence levels after doing the Oxford Quick Placement Test.
The texts were error coded using the UAM Corpustool.Two different ways of interpreting the data were employed: firstly, we looked at the main categories of errors within the whole corpus, then we concentrated on the subcategories, moving on finally to the most specific features.Secondly, and in a similar way, we focused on the most frequent errors but this time taking into account the competence level of the students and how the interlanguage errors varied according to this variable.
Concerning the most frequent errors found in the corpus, the main category, Grammar errors, were the most numerous.Moving down a level to look at the Grammar subcategories, noun phrase errors were most frequent and below this level, determiners accounted for almost a third of grammar errors, errors with the head of the noun phrase (9%), preposition errors (14%) and clause errors (13%).Examples of each subcategory were provided.
Our second research question focused on the competence level of the writers and the frequency and type of errors made by the learners at a particular level.Once more we look at the most frequent main category, and the subcategories under this.Results show that grammar errors are more frequent accounting for 50% of errors at A2 level, falling to 32% at C2 level.In addition, the more specific features within this category were analysed and noun phrase errors were found to be the most frequent among levels A1, A2, B1, B2 and C1 (between 40% and 50% of all grammar errors).Concerning C2, noun phrase errors were not so clearly salient (28% of grammar errors) but were similar in frequency to prepositional phrase errors (27% of grammar errors) and clause errors (25%).
Following this, a comparative study is carried out of the most frequent errors at each competence level.In this case the most specific features are examined within the grammar category.It was found that there is a slight difference in the most frequent errors depending on the CEFR level of the students.The results were explained by making reference firstly to levels A1 to B2 and then separately, C1 and C2.
Summarising the most relevant findings, in first place the most frequent errors with A1 through to B2 were determiner errors, namely when one is used but is not required.By far the most frequently misused determiner was the definite article the at all levels.The second most frequent is the category preposition-choice-error and it was found that the preposition most often misused was of which we concluded was possibly due to direct transfer from Spanish.With C1 and C2, the most frequent errors were preposition-choice, accounting for 23% of all grammar errors at C2 level and 15% at C1.This particular error was the highest within the prepositional phrase errors and also included of as the most misused, followed by to and in.Next we found adjunct-order-error the second most frequent error at C2 level.At this level, learners should be highly competent in written and spoken English, are likely to be using more complicated syntactic and grammatical structures which may explain why this particular error is the second most frequent.
The third most frequent error among levels A1 to B2 involved subject-finiteagreement, to be expected among the lower levels, perhaps a little surprising when considering B2.However, we noted that the linguistic context did tend to be different according to competence levels.The next most frequent error was identical with levels A1, A2 and B2 (determiner-absent-required).
All errors, on the whole, improve as competence levels increase.However, despite the fact that the order of frequency changes, the categories of the most frequent errors coincides in most cases, i.e. there are no spectacular differences between one level and another.Several types of determiner errors, (agreement, choice, needed but not present and vice versa) are found at all levels, although at C2, some were not present within the 10 most frequent errors analysed.Other features found at all levels were: subject-finite-agreement (verb phrase), wrongnumber, obligatory-subject-absent (head of noun phrase); prepositional-choiceerror, unnecessary preposition (prepositional phrase).On the contrary, some errors were among the most frequent at some levels only: pronoun error (A1 and B1); adjunct-order-error (A2, C1 and C2).Finally, some features were only among the first ten most frequent at one level: infinitive-clause-formation (B2); obligatoryobject-absent (C2); modal-tense-aspect-choice error (C2) and premodifier-orderproblem (C2).
This is interesting as it must be noted that there are 132 error categories in total, for some of these, no errors were recorded, but with this study we have shown that in fact there is not as much variation between different competence levels as we might have hypothesized.
Obviously we must mention that taking the first 10 most frequent errors and not say, the first 20, is a somewhat arbitrary decision, but the frequencies and figures drop after this and become quite insignificant especially in the study at hand which set out to deal with the frequency of the most specific features.Also the number of words and texts at each level varies somewhat, and this may also be a factor that influences the results to a certain extent.Another limitation is that the results are not easily comparable with other error coding systems.Diaz-Negrillo and Fernández-Dominguez (2006) suggest that some sort of standardisation process or establishment of "a benchmark for the analysis of computerized learner errors" (2006: 86) would be necessary in order to address this issue.Unlike our taxonomy, the Louvain error coding system presents a total of 40 error categories, difficult to compare with the 170 features used in the present study.Also a larger corpus with more texts representing the different levels would have made the results more representative of the population under study.
In the present article we have centred on the most frequent errors in the corpus which were the grammar category.Of particular interest for the next stage of the project is to determine with great detail which grammatical rules have been broken when looking at the most frequent errors in order to provide the basis for the development of a web-based language learning system which dynamically adapts to the student.

Figure 2 .
Figure 2. Main error types in coding scheme.

Figure 4 .
Figure 4. Major categories of errors in coded corpus.

Table . 2
. Distribution of grammar errors according to CEFR levels

Table . 5
. Third and fourth most frequent errors A1 to B2.

Table . 6
. Third and fourth most frequent errors C1 and C2.

Table . 7
. Fifth and sixth most frequent errors A1 to B2.

Table . 8
. Fifth and sixth most frequent errors C1 and C2.

Table . 9
. Seventh and eighth most frequent errors A1 to B2.
Table. 10.Seventh and eighth most frequent errors C1 and C2.

Table .
11. Ninth and tenth most frequent errors A1 to B2.
Table.12. Ninth and tenth most frequent errors C1 and C2.