ITEM AND TASK DIFFICULTY IN A B 2 READING EXAMINATION : PERCEPTIONS OF TEST-TAKERS AND CEFR ALIGNMENT EXPERTS COMPARED WITH PSYCHOMETRIC MEASUREMENTS

The article presents a study of a CEFR B2-level reading subtest that is part of the Slovenian national secondary school leaving examination in English as a foreign language, and compares the test-taker actual performance (objective difficulty) with the test-taker and expert perceptions of item difficulty (subjective difficulty). The study also analyses the test-takers’ comments on item difficulty obtained from a while-reading questionnaire. The results are discussed in the framework of the existing research in the fields of (the assessment of) reading comprehension, and are addressed with regard to their implications for item-writing, FL teaching and curriculum development. stopar and ilc: item and task difficulty 319


Introduction
Following the well-established distinction between the objective and subjective difficulty (Fulmer and Tulis 2013), the present study aims at determining possible correlations and interdependencies between these two types of difficulty, with special attention being paid to their importance for item-writers, test-, policy-and curriculumdevelopers as well as CEFR 1 -alignment experts.
The study draws on the reading comprehension subtest of the Slovenian national end-ofsecondary-school-leaving exam in English, called the General Matura in English (henceforth GM), which has only recently been fully validated and aligned with the B2 level of the European CEFR scale (Bitenc Peharc and Tratnik 2014).For the purposes of the investigation, the GM reading subtest has been administered to a group of testtakers together with a while-reading questionnaire, in which the test-takers have commented on their perception of the item/task difficulty.In order to determine to what extent the objective difficulty correlates with the subjective difficulty, the study compares (i) the psychometric measurements of the reading subtest (objective difficulty) with (ii) the answers from the while-reading questionnaire as well as with the judgments of the language experts that have aligned the GM examination with the CEFR (subjective difficulty).The reason for including the language expert into the study is twofold.First, in our context, most of the language experts participating in the CEFR alignment project are also item-writers for the national examinations, and second we want to address the question of experts and their reported weak ability to predict the item/task difficulty (Alderson and Lukmani 1989;Sydorenko 2011).In addition, the indepth analysis of the test-takers' while-reading questionnaire is employed to identify the underlying factors that can contribute to the item/task difficulty and influence test-taker performance.
We strongly believe that apart from theoretical implications, the results of our investigation will also have practical value especially in educational environments where the test-provider does not follow all the standardised test-design procedures as 1 Common European Framework of Reference for Languages (Council of Europe 2001) clac 67/2016, 318-342 stopar and ilc: item and task difficulty 321 described in Green (2014) among others.For example, in Slovenia, the national highstakes examinations, including the GM, are neither piloted nor pre-tested (Ilc, Rot Gabrovec, and Stopar 2014;Šifrar Kalan and Trenc 2014).Consequently, the itemwriters, test-developers and (CEFR) alignment experts must solely rely on their subjective judgment regarding the item/task difficulty.Their misjudgement about the item/task difficult may negatively affect the test validity and reliability, which is an undesired result, particularly so in the case of high-stake examinations.Therefore, a better understanding of item/task difficulty may have positive ramifications for the test validity/reliability.

Reading Comprehension and Item/Task difficulty: Basic Tenets
Reading has long been treated as a cornerstone of foreign language (FL) teaching.The mid-20th-century notion of reading as one of the four discrete FL skills remains relevant today (Hinkel 2010), even though it has been amended by the findings of many studies showing that reading should not be treated as a single, monolithic skill but rather as a complex and extensive set of activities that involves multifarious skills.As Alderson (2000) points out, the lists of reading (sub-)skills and the descriptions of how they interact are numerous and various, depending on the theorist researching them.Some that have frequently surfaced in the literature and have persisted for decades include decoding, linguistic knowledge, knowledge of discourse structure, knowledge of the world, synthesis and evaluation, metacognitive knowledge, and others (Bloom et al. 1956;Grabe 1991;Koda 2005;Munby 1978;Urquhart and Weir 1998).Khalifa and Weir (2009) propose a detailed, 7-point taxonomic scale of the reading ability which involves (from the lowest to highest): word recognition, lexical access, syntactic parsing, establishing propositional meaning, inferencing, building a mental model, and creating a text-level structure. 2While the lower levels mostly deal with lexis and syntax that are explicitly recoverable from the text, the higher levels focus on the contextual dimensions of reading such as recognizing the implicit meaning, connecting the text 2 The empirical studies have shown that the hierarchical ordering of the proposed levels should be slightly modified (Wu, 2011;Ilc and Stopar, 2014) clac 67/2016, 318-342 stopar and ilc: item and task difficulty 322 with the knowledge of the world as well as establishing intra-/inter-textual links.A similar proposal is put forward by Grabe (2009), and it distinguishes between word recognition, syntactic parsing, and proposition encoding (lower levels) from text processing strategies, background knowledge/inferencing, and understanding the discourse structure as well as the context of the reading act (higher levels).All of these levels are fully interconnected and '[c]omprehension cannot occur without the smooth operation of these processes' (Grabe 2009: 22).Given these assumptions, one would expect that there is a direct correlation between a taxonomic level and the reading comprehension difficulty: higher taxonomic skills should intrinsically be more difficult than the lower ones.However, the empirical study of Brunfaut and McCray (2015) has shown that such an overgeneralisation is problematic.According to their study, some readers have been able to aim their attention at higher order skills exclusively, making little use of lower level skills.This conclusion also supports previous claims that the difficulty level of a particular reading subskill cannot be directly linked to the taxonomic levels.For instance, Alderson and Lukmani (1989: 268) observe that some linguistically weaker test-takers perform overall 'somewhat better on the higher order questions than on lower order questions'.They attribute this fact to their non-linguistic cognitive skills abilities.Harding, Alderson and Brunfaut (2015: 7) again point out that reading skills also need to be closely linked with different cognitive processes, including working memory capacity, attention and the automaticity of word recognition.
Due to these factors, the question of 'how to diagnose problems at the higher level, or problems related to the interactions between lower-and higher-level processes, is less clear' (ibid.).
Despite these observed and reported discrepancies between the taxonomic and difficulty levels, the contemporary FL teaching practices and policies, by and large, follow the assumption that the relative taxonomical ranking of a particular comprehension skill directly reflects the skill complexity and difficulty.This strategy is evident in the CEFR (Council of Europe 2001).The document, for instance, describes the reading ability of a B2 student as one that includes reading different types of discourse; dealing with 'contemporary problems', which can be interpreted part of the reader's knowledge of the world; and recognising 'particular attitudes or viewpoints' (Council of Europe 2001: 27).These descriptors can be directly linked with building a mental model and clac 67/2016, 318-342 stopar and ilc: item and task difficulty 323 inferencing taxonomic levels of Khalifa and Weir's (2009) classification.In contrast, the reading ability of an A2 student is defined by descriptors that are associated with lower taxonomic levels (word recognition and lexical access), for instance, 'can understand short, simple texts containing highest frequency vocabulary' (Council of Europe 2001: 69).
After the publication of Bachman's (1990) and Bachman and Palmer's (1996) seminal works on language testing, a lot of research has been dedicated to testing reading comprehension, and also to the relationship between factors that give rise to item/task difficulty.As Fulmer and Tulis (2013) observe, two different types of item/task difficulty have been discussed: the objective and the subjective difficulty.While the former mostly pertains to readability that can be objectively measured by using different tools and item/task analysis, the latter involves a subjective judgment of difficulty based on cognitive, motivational and emotional factors (Efklides 2002;Fulmer and Tulis 2013).
Discussing the objective difficulty, Freedle andKostin (1993, 1999) analyse in detail factors such as vocabulary selection, sentence/passage length, topic (abstract vs. concrete), syntactic features (rhetorical organiser, referentials, fronting, negation), text organisation (topicalisation), and item type (explicit/implicit detail, explicit/implicit gist, textual organisation/structure).When addressing the relationship between the item type and difficulty, which is also discussed in this paper, Freedle and Kostin (1999: 18) observe that at least as far as the listening comprehension testing is concerned, the items that involve identifying the main idea and inference-application are easier than inference items.Lund (1991) reports that given the same language proficiency, testtakers find main-idea items and inference items easier than supporting idea items in the case of listening comprehension, whereas the situation is exactly the reverse with reading comprehension.
The perceived (i.e., subjective) difficulty involves both ability and affective variables.
While the ability variables (intelligence, aptitude, cognitive style) are more permanent and can be diagnosed ahead of time, the affective variables (confidence, motivation, anxiety) are more temporary and less predictable (Robinson 2001: 32).Consequently, the reported discrepancies between the objective and subjective difficulty can be attributed to affective variables (Fulmer and Tulis, 2013).clac 67/2016, 318-342 The theoretical considerations involving the complexity of the reading process as well as FL testing (see above) have led authors such as Alderson (2000) and Spaan (2007) to suggest that a valid and reliable reading comprehension test should always contain an appropriate selection of tasks and texts that not only test the appropriate micro skills but also include tasks (and items) targeting the intended level of difficulty.Such a requirement, coupled with the requirements of the curricula increasingly aligned with the CEFR, presents a significant challenge for testing and assessment (Figueras 2012;Fulcher 2004).This is especially the case with examinations for which the curriculum also serves as the test construct.In our context, the GM test developers and item-writers are faced with the responsibility of creating valid tests that adhere to the requirements of their exam constructs, which, in turn, are rigidly aligned to the CEFR.The item-writers are also not supported by (external) validation and evaluation of the items (e.g., piloting, pre-testing).Thus, the test validity and reliability exclusively depends on the itemwriters' and test-developers' judgments about item/task difficulty.The importance of pinpointing the desired difficulty level is also demonstrated by projects and studies focusing on relating examinations to the CEFR and identifying alternatives to (often impractical) piloting and pre-testing procedures (cf.Alderson et al. 2004;Bitenc Peharc, and Tratnik 2014;Cizek 2001;Council of Europe 2009;Hambleton and Jirka 2006;Kaftandijeva 2010;Little 2007;Martyniuk 2010;Sydorenko 2011).Along these attempts, the research presented herein explores to what extent the test-takers' as well as language experts' subjective perception of the item/task difficulty can be used as an alternative to piloting and pre-testing.

Context
The study presents three different reading comprehension tasks from the GM, in relation to the item difficulty as shown by psychometric measurements and the perception of test-takers and the CEFR-relating experts.For the purposes of the research, we have collected the test-takers' psychometric measurements, the test takers' answers to the while-reading questionnaire on the item difficulty, and the experts' judgments of item clac 67/2016, 318-342 stopar and ilc: item and task difficulty 325 difficulty.The reading tasks and the while-reading questionnaires were administered for the purposes of the present investigation only (i.e., they were not part of the GM administration); however, the GM administration guidelines were strictly followed.
The GM is a high-stakes exam, serving both as an achievement test (i.e. as a national secondary school-leaving exam), and as a proficiency test (i.e. as the tertiary education entrance exam).The GM is provided and administered by the Slovenian National Examinations Centre (RIC), and it comprises three obligatory and two elective subjects.
One of the obligatory subjects is a FL.The GM in English consists of five subtests: the reading and listening comprehension, language in use, writing and speaking.The former four subtests are administered on the national level and marked externally; the last is administered by the Matura school committees using standardised prompts and criteria.

Participants
The data presented herein was collected from responses of a total of 83 3 test-takers, who are all non-native speakers of English.With regard to EFL and the CEFR-levels, they all share the same background: they have English as an FL1 subject in their curricula, and their expected proficiency level is, according to the curricula, B2.The test-takers were selected randomly from the GM population from different Slovenian secondary schools (last-year students, age range from 17 to 19).The participants were required to complete three different reading comprehension tasks that were originally administered 3 The original number of participants was 100 but 17 test-takers did not complete the while-reading questionnaire.
clac 67/2016, 318-342 stopar and ilc: item and task difficulty 326 by the RIC together with the accompanying while-reading questionnaire.Comparing the performance of the 83 test-takers included in the study with the performance of the testtakers that originally sat for the GM, we can observe a high level of consistency in correlations: 0.89, 0.77, and 0.87 for Tasks 1, 2, and 3 respectively, which suggests that our sample is representative of the GM test-taker population.

Reading comprehension subtest
The three reading tasks included in the study were taken from the RIC test paper bank and were administered by the test provider in autumn 2009 to 1,022 test-takers (Tasks 1 and 2), and in spring 2013 to 4,375 test-takers (Task 3).The reason for selecting these three reading tasks for our research is twofold.First, Tasks 1 and 2 were also used by the panellists that aligned the GM reading subtest to the CEFR levels, so by using these two tasks, we have been able to compare the perception of item difficulty from the perspective of both the test-takers and the panellists.Second, Task 3 was selected intentionally to create a representative array of task-types that frequently occur in the GM reading subtests: Task 1 is an short-answer (SA) task type (Items 1-10), Task 2 (Items 11-20) is a gapped-text (GT) task type, and Task 3 (Items 21-28) is a multiplechoice (MC) task type.Following Freedle and Kostin's (1999) classification of items, we identified detail explicit (D-E) items (12 items), detail implicit (D-I) items (2 items), gist explicit (G-E) items (2 items), gist implicit (G-I) items (2 items), and items targeting at textual organisation/structure (O-S) (10 items).The items are presented in Table 1.we assigned the numeric values to the participants' descriptive responses as follows.
Test items marked as easy were assigned the value 0.95, items marked as moderate the value 0.50, and items marked as difficult the value 0.05.
To analyse the replies to the open-ended question, we have applied the method of clustering (Miles and Huberman 1994) which involves first identifying general topics and then breaking them down to more specific sub-topics.

Correlations and Comparisons
With regard to RQ1, we find that the correlation between test-taker perceptions of item difficulty and test-taker performance (Items 1-28) is relatively high, at 0.73.The correlation between expert judgments and test-taker performance (Items 1-20) is very high, at 0.83.
The test-takers perceive the test as noticeably more difficult than it actually is (the respective average facility values for Tasks 1-3 are 0.67 and 0.82).Their predictions are most reliable for Task 1 (correlation: 0.70) and Task 2 (correlation: 0.88), whereas the correlation between the perceptions and the performance is the lowest for Task 3, at 0.44.In contrast, the gap between the average perceived facility value (0.53) and the average performance facility value (0.66) is the least noticeable for the same task.
Unfortunately, an identical comparison between the test-takers and the alignment experts is not possible since the data from the CEFR alignment project do not include expert difficulty judgments on the multiple-choice task (Task 3).Despite this limitation, clac 67/2016, 318-342 stopar and ilc: item and task difficulty 331 we can observe that even if the correlation analysis is confined to the first two tasks, the result remains unchanged; namely, the correlation between test-taker perceptions and performance is at 0.73.
Focusing on individual items from the tasks that the test-takers and the judges had in common, we can observe a high degree of agreement about the items perceived/judged as the most difficult or the easiest.For instance, the data show that four out of the five items that were perceived as the most difficult are the same in both groups (Items 5, 4, 16 and 3) -albeit not in the same order.Both groups also share the perception that four among the five easiest items in the first two tasks are Items 1, 2, 7 and 11.
For reasons of practicality, the detailed presentation of results in the following sections is limited to a maximum of five testing items that are (i) perceived as the most difficult; (ii) perceived as the easiest, and (iii) most noticeably misperceived.

Items Perceived as the Most Difficult
The five items that the test-takers perceived as particularly challenging in Tasks 1-3 are, in order of perceived difficulty, Items 27, 3, 5, 22 and 26.The first three require the testtakers to process implicit information; additional factors affecting their difficulty are comparison (Item 27), negation (Item 3) and the fact that all preclude syntactic/lexical lifting.Items 22 and 26 target explicit information but involve the processing of numerous details and key words (in the text and distractors).
The five items that the experts marked as the most difficult in Tasks 1-2 are, in order of perceived difficulty, Items 5, 4, 16, 3 and 15.Items 5 and 3 are presented above.Items 4 and 15 rely on the test-takers' being familiar with the C1 6 word 'flee' and the lowfrequency, subject specific word 'doge' (chief magistrate of the Venetian Republic).
Item 15 is cognitively demanding since it includes contrasting. 6The CEFR level as provided in the online dictionary Cambridge Dictionaries Online (based on the English Vocabulary Profile).

Items Perceived as the Easiest
The five items that the test-takers saw as the easiest in Tasks 1-3 are, starting with the easiest, Items 7, 1, 2, 12 and 11.The short-answer items test the ability to identify explicit details and allow the answers to be recovered verbatim.The gapped-text items are syntactically and lexically undemanding and contain explicit lexico-grammatical cohesion links to the rest of the text.
The same justification can be given for the five items that the experts judged as the easiest in Tasks 1-2: starting with the easiest, these are Items 7, 11, 2, 19 and 1.

Gaps between Perceptions and Performance
In line with RQ2, we also observed the characteristics of the reading comprehension items that exhibit the greatest differences between test-taker perceptions/expert judgments and psychometric statistics.
All are judged as more difficult than they are according to statistics.The difficulty of these items is related to their implicitness (Items 3 and 8), demanding vocabulary (Items 4 and 26) and comparison (Item 27).
The items that the experts most noticeably misperceived (in Tasks 1-2) are items 4, 14, 17, 12, 3 and 8 (the sixth item is included in the analysis because the numerical gap between their perceptions and actual performance was identical for Items 3 and 8).
Similarly to the test-takers, the experts perceive the items as being more difficult than they actually are.Items 3, 4 and 8 are discussed above, while for Items 12, 14 and 17, we can observe that they are structurally ambiguous (from a lexico-grammatical perspective more than one option fits the gap).

Test-takers' Qualitative Comments
In the while-reading questionnaire, the test-takers were asked to provide comments on the factors that influence item difficulty.Their responses are presented based on tasktypes.Given the unstable status of the affective variables (see above), we must mention that the perceived difficulty of the test-takers included in our study may be somewhat clac 67/2016, 318-342 stopar and ilc: item and task difficulty 333 different from the test-takers sitting for the GM examination due to different circumstances (different testing situation, motivation, etc.).
The five items that the test-takers marked as the most difficult in Tasks 1-3 include two short-answer items (3 and 5) and three multiple-choice items (22, 26 and 27).
The 64 comments for the short-answer items (3 and 5) have been clustered as follows: -The answer is not explicitly stated (62.50%): 'the answer has to be deduced from the text', 'the answer is not in the text'; -Issues with the question (25.00%): 'misleading', 'difficult to understand'; -Issues with the text (9.38%): 'the text is not clear', 'the article is ambiguous'; -General comments (3.12%): 'I don't know the answer', 'difficult'.

Discussion and Conclusions
The study explores the relationship between objective and subjective difficulty of the GM reading comprehension text.Overall, the findings confirm the predictions of taxonomies proposed by Khalifa and Weir (2009) and Grabe (2009): the higher the taxonomic level, the more challenging the item is for the reader.The most frequent difficulties reported by our test-takers thus involve the higher-order skills of inferencing and text processing.Nevertheless, a common factor contributing to the difficulty of the test is also vocabulary, the recognition of which is ranked as a lower-order skill.We can observe that all these reading obstacles are reliably detected by both the test-takers and the expert judges.
Our empirical data show that test-takers are reliable judges of item difficulty.Their perceptions closely correlate (0.73) with their performance on the examination.This observation corroborates the previous findings of Apostolou (2010: 45-47) on test-taker perceptions of item difficulty in listening comprehension texts; thus we have proved that a similar conclusion can be extended to the reading comprehension as well.Another finding is that the CEFR alignment judges are even more accurate in their assessment of testing items (0.83), which is expected given their training and professional experience.
This supports the findings of Fortus, Coriat and Fund (1998), who report a very similar correlation of 0.82 for trained judges assessing reading comprehension items.In contrast, our result partly refutes some previous studies (Alderson and Lukmani 1989;Sydorenko 2011) that claim that (experienced) item writer intuitions are weak predictor of item difficulty.We propose this difference is the result of the training that the CEFRrelating judges received.The observed correlations in herein attest to the reliability of clac 67/2016, 318-342 stopar and ilc: item and task difficulty 336 both test-taker difficulty perceptions and expert judgments, and thus prove their relevance for test-design, CEFR-alignment procedures, and assessment in teaching.
Despite the otherwise consistently high correlations between the perceptions (of both the test-takers and the experts) and psychometric data, we can observe that in some items the differences are quite pronounced.In the case of test-takers, this is typical for items that test implicit information, prevent recovering the answers verbatim, and contain low frequency vocabulary.Common issues are also with overall comprehension of the text -even with items that target explicit information.In the group of experts, the most problematic items are also related to less frequent language and, with regard to gapped-text items, to gaps that are structurally ambiguous and thus rely mostly on the comprehension of the context.Also noteworthy is that the most noticeably misperceived items often overlap with the items perceived/judged as the most difficult.It would appear that the perception of difficulty is intensified when test-takers or judges encounter the most challenging items.
A closer analysis of the reading items also reveals that the items perceived as the most demanding involve processing implicit information and main ideas.Such a finding confirms Lund's (1991) study that established these factors as challenging in a reading test and supports the idea that there is a link between the difficulty of the skill and its taxonomical position (cf.Freedle and Kostin 1999;Khalifa and Weir 2009).However, we also observe that some of the items that were perceived as very difficult are detail explicit items.Our analysis and the test-takers' responses in the while-reading questionnaire indicate that such items include some other factor contributing to their difficulty, such as overall text comprehension issues and, quite frequently, the presence of challenging or over-demanding vocabulary.This also demonstrates the impact that linguistic factors have even on lower order questions: a test-taker may be cognitively capable of higher-level processing (in terms of Grabe 2009) but still fails to answer a question owing to a word recognition issue.In comparison, the items perceived as the least demanding involve the identification of explicit details that can be recovered verbatim from the text.With regard to the gapped-text items, we conclude that the main factors contributing to their simplicity are syntactic and lexical accessibility.This, too, is consistent with the relevant literature: for instance, the impact of sentence length and complex vocabulary has been shown in Freedle and Kostin's work (1993), and both clac 67/2016, 318-342 stopar and ilc: item and task difficulty 337 notions are included in the CEFR descriptors.The above findings are well supported by the responses collected in the while-reading questionnaire.
The study also highlights some valuable insights in task-types.Firstly, the results (quantitative and qualitative) further attest to the importance of using a variety of tasktypes in a test (see Alderson 2004 andWeir 2005, for instance).Secondly, while all are consistently perceived as more difficult than they really are, the average facility values perceived by the test-takers are the most accurate for the most difficult of the three tasks in this study, i.e. the multiple-choice task.This finding is relevant in light of Sydorenko's (2011: 43) claim that item writers seem to have difficulties distinguishing between intermediate and advanced level items.If item writers fail to distinguish some difficulty levels, then this gap can be filled by including the perceptions of test-takers who have proved to be very successful in predicting the average facility values of the most difficult task in our study.Admittedly, judging from the observed correlations, the experts are not very likely to fail in their predictions; however, in contexts where itemwriters are unable to receive sufficient training, such an alternative to piloting and pretesting procedures is desirable.
The findings presented herein will not only help test-developers and item-writers predict item/task difficulty and give them an insight into test-takers' perception of difficulty but also provide practical implications for FL teaching and curriculum development.For instance, the study shows that the micro skills in reading comprehension that B2-level students feel most insecure about include searching for main ideas and, perhaps most significantly, reading for implicit information.In addition, the data indicate that more emphasis should be placed on the strategies of tackling unknown vocabulary.Such skills, incidentally, are already part of the CEFR descriptors for the level B2, which serve as the curricular basis for the national reading test analysed in this study.
From 2008 to 2013, the RIC conducted a project that aligned all national examinations in English with the CEFR (Council of Europe 2001).The judging panel consisted of eleven to twelve experienced Slovenian education professionals.Most of the panellists are primary, secondary or tertiary teachers of English that cooperate with the RIC as item-writers and/or test-developers.The project's final report was published in 2014, claiming that the GM is aligned with the B2 level of the CEFR scale(Bitenc Peharc, and Tratnik 2014).

Table 1 .
The twenty-eight reading items used in the reading comprehension test Together with the reading tasks, the respondents were given a while-reading questionnaire, which had to be completed after answering each item.The respondents were asked to answer two questions for each item: (i) whether they found the item easy/moderate/difficult (a close-ended question), and (ii) what made the item easy/moderate/difficult (an open-ended question).We decided that the test-takers should evaluate the item difficulty level on a three-point scale (i.e.easy/moderate/difficult), so that the results can be directly compared with the facility values 4 from the official exam reports of the test provider.Taking into consideration the test provider's ranking of the items which coincides with those proposed in the relevant literature (seeBailey 1998) (Bitenc Peharc and Tratnik 2014) herein are taken from the Slovenian alignment project.As stated in the project's final report(Bitenc Peharc and Tratnik 2014), the reading subtest of the GM was aligned to B2 level, its cut score set at 80%.During the standard setting procedure for the reading comprehension subtest, the panellists used the combination of the Angoff and the Basket Methods 5 (op.cit.: 10) in order to minimize the influence of a particular method on the final standard setting results, which is also in accordance with the recommendations for the CEFR-alignment projects(Council of   Europe 2009: 61-65, 75-77; Kaftandijeva 2010: 131).Using the Basket Method, the experts ranked items as B1, B2 and C1, each abbreviation reflecting a CEFR level during the evaluation procedure.We converted their descriptive evaluations into numeric values in the same fashion as the participants' judgments about item difficulty.Since the GM targets the B2 level, we considered B1 items as easy, B2 items as moderate, and C1 items as difficult.Consequently, the numeric values of 0.05, 0.50 and 0.95 were assigned, respectively.As shown later (section 3.5), our proposed numeric conversion of the Basket judgments highly correlates with the experts' numeric item difficulty perception values based on the Angoff Method.The Basket Method builds directly on the connection between an item and the CEFR descriptors.To align an item to the CEFR level, a panellist has to establish at what CEFR level a test-taker can already answer the item correctly(Council of Europe 2009: 75).The Angoff Method, on the other hand, is based on the notion of 'minimally acceptable person' or a 'minimally competent candidate' at a targeted level.
4These are calculated with the classical test theory.Facility values together with other statistic data and their interpretation are included in the test-provider's final report published electronically for each administered exam (http://www.ric.si/splosna_matura/statisticni_podatki/?lng=eng).5Foreachitem,a panellist has to decide how likely it is that such a test-taker will answer correctly.(Council of Europe 2009: 61).