Presidential Address

Invited Paper 1 (Qian)

Invited Paper 2 (Phakiti)

Invited Paper 3 (Isaacs)

Invited Paper 4 (Lee)








Presidential Address Abstract



Antony John Kunnan, Nanyang Technological University, Singapore


How to evaluate a language assessment using fairness and justice?


Evaluations of language assessments often use the concept of fairness. The Standards for Educational and Psychological Testing (AERA, APA, NCME, 1999) with a section titled “Fairness in Testing” and Codes of Ethics and Practice in assessment agencies such as ILTA, ETS, and ALTE include the concept of fairness. Recent publications have included fairness:  situated ethics (Kunnan & Davidson, 2004) and how to investigate fairness (Xi, 2010). The term justice is not as well known in the assessment literature although the idea of justice has been discussed in writings from Plato to Rawls and Sen. The term includes “distributive justice” which refers to institutions providing benefits that are distributed to a society in a just manner. In language assessment, Kunnan has tied the two concepts together (Kunnan, 2004, 2008, 2010, 2014) and McNamara and Roever (2010) have offered separation and clarity.


Based on work by Rawls and Sen, this talk will present principles and sub-principles of fairness and justice for evaluation of language assessments. It will apply the idea of fairness as relating to persons - how assessments ought to be fair to test takers, and the idea of justice as relating to institutions - how institutions ought to be just to test takers.


Principle 1: The Principle of Fairness: An assessment ought to be fair to all test takers.

Sub-principle 1: An assessment ought to provide adequate opportunity to learn the

knowledge, abilities or skills to be assessed for all test takers.   

Sub-principle 2: An assessment ought to be consistent and meaningful in terms of its test-score interpretation for all test takers.

Sub-principle 3: An assessment ought to be free of bias against all test takers, in particular by assessing construct-irrelevant matters.

Sub-principle 4: An assessment ought to use appropriate access, administration, and standard setting procedures so that decision-making is equitable for all test takers.


Principle 2: The Principle of Justice: An assessment institution ought to be just.

Sub-principle 1: An assessment institution ought to bring benefits to society by making a positive social impact.

Sub-principle 2: An assessment institution ought to advance justice through public reasoning of its assessment.


A discussion of these principles, the sub-principles and the warrants that go with them (in the Toulmin (1953) argumentation model of grounds, warrants, backing, rebuttals, etc.; Kane, 2010, Bachman (2005) is presented. Finally, sample empirical studies are presented and mapped onto the Toulmin model so that evaluations can be made regarding claims of fairness and justice of particular language assessments.  













Invited Papers Abstracts



Invited Paper 1


David Qian, Hong Kong Polytechnic University, Hong Kong


Putting academic vocabulary lists to the test: Measuring the academicality of different generations of the TOEFL


In the context of profiling spoken discourse features in academic settings with three academic spoken corpora of three million words as the database, it became clear that academic vocabulary plays an important role in such discourse. Existing research in other contexts also contends that vocabulary knowledge plays a significant role in listening comprehension, providing further support for our argument that knowledge of academic vocabulary facilitates academic communication and academic vocabulary should form an important element in an English proficiency test for academic purposes. Since the creation of the University Word List (Xue & Nation, 1984), a number of academic vocabulary lists have appeared, including the Academic Word List (AWL, Coxhead, 2000), Academic Formulaic List (AFL, Simpson-Vlach & Ellis, 2010), PHRASE List (Martinez & Schmitt, 2012), and Academic Vocabulary List (AVL, Gardner & Davies, 2013). These lists were created following different frameworks, and therefore differ in their scopes of coverage. The present study aims to evaluate the usefulness of two of these lists, namely, AVL and AFL, for detecting academicality, i.e., density of academic vocabulary, in different generations of the TOEFL, which is intended for determining candidates’ suitability for academic studies in universities. In the present study, I first analyze the approaches and criteria adopted in creating these vocabulary lists and then apply the lists to profiling the academic lexical coverage of the listening and reading sub-tests of multiple forms of TOEFL pBT and TOEFL iBT. I will finally report my findings from these analyses.












Invited Paper 2


Aek Phakiti, University of Sydney, Australia


Test-takers’ calibration and strategy use in IELTS listening tasks


To date, little remains unknown about an ‘optimal condition’ in which appropriate and desirable strategy use results in significantly better test performance. Language testing researchers and test developers do not have a sufficient empirical understanding about test-takers’ metacognitive judgments about their current test performance and factors affecting their judgment accuracy. This presentation reports on an empirical study that investigates test-takers’ calibration and its relationship to reported strategy use in an IELTS listening test. Test-taker calibration denotes a perfect relationship between confidence in performance success and actual performance outcome. In other words, a study of calibration aims to evaluate an alignment between test-takers’ perceived confidence and their actual performance. Calibration or miscalibration thus indicates the nature of test-takers’ metacognitive judgment, monitoring accuracy and/or self-appraisal. 388 English as a second language (ESL) test-takers in Australia took part in this study. Before they took the listening test, they were asked to report on their general strategy use in IELTS listening tests. While completing each of the IELTS test questions, they were asked to report on the level of their confidence in the correctness of their answer (e.g., 0% to 100%) and at the end of the test, they were asked to report their cognitive and metacognitive strategy use during this listening test. Their calibration, confidence, and reported strategy use scores were analyzed together using serval statistical tests and in particular a structural equation modeling (SEM) approach. It was found that on average the test takers were not well calibrated and had a tendency to be overconfident across the listening test sections. Their calibration scores and confidence ratings in performance were positively, yet marginally related to their cognitive and metacognitive strategy use. The present study helps us advance our knowledge of strategic processes including calibration and strategy use that are part of or affect listening test performance (IELTS). Implications of the study and recommendations for future research will be articulated.













Invited Paper 3


Talia Isaacs, University of Bristol, U.K


Perceptions and ratings of lay listeners, teachers, and examiners in L2 pronunciation scale development and validation


In second language (L2) pronunciation assessment, the consequences of an intuitive/experiential approach to rating scale development have led to shortcomings in the quality of the pronunciation descriptors used in current scales (Isaacs, 2013). For example, the main CEFR scales, which were compiled from assorted intuitively-derived descriptors, exclude reference to pronunciation, partially reflecting the inadequacy of those descriptors. Speaking scales that include a pronunciation component are also problematic. Some haphazardly describe behavioural indicators across levels (e.g., ACTFL), whereas others are so general that the specific linguistic features that constitute level distinctions are often unclear (e.g., IELTS). Still other scales imply or directly equate increased intelligibility (i.e., understandability of L2 speech) with the reduced presence of a foreign accent (e.g., CEFR phonological control scale). However, this practice contradicts strong research evidence that perfectly intelligible speech does not preclude the presence of a noticeable L2 accent, whereas a heavy accent is a hallmark of unintelligible speech (Derwing & Munro, 2009). Developing an evidential basis for operationalizing pronunciation features in rating scales is essential for generating more valid assessments.


This paper will report on a research program on using raters’ perceptions and judgments of L2 speech to better understand the linguistic properties that underlie speech that is easily understandable (often termed ‘intelligible’ or comprehensible’ in rating scales). After presenting findings from an initial scale development study involving  ‘lay’ raters’ judgments of L2 French learners’ comprehensibility and a follow-up study on speaker first language effects (Crowther et. al, 2014), the paper will turn to focus group and rating data from eight accredited IELTS examiners on their impressions of the IELTS Pronunciation scale. Implications for pronunciation scale development and validation, including challenges in teacher-raters’ pronunciation literacy, will be discussed.














Invited Paper 4


Yong-Won Lee, Seoul National University, South Korea

Reconsidering consequential validity in diagnostic language assessment


Diagnostic language assessment (DLA) is attracting a great amount of attention from language testing researchers and practitioners. DLA is designed to identify learners’ weaknesses, as well as their strengths, in a targeted domain of communicative competence. One unique feature of DLA is that it has an explicit goal of positively impacting subsequent learning by providing the learners with diagnostic feedback and (guidance for) remedial activities. One implication of such learning-inducing characteristics of DLA for validation is that the evidence for consequential validity should be carefully collected and evaluated in support of the accuracy, meaningfulness, and effectiveness of diagnosis, feedback, and remedial learning/instruction based on assessment results.

Despite such strong needs demonstrated for careful evaluations of consequences in DLA, however, there have been on-going debates in the measurement community regarding whether it is justifiable to include consequences of testing in validity frameworks for psychological and educational tests (Borsboom, 2006; Kane, 2009; Lissitz & Samuelson, 2007; Markus & Borsboom, 2013; Messick, 1989; Popham, 2007; Sheppard, 1997). Reductionists claim that the notion of validity should be confined only to the accuracy of score-based inferences, whereas expansionists argue for inclusion of the consequences of test use and score-based actions in the validity framework. In this regard, DLA seems to provide a good testing ground for refining the rationales and methods for dealing with consequences in the validity frameworks.
With these backgrounds, the major goals of the study are to: (a) re-examine the major arguments for, and against, the inclusion of consequences in the validity framework from the perspective of DLA, (b) identify some of the major issues that need to be considered in creating validity frameworks for DLA, and (c) propose and illustrate two alternative validity (or utility argument) frameworks for DLA. In the paper, I also argue that the scope of validation and evaluation in DLA should include not only the quantitative information (or scores) but also its linkage to the qualitative information that describes the nature of attributes being measured, learners’ proficiency levels on the attributes, weakness-strength patterns, and referral information for recommended remedial activities.