Oxford University Press

English Language Teaching Global Blog


2 Comments

Assessment Literacy – the key concepts you need to know!

student filling out a test in the classroomResearch shows that the typical teacher can spend up to a third of their professional life involved in assessment-related activities (Stiggins and Conklin, 1992), yet a lack of focus on assessment literacy in initial teacher training has left many teachers feeling less than confident in this area. In this blog, we’ll be dipping our toes into some of the key concepts of language testing. If you find this interesting, be sure to sign up for my Oxford English Assessment Professional Development assessment literacy session.

What is assessment literacy?

As with many terms in ELT, there are competing definitions for the term ‘assessment literacy’, but for this blog, we’re adopting Malone’s (2011) definition:

Assessment literacy is an understanding of the measurement basics related directly to classroom learning; language assessment literacy extends this definition to issues specific to language classrooms.

As you can imagine, language assessment literacy (LAL) is a huge area. For now, though, we’re going to limit ourselves to the key concepts encapsulated in ‘VRAIP’.

What’s VRAIP?

VRAIP is an abbreviation for Validity, Reliability, Authenticity, Impact and Practicality. These are key concepts in LAL and can be used as a handy checklist for evaluating language tests. Let’s take each one briefly in turn.

Validity

Face, concurrent, construct, content, criterion, predictive… the list of types of validity goes on, but at its core, validity refers to how well a test measures what it is setting out to measure. The different types of validity can help highlight different strengths and weaknesses of language tests, inform us of what test results say about the test taker, and allow us to see if a test is being misused. Take construct validity. This refers to the appropriateness of any inferences made based upon the test scores; the test itself is neither valid nor invalid. With that in mind, would you say the test in Figure 1 is a valid classroom progress test of grammar? What about a valid proficiency speaking test?

Figure 1

Student A

Ask your partner the questions about the magazine.

 

1.       What / magazine called?

2.       What / read about?

3.       How much?

 

 

Student B

Answer your partner with this information.

Teen Now Magazine

Download the Teen Now! app on your phone or tablet for all the latest music and fashion news.

Only £30 per year!

http://www.teennow.oup

Reliability

‘Reliability’ refers to consistency in measurement, and however valid a test, without reliability its results cannot be trusted. Yet ironically, there is a general distrust of statistics itself, reflected in the joke that “a statistician’s role is to turn an unwarranted assumption into a foregone conclusion”. This distrust is often rooted in a lack of appreciation of how statistics work, but it’s well within the average teacher’s ability to understand the key statistical concepts. And once you have mastered this appreciation, you are in a much stronger position to critically evaluate language tests.

Authenticity

The advent of Communicative Language Teaching in the 1970s saw a greater desire for ‘realism’ in the context of the ELT classroom, and since then the place of ‘authenticity’ has continued to be debated. A useful distinction to make is between ‘text’ authenticity and ‘task’ authenticity, the former concerning the ‘realness’ of spoken or written texts, the latter concerning the type of activity used in the test. Intuitively, it feels right to design tests based on ‘real’ texts, using tasks which closely mirror real-world activities the test taker might do in real life. However, as we will see in the Practicality section below, the ideal is rarely realised.

Impact

An English language qualification can open doors and unlock otherwise unrealisable futures. But the flip side is that a lack of such a qualification can play a gatekeeping role, potentially limiting opportunities. As Pennycook (2001) argues, the English language

‘has become one of the most powerful means of inclusion or exclusion from further education, employment and social positions’.

As language tests are often arbiters of English language proficiency, we need to take the potential impact of language tests seriously.

Back in the ELT classroom, a more local instance of impact is ‘washback’, which can be defined as the positive and negative effects that tests have on teaching and learning. An example of negative washback that many exam preparation course teachers will recognise is the long hours spent teaching students how to answer weird, inauthentic exam questions, hours which could more profitably be spent on actually improving the students’ English.

Take the exam question in Figure 2, for instance, which a test taker has completed. To answer it, you need to make sentence B as close in meaning as possible to sentence A by using the upper-case word. But you mustn’t change the upper-case word. And you mustn’t use more than five words. And you must remember to count contracted words as their full forms. Phew! That’s a lot to teach your students. Is this really how we want to spend our precious time with our students?

By the way, the test taker’s answer in Figure 2 didn’t get full marks. Can you see why? The solution is at the end of this blog.

Figure 2

A    I haven’t received an invite from Anna yet.

STILL

B     Anna still hasn’t sent an invite.

The cause of this type of ‘negative washback’ is typically due to test design emphasising reliability at the expense of authenticity. But before we get too critical, we need to appreciate that balancing all these elements is always an exercise in compromise, which brings us nicely to the final concept in VRAIP…

Practicality

There is always a trade-off between validity, reliability, authenticity and impact. Want a really short placement test? Then you’re probably going to have to sacrifice some construct validity. Want a digitally-delivered proficiency test? Then you’re probably going to have to sacrifice some authenticity. Compromise in language testing is inevitable, so we need to be assessment literate enough to recognise when VRAIP is sufficiently balanced for a test’s purpose. If you’d like to boost your LAL, sign up for my assessment literacy session.

If you’re a little rusty, or new to key language assessment concepts such as validity, reliability, impact, and practicality, then my assessment literacy session is the session for you:

Register for the webinar

Solution: The test taker did not get full marks because their answer was not ‘as close as possible’ to sentence A. To get full marks, they needed to write “still hasn’t sent me”.

 


References

  • Malone, M. E. (2011). Assessment literacy for language educators. CAL Digest October 2011.
  • Pennycook, A. (2001). English in the World/The World in English. In A Burns and C. Coffin (Eds), Analysing English in a Global Context: A Reader. London, Routledge
  • Stiggins, R. J, & Conklin, N. F. (1992). In teachers’ hands: Investigating the practices of classroom assessment. Albany: State University of New York Press.

 

Colin Finnerty is Head of Assessment Production at Oxford University Press. He has worked in language assessment at OUP for eight years, heading a team which created the Oxford Young Learner’s Placement Test and the Oxford Test of English. His interests include learner corpora, learning analytics, and adaptive technology.


1 Comment

Adaptive testing in ELT with Colin Finnerty | ELTOC 2020

OUP offers a suite of English language tests: the Oxford Online Placement test (for adults), the Oxford Young Learners Placement Test, The Oxford Test of English (a proficiency test for adults) and, from April 2020, the Oxford Test of English for Schools. What’s the one thing that unites all these tests (apart from them being brilliant!)? Well, they are all adaptive tests. In this blog, we’ll dip our toes into the topic of adaptive testing, which I’ll be exploring in more detail in my ELTOC session. If you like this blog, be sure to come along to the session.

The first standardized tests

Imagine the scene. A test taker walks nervously into the exam room, hands in any forbidden items to the invigilator (e.g. a bag, mobile phone, notepad, etc.) and is escorted to a randomly allocated desk, separated from other desks to prevent copying. The test taker completes a multiple-choice test, anonymised to protect against potential bias from the person marking the test, all under the watchful eyes of the invigilators. Sound familiar? But imagine this isn’t happening today, but over one-and-a-half thousand years ago.

The first recorded standardised tests date back to the year 606. A large-scale, high-stakes exam for the Chinese civil service, it pioneered many of the examination procedures that we take for granted today. And while the system had many features we would shy away from today (the tests were so long that people died while trying to finish them), this approach to standardised testing lasted a millennium until it came to an end in 1905. Coincidentally, that same year the next great innovation in testing was established by French polymath Alfred Binet.

A revolution in testing

Binet was an accomplished academic. His research included investigations into palmistry, the mnemonics of chess players, and experimental psychology. But perhaps his most well-known contribution is the IQ test. The test broke new ground, not only for being the first to attempt to measure intelligence, but also because it was the first ever adaptive test. Adaptive testing was an innovation well ahead of its time, and it was another 100 years before it became widely available. But why? To answer this, we first need to explore how traditional paper-based tests work.

The problem with paper-based tests

We’ve all done paper-based tests: everyone gets the same paper of, say, 100 questions. You then get a score out of 100 depending on how many questions you got right. These tests are known as ‘linear tests’ because everyone answers the same questions in the same order. It’s worth noting that many computer-based tests are actually linear, often being just paper-based tests which have been put onto a computer.

But how are these linear tests constructed? Well, they focus on “maximising internal consistency reliability by selecting items (questions) that are of average difficulty and high discrimination” (Weiss, 2011). Let’s unpack what that means with an illustration. Imagine a CEFR B1 paper-based English language test. Most of the items will be around the ‘middle’ of the B1 level, with fewer questions at either the lower or higher end of the B1 range. While this approach provides precise measurements for test takers in the middle of the B1 range, test takers at the extremes will be asked fewer questions at their level, and therefore receive a less precise score. That’s a very inefficient way to measure, and is a missed opportunity to offer a more accurate picture of the true ability of the test taker.

Standard Error of Measurement

Now we’ll develop this idea further. The concept of Standard Error of Measurement (SEM), from Classical Test Theory, is that whenever we measure a latent trait such as language ability or IQ, the measurement will always consist of some error. To illustrate, imagine giving the same test to the same test taker on two consecutive days (magically erasing their memory of the first test before the second to avoid practice effects). While their ‘True Score’ (i.e. underlying ability) would remain unchanged, the two measurements would almost certainly show some variation. SEM is a statistical measure of that variation. The smaller the variation, the more reliable the test score is likely to be. Now, applying this concept to the paper-based test example in the previous section, what we will see is that SEM will be higher for the test takers at both the lower and higher extremes of the B1 range.

Back to our B1 paper-based test example. In Figure 1, the horizontal axis of the graph shows B1 test scores going from low to high, and the vertical axis shows increasing SEM. The higher the SEM, the less precise the measurement. The dotted line illustrates the SEM. We can see that a test taker in the middle of the B1 range will have a low SEM, which means they are getting a precise score. However, the low and high level B1 test takers’ measurements are less precise.

Aren’t we supposed to treat all test takers the same?

                                                                                            Figure 1.

How computer-adaptive tests work

So how are computer-adaptive tests different? Well, unlike linear tests, computer-adaptive tests have a bank of hundreds of questions which have been calibrated with different difficulties. The questions are presented to the test taker based on a sophisticated algorithm, but in simple terms, if the test taker answers the question correctly, they are presented with a more difficult question; if they answer incorrectly, they are presented with a less difficult question. And so it goes until the end of the test when a ‘final ability estimate’ is produced and the test taker is given a final score.

Binet’s adaptive test was paper-based and must have been a nightmare to administer. It could only be administered to one test taker at a time, with an invigilator marking each question as the test taker completed it, then finding and administering each successive question. But the advent of the personal computer means that questions can be marked and administered in real-time, giving the test taker a seamless testing experience, and allowing a limitless number of people to take the test at the same time.

The advantages of adaptive testing

So why bother with adaptive testing? Well, there are lots of benefits compared with paper-based tests (or indeed linear tests on a computer). Firstly, because the questions are just the right level of challenge, the SEM is the same for each test taker, and scores are more precise than traditional linear tests (see Figure 2). This means that each test taker is treated fairly. Another benefit is that, because adaptive tests are more efficient, they can be shorter than traditional paper-based tests. That’s good news for test takers. The precision of measurement also means the questions presented to the test takers are at just the right level of challenge, so test takers won’t be stressed by being asked questions which are too difficult, or bored by being asked questions which are too easy.

This is all good news for test takers, who will benefit from an improved test experience and confidence in their results.

 

                                                                                            Figure 2.


Colin spoke further on this topic at ELTOC 2020. Stay tuned to our Facebook and Twitter pages for more information about upcoming professional development events from Oxford University Press.


Colin Finnerty is Head of Assessment Production at Oxford University Press. He has worked in language assessment at OUP for eight years, heading a team which created the Oxford Young Learner’s Placement Test and the Oxford Test of English. His interests include learner corpora, learning analytics, and adaptive technology.

You can catch-up on past Professional Development events using our webinar library.

These resources are available via the Oxford Teacher’s Club.

Not a member? Registering is quick and easy to do, and it gives you access to a wealth of teaching resources.


References

Weiss, D. J. (2011). Better Data From Better Measurements Using Computerized Adaptive Testing. Testing Journal of Methods and Measurement in the Social Sciences Vol.2, no.1, 1-27.

Oxford Online Placement Test and Oxford Young Learners Placement Test: www.oxfordenglishtesting.com

The Oxford Test of English and Oxford Test of English for Schools: www.oxfordtestofenglish.com