October 9, 2016

Good Assessment: What is "Good"?

As instructional designers, we often have to either write ourselves or use SME-written assessments, quizzes and other varieties of multiple (or not so multiple) choice questions. And often we're horrible at it. Before investing time into learning about the principles behind assessment writing, I did every single mistake that was possible, once again proving that learning without theory is not learning.

A Good Assessment is...

While proper evaluation of assessment quality, which would be necessary for high-stakes assessment, requires specialised knowledge and skills, in my daily life I found it extremely helpful to be aware of the underlying complexity of assessment development. Which is why in this post I would like to provide a simplified overview of assessment quality criteria, as even basic knowledge of these helped me design better assessments.

So, in simple terms, an assessment is good, when it is valid and reliable.

Understanding Validity

Validity, put simply, means that your assessment is measuring what it is supposed to. There are multiple ways of measuring the validity of your test, but they ultimately depend on the purpose of your assessment, which should always be formulated clearly. For example, your assessment may claim that anyone who scores X out of Y points will be able to perform a job or a task they were taught. In this case,in order for your assessment to be valid, you need to confirm the correlation between the test score and the job performance. If you find that people who failed your assessment are performing just as good as anyone who passed, your assessment is most likely invalid.

More often, however, you will most likely want to assess the achievement of learning objectives. In this case, the validity of  your assessment can be evaluated by verifying if:

  • Questions are linked to the learning objectives. In fact, each test item should measure only one learning objective or only one "thing". While it may seem like a good idea to create complex questions that relate to two or more learning objectives, this creates an issue with identifying the reasons behind failing the question. How can you know which learning objective wasn't achieved? 
  • There is an acceptable proportion of questions per each objective. For example, a learning objective "Will be able to save a document" can most likely be assessed with one test question, while "Will be able to troubleshoot technical issues" most certainly needs more than one. 
  • The cognitive level of test items is adequate. For example, if your objective is "To analyse the impact of World War II on global economy", the question like "In which year was World War II started" is not valid for this learning objective. This is the case where Bloom's taxonomy and well-formulated learning objectives are vital.
  • The difficulty of test items and the assessment as a whole is appropriate. While defining the difficulty can be quite complex, at the very least you can ask a group of SME  as well as someone who most closely resembles the assessment target audience, to review your assessment draft and establish the level of difficulty.  

Understanding Reliability

A valid test measures what it's supposed to. Reliable test does it constantly. For a quick example of the difference between validity and reliability, take a look at this ruler:

It is very reliable (it will constantly measure the same distance), but invalid (because it actually has 8 centimeters instead of the declared 9).

Assessment reliability measurement is usually complex and requires specific knowledge. Therefore, if you need to create a high-stakes assessment, such as a final exam or a certification program which has impact on the learners' employment or study outcomes, I would highly recommend hiring a professionally trained assessment specialist to assist you in this process. However, for the development of low-stakes assessment it is helpful to keep in mind the general principles.

Generally speaking, assessment reliability can be verified by:

  • Test/Retest - administering the same test to the same audience over the period of time. For example, if you have ever taken the same personality test several times and got a different result each time (and there's nothing in your life that could dramatically affect your personality), then the test may not be reliable.  
  • Equivalence - successive administration of the parallel forms of the same test to two groups. In very simple terms, this means that you will need to create two copies of the test, with questions that are different, but assess the same learning objectives / skills / knowledge areas.
  • Item reliability - a degree to which the test questions that address the same learning objective / skill / knowledge produce similar results. 
  • When evaluating test questions, it is also helpful to consider if the item discrimination, or can the item identify good performers from the poor. For example, are there test questions that high performers find constantly confusing? 
  • Inter-rater reliability - if you're using human graders (for essay questions or observations), you should verify to which extent the grades for each test items can vary between individual graders.

Based on my practical experience with assessments following internal on-boarding programs, the reliability of assessment can be compromised by unreliable training processes. This is particularly true if your training program is being run across multiple locations, by training teams that are not operating jointly or even belong to different legal entities within your corporation. Local issues, miscommunication between departments, lack of support for trainers - all of these factors should be considered before making conclusion about either quality of the assessment or the learners' performance.

Check Your LMS

It may not always be obvious, but some learning management systems (LMS) can actually help you assess your assessment. Particularly Moodle is extremely helpful in this regard. For example, it can identify test questions that have poor discrimination or inadequate difficulty. At the very least, your LMS should provide you with the percentage of learners answering each question correctly/incorrectly (and this is what is called "difficulty") - pay attention to questions which have extreme percentages (whether low or high) and particularly the 50/50 splits.  This can indicate either an issue with the question, or an issue outside of the test, e.g. an outdated or inadequate training.

The Worst Practice to Avoid

Creating an assessment that has 20 questions and a passing grade of 80% without any reason behind it, other than this being the so-called "best practice". Plainly speaking, neither the number of test items, nor passing grade should be set arbitrarily, as this compromises the validity of the assessment. In fact, assessment based solely on these two or either one of the numbers, usually leads to disputes and confusion, as there is no clarity of what these numbers mean or what is the purpose of assessment. There are specific principles that can help you make a reasonable decisions about the number of test items, as well as the passing grade - these should be applied to all assessments.

In general, having poorly written assessment is at the very least a waste of time (and at worst - highly unethical or even illegal), both yours as an assessment writer, and the learner's. In this case, no assessment is better than a bad assessment!

Further Reading

In addition to this extensive summary of the views on validity and reliability and this (less academic) overview, I found the following printed resources helpful, practical and accessible to more general audiences:

  • Anderson, P., Morgan, G. (2008) ‘Developing Tests and Questionnaires for a National Assessment of Educational Achievement’, vol. 2, World Bank Publication.
  • Burton et al. (1991) ‘How to Prepare Better Multiple-Choice Test Items: Guidelines for University Faculty’, Brigham Young University
  • Downing, S., Haladyna, T. (1997) ‘Test Item Development: Validity Evidence From Quality Assurance Procedures’, Applied Measurement in Education, nr. 10, vol. 1, pp.61-82.

Most of them may be found through Google Scholar.

No comments:

Post a Comment