Development and assessment of tests for education: Comparing standard setting methods

Introducing methods for standard setting

Tests are made and taken to see how well students understand a certain topic. When developing a test, a lot of decisions need to be made. All these decisions combined are called the standard setting process, in which is determined how well someone has to perform on a test to pass that test. It also includes setting performance standards, making exam questions, and selecting a method for setting a cut-score. The cut-sore represents the least number of items that need to be answered correctly to pass the test. In this post will be focused on explaining and comparing different methods to set a cut-score.

There are a few methods for standard setting, and they can be divided in three subgroups, norm-referenced, criterion-referenced, and mixed. In norm-referenced methods students are compared to each other, while criterion-referenced methods are chosen when the student needs a certain level of knowledge or skills to be able to pass the test (Ertoprak & Dogan, 2016). It is also possible to use a mix of those methods.

The passing percentage and Cohen’s method are examples of norm-referenced methods. The passing percentage method can be used when there is a desired percentage of students who need to pass. This can be due to selection or limited places available. A reason not to use this method is because the content and quality are not considered when deciding if people passed. Even if the test were made badly, people would still be able to pass. The Cohen’s method is similar; however, the best performing student is used as reference and 60-65% of that score is used to determine the cut-score. The advantage is that student ability across exams is more stable than panelist rating, additionally panelist ratings can be too expensive. On the other hand, it might be that student ability across exams fluctuates too much.

The linear transformation and expert panels are examples of criterion-referenced methods. The linear transformation draws a straight line between the guess-score and the maximum point that can be obtained. This method can be used when there are no differences in difficulty between exams, guessing score, and maximum score. However, if there are differences, this method cannot be used. The second method entails expert panels, this is also called Angoff-method. Around ten panelists estimate the probability that a minimal competent student answers an item correctly. They do this for all items on a test. With those experts setting the cut-score, criterium reference, quality assurance, and minimal competence are clear. Additionally, professionals are engaged in the process. Since this is a time and money consuming process it might not always be the best method to choose.

Lastly, the Hofstee-method is a combination of norm- and criterion-referenced methods. Experts decide on an acceptable passing sore, they set the minimum and the maximum failure rate and the minimum and maximum passing score. This is done to control for extreme failure rates by critical panelists. A reason not to use this method is that the ability of examinees across exams can differ a lot.

Analysing methods for setting the passing percentage and cut-score

The analysis will be performed with data from a high-stakes Mathematics exam to find similarities and differences between the previously discussed methods. The outcomes and comparisons are discussed below. A few details of the data: students could obtain 66 points in the exam, the guess-score was 16.5, and 1945 students took the exam.

Looking at all methods, students had to answer between 40 and 44 items correctly to pass the test. The passing percentage method shows that if 51% of the students should pass, the cut-score should be set at 41. To let 57% of the students pass, the cut-score should be 40. The linear method shows that students had to answer 41.5 items correctly to get the passing grade of 5.5. Since it is not possible to get this score, the cut-score should be set to 41 or 42. A score of 41 results in a 5.0, while answering 42 items correctly results in a 6.0. There is a big difference in how is decided if students pass the test when looking at these two methods.

When setting the cut-scores with the other methods, it becomes clear that there are less differences. The graph of the Hofstee-method shows an intersection that provides a cut-score of 41, in this case, 49% of the students will fail the test. Twelve panelists performed the Angoff-method which resulted in a cut-score of 40. Lastly, the Cohen’s method shows a cut-score of 41 and then 49,1% of the students failing. So, the Hofstee- and Cohen’s method have similar results. The Angoff-method gives a cut-score that is a little lower. If this cut-score would be used in the Hofstee- and Cohen’s method, the percentage of students failing would lower to respectively 43% and 43,2%.

Changing student ability, what happens to the passing percentage and cut-score?

If the students who would take the same test would have a lower ability level, there would be some changes in the passing percentage and cut-score. Firstly, looking at the passing percentage method and linear transformation method, there would be less items that need to be answered correctly to pass the test. If still around 60% of the students need to pass and they all score lower on the test, they will have to answer less than 40 items correctly. When looking at the linear transformation method, if the highest score of one of the students is still 66, there would not be a difference in the number of items that need to be answered correctly to pass the test. However, if the highest score on that test would not be 66, the number of items that need to be answered correctly to pass, would be lower. So, using the same exam in a group in which student’s abilities are lower, the passing percentage could be different depending on the method that is used and the highest score on the test.

Also changes in the cut-score are dependent on the method that is used to set the cut-score. The Hofstee-method provides a range in which the cut-score can lie, the score will be different when the student’s performance is less because of the change in the cumulative graph, not because the experts have set other values for the minimum and maximum failure rate and passing score. If the students score lower on the test, the cut-score will be lower. The cut-score determined by the Angoff-method will be different because the probability of students answering the items correct is considered. So, if the student’s ability is lower, the cut-score will be lower. Lastly, the cut-score set by the Cohen’s method can be different since it depends on the highest score in the group. So, the same holds as for the linear transformation method. If the best performing student now scores lower than in the previous group, the cut-score will also be lower. If the best performing student performs equally well as in the previous group, the cut-score will not change.

References

Ertoprak, D. G., & Dogan, N. (2016). A research on the classification validity of the decisions made according to norm and criterion-referenced assessment approaches. Anthropologist, 23(3), 612–619. https://doi.org/10.1080/09720073.2014.11891981

5 opmerkingen:

Atayo's World12 mei 2020 om 11:33
Hello Birgit,
I enjoyed reading your blog. You show a good understanding of the topic. A quick one though, are you sure the cut-score will remain the same for lower ability students? The Angoff method is a criterion -referenced item orientation method. The experts set the cut-score using the item difficulty not the expected ability of the candidates. In my opinion, the cut-score for low ability students will remain same . What do you think?
BeantwoordenVerwijderen
Reacties
Anoniem15 mei 2020 om 20:41
Chantha
BeantwoordenVerwijderen
Reacties
Monika Vaheoja25 mei 2020 om 12:28
Nice blog Birgit, i do however have a few questions about your blog. You state that the highest cut-score was equal to 44, for which method?

you also state that for the linear transformation does the maximum point change, when students' ability changes. But thats not trully correct, the maximum possible points stays the same, even when the ability changes.

And when you think a bit ahead: different methods result in different cut-scores, but what does it mean for the performance standard (in terms of the difficulty of the exam) if the cut-scores for the same exam are different?

BeantwoordenVerwijderen
Reacties

Reactie toevoegen

Development and assessment of tests for education

maandag 11 mei 2020

Comparing standard setting methods

5 opmerkingen:

Defining educational measurement and describing its innovations and future

Zoeken in deze blog