maandag 25 mei 2020

Reviewing a test item with the Evidence Centered Design model


Introduction of the ECD model
In this post, an item will be assessed to check whether it is fitting to assess a certain target skill. This item, shown in Figure 1, is taken from an old Dutch traffic theory exam for cycling and aimed at primary school children in grade 5.  For this assessment, the evidence centered design model is used. This model consists of six sub-models: the student model, task model, evidence, presentation, assembly, and reporting model, see Figure 2 (Almond, Steinberg, & Mislevy, 2002). In this post, only part of the student, task, and evidence model will be discussed. The student model provides insight into what the minimum requirements or skills are when performing a task. The task model provides an answer to which assessment tasks are needed to gain information about the student. And the evidence model answers the questions: ‘What counts as evidence for proficiency?’ and ‘How to interpret evidence when drawing a conclusion about the target skill?’. Of those models, the following aspects will be discussed: the target skill, the traffic task and task situation, the task context, the task complexity features, the responding, and the responding process.

Figure 1. Item used for the assessment of the target skill, translated into English (VNN, 2014).



Figure 2. The sub-models that together form the ECD model

Target skill
The target skill seems to be decision making in traffic or having sufficient knowledge about traffic rules and insight in a situation to safely carry out the task. The task in this case, as described in Figure 1, is to safely pass the car that is backing out of a parking spot. The students will go through a few processes while solving the item, the perceptive process, the anticipation process, and the decisive process. In the perception process, they become aware of their environment. More specifically, their speed, the speed and direction of the other vehicles, and the fact that the rules indicate that the car should give way to them. In the anticipation process, they need to know the possible outcomes and predict the behaviour of the other people involved in this situation. The following situations are a few possibilities that may cross their mind: we can continue if the car lets us go first, we should slow down to make sure the car driver will notice us in time, we should slow down to avoid a collision if the car driver does not see us in time, we should not break abruptly since we are then risking a collision with each other (other cyclists). These thought focus on traffic flow, safety, and social participation in traffic. Lastly, a decision needs to be made on how to act and the corresponding answer needs to be chosen. The decision for continuing or slowing down will be based on the speed and movement of the car. If it stops, the cyclists can continue. If the car does not stop, the cyclists will have to slow down and eventually break. Therefore, the safest option is to at least slow down until the cyclists are noticed by the car driver. So, Sanne gives the correct answer (see Figure 1).
            The description of solving this item might seem straightforward, however, it is important to be aware of the pitfalls of this question. It is essential to pay attention to what can go wrong during these processes. It is possible that not all relevant information is noticed in the perception phase, it is possible that not all possible outcomes are thought of in the anticipation phase, and it is possible that still the wrong decision is made despite having the correct perception and anticipation. It might even be possible that, for example, reading skills of the students are not well/ fully developed, therefore the student might not understand the question, which also can be a reason to give the incorrect answer to this item.

Traffic task and task situation
The traffic situation can be described with the help of a detailed overview of the traffic task, shown in Table 1. In this table is shown that the cyclists are cycling on a road and there are other road users. The cyclists encounter a car that is backing out of a parking spot into the road and they must change their position to pass the car and avoid a collision. When they passed the car, they can go back to cycle on the right side of the road again.

Table 1

Characteristics of task situations
Main task
Subtask
Light
Weather
Road category
Road section
Other road users
Cruising
Stay on course
Normal view
Normal
30 km/h road
Straight road
Cars
Cruising
Stay on course
Normal view
Normal
30 km/h road
Straight road
Bikers
Change lateral position
Overtaking car
Normal view
Normal
30 km /h road
Straight road
Cars
Change lateral position
Overtaking car
Normal view
Normal
30 km/h road
Straight road
Bikers
Change lateral position
Merging
Normal view
Normal
30 km/h road
Straight road
Cars
Change lateral position
Merging
Normal view
Normal
30 km/h road
Straight road
Bikers

Task context
The context of the task can be real, simulated, described, or context-free. Since the item is a picture, it is clearly not a real or simulated context. I think the picture can be evaluated as a described situation as well as context-free. The actors, objects, and materials are visually shown and there is only focus on the specific task of safely continuing the route without being interrupted or hit by the car that is getting out of the parking spot. An argument that can be made for context-free is the fact that only this specific task is focused on, so the test taker should mainly be aware of the rule that the car driver should give way to the cyclists. However, since a picture with the situation is provided rather than only a rule or sign, I would say that the context of the task is more described than context-free.
            To let a student experience a learning task more realistically, the task context could be real or simulated. For example, if you are cycling and you want to turn left. You must think about a lot of things, e.g. looking for other road users, signalling, deciding if you can continue or should wait. Therefore, practising such a skill might be helpful. Doing this in a simulation has the advantages of it looking like a real situation but without the actual danger.

Task complexity features
There are various aspects that add to the complexity of carrying out the target skill in a real situation. A few of such aspects are shown in Table 2. As described at the target skill, the student needs to notice other road users and their speed and direction. To decide how to handle the situation other road users and their speed and direction should be considered. The decision that is made should be feasible to carry out regarding space and time.
The complexity of this task could be varied by changing the situation in which the target skill is carried out. To increase the complexity, it could be night- instead of daytime which decreases sight and visibility. To make the task less difficult, the car coming from the opposite direction or the parked cars could be removed to increase visibility and space to carry out the action of passing the car.

Table 2

Task complexity features

Perception
Decision making
Action execution
Sight and visibility
X


Presence of other road users
X
X

Regulation of situation

X

Speed differences
X
X
X
Time pressure

X
X
Space to carry out actions
X
X
X

Response
The students respond to this item by selecting one of the three answers that are provided. Therefore, the answer is provided using the visual response channel. However, if the answer would be constructed by the students themselves, they have the possibility to express and explain themselves better and the question won’t be marked with a pass or fail only based on the final answer. The constructed answer could be provided to the assessor verbally or an element of interaction can be added to the question by moving certain elements in the situation. However, the item is paper-based, so, instead of selecting one option, it would be more insightful to present this item as an open question.

Response processing
There are different parts of a response that points can be assigned to: actions and strategies, the actual solution, or the consequences of the solution. In this case, since only A, B, or C can be chosen, the reasoning of the test taker is not provided to the assessor. So, either the actual solution or the consequences the answer has in this situation can be scored. However, in the case of assigning points to the answer with the best outcome, the item should be formulated differently. The wording of the question should make the aim of the item clear. If the item would be presented as an open question as suggested above, the actions and strategies used by the student could be scored together with the actual solution.
            The scoring for the alternative task mentioned in the task context (turning left when cycling) differs from the item provided in Figure 1. The alternative task has a simulated context, the actions and the consequences are important, thus the student should get points based on (one of) these parts. However, a student should be made aware of the aim of the task before carrying it out. When the assessor is interested in the reasoning behind the actions, the student could share their thinking process when deciding what to do or they could be asked later why they made certain decisions.

Conclusion
A few aspects of the student, task, and evidence model were discussed. Those aspects are an example of what must be considered when constructing an item for a test. It is important to know what skill needs to be measured to develop a fitting item or test that provides an answer to the question if the student is proficient in a certain skill.
            For this specific item, I think overall it is a good item to measure the target skill. However, it could benefit from rethinking the way the student has to respond to the item since now the reasoning of the answer that is provided is unknown to the assessor. The reasoning can also give the assessor valuable information about possible mistakes that are made during the decision-making process.

References
Almond, R. G., Steinberg, L. S., & Mislevy, R. J. (2002). Enhancing the design and delivery of assessment systems: A four process architecture. Journal of Technology, Learning, and Assessment, 1(5), 1–64.

maandag 18 mei 2020

Assessing quality of items and tests


Quality of tests need to be assessed before they can be used to test knowledge or skills of the candidates. The RCEC review system is an analytical review system that is developed to evaluate the quality of educational exams (‘The RCEC review system for the quality of tests and exams’, n.d.). This system has six criteria that together make up the substantive and organizational aspect and the psychometric aspect. Purpose and use, test and examination material, and test administration and security combined form the first aspect. Representativeness, reliability, and standard setting and maintenance form the second aspect. To measure if the six criteria are met, questions are answered with either ‘insufficient’, ‘sufficient’, or ‘good’, this gives respectively a score of 1, 2, or 3 to the question. At the end of the questions for each criterium is checked whether enough points are gathered to be able to say if that criterium is met. In this post, a few questions of the criteria for the psychometric aspect are answered with data that was provided. This data contains the analysis of the test and items. It was a test for group 7 (grade 5), 199 students participated, and the test consisted of 40 items.

For criterium 3, representativeness, the question 3.2 ‘Is the degree of difficulty of the items and/or the actions adjusted to the intended target population?’ was selected. To be able to answer this question with sufficient, 75% - 90% of the items should have a p-value >0.20 and ≤0.80. If the percentage is lower than 75, the question is marked insufficient and if the percentage is higher than 90, the question is marked as good. When looking at this data, less than 75% of the items has a p-value between 0.20 and 0.80. Therefore, this question must be answered with insufficient and gets a score of 1.

For criterium 4, reliability, the questions 4.2 and 4.3 were selected. Question 4.2 ‘Is the reliability of the test correctly calculated?’ is answered by the number of candidates used for the calculation of the reliability. At least 200 candidates should be used for the calculation however, in this data, only 199 candidates took the test. Therefore, the answer to this question is insufficient and gets a score of 1. If there would have been 200 candidates, the score would have gone up to sufficient. Additionally, there was an objective scoring system, established in question 2.9 (criterium 2, question 9), therefore the score would go to good. So, with at least one extra candidate, the score of this question would go from 1 to 3.
The second question in this criterium is 4.3, ‘Is the reliability sufficient, considering the decisions that have to be based on the test?’. To answer this question is looked at the reliability score. A reliability between ≥0.80 and <0.90 is considered sufficient. Lower than 0.80 is insufficient and higher than 0.90 is good. In this data, the coefficient alpha is only 68%, therefore also this question is answered with insufficient and gets a score of 1.

For criterium 5, standard setting and maintenance, the questions 5.1, 5.2a, and 5.2c were selected. Question 5.1 is ‘Are norms/ standards/ cut-off scores provided?’. So, either these are/ one of these is given or not. The data shows that the Angoff method is used and the cut-off score has been set. So, this question can be marked as good and gets a score of 3.
The second question is 5.2 ‘Has the standard setting been carried out correctly?’, which is divided in three sub questions. However, only sub question a and c will be discussed.
Sub question a is ‘Has the standard setting method been carried out correctly?’. To answer this question professional consideration or argumentation to support the decision for the cut-off score needs to be considered. The Angoff method was used to set the cut-off score and seems to be carried out correctly, however, the reasoning and support of the experts is missing. Therefore, this question is answered as sufficient and gets a score of 2.
Sub question c is ‘Is there sufficient agreement between the qualified experts?’. Sufficient agreement is between 0.60 and 0.80. In this data, the agreement between the qualified experts is 89% which means that this question can be answered with good and thus gets a score of 3.

In summary, the review has strict rules with which the quality evaluation is executed. However, it is not always as straightforward as it might seem, for example, look at criterium 4. Most importantly, no conclusion can be drawn from answering a few questions since all questions must be answered to produce a reliable evaluation of the quality of the test.


Reference
The RCEC review system for the quality of tests and exams. (n.d.). Retrieved 18 May 2020, from https://www.rcec.nl/en/review-system/

maandag 11 mei 2020

Comparing standard setting methods

Introducing methods for standard setting
Tests are made and taken to see how well students understand a certain topic. When developing a test, a lot of decisions need to be made. All these decisions combined are called the standard setting process, in which is determined how well someone has to perform on a test to pass that test. It also includes setting performance standards, making exam questions, and selecting a method for setting a cut-score. The cut-sore represents the least number of items that need to be answered correctly to pass the test. In this post will be focused on explaining and comparing different methods to set a cut-score.
            There are a few methods for standard setting, and they can be divided in three subgroups, norm-referenced, criterion-referenced, and mixed. In norm-referenced methods students are compared to each other, while criterion-referenced methods are chosen when the student needs a certain level of knowledge or skills to be able to pass the test (Ertoprak & Dogan, 2016). It is also possible to use a mix of those methods.
            The passing percentage and Cohen’s method are examples of norm-referenced methods. The passing percentage method can be used when there is a desired percentage of students who need to pass. This can be due to selection or limited places available. A reason not to use this method is because the content and quality are not considered when deciding if people passed. Even if the test were made badly, people would still be able to pass. The Cohen’s method is similar; however, the best performing student is used as reference and 60-65% of that score is used to determine the cut-score. The advantage is that student ability across exams is more stable than panelist rating, additionally panelist ratings can be too expensive. On the other hand, it might be that student ability across exams fluctuates too much.
            The linear transformation and expert panels are examples of criterion-referenced methods. The linear transformation draws a straight line between the guess-score and the maximum point that can be obtained. This method can be used when there are no differences in difficulty between exams, guessing score, and maximum score. However, if there are differences, this method cannot be used. The second method entails expert panels, this is also called Angoff-method. Around ten panelists estimate the probability that a minimal competent student answers an item correctly. They do this for all items on a test. With those experts setting the cut-score, criterium reference, quality assurance, and minimal competence are clear. Additionally, professionals are engaged in the process. Since this is a time and money consuming process it might not always be the best method to choose.
            Lastly, the Hofstee-method is a combination of norm- and criterion-referenced methods. Experts decide on an acceptable passing sore, they set the minimum and the maximum failure rate and the minimum and maximum passing score. This is done to control for extreme failure rates by critical panelists. A reason not to use this method is that the ability of examinees across exams can differ a lot.

Analysing methods for setting the passing percentage and cut-score
The analysis will be performed with data from a high-stakes Mathematics exam to find similarities and differences between the previously discussed methods. The outcomes and comparisons are discussed below. A few details of the data: students could obtain 66 points in the exam, the guess-score was 16.5, and 1945 students took the exam.
            Looking at all methods, students had to answer between 40 and 44 items correctly to pass the test. The passing percentage method shows that if 51% of the students should pass, the cut-score should be set at 41. To let 57% of the students pass, the cut-score should be 40. The linear method shows that students had to answer 41.5 items correctly to get the passing grade of 5.5. Since it is not possible to get this score, the cut-score should be set to 41 or 42. A score of 41 results in a 5.0, while answering 42 items correctly results in a 6.0. There is a big difference in how is decided if students pass the test when looking at these two methods.
            When setting the cut-scores with the other methods, it becomes clear that there are less differences. The graph of the Hofstee-method shows an intersection that provides a cut-score of 41, in this case, 49% of the students will fail the test. Twelve panelists performed the Angoff-method which resulted in a cut-score of 40. Lastly, the Cohen’s method shows a cut-score of 41 and then 49,1% of the students failing. So, the Hofstee- and Cohen’s method have similar results. The Angoff-method gives a cut-score that is a little lower. If this cut-score would be used in the Hofstee- and Cohen’s method, the percentage of students failing would lower to respectively 43% and 43,2%.

Changing student ability, what happens to the passing percentage and cut-score?
If the students who would take the same test would have a lower ability level, there would be some changes in the passing percentage and cut-score. Firstly, looking at the passing percentage method and linear transformation method, there would be less items that need to be answered correctly to pass the test. If still around 60% of the students need to pass and they all score lower on the test, they will have to answer less than 40 items correctly. When looking at the linear transformation method, if the highest score of one of the students is still 66, there would not be a difference in the number of items that need to be answered correctly to pass the test. However, if the highest score on that test would not be 66, the number of items that need to be answered correctly to pass, would be lower. So, using the same exam in a group in which student’s abilities are lower, the passing percentage could be different depending on the method that is used and the highest score on the test.
            Also changes in the cut-score are dependent on the method that is used to set the cut-score. The Hofstee-method provides a range in which the cut-score can lie, the score will be different when the student’s performance is less because of the change in the cumulative graph, not because the experts have set other values for the minimum and maximum failure rate and  passing score. If the students score lower on the test, the cut-score will be lower. The cut-score determined by the Angoff-method will be different because the probability of students answering the items correct is considered. So, if the student’s ability is lower, the cut-score will be lower. Lastly, the cut-score set by the Cohen’s method can be different since it depends on the highest score in the group. So, the same holds as for the linear transformation method. If the best performing student now scores lower than in the previous group, the cut-score will also be lower. If the best performing student performs equally well as in the previous group, the cut-score will not change.

References
Ertoprak, D. G., & Dogan, N. (2016). A research on the classification validity of the decisions made according to norm and criterion-referenced assessment approaches. Anthropologist, 23(3), 612–619. https://doi.org/10.1080/09720073.2014.11891981

Defining educational measurement and describing its innovations and future

In the last eight weeks, I have learned about different topics regarding educational measurement as can be seen in the previous six blogpo...