Development and assessment of tests for education

maandag 22 juni 2020

Defining educational measurement and describing its innovations and future

In the last eight weeks, I have learned about different topics regarding educational measurement as can be seen in the previous six blogposts. In this post I would like to summarize and define educational measurement based on those six topics, discuss innovations and the future of educational measurement, and what my contribution/role could be in this field.

What is educational measurement?

Definitions of educational measurement in literature focus on the skill or knowledge measured and the tool to measure the skill or knowledge. Cito formulates educational measurement as the “objective measurements of the knowledge, skills and competencies of your students and professionals” (“Educational Measurement,” n.d.), while another definition, provided by St. Thomas University defines measurement as “determining the attributes or dimensions of an object, skill or knowledge” (“Importance of Educational Measurement, Assessment and Evaluation,” 2018). To measure a skill or knowledge in a reliable way, tests should meet certain standards. When looking at the previous blog posts, I would say that the definition of educational measurement should at least include the aspects, development of a test, assessing the quality of the test, making decisions about how much needs to be answered correctly to pass the test, analysing data that is generated during a test, and being aware of fraud. Combining the definitions found online with the information provided during the course, I would propose the following definition for educational measurement:

Educational measurement entails the development, quality assessment, standard setting, analysing data, and detection of fraud of a test or assignment.

What are the upcoming innovations in educational measurement?

When thinking about innovation, I easily wander into the field of technology and search for opportunities to improve or support the current practices. Over the last years, technology has had a big impact on how test are developed, made, and analysed (Zenisky & Sireci, 2002). The development in the field of test construction is visible in computerized adaptive testing (CAT). CAT selects items from an item bank based on the level of the test taker, in this way fewer items are needed to measure the ability of the test taker (Henk van der Kolk, 2018). This type of testing would not be possible without the technological advancements existing nowadays. Also, the collection of data worldwide (PISA) is easier with the current technology available. Finding schools to participate, distributing the tests and surveys, and collecting the data are all important aspects in this research that are positively influenced by the development of technology. Lastly, running analyses on big data or doing statistical analyses to detect fraud is easier to do with the software that is developed to carry out those analyses.

What will educational measurement look like in the future?

For the future, I think it is important that the field of technology is researched and used to support and improve the field of educational measurement. There have been so many improvements already over the last years, and I assume that these improvements will continue in the coming years. New technologies or software that is developed might increase the quality and speed of analyses on data and new programs might make it easier to distribute tests and collect data. Security of those programs and software need to evolve as well to keep up with new ways people will try to steal items from an item bank, or the concept item bank might be changed or improved so that is more difficult to hack such a system.

What role would I like to fulfil as an educational specialist in the future (of educational measurement) described above?

Educational measurement consists of various aspects which makes it a broad field. Referring to the definition given in the beginning, of those aspects, I would like to contribute to the field of analysing test data, either for learning purposes or fraud detection. In combination with the development of technology in the future, I think this is a promising area for research. Studying the current field to find what can be improved and researching how current technological development can help and support those improvements seems interesting, relevant, and important.

References

Educational Measurement. (n.d.). Retrieved June 5, 2020, from https://www.cito.com/we-are/educational-measurement

Henk van der Kolk. (2018, June 14). Computerized Adaptive Testing [Video file]. Retrieved from http://www.youtube.com/watch?v=jckSV5vHSIs

Importance of Educational Measurement, Assessment and Evaluation. (2018). Retrieved June 5, 2020, from determining the attributes or dimensions of an object, skill or knowledge

Zenisky, A. L., & Sireci, S. G. (2002). Applied measurement in education: A review of strategies for validating computer-automated scoring. Applied Measurement in Education, 15(4), 337–362. https://doi.org/10.1207/S15324818AME1504

Designing a fraud-proof certification program

It is important that a test is taken without any cheating or fraud happening. Otherwise, there is a negative impact on the validity, reliability, and credibility of a test. Firstly, it is uncertain if the ability of a student is tested or that (s)he knew all responses to the items that were asked. Secondly, the result of the test might depend more on items that were known to the test-takers rather than uncompromised items. Lastly, the credibility of a certificate can be questionable since it is uncertain whether the candidate has sufficient skills and knowledge. To make sure a test or program is valid, reliable, and credible it needs to be fraud-proof.

Cheating can happen on different levels. The test taker can copy the answers from another test taker, the answers might already be known to him/her, or a test taker can try to remember as many items of a test as possible and distribute or publish those items. On a larger scale, it is possible that an item bank might be hacked into so a lot of items can than be used in practice tests which decreases the reliability of the test.

When designing a fraud-proof certification program, the following aspects are considered: discouraging, preventing, detecting, responding to, and recovering from fraud. The importance and how they are taken care of is discussed for every aspect.

Discouraging

To discourage test takers to cheat, it is important to let them know what the consequences of their behaviour will be. For example, having to do a retake or being expelled from a course or study. The bigger the consequences, the more chance there is that the test taker will think again before cheating. Discouraging people to cheat or commit fraud because the punishment is not worth it, is the first step in making a test or certification program fraud-proof.

Preventing

For step two, making it difficult to cheat, steal items, or bribe supervisors or teachers, multiple measures that can be taken. In the first case, leaving enough space between two test takers, placing physical obstacles, distributing different versions of a test, handing in electronic devices and books, checking for cheat papers or notes, having supervision walking around are a few examples of means to prevent cheating during a test. On a different level, making sure that the test is stored somewhere safe before it is taken prevents the possibility of fraud being committed with the test. Lastly, a background check on supervisors might help if they have a history of providing answers to test takers or pretending not to see someone cheating.

Detecting

If discouraging and prevention were not enough to stop cheating and fraud, software can help to indicate if fraud was committed and by whom. There are multiple methods to detect fraud statistically, two examples are the Guttman error model and log-normal response time (Klerk, Noord, & Ommering, 2006). Additionally, the likelihood ratio test and score test can be used to detect preknowledge of items (Sinharay, 2017).

The Guttman error model is defined by the Guttman scale, patterns, and errors. Test items are ordered from least to most difficult; it is expected that when a question is answered incorrectly, all items that are more difficult are also answered incorrectly. In practice, this is not always the case, so that is why the Guttman error is calculated. The Guttman score = (number of Guttman errors) / (items answered correct * items answered incorrect) (Klerk & Bijl, 2020).

The log-normal response time models the response times on test items for each test taker (Klerk & Bijl, 2020). A long response time can indicate that the test taker is trying to remember the item so that (s)he can pass it on to others later, while a short response time can indicate that the test taker has seen the item(s) before. The response time of the test taker is compared with the expected response time and the differences are analysed to see if the test taker has cheated on the test (Van Der Linden, 2006).

Lastly, the detection of item preknowledge is important too. If items on a test are available on the internet, for example, the result of the test does not represent the skills or knowledge of the test taker. Therefore, the validity, reliability, and credibility of the test diminish or disappear (Klerk & Bijl, 2020).

Responding

When cheating or fraud is suspected, the specific test taker should be contacted. During the conversation, this person should be informed about what exactly (s)he is accused of and have a possibility to explain or defend him-/ herself in case an error was made. It is possible that the test taker admits to fraudulent behaviour, in which case should be communicated what the punishment will be. The student could be expelled from the course, program, or school or an extra test or assignment could be a substitution for the test that does not count anymore.

In case the test-taker does not admit to having committed fraud, it might be that the test still is not admissible, and the student must retake the test or should make an alternative assignment. It might depend on how reliable the software to decide how accurate the detection is when indicating that someone cheated.

Recovering

The last step in the process is the recovery. Depending on how the fraud or cheating happened, the recovery consists of a different action. If it was possible to cheat during the test, more obstacles or more strict measures should be implemented. If items are compromised, the security of items banks should be improved. If such items were used in a test there should be more checking the internet or test prepare organizations to be aware if a lot of items are known and being used for test purposes. In case a test consists of compromised items, the test should be thrown out and a new test with secure items should be made. Even a company, for example, eX:plain, could be hired to inform or improve security, or provide training to decrease the possibility that fraud is committed (“Data Forensics,” n.d.).

References

Data Forensics. (n.d.). Retrieved June 22, 2020, from https://www.explain.nl/onderzoek-en-innovatie/data-forensics

de Klerk, S., & Bijl, A. (2020, June 15). Cheating and the prevention and detection of test fraud [Slides]. Retrieved from https://canvas.utwente.nl/courses/5049/pages/data-forensics?module_item_id=148083

de Klerk, S., van Noord, S., & van Ommering, C. J. (2006). The Theory and Practice of Educational Data Forensics. In B. P. Veldkamp & C. Sluijter (Eds.), Methodology of Educational Measurement and Assessment (pp. 1–20). https://doi.org/https://doi.org/10.1007/978-3-030-18480-3_20

Sinharay, S. (2017). Detection of Item Preknowledge Using Likelihood Ratio Test and Score Test. Journal of Educational and Behavioral Statistics, 42(1), 46–68. https://doi.org/10.3102/1076998616673872

Van Der Linden, W. J. (2006). A lognormal model for response times on test items. Journal of Educational and Behavioral Statistics, 31(2), 181–204. https://doi.org/10.3102/10769986031002181

maandag 15 juni 2020

Analysing PISA data with Rstudio

PISA – Programme for International Student Assessment

The Organisation for Economic Co-operation and Development (OECD) commissions the PISA study every three years (“PISA,” n.d.). This is an international test in which the skills and level of 15- year olds regarding reading, mathematics, and science are tested (“PISA,” n.d.). Students all over the world participate in this study, which makes it possible to compare the (quality of) education of different countries and track the development of education (Feskens, 2020). Every year, there is one main topic which means there are more items on either reading, mathematics, or science (“PISA (Programma for International Student Assessment,” n.d.). The test usually has only a few items since it is a low stakes test, no personal consequences based on the result, and to keep the motivation and attention high. Next to the test part, there is also a survey which is filled out by the parents of the students taking part in the test to learn about the background of the test takers (Feskens, 2020).

On the results that were gathered in 2018, a few analyses were done. Firstly, the data were loaded, then classical test theory was performed to get the results of the item and test statistics, the results of the Netherlands and Germany were compared, and finally the performance of all countries were compared.

Loading data in R

Before any analysis can be done, the data need to be loaded in the program. The program that is used to analyse the data is Rstudio. Figure 1 shows how the necessary libraries are loaded, the working directory is set, the files are given a variable name so that they can more easily be referred to. Lastly, a dexter project is started with the scoring rules and item responses that are needed for the analysis. Additionally, previews are shown of the data file and the items responses.

Figure 1. Code showing data being loaded in RStudio

CTT analysis

The classic test theory analysis is a first framework used to analyse test data (Feskens, 2020). Test and item statistics are shown in Figure 2. These statistics show the number of items (nItems), the alpha value, the mean p-, rit-, and rir-value, the maximum test score, and the number of responses (N). The test statistics show the average values of the test, while the item statistics show the values for each item in the data set. The Cronbach’s alpha shows the reliability of the test and a value of 0.83 means that the test is reliable.

Figure 2. CTT analysis of data

Comparing results

The data set can be analysed in general, however, also specific countries can be analysed separately or compared. In this example, the test statistics of the Netherlands and Germany are compared. As shown in Figure 3, there is a difference in the alpha value between the Netherlands and Germany. The alpha value for the Netherlands is 0.82, the alpha value for Germany is 0.67. This means that the results from the Netherlands are more reliable than the results from Germany.

Figure 3. Test statistics of the Netherlands and Germany

Comparing performances

To make a ranking of the performances of all countries that participated in the PISA study, the test scores need to be compared. Figure 4 shows a part of the individual test scores of this PISA study. Figure 5 shows a part of the test scores per country and Figure 6 shows a graph of the test scores of all countries that participated in this PISA study. This last graph shows that Japan performed best.

Figure 4. Individual test scores

Figure 5. Test score per country (partly)

Figure 6. Test scores of all countries

References

Feskens, R. (2020, June 8). Programme for International Student Assessment [Slides]. Retrieved from https://canvas.utwente.nl/courses/5049/pages/pisa?module_item_id=148084

PISA. (n.d.). Retrieved June 15, 2020, from https://www.oecd.org/pisa/

PISA (Programma for International Student Assessment. (n.d.). Retrieved June 12, 2020, from https://www.cito.nl/kennis-en-innovatie/onderzoek/in-opdracht/internationaal-pisa/

maandag 25 mei 2020

Reviewing a test item with the Evidence Centered Design model

Introduction of the ECD model

In this post, an item will be assessed to check whether it is fitting to assess a certain target skill. This item, shown in Figure 1, is taken from an old Dutch traffic theory exam for cycling and aimed at primary school children in grade 5. For this assessment, the evidence centered design model is used. This model consists of six sub-models: the student model, task model, evidence, presentation, assembly, and reporting model, see Figure 2 (Almond, Steinberg, & Mislevy, 2002). In this post, only part of the student, task, and evidence model will be discussed. The student model provides insight into what the minimum requirements or skills are when performing a task. The task model provides an answer to which assessment tasks are needed to gain information about the student. And the evidence model answers the questions: ‘What counts as evidence for proficiency?’ and ‘How to interpret evidence when drawing a conclusion about the target skill?’. Of those models, the following aspects will be discussed: the target skill, the traffic task and task situation, the task context, the task complexity features, the responding, and the responding process.

Figure 1. Item used for the assessment of the target skill, translated into English (VNN, 2014).

Figure 2. The sub-models that together form the ECD model

Target skill

The target skill seems to be decision making in traffic or having sufficient knowledge about traffic rules and insight in a situation to safely carry out the task. The task in this case, as described in Figure 1, is to safely pass the car that is backing out of a parking spot. The students will go through a few processes while solving the item, the perceptive process, the anticipation process, and the decisive process. In the perception process, they become aware of their environment. More specifically, their speed, the speed and direction of the other vehicles, and the fact that the rules indicate that the car should give way to them. In the anticipation process, they need to know the possible outcomes and predict the behaviour of the other people involved in this situation. The following situations are a few possibilities that may cross their mind: we can continue if the car lets us go first, we should slow down to make sure the car driver will notice us in time, we should slow down to avoid a collision if the car driver does not see us in time, we should not break abruptly since we are then risking a collision with each other (other cyclists). These thought focus on traffic flow, safety, and social participation in traffic. Lastly, a decision needs to be made on how to act and the corresponding answer needs to be chosen. The decision for continuing or slowing down will be based on the speed and movement of the car. If it stops, the cyclists can continue. If the car does not stop, the cyclists will have to slow down and eventually break. Therefore, the safest option is to at least slow down until the cyclists are noticed by the car driver. So, Sanne gives the correct answer (see Figure 1).

The description of solving this item might seem straightforward, however, it is important to be aware of the pitfalls of this question. It is essential to pay attention to what can go wrong during these processes. It is possible that not all relevant information is noticed in the perception phase, it is possible that not all possible outcomes are thought of in the anticipation phase, and it is possible that still the wrong decision is made despite having the correct perception and anticipation. It might even be possible that, for example, reading skills of the students are not well/ fully developed, therefore the student might not understand the question, which also can be a reason to give the incorrect answer to this item.

Traffic task and task situation

The traffic situation can be described with the help of a detailed overview of the traffic task, shown in Table 1. In this table is shown that the cyclists are cycling on a road and there are other road users. The cyclists encounter a car that is backing out of a parking spot into the road and they must change their position to pass the car and avoid a collision. When they passed the car, they can go back to cycle on the right side of the road again.

Table 1

Characteristics of task situations

Main task	Subtask	Light	Weather	Road category	Road section	Other road users
Cruising	Stay on course	Normal view	Normal	30 km/h road	Straight road	Cars
Cruising	Stay on course	Normal view	Normal	30 km/h road	Straight road	Bikers
Change lateral position	Overtaking car	Normal view	Normal	30 km /h road	Straight road	Cars
Change lateral position	Overtaking car	Normal view	Normal	30 km/h road	Straight road	Bikers
Change lateral position	Merging	Normal view	Normal	30 km/h road	Straight road	Cars
Change lateral position	Merging	Normal view	Normal	30 km/h road	Straight road	Bikers

Task context

The context of the task can be real, simulated, described, or context-free. Since the item is a picture, it is clearly not a real or simulated context. I think the picture can be evaluated as a described situation as well as context-free. The actors, objects, and materials are visually shown and there is only focus on the specific task of safely continuing the route without being interrupted or hit by the car that is getting out of the parking spot. An argument that can be made for context-free is the fact that only this specific task is focused on, so the test taker should mainly be aware of the rule that the car driver should give way to the cyclists. However, since a picture with the situation is provided rather than only a rule or sign, I would say that the context of the task is more described than context-free.

To let a student experience a learning task more realistically, the task context could be real or simulated. For example, if you are cycling and you want to turn left. You must think about a lot of things, e.g. looking for other road users, signalling, deciding if you can continue or should wait. Therefore, practising such a skill might be helpful. Doing this in a simulation has the advantages of it looking like a real situation but without the actual danger.

Task complexity features

There are various aspects that add to the complexity of carrying out the target skill in a real situation. A few of such aspects are shown in Table 2. As described at the target skill, the student needs to notice other road users and their speed and direction. To decide how to handle the situation other road users and their speed and direction should be considered. The decision that is made should be feasible to carry out regarding space and time.

The complexity of this task could be varied by changing the situation in which the target skill is carried out. To increase the complexity, it could be night- instead of daytime which decreases sight and visibility. To make the task less difficult, the car coming from the opposite direction or the parked cars could be removed to increase visibility and space to carry out the action of passing the car.

Table 2

Task complexity features

	Perception	Decision making	Action execution
Sight and visibility	X
Presence of other road users	X	X
Regulation of situation		X
Speed differences	X	X	X
Time pressure		X	X
Space to carry out actions	X	X	X

Response

The students respond to this item by selecting one of the three answers that are provided. Therefore, the answer is provided using the visual response channel. However, if the answer would be constructed by the students themselves, they have the possibility to express and explain themselves better and the question won’t be marked with a pass or fail only based on the final answer. The constructed answer could be provided to the assessor verbally or an element of interaction can be added to the question by moving certain elements in the situation. However, the item is paper-based, so, instead of selecting one option, it would be more insightful to present this item as an open question.

Response processing

There are different parts of a response that points can be assigned to: actions and strategies, the actual solution, or the consequences of the solution. In this case, since only A, B, or C can be chosen, the reasoning of the test taker is not provided to the assessor. So, either the actual solution or the consequences the answer has in this situation can be scored. However, in the case of assigning points to the answer with the best outcome, the item should be formulated differently. The wording of the question should make the aim of the item clear. If the item would be presented as an open question as suggested above, the actions and strategies used by the student could be scored together with the actual solution.

The scoring for the alternative task mentioned in the task context (turning left when cycling) differs from the item provided in Figure 1. The alternative task has a simulated context, the actions and the consequences are important, thus the student should get points based on (one of) these parts. However, a student should be made aware of the aim of the task before carrying it out. When the assessor is interested in the reasoning behind the actions, the student could share their thinking process when deciding what to do or they could be asked later why they made certain decisions.

Conclusion

A few aspects of the student, task, and evidence model were discussed. Those aspects are an example of what must be considered when constructing an item for a test. It is important to know what skill needs to be measured to develop a fitting item or test that provides an answer to the question if the student is proficient in a certain skill.

For this specific item, I think overall it is a good item to measure the target skill. However, it could benefit from rethinking the way the student has to respond to the item since now the reasoning of the answer that is provided is unknown to the assessor. The reasoning can also give the assessor valuable information about possible mistakes that are made during the decision-making process.

References

Almond, R. G., Steinberg, L. S., & Mislevy, R. J. (2002). Enhancing the design and delivery of assessment systems: A four process architecture. Journal of Technology, Learning, and Assessment, 1(5), 1–64.