Development and assessment of tests for education: juni 2020

maandag 22 juni 2020

Defining educational measurement and describing its innovations and future

In the last eight weeks, I have learned about different topics regarding educational measurement as can be seen in the previous six blogposts. In this post I would like to summarize and define educational measurement based on those six topics, discuss innovations and the future of educational measurement, and what my contribution/role could be in this field.

What is educational measurement?

Definitions of educational measurement in literature focus on the skill or knowledge measured and the tool to measure the skill or knowledge. Cito formulates educational measurement as the “objective measurements of the knowledge, skills and competencies of your students and professionals” (“Educational Measurement,” n.d.), while another definition, provided by St. Thomas University defines measurement as “determining the attributes or dimensions of an object, skill or knowledge” (“Importance of Educational Measurement, Assessment and Evaluation,” 2018). To measure a skill or knowledge in a reliable way, tests should meet certain standards. When looking at the previous blog posts, I would say that the definition of educational measurement should at least include the aspects, development of a test, assessing the quality of the test, making decisions about how much needs to be answered correctly to pass the test, analysing data that is generated during a test, and being aware of fraud. Combining the definitions found online with the information provided during the course, I would propose the following definition for educational measurement:

Educational measurement entails the development, quality assessment, standard setting, analysing data, and detection of fraud of a test or assignment.

What are the upcoming innovations in educational measurement?

When thinking about innovation, I easily wander into the field of technology and search for opportunities to improve or support the current practices. Over the last years, technology has had a big impact on how test are developed, made, and analysed (Zenisky & Sireci, 2002). The development in the field of test construction is visible in computerized adaptive testing (CAT). CAT selects items from an item bank based on the level of the test taker, in this way fewer items are needed to measure the ability of the test taker (Henk van der Kolk, 2018). This type of testing would not be possible without the technological advancements existing nowadays. Also, the collection of data worldwide (PISA) is easier with the current technology available. Finding schools to participate, distributing the tests and surveys, and collecting the data are all important aspects in this research that are positively influenced by the development of technology. Lastly, running analyses on big data or doing statistical analyses to detect fraud is easier to do with the software that is developed to carry out those analyses.

What will educational measurement look like in the future?

For the future, I think it is important that the field of technology is researched and used to support and improve the field of educational measurement. There have been so many improvements already over the last years, and I assume that these improvements will continue in the coming years. New technologies or software that is developed might increase the quality and speed of analyses on data and new programs might make it easier to distribute tests and collect data. Security of those programs and software need to evolve as well to keep up with new ways people will try to steal items from an item bank, or the concept item bank might be changed or improved so that is more difficult to hack such a system.

What role would I like to fulfil as an educational specialist in the future (of educational measurement) described above?

Educational measurement consists of various aspects which makes it a broad field. Referring to the definition given in the beginning, of those aspects, I would like to contribute to the field of analysing test data, either for learning purposes or fraud detection. In combination with the development of technology in the future, I think this is a promising area for research. Studying the current field to find what can be improved and researching how current technological development can help and support those improvements seems interesting, relevant, and important.

References

Educational Measurement. (n.d.). Retrieved June 5, 2020, from https://www.cito.com/we-are/educational-measurement

Henk van der Kolk. (2018, June 14). Computerized Adaptive Testing [Video file]. Retrieved from http://www.youtube.com/watch?v=jckSV5vHSIs

Importance of Educational Measurement, Assessment and Evaluation. (2018). Retrieved June 5, 2020, from determining the attributes or dimensions of an object, skill or knowledge

Zenisky, A. L., & Sireci, S. G. (2002). Applied measurement in education: A review of strategies for validating computer-automated scoring. Applied Measurement in Education, 15(4), 337–362. https://doi.org/10.1207/S15324818AME1504

Designing a fraud-proof certification program

It is important that a test is taken without any cheating or fraud happening. Otherwise, there is a negative impact on the validity, reliability, and credibility of a test. Firstly, it is uncertain if the ability of a student is tested or that (s)he knew all responses to the items that were asked. Secondly, the result of the test might depend more on items that were known to the test-takers rather than uncompromised items. Lastly, the credibility of a certificate can be questionable since it is uncertain whether the candidate has sufficient skills and knowledge. To make sure a test or program is valid, reliable, and credible it needs to be fraud-proof.

Cheating can happen on different levels. The test taker can copy the answers from another test taker, the answers might already be known to him/her, or a test taker can try to remember as many items of a test as possible and distribute or publish those items. On a larger scale, it is possible that an item bank might be hacked into so a lot of items can than be used in practice tests which decreases the reliability of the test.

When designing a fraud-proof certification program, the following aspects are considered: discouraging, preventing, detecting, responding to, and recovering from fraud. The importance and how they are taken care of is discussed for every aspect.

Discouraging

To discourage test takers to cheat, it is important to let them know what the consequences of their behaviour will be. For example, having to do a retake or being expelled from a course or study. The bigger the consequences, the more chance there is that the test taker will think again before cheating. Discouraging people to cheat or commit fraud because the punishment is not worth it, is the first step in making a test or certification program fraud-proof.

Preventing

For step two, making it difficult to cheat, steal items, or bribe supervisors or teachers, multiple measures that can be taken. In the first case, leaving enough space between two test takers, placing physical obstacles, distributing different versions of a test, handing in electronic devices and books, checking for cheat papers or notes, having supervision walking around are a few examples of means to prevent cheating during a test. On a different level, making sure that the test is stored somewhere safe before it is taken prevents the possibility of fraud being committed with the test. Lastly, a background check on supervisors might help if they have a history of providing answers to test takers or pretending not to see someone cheating.

Detecting

If discouraging and prevention were not enough to stop cheating and fraud, software can help to indicate if fraud was committed and by whom. There are multiple methods to detect fraud statistically, two examples are the Guttman error model and log-normal response time (Klerk, Noord, & Ommering, 2006). Additionally, the likelihood ratio test and score test can be used to detect preknowledge of items (Sinharay, 2017).

The Guttman error model is defined by the Guttman scale, patterns, and errors. Test items are ordered from least to most difficult; it is expected that when a question is answered incorrectly, all items that are more difficult are also answered incorrectly. In practice, this is not always the case, so that is why the Guttman error is calculated. The Guttman score = (number of Guttman errors) / (items answered correct * items answered incorrect) (Klerk & Bijl, 2020).

The log-normal response time models the response times on test items for each test taker (Klerk & Bijl, 2020). A long response time can indicate that the test taker is trying to remember the item so that (s)he can pass it on to others later, while a short response time can indicate that the test taker has seen the item(s) before. The response time of the test taker is compared with the expected response time and the differences are analysed to see if the test taker has cheated on the test (Van Der Linden, 2006).

Lastly, the detection of item preknowledge is important too. If items on a test are available on the internet, for example, the result of the test does not represent the skills or knowledge of the test taker. Therefore, the validity, reliability, and credibility of the test diminish or disappear (Klerk & Bijl, 2020).

Responding

When cheating or fraud is suspected, the specific test taker should be contacted. During the conversation, this person should be informed about what exactly (s)he is accused of and have a possibility to explain or defend him-/ herself in case an error was made. It is possible that the test taker admits to fraudulent behaviour, in which case should be communicated what the punishment will be. The student could be expelled from the course, program, or school or an extra test or assignment could be a substitution for the test that does not count anymore.

In case the test-taker does not admit to having committed fraud, it might be that the test still is not admissible, and the student must retake the test or should make an alternative assignment. It might depend on how reliable the software to decide how accurate the detection is when indicating that someone cheated.

Recovering

The last step in the process is the recovery. Depending on how the fraud or cheating happened, the recovery consists of a different action. If it was possible to cheat during the test, more obstacles or more strict measures should be implemented. If items are compromised, the security of items banks should be improved. If such items were used in a test there should be more checking the internet or test prepare organizations to be aware if a lot of items are known and being used for test purposes. In case a test consists of compromised items, the test should be thrown out and a new test with secure items should be made. Even a company, for example, eX:plain, could be hired to inform or improve security, or provide training to decrease the possibility that fraud is committed (“Data Forensics,” n.d.).

References

Data Forensics. (n.d.). Retrieved June 22, 2020, from https://www.explain.nl/onderzoek-en-innovatie/data-forensics

de Klerk, S., & Bijl, A. (2020, June 15). Cheating and the prevention and detection of test fraud [Slides]. Retrieved from https://canvas.utwente.nl/courses/5049/pages/data-forensics?module_item_id=148083

de Klerk, S., van Noord, S., & van Ommering, C. J. (2006). The Theory and Practice of Educational Data Forensics. In B. P. Veldkamp & C. Sluijter (Eds.), Methodology of Educational Measurement and Assessment (pp. 1–20). https://doi.org/https://doi.org/10.1007/978-3-030-18480-3_20

Sinharay, S. (2017). Detection of Item Preknowledge Using Likelihood Ratio Test and Score Test. Journal of Educational and Behavioral Statistics, 42(1), 46–68. https://doi.org/10.3102/1076998616673872

Van Der Linden, W. J. (2006). A lognormal model for response times on test items. Journal of Educational and Behavioral Statistics, 31(2), 181–204. https://doi.org/10.3102/10769986031002181

maandag 15 juni 2020

Analysing PISA data with Rstudio

PISA – Programme for International Student Assessment

The Organisation for Economic Co-operation and Development (OECD) commissions the PISA study every three years (“PISA,” n.d.). This is an international test in which the skills and level of 15- year olds regarding reading, mathematics, and science are tested (“PISA,” n.d.). Students all over the world participate in this study, which makes it possible to compare the (quality of) education of different countries and track the development of education (Feskens, 2020). Every year, there is one main topic which means there are more items on either reading, mathematics, or science (“PISA (Programma for International Student Assessment,” n.d.). The test usually has only a few items since it is a low stakes test, no personal consequences based on the result, and to keep the motivation and attention high. Next to the test part, there is also a survey which is filled out by the parents of the students taking part in the test to learn about the background of the test takers (Feskens, 2020).

On the results that were gathered in 2018, a few analyses were done. Firstly, the data were loaded, then classical test theory was performed to get the results of the item and test statistics, the results of the Netherlands and Germany were compared, and finally the performance of all countries were compared.

Loading data in R

Before any analysis can be done, the data need to be loaded in the program. The program that is used to analyse the data is Rstudio. Figure 1 shows how the necessary libraries are loaded, the working directory is set, the files are given a variable name so that they can more easily be referred to. Lastly, a dexter project is started with the scoring rules and item responses that are needed for the analysis. Additionally, previews are shown of the data file and the items responses.

Figure 1. Code showing data being loaded in RStudio

CTT analysis

The classic test theory analysis is a first framework used to analyse test data (Feskens, 2020). Test and item statistics are shown in Figure 2. These statistics show the number of items (nItems), the alpha value, the mean p-, rit-, and rir-value, the maximum test score, and the number of responses (N). The test statistics show the average values of the test, while the item statistics show the values for each item in the data set. The Cronbach’s alpha shows the reliability of the test and a value of 0.83 means that the test is reliable.

Figure 2. CTT analysis of data

Comparing results

The data set can be analysed in general, however, also specific countries can be analysed separately or compared. In this example, the test statistics of the Netherlands and Germany are compared. As shown in Figure 3, there is a difference in the alpha value between the Netherlands and Germany. The alpha value for the Netherlands is 0.82, the alpha value for Germany is 0.67. This means that the results from the Netherlands are more reliable than the results from Germany.

Figure 3. Test statistics of the Netherlands and Germany

Comparing performances

To make a ranking of the performances of all countries that participated in the PISA study, the test scores need to be compared. Figure 4 shows a part of the individual test scores of this PISA study. Figure 5 shows a part of the test scores per country and Figure 6 shows a graph of the test scores of all countries that participated in this PISA study. This last graph shows that Japan performed best.

Figure 4. Individual test scores

Figure 5. Test score per country (partly)

Figure 6. Test scores of all countries

References

Feskens, R. (2020, June 8). Programme for International Student Assessment [Slides]. Retrieved from https://canvas.utwente.nl/courses/5049/pages/pisa?module_item_id=148084

PISA. (n.d.). Retrieved June 15, 2020, from https://www.oecd.org/pisa/

PISA (Programma for International Student Assessment. (n.d.). Retrieved June 12, 2020, from https://www.cito.nl/kennis-en-innovatie/onderzoek/in-opdracht/internationaal-pisa/