Modification Description：1-01

We regret that we have not fully explained the basic principles of Classical Test Theory (CTT), Item Response Theory (IRT), and the Rasch model, which have been more clearly explained and presented in the revised article. At the same time, in the revised article we have cited the original works on these theories after their development (e.g. Gulliksen 1950; Lord 1953; Rasch 1960), with the following specific modifications:

2.1Theories of Test Development

2.2.1 Classical Test Theory

The basic idea of Classical Test Theory (CTT) is to view the score of a test (often called the observed score of a test) as a linear combination of true scores and error scores, i.e.

X＝T＋E

X is the measurement result, T is the true score and E is the error score (Gulliksen 1950). The principles and methods used in traditional test for reliability, validity and item analysis are based on this model. Test theory is based on assumptions, which can be divided into strong and weak assumptions, with the weak assumptions being easily met by the majority of test data and the strong assumptions not being easily met by the majority of test data. CTT is based on three weak assumptions：(i) If a person's particular psychological trait can be repeatedly measured enough times using parallel tests, the mean of their observed scores will be close to the true score. (ii) The correlation between the true and error scores is zero. (iii) The correlation between the error scores on each parallel test is zero. Using the CTT, the reliability, validity, difficulty and discrimination of the measurement instrument can be tested (Brennan 2010). The theoretical system of CTT is very well established and has the following advantages: (i) It has weak theoretical assumptions and less stringent requirements for implementation conditions, and is therefore widely applicable. (ii) It focuses on the validity of the measurement instrument, especially the construct validity. (iii) It requires a small sample size, usually a sample size of 200-500 is sufficient. However, the CTT also has certain shortcomings, such as: (i) item dependence, as the test scores depend on the difficulty of the items, making it difficult to compare subjects who take different tests, and when the items are difficult, the test scores are lower. (ii) Sample-dependent, where item difficulty is heavily dependent on the sample of subjects. If the ability of the sample is high, the item difficulty is low. (iii) Item difficulty and subject ability are not in the same frame of reference, so it is not possible to verify that the items match the subjects exactly (Lord 1953; Hambleton and Jones 1993; Fayers 2004).

2.2.2 Item Response Theory

In response to the shortcomings of CTT, modern test theory has emerged. In the item analysis section, the main emergence is the Item Response Theory(IRT), which is based on latent trait theory. A latent trait is a stable, intrinsic characteristic (denoted as θ) that is not directly observable and that governs a subject's response to a corresponding item and shows consistency in response, there is a relationship between an underlying trait of the subject and the responses to items measuring that trait as follows: as the latent trait increases, the probability of correctly reflecting the item P(θ) also increases (Lord 1977). IRT has a larger number of models, and by finding the right model for the data, a more accurate analysis of the items can be carried out. Currently, the more commonly used models are the one-parameter logistic model (referred to as the 1PL model or Rasch model, which has only difficulty), the two-parameter logistic model (referred to as the 2PL model, which has difficulty and discrimination) and the three-parameter logistic model (referred to as the 3PL model, which has difficulty , discrimination and guessing), and the 3PL model equation is:

i=1，2，...,n

In the formula, a, b and c correspond to the three parameters of item discrimination, item difficulty and guessing factor respectively, and D is a constant 1.7. If the guessing factor is not taken into account, then c = 0 and the model is a 2PL model; if it is further assumed that all items have the same discrimination but different difficulty, then a = 0 and c = 0 and the model becomes a 1PL model.

While the CTT is based on weak assumptions, the IRT is based on strong assumptions and has three basic assumptions: (i) The unidimensionality of latent traits hypothesis - meaning that all items that make up a given test measure the same latent trait. (ii) The assumption of local independence - meaning that no correlation exists between items for a given subject's ability. (iii) The item characteristic curve assumption - a model of the relationship between the probability of a correct response on an item and the subject's ability. The main advantages of IRT include: (i) It solves the problem of sample dependency of CTT. (ii) It solves the problem of item dependence of CTT. (iii) IRT puts the subject's ability and the difficulty of the items on the same scale for estimation, and can verify whether the items match the subject's ability (Hambleton, Swaminathan,and Rogers, 1991). However, IRT also has certain shortcomings, such as that it usually requires a sample size of more than 500.

2.2.3 Rasch Model

The Rasch model is one of the IRT models, which was developed by the Danish mathematician Georg Rasch. It is formulated as:

P is the probability of an individual with ability B correctly answering an item of difficulty D, where X denotes the random variable of item success or failure, with X = 1 indicating item success and X = 0 indicating item failure. Rasch model assumes that the probability of success of an event is influenced only by individual ability and item difficulty, i.e. the probability of an individual correctly answering a question depends only on both the subject's ability and the difficulty of the items ( Rasch 1960). The Rasch model is currently the most simplified model in the field of IRT, requiring the least number of parameters to be estimated, incorporating only the difficulty and ability parameters. The mathematical expression of the Rasch model reveals that the difficulty and ability parameters in the Rasch model are symmetrical to each other. With the mathematical structure of the relative symmetry of the parameter estimates, the Rasch model can transform a non-linear data matrix consisting of item responses into two columns of interval data with symmetrical properties reflecting the ability and difficulty parameters. These properties of the Rasch model allow it to have two major advantages in practical applications: firstly, it can transform non-linear data into data with isometric significance, i.e. construct higher-ranking linear measurements based on lower-ranking data, thus providing more informative measurements and making them more accurate and objective (Bond and Fox 2015; Fischer and Molenaar 2012). Secondly, the Rasch model places the subject and the item in the same scale, which helps the researcher to more accurately assess and interpret the fit between the target being measured and the measurement instrument (Lunz 2010). In addition, the stability and accuracy of parameter estimates in Rasch models are generally higher than other complex models, making them less susceptible to additional factors during parameter estimation or data transformation, which can help to improve the reliability of the measure.