## Some useful definitions

Let us first note that when I refer to a **measurand** I simply mean the (psychological) attribute that we intend to measure.

A **realist** defines the measurand outside of the measurement procedure, that is, the measurand is some test-independent attribute that is estimated by scores produced by the test.

An **operationalist** would instead suppose that the measurand is defined by the test itself and can not be defined outside of the test (i.e., the measurand is test-dependent). Note that a *strict* operationalist in the form that I discuss in this blog post is a sort of extreme operationalist. In the measurement error modeling section, I also describe a *laxed*, but it is just a useful descriptor for contrasting two types of operationalist measurement models.

A **respectful operationalist** is an operationalist that defines the measurand in terms of the test, but the test is validated for extra-operational meanings and usability (i.e., meanings and usability beyond the test, Vessonen 2021). So the respectful operationalist is kind of like “sure, yeah we can define the measurand in terms of the test, but the test needs to incorporate common connotations of the target concept and produce usable scores”

## Operationalist and Realist Measurement Error Models

Okay let’s start at the foundation: we observe a test score for a person, .

**A strict operationalist** may simply define a person’s *true* score as the observed score such that, . Trivially, the observed score can be modeled *as* the true score,

There is something attractive and so practical about this model, however there are immediate problems of usability. What if we test that person again and they obtain a different score? What if the test is scored by a rater and a different rater assigns a different score to that person? To answer this, let us use an example where a person’s true score is defined as the observed score , for person , test administration , and rater such that,

However the problem now is that if the observed score varies across test-administrations and/or raters then the true score will not be generalizable across test-administrations or raters,

This will force us to define a true score for not only each person, but *also* each test administration and rater such that,

However differences in observed scores between test-administration and raters may not be of scientific interest if we only really care about person-level differences. This specific operationalist approach is unusable in practice. Although, it is interesting to note, that the reliability of observed scores is necessarily perfect and can be shown by the reliability index, (i.e., the correlation between true and observed scores).

A more **laxed operationalist** definition of true scores would be to define it based on the *expectation* of observed scores conditional on a person. Therefore a definition of a true score for a given person can be expressed by,

Conditional expectation () is the average observed score over all possible observed scores for a given person. This also provides the beautiful property that for a given person, is an “unbiased estimate” of , since by the definition of statistical bias,

So now, even if observed scores vary across raters and/or test administrations, the true score will remain constant. The question at this point is how to model a single instance of an observed score. Here, we will need to introduce a measurement errors. **Measurement errors** can be defined as differences between observed scores and true scores. Remember that in our strict operationalist model (Equation 1) we did not have any measurement errors since the true score was equivalent to the observed score (i.e., if , then ). Now that the true score is the expected observed score for a given person, a single instance of an observed score *can* differ from the true score such that,

Where indicates the error in the observed score for a given person and measurement. Since we started with the definition of true scores being , an extremely useful consequence of this is that the conditional expectation of measurement errors within a person is zero. We can demonstrate why this is the case with a short derivation (first re-arranging Equation 2 so that is on the left hand side),

Errors that balance out over all possible observed scores for a given person is what distinguishes the classical test theory model from other measurement models (Kroc and Zumbo 2020). It is important to note that the model is not what defines the classical test theory model, in fact, many measurement error models come in an analogous form, conditional expectations between the components of the model generally are what distinguish measurement error models across disciplines (Kroc and Zumbo 2020).

**A realist** model may suppose that the true score is a construct score that is an objective value that is defined outside of the measurement (the actual value of the construct/attribute being measured; Borsboom and Mellenbergh (2002)). Therefore we can not define true scores in terms of observed scores. The model for an observed score is superficially identical to Equation 2,

With the major difference that . Note that Equation 3 is *not* the classical test theory model because it does not, by default, meet the assumptions of classical test theory that the laxed operationalist model does (Equation 2). The big difference between this realist model the laxed operationalist model is that the conditional expectation of errors is no longer zero in the realist approach, and thus systematic errors (i.e., biased estimates of true scores) can exist. In order to recover the unbiased estimate of true scores, we would have to just assume that (see flaws below).

When it comes to psychological constructs the realist approach has two major flaws to me.

An ontological commitment to the existence of an objective construct scores that exist independent of the measure. The problem with this is it includes an additional parameter with an undefined nature and lacking evidence of it’s existence. Therefore it adds increased complexity into our theory.

The operationalist model produces an unbiased measure by definition whereas the realist model needs to add an additional assumption (i.e., conditioned on a given person, the expectation of measurement errors is zero) in order for the observed scores to be unbiased.

The operationalist model may have the biggest flaw so far:

- You can not really draw any meaningful inferences about anything outside of the measure.

There is a third option though!

## Respectful Operationalist Measurement Error Model

As put by the originator of Respectful Operationalism Elina Vessonen (2021) states,

[Respectful Operationalism] is the view that we may define a target concept in terms of a test, as long as that test is validated to incorporate common connotations of the target concept and the usability of the measure

Therefore we need to identify which observed scores are produced by tests that have been validated for extra-operational connotations for the concept of interest. Let’s define an observed score produced by a sufficiently valid test as which is contained in the set of all possible observed scores such that, . As an example, a measure of depression that produces observed scores from responses to the question, “what is your age?”, would not encompass common connotations of depression and therefore those observed scores would be in the total set of observed scores, but *not* in the set of *valid* observed scores. This new model can be defined as,

where

In this way, inferences about concepts can be made from observed scores since they hold inherent relevance by virtue of prior validation of the test’s extra-operational meaning.