IRT and Adaptive Testing

Item Response Theory (IRT) refers to a family of probabilistic models that relate the probability of a particular response to an item to an unobserved latent variable representing the respondent’s location on the construct being measured, whether it be an ability or personality test. Unlike classical test theory, which treats a test as a fixed instrument and focuses on the total score as its primary output, IRT models the interaction between respondents and items directly. The central move is to specify how item characteristics—such as difficulty, discrimination, and, in some models, guessing—combine with a respondent’s position on the latent trait being measured (commonly denoted as θ) to generate observable responses (Rasch, 1960; Lord and Novick, 1968).

This reframing has two consequences that are foundational for the developments of Computer Adaptive Testing (CAT).

Introduction

First, item properties become explicit objects of estimation rather than implicit features of a test form. Second, scores are no longer treated as simply the number of items answered correctly, but as estimates of a latent location on θ accompanied by quantified uncertainty. These features make it possible to compare performance across different sets of items in a principled way, provided the items have been calibrated on a common scale (Wright and Masters, 1982; Hambleton, Swaminathan and Rogers, 1991).

IRT therefore underpins a shift from thinking of assessment as a static instrument to thinking of it as a measurement system. Tests become instances drawn from larger calibrated item banks; scores become conditional estimates whose precision depends on the items administered; and uncertainty is treated as an inherent and informative part of measurement rather than as an error to be eliminated. These ideas set the stage for later developments in item banking, computerised adaptive testing, and modern operational assessment systems, but they are rooted in a small number of core models and conceptual commitments established in the second half of the twentieth century.

Core IRT models and measurement principles

Rasch model and parameter invariance

The model most closely associated with the foundational measurement philosophy of IRT is the Rasch model. Introduced by Georg Rasch in 1960, the model specifies that the probability of a correct response is a logistic function of the difference between a respondent’s location on the latent trait and an item’s difficulty parameter (Rasch, 1960). In its simplest form, the model includes only a single item parameter, difficulty, and assumes that all items discriminate equally between respondents at different locations on the latent variable. Hence only items that meet this criteria should be selected for the test. What distinguished Rasch’s contribution was not only the mathematical form of the model, but the measurement principles it embodied. Rasch emphasised parameter invariance: under the model, item difficulties should be independent of the particular sample of respondents used for estimation, and respondent locations should be independent of the particular set of items administered, within the limits of sampling variability. This reciprocity was presented as a criterion for measurement, rather than as an empirical convenience (Rasch, 1960; Wright and Masters, 1982).

In practice, the Rasch model is often referred to as the one-parameter logistic (1PL) model, but this retrospective label can obscure its original intent. Rasch did not propose the model as a descriptive approximation to observed data at any cost; instead, he argued that the discipline imposed by the model was what allowed meaningful measurement claims to be made. This stance has had lasting influence, particularly in educational assessment, health outcomes measurement, and other applied domains where comparability and interpretability are central concerns (Wright and Masters, 1982). The Rasch model also provides a useful reference point for later developments. Many subsequent IRT models can be understood as relaxations of Rasch’s constraints, trading strict invariance for improved empirical fit. The tension between these two goals—measurement discipline versus descriptive adequacy—runs throughout the history of IRT and reappears in later debates about model choice, fairness, and operational practice.

Birnbaum/Lord multi-parameter models and empirical fit

A different trajectory within IRT emerged from the work of Allan Birnbaum and colleagues in the 1950s and 1960s. Birnbaum proposed extensions to the basic latent trait framework that allowed items to differ not only in difficulty but also in discrimination, and, in some contexts, in lower asymptotes associated with guessing (Birnbaum, 1968). These models, later formalised within the framework of modern test theory, are commonly referred to as two-parameter (2PL) and three-parameter (3PL) logistic models. The systematic exposition of these models, and their integration into a general statistical theory of testing, is most closely associated with the work of Lord and Novick (1968). In this formulation, item parameters are treated as descriptive features to be estimated from data, and model fit becomes a central criterion for model adequacy. Compared with the Rasch model, multi-parameter models offer greater flexibility in representing observed response patterns, particularly in large-scale testing contexts where items vary widely in quality and format.

This flexibility came at a cost. Allowing item discrimination and guessing parameters to vary weakens the strong invariance properties emphasised by Rasch, and it complicates the interpretation of scales and scores. Nonetheless, multi-parameter models proved attractive in many operational settings, especially in the United States, where large item pools and high-stakes decisions placed a premium on empirical fit and predictive accuracy (Hambleton, Swaminathan and Rogers, 1991). The coexistence of Rasch and Birnbaum-type models illustrates an enduring feature of IRT practice: model choice is not determined solely by statistical considerations, but also by the purposes of measurement, the institutional context, and the kinds of inferences that users wish to draw. This plurality of models provides both flexibility and a source of ongoing debate within psychometrics.

Non-parametric IRT and Mokken scaling

Alongside parametric IRT models, a parallel tradition developed that relaxed assumptions about the functional form of item response curves. Mokken scaling, introduced by Rob Mokken in the early 1970s, provides a non-parametric approach to analysing item response data (Mokken, 1971). Rather than specifying a particular logistic form, Mokken models focus on weaker assumptions such as monotonicity of item response functions and stochastic ordering of respondents. Mokken scaling is often described as a bridge between classical test theory and parametric IRT. Like IRT, it treats items as the primary units of analysis and seeks to establish ordering on a latent variable. Unlike parametric models, however, it does not yield interval-scale estimates of respondent location, nor does it require strong assumptions about item characteristic curves. The resulting scalability coefficients provide information about the extent to which items and respondents can be meaningfully ordered (Mokken, 1971; Sijtsma and Molenaar, 2002).

An interesting aspect of this approach is that it pays particular attention to item ordering—something that classical “number-correct” scoring typically ignores. Consider a 10-item mathematics test with items arranged from easiest to hardest. A raw score of 5 might mean that a candidate answered items 1–5 correctly and 6–10 incorrectly. But it could equally mean that they answered items 1, 3, 5, 7, and 9 correctly and items 2, 4, 6, 8, and 10 incorrectly. Under conventional scoring both candidates obtain 5, and if the pass mark is 5 they both pass—despite the very different response patterns that those two profiles suggest. This is precisely the kind of information that IRT-style models are designed to exploit, because they treat responses as a function of item difficulty (and, depending on the model, discrimination and guessing) rather than as interchangeable “marks”.

In personality and attitude measurement, similar issues arise—particularly around scale midpoints. A midpoint score on an introversion–extraversion scale might represent someone who is genuinely moderate (neither strongly introverted nor strongly extraverted), but it might also reflect a person who endorses (or can present) both ends of the continuum depending on context, yielding a mixed pattern that averages out to the midpoint. Approaches that model items through their difficulty/endorsement ordering can therefore be especially useful in analysing attitude and personality measures, where assumptions of equal discrimination or a precise functional form are often hard to justify. At the same time, the interpretive limits of Mokken scaling need to be kept in view: a mid-range score (for example, a stanine of five) primarily reflects relative ordering rather than a precise location on a metric scale, and its substantive meaning depends strongly on the distribution of item difficulties and—crucially—the content of the items that define that region of the scale (Sijtsma, 2012). Mokken scaling thus reinforces a theme that recurs throughout this chapter: different measurement models support different kinds of inference. Understanding what a model does not provide is as important as understanding what it does, particularly when scores are used to inform consequential decisions.

Estimation, information, and computation

Central to the practical use of IRT is the estimation of both item parameters and respondent locations on the latent scale. Unlike classical test theory, where observed scores are treated as direct estimates of ability, IRT makes explicit the probabilistic relationship between responses and latent variables. Estimation is therefore inseparable from uncertainty: any estimate of a respondent’s location is conditional on the items administered and the model assumed, and must be accompanied by an indication of its precision. Early developments in IRT estimation focused on maximum likelihood approaches, which estimate respondent locations by identifying the value of the latent variable that maximises the likelihood of the observed response pattern given fixed item parameters. While conceptually straightforward, maximum likelihood estimation has well-known limitations, particularly at either end of the scale, where response patterns can lead to undefined or unstable estimates as little data is available at these extremes.

These difficulties motivated the development of marginal maximum likelihood (MML) methods for item parameter estimation, most notably by Bock and Aitkin (1981). In this approach, the latent variable is treated as a random effect, typically assumed to follow a normal distribution, and integrated out of the likelihood, allowing item parameters to be estimated independently of specific respondent estimates. MML provided a practical and statistically principled foundation for large-scale calibration of item banks and remains central to operational IRT practice. Bayesian approaches further extended the estimation framework by introducing explicit prior distributions for both item and person parameters. By combining prior information with observed responses, Bayesian estimation yields posterior distributions rather than point estimates, making uncertainty an integral part of the inferential output. This perspective proved particularly influential in contexts where data are sparse, where adaptive testing produces uneven information across respondents, or where stability across administrations is a concern (Mislevy, 1986).

The concept of information plays a crucial role in linking estimation to test design. In IRT, item information functions quantify how much information an item provides about respondent location at different points on the latent scale. Test information functions aggregate this information across items, providing a direct measure of expected precision at each location. These concepts make explicit the trade-offs involved in test construction and adaptive item selection, and they underpin the efficiency gains associated with CAT (Hambleton, Swaminathan and Rogers, 1991).

Importantly, information is not uniform across the scale. Tests are most precise where item information is concentrated and less precise elsewhere. This unevenness has practical implications for score interpretation, especially when fixed cut scores or classification decisions are imposed on a scale whose measurement precision varies by trait level. Recognising, quantifying, and communicating this variability is a central responsibility of psychometric practice. The same insight also highlights a long-recognised limitation of classical test theory: reporting a single reliability coefficient can encourage the mistaken impression that precision is constant across the entire score range. In practice, however, estimates of precision at the extremes—whether framed in CTT or IRT terms—can be particularly uncertain, because pilot and standardisation samples typically include relatively few respondents at very high or very low trait levels. As a result, behaviour in the tails may be inferred with limited empirical support and may depend disproportionately on model assumptions and extrapolation. Once this is acknowledged, difficult questions follow about the fairness and defensibility of using ostensibly “reliable” ability scores for high-stakes prediction and threshold decisions, including cut-offs used in special educational needs diagnosis.

The increasing adoption of IRT in operational settings was closely tied to advances in computation and the development of specialised software. Early IRT analyses were computationally demanding, limiting their practical application. As computing power increased and algorithms improved, it became feasible to estimate complex models with large datasets, enabling the routine use of IRT in testing programmes of substantial scale. Software implementations played a decisive role in this transition. Programs such as Concerto, BILOG and later more flexible environments supported the estimation of a wide range of IRT models, lowering the barrier to entry for practitioners and encouraging methodological standardisation. These tools embodied particular estimation choices and modelling assumptions, shaping practice as well as reflecting it.

More recent developments have emphasised flexibility and extensibility, with software frameworks designed to accommodate multidimensional models, complex sampling designs, and integration with adaptive testing systems. The availability of robust computational tools has made it possible to treat IRT not as a specialised analytical technique, but as an integral component of broader measurement infrastructures (Cai, 2010). At the same time, increased computational power has not eliminated the need for judgement. Choices about model specification, estimation method, convergence criteria, and diagnostic evaluation remain consequential. The apparent automation of estimation can obscure these decisions, reinforcing the importance of psychometric expertise in interpreting results and ensuring that modelling choices align with the intended uses of scores.

Formal structure: response functions, likelihood, and information

This section introduces the core mathematical structure of item response theory. The aim is not to provide full derivations, but to make explicit the objects being modelled, the quantities being estimated, and the reasons why uncertainty and information play such a central role in IRT-based systems. Readers seeking detailed proofs or algorithmic implementations are referred to the specialist literature cited in the References section below; the focus here is on conceptual clarity and interpretability.

Notation and formatting conventions

Throughout this section, the following conventions are used for consistency. The respondent’s latent location is denoted by θ, item difficulty by b, item discrimination by a, and lower asymptote (guessing) by c. Subscripts i and j index items and respondents respectively. Probabilities are expressed using the logistic function, and expectations about precision are discussed in terms of information rather than raw score variance. Where equations are presented, they are intended to illustrate structure and interpretation rather than to invite derivation.

Item response functions

At the heart of IRT is the item response function (IRF), which specifies the probability of a particular response as a function of a respondent’s location on a latent variable. In the simplest case, the Rasch model expresses the probability of a correct response to item i by respondent j as

P(Xᵢⱼ = 1 | θⱼ) = exp(θⱼ − bᵢ) / [1 + exp(θⱼ − bᵢ)],

where θⱼ denotes the respondent’s location on the latent scale and bᵢ denotes the difficulty of item i. The model asserts that response probability depends only on the difference between these two quantities. Items are most informative for respondents whose locations lie close to their difficulties, and least informative far from that point.

Multi-parameter models extend this formulation by allowing items to differ in how sharply they discriminate between respondents at different locations. In the two-parameter logistic (2PL) model, a discrimination parameter aᵢ scales the difference between θⱼ and bᵢ, while in the three-parameter logistic (3PL) model a lower asymptote cᵢ is added to represent the probability of a correct response due to guessing. These extensions increase descriptive flexibility, but they also complicate interpretation and weaken the strong invariance properties emphasised in the Rasch model.

Likelihood and estimation

Given a set of item parameters, the likelihood of a respondent’s response pattern is defined as the product of the probabilities of the observed responses across items. Estimation of a respondent’s location involves identifying the value of θ that maximises this likelihood. In practice, this process reveals an important feature of IRT: estimation precision varies across the scale and depends on the match between item difficulties and respondent location.

For item calibration, marginal maximum likelihood estimation treats respondent locations as random variables drawn from a population distribution. By integrating over this distribution, item parameters can be estimated without conditioning on particular respondent estimates. This separation of item calibration from person estimation is a key reason why IRT supports item banking and adaptive testing at scale.

Bayesian approaches make the role of uncertainty explicit by assigning prior distributions to parameters and producing posterior distributions as outputs. Rather than yielding single point estimates, Bayesian estimation characterises uncertainty directly, an advantage in contexts where data are sparse or unevenly distributed, as is often the case in adaptive testing systems.

Information and precision

The concept of information provides the formal link between the probabilistic model and measurement precision. For a given item, the item information function quantifies how much information the item provides about θ at each point on the latent scale. In the Rasch model, information is maximised when θ equals the item difficulty, reflecting the fact that responses are most sensitive to changes in θ at that point.

Test information functions aggregate item information across all administered items, providing a location-specific measure of expected precision. The inverse of test information approximates the variance of the estimator of θ, making explicit the relationship between item selection and uncertainty. This formal connection explains why adaptive testing is efficient: by selecting items that are most informative at the respondent’s current estimated location, CAT concentrates measurement effort where it yields the greatest reduction in uncertainty.

Crucially, information is not constant across the scale. Regions with sparse or poorly targeted items yield less precise estimates, even in long tests. This variability has important implications for interpretation, particularly when fixed cut scores or classifications are imposed. IRT does not eliminate these issues, but it makes them visible and quantifiable.

Mathematical structure and system design

Seen together, the mathematical elements of IRT—item response functions, likelihood-based estimation, and information—form a coherent framework for measurement. They clarify why different respondents benefit from different items, why uncertainty is unavoidable, and why adaptive systems must be designed with explicit objectives and constraints. The mathematics does not stand apart from the system-level considerations discussed earlier; rather, it provides the formal language through which those considerations are expressed and evaluated.

The purpose of presenting this material here is therefore not to complete a mathematical treatment, but to establish a shared understanding of how the models behave and why they support the kinds of measurement systems now in widespread use.

The move from instruments to systems

Item banking

As assessment programmes grew in scale and frequency—especially from the late twentieth century onwards—the limitations of fixed test forms became hard to ignore. Repeated reuse of items created predictable risks: compromised security, unequal exposure, and score inflation in high-stakes settings where memorisation is rational. Yet curricula and test specifications also constrain the item universe, so simply “writing more items” is not a scalable remedy. Item banking emerged as the operational response. An item bank is not just a repository; it is a calibrated collection in which each item is located on a common measurement scale under an explicit psychometric model. Calibration permits different subsets of items to be assembled while preserving comparability of scores—provided items share the same scale and the model assumptions hold (Kolen & Brennan, 2014).

This is the decisive conceptual shift. Tests stop being singular instruments and become temporary instantiations drawn from a larger measurement system. That reconceptualisation enables automated assembly, adaptive delivery, and continuous replenishment, while also making governance part of validity: decisions about inclusion, retirement, content balance, and security policy now shape the measurement process itself.

Computerised adaptive testing as a system

CAT became feasible at scale once item banking and IRT calibration made dynamic selection compatible with score comparability. Early CAT was limited by computing constraints, modest pools, and institutional caution, so deployments often began in narrow domains or as supplements rather than replacements (Lord, 1980; Wainer, 2000). Even so, CAT made the systems logic explicit: measurement precision is conditional. Different respondents require different items, and efficiency comes from matching difficulty to an examinee’s estimated location on the scale. In CAT the “test” is no longer a fixed artefact but a controlled selection process operating over calibrated infrastructure. The algorithm becomes policy in code.

Exposure control and security constraints

A central operational tension in CAT is that highly informative items tend to be selected repeatedly, accelerating overexposure and the risk of compromise. This exposes a general trade-off: maximising information step-by-step can be incompatible with maintaining a sustainable pool over time. Exposure control procedures manage that trade-off. Mature systems do not simply select the single most informative item at each step; they impose constraints that cap usage rates, enforce content coverage, and protect the long-term integrity of the bank. Exposure control is therefore not an optional add-on but part of what an adaptive system is. Selection is shaped by measurement goals and policy, security, and lifecycle management.

Optimisation frameworks for assembly and CAT

As programmes grew more complex, test assembly and item selection were increasingly formalised as constrained optimisation problems: maximise precision (or minimise error) subject to content, exposure, and operational constraints. This perspective is developed most clearly by van der Linden, who showed how fixed-form assembly and CAT selection can be expressed within a unified optimisation framework (van der Linden, 2005). In this view, CAT is not merely “pick the psychometrically best next item,” but solve a dynamic decision problem under explicitly stated objectives and constraints—consolidating item banking, CAT, and exposure control into a coherent systems framework (van der Linden & Glas, 2010).

Online calibration and continuous replenishment

Modern systems face frequent item turnover driven by security, curricular change, and the need to refresh content. Recalibrating an entire bank from scratch each time is rarely practical. Online calibration addresses this by embedding new items in live administrations and estimating their parameters on an existing scale using responses to calibrated anchor items. When linking assumptions hold, comparability can be preserved while the bank evolves (Ban et al., 2019). But operational convenience does not remove epistemic risk. Decisions about when to seed new items, how much data is enough before operational use, how to detect drift, and when to retire items become validity-critical. In a system-centred conception, governance and professional judgement are not external to psychometrics; they are part of what makes measurement defensible.

Once tests are treated as evolving systems—rather than fixed instruments—the next technical question is not only how to estimate ability, but how to maintain meaning over time: linking across forms and years, monitoring parameter drift and population change, and deciding when “change in score” reflects real change versus shifts in the system itself. This sets up the move to longitudinal measurement, invariance and drift detection, and the increasingly central role of monitoring and audit in large-scale adaptive programmes.

Adaptive testing and dynamic measurement systems today

By the late 2010s, the core logic of IRT-based adaptive testing was well established: calibrated item banks, information functions, and real-time estimation. Developments since 2019 have not altered that foundation so much as shifted the centre of gravity. Operational adaptive testing has become more architecturally disciplined—less focused on unconstrained information maximisation and more explicitly governed by test specifications, security, and defensibility. In parallel, measurement has increasingly moved from “one-off” administrations to continuously operating systems, where both item behaviour and examinee states may evolve over time. As a result, adaptive testing is now best understood not only as a method of efficient item selection, but as a managed measurement system with ongoing monitoring, updating, and governance.

Constrained optimisation as the operational norm

Pure information maximisation is rarely sufficient in high-stakes contexts. Content balance, exposure control, and regulatory defensibility require explicit constraint management, and this has pushed adaptive testing toward formal optimisation architectures. Shadow testing (van der Linden, 2005) provides the canonical framework: at each step, a full constrained test is assembled (a “shadow test”) that satisfies blueprint and operational constraints, and the next administered item is chosen from that solution. Simulation evidence continues to show that well-designed constrained CAT systems usually incur only modest efficiency losses relative to unconstrained selection, while materially improving blueprint fidelity and item exposure management (Chen and Ankenmann, 2004; Liu and Wang, 2021).

A central unresolved issue is robustness under model mis-specification. Constrained systems can perform “well” operationally—satisfying blueprints and exposure limits—while propagating systematic distortions if the calibrated model omits important structure, such as latent multidimensionality, local dependence, or response-style contamination. The need for large-scale empirical validation under realistically messy conditions (heterogeneous subpopulations, parameter drift, and shifting test use) remains more pressing than the sophistication of optimisation machinery might suggest.

Multistage testing, auditability, and institutional fit

Multistage testing (MST) has continued to develop as an institutional compromise between item-level adaptivity and transparency. By routing candidates through pre-assembled modules rather than selecting every item individually, MST reduces the perception of algorithmic opacity while retaining much of the efficiency gain. Empirical studies indicate that carefully designed MST systems can approach the precision of fully adaptive CAT, while providing clearer audit trails and more readily communicable forms of comparability (von Davier and Yan, 2020). In practice, this transparency is often decisive in settings where governance and explainability are operational requirements rather than optional virtues.

Statistically, MST remains vulnerable to early routing error when information is sparse; in well-designed systems later stages can compensate, but performance can degrade when population distributions are skewed relative to calibration samples or when parameters are unstable. The choice between MST and fully adaptive CAT is therefore not only statistical but organisational: it reflects differences in tolerance for complexity, requirements for auditability, and the institutional context of the decisions supported by the test.

From fixed calibration to continuous monitoring

Classical IRT treats item parameters as stable once calibrated. In occasional administrations this assumption is serviceable; in continuously operating systems it becomes an empirical question. Research on parameter drift detection (Sinharay, 2020; Patton, Cheng and Sinharay, 2021) shows that even modest exposure effects, coaching, or population shifts can produce detectable changes in item difficulty over time. This has encouraged sequential updating frameworks in which recalibration is embedded within routine monitoring, so calibration becomes less a periodic event and more a managed process—supported by drift detection, flagged items, and controlled updating.

The methodological challenge is not merely to detect drift, but to distinguish substantive change from noise in realistic operational settings. Under unequal subgroup sizes, changing testing conditions, and constrained item selection, drift detection and recalibration can themselves introduce feedback effects. Scale continuity becomes a governance problem as well as a statistical one: operational programmes must decide how much updating is acceptable before comparability (and stakeholder confidence) is compromised.

From static traits to evolving latent states

Alongside attention to drifting items, there is now increased emphasis on evolving persons. Temporal IRT models treat ability not as a static estimate tied to a single administration, but as a latent trajectory updated across time (Kim et al., 2023). Empirical work suggests that such models can improve predictive accuracy relative to static formulations without prohibitive computational cost, especially in systems where measurement is distributed across repeated short interactions. Conceptually, this marks a shift in emphasis: IRT increasingly functions as a dynamic state estimator rather than solely as a one-off measurement device.

Longitudinal updating, however, introduces new tensions. Learning effects, practice, feedback, and changing motivation can violate independence assumptions and complicate interpretation of observed change. The central inferential problem becomes methodological discrimination: how reliably can true growth be separated from artefactual change, and how can that separation be defended under realistic sampling variability and operational drift? Longitudinal CAT therefore represents both an efficiency gain and a modelling challenge, because it makes measurement more continuous while making “change” harder to interpret without explicit modelling of time and context.

Sequential decision theory, bandits, and active learning

As adaptivity becomes more explicitly sequential, its structural links to decision theory become clearer. Item selection can be framed as an exploration–exploitation trade-off under uncertainty, analogous to the multi-armed bandit family of problems in machine learning (Sutton and Barto, 2018). In this perspective, “arms” correspond to item or module choices whose information yield is uncertain, and selection rules (including probabilistic strategies such as Thompson sampling) can be interpreted as allocating queries according to current beliefs while preserving exploration.

This framing has been made explicit in work that treats adaptive testing as a contextual bandit problem; BanditCAT is one example (Sharpnack et al., 2024). Simulation evidence suggests that such approaches can be competitive with classical information-based methods in some regimes, particularly where uncertainty about item performance or changing conditions makes exploration valuable. Relatedly, active learning formalises query selection to reduce uncertainty (Settles, 2009). The key difference is constraint: psychometric adaptivity is embedded within blueprint, fairness, exposure, and comparability requirements, which means “informativeness” is never the only objective. Nonetheless, the convergence is useful because it clarifies shared inferential structure while highlighting why operational testing must remain more tightly governed than many machine-learning querying settings.

Fairness, invariance, and governance

Differential prediction, DIF, and measurement invariance

As IRT-based methods became embedded in large-scale assessment systems, questions of fairness and comparability moved from methodological considerations to governance requirements. A central concern is whether scores retain the same meaning across groups, administrations, or contexts. In psychometrics, this is formalised through measurement invariance and differential item functioning (DIF).

Measurement invariance refers to the degree to which the relationship between the latent variable and observed responses remains stable across groups. In its strongest forms, invariance implies that items function equivalently for different populations, permitting meaningful comparisons of latent locations. DIF analysis focuses more narrowly on identifying items whose response probabilities differ across groups after controlling for overall location on the latent variable. Importantly, DIF is a statistical finding that requires substantive interpretation; it does not, by itself, establish bias or unfairness (Meredith, 1993; Millsap, 2012; Zumbo, 1999). A useful working maxim is that DIF is diagnostic, not a verdict.

The increasing diversity of test-taking populations, combined with changes in delivery modes, has made invariance testing a routine expectation in responsible assessment practice. Contemporary approaches increasingly treat invariance as a matter of degree rather than a binary condition, emphasising what level of invariance is required for particular inferences and uses. This aligns psychometric analysis with broader validity arguments, in which fairness is understood as a property of the measurement system as a whole rather than as a purely statistical outcome (Vandenberg & Lance, 2000; Putnick & Bornstein, 2016).

Fairness as embedded operational practice

Over roughly the past five years, fairness work has shifted further from being an occasional diagnostic to becoming part of operational design and monitoring. Psychometric approaches long predate contemporary AI debates: differential prediction (Cleary, 1968), DIF analysis (Lord, 1980; Holland & Wainer, 1993), and measurement invariance (Millsap, 2011) were developed in response to legal, ethical, and validity pressures decades ago. What has changed operationally is not the underlying concern, but the extent to which fairness diagnostics are integrated into delivery pipelines and ongoing governance.

Bayesian multiple-group models support partial pooling across subpopulations, stabilising subgroup parameter estimates under unequal sample sizes (Fox, 2010). Evidence suggests that such approaches can reduce instability and false positives in DIF detection under imbalanced designs (Sinharay, 2020). In large digital platforms, fairness checks are increasingly embedded in continuous monitoring rather than treated as retrospective audits—consistent with the wider shift toward managed calibration and lifecycle oversight.

A technical frontier concerns behaviour under adaptivity. Many fairness procedures were developed under linear testing assumptions. Their operating characteristics in constrained adaptive systems—where exposure rules, routing, and optimisation shape which items are seen by whom—remain insufficiently mapped and require more systematic empirical investigation.

Remote and online administration: security, privacy, equity

The expansion of remote and online assessment has altered the context in which IRT-based systems operate. Online delivery increases accessibility and scalability, but it also changes the security threat model and introduces new variability in testing conditions. The rapid shift to remote assessment during the COVID-19 pandemic brought these tensions into sharp relief, especially in settings where test integrity and comparability are high-stakes requirements.

Remote proctoring technologies were widely adopted to preserve test integrity, but they introduced trade-offs involving privacy, equity, and candidate experience. Research from this period highlights the complexity of balancing security with validity and acceptability, and cautions against assuming that technological surveillance provides a complete solution to cheating or misconduct (Patael et al., 2022; Belzak et al., 2024). From a measurement perspective, the key point is that changes in administration mode can interact with item functioning, respondent behaviour, and score interpretation in ways that require explicit monitoring rather than implicit assumption. The core principles of IRT remain applicable, but their application becomes inseparable from the institutional and technological context in which measurement is carried out.

Fairness and causal modelling in AI discourse

Algorithmic fairness research (Hardt, Price & Srebro, 2016) and causal modelling frameworks (Pearl & Mackenzie, 2018) address predictive disparities and structural bias using tools that are often framed outside classical latent-variable modelling. Psychometric DIF and invariance frameworks address parallel questions—group comparability and differential functioning—within structured measurement models.

The convergence reflects a shared concern with comparability under modelling assumptions. AI’s emphasis on causal reasoning sharpens attention to data-generating mechanisms and intervention; psychometrics offers mature tools for operationalising invariance and diagnosing differential functioning within latent variable systems. Taken together, these traditions suggest a productive division of labour: causal reasoning can clarify which disparities are structural and which are measurement-linked, while psychometrics provides established methods for maintaining scale meaning under heterogeneous populations and changing contexts.

Multidimensionality and response processes

Multidimensional IRT in applied settings

Many constructs assessed in educational and psychological testing are inherently multidimensional. Early applications of IRT often relied on unidimensional approximations for reasons of tractability and interpretability, but advances in computation have made multidimensional IRT (MIRT) increasingly accessible. MIRT models allow multiple latent variables to be estimated simultaneously, capturing patterns of association among item responses that cannot be represented on a single scale (Reckase, 2009).

In applied settings, MIRT has been used to support profile reporting, diagnostic feedback, and the analysis of complex test structures. At the same time, multidimensional models raise challenges for score interpretation and communication. Added flexibility does not automatically translate into clearer reporting, and the decision to adopt multidimensional models must be guided by intended score uses and by the capacity of stakeholders to interpret the resulting structure. Wider availability of MIRT software has contributed to adoption, but it reinforces a central theme: computational tractability does not settle interpretive questions. Decisions about dimensionality, model fit, and reporting remain substantive as well as technical.

Multidimensional modelling and score interpretation

Over the past few years, multidimensional and hierarchical modelling have become computationally routine (Reckase, 2009; van der Linden, 2018). Hierarchical priors can stabilise small-domain estimates without materially distorting global scores when assumptions are approximately satisfied. The limiting factor is increasingly interpretive rather than computational. Improvements in global fit indices do not automatically confer psychological distinctiveness, and statistical identifiability should not be conflated with substantive differentiation.

The capacity to report subscores and multidimensional profiles now exceeds the evidence base evaluating their behavioural consequences. A pressing agenda is therefore not only to refine estimation, but to study how multidimensional reporting shapes institutional and individual decision-making—whether it improves decisions, increases noise, changes incentives, or shifts attention toward dimensions that are statistically detectable but psychologically or educationally marginal.

Response styles and assumption hierarchies

Recent work has also sharpened attention on response processes and the uneven scrutiny of modelling assumptions in operational IRT practice. Formal IRT-based response-style models show that acquiescence and extreme responding can bias discrimination estimates even when global fit statistics appear acceptable (Bolt, 2024). In operational terms, ignoring response styles is not a harmless simplification: it can alter item parameter estimates, distort subgroup comparisons, and affect classification when response tendencies differ systematically across populations.

At the same time, reviews of empirical IRT practice suggest that certain assumptions—such as local independence or detailed item-level fit—are less consistently evaluated than unidimensionality (Yiğiter & Boduroğlu, 2024). The resulting hierarchy of assumption scrutiny is itself an empirical phenomenon with consequences for practice. The methodological challenge is not to eliminate all violations, but to determine which violations materially distort inference and which mainly affect formal fit metrics. As psychometrics becomes more mathematically refined, its psychological commitments—and the practical implications of its assumptions—must remain explicit.

Convergences between psychometrics and contemporary AI

Latent representation

Deep learning systems operate in high-dimensional latent spaces learned through representation learning (Goodfellow, Bengio & Courville, 2016; Vaswani et al., 2017). Variational autoencoders (Kingma & Welling, 2014) formalise probabilistic latent-variable modelling within neural architectures. This logic is continuous with latent trait modelling in psychometrics, but the key distinction lies in interpretive constraint. Psychometric latent variables are anchored to constructs and evaluated through invariance testing and validity arguments (Messick, 1989; Borsboom, 2005). AI embeddings, while powerful, are often not tied to explicit construct theory, even when they yield excellent predictive performance. Interest in disentanglement and identifiability within representation learning echoes longstanding psychometric concerns, but in a setting where interpretability and construct discipline are often secondary objectives.

Dynamic state estimation

Sequential updating in reinforcement learning and Bayesian filtering treats hidden states as evolving quantities (Bishop, 2006; Sutton & Barto, 2018). Temporal IRT models represent a parallel development within measurement science (Kim et al., 2023). AI has advanced scalable inference in complex dynamic systems; psychometrics contributes disciplined attention to comparability, scale continuity, and the interpretive meaning of change. A particularly promising interface concerns integration of dynamic modelling with equating principles—preserving continuity of meaning while allowing latent states and item behaviour to vary over time.

Adaptive query selection

Active learning and contextual bandit methods formalise query selection to reduce uncertainty (Settles, 2009). CAT implements a constrained version of this principle using information functions and optimisation under blueprint constraints. The parallel underscores that adaptive measurement and adaptive learning share a common inferential structure. What distinguishes psychometrics is its sustained concern with construct validity and regulated decision contexts, which makes constraint management—not only efficiency—a defining feature of operational adaptivity.

Explainability and construct discipline

Explainable AI has gained prominence as models increase in complexity (Rudin, 2019). Psychometric models have traditionally prioritised parameter interpretability tied to constructs, even when predictive accuracy is not maximised. The convergence suggests complementary strengths: AI contributes scalable modelling; psychometrics contributes disciplined construct specification and comparability frameworks. The dialogue between fields is increasingly technical rather than speculative, and the points of contact are now as much methodological as conceptual.

Integrative perspective

Taken together, developments in IRT, CAT, and AI reveal not a replacement of one field by another, but a growing technical alignment around shared inferential problems. Psychometrics has moved toward dynamic, adaptive, and temporally indexed modelling; AI has increasingly engaged with latent representation, sequential updating, fairness constraints, and interpretability. Both traditions address inference under uncertainty; what continues to distinguish psychometrics is sustained attention to construct definition, invariance, and decision accountability within regulated contexts. On this view, convergence strengthens rather than diminishes the relevance of psychometric theory.

Conclusion

The developments traced in this chapter reveal a consistent trajectory: item response theory evolved not merely as a set of statistical models, but as the conceptual core of modern measurement systems. From the Rasch model’s emphasis on invariance and measurement discipline, through the flexibility of multi-parameter models, to the operational realities of item banking and adaptive testing, IRT progressively reframed what it means to measure psychological and educational constructs at scale.

Several themes recur. First, uncertainty is not a defect to be engineered away but an intrinsic feature of inference that must be managed, communicated, and aligned with intended uses. Second, fairness is best understood as a property of systems rather than items in isolation: invariance and DIF analyses provide essential tools, but their interpretation depends on substantive judgement and governance as populations, delivery modes, and institutional pressures change. Third, operational mechanisms—item banks, exposure controls, optimisation frameworks, and online calibration—are not ancillary technicalities but the means by which measurement principles are realised in practice. Finally, methodological power does not remove the need for expertise: choices about models, dimensionality, constraints, and reporting shape the meaning of scores and the consequences of their use.

This chapter has focused on IRT as a mature measurement framework. By the late 2010s, calibrated item banks, adaptive delivery, formal estimation, invariance analysis, and operational governance were firmly in place, together defining a paradigm capable of supporting large-scale, high-stakes assessment across diverse contexts. Recent developments in artificial intelligence—automated item generation, large-scale simulation, and more sophisticated analysis of response patterns—do not render these principles obsolete; rather, they make their assumptions more visible and, in high-stakes contexts, more consequential.

The practical challenge is therefore integration. As new technologies are incorporated into assessment, the distinctions emphasised throughout this chapter—between model and system, estimation and interpretation, and statistical detection and substantive judgement—become even more important. IRT provides the conceptual vocabulary and methodological discipline needed to evaluate innovations critically, ensuring that efficiency gains do not come at the expense of validity, fairness, or interpretability. A fuller treatment of these developments, and of how AI-assisted methods interact with established psychometric practice, is reserved for a subsequent chapter.

References

Ban, J. C., Hanson, B. A., Wang, T., Yi, Q. and Harris, D. J. (2019). Using response times to assess the validity of joint response time and accuracy models. Educational Measurement: Issues and Practice, 38, 33–46.

Belzak, W. C. M., Smith, K. M. and Roussos, L. A. (2024). Fairness and judgement in remote proctoring decisions. Educational Measurement: Issues and Practice, 43, 45–58.

Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord and M. R. Novick (eds.), Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley.

Bishop, C.M. (2006) Pattern Recognition and Machine Learning. New York: Springer.

Bock, R. D. and Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443–459.

Bolt, D.M. (2024) ‘IRT-based response style models and related methodology’, British Journal of Mathematical and Statistical Psychology.

Borsboom, D. (2005) Measuring the Mind: Conceptual Issues in Contemporary Psychometrics. Cambridge: Cambridge University Press.

Cai, L. (2010). High-dimensional exploratory item factor analysis by a Metropolis–Hastings Robbins–Monro algorithm. Psychometrika, 75, 33–57.

Chen, S.-Y. and Ankenmann, R.D. (2004) ‘Effects of item exposure control procedures on item pool usage and test security in CAT’, Applied Psychological Measurement, 28(6), pp. 417–432.

Cleary, T.A. (1968) ‘Test bias: prediction of grades of Negro and white students in integrated colleges’, Journal of Educational Measurement, 5(2), pp. 115–124.

Fox, J.-P. (2010) Bayesian Item Response Modeling: Theory and Applications. New York: Springer.

Goodfellow, I., Bengio, Y. and Courville, A. (2016) Deep Learning. Cambridge, MA: MIT Press.

Hambleton, R. K., Swaminathan, H. and Rogers, H. J. (1991). Fundamentals of Item Response Theory. Newbury Park, CA: Sage.

Hardt, M., Price, E. and Srebro, N. (2016) ‘Equality of opportunity in supervised learning’, Advances in Neural Information Processing Systems, 29, pp. 3315–3323.

Holland, P.W. and Wainer, H. (eds.) (1993) Differential Item Functioning. Hillsdale, NJ: Lawrence Erlbaum.

Kim, Y., Sankaranarayanan, S., Piech, C. and Thille, C. (2023) ‘Variational temporal IRT: fast, accurate, and explainable inference of dynamic learner proficiency’, arXiv preprint arXiv:2311.08594.

Kingma, D.P. and Welling, M. (2014) ‘Auto-encoding variational Bayes’, Proceedings of the International Conference on Learning Representations.

Kolen, M. J. and Brennan, R. L. (2014). Test Equating, Scaling, and Linking (3rd ed.). New York: Springer.

Liu, C.W. and Wang, W.-C. (2021) ‘Item selection methods for computerized adaptive testing: a review and new perspectives’, Psychometrika, 86(1), pp. 1–28.

Lord, F. M. (1980). Applications of Item Response Theory to Practical Testing Problems. Hillsdale, NJ: Erlbaum.

Lord, F. M. and Novick, M. R. (1968). Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley.

Lord, F.M. (1980) Applications of Item Response Theory to Practical Testing Problems. Hillsdale, NJ: Lawrence Erlbaum.

Ma, W.A., et al. (2025) ‘ROAR-CAT: efficient online adaptive reading assessment’, Behavior Research Methods.

Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58, 525–543.

Messick, S. (1989) ‘Validity’, in Linn, R.L. (ed.) Educational Measurement (3rd edn). New York: Macmillan, pp. 13–103.

Millsap, R. E. (2012). Statistical Approaches to Measurement Invariance. New York: Routledge.

Millsap, R.E. (2011) Statistical Approaches to Measurement Invariance. New York: Routledge.

Mislevy, R. J. (1986). Bayes modal estimation in item response models. Psychometrika, 51, 177–195.

Mokken, R. J. (1971). A Theory and Procedure of Scale Analysis. The Hague: Mouton.

Patton, J.M., Cheng, Y. and Sinharay, S. (2021) ‘A framework for monitoring item parameter drift in large-scale assessment’, Journal of Educational Measurement, 58(3), pp. 353–376.

Pearl, J. and Mackenzie, D. (2018) The Book of Why: The New Science of Cause and Effect. London: Penguin.

Putnick, D. L. and Bornstein, M. H. (2016). Measurement invariance conventions and reporting. Developmental Review, 41, 71–90.

Rasch, G. (1960). Probabilistic Models for Some Intelligence and Attainment Tests. Copenhagen: Danish Institute for Educational Research.

Reckase, M. D. (2009). Multidimensional Item Response Theory. New York: Springer.

Reckase, M.D. (2009) Multidimensional Item Response Theory. New York: Springer.

Rudin, C. (2019) ‘Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead’, Nature Machine Intelligence, 1, pp. 206–215.

Settles, B. (2009) ‘Active learning literature survey’, University of Wisconsin–Madison, Computer Sciences Technical Report 1648.

Sharpnack, J., Hao, K., Mulcaire, P., Bicknell, K., LaFlair, G., Yancey, K. and von Davier, A. (2024) ‘BanditCAT and AutoIRT: machine learning approaches to computerized adaptive testing and item calibration’, arXiv preprint arXiv:2410.21033.

Sijtsma, K. (2012). Psychometrics in practice: Critical evaluation of the current status of measurement in psychology. Psychometrika, 77, 617–633.

Sijtsma, K. and Molenaar, I. W. (2002). Introduction to Nonparametric Item Response Theory. Thousand Oaks, CA: Sage.

Sinharay, S. (2020) ‘Monitoring item parameter drift in operational testing programs’, Educational Measurement: Issues and Practice, 39(2), pp. 16–26.

Sireci, S.G. (2020) ‘Fairness in testing: the old and the new’, Educational Measurement: Issues and Practice, 39(1), pp. 3–6.

Sutton, R.S. and Barto, A.G. (2018) Reinforcement Learning: An Introduction (2nd edn). Cambridge, MA: MIT Press.

Vandenberg, R. J. and Lance, C. E. (2000). A review and synthesis of the measurement invariance literature. Organizational Research Methods, 3, 4–70.

Vaswani, A., Shazeer, N., Parmar, N., et al. (2017) ‘Attention is all you need’, Advances in Neural Information Processing Systems, 30, pp. 5998–6008.

Wainer, H. (2000). Computerized Adaptive Testing: A Primer (2nd ed.). Mahwah, NJ: Erlbaum.

Wang, C. and Liu, Y. (2019) ‘Online calibration in item response theory models’, Educational and Psychological Measurement, 79(3), pp. 487–512.

Weiss, D. J. (ed.) (1983). New Horizons in Testing: Latent Trait Test Theory and Computerized Adaptive Testing. New York: Academic Press.

Wright, B. D. and Masters, G. N. (1982). Rating Scale Analysis. Chicago: MESA Press.

Zumbo, B. D. (1999). A Handbook on the Theory and Methods of Differential Item Functioning. Ottawa: Directorate of Human Resources Research and Evaluation, Department of National Defense.