On the Status of This Chapter
This chapter on AI Assisted Test Development is presented as a working draft of a proposed chapter for a future edition of Modern Psychometrics. In light of the rapid pace of change in AI-assisted assessment and online research methods, it is being made available here in advance of formal publication.
The chapter is intended as a methodological intervention rather than a final position statement. Its purpose is to articulate a structural problem that has emerged at the intersection of online sampling, automated item generation, and psychometric modelling, and to reconnect contemporary practice with principles that have long been recognised within the field. Where appropriate, revisions will be made, and the chapter will evolve as part of an open, ongoing conversation about how psychometrics should adapt to new technologies. This draft may be cited, with the understanding that it represents a developing contribution rather than a settled text.
Current Limits to AI-Assisted Test Development
Rust, J. (2025) Working chapter draft, Modern Psychometrics (5th ed., in preparation). Available at: https://johnrust.website/reading/books/ai-assisted-test-development/
Introduction
Psychometric test development has always involved a tension between formal structure and lived cognition. On the one hand, tests must be systematic, reliable, and grounded in defensible models of ability. On the other, they must function as intended when encountered by real people, under realistic conditions, and across the full range of abilities for which they are designed. Historically, this tension was managed through close attention to examinee behaviour during development and calibration. In recent years, however, changes in method and infrastructure have altered how that balance is struck.
Two developments in particular have reshaped contemporary practice. The first is the widespread adoption of online participant panels for item piloting and calibration. These panels have made psychometric development faster, cheaper, and more scalable, while offering high levels of data quality and experimental control. The second is the rapid integration of artificial intelligence, and especially large language models, into item generation and early-stage test design. Together, these tools have transformed what is possible in psychometrics, but they have also introduced new risks that are not yet fully recognised.
In this chapter, cognitive representativeness means that a development sample reproduces (within tolerances relevant to intended use) the target population’s distribution of: functional reading comprehension and reading speed; working-memory tolerance for multi-clause constraints; susceptibility to instruction misparsing; test anxiety and confidence under evaluation; and digital/administrative fluency (device, UI, timed navigation). These dimensions can vary independently of standard demographic strata. It is argued that a specific and systematic misalignment has emerged at the intersection of these developments. Items are increasingly derived from expert-coded sources, including curricula, formal frameworks, and AI-generated templates. They are then evaluated using samples that are cognitively selective, procedurally expert, and unusually well adapted to abstract testing environments. The resulting evidence suggests that items are accessible and appropriately calibrated. Yet when the same items are deployed in operational settings, particularly with lower-ability or less practised examinees, they often prove unexpectedly difficult, confusing, or fragile.
A distinct alternative explanation is administration-ecology non-invariance: item parameters may shift because operational settings differ from panel settings in stakes, timing constraints, proctoring friction, device quality, language context, and motivation. Where such context shifts dominate, the issue is not panel conditioning but a construct shift across modes. The argument here is narrower: even holding administration constant, development pipelines can drift toward samples and item sources that under-express the lower-tail cognitive behaviors that matter for accessibility. The source of this discrepancy is not poor modelling, inadequate technology, or declining standards. It lies in a gradual shift away from cognitive realism as an explicit design constraint. Difficulty has come to be treated primarily as a statistical property inferred from clean data, rather than as an empirical property of human cognition observed under realistic conditions. The consequence is a closed loop in which expert-like items are validated by expert-adapted respondents and then generalised to populations for whom neither the items nor the evidence are representative.
The purpose of this chapter is to unpack how this loop arises and why it persists, despite longstanding insights from psychometrics and cognitive psychology. It does so by examining, in turn, the cognitive selectivity of online panels, the effects of repeated participation on test performance, the consequences of over-clean data, and the limitations of curriculum-aligned, symplectic, and AI-generated item sources when used without appropriate constraints. Throughout, the emphasis is on mechanisms rather than on critique, and on structural causes rather than individual decisions. The chapter then turns to the sociological and institutional factors that allowed these practices to become normalised, including changes in training, division of labour within development pipelines, and the incentives created by speed and scale. Finally, it sets out practical principles for restoring cognitive realism to modern psychometric development without abandoning the genuine benefits of online methods and AI-assisted tools.
The argument advanced here is not a rejection of contemporary practice, but a correction of its implicit assumptions. Psychometrics remains a discipline concerned with drawing valid inferences about human abilities. Maintaining that commitment in an era of powerful new technologies requires renewed attention to how tests are actually experienced by the people who take them. The sections that follow aim to make that attention systematic again.
Empirically, the “closed loop” hypothesis predicts a characteristic pattern: (i) calibrated difficulty for text-heavy, multi-constraint items will be systematically underestimated relative to operational deployment, especially in the lower tail; (ii) this underestimation will concentrate in items with high reading-load and working-memory coordination demands; (iii) the gap will manifest as differential test functioning (DTF) and/or item-level DIF when moving from panel calibration to cognitively naïve field samples, with elevated omission/abandonment and longer response times among lower-ability examinees; and (iv) these effects will be reduced when calibration samples explicitly include cognitively naïve participants and when “error” behaviors are retained and modelled rather than screened out.
Demographic Diversity Is Not Cognitive Representativeness
A persistent assumption in contemporary psychometric practice is that demographic diversity functions as a proxy for cognitive representativeness. If a sample contains respondents with varied ages, genders, educational levels, incomes, and geographical backgrounds, it is often taken to approximate the population for whom a test is intended. This assumption underpins the widespread use of online participant panels in early, and increasingly late, stages of item development. It is also, in crucial respects, incorrect. The error lies in treating social descriptors as substitutes for cognitive functioning in testing situations. Demographic variables describe who people are and how they are positioned socially. Cognitive representativeness concerns how individuals read, interpret, and respond to test materials under conditions of abstraction, time pressure, and uncertainty. The two are related, but they are not interchangeable, and confusing them leads to systematic distortion of item statistics.
Online research platforms such as Prolific have significantly improved the quality of behavioural data relative to earlier crowdworking systems. Comparative studies now show higher levels of attention, better instruction-following, lower rates of random responding, and more reliable data overall than previously (Palan & Schitter, 2018; Douglas et al., 2023). These improvements are real and valuable. However, they also signal a form of cognitive selection that is rarely made explicit in psychometric reasoning.
Participation in online panels is not neutral with respect to cognition. Individuals who register for, remain active on, and succeed within such platforms tend to be selected for a distinctive profile: high digital literacy, comfort with abstract and decontextualised tasks, tolerance for dense written instructions, and familiarity with the conventions of psychological testing. These traits are directly relevant to performance on reasoning and ability items. They are also unevenly distributed in the populations for whom many applied tests are designed. This leads to a critical distinction that is often glossed over in practice. A sample may be demographically diverse while remaining cognitively selective. The presence of respondents with low formal qualifications or lower income does not guarantee the inclusion of individuals who struggle with reading comprehension, working memory load, or task interpretation in the ways that matter for applied assessment. Someone with limited educational attainment who voluntarily completes online studies for payment is already atypical of the broader population of low-qualification or low-ability examinees encountered in workplace, educational, or safety-critical testing contexts.
Methodological reviews of online panels tend to acknowledge this indirectly. Douglas et al. (2023), for example, show that Prolific outperforms MTurk and other panels on standard data-quality metrics, but also note that the sample remains disproportionately educated and cognitively engaged. Uittenhove and Vergauwe (2023) make the point more directly in the context of cognitive research, arguing that who is tested matters at least as much as how testing is conducted, particularly for tasks that place sustained demands on attention and comprehension. These observations are often treated as caveats to generalisability. From a psychometric perspective, they point to a deeper issue: the systematic underrepresentation of the very cognitive difficulties that define lower-ability performance.
The problem is not that online panel participants are uniformly high in general intelligence. Rather, they are highly adapted to the testing environment itself. Repeated exposure to online studies may encourage participants to develop efficiencies in parsing instructions, identify the core of a task, manage time pressure, and avoid common distractors. These are procedural competencies that improve performance across a wide range of items without necessarily reflecting the latent abilities the test is intended to measure.
As a result, item statistics derived from such samples are biased in predictable ways. Items appear easier than they will be in operational use. Variance is reduced, particularly at the lower end of the ability distribution. Difficulty gradients are flattened, and items that would challenge less confident or less practised examinees show little sign of doing so during development. Tang et al. (2022) document external-validity gaps between online panels and target populations in a different substantive domain (privacy/security surveys). The point here is not domain equivalence but mechanism plausibility: when participation is cognitively selective and tasks are cognitively demanding, panel-derived estimates can diverge from estimates obtained in samples closer to the target ecology. In psychometric terms, that risk is consistent with a failure of cognitive representativeness.
Importantly, stratifying online samples by education or income does not resolve this problem. Lower educational attainment within an online panel does not equate to lower cognitive functioning in testing contexts. Participants with fewer qualifications who are active online workers are often more motivated, more attentive, and more practised at abstract problem-solving than many individuals with similar qualifications outside the panel ecosystem. They may also possess compensatory skills, such as strong verbal strategies or pattern-recognition heuristics, that allow them to perform well on reasoning tasks despite limited formal education.
This helps explain a recurring pattern in modern test development. In operational terms, this can be detected as systematic deviations from predicted p-values/IRT difficulties, elevated omissions and abandonments, atypically long response-time distributions, and item-level misfit or DIF concentrated in subgroups defined by literacy and test familiarity rather than by the intended latent trait alone. Items that behave well during online piloting later prove problematic when deployed in real settings. Practitioners report unexpected confusion, reading demands that exceed expectations, or sharp drops in performance among groups for whom the test was explicitly intended to be accessible. These failures are frequently attributed to administration conditions or contextual stress. In many cases, the simpler explanation is that the items were never calibrated against a cognitively realistic sample.
The issue is compounded by selective retention within online panels. Participants who find tasks confusing, frustrating, or cognitively taxing are more likely to disengage, fail attention checks, or drop out altogether. Over time, this produces a pool that is increasingly homogeneous in its ability to cope with complex testing demands. From a psychometric standpoint, this constitutes an unacknowledged form of range restriction operating precisely on the dimensions that influence item difficulty. None of this implies that online panels are inappropriate for psychometric research. On the contrary, they are highly effective for specific stages of test development. They are well suited to identifying gross item flaws, estimating discrimination parameters under ideal conditions, and exploring dimensional structure in clean data. The error lies in treating their outputs as indicative of how items will function across the full spectrum of real examinees.
The distinction becomes particularly important when tests are intended for heterogeneous or vulnerable populations: individuals with limited literacy, second-language users, candidates under stress, or examinees with low confidence in abstract reasoning tasks. These are contexts in which cognitive ergonomics, reading load, and task interpretation play a decisive role. Online panels systematically under-sample these difficulties. The growing reliance on such panels reflects a broader shift in psychometric culture. As testing has moved online, convenience has increasingly substituted for theory. Statistical indicators derived from large, clean datasets are granted authority even when the underlying sampling assumptions are weak. Difficulty becomes something inferred from response patterns alone, rather than something grounded in an explicit model of how people with varying cognitive resources engage with test materials.
Recognising the limits of demographic diversity as a proxy for cognitive representativeness is therefore a necessary corrective. It reopens questions about where different kinds of samples are appropriate and for what purposes. It also prepares the ground for understanding why newer developments, particularly the use of AI-generated test items, interact with online panels in ways that exacerbate these distortions rather than resolving them. Boundary conditions matter. For high-literacy, test-savvy populations; for low reading-load items; and for contexts where administration ecology closely matches panel testing (self-paced, low stakes, familiar interfaces), online panels may yield difficulty estimates that transfer adequately. The concern is greatest for heterogeneous or vulnerable populations and for items whose difficulty is mediated by comprehension, working-memory coordination, or interpretation under uncertainty. The next section examines a closely related mechanism: the way repeated participation in online studies produces procedural expertise that further inflates apparent item easiness, even when demographic diversity is preserved.
The Expertise-Through-Participation Effect
Even when demographic diversity is preserved, online panels introduce a second, more subtle distortion into psychometric development: repeated participation produces a form of procedural test literacy, a form of expertise that systematically inflates apparent item easiness. This effect is not incidental. It is a structural consequence of how online research platforms operate and how participants adapt to them over time.
For clarity, procedural test literacy here refers to transferable skills in navigating test ecologies (instruction parsing, time management, recognizing common distractor formats). Depending on intended use, some of these skills may be construct-irrelevant nuisance variance (e.g., when measuring reasoning independent of literacy), while in other contexts they may be construct-relevant (e.g., where workplace performance requires rule-following under written constraints). The concern in this chapter arises when procedural literacy is implicitly treated as irrelevant but is nonetheless differentially concentrated in the calibration sample.
Online panel participants do not encounter tests as naïve examinees. Many complete dozens or even hundreds of studies. Through this repeated exposure, they acquire skills that are only weakly related to the constructs being measured but strongly related to success in testing environments. These skills include rapid identification of task demands, efficient parsing of instructions, recognition of common distractor patterns, and strategies for balancing speed and accuracy under time pressure. In applied psychometrics, such competencies are rarely modelled explicitly, yet they exert a powerful influence on performance. This phenomenon can be described as an expertise-through-participation effect. Participants may become increasingly expert not in the content of specific items, but in the ecology of testing itself. They learn how tests “work”: what kinds of answers are typically rewarded, how ambiguity is usually resolved, and how to manage cognitive effort across a session. The result is an elevation of performance that is largely independent of the latent abilities the test is intended to measure.
From a psychometric standpoint, this matters because item statistics implicitly assume a population of respondents who are naïve with respect to the testing process. Classical item difficulty indices, and their modern counterparts in item response theory, are interpreted as reflecting how demanding an item is for individuals at different levels of ability. When a substantial proportion of respondents possess unmodelled procedural expertise, these indices no longer have that interpretation. They instead reflect performance under conditions of partial training.
Empirical work on online panels consistently hints at this effect, even when it is not the primary focus. Douglas et al. (2023) report lower rates of careless responding and higher instruction compliance on Prolific than on other platforms. While this is often framed as improved data quality, it also indicates that participants are highly practiced at meeting the implicit demands of cognitive tasks. Tang et al. (2022) show that estimates derived from online panels diverge from those obtained from probability samples, which it could be argued might also apply to tasks requiring sustained attention or complex reasoning. One plausible mechanism for this divergence is differential familiarity with test-like problem solving. The consequences for item development are systematic. Items that rely on careful reading, multi-step integration of information, or the inhibition of superficially plausible distractors will appear easier in samples populated by experienced test-takers. Performance becomes more uniform, reducing variance and masking the presence of lower-performing response strategies. Difficulty gradients flatten, and the lower end of the ability distribution becomes poorly resolved.
This effect is especially pronounced for items that resemble puzzles, logic problems, or academic exercises. Online panel participants are disproportionately likely to enjoy such tasks and to have developed strategies for approaching them efficiently. In contrast, many real-world examinees, particularly those with limited testing experience or lower confidence in abstract reasoning, approach the same items with hesitation, misinterpretation, or cognitive overload. These reactions are not noise; they are part of what the test is meant to measure.
Crucially, the expertise-through-participation effect is reinforced by platform-level incentives. Online panels reward speed, accuracy, and compliance. Participants who consistently perform well are more likely to be retained, invited to further studies, and remunerated efficiently. Those who struggle are filtered out through attention checks, exclusion criteria, or self-selection. Over time, the participant pool becomes increasingly skewed toward individuals who are adept at navigating testing demands. This process can resemble a form of informal training regime, one that operates continuously but invisibly.
Traditional psychometric theory recognises the importance of practice effects, but typically treats them as item-specific or content-specific phenomena. The expertise-through-participation effect is broader. It reflects learning at the level of task interpretation and response strategy, not familiarity with particular items. As such, it can influence performance across an entire test battery without leaving obvious traces such as item exposure or memory effects.
The implications for ability testing are significant. When item parameters are estimated using samples enriched with procedurally expert respondents, those parameters encode assumptions about test-taker behaviour that do not hold in operational contexts. Items calibrated as easy may in fact require levels of reading fluency, working-memory coordination, or strategic insight that exceed the capacities of intended users. Conversely, items that genuinely tap lower-level abilities may be discarded prematurely because they fail to discriminate within an overly competent sample.
This helps explain why psychometric failures often emerge only after deployment. Developers are surprised to find that candidates misunderstand instructions, misparse scenarios, or resort to guessing in ways that were rarely observed during piloting. These behaviours were present all along, but the sampling regime systematically filtered them out.
The expertise-through-participation effect also interacts with demographic stratification in misleading ways. Researchers may believe that by recruiting participants with lower educational attainment they are capturing lower levels of ability. In practice, they are often recruiting individuals who have compensated for limited formal education through extensive experience with abstract online tasks. The result is a conflation of educational background with cognitive performance that does not generalise beyond the panel environment. It is important to emphasise that this effect is not a flaw of any particular platform, nor is it easily eliminated through improved sampling procedures. It arises from the basic logic of online research: voluntary participation, repeated exposure, and selective retention. As such, it must be addressed conceptually rather than technologically.
For psychometricians, the key lesson is that repeated participation constitutes an unmeasured source of variance that systematically biases item statistics. Treating online panel data as if it were drawn from a population of naïve examinees leads to predictable errors in difficulty estimation and test design. Recognising this requires a shift in how early-stage data are interpreted, and a clearer separation between stages of development that benefit from expert respondents and those that require cognitively realistic samples. The next section examines a third, closely related mechanism: the over-cleanliness of online panel data, and how the very procedures designed to ensure quality simultaneously remove the forms of error and misunderstanding that define lower-ability performance in real testing contexts.
The Over-Cleanliness Problem
A third mechanism compounds the distortions introduced by cognitive selectivity and expertise-through-participation: the systematic over-cleanliness of online panel data. The very procedures designed to improve data quality also remove precisely those forms of error, hesitation, and misunderstanding that define lower-ability performance in real testing contexts.
Online research platforms place strong emphasis on data hygiene. Participants are screened using attention checks, minimum completion times, comprehension questions, and exclusion rules designed to eliminate careless or disengaged responding. From the perspective of experimental control, these practices are entirely sensible. They reduce noise, improve reliability, and increase statistical power. From the perspective of applied psychometrics, however, they have an unintended and often unacknowledged consequence: they can selectively exclude cognitive behaviours that are diagnostically relevant for accessibility and lower-tail functioning in many applied tests.
In operational testing environments, lower performance is rarely expressed as random responding. It is more often expressed as slow reading, partial comprehension, misinterpretation of task demands, difficulty integrating multiple pieces of information, or premature abandonment of effort. These behaviours are not artefacts to be cleaned away. They are central features of the construct space in which many ability tests operate. When development samples systematically exclude respondents who exhibit them, item statistics are calibrated against an unrealistically narrow slice of human cognition.
The contrast between research quality and psychometric realism is particularly stark in online panels. Participants who read slowly, struggle with instructions, or require repeated clarification are more likely to fail attention checks or exceed time limits. They are filtered out either automatically or through researcher-imposed exclusion criteria. Over time, this produces a participant pool that is unusually proficient at coping with complex written material and abstract task structures. The resulting data are clean, but they are also atypical.
Several reviews of online research methods implicitly acknowledge this tension. Douglas et al. (2023), for example, document the effectiveness of Prolific’s quality controls in reducing inattentive responding relative to other platforms. While this is presented as a methodological advantage, it also means that individuals who find cognitive tasks genuinely difficult are disproportionately removed from the sample. Rodd (2024) makes a similar point in the context of experimental psychology, noting that online testing environments tend to privilege participants who are already comfortable with sustained, self-directed cognitive work.
From a psychometric standpoint, the implications are straightforward. Difficulty estimates derived from such samples systematically underestimate the demands that items place on real examinees. Items that require careful parsing of text, maintenance of information in working memory, or suppression of misleading cues appear easier when administered to respondents who rarely experience breakdowns in these processes. The lower tail of the ability distribution is effectively truncated before calibration even begins.
This over-cleanliness also suppresses variance in ways that are difficult to detect statistically. Because respondents who struggle are excluded rather than retained as low scorers, distributions appear tighter and more orderly than they would be in operational use. Reliability estimates may improve, discrimination indices may look respectable, and factor structures may appear stable. Yet these apparent strengths rest on a restricted range of cognitive behaviour. When the test is later administered to a broader population, variance reappears in unanticipated forms, often accompanied by unexpected item failures.
The problem is particularly acute for tests intended to be accessible. In many applied contexts, psychometricians are explicitly tasked with developing items that can be understood and attempted by individuals with limited literacy, limited formal education, or low confidence in abstract reasoning. Online panel data are ill suited to this purpose, not because participants lack demographic diversity, but because the sampling and filtering processes systematically remove those who would struggle most.
The over-cleanliness problem also interacts with the expertise-through-participation effect described in the previous section. Participants who have learned how to succeed in online studies are not only more likely to remain in the panel; they are also more likely to pass quality checks that rely on speed, accuracy, and compliance. In this way, procedural expertise is rewarded and reinforced, while genuine difficulty is treated as noise. Traditional psychometric practice recognised the importance of observing error patterns as data in their own right. Early test developers often examined incorrect responses, hesitations, and partial solutions to understand how items were being interpreted. In modern online pipelines, these behaviours are increasingly filtered out before analysis. The result is a loss of information about how items fail, particularly among lower-ability examinees.
It is important to distinguish this argument from a general critique of data quality standards. Clean data are essential for many forms of analysis, and careless responding can invalidate results. The issue is not cleanliness per se, but misalignment between cleanliness and purpose. When the goal is to estimate how demanding an item will be for real users, excluding those who struggle most with the task undermines that goal. The cumulative effect of cognitive selectivity, expertise-through-participation, and over-cleanliness is a development environment in which items are evaluated under conditions that systematically favour success. Items appear easier, more discriminating, and more orderly than they will be in practice. When such items are later deployed, their failure is often attributed to contextual factors rather than to the calibration process itself.
Recognising the over-cleanliness problem forces a reconsideration of how quality controls are used in psychometric development. It suggests that different stages of the pipeline require different tolerances for error, confusion, and slow responding. Early-stage prototyping may benefit from clean data, but difficulty calibration and accessibility assessment require exposure to the full range of cognitive behaviours present in the target population. The next section turns to a closely related but conceptually distinct issue: why item sources grounded in expert cognitive structures, such as curriculum-aligned or symplectic materials, systematically fail to capture the demands experienced by lower-ability examinees, even when they appear well designed in clean development samples.
Why Curriculum-Aligned and Symplectic Item Sources Fail Lower-Ability Examinees
Difficulties in modern item development do not arise solely from how items are sampled and calibrated. They also arise from where items come from. Curriculum-aligned repositories and LLM generative templating frameworks that preserves deep structure are often treated as principled sources of test material because they are systematic, traceable to educational standards, and internally coherent. These qualities are appealing, particularly in large-scale or automated development pipelines. Yet they can encode assumptions about cognition that disproportionately disadvantage lower-ability examinees, especially when reading load and constraint-tracking demands are not explicitly constrained. The core problem is a mismatch between expert cognitive structure and novice cognitive experience. Curriculum frameworks are constructed by subject-matter experts who have internalised the conceptual organisation of their domain. They reflect how knowledge is ideally structured once learning has been successful. Symplectic approaches intensify this orientation by privileging structural invariance and formal relations across items. What both approaches tend to overlook is how individuals with limited mastery actually encounter and interpret tasks.
This distinction has long been recognised in cognitive psychology. Classic work by Chi, Feltovich, and Glaser (1981) showed that experts and novices perceive the same problems in fundamentally different ways. Experts organise problems according to deep structural principles; novices attend to surface features and struggle to identify what is relevant. When items are designed around expert representations, they implicitly assume the very skills that novices lack. As a result, formally simple items can impose substantial cognitive demands on less able examinees.
Psychometric theory provides a parallel insight. Embretson’s construct representation framework (Embretson, 1983) distinguishes between the formal properties of a task and the cognitive processes it elicits. An item may involve a single logical operation, yet require complex mental representations, sustained working memory, or sophisticated strategy selection. When item design focuses on formal structure alone, without modelling how examinees actually process the task, difficulty estimates become detached from lived cognitive experience.
Curriculum-aligned items often rely on implicit assumptions of shared meaning, fluent reading, and efficient integration of information. These assumptions are invisible to expert designers but consequential for examinees who lack confidence or experience in abstract problem solving. Large-scale studies of adult literacy and problem solving illustrate the scale of this gap. Surveys such as the National Adult Literacy Survey and later OECD skills assessments show that substantial proportions of adults struggle with tasks involving dense text, embedded conditions, and implicit inference, even when those tasks are nominally aligned with school curricula (Kirsch et al., 1993; OECD, 2013). These are precisely the features that curriculum-derived items tend to normalise.
Symplectic item generation adds a further layer of difficulty. By preserving formal relations across items, symplectic methods often compress surface variation while maintaining deep structural equivalence. This compression increases cognitive load. Cognitive load theory makes clear why this matters. Sweller (1988) demonstrated that problem-solving performance deteriorates when intrinsic and extraneous load exceed working-memory capacity, particularly for novices. Later work extended this insight to instructional and assessment contexts, showing that expert-designed tasks routinely underestimate the processing burden imposed on less skilled individuals (Sweller, Ayres, & Kalyuga, 2011).
In practical terms, this means that symplectically generated items can be formally elegant while being cognitively alien. They may require examinees to track multiple constraints, inhibit misleading cues, or recognise abstract equivalences that are obvious only to those with well-developed domain schemas. For experienced test-takers, such demands pose little difficulty. For lower-ability examinees, they lead to misinterpretation, overload, or disengagement.
The interaction with sampling environments is crucial. When curriculum-aligned or symplectic items are piloted on cognitively selective online panels, they appear to function well. Participants recognise the underlying structure, apply appropriate heuristics, and respond efficiently. Difficulty indices suggest accessibility, and discrimination statistics may even improve. These results are then taken as evidence that the items are suitable for broad use. When the same items are deployed operationally, the mismatch becomes apparent. Examinees misparse the stem, overlook conditions, or fail to grasp what the item is asking. Performance drops in ways that were not predicted by development data. These failures are often attributed to motivation or test anxiety. Yet validity theory suggests a more direct explanation. Messick’s unified view of validity emphasises that the consequences of testing are inseparable from the interpretation of scores (Messick, 1995). If an item systematically disadvantages certain groups because of unexamined cognitive demands, that is a validity problem, not an incidental side effect.
Item response theory does not insulate against this issue. As Embretson and Reise (2000) note, item parameters inherit the assumptions of the calibration sample. If items are calibrated on respondents who share expert-like cognitive habits, the resulting difficulty estimates encode those habits. When the test is later administered to populations who do not share them, misfit is inevitable. The problem lies not in the model, but in the representativeness of the cognitive processes on which it was estimated. Traditional item-writing guidelines implicitly recognised these dangers. Early psychometricians stressed simplicity of language, minimal working-memory demands, and transparency of task requirements, particularly for tests aimed at lower-ability groups. These principles were grounded in observation of examinee behaviour rather than in abstract task analysis. As item development has become increasingly automated and theory-driven, this observational grounding has weakened.
The issue, then, is not that curriculum alignment or symplectic structure are inherently misguided. Both are valuable for ensuring content coverage and internal coherence. The problem arises when they are treated as sufficient conditions for cognitive appropriateness. Without explicit attention to how items are interpreted by less skilled readers and problem-solvers, these approaches systematically overshoot their intended audience. This has direct implications for AI-assisted item generation. Large language models are trained predominantly on expert-produced texts and reproduce expert cognitive structures by default. When prompted to generate curriculum-aligned or formally structured items, they tend to amplify dense phrasing, implicit assumptions, and compressed reasoning. When such items are evaluated using cognitively selective samples, their inaccessibility remains hidden.
The result is a development pipeline in which expert-centric items are validated by expert-like respondents and then deployed to non-expert populations. The ensuing failures follow predictable patterns rooted in cognitive psychology and psychometric theory. Recognising these patterns requires shifting attention away from formal elegance and toward cognitive realism. The next section examines how these issues are intensified by large language models themselves, and why the interaction between AI-generated items and online panels creates a particularly misleading impression of item difficulty.
LLM-Generated Items and the Amplification of Expert Cognitive Structure
Large language models introduce a new source of distortion into item development, not because they depart from earlier patterns, but because they intensify them. When used to generate test items, LLMs systematically reproduce expert cognitive structures at scale. When these items are evaluated using cognitively selective online panels, the resulting evidence creates a particularly misleading impression of accessibility and difficulty. This outcome follows directly from how large language models are trained. LLMs learn from vast corpora dominated by expert-produced texts: textbooks, instructional explanations, formal worked examples, and curated problem sets. As a result, they internalise not only domain content but the representational habits of experts. When prompted to generate assessment items, especially those aligned with curricula or formal logical structures, they naturally produce tasks that reflect how competent humans think about problems once understanding is already established.
The cognitive implications of this have been understood for decades. Classic work on expert–novice differences demonstrated that experts organise problems around deep structural principles, while novices attend to surface features and struggle to identify what is relevant (Chi, Feltovich, & Glaser, 1981). LLM-generated items mirror expert representations in precisely this sense. They compress information, rely on implicit assumptions, and expect examinees to recognise abstract relations that are obvious only to those with well-developed schemas. From a psychometric perspective, this matters because formal simplicity does not guarantee cognitive simplicity. Embretson’s construct representation framework makes clear that item difficulty is determined by the cognitive processes a task elicits, not by its logical description (Embretson, 1983). Many LLM-generated items involve a small number of formal operations but require examinees to integrate information across clauses, track multiple constraints, or infer unstated relationships. These demands are invisible to experts and to models trained on expert discourse, but they dominate performance among lower-ability examinees.
Cognitive load theory explains why such items disproportionately disadvantage less skilled respondents. Sweller’s work showed that novices are especially sensitive to tasks that impose high intrinsic or extraneous load, and that formats efficient for experts can overwhelm those with limited working-memory resources (Sweller, 1988). When LLMs compress reasoning or omit surface cues in the interest of elegance or brevity, they increase cognitive load in ways that are rarely detected during development. In principle, these issues could be identified through careful piloting. In practice, they are masked by the evaluation environment. Online panels are populated by participants who are unusually well equipped to cope with dense phrasing, implicit structure, and abstract reasoning. As a result, LLM-generated items that will later prove confusing or inaccessible appear to function smoothly. Difficulty indices suggest that the items are easy, and discrimination statistics appear satisfactory.
Recent work on LLMs as tools for survey research and synthetic data generation illustrates a closely related phenomenon. Studies examining algorithmic fidelity show that LLM outputs can act as “silicon samples”, that is, simulated subpopulations which, while they can approximate aggregate human patterns can lead to systemic distortions and mask subgroup differences (Argyle et al., 2023). Although this work focuses on LLMs as respondents rather than item generators, the underlying lesson applies directly: when models and evaluation samples share the same cognitive biases, distortions reinforce rather than cancel. Psychometric modelling does not eliminate this problem. Item response theory can estimate parameters for LLM-generated items, but those parameters are conditional on the calibration sample. As Embretson and Reise (2000) emphasised, item parameters reflect task–person interactions within a specific population. When both items and respondents privilege expert-like cognition, the resulting difficulty estimates encode a narrow and misleading view of accessibility.
This helps explain a pattern that has been reported by some practitioners. Items generated or refined using AI appear well calibrated during development but fail in operational settings. Examinees misunderstand what is being asked, abandon items mid-way, or resort to guessing strategies that were rarely observed during piloting. These outcomes are not anomalies. They are predictable consequences of calibrating cognitively dense items on cognitively adept samples. Importantly, this is not an argument against the use of large language models in psychometrics. LLMs are indeed very valuable tools for early-stage prototyping, dimensional exploration, and stress-testing of constructs. But final calibration should continue to rely on human data drawn from cognitively appropriate samples. As things stand now, LLM-based simulation can approximate some aggregate patterns but can still exhibit systematic distortions and subgroup mismatches (Argyle et al., 2023). Problems arise when this distinction is ignored and AI-generated items are treated as ready for deployment on the basis of online-panel evidence alone.
Validity theory provides the appropriate interpretive frame. Messick’s emphasis on the consequences of testing reminds us that underestimating difficulty is not a technical inconvenience but a threat to the validity of inferences drawn from scores (Messick, 1995). When AI-generated items systematically disadvantage lower-ability examinees because their cognitive demands were never observed during development, the resulting scores misrepresent the very abilities they are intended to measure.
The interaction between LLM-generated items and online panels thus creates a closed cognitive loop. Expert-trained models generate expert-coded items. Expert-adapted respondents validate them. The resulting evidence suggests accessibility and fairness. Only when the test encounters real examinees does the loop break Breaking this loop requires explicit attention to cognitive realism. LLMs must be constrained not only by content specifications but by models of how tasks are actually interpreted by less skilled readers and problem-solvers. Evaluation samples must include naïve and less practised respondents, even when this reduces apparent data quality. Misunderstanding, slow responding, and partial failure must be treated as diagnostic information rather than noise. The next section steps back from specific mechanisms to ask a broader question: why a field with such a long history of attending to these issues was slow to recognise the closed loop created by modern item sources, online panels, and AI-assisted generation.
Why the Field Missed This
Given the longevity of the cognitive and psychometric principles invoked in the preceding sections, it is reasonable to ask why the closed loop created by expert-coded items, cognitively selective samples, and AI-assisted generation went largely unremarked for so long. The answer does not lie in ignorance of theory, but in a convergence of institutional habits, training norms, and technological incentives that gradually displaced older forms of methodological vigilance.
One contributing factor is the success of online testing itself. As psychometrics moved from paper-and-pencil administration to digital delivery, online panels solved a series of genuine problems. They reduced costs, accelerated development cycles, increased sample sizes, and enabled rapid iteration. These gains were real and professionally rewarding. Over time, however, convenience began to substitute for representativeness. Practices adopted initially as pragmatic compromises hardened into defaults, and the assumptions embedded in those defaults were rarely revisited.
A parallel shift occurred in training. Contemporary psychometric education places strong emphasis on modelling, estimation, and statistical diagnostics. Students are taught to interrogate reliability, discrimination, dimensionality, and model fit with considerable sophistication. Much less attention is paid to observing how real examinees struggle with items, misinterpret instructions, or abandon tasks. Earlier generations of psychometricians encountered these behaviours directly through field testing and face-to-face administration, an approach that treated validation as inseparable from observation of test use (Cronbach, 1971). As those experiences receded from curricula, so too did the intuition that difficulty is something lived before it is modelled.
The organisation of modern test development further reinforced this drift. Item writing, piloting, modelling, and deployment are now frequently handled by different teams, sometimes separated by institutional or commercial boundaries. Item sources are justified by curricular alignment, models are validated statistically, and deployment failures are attributed to context or implementation. No single stage requires responsibility for cognitive realism across the full pipeline. Mislevy’s work on evidence-centred design anticipated this risk, warning that inferences about ability weaken when cognitive interpretation is divorced from modelling and use (Mislevy, 1994). In practice, that separation has become routine. AI intensified these tendencies by accelerating everything they touched. Large language models made it possible to generate large numbers of formally coherent items quickly, reinforcing the appeal of automated pipelines. Online panels provided equally rapid validation. The speed and apparent robustness of this process reduced opportunities for reflective pause. When difficulties emerged later, they were often interpreted as anomalies rather than as signals of systematic misalignment.
There is also a sociological dimension. Online panels and AI tools carry a strong association with technical progress and methodological sophistication. Questioning their adequacy can be read, unfairly, as resistance to innovation rather than as concern for validity. In such an environment, arguments grounded in cognitive psychology or observational experience may be discounted as anecdotal, even when they rest on well-established theory. Similar dynamics have been observed more broadly in methodological research, where incentive structures reward speed, scale, and novelty over careful examination of assumptions.
Finally, success masked the problem. For higher-ability, highly motivated, and test-savvy populations, many AI-generated and curriculum-aligned items work well enough. Scores discriminate, reliability is acceptable, and users report few difficulties. It is only when tests are deployed in heterogeneous or lower-ability populations that failures become visible. By then, the development evidence appears to contradict the complaints, and the weight of quantitative data favours the original design decisions. Seen in this light, the field did not so much overlook the problem as lack a vantage point from which it could be seen clearly. Sampling biases, item-source assumptions, and modelling practices each contributed a small distortion. Together, they formed a self-reinforcing system that appeared methodologically sound while drifting away from cognitive realism.
Recognising this history reframes the task ahead. The issue is not to assign blame or to abandon useful tools, but to restore a balance that psychometrics once maintained more explicitly. That balance involves treating difficulty as an empirical property of human cognition under realistic conditions, not merely as a parameter estimated under ideal ones. The final section of this chapter turns from diagnosis to prescription. It sets out practical principles for reintroducing cognitive realism into modern, AI-assisted psychometric development without sacrificing the efficiencies that new technologies provide.
Restoring Cognitive Realism in AI-Assisted Psychometric Development
The preceding sections have described a closed loop in contemporary psychometric practice: expert-coded item sources, cognitively selective samples, and AI-assisted generation reinforce one another in ways that systematically underestimate difficulty for real examinees. The purpose of this final section is to set out practical principles for breaking that loop without discarding the efficiencies and genuine advances that modern tools provide.
The central corrective is straightforward in concept, though demanding in practice. Psychometric development should again distinguish clearly between stages that benefit from clean, expert-like data and stages that require exposure to the full range of cognitive behaviours present in the target population. Confusion, slow responding, misinterpretation, and partial failure need to be treated as evidence about item functioning rather than as defects to be eliminated.
One necessary principle concerns the separation of developmental stages. Online panels and AI-generated items are highly effective during early phases of test construction. They support rapid prototyping, exploration of dimensional structure, identification of gross item flaws, and stress-testing of constructs under favourable conditions. What they cannot reliably establish is how demanding an item will be for less practised, less confident, or less literate examinees. Difficulty calibration, norm development, and cut-score setting typically require samples that are cognitively representative of the population for whom consequential decisions will be made.
A second principle concerns cognitive realism as a design constraint. Item specifications should address not only content coverage and formal structure, but also reading load, working-memory demands, and interpretive transparency. For AI-generated items, this entails constraining prompts and selection criteria so that surface clarity is valued alongside structural coherence. Items should be evaluated in terms of how they are likely to be parsed by novice readers, not solely by whether they satisfy formal or curricular criteria.
A third principle involves how sampling diversity is understood. Demographic heterogeneity alone is insufficient. Developers should seek exposure to a diversity of cognitive experience, particularly including individuals who are naïve to testing, uncertain in abstract reasoning, or prone to misinterpretation under cognitive load. Where large field samples are impractical, partnerships with schools, training programmes, apprenticeships, or entry-level workplaces can provide small but revealing windows into real examinee behaviour. The aim is early detection of cognitive misalignment, not statistical completeness.
A fourth principle concerns tolerance for error at appropriate stages. Quality controls must be aligned with developmental purpose. While it is reasonable to exclude random or disengaged responding during early modelling, later stages should retain respondents who struggle, read slowly, or misunderstand instructions. Patterns of error, hesitation, nonresponse, and abandonment are diagnostically valuable. Removing them in the name of data cleanliness deprives developers of precisely the information needed to assess accessibility.
A fifth principle relates to the role assigned to large language models. LLMs are best understood as generators of hypotheses and design variants rather than as arbiters of readiness. Their value lies in speed, breadth, and the ability to explore item spaces that would be impractical to construct manually. Judgements about difficulty, fairness, and suitability for use should remain grounded in human data drawn from cognitively appropriate samples. This distinction should be formalised in development protocols rather than left to informal judgement.
Finally, there are implications for training. Psychometric education must reconnect modelling expertise with cognitive observation. Students should be taught not only how to evaluate reliability, discrimination, and fit, but how to recognise when those statistics are being produced under unrealistically favourable conditions. Direct exposure to real examinee behaviour, including confusion and failure, should again be treated as part of methodological competence rather than as an inconvenience to be engineered away.
None of these recommendations require abandoning online panels, curriculum alignment, symplectic structure, or AI-assisted generation. They require placing those tools back within a framework that recognises the limits of their assumptions. The aim is not to slow innovation, but to prevent efficiency and formal elegance from standing in for validity.
The broader implication of this chapter is that psychometrics faces a familiar choice when powerful new technologies reshape its methods. It can allow convenience, scale, and formal sophistication to define practice by default, or it can deliberately reassert the primacy of the human cognition it seeks to measure. Restoring cognitive realism is not a retreat from modernity. It is the condition under which modern psychometrics can continue to justify the inferences it draws.
References
Argyle, L. P., Busby, E. C., Fulda, N., Gubler, J. R., Rytting, C. F., & Wingate, D. (2023). Out of One, Many: Using Large Language Models to Simulate Human Samples. Political Analysis, 31(3), 337–351.
(Introduces algorithmic fidelity and shows how LLM outputs can look valid while masking systematic distortions.)
Chi, M. T. H., Feltovich, P. J., & Glaser, R. (1981). Categorization and representation of physics problems by experts and novices. Cognitive Science, 5(2), 121–152.
(A classic demonstration that experts and novices perceive the same problem differently, with experts organising by deep structure and novices by surface features. Foundational for understanding why LLM-generated items privilege expert cognition.)
Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Educational Measurement (2nd ed.).
(Represents an earlier tradition in which validation was grounded in observation of test use and consequences.)
Mislevy, R. J. (1994). Evidence and inference in educational assessment. Psychometrika, 59(4), 439–483.
(Articulates the risks of separating cognitive interpretation from modelling and use.)
Douglas, B. D., Ewell, P. J., & Brauer, M. (2023). Data quality in online human-subjects research: Comparisons between MTurk, Prolific, CloudResearch, Qualtrics, and SONA. PLOS ONE, 18(3), e0279720.
(Comparative analysis showing Prolific’s superior data quality alongside persistent sample biases and superior instruction-following and compliance, indicative of participant expertise. Also documents the effectiveness of quality controls on Prolific which are consistent with homogeneity in retained samples.)
Embretson, S. E. (1983). Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93(1), 179–197.
(This is the foundational source for the distinction between formal task structure and the cognitive processes examinees actually use. Embretson’s construct representation framework underpins the argument that items can be formally simple yet cognitively demanding, and that item validity depends on modelling how tasks are mentally represented by examinees, not just how they are logically defined. Provides the theoretical basis for distinguishing formal task structure from the cognitive processes elicited by an item.)
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Lawrence Erlbaum.
(Provides the psychometric backdrop for understanding why unmodelled cognitive demands distort item parameters. The book makes clear that IRT parameters inherit the assumptions of the calibration sample, which directly supports the claim that symplectic or curriculum-aligned items calibrated on expert-like samples will misestimate difficulty for novice populations.)
Kirsch, I. S., Jungeblut, A., Jenkins, L., & Kolstad, A. (1993). Adult literacy in America. National Center for Education Statistics.
(A landmark study showing that real-world populations struggle with tasks involving dense text, embedded conditions, and implicit assumptions. It provides strong external grounding for the claim that many “simple” items are cognitively inaccessible to lower-ability examinees despite being curriculum-aligned.)
Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances. American Psychologist, 50(9), 741–749.
(Provides the broader validity framework within which this critique sits. Messick’s emphasis on consequences of testing supports the argument that item sources must be evaluated for how they function in real populations, not just for internal coherence.)
Mislevy, R. J. (1994). Evidence and inference in educational assessment. Psychometrika, 59(4), 439–483.
(Links psychometric modelling to cognitive interpretation, reinforcing the idea that item performance data are evidence about cognition only insofar as the task representation is appropriate for the examinees.)
OECD (2013). OECD Skills Outlook: First Results from the Survey of Adult Skills.
(Reinforces the literacy and problem-solving difficulties observed in applied populations, especially when tasks require integration across multiple pieces of information. This directly contrasts with the performance seen in online panels and explains why curriculum-aligned items often fail operationally.)
Palan, S., & Schitter, C. (2018). Prolific.ac—A subject pool for online experiments. Journal of Behavioral and Experimental Finance, 17, 22–27.
(Foundational description of Prolific’s participant pool and its advantages and limitations.)
Rodd, J. M. (2024). Moving experimental psychology online: How to obtain high quality data when we can’t see our participants. Journal of Memory and Language, 134, 104472.
(Discusses the trade-off between data quality and ecological validity in online cognitive testing.)
Sweller, J. (1988). Cognitive load during problem solving: Effects on learning. Cognitive Science, 12(2), 257–285.
(Introduces cognitive load theory, which explains why items that compress multiple relations or rely on implicit structure disproportionately disadvantage less able examinees. This supports the critique of symplectic item design, where formal elegance often increases intrinsic and extraneous load.)
Sweller, J., Ayres, P., & Kalyuga, S. (2011). Cognitive load theory. Springer.
(Extends cognitive load theory to instructional and assessment contexts, reinforcing the argument that expert-designed tasks routinely underestimate the processing burden imposed on novices.)
Tang, J., Birrell, E., & Lerner, A. (2022). How well do my results generalize now? The external validity of online privacy and security surveys. In Proceedings of the Eighteenth Symposium on Usable Privacy and Security (SOUPS 2022).
(Empirical demonstration of systematic divergence between online panels and target populations, particularly for cognitively demanding tasks.)
Uittenhove, K., & Vergauwe, E. (2023). From lab-testing to web-testing in cognitive research: Who you test is more important than how you test. Journal of Cognition, 6(1), 13.
(Emphasises that participant exclusion and task demands interact to shape observed performance. Explicitly argues that participant characteristics critically shape cognitive task performance online. Although focused on testing mode, this paper supports the broader claim that expert-like samples systematically misrepresent task difficulty for broader populations.)
© John Rust (December 2025). All Rights Reserved.