AI Statistical Inference (ASI)
Is this the future of Psychometrics?
If high-quality human data becomes scarce, polluted, or ethically gated, under what conditions could AI Supported Statistical Inference (ASI) – LLM-based priors plus synthetic samples – actually outperform classical, human-sample-only psychometrics? And what does that world look like for test development and use?
DeepMind/Google’s current work on Generative Data Refinement (GDR) explicitly targets the looming shortage of clean human training data, by using models to “refine” or regenerate large corpora while preserving informational content and avoiding model collapse. Already the field is quietly assuming that synthetic or model-based data will become central, not peripheral. But taking the question one step beyond that: what if human data is so rare and second-class it becomes unviable?
What we already know
We’re no longer speculating about whether LLMs can act as “synthetic humans”:
-
Argyle et al. (2023) show that GPT-3–based “silicon samples” can approximate opinion distributions of real subpopulations when conditioned on demographic backstories. They introduce the idea of algorithmic fidelity: does a model reproduce not just means but correlation structures similar to real data? Cambridge University Press & Assessment+1
-
Multiple follow-ups (e.g. public-opinion simulations, policy preference surveys) show that LLM-generated samples often match aggregate survey patterns, but with systematic distortions: reduced variance, more positivity, and item-level heterogeneity. ACM Digital Library+1
-
Pellert et al. (2024) treat LLMs as objects of psychometrics—administering inventories to models and analysing factor structure, reliability, and validity—essentially building a bridge between classical psychometrics and LLM behaviour. PubMed+2SAGE Journals+2
-
Google DeepMind researchers have found ways to generate large corpora while preserving informational content and avoiding model collapse. These corpora contain massive collections of digital text and data used to train, test and analyse large language models, including books, websites, articles, and code, and serve as the raw material that enables LLMs to learn patterns, grammar, and information about the world. Business Insider+2arXiv+2
When could ASI actually beat human-sample psychometrics?
Here’s the key move: treat both approaches as noisy measurement channels of a true latent population distribution.
Let:
- P* = the “true” joint distribution of item responses in a target population
- DH = a finite human dataset (N respondents) sampled from that population
- M = a trained LLM, which defines an approximate distribution PM over responses given prompts/personas
Classical psychometrics estimates item parameters and norms from DH. ASI estimates them from samples drawn from PM (possibly conditioned on pseudo-demographics).
In Bayesian terms, ASI is roughly the move: assume PM is an informative prior over P*, and use a tiny DH (if any) as data.
So in what sense could PM be “better” than DH?
1. Sample-efficiency regime
Human data:
- Limited N
- Biased sampling frame (WEIRD, panel fatigue, bots)
- Expensive and slow to expand
Model priors:
- Effectively infinite N (you can sample as much as you like)
- Encodes a mixture of many historical populations
- Cheap to resample, but biased in its own way
ASI wins when:
- Effective sample size: the bias of PM is smaller than the combined sampling error plus bias of DH.
- Structural fidelity: the model reproduces covariance structures (factor loadings, correlations) more stably than small DH. That’s what Argyle et al. call algorithmic fidelity (Cambridge University Press & Assessment).
In that regime, a psychometrician who ignores PM is willingly throwing away information.
2. Domain regime
Where could this hold for bona fide psychometric traits?
- Broad, culturally ubiquitous constructs with huge textual footprint: e.g. extraversion/introversion, anxiety, political cynicism, techno-optimism.
- Traits closely reflected in ordinary language: attitudes, preferences, stances, values.
- Much less so for tightly controlled, non-verbal, stimulus-driven constructs: e.g. spatial rotation, working-memory span, speeded cancellation tasks. Here, the mapping from text corpora to response distributions is much weaker.
So ASI will first outcompete human-sample methods in attitudinal and contextual scales long before it can emulate fine-grained ability tests. However, even for these ability tests there are alternative approaches. Criterion-based standardization, for example, can generate Bayesian Priors for newly created items based on the estimated achievement levels (means and standard deviations) of children between 5 and 16 years of age as prediced by the syllabus for each age in their country's Education National Curriculum.
What if human data is scarce or “polluted”?
Now let's take the counterfactual seriously: human data is rare, expensive, polluted by bots and participation bias, or legally restricted (privacy, data protection, political regulation).
We already see early signals of this:
- Studies detecting AI bots passing as humans in online surveys; panel contamination is no longer hypothetical. (The Times)
- Growing ethical and legal restrictions around psychological data collection, especially for minors, health, or political attitudes.
In that world, the cost of obtaining a clean, representative human norm sample DH skyrockets. Two consequences follow:
- Norms become a luxury good. Only large institutions (state, Big Tech, defence) can afford to collect them.
-
Synthetic priors become epistemically attractive, not just economically convenient. The rational move
for a psychometrician is then to:
- Use PM to generate a high-quality synthetic “proto-norm”, including plausible demographic slices.
- Use tiny, carefully audited human subsamples as sanity checks, outlier detectors, and fairness audits—not as the main estimation engine.
In other words, even if human data doesn’t vanish, it stops being the primary driver of the model.
What fails if we try to “go fully synthetic”?
This is where the philosophers will (rightly) push. Three failure modes are already visible in the literature:
- Variance collapse – synthetic samples often under-represent extreme or rare patterns compared to real data (ACM Digital Library). For high-stakes testing (clinical risk, rare pathologies, tail-risk safety roles) this is fatal.
- Invariance illusions – a model trained mostly on Anglo-American data may give you a beautifully stable factor structure that silently fails on marginalised groups or other cultures.
- Feedback / model collapse – if you train tomorrow’s models on data generated by yesterday’s models, you risk a form of informational inbreeding. DeepMind’s GDR work is explicitly a response to this threat (Business Insider, arXiv).
So a purely synthetic psychometrics, with no ties back to actual people, is epistemically fragile. But that’s now. What does the future hold?
Can we reach a world where almost all new tests start ASI-first, and human data is deployed selectively rather than universally?
This, I think, is not just plausible but likely.
What does test development look like in that ASI-first world?
Let's sketch a development pipeline:
Latent trait specification
Human theorists articulate a construct (e.g. “Attitudes Toward AI Emergence”) with a conceptual map: sub-facets, hypothesised oppositions, expected correlates.
ASI item generation + parameter seeding
An LLM, prompted with this map and constraints, generates thousands of candidate items. The same or another model is used as a psychometric oracle: simulate responses for many virtual personas to estimate difficulty, discrimination, local dependence, and factor structure.
Synthetic population modelling
You sample a large synthetic population from PM, stratified by pseudo-demographics (age bands, regions, occupations) and generate a full proto-norm table plus expected correlations with other constructs.
Micro-sample human anchoring
Instead of 1,000–10,000 participants, you collect, say, 100–200 highly curated respondents from the actual target population(s). You use them not to define the whole structure, but to:
- check measurement invariance,
- detect systematic discrepancies,
- re-scale the score metric,
- set cut-scores anchored in real-world outcomes where available.
Ongoing Bayesian updating
As real data trickles in from actual use, you update parameter estimates with the ASI priors as regularisers.
Your Attitudes Toward AI Emergence instrument is almost tailor-made to be an exemplar of this logic.
Why philosophers should care
There are at least three philosophically legible claims:
1. ASI doesn’t abolish reference to people; it changes where reference lives.
In classical psychometrics, reference is local: this cohort of respondents, at this time, defines the norms. In ASI-first psychometrics, reference is partly historical and distributed: the model encodes many cohorts across time and space. Human samples become correction points rather than the whole map.
The philosophical question: is that still “measurement of humans”? Or are we measuring the latent structure of a socio-technical system that includes the model itself?
2. The epistemic unit shifts from “sample” to “training history”.
A single ASI-based test implicitly relies on the entire training pipeline of the model (data selection, fine-tuning, safety filters). Validity arguments must therefore include not just sample properties, but model provenance.
3. Agency and fairness become multi-layered.
Bias is no longer only in the sampling frame; it’s in the model, the prompts, the “virtual personas” used to simulate responses, and the tiny human anchor sample. This raises new questions about responsibility: who owns the error when a synthetic prior mischaracterises a real group?
From a philosophical perspective, psychometrics becomes a paradigm example of AI-mediated social epistemology, not a niche technical application.
So here is the research agenda
In a world of scarce and contaminated human data, the most epistemically defensible psychometric systems will be ASI-first and human-anchored: they use large-model priors to define and stabilise latent structures, and deploy small, carefully governed human datasets to audit, warp, and re-anchor these structures to lived reality. The central scientific question is not whether ASI can “replace” human data, but how far we can push this asymmetry before psychological meaning and fairness break down.
Other References
-
Studies detecting AI bots passing as humans in online surveys; panel contamination is no longer hypothetical. The Times
-
Variance collapse – synthetic samples often under-represent extreme or rare patterns compared to real data. ACM Digital Library+1
- Feedback / model collapse – if you train tomorrow’s models on data generated by yesterday’s models, you risk a form of informational inbreeding. DeepMind’s GDR work is explicitly a response to this threat. Business Insider+1