Independent evaluation for AI in public-sector systems.

We help agencies, companies, and organizations identify promising AI use cases, evaluate vendors on real workflows, and measure whether tools improve outcomes before they are scaled.

Built by some of the world's leading researchers in AI evaluation and A/B testing from MIT, University of Chicago, UT Austin, Princeton, and other top research universities.

Work with us See what we do →

Demos and benchmarks aren't enough.

43-pt

Gap between expected and measured productivity when experienced developers used AI coding tools in a randomized trial.

2,133

AI use cases already in production across 41 federal agencies.

77.6%

Of agent-benchmark papers report neither fairness nor cost; two of the metrics public-sector buyers need most.

93%

False-positive rate of a state fraud-detection algorithm deployed without independent testing.

Sources: METR randomized study of experienced developers (2025); federal AI use case inventory; agent-benchmark survey of 1,300+ papers; Michigan MiDAS unemployment fraud-detection system.

Why it matters

AI adoption is moving faster than evidence.

AI adoption is outpacing the evidence needed to guide it. Federal AI contracting jumped 1,200% to $4.5 billion in 2024, but only 10% of federal agencies have a comprehensive AI governance framework in place. New regulations in the U.S. and globally are starting to require evidence ahead of procurement.

Demos and self-reported benchmarks show potential but they rarely predict real-world impact. Wherever you are in the process, TrialAI helps you answer: does this tool actually improve outcomes for the people you serve? We measure what AI tools do in your workflows so you can scale what works, fix what doesn't, and walk away from what shouldn't be deployed at all.

What we do

Three services that meet you where the decision is.

Not every decision needs a randomized trial. We match the strength of evidence to the size of the decision: from a procurement roadmap to a full embedded evaluation. The three services work independently or together as a single pipeline to full-scale deployment.

Service 01

AI Evidence Framework & Roadmap

We provide training and frameworks to evaluate AI claims: what benchmarks do, how they speak to real-world value and how they do not, and what kind of evidence different decisions require and why. We then apply that framework to your specific workstreams to identify where AI is a plausible fit and where caution is warranted.

You receive A decision framework your team can keep using, a prioritized portfolio of AI use cases, and a procurement-ready roadmap.

Service 02

Independent Vendor Evaluation

We test AI tools on the workflows you actually run — such as benefits determinations, document verification, eligibility support, citizen-facing chat — and measure fairness, reliability, accessibility, cost, and speed against the standards your organization has to meet.

You receive A vendor evaluation memo, benchmark report, and procurement-ready criteria.

Service 03

Impact & Measurement

We design and run randomized trials that measure how AI actually affects the outcomes you care about, such as service access, completion, placement, learning, and fairness, and we build the data pipelines and monitoring infrastructure to keep evaluating as models, prompts, and features change.

You receive A trial design, measurement system, and ongoing monitoring dashboard.

Where we come in

The decisions we help you make.

"We're about to buy an AI tool for eligibility determinations. We need independent evidence it works before we sign."

→ AI Evidence Framework & Roadmap

"We're deploying an application screening model. We need to know if it's fair and accurate."

→ Independent Vendor Evaluation

"We funded three AI pilots. Now we need evidence about which ones improve outcomes before deciding which to scale."

→ Impact & Measurement

"We shipped an AI feature. Usage is up, but we don't know if it's actually helping users."

→ Impact & Measurement

How we work

Rigor that scales with the stakes.

For low-stakes pilots, a workflow-specific evaluation may be enough. For tools that will shape benefits eligibility, hiring, or learning at scale, you need trial-level evidence. We help you decide which level of rigor fits the decision in front of you. Then we deliver it.

Durable infrastructure

Setting up rigorous evaluation infrastructure is mostly a one-time cost. Once the data pipelines, randomization, and monitoring are in place, every follow-on test runs at a fraction of the original effort.

On past engagements, we've evaluated subsequent product changes for a small fraction of the original setup cost to turn a one-time investment into durable capability.

Why TrialAI

We combine research rigor with live-platform implementation.

Rigorous AI evaluation usually requires three kinds of expertise that rarely sit on the same team: research-grade methods, hands-on ML and engineering, and extensive public-sector domain knowledge. We bring this expertise so we can scope real use cases, build or stress-test models, and measure whether it improves outcomes.

Our team designs and runs randomized controlled trials in live production systems, applies tools from ML and econometrics to measure impact, and can even help build models. Our evaluation, fairness, and bias-auditing work is grounded in methods from peer-reviewed journal articles, many of which we have authored. And we bring direct experience in multiple domains, such as benefits access, workforce, housing, education, criminal justice, and financial inclusion.

That range of expertise, in one team, is what makes rigorous AI evaluation possible from scoping a use case through measuring outcomes in the field.

TrialAI is a program of Learning Collider at Renaissance Philanthropy, drawing on a network of researchers from Chicago, Oxford, MIT, Princeton, Brown, UC San Diego, and UT Austin, and policy relationships across more than 20 states.

Selected projects

AI housing navigator Impact & Measurement Building and evaluating an AI navigator on AffordableHousing.com, used by thousands of Public Housing Authorities, to measure whether it improves how individuals apply for benefits and secure housing.
AI hiring algorithms Independent Vendor Evaluation Building and benchmarking an LLM-based hiring tool against human reviewers and traditional ML models to evaluate efficacy and fairness across demographic groups.
AI lending algorithms Impact & Measurement Designing and running a randomized controlled trial of ML-based underwriting models to measure efficacy, fairness, and downstream borrower outcomes.

Planning an AI procurement, pilot, or renewal?

Let's scope the use case, the evidence standard, and the evaluation approach before you commit.

hello@trialai.co

TrialAI is a program of Learning Collider at Renaissance Philanthropy.