I Asked 12 AI Models the Same Questions - Blind. Here Is What the Data Shows.

I ran 12 AI models through 45 identical prompts - blind, no shared context. Here is what consistent agreement and disagreement reveal about reliability.

I Asked 12 AI Models the Same Questions - Blind. Here Is What the Data Shows.

There is a question I kept coming back to: when two AI models give me different answers to the same question, which one do I trust?

Most people just go with whichever sounds more confident. That never sat right with me.

So I ran an experiment.

The Setup

I called it BMAS - Blind Multi-Agent Synthesis. The concept is straightforward: send the same prompt to multiple AI models simultaneously, with no shared context between them, then measure how much they agree.

The hypothesis: if multiple models independently arrive at the same answer, that answer is probably reliable. If they scatter, that is a signal worth paying attention to.

Twelve models. Forty-five prompts. Three domains:

  • Technical (A): Factual, specific questions with verifiable answers - CVE severity scores, cryptographic standards, protocol specifications
  • Regulatory (B): Compliance and legal interpretation - GDPR articles, NIS-2 requirements, BSI framework standards
  • Strategic (C): Open-ended questions without a single correct answer - risk prioritization, threat modeling, long-term organizational decisions

Each model received the same prompt in complete isolation, with a neutral system prompt. No model knew another was answering the same question. Four metrics measured agreement: cosine similarity, BERTScore, Jaccard coefficient, and DBSCAN clustering to detect outliers - models that went completely off-script relative to the others.

What the Data Shows

The first finding was expected, but the confirmation was still satisfying.

On factual, well-defined questions, models converge. Mean cosine similarity of 0.851 on technical prompts - well above the 0.75 convergence threshold I set. When there is a known answer, the models largely agree.

On strategic, open-ended questions, similarity drops to 0.845. Small in absolute terms, but consistent across every prompt in that domain. No ground truth to converge on.

What caught me off guard was the outlier pattern. Models that diverged on technical prompts tended to do so consistently - same models, different prompts, same direction. That is not random noise. It is a property of how a specific model approaches a domain. Some models are systematically more conservative in technical claims. Some are more expansive. The experiment made that visible.

What This Is Actually Useful For

I work in IT management and deal with security, compliance, and infrastructure questions daily. AI tools are a real part of that workflow now. The question I started with is not abstract.

What BMAS showed me is that the answer is not "trust model X, avoid model Y." The answer is: use agreement as a signal.

High inter-model agreement on factual questions is a useful reliability indicator. Not a guarantee - models can agree on something wrong - but a meaningful calibration layer.

For strategic questions, expect divergence and treat it as useful information. Different models surfacing different framings of the same problem is not a failure - it gives you a better map of the solution space than any single model provides.

The dangerous zone is a factual question where models happen to agree on something incorrect. That is harder to detect, and it is where domain expertise and primary sources still matter. BMAS does not replace verification. It adds a layer before it.

What Comes Next

The complete dataset - 540 model responses across 45 prompts and 12 models - is on GitHub. BMAS is open research: the code, the prompts, and every raw response are public.

The repository also documents the three synthesis strategies we tested: majority-vote aggregation, semantic centroid, and LLM-as-Judge. The LLM-as-Judge results in particular are worth a separate look.

I did not start this to publish something. I started it because the question was bothering me and I wanted actual data. The results turned out to be more interesting than expected, so I am sharing them.

If you are using AI for decisions that matter - security assessments, compliance interpretations, technical recommendations - the convergence approach is worth thinking about. It does not require running your own experiment. It just requires asking the question twice, with different models, and paying attention to where they disagree.


BMAS (Blind Multi-Agent Synthesis) is open research. Code and full dataset on GitHub.