CovaSyn

← Trust & Compliance

ICLR 2026 Benchmark

On the peer-reviewed MolecularIQ benchmark, frontier LLMs score 21–41 % on chemical structure analysis. With CovaSyn MCP tools: 85–92 %. Here are the numbers, and what they don't show.

Baseline vs +CovaSyn MCP across three frontier LLMs (ICLR-2026 style)
Fig. 1. Score per model × configuration on 3,540 verified chemistry questions.

Top-line numbers

ModelBaseline+ CovaSyn MCPΔLift
Claude Haiku 4.521.18 %85.38 %+64.20 pp4.03×
Claude Opus 4.740.75 %91.51 %+50.76 pp2.25×
OpenAI GPT-5.522.29 %89.92 %+67.63 pp4.03×

What this means in cost terms

Frontier models are expensive. With CovaSyn, you can often run the cheaper model without giving up accuracy.

ConfigurationAccuracy$/questionLatency
Opus 4.7 baseline40.75 %$0.025295.1 s
Opus 4.7 + CovaSyn MCP91.51 %$0.125367.4 s
Haiku 4.5 + CovaSyn MCP85.38 %$0.007815.8 s
Haiku 4.5 baseline21.18 %$0.000692.1 s
GPT-5.5 + CovaSyn MCP89.92 %$0.030059.4 s

The sharp claim:

Haiku 4.5 + CovaSyn delivers 2.1× the accuracy of Opus 4.7 baseline at 32 % of the cost, and stays 16× cheaper than Opus 4.7 + CovaSyn while giving up only 6 pp accuracy.

Pareto frontier: accuracy on the y-axis versus cost per question on the x-axis. Haiku 4.5 with CovaSyn sits top-left, high accuracy at low cost.
Fig. 2. Cost-accuracy Pareto. Haiku with CovaSyn sits top-left: high accuracy at low cost per question.

Where CovaSyn lifts hardest

Mean accuracy lift across 8 question categories (averaged across all three models):

CategoryBaseline+ CovaSyn MCPΔ
Scaffold & fragments18.0 %86.5 %+68.4 pp
Rings & topology29.4 %93.2 %+63.8 pp
Bonds & chains17.6 %80.9 %+63.3 pp
Multi-feature questions27.3 %88.4 %+61.1 pp
Atom & formula counts38.7 %98.3 %+59.7 pp
Stereochemistry28.7 %86.0 %+57.4 pp
Electronics & H-bonds31.2 %81.5 %+50.3 pp
Per-category lift across all models, grouped by question type.
Fig. 3. Lift per category. Strongest leverage: scaffold & fragments (+68.4 pp).
Summary: accuracy of all six configurations across complexity bins.
Fig. 4. Overall summary. Three models, two configurations, three complexity bins.

Methodology

Benchmark

MolecularIQ by Bartmann et al., ICLR 2026 (arXiv:2601.15279). 3,540 tasks, 65 features, three complexity bins. Dataset public on HuggingFace.

Models

Claude Haiku 4.5, Claude Opus 4.7 and GPT-5.5. Each tested with and without CovaSyn MCP.

Verification

Symbolic, no LLM judges. Score only when the full answer matches ground truth.

Tools

Five chemistry primitives from the CovaBasicChem suite. Cheminformatics operations, deterministic, validated.

Volume

10,720 model responses in total. Haiku ran the full set, Opus and GPT-5.5 a stratified sample.

Where we still improve

We do not hit 100 %, and we do not want to hide that. Here is how the remaining gap breaks down and where you would look closer for your own validation.

CategoryHaiku + MCPOpus + MCPGPT-5.5 + MCP
Correct73.2 %83.0 %83.6 %
Tool result discarded21.6 %14.5 %10.9 %
Tool value off4.8 %2.2 %1.4 %
Format error0.2 %0.2 %4.1 %

Most of the remaining gap sits between tool and model, not in the tool itself. We address that continuously.

Citation

Bartmann C., Schimunek J., Ielanskyi M., Seidl P., Klambauer G., Luukkonen S. (2026). MolecularIQ: Characterizing Chemical Reasoning Capabilities Through Symbolic Verification on Molecular Graphs. ICLR 2026 (poster, Pavilion 4 · P4-#5202, 24 Apr 2026), arXiv:2601.15279. Code: github.com/ml-jku/moleculariq. Dataset: huggingface.co/datasets/ml-jku/moleculariq-v0.0. Data snapshot: 2026-05-17.

Go deeper

In-depth analysis with methodology, implications and FAQ

About 12 minutes of reading. Background on model choice, cost Pareto in detail, GxP implications, FAQs.

Test it yourself

The tools that produced this lift are available in every CovaSyn account, including the free tier with 100 credits per week.

ICLR 2026 Benchmark. CovaSyn on MolecularIQ - CovaSyn