Benchmark9 min readMay 18, 2026

21 % → 85 %: How CovaSyn scores on the ICLR 2026 chemistry benchmark

Klambauer's lab (JKU Linz) built a molecular-reasoning benchmark for ICLR 2026 that tests real chemistry instead of using LLM judges. Frontier LLMs score 21 to 41 percent on it. With CovaSyn MCP attached, the same models reach 85 to 92 percent. We publish the full numbers, including the gaps.

Oliver Kraft

CovaSyn

What we wanted to know

How far does a modern frontier LLM actually get in chemistry when it works on its own, and how much of that changes once you give it a proper MCP layer? Counting atoms, identifying rings, locating stereocenters, extracting scaffolds: that is where you can see whether a model computes or guesses.

Instead of building our own benchmark for this, we picked an independent one.

The benchmark

MolecularIQ comes out of JKU Linz's Institute for Machine Learning and was accepted at ICLR 2026 (Bartmann, Schimunek, Ielanskyi, Seidl, Klambauer, Luukkonen; arXiv:2601.15279). The dataset contains 13,170 questions across eight splits. We used the test split with 3,540 questions, restricted to the symbolically verifiable task types. Scoring runs without LLM judges: only a full match against ground truth counts as a hit.

The numbers

Three frontier models, each with and without CovaSyn MCP:

| Model | Baseline | + CovaSyn MCP | Lift | |--------------------|----------|---------------|-------| | Claude Haiku 4.5 | 21.18 % | 85.38 % | 4.03× | | Claude Opus 4.7 | 40.75 % | 91.51 % | 2.25× | | OpenAI GPT-5.5 | 22.29 % | 89.92 % | 4.03× |

Three models, same direction, same magnitude. Combined, that is a jump from 21 to 41 percent up to 85 to 92 percent.

What that means in cost

Same snapshot, cost per question:

- Opus 4.7 without MCP: 40.75 % at $0.02529 per question - Opus 4.7 with MCP: 91.51 % at $0.12536 per question - Haiku 4.5 with MCP: 85.38 % at $0.00781 per question

The interesting row is the last one. Haiku 4.5 with MCP delivers more than twice the accuracy of Opus baseline at roughly a third of the cost. Against Opus plus MCP it gives up six percentage points of accuracy but costs about one-sixteenth. A new middle ground for teams that were forced to choose between cheap-and-wrong and expensive-and-correct.

Where the lift is biggest

Averaged over the three models, grouped by the eight benchmark categories:

- Scaffold and fragments: 18.0 % → 86.5 % - Rings and topology: 29.4 % → 93.2 % - Bonds and chains: 17.6 % → 80.9 % - Multi-feature constraints: 27.3 % → 88.4 % - Atom and formula counts: 38.7 % → 98.3 % - Stereochemistry: 28.7 % → 86.0 % - Electronics and H-bonds: 31.2 % → 81.5 %

These are exactly the building blocks medicinal chemistry uses every day. On a meaningful share of sub-tasks the cheapest configuration even hits 100 percent, which never happens without tools.

Where it is not yet 100 percent

Some model answers remain wrong even when the tool returned the correct value. The model overrides or ignores it. A smaller share is format issues or tool values that do not fit. The breakdown sits on [covasyn.com/benchmark](/en/benchmark). We publish it because an honest gap is more useful than a polished marketing claim.

What this means in practice

Model spend is no longer automatically the bottleneck. Teams paying $0.03 per question can move to $0.008 per question at equal or better accuracy. The lift itself is model-independent, so it carries over to whichever frontier model comes next.

For regulated pharma and CDMO workflows that matters, because validation hangs on reproducible numbers. Those are exactly what we published.

What's next

We have seven more CovaSyn modules lined up for matching datasets. Once the numbers land, they go up in the same format on [covasyn.com/benchmark](/en/benchmark). Methodology, gaps, all of it.

The free tier lets you see how the integration feels first-hand: create an account, generate an API key, attach it to your agent. 100 credits per week included.

Citation

Bartmann C., Schimunek J., Ielanskyi M., Seidl P., Klambauer G., Luukkonen S. (2026). MolecularIQ: Characterizing Chemical Reasoning Capabilities Through Symbolic Verification on Molecular Graphs. ICLR 2026, arXiv:2601.15279. Code: github.com/ml-jku/moleculariq. Dataset: huggingface.co/datasets/ml-jku/moleculariq-v0.0. Data snapshot: 2026-05-17.

CovaSyn MCP

Scientific tools in your AI workflow.

130+ functions for pharma, biotech and chemistry. Free tier instantly active.

See CovaSyn MCP →