CovaSyn
All Articles
Explainer6 min readMay 22, 2026

Why LLMs fail at chemistry, and what the tokenizer has to do with it

LLMs fail at counting rings. The reason is not missing chemistry knowledge but the tokenizer: SMILES strings shatter into disconnected fragments, and the molecular graph is lost. What the research shows, and how validated tools solve the problem.

OK

Oliver Kraft

CovaSyn

Why LLMs fail at chemistry, and what the tokenizer has to do with it

Key takeaways

  • Modern LLMs fail at surprisingly simple chemistry tasks, even counting rings in a structure.
  • The cause is not missing "chemistry knowledge" but the tokenizer: SMILES strings break apart into disconnected token fragments, so the underlying molecular graph is lost.
  • This is documented in the literature (KAIST 2025, Nature Sci. Reports 2024) and is not fixable by simply scaling up the model.
  • The reliable workaround: offload the computation to validated, deterministic tools that the LLM calls when needed, the tool-augmentation approach documented in Nature.

A simple test that almost every LLM fails

Give a current language model the SMILES notation of a molecule and ask: "How many rings does this structure have?" The answer sounds confident, and is surprisingly often wrong. The same goes for counting stereocenters, identifying bridged ring systems, or decomposing a scaffold.

This is not an isolated case and not a prompt problem. Current LLMs struggle to interpret SMILES and fail even at basic tasks like counting rings. A group at KAIST showed this systematically in 2025 and confirmed what our own chemistry benchmark shows: frontier models without tools often only reach 14 to 41 percent on symbolically verifiable chemistry tasks.

The interesting question is not that they fail, but why.

The actual reason: the tokenizer

An LLM never sees a molecule as a molecule. It sees a text string, and before the model computes anything, a tokenizer breaks that string into fragments. That is exactly where the chemistry gets lost.

SMILES encodes a structure through a compact but syntactically dense grammar: parentheses mark branches, digits mark ring closures, and connected atoms often do not sit next to each other in the string. The SMILES grammar uses special conventions for rings and branches that frequently produce non-contiguous tokens for connected substructures. A ring is closed in the text by two identical digits at entirely different positions, for the tokenizer those are two independent characters with no visible link.

The consequence: this structural bottleneck prevents LLMs from fully grasping the underlying molecular-graph structure. The model "reads" a character string from which the topology of the molecule, which atom is connected to which, does not directly emerge. It would have to reconstruct it from the string, and that reconstruction is unreliable.

To make things worse, tokenization itself is ambiguous: SMILES strings rely on a precise sequence of atoms and bonds, and different tokenization schemes produce meaningful differences in how molecules are parsed and represented. The same molecule, written differently, can yield different tokens and therefore different answers.

Why "a bigger model" does not solve this

The intuitive hope, the next, larger frontier model will fix it, falls short here. The problem is not the number of parameters or training data but the representation. As long as a molecule arrives as a token sequence rather than a graph, even the most capable model wrestles with a task that is trivial for a deterministic algorithm.

A ring can be counted exactly. A molecular weight can be computed exactly. A stereocenter can be determined exactly. These are not tasks a probabilistic text generator should be guessing, they are tasks with one correct answer that a specialist tool delivers in milliseconds. That is precisely why guessing fails, no matter how good the model is at language.

How we solve it at CovaSyn: offload the computation

The research community's answer to this problem is not "let it guess better" but "stop letting it guess." The established route is called tool augmentation: the LLM keeps what it does well, understanding language, recognizing intent, contextualizing results, and delegates exact computation to specialized, deterministic tools.

That this works is well documented. ChemCrow, published in Nature Machine Intelligence, demonstrates it: LLMs show strong performance across many domains but struggle with chemistry-related problems and lack access to external knowledge sources; by integrating 18 expert tools, ChemCrow extends the chemistry performance of the LLM. Many of these tools rely on established libraries like RDKit, which correctly process chemical structures as graphs, not as text.

The broader finding holds well beyond chemistry: integrating symbolic reasoning with LLMs has been explored to improve their performance on arithmetic and other computational tasks where deterministic solutions are critical. Where a task has an exact answer, it belongs to a tool, not the language generator.

That is exactly the idea behind CovaSyn. We build the deterministic chemistry layer, structure analysis, solubility, toxicology, stability, analytics, as validated tools an AI agent calls through the Model Context Protocol (MCP). The molecule is processed where it is treated as a graph; the LLM gets a verified value back and only has to communicate it correctly. The ability to compute chemistry is delegated to systems explicitly built for it, and the LLM reaches for them as needed.

What this means in numbers, we measured on independent data: with CovaSyn attached, the same frontier models jump from 14 to 41 percent up to 76 to 92 percent correct answers. Not because the models got better, but because they stopped guessing.

How big the effect is for a cheap model is broken down in detail in our Gemini 3.5 Flash deep dive; how a deterministic layer solves a single concrete property is shown in the CovaSolv solubility post.

The bottom line

LLMs are excellent language and reasoning engines, but molecules are not language, they are graphs. The tokenizer is where this mismatch becomes a source of error. A bigger model only shifts the boundary; a validated tool does not shift it, it removes it. Anyone who wants to deploy AI agents reliably in pharma and chemistry R&D needs both: the linguistic capability of the model and the deterministic reliability of the tools underneath.

You can try it yourself on the free tier, create an account, generate an API key, attach it to your agent. 100 credits per week, no credit card. → See CovaSyn MCP

FAQ

Why can ChatGPT not count atoms or rings?

Because the model sees a molecule as a text string (SMILES) that a tokenizer breaks into disconnected fragments. The graph structure of the molecule is lost in the process, so the model has to guess rather than compute the result.

Do LLMs understand SMILES?

Only to a limited degree. Studies (including KAIST 2025) show that current LLMs fail even at basic SMILES parsing tasks such as ring counting, because SMILES uses non-contiguous tokens for connected substructures.

Does a bigger model solve the chemistry problem?

Not reliably. The problem lies in the representation (text instead of graph), not in model size. Exact tasks with one correct answer belong to a deterministic tool.

How do you make an LLM good at chemistry?

Through tool augmentation: the LLM calls validated, deterministic chemistry tools (e.g. RDKit-based) that process structures as graphs. That is exactly what CovaSyn delivers through the Model Context Protocol.

Sources

  • Jang Y., Kim J., Ahn S. (2025). Improving Chemical Understanding of LLMs via SMILES Parsing. arXiv:2505.16340.
  • SMILES vs. SELFIES tokenization comparison: Scientific Reports 14 (2024), s41598-024-76440-8.
  • M. Bran A. et al. (2024). Augmenting large language models with chemistry tools (ChemCrow). Nature Machine Intelligence.
  • McNaughton et al. CACTUS: Chemistry Agent Connecting Tool Usage to Science.
  • Integrating External Tools with LLMs to Improve Accuracy. arXiv:2507.08034 (2025).

CovaSyn MCP

Scientific tools in your AI workflow.

130+ functions for pharma, biotech and chemistry. Free tier instantly active.

Why LLMs fail at chemistry, and what the tokenizer has to do with it | CovaSyn