Olfactory Perception Benchmark for Large Language Models

Benchmark for Assessing Olfactory Perception of Large Language Models

Summary: arXiv:2604.00002v1 Announce Type: cross

This article introduces the Olfactory Perception (OP) benchmark, a novel framework aimed at evaluating the capabilities of large language models (LLMs) in reasoning about smell. The benchmark is a comprehensive tool that encompasses a wide array of tasks related to olfactory perception.

What is the Olfactory Perception Benchmark?

The OP benchmark is designed to assess LLMs through a series of structured questions that span eight distinct task categories. These categories are as follows:

Odor classification
Odor primary descriptor identification
Intensity and pleasantness judgments
Multi-descriptor prediction
Mixture similarity
Olfactory receptor activation
Smell identification from real-world odor sources

In total, the benchmark comprises 1,010 questions, each presented in two different prompt formats: compound names and isomeric SMILES. This dual-format approach is intended to investigate the impact of molecular representations on the models’ performance.

Evaluation of Model Configurations

The study evaluates 21 different model configurations across major model families. The results reveal significant insights into the performance of LLMs when tasked with olfactory reasoning:

Compound-name prompts consistently outperform isomeric SMILES prompts.
Performance gains range from +2.4 to +18.9 percentage points, with a mean increase of approximately +7 points.
The best-performing model achieved an overall accuracy of 64.4%.

These findings indicate that current LLMs tend to access olfactory knowledge primarily through lexical associations, rather than through structural molecular reasoning.

Cross-Language Evaluation

Additionally, the benchmark extends its evaluation to a subset of the OP across 21 languages. The research indicates that aggregating predictions across different languages results in enhanced olfactory prediction capabilities:

The best performing language ensemble model achieved an area under the receiver operating characteristic curve (AUROC) of 0.86.
This improvement suggests that LLMs can leverage linguistic diversity to enhance their olfactory reasoning abilities.

Conclusion

The introduction of the Olfactory Perception benchmark is a significant advancement in the field of artificial intelligence, as it emphasizes the potential for LLMs to process olfactory information alongside visual and auditory data. The results suggest that while LLMs demonstrate emerging capabilities in olfactory reasoning, there are still substantial gaps that need to be addressed to fully harness their potential in this domain.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Olfactory Perception Benchmark for Large Language Models

Benchmark for Assessing Olfactory Perception of Large Language Models

What is the Olfactory Perception Benchmark?

Evaluation of Model Configurations

Cross-Language Evaluation

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related