Benchmark for Assessing Olfactory Perception of Large Language Models
Summary: arXiv:2604.00002v1 Announce Type: cross
This article introduces the Olfactory Perception (OP) benchmark, a novel framework aimed at evaluating the capabilities of large language models (LLMs) in reasoning about smell. The benchmark is a comprehensive tool that encompasses a wide array of tasks related to olfactory perception.
What is the Olfactory Perception Benchmark?
The OP benchmark is designed to assess LLMs through a series of structured questions that span eight distinct task categories. These categories are as follows:
- Odor classification
- Odor primary descriptor identification
- Intensity and pleasantness judgments
- Multi-descriptor prediction
- Mixture similarity
- Olfactory receptor activation
- Smell identification from real-world odor sources
In total, the benchmark comprises 1,010 questions, each presented in two different prompt formats: compound names and isomeric SMILES. This dual-format approach is intended to investigate the impact of molecular representations on the models’ performance.
Evaluation of Model Configurations
The study evaluates 21 different model configurations across major model families. The results reveal significant insights into the performance of LLMs when tasked with olfactory reasoning:
- Compound-name prompts consistently outperform isomeric SMILES prompts.
- Performance gains range from +2.4 to +18.9 percentage points, with a mean increase of approximately +7 points.
- The best-performing model achieved an overall accuracy of 64.4%.
These findings indicate that current LLMs tend to access olfactory knowledge primarily through lexical associations, rather than through structural molecular reasoning.
Cross-Language Evaluation
Additionally, the benchmark extends its evaluation to a subset of the OP across 21 languages. The research indicates that aggregating predictions across different languages results in enhanced olfactory prediction capabilities:
- The best performing language ensemble model achieved an area under the receiver operating characteristic curve (AUROC) of 0.86.
- This improvement suggests that LLMs can leverage linguistic diversity to enhance their olfactory reasoning abilities.
Conclusion
The introduction of the Olfactory Perception benchmark is a significant advancement in the field of artificial intelligence, as it emphasizes the potential for LLMs to process olfactory information alongside visual and auditory data. The results suggest that while LLMs demonstrate emerging capabilities in olfactory reasoning, there are still substantial gaps that need to be addressed to fully harness their potential in this domain.
