Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces
The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, researchers have introduced OmniBehavior, the first user simulation benchmark constructed entirely from real-world data.
Introducing OmniBehavior
OmniBehavior integrates long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. This innovative benchmark is designed to provide a more comprehensive understanding of human behavior by utilizing real-world data, thereby addressing the limitations present in previous models.
Empirical Evidence and Findings
The introduction of OmniBehavior is supported by empirical evidence demonstrating that previous datasets relying on isolated scenarios suffer from tunnel vision. Real-world decision-making, in contrast, depends on long-term, cross-scenario causal chains. The findings reveal several key insights:
- Tunnel Vision in Existing Datasets: Previous benchmarks failed to account for the interconnectedness of human behaviors across various scenarios.
- Long-term Decision-making: Authentic decision-making processes often involve complex causal relationships that extend beyond isolated actions.
- State-of-the-art LLM Performance: Evaluations of current LLMs demonstrate that they struggle to accurately simulate these complex behaviors, indicating a pressing need for improvement.
Structural Bias in Large Language Models
One of the critical discoveries from the research is a fundamental structural bias inherent in LLMs. The models tend to converge toward a positive average persona, characterized by:
- Hyper-activity: Simulated users often display exaggerated levels of activity, not reflective of true human behavior.
- Persona Homogenization: LLMs demonstrate a tendency to produce similar personas, leading to a loss of individuality in simulations.
- Utopian Bias: The results suggest that LLMs favor idealized versions of behaviors, neglecting the diverse and often messy realities of human actions.
Implications for Future Research
The findings highlight crucial directions for future high-fidelity simulation research. It is evident that improvements are necessary to ensure that LLMs can more accurately reflect the complexities of real-world human behavior. Potential avenues for further exploration include:
- Enhancing datasets to include a wider variety of behavioral patterns.
- Developing models capable of understanding and simulating the intricacies of long-term decision-making.
- Addressing the structural biases to create more authentic representations of diverse human behaviors.
As the field progresses, the need for robust, real-world data-driven benchmarks like OmniBehavior will become increasingly vital in shaping the future of user simulation and improving the capabilities of Large Language Models.
