Sanity Checks for Agentic Data Science
Summary: arXiv:2604.11003v1 Announce Type: new
Agentic data science (ADS) pipelines have grown rapidly in both capability and adoption, with systems such as OpenAI Codex now able to directly analyze datasets and produce answers to statistical questions. However, these systems can reach falsely optimistic conclusions that are difficult for users to detect.
To address this issue, researchers have proposed a pair of lightweight sanity checks grounded in the Predictability-Computability-Stability (PCS) framework for veridical data science. These checks employ reasonable perturbations to determine whether an agent can reliably distinguish between meaningful signals and random noise. By applying these checks, users can impose a falsifiability constraint that helps expose unsupported affirmative conclusions.
Proposed Sanity Checks
The two proposed sanity checks aim to characterize the trustworthiness of an ADS output. They assess whether the ADS has identified stable signals or is merely responding to noise. Additionally, they evaluate whether the conclusions drawn by the ADS are sensitive to incidental aspects of the input data.
- Check 1: Signal Detection – This check assesses the agent’s ability to reliably identify the presence of a signal amidst noise.
- Check 2: Noise Sensitivity – This check evaluates how sensitive the agent’s conclusions are to variations in the input data.
By validating this approach on synthetic data with controlled signal-to-noise ratios, researchers confirmed that the sanity checks effectively track the ground-truth signal strength.
Real-World Applications
The researchers further demonstrated the effectiveness of these sanity checks on 11 real-world datasets using OpenAI Codex. This evaluation aimed to characterize the trustworthiness of each conclusion drawn by the ADS.
Remarkably, the findings revealed that in 6 out of the 11 datasets, an affirmative conclusion was not well-supported, even though a single ADS run may have suggested otherwise. This discrepancy highlights the necessity of implementing sanity checks in practical applications of ADS.
Analysis of Failure Modes
In addition to validating the sanity checks, the researchers conducted an analysis of failure modes in ADS systems. The results indicated that the self-reported confidence levels of ADS are often poorly calibrated to the empirical stability of their conclusions. This finding underscores the importance of skepticism when interpreting results generated by ADS systems.
Conclusion
As the use of agentic data science pipelines continues to increase, ensuring the reliability and trustworthiness of their outputs becomes paramount. The introduction of sanity checks, grounded in the PCS framework, provides a necessary tool for users to critically evaluate the conclusions drawn by these systems. Future research should continue exploring additional methods to enhance the reliability of ADS, ensuring that these powerful tools serve as trustworthy aids in data analysis.
