ToolSafety: A Comprehensive Dataset for Enhancing Safety in LLM-Based Agent Tool Invocations

Published in EMNLP 2025, 2025

First Author

We introduce ToolSafety, a safety fine-tuning dataset designed to address critical safety vulnerabilities in tool-using AI systems. The dataset contains 5,668 direct harm samples, 4,311 indirect harm samples, and 4,311 multi-step samples. We fine-tuned LLaMA3.1-8B-Instruct and Qwen2.5-7B-Instruct models, demonstrating that these models effectively maintain safety in multi-step and indirect harm scenarios while preserving helpfulness.

Recommended citation: Yuejin Xie, Youliang Yuan, Wenxuan Wang, Fan Mo, Jianmin Guo, Pinjia He. (2025). "ToolSafety: A Comprehensive Dataset for Enhancing Safety in LLM-Based Agent Tool Invocations." Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP). https://aclanthology.org/2025.emnlp-main.714/