SENTINEL — Trustworthy Guardrails for Web-Agent LLM Services
A model-agnostic guardrail framework: five inference-time enforcement layers inside a nine-layer reference architecture, wrapping any OpenAI-compatible endpoint to block jailbreaks, prompt injection, and indirect tool-injection — without retraining the model.
- 49.2% of attacks blocked pre-generation at 5.0% benign false-positive rate
- More than 2× a keyword-matching baseline (19.0%)
- Llama 3.1 8B end-to-end: block rate 80.2% → 90.5%, true ASR 19.8% → 9.5%