← Back to papers

Paper deep dive

Foundation Models as Guardrails: LLM-and VLM-Based Approaches to Safety and Alignment

Huy H. Nguyen, Pride Kavumba, Tomoya Kurosawa, Koki Wataoka

Year: 2025Venue: APSIPA ASC 2025Area: Adversarial RobustnessType: SurveyEmbeddings: 1

Models: Gemini, LLMs (general), VLMs (general)

Abstract

The growing deployment of large language models (LLMs) and vision-language models (VLMs) raises urgent concerns about safety and alignment. While alignment techniques such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) improve model behavior, they are not sufficient to prevent harmful outputs. This paper reviews recent approaches that use foundation models themselves as guardrails systems that monitor or filter inputs and outputs for safety. We cover LLM-based moderation, neural classifiers, and multimodal safety filters, highlighting both academic advances and industry tools. We also discuss empirical evaluation methods such as red teaming and adversarial prompting. Finally, we outline open challenges in robustness, interpretability, and policy adaptation, pointing to key directions for building trustworthy guardrails for generative AI.

Tags

adversarial-robustness (suggested, 92%)ai-safety (imported, 100%)alignment-training (suggested, 80%)red-teaming (suggested, 80%)safety-evaluation (suggested, 80%)survey (suggested, 88%)

Links

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 92%

Last extracted: 3/11/2026, 12:35:53 AM

Summary

This paper reviews the use of foundation models, specifically LLMs and VLMs, as guardrail systems to enhance safety and alignment in generative AI. It covers moderation techniques, neural classifiers, multimodal filters, and evaluation methods like red teaming, while identifying challenges in robustness and interpretability.

Entities (5)

Large Language Models · technology · 98%Vision-Language Models · technology · 98%Reinforcement Learning from Human Feedback · methodology · 95%Supervised Fine-Tuning · methodology · 95%Red Teaming · methodology · 92%

Relation Signals (3)

LLMs usedas Guardrails

confidence 90% · This paper reviews recent approaches that use foundation models themselves as guardrails

VLMs usedas Guardrails

confidence 90% · This paper reviews recent approaches that use foundation models themselves as guardrails

Red Teaming evaluates Guardrails

confidence 85% · We also discuss empirical evaluation methods such as red teaming and adversarial prompting.

Cypher Suggestions (2)

Find all technologies used as guardrails · confidence 90% · unvalidated

MATCH (t:Technology)-[:USED_AS]->(g:Concept {name: 'Guardrails'}) RETURN t.name

List all methodologies used for safety evaluation · confidence 85% · unvalidated

MATCH (m:Methodology)-[:EVALUATES]->(g:Concept) RETURN m.name

Full Text

834 characters extracted from source content.

Expand or collapse full text

Foundation Models as Guardrails: LLM-and VLM-Based Approaches to Safety and Alignment | IEEE Conference Publication | IEEE Xplore IEEE Account Change Username/Password Update Address Purchase Details Payment Options Order History View Purchased Documents Profile Information Communications Preferences Profession and Education Technical Interests Need Help? US & Canada: +1 800 678 4333 Worldwide: +1 732 981 0060 Contact & Support About IEEE Xplore Contact Us Help Accessibility Terms of Use Nondiscrimination Policy Sitemap Privacy & Opting Out of Cookies A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.© Copyright 2026 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.