For Beginners
Currently, only the Japanese version is available. The English version is coming soon.
Informartion Hub
Survey
- TrustLLM: Trustworthiness in Large Language Models
- Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models
- A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly
Attack
Prompt Injection
- Prompt Injection attack against LLM-integrated Applications
- Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
- Ignore Previous Prompt: Attack Techniques For Language Models
- Safeguarding Crowdsourcing Surveys from ChatGPT with Prompt Injection
- A Prompting-based Approach for Adversarial Example Generation and Robustness Enhancement
- Black Box Adversarial Prompting for Foundation Models
- Adversarial Soft Prompt Tuning for Cross-Domain Sentiment Analysis
- Prompt Injection: Parameterization of Fixed Inputs
- Prompt Injection Attacks and Defenses in LLM-Integrated Applications
- From Prompt Injections to SQL Injection Attacks: How Protected is Your LLM-Integrated Web Application?
- Optimization-based Prompt Injection Attack to LLM-as-a-Judge
Jailbreak
- How to jailbreak ChatGPT: get it to really do what you want
- Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks
- MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models
- Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs
- Universal and Transferable Adversarial Attacks on Aligned Language Models
- Jailbroken: How Does LLM Safety Training Fail?
- Jailbreaking Black Box Large Language Models in Twenty Queries
- AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts
- MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots
- Red Teaming Language Models with Language Models
- GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models
- Increased LLM Vulnerabilities from Fine-tuning and Quantization
- All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks
Backdoor
- Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection
- Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
- Unleashing Cheapfakes through Trojan Plugins of Large Language Models
- “Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment
- Get Rid Of Your Trail: Remotely Erasing Backdoors in Federated Learning
- TARGET: Template-Transferable Backdoor Attack Against Prompt-based NLP Models via GPT4
- BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP
- Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
- The Philosopher's Stone: Trojaning Plugins of Large Language Models
- Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
RAG
Multi-Modal
Others
- "Real Attackers Don't Compute Gradients": Bridging the Gap Between Adversarial ML Research and Practice
- Last One Standing: A Comparative Analysis of Security and Privacy of Soft Prompt Tuning, LoRA, and In-Context Learning
- Nightshade: Prompt-Specific Poisoning Attacks on Text-to-Image Generative Models
- DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions
- Proof-of-Learning: Definitions and Practice
Defense
- Adversarial Meta Prompt Tuning for Open Compound Domain Adaptive Intent Detection
- PromptCARE: Prompt Copyright Protection by Watermark Injection and Verificationn
- From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy
- Probing LLMs for hate speech detection: strengths and vulnerabilities
- Defending LLMs against Prompt Injection
- jackhhao/jailbreak-classifier
- Defending ChatGPT against jailbreak attack via self-reminders
- Safeguarding Crowdsourcing Surveys from ChatGPT with Prompt Injection
- RAIN: Your Language Models Can Align Themselves without Finetuning
- A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection
- SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
- A Survey of Adversarial Defences and Robustness in NLP
- Baseline Defenses for Adversarial Attacks Against Aligned Language Models
- Detecting Language Model Attacks with Perplexity
- Certifying LLM Safety against Adversarial Prompting
- Adversarial Attacks and Defenses in Large Language Models: Old and New Threats
- Intention Analysis Makes LLMs A Good Jailbreak Defender
- MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
- How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
- Can LLMs Recognize Toxicity? Structured Toxicity Investigation Framework and Semantic-Based Metric
- SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
- Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing
- Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
- Explaining Toxic Text via Knowledge Enhanced Text Generation
- Your fairness may vary: Pretrained language model fairness in toxic text classification
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
LLM Vulnerability Evaluation
- Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection
- ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models
- Demystifying RCE Vulnerabilities in LLM-Integrated Apps
- Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models
- Identifying and Mitigating Vulnerabilities in LLM-Integrated Applications
- Universal and Transferable Adversarial Attacks on Aligned Language Models
- AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models
- HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Datasets
- Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition
- Automatic Construction of a Korean Toxic Instruction Dataset for Ethical Tuning of Large Language Models
- Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game
- deepset/prompt-injections
- Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
- AnswerCarefully Dataset
- “Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models