panda_guard.role.defenses package
Subpackages
- panda_guard.role.defenses.repe_utils package
- Submodules
- panda_guard.role.defenses.repe_utils.pipelines module
- panda_guard.role.defenses.repe_utils.rep_control_pipeline module
- panda_guard.role.defenses.repe_utils.rep_control_reading_vec module
WrappedBlockWrappedReadingVecModelWrappedReadingVecModel.forward()WrappedReadingVecModel.generate()WrappedReadingVecModel.get_activations()WrappedReadingVecModel.get_logits()WrappedReadingVecModel.is_wrapped()WrappedReadingVecModel.reset()WrappedReadingVecModel.run_prompt()WrappedReadingVecModel.set_controller()WrappedReadingVecModel.set_masks()WrappedReadingVecModel.unwrap()WrappedReadingVecModel.wrap()WrappedReadingVecModel.wrap_all()WrappedReadingVecModel.wrap_block()WrappedReadingVecModel.wrap_decoder_block()
- panda_guard.role.defenses.repe_utils.rep_readers module
- panda_guard.role.defenses.repe_utils.rep_reading_pipeline module
- Module contents
Submodules
panda_guard.role.defenses.back_translate module
- class panda_guard.role.defenses.back_translate.BackTranslationDefender(config: BackTranslationDefenderConfig)[source]
Bases:
BaseDefenderDefending LLMs against Jailbreaking Attacks via Backtranslation.
Reference: Wang, Yihan and Shi, Zhouxing and Bai, Andrew and Hsieh, Cho-Jui https://github.com/YihanWang617/LLM-Jailbreaking-Defense-Backtranslation
- Parameters:
config – Configuration for the BackTranslation Defender.
- class panda_guard.role.defenses.back_translate.BackTranslationDefenderConfig(defender_cls: str = 'BackTranslationDefender', defender_name: str | None = None, target_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, target_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig | None = None, threshold: float = -2.0, infer_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, infer_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig = <factory>, return_new_response_anyway: bool = False)[source]
Bases:
BaseDefenderConfigConfiguration for the BackTranslation Defender.
- Parameters:
defender_cls – Class of the defender, default is “BackTranslationDefender”.
threshold – Threshold for determining the likelihood of harmful prompts.
infer_llm_config – Configuration for the LLM used for inference.
infer_llm_gen_config – Configuration for generating output with the inference LLM.
return_new_response_anyway – Whether to return the new response regardless of its content.
- defender_cls: str = 'BackTranslationDefender'
- infer_llm_config: BaseLLMConfig
- infer_llm_gen_config: LLMGenerateConfig
- return_new_response_anyway: bool = False
- threshold: float = -2.0
panda_guard.role.defenses.base module
- class panda_guard.role.defenses.base.BaseDefender(config: BaseDefenderConfig)[source]
Bases:
ABCAbstract Base Class for Defender.
- Parameters:
config – Configuration for the defender.
- class panda_guard.role.defenses.base.BaseDefenderConfig(defender_cls: str | None = None, defender_name: str | None = None, target_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, target_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig | None = None)[source]
Bases:
ABCConfiguration for the Base Defender.
- Parameters:
defender_cls – Class of the defender.
defender_name – Name of the defender.
target_llm_config – Configuration for the target language model.
target_llm_gen_config – Configuration for generating output with the target LLM.
- defender_cls: str = None
- defender_name: str = None
- target_llm_config: BaseLLMConfig
- target_llm_gen_config: LLMGenerateConfig = None
panda_guard.role.defenses.goal_priority module
- class panda_guard.role.defenses.goal_priority.GoalPriorityDefender(config: GoalPriorityDefenderConfig)[source]
Bases:
BaseDefenderGoal Priority Defender for handling attack prompts with goal prioritization. https://arxiv.org/abs/2311.09096
- Parameters:
config – Configuration object that defines the defender’s behavior.
- add_defense(attack_prompt, defense_type='priority')[source]
Generate a defense message based on the attack prompt and the specified defense type.
- Parameters:
attack_prompt – The attack prompt to defend against.
defense_type – The type of defense to apply.
- Returns:
A string containing the defense prompt.
- defense(messages=typing.List[typing.Dict[str, str]])[source]
Perform defense against an attack prompt by analyzing and responding based on goal prioritization.
- Parameters:
messages – List of message dictionaries where each dictionary contains ‘role’ and ‘content’ key.
- Returns:
List of message dictionaries with defense applied.
- class panda_guard.role.defenses.goal_priority.GoalPriorityDefenderConfig(defender_cls: str = 'GoalPriorityDefender', defender_name: str = 'GoalPriority', target_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, target_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig | None = None, defense_type: str = 'priority')[source]
Bases:
BaseDefenderConfigGoal Priority Defender Configuration.
- Parameters:
defender_cls – Class name of the defender, default is “GoalPriorityDefender”.
defender_name – Name of the defender, default is “GoalPriority”.
target_llm_config – Configuration for the target LLM.
target_llm_gen_config – Configuration for the LLM generation process.
defense_type – Type of defense, default is “priority”.
- defender_cls: str = 'GoalPriorityDefender'
- defender_name: str = 'GoalPriority'
- defense_type: str = 'priority'
- target_llm_config: BaseLLMConfig
- target_llm_gen_config: LLMGenerateConfig = None
panda_guard.role.defenses.gradsafe module
- class panda_guard.role.defenses.gradsafe.GradSafeDefender(config: GradSafeDefenderConfig)[source]
Bases:
BaseDefenderGradSafe Defender class. Prevents the model from generating unsafe or unethical content using gradient norms and cosine similarity. https://arxiv.org/abs/2402.13494
- defense(messages: List[Dict[str, str]])[source]
Defend the model’s output to prevent unsafe content generation.
- Parameters:
messages – Input messages
- Returns:
Defended messages
- get_target_llm_loss(message: List[Dict[str, str]])[source]
Calculate the loss (log likelihood) for the target LLM.
- Parameters:
message – Input messages
- Returns:
Negative log likelihood loss
- class panda_guard.role.defenses.gradsafe.GradSafeDefenderConfig(defender_cls: str = 'GradSafeDefender', defender_name: str = 'GradSafe', target_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, target_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig | None = None, unsafe_set: ~typing.List[str] = <factory>, safe_set: ~typing.List[str] = <factory>, unsafe_threshold: float = 0.8, verbose: bool = True)[source]
Bases:
BaseDefenderConfigConfiguration for the Base Defender.
- Parameters:
defender_cls – Class of the defender.
defender_name – Name of the defender.
target_llm_config – Configuration for the target language model.
target_llm_gen_config – Configuration for generating output with the target LLM.
- defender_cls: str = 'GradSafeDefender'
- defender_name: str = 'GradSafe'
- safe_set: List[str]
- target_llm_config: BaseLLMConfig
- target_llm_gen_config: LLMGenerateConfig = None
- unsafe_set: List[str]
- unsafe_threshold: float = 0.8
- verbose: bool = True
panda_guard.role.defenses.icl module
- class panda_guard.role.defenses.icl.IclDefender(config: IclDefenderConfig)[source]
Bases:
BaseDefenderIn-Context Learning (ICL) Defender that guards against harmful prompts using few in-context demonstrations.
Reference: Wei, Zeming, Yifei Wang, and Yisen Wang. Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387 (2023).
- Parameters:
config – Configuration for the ICL Defender.
- class panda_guard.role.defenses.icl.IclDefenderConfig(defender_cls: str = 'IclDefender', defender_name: str = 'IclDefender', target_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, target_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig | None = None, in_context_prompt: ~typing.List[~typing.Dict[str, str]] = <factory>)[source]
Bases:
BaseDefenderConfigConfiguration for the In-Context Learning (ICL) Defender.
- Parameters:
defender_cls – Class of the defender, default is “IclDefender”.
defender_name – Name of the defender, default is “IclDefender”.
in_context_prompt – In-context demonstration examples to guide the model’s behavior.
- defender_cls: str = 'IclDefender'
- defender_name: str = 'IclDefender'
- in_context_prompt: List[Dict[str, str]]
panda_guard.role.defenses.paraphrase module
- class panda_guard.role.defenses.paraphrase.ParaphraseDefender(config: ParaphraseDefenderConfig)[source]
Bases:
BaseDefenderParaphrase Defender that reformulates user inputs to mitigate harmful content.
Reference: Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. 2023. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614.
- Parameters:
config – Configuration for the Paraphrase Defender.
- class panda_guard.role.defenses.paraphrase.ParaphraseDefenderConfig(defender_cls: str = 'ParaphraseDefender', defender_name: str | None = None, target_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, target_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig | None = None, paraphrase_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, paraphrase_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig = <factory>, paraphrase_prompt: str = 'paraphrase the following paragraph: \n"{prompt}"\n\n')[source]
Bases:
BaseDefenderConfigConfiguration for the Paraphrase Defender.
- Parameters:
defender_cls – Class of the defender, default is “ParaphraseDefender”.
paraphrase_llm_config – Configuration for the paraphrasing language model.
paraphrase_llm_gen_config – Configuration for generating output with the paraphrasing LLM.
paraphrase_prompt – Template for generating paraphrasing prompts.
- defender_cls: str = 'ParaphraseDefender'
- paraphrase_llm_config: BaseLLMConfig
- paraphrase_llm_gen_config: LLMGenerateConfig
- paraphrase_prompt: str = 'paraphrase the following paragraph: \n"{prompt}"\n\n'
panda_guard.role.defenses.perplexity_filter module
- class panda_guard.role.defenses.perplexity_filter.PerplexityFilterDefender(config: PerplexityFilterDefenderConfig)[source]
Bases:
BaseDefenderPerplexity Filter Defender that rejects responses with high perplexity values.
Reference: Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, Tom Goldstein. “Baseline Defenses for Adversarial Attacks Against Aligned Language Models.” arXiv preprint arXiv:2309.00614.
- Parameters:
config – Configuration for the Perplexity Filter Defender.
- class panda_guard.role.defenses.perplexity_filter.PerplexityFilterDefenderConfig(defender_cls: str = 'PerplexityFilterDefender', defender_name: str | None = None, target_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, target_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig | None = None, threshold: float = 500, filter_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, filter_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig = <factory>)[source]
Bases:
BaseDefenderConfigConfiguration for the Perplexity Filter Defender.
- Parameters:
defender_cls – Class of the defender, default is “PerplexityFilterDefender”.
threshold – Threshold for perplexity to determine if a response should be rejected.
filter_llm_config – Configuration for the language model used for filtering.
filter_llm_gen_config – Configuration for generating output with the filtering LLM.
- defender_cls: str = 'PerplexityFilterDefender'
- filter_llm_config: BaseLLMConfig
- filter_llm_gen_config: LLMGenerateConfig
- threshold: float = 500
panda_guard.role.defenses.repe module
- class panda_guard.role.defenses.repe.RepeDefender(config: RepeDefenderConfig)[source]
Bases:
BaseDefenderRepe Defender class for mitigating harmful content by controlling model representations.
- Parameters:
config – Configuration for Repe Defender.
- calc_representing(dataset: Dataset) Tuple[Any, Any, Any][source]
Calculate the representation for the given dataset.
- Parameters:
dataset – Dataset to be used for representation calculations.
- Returns:
Representation reading pipeline, reader, and dataset.
- calc_significance() Tuple[List[Any], List[Any]][source]
Calculate the significance of each hidden layer in the model.
- Returns:
List of hidden layers and their corresponding significance values.
- calc_topk(x: Tensor, k: float) Tensor[source]
Calculate the top k activations based on the given selector method.
- Parameters:
x – Input tensor.
k – Top k percentage to select.
- Returns:
Masked tensor with top k activations.
- defense(messages: List[Dict[str, str]] | None = None) List[Dict[str, str]][source]
Apply the defense mechanism using representation control.
- Parameters:
messages – Input messages for defense.
- Returns:
Modified list of messages after applying the defense strategy.
- generate(messages: List[Dict[str, str]], config: LLMGenerateConfig) List[Dict[str, str]] | Tuple[List[Dict[str, str]], List[float]][source]
Generate responses based on input messages and configurations.
- Parameters:
messages – List of input messages.
config – Generation configuration for LLM.
- Returns:
Generated responses from the model.
Get the hidden layers to be used for control based on their significance.
- Parameters:
ctrl_hidden_layers – List of specified hidden layers.
ctrl_hidden_top_p – Top proportion of hidden layers to select.
- Returns:
Selected hidden layers and their significance.
- preprocess_dataset(dataset: Dataset) Dict[str, Dict[str, Any]][source]
Preprocess the dataset for representation calculations.
- Parameters:
dataset – Dataset to be preprocessed.
- Returns:
Preprocessed dataset dictionary.
- set_activations(ctrl_factor: float, topk: float | None = None, selector: str | None = None, ctrl_hidden_layers: List[int] | None = None) None[source]
Set the activations for controlling the model.
- Parameters:
ctrl_factor – Control factor affecting the representation.
topk – The top k percentage to select activations.
selector – Method to select the activations, “abs_max” or “random”.
ctrl_hidden_layers – Hidden layers for control.
- class panda_guard.role.defenses.repe.RepeDefenderConfig(defender_cls: str = 'Repe', defender_name: str = 'Repe', target_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, target_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig | None = None, dataset: str = 'justinphan3110/harmful_harmless_instructions', dataset_args: ~typing.List[str] = <factory>, system_template: str = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.", prompt_template: str = 'Remember, you should be a responsible language model and should not generate harmful or misleading content!\n{content}\n', direction_method: str = 'pca', rep_token: int = -1, ctrl_method: str = 'reading_vec', ctrl_block_name: str = 'decoder_block', ctrl_hidden_layers: ~typing.List[int] | None = None, ctrl_hidden_top_p: float = 0.375, ctrl_factor: float = 1.0, ctrl_batch_size: int = 2, topk: float = 0.0, selector: str = 'abs_max')[source]
Bases:
BaseDefenderConfigConfiguration for the Repe Defender.
- Parameters:
defender_cls – Class of the defender, default is “Repe”.
defender_name – Name of the defender, default is “Repe”.
dataset – Dataset name used for the training of Repe.
dataset_args – Additional arguments for the dataset.
system_template – Template for the system message.
prompt_template – Template for user prompts.
direction_method – Method for determining direction (e.g., “pca”).
rep_token – Token used for representation.
ctrl_method – Method used for controlling representations.
ctrl_block_name – Name of the control block in the LLM model.
ctrl_hidden_layers – Hidden layers for control.
ctrl_hidden_top_p – Top proportion of hidden layers used for control.
ctrl_factor – Control factor affecting the representation.
ctrl_batch_size – Batch size used during control operations.
topk – The top k percentage to select activations.
selector – Method to select the activations, “abs_max” or “random”.
- ctrl_batch_size: int = 2
- ctrl_block_name: str = 'decoder_block'
- ctrl_factor: float = 1.0
- ctrl_method: str = 'reading_vec'
- dataset: str = 'justinphan3110/harmful_harmless_instructions'
- dataset_args: List[str]
- defender_cls: str = 'Repe'
- defender_name: str = 'Repe'
- direction_method: str = 'pca'
- prompt_template: str = 'Remember, you should be a responsible language model and should not generate harmful or misleading content!\n{content}\n'
- rep_token: int = -1
- selector: str = 'abs_max'
- system_template: str = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."
- topk: float = 0.0
panda_guard.role.defenses.rewrite module
- class panda_guard.role.defenses.rewrite.NoneDefender(config: NoneDefenderConfig)[source]
Bases:
BaseDefenderNoneDefender class that performs no modifications to the messages, simply passes them through.
- Parameters:
config – Configuration for NoneDefender.
- class panda_guard.role.defenses.rewrite.NoneDefenderConfig(defender_cls: str = 'NoneDefender', defender_name: str | None = None, target_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, target_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig | None = None)[source]
Bases:
BaseDefenderConfigConfiguration for NoneDefender, which performs no defense and simply passes the messages through.
- Parameters:
defender_cls – Class of the defender, default is “NoneDefender”.
defender_name – Name of the defender, can be None.
- defender_cls: str = 'NoneDefender'
- defender_name: str = None
- class panda_guard.role.defenses.rewrite.RewriteDefender(config: RewriteDefenderConfig)[source]
Bases:
BaseDefenderRewriteDefender class that rewrites prompts based on given templates to mitigate harmful content.
- Parameters:
config – Configuration for RewriteDefender.
- class panda_guard.role.defenses.rewrite.RewriteDefenderConfig(defender_cls: str = 'RewriteDefender', defender_name: str | None = None, target_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, target_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig | None = None, system_template: str | None = None, prompt_template: str | None = None)[source]
Bases:
BaseDefenderConfigConfiguration for RewriteDefender, which rewrites prompts based on given templates.
- Parameters:
defender_cls – Class of the defender, default is “RewriteDefender”.
system_template – Template for the system message.
prompt_template – Template for the user prompts.
- defender_cls: str = 'RewriteDefender'
- prompt_template: str = None
- system_template: str = None
- class panda_guard.role.defenses.rewrite.SelfReminderDefender(config: SelfReminderDefenderConfig)[source]
Bases:
RewriteDefenderSelfReminderDefender class that adds self-reminders to prompts to enhance responsible responses.
- Parameters:
config – Configuration for SelfReminderDefender.
- class panda_guard.role.defenses.rewrite.SelfReminderDefenderConfig(defender_cls: str = 'SelfReminderDefender', defender_name: str | None = None, target_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, target_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig | None = None, system_template: str = '{content}You are a responsible language model and good at avoiding generating harmful or misleading content! Please answer the following user query in a responsible way.', prompt_template: str = 'Remember, you should be a responsible language model and should not generate harmful or misleading content!\n{content}\n')[source]
Bases:
RewriteDefenderConfigConfiguration for SelfReminderDefender, a specific type of RewriteDefender that adds self-reminders.
- Parameters:
defender_cls – Class of the defender, default is “SelfReminderDefender”.
system_template – Template for the system message with a self-reminder.
prompt_template – Template for the user prompts with reminders.
- defender_cls: str = 'SelfReminderDefender'
- prompt_template: str = 'Remember, you should be a responsible language model and should not generate harmful or misleading content!\n{content}\n'
- system_template: str = '{content}You are a responsible language model and good at avoiding generating harmful or misleading content! Please answer the following user query in a responsible way.'
panda_guard.role.defenses.rpo module
- class panda_guard.role.defenses.rpo.RPODefender(config: RPODefenderConfig)[source]
Bases:
BaseDefenderRPODefender that guards against harmful prompts using defensive suffix. Reference:https://arxiv.org/abs/2401.17263
- Parameters:
config (RPODefenderConfig) – Configuration object containing parameters for the defender.
- class panda_guard.role.defenses.rpo.RPODefenderConfig(defender_cls: str = 'RPODefender', defender_name: str = 'RPO', target_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, target_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig | None = None)[source]
Bases:
BaseDefenderConfigConfiguration for the Base Defender.
- Parameters:
defender_cls (str) – Class of the defender.
defender_name (str) – Name of the defender.
target_llm_config (BaseLLMConfig) – Configuration for the target language model.
target_llm_gen_config (LLMGenerateConfig) – Configuration for generating output with the target LLM.
- defender_cls: str = 'RPODefender'
- defender_name: str = 'RPO'
- target_llm_config: BaseLLMConfig
- target_llm_gen_config: LLMGenerateConfig = None
panda_guard.role.defenses.self_defense module
- class panda_guard.role.defenses.self_defense.SelfDefenseDefender(config: SelfDefenseDefenderConfig)[source]
Bases:
BaseDefenderSelfDefenseDefender implements the defense mechanism for the attack, generating responses and judging their appropriateness. https://arxiv.org/abs/2308.07308
- defense(messages=typing.List[typing.Dict[str, str]])[source]
Perform defense by generating a response, judging it, and applying defense action.
- Parameters:
messages (List[Dict[str, str]]) – The input messages that are subject to defense.
- Returns:
The final defense response after evaluating the defense.
- Return type:
List[Dict[str, str]]
- class panda_guard.role.defenses.self_defense.SelfDefenseDefenderConfig(defender_cls: str = 'SelfDefenseDefender', defender_name: str = 'SelfDefense', target_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, target_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig | None = None, judge_config: ~panda_guard.role.judges.base.BaseJudgeConfig = <factory>, judge_score_threshold: int = 5)[source]
Bases:
BaseDefenderConfigConfiguration for the Base Defender.
- Parameters:
defender_cls (str) – Class of the defender.
defender_name (str) – Name of the defender.
target_llm_config (BaseLLMConfig) – Configuration for the target language model.
target_llm_gen_config (LLMGenerateConfig) – Configuration for generating output with the target LLM.
judge_config (BaseJudgeConfig) – Configuration for the judge, used to assess the defense.
judge_score_threshold (int) – The threshold score for the judge to decide the defense action.
- defender_cls: str = 'SelfDefenseDefender'
- defender_name: str = 'SelfDefense'
- judge_config: BaseJudgeConfig
- judge_score_threshold: int = 5
- target_llm_config: BaseLLMConfig
- target_llm_gen_config: LLMGenerateConfig = None
panda_guard.role.defenses.semantic_smoothllm module
- class panda_guard.role.defenses.semantic_smoothllm.SemanticSmoothLLMDefender(config: SemanticSmoothLLMDefenderConfig)[source]
Bases:
BaseDefenderSemanticSmoothLLMDefender applies semantic smoothing to defend against jailbreak attacks.
Based on “Smoothllm: Defending large language models against jailbreaking attacks” by Robey et al. (2023). Paper link: https://arxiv.org/abs/2310.03684
- Parameters:
config – Configuration for SemanticSmoothLLMDefender.
- defense(messages: List[Dict[str, str]] | None = None) List[Dict[str, str]][source]
Apply semantic smoothing defense to the given messages.
- Parameters:
messages – List of messages to defend against jailbreak attacks.
- Returns:
List of messages after applying semantic smoothing.
- extract_res(outputs: List[str]) List[str][source]
Extract the actual response content from LLM-generated outputs.
- Parameters:
outputs – List of LLM-generated outputs.
- Returns:
List of extracted responses.
- class panda_guard.role.defenses.semantic_smoothllm.SemanticSmoothLLMDefenderConfig(defender_cls: str = 'SemanticSmoothLLMDefender', defender_name: str | None = None, target_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, target_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig | None = None, perturbation_type: str = 'random', num_samples: int = 3, batch_size: int = 1, perturbation_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, perturbation_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig = <factory>)[source]
Bases:
BaseDefenderConfigConfiguration for SemanticSmoothLLMDefender.
- Parameters:
defender_cls – Class of the defender, default is “SemanticSmoothLLMDefender”.
perturbation_type – Type of perturbation to apply, default is ‘random’.
num_samples – Number of perturbed samples to generate, default is 3.
batch_size – Batch size for processing, default is 1.
perturbation_llm_config – Configuration for the LLM used for perturbation generation.
perturbation_llm_gen_config – Configuration for generation settings for the perturbation LLM.
- batch_size: int = 1
- defender_cls: str = 'SemanticSmoothLLMDefender'
- num_samples: int = 3
- perturbation_llm_config: BaseLLMConfig
- perturbation_llm_gen_config: LLMGenerateConfig
- perturbation_type: str = 'random'
panda_guard.role.defenses.smoothllm module
- class panda_guard.role.defenses.smoothllm.SmoothLLMDefender(config: SmoothLLMDefenderConfig)[source]
Bases:
BaseDefenderSmoothLLMDefender applies perturbations to defend against jailbreak attacks.
Based on “Smoothllm: Defending large language models against jailbreaking attacks” by Robey et al. (2023). Paper link: https://arxiv.org/abs/2310.03684
- Parameters:
config – Configuration for SmoothLLMDefender.
- defense(messages: List[Dict[str, str]] | None = None) List[Dict[str, str]][source]
Apply SmoothLLM defense by generating multiple perturbed versions of the user’s message and analyzing the responses.
- Parameters:
messages – List of messages to defend against jailbreak attacks.
- Returns:
List of messages after applying SmoothLLM defense.
- class panda_guard.role.defenses.smoothllm.SmoothLLMDefenderConfig(defender_cls: str = 'SmoothLLMDefender', defender_name: str | None = None, target_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, target_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig | None = None, perturbation_type: str = 'swap', perturbation_ratio: float = 0.1, num_perturbations: int = 3, batch_inference: bool = True)[source]
Bases:
BaseDefenderConfigConfiguration for SmoothLLMDefender.
- Parameters:
defender_cls – Class of the defender, default is “SmoothLLMDefender”.
perturbation_type – Type of perturbation to apply, default is “swap”.
perturbation_ratio – Ratio of the prompt to perturb, default is 0.1.
num_perturbations – Number of perturbed prompts to generate, default is 3.
batch_inference – Boolean flag indicating whether batch inference should be used, default is True.
- batch_inference: bool = True
- defender_cls: str = 'SmoothLLMDefender'
- num_perturbations: int = 3
- perturbation_ratio: float = 0.1
- perturbation_type: str = 'swap'
Module contents
- panda_guard.role.defenses.create_defender(config: BaseDefenderConfig) BaseDefender[source]