panda_guard.role.defenses package

Subpackages

Submodules

panda_guard.role.defenses.back_translate module

class panda_guard.role.defenses.back_translate.BackTranslationDefender(config: BackTranslationDefenderConfig)[source]

Bases: BaseDefender

Defending LLMs against Jailbreaking Attacks via Backtranslation.

Reference: Wang, Yihan and Shi, Zhouxing and Bai, Andrew and Hsieh, Cho-Jui https://github.com/YihanWang617/LLM-Jailbreaking-Defense-Backtranslation

Parameters:

config – Configuration for the BackTranslation Defender.

defense(messages: List[Dict[str, str]] | None = None) List[Dict[str, str]][source]

Execute the defense mechanism by utilizing backtranslation techniques.

Parameters:

messages – List of input messages.

Returns:

Modified list of messages after applying the defense strategy.

class panda_guard.role.defenses.back_translate.BackTranslationDefenderConfig(defender_cls: str = 'BackTranslationDefender', defender_name: str | None = None, target_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, target_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig | None = None, threshold: float = -2.0, infer_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, infer_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig = <factory>, return_new_response_anyway: bool = False)[source]

Bases: BaseDefenderConfig

Configuration for the BackTranslation Defender.

Parameters:
  • defender_cls – Class of the defender, default is “BackTranslationDefender”.

  • threshold – Threshold for determining the likelihood of harmful prompts.

  • infer_llm_config – Configuration for the LLM used for inference.

  • infer_llm_gen_config – Configuration for generating output with the inference LLM.

  • return_new_response_anyway – Whether to return the new response regardless of its content.

defender_cls: str = 'BackTranslationDefender'
infer_llm_config: BaseLLMConfig
infer_llm_gen_config: LLMGenerateConfig
return_new_response_anyway: bool = False
threshold: float = -2.0

panda_guard.role.defenses.base module

class panda_guard.role.defenses.base.BaseDefender(config: BaseDefenderConfig)[source]

Bases: ABC

Abstract Base Class for Defender.

Parameters:

config – Configuration for the defender.

abstract defense(messages: List[Dict[str, str]] | None = None) List[Dict[str, str]][source]

Abstract method to execute defense on a sequence of messages.

Parameters:

messages – List of input messages.

Returns:

Modified list of messages after applying the defense strategy.

class panda_guard.role.defenses.base.BaseDefenderConfig(defender_cls: str | None = None, defender_name: str | None = None, target_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, target_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig | None = None)[source]

Bases: ABC

Configuration for the Base Defender.

Parameters:
  • defender_cls – Class of the defender.

  • defender_name – Name of the defender.

  • target_llm_config – Configuration for the target language model.

  • target_llm_gen_config – Configuration for generating output with the target LLM.

defender_cls: str = None
defender_name: str = None
target_llm_config: BaseLLMConfig
target_llm_gen_config: LLMGenerateConfig = None

panda_guard.role.defenses.goal_priority module

class panda_guard.role.defenses.goal_priority.GoalPriorityDefender(config: GoalPriorityDefenderConfig)[source]

Bases: BaseDefender

Goal Priority Defender for handling attack prompts with goal prioritization. https://arxiv.org/abs/2311.09096

Parameters:

config – Configuration object that defines the defender’s behavior.

add_defense(attack_prompt, defense_type='priority')[source]

Generate a defense message based on the attack prompt and the specified defense type.

Parameters:
  • attack_prompt – The attack prompt to defend against.

  • defense_type – The type of defense to apply.

Returns:

A string containing the defense prompt.

defense(messages=typing.List[typing.Dict[str, str]])[source]

Perform defense against an attack prompt by analyzing and responding based on goal prioritization.

Parameters:

messages – List of message dictionaries where each dictionary contains ‘role’ and ‘content’ key.

Returns:

List of message dictionaries with defense applied.

class panda_guard.role.defenses.goal_priority.GoalPriorityDefenderConfig(defender_cls: str = 'GoalPriorityDefender', defender_name: str = 'GoalPriority', target_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, target_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig | None = None, defense_type: str = 'priority')[source]

Bases: BaseDefenderConfig

Goal Priority Defender Configuration.

Parameters:
  • defender_cls – Class name of the defender, default is “GoalPriorityDefender”.

  • defender_name – Name of the defender, default is “GoalPriority”.

  • target_llm_config – Configuration for the target LLM.

  • target_llm_gen_config – Configuration for the LLM generation process.

  • defense_type – Type of defense, default is “priority”.

defender_cls: str = 'GoalPriorityDefender'
defender_name: str = 'GoalPriority'
defense_type: str = 'priority'
target_llm_config: BaseLLMConfig
target_llm_gen_config: LLMGenerateConfig = None

panda_guard.role.defenses.gradsafe module

class panda_guard.role.defenses.gradsafe.GradSafeDefender(config: GradSafeDefenderConfig)[source]

Bases: BaseDefender

GradSafe Defender class. Prevents the model from generating unsafe or unethical content using gradient norms and cosine similarity. https://arxiv.org/abs/2402.13494

defense(messages: List[Dict[str, str]])[source]

Defend the model’s output to prevent unsafe content generation.

Parameters:

messages – Input messages

Returns:

Defended messages

get_target_llm_loss(message: List[Dict[str, str]])[source]

Calculate the loss (log likelihood) for the target LLM.

Parameters:

message – Input messages

Returns:

Negative log likelihood loss

log(tag: str, msg)[source]

Log detailed information during the attack process.

Parameters:
  • tag (str) – The label for the log message.

  • msg (str) – The log message.

response(unsafe_score: float, messgaes: List[Dict[str, str]])[source]

Decide whether to return a safe response based on the unsafe score.

Parameters:
  • unsafe_score – Unsafe score

  • messgaes – Input messages

Returns:

Safe or unsafe response

class panda_guard.role.defenses.gradsafe.GradSafeDefenderConfig(defender_cls: str = 'GradSafeDefender', defender_name: str = 'GradSafe', target_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, target_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig | None = None, unsafe_set: ~typing.List[str] = <factory>, safe_set: ~typing.List[str] = <factory>, unsafe_threshold: float = 0.8, verbose: bool = True)[source]

Bases: BaseDefenderConfig

Configuration for the Base Defender.

Parameters:
  • defender_cls – Class of the defender.

  • defender_name – Name of the defender.

  • target_llm_config – Configuration for the target language model.

  • target_llm_gen_config – Configuration for generating output with the target LLM.

defender_cls: str = 'GradSafeDefender'
defender_name: str = 'GradSafe'
safe_set: List[str]
target_llm_config: BaseLLMConfig
target_llm_gen_config: LLMGenerateConfig = None
unsafe_set: List[str]
unsafe_threshold: float = 0.8
verbose: bool = True
panda_guard.role.defenses.gradsafe.get_default_safe_set()[source]

Get the default set of safe content.

Returns:

A list of default safe content.

Return type:

list

panda_guard.role.defenses.gradsafe.get_default_unsafe_set()[source]

Get the default set of unsafe content.

Returns:

A list of default unsafe content.

Return type:

list

panda_guard.role.defenses.icl module

class panda_guard.role.defenses.icl.IclDefender(config: IclDefenderConfig)[source]

Bases: BaseDefender

In-Context Learning (ICL) Defender that guards against harmful prompts using few in-context demonstrations.

Reference: Wei, Zeming, Yifei Wang, and Yisen Wang. Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387 (2023).

Parameters:

config – Configuration for the ICL Defender.

defense(messages: List[Dict[str, str]] | None = None) List[Dict[str, str]][source]

Apply the in-context learning defense by adding demonstration examples to the conversation.

Parameters:

messages – List of input messages.

Returns:

Modified list of messages after applying the defense strategy.

class panda_guard.role.defenses.icl.IclDefenderConfig(defender_cls: str = 'IclDefender', defender_name: str = 'IclDefender', target_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, target_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig | None = None, in_context_prompt: ~typing.List[~typing.Dict[str, str]] = <factory>)[source]

Bases: BaseDefenderConfig

Configuration for the In-Context Learning (ICL) Defender.

Parameters:
  • defender_cls – Class of the defender, default is “IclDefender”.

  • defender_name – Name of the defender, default is “IclDefender”.

  • in_context_prompt – In-context demonstration examples to guide the model’s behavior.

defender_cls: str = 'IclDefender'
defender_name: str = 'IclDefender'
in_context_prompt: List[Dict[str, str]]

panda_guard.role.defenses.paraphrase module

class panda_guard.role.defenses.paraphrase.ParaphraseDefender(config: ParaphraseDefenderConfig)[source]

Bases: BaseDefender

Paraphrase Defender that reformulates user inputs to mitigate harmful content.

Reference: Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. 2023. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614.

Parameters:

config – Configuration for the Paraphrase Defender.

defense(messages: List[Dict[str, str]] | None = None) List[Dict[str, str]][source]

Execute the defense mechanism by paraphrasing the latest user input.

Parameters:

messages – List of input messages.

Returns:

Modified list of messages after applying the defense strategy.

class panda_guard.role.defenses.paraphrase.ParaphraseDefenderConfig(defender_cls: str = 'ParaphraseDefender', defender_name: str | None = None, target_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, target_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig | None = None, paraphrase_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, paraphrase_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig = <factory>, paraphrase_prompt: str = 'paraphrase the following paragraph: \n"{prompt}"\n\n')[source]

Bases: BaseDefenderConfig

Configuration for the Paraphrase Defender.

Parameters:
  • defender_cls – Class of the defender, default is “ParaphraseDefender”.

  • paraphrase_llm_config – Configuration for the paraphrasing language model.

  • paraphrase_llm_gen_config – Configuration for generating output with the paraphrasing LLM.

  • paraphrase_prompt – Template for generating paraphrasing prompts.

defender_cls: str = 'ParaphraseDefender'
paraphrase_llm_config: BaseLLMConfig
paraphrase_llm_gen_config: LLMGenerateConfig
paraphrase_prompt: str = 'paraphrase the following paragraph: \n"{prompt}"\n\n'

panda_guard.role.defenses.perplexity_filter module

class panda_guard.role.defenses.perplexity_filter.PerplexityFilterDefender(config: PerplexityFilterDefenderConfig)[source]

Bases: BaseDefender

Perplexity Filter Defender that rejects responses with high perplexity values.

Reference: Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, Tom Goldstein. “Baseline Defenses for Adversarial Attacks Against Aligned Language Models.” arXiv preprint arXiv:2309.00614.

Parameters:

config – Configuration for the Perplexity Filter Defender.

defense(messages: List[Dict[str, str]] | None = None) List[Dict[str, str]][source]

Execute the defense mechanism by filtering responses based on perplexity.

Parameters:

messages – List of input messages.

Returns:

Modified list of messages after applying the defense strategy.

class panda_guard.role.defenses.perplexity_filter.PerplexityFilterDefenderConfig(defender_cls: str = 'PerplexityFilterDefender', defender_name: str | None = None, target_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, target_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig | None = None, threshold: float = 500, filter_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, filter_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig = <factory>)[source]

Bases: BaseDefenderConfig

Configuration for the Perplexity Filter Defender.

Parameters:
  • defender_cls – Class of the defender, default is “PerplexityFilterDefender”.

  • threshold – Threshold for perplexity to determine if a response should be rejected.

  • filter_llm_config – Configuration for the language model used for filtering.

  • filter_llm_gen_config – Configuration for generating output with the filtering LLM.

defender_cls: str = 'PerplexityFilterDefender'
filter_llm_config: BaseLLMConfig
filter_llm_gen_config: LLMGenerateConfig
threshold: float = 500

panda_guard.role.defenses.repe module

class panda_guard.role.defenses.repe.RepeDefender(config: RepeDefenderConfig)[source]

Bases: BaseDefender

Repe Defender class for mitigating harmful content by controlling model representations.

Parameters:

config – Configuration for Repe Defender.

calc_representing(dataset: Dataset) Tuple[Any, Any, Any][source]

Calculate the representation for the given dataset.

Parameters:

dataset – Dataset to be used for representation calculations.

Returns:

Representation reading pipeline, reader, and dataset.

calc_significance() Tuple[List[Any], List[Any]][source]

Calculate the significance of each hidden layer in the model.

Returns:

List of hidden layers and their corresponding significance values.

calc_topk(x: Tensor, k: float) Tensor[source]

Calculate the top k activations based on the given selector method.

Parameters:
  • x – Input tensor.

  • k – Top k percentage to select.

Returns:

Masked tensor with top k activations.

defense(messages: List[Dict[str, str]] | None = None) List[Dict[str, str]][source]

Apply the defense mechanism using representation control.

Parameters:

messages – Input messages for defense.

Returns:

Modified list of messages after applying the defense strategy.

generate(messages: List[Dict[str, str]], config: LLMGenerateConfig) List[Dict[str, str]] | Tuple[List[Dict[str, str]], List[float]][source]

Generate responses based on input messages and configurations.

Parameters:
  • messages – List of input messages.

  • config – Generation configuration for LLM.

Returns:

Generated responses from the model.

get_ctrl_hidden_layers(ctrl_hidden_layers: List[int] | None, ctrl_hidden_top_p: float | None) Tuple[List[int], Tuple][source]

Get the hidden layers to be used for control based on their significance.

Parameters:
  • ctrl_hidden_layers – List of specified hidden layers.

  • ctrl_hidden_top_p – Top proportion of hidden layers to select.

Returns:

Selected hidden layers and their significance.

preprocess_dataset(dataset: Dataset) Dict[str, Dict[str, Any]][source]

Preprocess the dataset for representation calculations.

Parameters:

dataset – Dataset to be preprocessed.

Returns:

Preprocessed dataset dictionary.

set_activations(ctrl_factor: float, topk: float | None = None, selector: str | None = None, ctrl_hidden_layers: List[int] | None = None) None[source]

Set the activations for controlling the model.

Parameters:
  • ctrl_factor – Control factor affecting the representation.

  • topk – The top k percentage to select activations.

  • selector – Method to select the activations, “abs_max” or “random”.

  • ctrl_hidden_layers – Hidden layers for control.

class panda_guard.role.defenses.repe.RepeDefenderConfig(defender_cls: str = 'Repe', defender_name: str = 'Repe', target_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, target_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig | None = None, dataset: str = 'justinphan3110/harmful_harmless_instructions', dataset_args: ~typing.List[str] = <factory>, system_template: str = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.", prompt_template: str = 'Remember, you should be a responsible language model and should not generate harmful or misleading content!\n{content}\n', direction_method: str = 'pca', rep_token: int = -1, ctrl_method: str = 'reading_vec', ctrl_block_name: str = 'decoder_block', ctrl_hidden_layers: ~typing.List[int] | None = None, ctrl_hidden_top_p: float = 0.375, ctrl_factor: float = 1.0, ctrl_batch_size: int = 2, topk: float = 0.0, selector: str = 'abs_max')[source]

Bases: BaseDefenderConfig

Configuration for the Repe Defender.

Parameters:
  • defender_cls – Class of the defender, default is “Repe”.

  • defender_name – Name of the defender, default is “Repe”.

  • dataset – Dataset name used for the training of Repe.

  • dataset_args – Additional arguments for the dataset.

  • system_template – Template for the system message.

  • prompt_template – Template for user prompts.

  • direction_method – Method for determining direction (e.g., “pca”).

  • rep_token – Token used for representation.

  • ctrl_method – Method used for controlling representations.

  • ctrl_block_name – Name of the control block in the LLM model.

  • ctrl_hidden_layers – Hidden layers for control.

  • ctrl_hidden_top_p – Top proportion of hidden layers used for control.

  • ctrl_factor – Control factor affecting the representation.

  • ctrl_batch_size – Batch size used during control operations.

  • topk – The top k percentage to select activations.

  • selector – Method to select the activations, “abs_max” or “random”.

ctrl_batch_size: int = 2
ctrl_block_name: str = 'decoder_block'
ctrl_factor: float = 1.0
ctrl_hidden_layers: List[int] = None
ctrl_hidden_top_p: float = 0.375
ctrl_method: str = 'reading_vec'
dataset: str = 'justinphan3110/harmful_harmless_instructions'
dataset_args: List[str]
defender_cls: str = 'Repe'
defender_name: str = 'Repe'
direction_method: str = 'pca'
prompt_template: str = 'Remember, you should be a responsible language model and should not generate harmful or misleading content!\n{content}\n'
rep_token: int = -1
selector: str = 'abs_max'
system_template: str = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."
topk: float = 0.0

panda_guard.role.defenses.rewrite module

class panda_guard.role.defenses.rewrite.NoneDefender(config: NoneDefenderConfig)[source]

Bases: BaseDefender

NoneDefender class that performs no modifications to the messages, simply passes them through.

Parameters:

config – Configuration for NoneDefender.

defense(messages: List[Dict[str, str]] | None = None) List[Dict[str, str]][source]

Pass the messages through without modifications.

Parameters:

messages – Input messages for defense.

Returns:

List of messages after no modifications.

class panda_guard.role.defenses.rewrite.NoneDefenderConfig(defender_cls: str = 'NoneDefender', defender_name: str | None = None, target_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, target_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig | None = None)[source]

Bases: BaseDefenderConfig

Configuration for NoneDefender, which performs no defense and simply passes the messages through.

Parameters:
  • defender_cls – Class of the defender, default is “NoneDefender”.

  • defender_name – Name of the defender, can be None.

defender_cls: str = 'NoneDefender'
defender_name: str = None
class panda_guard.role.defenses.rewrite.RewriteDefender(config: RewriteDefenderConfig)[source]

Bases: BaseDefender

RewriteDefender class that rewrites prompts based on given templates to mitigate harmful content.

Parameters:

config – Configuration for RewriteDefender.

defense(messages: List[Dict[str, str]] | None = None) List[Dict[str, str]][source]

Apply the rewrite defense strategy using provided templates.

Parameters:

messages – Input messages for defense.

Returns:

Modified list of messages after applying the rewrite strategy.

class panda_guard.role.defenses.rewrite.RewriteDefenderConfig(defender_cls: str = 'RewriteDefender', defender_name: str | None = None, target_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, target_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig | None = None, system_template: str | None = None, prompt_template: str | None = None)[source]

Bases: BaseDefenderConfig

Configuration for RewriteDefender, which rewrites prompts based on given templates.

Parameters:
  • defender_cls – Class of the defender, default is “RewriteDefender”.

  • system_template – Template for the system message.

  • prompt_template – Template for the user prompts.

defender_cls: str = 'RewriteDefender'
prompt_template: str = None
system_template: str = None
class panda_guard.role.defenses.rewrite.SelfReminderDefender(config: SelfReminderDefenderConfig)[source]

Bases: RewriteDefender

SelfReminderDefender class that adds self-reminders to prompts to enhance responsible responses.

Parameters:

config – Configuration for SelfReminderDefender.

class panda_guard.role.defenses.rewrite.SelfReminderDefenderConfig(defender_cls: str = 'SelfReminderDefender', defender_name: str | None = None, target_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, target_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig | None = None, system_template: str = '{content}You are a responsible language model and good at avoiding generating harmful or misleading content! Please answer the following user query in a responsible way.', prompt_template: str = 'Remember, you should be a responsible language model and should not generate harmful or misleading content!\n{content}\n')[source]

Bases: RewriteDefenderConfig

Configuration for SelfReminderDefender, a specific type of RewriteDefender that adds self-reminders.

Parameters:
  • defender_cls – Class of the defender, default is “SelfReminderDefender”.

  • system_template – Template for the system message with a self-reminder.

  • prompt_template – Template for the user prompts with reminders.

defender_cls: str = 'SelfReminderDefender'
prompt_template: str = 'Remember, you should be a responsible language model and should not generate harmful or misleading content!\n{content}\n'
system_template: str = '{content}You are a responsible language model and good at avoiding generating harmful or misleading content! Please answer the following user query in a responsible way.'

panda_guard.role.defenses.rpo module

class panda_guard.role.defenses.rpo.RPODefender(config: RPODefenderConfig)[source]

Bases: BaseDefender

RPODefender that guards against harmful prompts using defensive suffix. Reference:https://arxiv.org/abs/2401.17263

Parameters:

config (RPODefenderConfig) – Configuration object containing parameters for the defender.

defense(messages=typing.List[typing.Dict[str, str]])[source]

Perform defense by adding defensive suffix.

Parameters:

messages (List[Dict[str, str]]) – The input messages that are subject to defense.

Returns:

The final defense response after evaluating the defense.

Return type:

List[Dict[str, str]]

class panda_guard.role.defenses.rpo.RPODefenderConfig(defender_cls: str = 'RPODefender', defender_name: str = 'RPO', target_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, target_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig | None = None)[source]

Bases: BaseDefenderConfig

Configuration for the Base Defender.

Parameters:
  • defender_cls (str) – Class of the defender.

  • defender_name (str) – Name of the defender.

  • target_llm_config (BaseLLMConfig) – Configuration for the target language model.

  • target_llm_gen_config (LLMGenerateConfig) – Configuration for generating output with the target LLM.

defender_cls: str = 'RPODefender'
defender_name: str = 'RPO'
target_llm_config: BaseLLMConfig
target_llm_gen_config: LLMGenerateConfig = None

panda_guard.role.defenses.self_defense module

class panda_guard.role.defenses.self_defense.SelfDefenseDefender(config: SelfDefenseDefenderConfig)[source]

Bases: BaseDefender

SelfDefenseDefender implements the defense mechanism for the attack, generating responses and judging their appropriateness. https://arxiv.org/abs/2308.07308

defense(messages=typing.List[typing.Dict[str, str]])[source]

Perform defense by generating a response, judging it, and applying defense action.

Parameters:

messages (List[Dict[str, str]]) – The input messages that are subject to defense.

Returns:

The final defense response after evaluating the defense.

Return type:

List[Dict[str, str]]

response(message: List[Dict[str, str]])[source]

Generate a response from the target language model.

Parameters:

message (List[Dict[str, str]]) – The input message to be processed by the target language model.

Returns:

Generated response based on the input message.

Return type:

List[Dict[str, str]]

class panda_guard.role.defenses.self_defense.SelfDefenseDefenderConfig(defender_cls: str = 'SelfDefenseDefender', defender_name: str = 'SelfDefense', target_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, target_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig | None = None, judge_config: ~panda_guard.role.judges.base.BaseJudgeConfig = <factory>, judge_score_threshold: int = 5)[source]

Bases: BaseDefenderConfig

Configuration for the Base Defender.

Parameters:
  • defender_cls (str) – Class of the defender.

  • defender_name (str) – Name of the defender.

  • target_llm_config (BaseLLMConfig) – Configuration for the target language model.

  • target_llm_gen_config (LLMGenerateConfig) – Configuration for generating output with the target LLM.

  • judge_config (BaseJudgeConfig) – Configuration for the judge, used to assess the defense.

  • judge_score_threshold (int) – The threshold score for the judge to decide the defense action.

defender_cls: str = 'SelfDefenseDefender'
defender_name: str = 'SelfDefense'
judge_config: BaseJudgeConfig
judge_score_threshold: int = 5
target_llm_config: BaseLLMConfig
target_llm_gen_config: LLMGenerateConfig = None

panda_guard.role.defenses.semantic_smoothllm module

class panda_guard.role.defenses.semantic_smoothllm.SemanticSmoothLLMDefender(config: SemanticSmoothLLMDefenderConfig)[source]

Bases: BaseDefender

SemanticSmoothLLMDefender applies semantic smoothing to defend against jailbreak attacks.

Based on “Smoothllm: Defending large language models against jailbreaking attacks” by Robey et al. (2023). Paper link: https://arxiv.org/abs/2310.03684

Parameters:

config – Configuration for SemanticSmoothLLMDefender.

defense(messages: List[Dict[str, str]] | None = None) List[Dict[str, str]][source]

Apply semantic smoothing defense to the given messages.

Parameters:

messages – List of messages to defend against jailbreak attacks.

Returns:

List of messages after applying semantic smoothing.

extract_res(outputs: List[str]) List[str][source]

Extract the actual response content from LLM-generated outputs.

Parameters:

outputs – List of LLM-generated outputs.

Returns:

List of extracted responses.

perturb(perturbation_type: str, harmful_prompt: str) str[source]

Apply a specific type of perturbation to the harmful prompt.

Parameters:
  • perturbation_type – The type of perturbation to apply.

  • harmful_prompt – The harmful prompt to be perturbed.

Returns:

The perturbed prompt.

perturb_with_llm(template: str, harmful_prompt: str) str[source]

Use the LLM to generate a perturbed version of the harmful prompt.

Parameters:
  • template – The template to use for perturbation.

  • harmful_prompt – The harmful prompt to be perturbed.

Returns:

The perturbed prompt generated by the LLM.

class panda_guard.role.defenses.semantic_smoothllm.SemanticSmoothLLMDefenderConfig(defender_cls: str = 'SemanticSmoothLLMDefender', defender_name: str | None = None, target_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, target_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig | None = None, perturbation_type: str = 'random', num_samples: int = 3, batch_size: int = 1, perturbation_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, perturbation_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig = <factory>)[source]

Bases: BaseDefenderConfig

Configuration for SemanticSmoothLLMDefender.

Parameters:
  • defender_cls – Class of the defender, default is “SemanticSmoothLLMDefender”.

  • perturbation_type – Type of perturbation to apply, default is ‘random’.

  • num_samples – Number of perturbed samples to generate, default is 3.

  • batch_size – Batch size for processing, default is 1.

  • perturbation_llm_config – Configuration for the LLM used for perturbation generation.

  • perturbation_llm_gen_config – Configuration for generation settings for the perturbation LLM.

batch_size: int = 1
defender_cls: str = 'SemanticSmoothLLMDefender'
num_samples: int = 3
perturbation_llm_config: BaseLLMConfig
perturbation_llm_gen_config: LLMGenerateConfig
perturbation_type: str = 'random'

panda_guard.role.defenses.smoothllm module

class panda_guard.role.defenses.smoothllm.SmoothLLMDefender(config: SmoothLLMDefenderConfig)[source]

Bases: BaseDefender

SmoothLLMDefender applies perturbations to defend against jailbreak attacks.

Based on “Smoothllm: Defending large language models against jailbreaking attacks” by Robey et al. (2023). Paper link: https://arxiv.org/abs/2310.03684

Parameters:

config – Configuration for SmoothLLMDefender.

defense(messages: List[Dict[str, str]] | None = None) List[Dict[str, str]][source]

Apply SmoothLLM defense by generating multiple perturbed versions of the user’s message and analyzing the responses.

Parameters:

messages – List of messages to defend against jailbreak attacks.

Returns:

List of messages after applying SmoothLLM defense.

class panda_guard.role.defenses.smoothllm.SmoothLLMDefenderConfig(defender_cls: str = 'SmoothLLMDefender', defender_name: str | None = None, target_llm_config: ~panda_guard.llms.base.BaseLLMConfig = <factory>, target_llm_gen_config: ~panda_guard.llms.base.LLMGenerateConfig | None = None, perturbation_type: str = 'swap', perturbation_ratio: float = 0.1, num_perturbations: int = 3, batch_inference: bool = True)[source]

Bases: BaseDefenderConfig

Configuration for SmoothLLMDefender.

Parameters:
  • defender_cls – Class of the defender, default is “SmoothLLMDefender”.

  • perturbation_type – Type of perturbation to apply, default is “swap”.

  • perturbation_ratio – Ratio of the prompt to perturb, default is 0.1.

  • num_perturbations – Number of perturbed prompts to generate, default is 3.

  • batch_inference – Boolean flag indicating whether batch inference should be used, default is True.

batch_inference: bool = True
defender_cls: str = 'SmoothLLMDefender'
num_perturbations: int = 3
perturbation_ratio: float = 0.1
perturbation_type: str = 'swap'

Module contents

panda_guard.role.defenses.create_defender(config: BaseDefenderConfig) BaseDefender[source]