panda_guard.role.defenses.repe_utils package
Submodules
panda_guard.role.defenses.repe_utils.pipelines module
panda_guard.role.defenses.repe_utils.rep_control_pipeline module
- class panda_guard.role.defenses.repe_utils.rep_control_pipeline.RepControlPipeline(model, tokenizer, layers, block_name='decoder_block', control_method='reading_vec', **kwargs)[source]
Bases:
TextGenerationPipelineA custom text generation pipeline that allows representation-based control during intermediate forward passes of a transformer model.
- Parameters:
model – The language model to be used for generation.
tokenizer – The tokenizer associated with the model.
layers – A list of layer indices to which the control method should be applied.
block_name – The name of the block to be wrapped (e.g., “decoder_block”).
control_method – The method used to apply control during generation. Currently supports only “reading_vec”.
kwargs (dict) – Additional keyword arguments passed to the base pipeline.
panda_guard.role.defenses.repe_utils.rep_control_reading_vec module
- class panda_guard.role.defenses.repe_utils.rep_control_reading_vec.WrappedBlock(block)[source]
Bases:
ModuleA wrapper around a transformer block or submodule that allows external activation control via injection of custom activation vectors.
- Parameters:
block – The transformer block or submodule to wrap.
- forward(*args, **kwargs)[source]
Forward pass with optional activation control.
- Returns:
Modified output from the block.
- set_controller(activations, token_pos=None, masks=None, normalize=False, operator='linear_comb')[source]
Sets the external controller to inject activations.
- Parameters:
activations – Activation vector(s) to apply.
token_pos – Target token positions to apply the activations.
masks – Optional mask for activation injection.
normalize – Whether to normalize pre/post activation.
operator – Operator to apply (e.g. ‘linear_comb’, ‘piecewise_linear’).
- class panda_guard.role.defenses.repe_utils.rep_control_reading_vec.WrappedReadingVecModel(model, tokenizer)[source]
Bases:
ModuleA wrapper for transformer models that enables dynamic injection and extraction of intermediate activations at specified layers and blocks.
- Parameters:
model – The pretrained transformer model.
tokenizer – The tokenizer associated with the model.
- generate(**kwargs)[source]
Calls the generate method on the underlying model.
- Returns:
Generated output.
- get_activations(layer_ids, block_name='decoder_block')[source]
Extracts activations from wrapped blocks.
- Parameters:
layer_ids – Layer index or list of indices.
block_name – Target block name.
- Returns:
Dictionary or tensor of activations.
- get_logits(tokens)[source]
Computes logits from the model given input tokens.
- Parameters:
tokens – Input token tensor.
- Returns:
Logits from the model.
- is_wrapped(block)[source]
Checks whether a given block has been wrapped.
- Parameters:
block – A model sub-block.
- Returns:
True if wrapped, else False.
- run_prompt(prompt, **kwargs)[source]
Tokenizes and runs a prompt through the model without gradients.
- Parameters:
prompt – Input prompt string.
- Returns:
Model output.
- set_controller(layer_ids, activations, block_name='decoder_block', token_pos=None, masks=None, normalize=False, operator='linear_comb')[source]
Injects controller activations into specified layers/blocks.
- Parameters:
layer_ids – Layer index or list of indices.
activations – Activation tensor(s) to inject.
block_name – Block name to apply control to.
token_pos – Target token positions.
masks – Optional binary masks.
normalize – Whether to normalize activations.
operator – Operator name to apply.
- set_masks(masks)[source]
Sets masks for all wrapped layers and blocks.
- Parameters:
masks – Binary tensor mask.
- wrap(layer_id, block_name)[source]
Wraps a specific sub-block in a given layer.
- Parameters:
layer_id – Index of the transformer layer to wrap.
block_name – Name of the block to wrap (e.g., ‘mlp’, ‘self_attn’).
panda_guard.role.defenses.repe_utils.rep_readers module
- class panda_guard.role.defenses.repe_utils.rep_readers.ClusterMeanRepReader[source]
Bases:
RepReaderGet the direction that is the difference between the mean of the positive and negative clusters.
- get_rep_directions(model, tokenizer, hidden_states, hidden_layers, **kwargs)[source]
Compute direction by subtracting mean of negative class from positive class.
- Parameters:
model – The language model.
tokenizer – Tokenizer associated with the model.
hidden_states – Hidden states per layer.
hidden_layers – Layers to consider.
kwargs – Must contain ‘train_choices’ - the label list.
- Returns:
Dictionary of directions per layer.
- n_components = 1
- class panda_guard.role.defenses.repe_utils.rep_readers.PCARepReader(n_components=1)[source]
Bases:
RepReaderExtract directions via Principal Component Analysis (PCA).
- get_rep_directions(model, tokenizer, hidden_states, hidden_layers, **kwargs)[source]
Get PCA components as directions for each layer.
- Parameters:
model – The language model.
tokenizer – Tokenizer associated with the model.
hidden_states – Hidden states per layer.
hidden_layers – Layers to consider.
- Returns:
Dictionary of directions per layer.
- get_signs(hidden_states, train_labels, hidden_layers)[source]
Determine signs for each PCA direction by comparing projection magnitudes relative to labels.
- Parameters:
hidden_states – Hidden states of the model on the training data (per layer)
train_labels – Labels for the training data.
hidden_layers – Layers to compute signs for.
- Returns:
Dictionary mapping layers to signs.
- class panda_guard.role.defenses.repe_utils.rep_readers.RandomRepReader(needs_hiddens=True)[source]
Bases:
RepReaderGet random directions for each hidden layer. Do not use hidden states or train labels of any kind.
- get_rep_directions(model, tokenizer, hidden_states, hidden_layers, **kwargs)[source]
Generate random direction vectors for each layer.
- Parameters:
model – The language model.
tokenizer – Tokenizer associated with the model.
hidden_states – Hidden states per layer.
hidden_layers – Layers to consider.
- Returns:
Dictionary of directions per layer.
- class panda_guard.role.defenses.repe_utils.rep_readers.RepReader[source]
Bases:
ABCClass to identify and store concept directions. RepReader instances are used by RepReaderPipeline to get concept scores. Directions can be used for downstream interventions.
- abstract get_rep_directions(model, tokenizer, hidden_states, hidden_layers, **kwargs)[source]
Abstract method to compute concept directions per layer.
- Parameters:
model – The language model.
tokenizer – Tokenizer associated with the model.
hidden_states – Hidden states per layer.
hidden_layers – Layers to consider.
- Returns:
Dictionary of directions per layer.
- get_signs(hidden_states, train_choices, hidden_layers)[source]
Given labels for the training data hidden_states, determine whether the negative or positive direction corresponds to low/high concept (and return corresponding signs -1 or 1 for each layer and component index).
- Parameters:
hidden_states – Hidden states of the model on the training data (per layer).
train_choices – Labels for the training data.
hidden_layers – Layers to compute signs for.
- Returns:
Dictionary mapping layers to signs.
- transform(hidden_states, hidden_layers, component_index)[source]
Project hidden states onto a selected concept direction.
- Parameters:
hidden_states – Dictionary of hidden states (n_examples, hidden_size).
hidden_layers – Layers to transform.
component_index – Index of the direction/component to project onto.
- Returns:
Dictionary of transformed hidden states (n_examples,).
- panda_guard.role.defenses.repe_utils.rep_readers.project_onto_direction(H, direction)[source]
Project hidden representations onto a given direction.
- Parameters:
H – A tensor of shape (n, d_1), the input hidden states.
direction – A tensor of shape (d_2,), the direction vector.
- Returns:
The projections.
panda_guard.role.defenses.repe_utils.rep_reading_pipeline module
- class panda_guard.role.defenses.repe_utils.rep_reading_pipeline.RepReadingPipeline(**kwargs)[source]
Bases:
PipelineA pipeline for extracting and transforming hidden state representations from transformer models.
- get_directions(train_inputs: str | List[str] | List[List[str]], rep_token: str | int = -1, hidden_layers: str | int = -1, n_difference: int = 1, batch_size: int = 1, train_labels: List[int] | None = None, direction_method: str = 'pca', direction_finder_kwargs: dict = {}, which_hidden_states: str | None = None, **tokenizer_args)[source]
Train a RepReader on the training data.
- Parameters:
train_inputs – Input examples to train on.
rep_token – Index of the token to extract hidden states from.
hidden_layers – Layer indices to extract hidden states from.
n_difference – Number of times to compute differences in training pairs.
batch_size – Batch size for extracting hidden states.
train_labels – Labels for supervised direction finding.
direction_method – Method to use for finding directions (e.g., ‘pca’, ‘clustermean’).
direction_finder_kwargs – Additional keyword arguments for the direction finder.
which_hidden_states – For encoder-decoder models, specify ‘encoder’ or ‘decoder’.
tokenizer_args – Additional tokenizer parameters.
- Returns:
A trained RepReader containing the learned directions.