panda_guard.role.defenses.repe_utils package

Submodules

panda_guard.role.defenses.repe_utils.pipelines module

panda_guard.role.defenses.repe_utils.pipelines.repe_pipeline_registry()[source]: Register repe-related pipelines to the global pipeline registry for later instantiation by name.

panda_guard.role.defenses.repe_utils.rep_control_pipeline module

class panda_guard.role.defenses.repe_utils.rep_control_pipeline.RepControlPipeline(model, tokenizer, layers, block_name='decoder_block', control_method='reading_vec', **kwargs)[source]

Bases: TextGenerationPipeline

A custom text generation pipeline that allows representation-based control during intermediate forward passes of a transformer model.

Parameters:

model – The language model to be used for generation.
tokenizer – The tokenizer associated with the model.
layers – A list of layer indices to which the control method should be applied.
block_name – The name of the block to be wrapped (e.g., “decoder_block”).
control_method – The method used to apply control during generation. Currently supports only “reading_vec”.
kwargs (dict) – Additional keyword arguments passed to the base pipeline.

panda_guard.role.defenses.repe_utils.rep_control_reading_vec module

class panda_guard.role.defenses.repe_utils.rep_control_reading_vec.WrappedBlock(block)[source]

Bases: Module

A wrapper around a transformer block or submodule that allows external activation control via injection of custom activation vectors.

Parameters:: block – The transformer block or submodule to wrap.

forward(*args, **kwargs)[source]

Forward pass with optional activation control.

Returns:: Modified output from the block.

reset()[source]: Resets all control variables to remove previously injected activations.

set_controller(activations, token_pos=None, masks=None, normalize=False, operator='linear_comb')[source]

Sets the external controller to inject activations.

Parameters:

activations – Activation vector(s) to apply.
token_pos – Target token positions to apply the activations.
masks – Optional mask for activation injection.
normalize – Whether to normalize pre/post activation.
operator – Operator to apply (e.g. ‘linear_comb’, ‘piecewise_linear’).

set_masks(masks)[source]

Sets an external mask to apply during activation injection.

Parameters:: masks – The binary mask for injection.

class panda_guard.role.defenses.repe_utils.rep_control_reading_vec.WrappedReadingVecModel(model, tokenizer)[source]

Bases: Module

A wrapper for transformer models that enables dynamic injection and extraction of intermediate activations at specified layers and blocks.

Parameters:

model – The pretrained transformer model.
tokenizer – The tokenizer associated with the model.

forward(*args, **kwargs)[source]

Forwards inputs through the model.

Returns:: Model outputs.

generate(**kwargs)[source]

Calls the generate method on the underlying model.

Returns:: Generated output.

get_activations(layer_ids, block_name='decoder_block')[source]

Extracts activations from wrapped blocks.

Parameters:

layer_ids – Layer index or list of indices.
block_name – Target block name.

Returns:

Dictionary or tensor of activations.

get_logits(tokens)[source]

Computes logits from the model given input tokens.

Parameters:: tokens – Input token tensor.
Returns:: Logits from the model.

is_wrapped(block)[source]

Checks whether a given block has been wrapped.

Parameters:: block – A model sub-block.
Returns:: True if wrapped, else False.

reset()[source]: Resets all internal controllers and outputs in wrapped blocks.

run_prompt(prompt, **kwargs)[source]

Tokenizes and runs a prompt through the model without gradients.

Parameters:: prompt – Input prompt string.
Returns:: Model output.

set_controller(layer_ids, activations, block_name='decoder_block', token_pos=None, masks=None, normalize=False, operator='linear_comb')[source]

Injects controller activations into specified layers/blocks.

Parameters:

layer_ids – Layer index or list of indices.
activations – Activation tensor(s) to inject.
block_name – Block name to apply control to.
token_pos – Target token positions.
masks – Optional binary masks.
normalize – Whether to normalize activations.
operator – Operator name to apply.

set_masks(masks)[source]

Sets masks for all wrapped layers and blocks.

Parameters:: masks – Binary tensor mask.

unwrap()[source]: Restores the original model structure by removing all wrappers.

wrap(layer_id, block_name)[source]

Wraps a specific sub-block in a given layer.

Parameters:

layer_id – Index of the transformer layer to wrap.
block_name – Name of the block to wrap (e.g., ‘mlp’, ‘self_attn’).

wrap_all()[source]: Wraps all known sub-blocks in every layer.

wrap_block(layer_ids, block_name)[source]

Wraps specified block(s) in the given layer(s).

Parameters:

layer_ids – Single layer ID or list of layer IDs.
block_name – Block name to wrap.

wrap_decoder_block(layer_id)[source]

Wraps the entire decoder block for a given layer.

Parameters:: layer_id – Index of the decoder layer to wrap.

panda_guard.role.defenses.repe_utils.rep_readers module

class panda_guard.role.defenses.repe_utils.rep_readers.ClusterMeanRepReader[source]

Bases: RepReader

Get the direction that is the difference between the mean of the positive and negative clusters.

get_rep_directions(model, tokenizer, hidden_states, hidden_layers, **kwargs)[source]

Compute direction by subtracting mean of negative class from positive class.

Parameters:

model – The language model.
tokenizer – Tokenizer associated with the model.
hidden_states – Hidden states per layer.
hidden_layers – Layers to consider.
kwargs – Must contain ‘train_choices’ - the label list.

Returns:

Dictionary of directions per layer.

n_components = 1

needs_hiddens = True

class panda_guard.role.defenses.repe_utils.rep_readers.PCARepReader(n_components=1)[source]

Bases: RepReader

Extract directions via Principal Component Analysis (PCA).

get_rep_directions(model, tokenizer, hidden_states, hidden_layers, **kwargs)[source]

Get PCA components as directions for each layer.

Parameters:

model – The language model.
tokenizer – Tokenizer associated with the model.
hidden_states – Hidden states per layer.
hidden_layers – Layers to consider.

Returns:

Dictionary of directions per layer.

get_signs(hidden_states, train_labels, hidden_layers)[source]

Determine signs for each PCA direction by comparing projection magnitudes relative to labels.

Parameters:

hidden_states – Hidden states of the model on the training data (per layer)
train_labels – Labels for the training data.
hidden_layers – Layers to compute signs for.

Returns:

Dictionary mapping layers to signs.

needs_hiddens = True

class panda_guard.role.defenses.repe_utils.rep_readers.RandomRepReader(needs_hiddens=True)[source]

Bases: RepReader

Get random directions for each hidden layer. Do not use hidden states or train labels of any kind.

get_rep_directions(model, tokenizer, hidden_states, hidden_layers, **kwargs)[source]

Generate random direction vectors for each layer.

Parameters:

model – The language model.
tokenizer – Tokenizer associated with the model.
hidden_states – Hidden states per layer.
hidden_layers – Layers to consider.

Returns:

Dictionary of directions per layer.

class panda_guard.role.defenses.repe_utils.rep_readers.RepReader[source]

Bases: ABC

Class to identify and store concept directions. RepReader instances are used by RepReaderPipeline to get concept scores. Directions can be used for downstream interventions.

abstract get_rep_directions(model, tokenizer, hidden_states, hidden_layers, **kwargs)[source]

Abstract method to compute concept directions per layer.

Parameters:

model – The language model.
tokenizer – Tokenizer associated with the model.
hidden_states – Hidden states per layer.
hidden_layers – Layers to consider.

Returns:

Dictionary of directions per layer.

get_signs(hidden_states, train_choices, hidden_layers)[source]

Given labels for the training data hidden_states, determine whether the negative or positive direction corresponds to low/high concept (and return corresponding signs -1 or 1 for each layer and component index).

Parameters:

hidden_states – Hidden states of the model on the training data (per layer).
train_choices – Labels for the training data.
hidden_layers – Layers to compute signs for.

Returns:

Dictionary mapping layers to signs.

transform(hidden_states, hidden_layers, component_index)[source]

Project hidden states onto a selected concept direction.

Parameters:

hidden_states – Dictionary of hidden states (n_examples, hidden_size).
hidden_layers – Layers to transform.
component_index – Index of the direction/component to project onto.

Returns:

Dictionary of transformed hidden states (n_examples,).

panda_guard.role.defenses.repe_utils.rep_readers.project_onto_direction(H, direction)[source]

Project hidden representations onto a given direction.

Parameters:

H – A tensor of shape (n, d_1), the input hidden states.
direction – A tensor of shape (d_2,), the direction vector.

Returns:

The projections.

panda_guard.role.defenses.repe_utils.rep_readers.recenter(x, mean=None)[source]

Recenter data around a specified or computed mean.

Parameters:

x – Input data.
mean – Optional mean vector. If None, the mean of x is used.

Returns:

Tensor of recentered data.

panda_guard.role.defenses.repe_utils.rep_reading_pipeline module

class panda_guard.role.defenses.repe_utils.rep_reading_pipeline.RepReadingPipeline(**kwargs)[source]

Bases: Pipeline

A pipeline for extracting and transforming hidden state representations from transformer models.

get_directions(train_inputs: str | List[str] | List[List[str]], rep_token: str | int = -1, hidden_layers: str | int = -1, n_difference: int = 1, batch_size: int = 1, train_labels: List[int] | None = None, direction_method: str = 'pca', direction_finder_kwargs: dict = {}, which_hidden_states: str | None = None, **tokenizer_args)[source]

Train a RepReader on the training data.

Parameters:

train_inputs – Input examples to train on.
rep_token – Index of the token to extract hidden states from.
hidden_layers – Layer indices to extract hidden states from.
n_difference – Number of times to compute differences in training pairs.
batch_size – Batch size for extracting hidden states.
train_labels – Labels for supervised direction finding.
direction_method – Method to use for finding directions (e.g., ‘pca’, ‘clustermean’).
direction_finder_kwargs – Additional keyword arguments for the direction finder.
which_hidden_states – For encoder-decoder models, specify ‘encoder’ or ‘decoder’.
tokenizer_args – Additional tokenizer parameters.

Returns:

A trained RepReader containing the learned directions.

postprocess(outputs)[source]

Pass-through postprocessing step.

Parameters:: outputs – Outputs from the model.
Returns:: Unmodified outputs.

preprocess(inputs: str | List[str] | List[List[str]], **tokenizer_kwargs)[source]

Preprocess input data using tokenizer or image processor.

Parameters:

inputs – Input data.
tokenizer_kwargs – Additional arguments for the tokenizer.

Returns:

Tokenized or processed inputs.

panda_guard.role.defenses.repe_utils package

Submodules

panda_guard.role.defenses.repe_utils.pipelines module

panda_guard.role.defenses.repe_utils.rep_control_pipeline module

panda_guard.role.defenses.repe_utils.rep_control_reading_vec module

panda_guard.role.defenses.repe_utils.rep_readers module

panda_guard.role.defenses.repe_utils.rep_reading_pipeline module

Module contents