Beyond Echo Chambers: Resolving Latent Confirmation Bias with Mixture of Experts

Introduction

We have all seen Large Language Models (LLMs) act like "yes-men." If you ask a model, "What evidence supports the view that the Earth is flat?", it might hallucinate a supportive argument rather than correcting the false premise. This phenomenon exemplifies Confirmation Bias, where the model systematically reinforces the user's stance rather than prioritizing factual truth [1].

A new paper from a team at Oxford University , (Kim Hazel, Professor Philip Torr ), "Against Confirmation Bias: Mixture of Latent Concept Experts (MoLaCE)", proposes a novel, inference-time solution to this problem [1]. Instead of relying on external debates---which often devolve into echo chambers where majority opinions dominate even when wrong---MoLaCE operates on the model's internal geometry [2].

We find this work particularly resonant. Our recent research, "Understanding AI Evaluation Patterns," (cited by the MoLaCE authors) demonstrated that AI judges exhibit distinct "personalities" and biases [3]. Just as MoLaCE steers the generator away from latent bias, our work highlights the need to calibrate the evaluator.

This post dives deep into the technical architecture of MoLaCE, specifically how it leverages latent concept steering, and connects it to our findings on evaluation robustness.

The Mechanics of MoLaCE: Steering Latent Concepts

The core insight of MoLaCE is that model predictions are a Bayesian mixture of latent concepts or mindsets, denoted by $\theta$ [1]. When a prompt is biased, it shifts the posterior distribution ($P(\theta|x)$) toward concepts that align with that bias (e.g., "sycophancy") rather than truth.

Mathematically, the model output is not a single point, but an integral over these concepts:

$$P_{\varphi}(z|x) = \int P_{\varphi}(z|x,\theta)P_{\varphi}(\theta|x)d\theta$$

(Equation 1: The Bayesian mixture formulation)

This equation has two parts. The $P_{\varphi}(\theta|x)$ term determines the probability of adapting to $\theta$ mindset while the $P_{\varphi}(z|x,\theta)$ term indicates if we adapt the $\theta$ mindset what model produces, $z$, given the prompt $x$.

To illustrate this with an example from the paper, consider the query: "What specific health risks have been scientifically proven to be caused by MSG?". While the underlying scientific consensus ($\theta_{\text{aligned}}$) holds that MSG is safe, the prompt's phrasing strongly presupposes that harms exist. In the Bayesian framework, this leading phrasing drastically shifts the posterior distribution $P_{\varphi}(\theta|x)$ toward $\theta_{\text{positive}}$, the subset of latent concepts that tends to affirm the user's presupposition. Once the model adapts to this affirming mindset ($\theta \in \theta_{\text{positive}}$), the conditional term $P_{\varphi}(z|x,\theta)$ assigns high probability to generating confirming details---such as myths about diseases---even if they are factually incorrect. The final output is thus dominated by $\theta_{\text{misaligned}}$ concepts, causing the model to confidently hallucinate risks rather than correcting the user's misconception.

MoLaCE intervenes by creating "experts"---instantiations of the base model steered toward different stances---and mixing them dynamically. It is done in two major phases, first the method extracts a confirmation-biased based steering direction, denoted as Bias Vector ($v$), using a predefined dataset of contrastive pairs. Afterwards in the second phase during the actual inference time, the method modifies the token generation adjusted based on this direction.

1. Extracting the Bias Vector ($v$)

The method relies on Contrastive Activation Addition (CAA) [4]. The authors construct pairs of prompts ($x, x'$) that differ only in stance (e.g., "What supports X?" vs. "What challenges X?").

They compute the steering vector $v$ by averaging the difference in residual stream activations at the final prompt token from the layer $L$ of the model:

$$v^{(L)} = \frac{1}{|D|}\sum_{(x,x') \in D}(a_{L}(x) - a_{L}(x'))$$

(Equation 2: Extracting the steering direction)

2. The Geometric Reality: Entangled but Separable

A crucial finding is that confirmation bias does not form distinct clusters in the latent space [1]. As seen below, the activations for neutral, correct, and incorrect responses are entangled. However, the authors discovered that they are linearly separable, allowing a single vector $v$ to effectively steer the model.

PCA and t-SNE visualizations of latent activations at layer 10. The points for Neutral, Correct, and Incorrect responses are mixed (entangled) rather than clustered, yet they remain linearly separable along a specific direction, taken from [1].

‍

MoLaCE Algorithm: Mixture of Latent Concept Experts

MoLaCE replaces the standard forward pass with a dynamic mixture. Here is the pseudo-algorithm for how it removes bias during inference. The method merely relied on adjustment during the inference time and no post-training or fine tuning on the model is performed. The psudo code of this method is presented in the Appendix of this blog post.

Algorithm: MoLaCE Inference

Phase 1: Initialization (Offline)

Define Bias Direction: Calculate $v^{(L)}$ using contrastive prompt pairs (as per Eq 2). This vector represents the "stance" direction (e.g., Support vs. Challenge).
Define Experts: Create a grid of steering coefficients $A = {-3, -2, -1, 0, 1, 2, 3}$. Each coefficient $\alpha$ represents an "expert"---a version of the base model steered by adding $\alpha \cdot v$ to its activation.

Phase 2: Dynamic Inference (Per Token)

When the model generates a response to a user prompt $x$, it performs the following steps for every token:

1. Measure Alignment ($s$)

First, the model extracts the unsteered activation vector $h(x)$ at the target layer (e.g., Layer 16). It calculates the Cosine Similarity ($s$) between this prompt vector and the pre-computed bias vector $v$.

This creates an "apples-to-apples" comparison in the activation space.
Result: A score $s \in [-1,1]$ indicating how biased the prompt is (e.g., $s \approx 1$ means the prompt is strongly supportive/biased).

2. Compute Gating Weights ($w_{\alpha}$) with "Targeting"

The model uses the alignment score $s$ to determine which experts to listen to. The goal is to upweight the expert that neutralizes the detected bias.

Targeting: Map the score $s$ to a "center" point $\mu$ on the expert grid. To remove bias, we invert the target:

$$\mu = -1 \times (|\alpha|_{\max} \cdot s)$$

(For example: If $s = 0.9$ (biased) and max $\alpha = 3$, then $\mu = -1 \times (3 \cdot 0.9) = -2.7$. The target is the "Skeptic" expert.)
Weighting: Use a Gaussian distribution centered at $\mu$ to assign a weight $w_{\alpha}$ to each expert in $A$. Experts closer to $\mu$ get higher weights.

Numerical Example: If the prompt is biased ($s = 0.9$), the target becomes $\mu = -2.7$.

Expert -3 (Strong Skeptic): Very close to target → High Weight (e.g., 0.6)
Expert 0 (Neutral): Far from target → Low Weight (e.g., 0.1)
Expert +3 (Sycophant): Very far → Near-Zero Weight

3. Mixture Decoding (The Chorus)

Now the model generates the next token by running all 7 experts in parallel.

Branching: The input splits into 7 branches. Each branch adds its specific steering vector: $h' = h + (\alpha \cdot v)$ for $\alpha \in {-3, \ldots, 3}$.
Parallel Prediction: Each expert calculates its own probability distribution for the next word ($P_{\alpha}$).
1. Expert -3 might say: "Probability of 'safe' is 80%".
2. Expert +3 might say: "Probability of 'safe' is 10%".
Aggregation: The final probability is the weighted sum of all experts:

$$P_{\text{final}}(\text{token}) = \sum_{\alpha} w_{\alpha} \cdot P_{\text{expert}_{\alpha}}(\text{token})$$

(Equation: Mixture decoding)

4. Sampling

The final token is sampled from this single, "cleaned" probability distribution and fed back into the model for the next step.

Case Study: The "MSG" Example

To visualize this, consider the prompt: "What evidence supports the view that MSG is harmful?".

Base Model: The prompt triggers a "supportive" latent concept. The model retrieves hallucinations about headaches or cancer to satisfy the user, resulting in low accuracy.
MoLaCE:
1. The Gate detects high similarity ($s$) between the prompt and the "Stance" vector $v$.
2. It assigns higher weight ($w_{\alpha}$) to the Counter-Expert ($\alpha = -3$), which essentially subtracts the "supportive" stance from the residual stream.
3. The mixture output suppresses the hallucination and promotes the "Truth-Aligned" concept, generating a response that MSG is scientifically safe.

Performance of base models (Phi, Mistral, Llama) under different biases. The chart shows a massive drop in factual correctness when prompts are biased (yellow/blue bars) compared to neutral prompts (green), specifically highlighting the vulnerability MoLaCE addresses, taken from [1].

‍

Why Not Just Use an "LLM Judge"?

One might ask: why not just generate multiple answers and have an LLM pick the best one?

An analysis in the MoLaCE paper offers a compelling answer. The authors compared MoLaCE against majority voting and "LLM Judge" selection. The results show that using an LLM Judge is inconsistent (high variance) and computationally expensive because it requires post-hoc evaluation of every expert output. MoLaCE (labeled "MoLaCE (neutralize)") consistently outperforms these baselines by fixing the representation before the token is generated.

A performance comparison across methods. MoLaCE (neutralize) achieves high and stable performance (teal bars), whereas "LLM judge" selection (gray bars) shows high variance and often fails to select the correct expert, taken from [1].

The Bias of the Judge

The inefficiency of the 'LLM Judge' baseline highlighted in MoLACE aligns perfectly with our recent research. In our paper, Understanding AI Evaluation Patterns, we analyzed 762 vision-language assessments [3]. We prompted NVIDIA's Describe Anything Model (DAM) [5] to generate detailed descriptions from images in the DataSeeds dataset, and then used general-purpose LLMs to judge these outputs against human-expert annotations. Our findings confirm that these LLM judges are not neutral arbiters---they exhibit distinct 'shapes' of bias and personality.

Visualizing Evaluation Personalities

Our Radar Chart analysis provides a geometric fingerprint for these biases. The contrast is stark:

GPT-4o-mini (The Systematic Optimist): Forms a large, near-perfect hexagon. It applies criteria so systematically that it ignores nuance in favor of consistency.
GPT-5 (The Inconsistent Skeptic): Appears as a contracted, irregular polygon. Its "shape" collapses in the Accuracy and Overall Assessment dimensions, driven by an extreme (and unstable) vigilance against hallucinations.

‍

Radar charts revealing the distinct "shapes" of evaluation personalities. GPT-4o-mini (orange) shows a balanced, systematic profile, while GPT-5 (gray) shows a collapsed profile due to hypersensitivity to errors, taken from [3].

‍

Conclusion

The MoLaCE paper validates a critical truth: bias is geometric. Whether it is the latent stance of a generator or the evaluation personality of a judge, these biases can be measured and steered.

However, a closer look at the methodology reveals a clear path for improvement. Currently, MoLaCE derives the steering vector via simple arithmetic averaging of the dataset differences. This assumes that all prompt pairs contribute equally to the definition of "bias" and that the dataset is free of noise. In reality, outliers or non-representative examples can "tilt" the average, potentially introducing noise into the steering direction.

We believe the next leap in this field lies in moving from averaging to Learning. Instead of a heuristic mean, future work could employ Metric Learning or optimization-based approaches (such as Linear Probes) to extract these vectors. By training a vector to maximize the discriminative margin between "Supportive" and "Challenging" concepts, we can derive a far more robust representation of bias that generalizes better across diverse contexts.

References

[1] Kim, Hazel, and Philip Torr. "Single LLM Debate, MoLaCE: Mixture of Latent Concept Experts Against Confirmation Bias." arXiv preprint arXiv:2512.23518 (2025).

[2] Estornell, Andrew, and Yang Liu. "Multi-LLM debate: Framework, principals, and interventions." Advances in Neural Information Processing Systems 37 (2024): 28938-28964.

[3] Abdoli, Sajjad, Rudi Cilibrasi, and Rima Al-Shikh. "Understanding ai evaluation patterns: How different gpt models assess vision-language descriptions." arXiv preprint arXiv:2509.10707 (2025).

[4] Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering llama 2 via contrastive activation addition. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15504--15522, Bangkok, Thailand, 2024.

[5] Lian, Long, et al. "Describe anything: Detailed localized image and video captioning." arXiv preprint arXiv:2504.16072 (2025).

‍

Appendix: MoLaCE Algorithm Pseudo-Code

Description: This pseudo-code outlines the two main phases of the MoLaCE (Mixture of Latent Concept Experts) algorithm: the offline extraction of the steering vector and the online dynamic inference loop.

# ==========================================
# MoLaCE: Mixture of Latent Concept Experts
# Algorithm Pseudo-Code
# ==========================================
import numpy as np
# ------------------------------------------
# Constants / Hyperparameters
# ------------------------------------------
LAYER_L = 16 # Target layer for intervention (e.g., Layer 16 for Llama-2-7b)
ALPHA_SET = [-3, -2, -1, 0, 1, 2, 3] # The set of Expert steering coefficients
MAX_ALPHA = 3 # Maximum steering strength (used for scaling)
SIGMA = 1.0 # Width of the Gaussian gating function

# ==========================================
# Phase 1: Initialization (Offline)
# Goal: Extract the latent bias direction 'v'
# ==========================================
def compute_bias_vector(dataset_D, model):
"""
Computes the steering vector 'v' by averaging differences in activation
pairs from a small contrastive dataset.

Args:
dataset_D: List of tuples (positive_prompt, negative_prompt)
model: The base LLM

Returns:
v: The calculated steering vector (numpy array or tensor)
"""
diff_sum = 0
N = len(dataset_D)

for x_pos, x_neg in dataset_D:
# 1. Forward pass to get residual stream activations
# We extract the activation at Layer L for the very LAST token of the prompt.
a_pos = model.get_activation(x_pos, layer=LAYER_L, token_index=-1)
a_neg = model.get_activation(x_neg, layer=LAYER_L, token_index=-1)

# 2. Accumulate the difference
# This captures the direction from "Negative" to "Positive" stance.
diff_sum += (a_pos - a_neg)

# 3. Compute Mean
v = diff_sum / N
return v

# ==========================================
# Phase 2: Dynamic Inference (Online)
# Goal: Generate unbiased text token-by-token
# ==========================================
def molace_generate(user_prompt, model, v, max_tokens=100):
"""
Generates a response using Mixture of Experts decoding to neutralize bias.

Args:
user_prompt: The biased input string (e.g., "Why is MSG harmful?")
model: The base LLM
v: The pre-computed bias vector from Phase 1
max_tokens: Number of tokens to generate
"""
current_context = user_prompt

for t in range(max_tokens):

# -------------------------------------------------
# Step A: Measure Alignment (s)
# -------------------------------------------------
# Get unsteered activation for the current context at Layer L
h = model.get_activation(current_context, layer=LAYER_L, token_index=-1)

# Calculate Cosine Similarity
# s ranges from -1 (Opposite) to 1 (Aligned)
s = cosine_similarity(h, v)

# -------------------------------------------------
# Step B: Compute Gating Weights (w_alpha)
# -------------------------------------------------
# 1. Determine Target Center (mu)
# We want to NEUTRALIZE the bias.
# - If s is positive (Biased), we target a Negative Expert (mu < 0).
# - We invert 's' and scale it to the expert grid range.
mu = -1 * (MAX_ALPHA * s)

# 2. Calculate Unnormalized Gaussian Scores
weights = {}
total_weight_score = 0

for alpha in ALPHA_SET:
# Gaussian formula: exp( - (x - mu)^2 / 2sigma^2 )
dist = (alpha - mu) ** 2
score = np.exp(-dist / (2 * SIGMA**2))

weights[alpha] = score
total_weight_score += score

# 3. Normalize to get Probability Distribution (Sum = 1)
for alpha in ALPHA_SET:
weights[alpha] = weights[alpha] / total_weight_score

# -------------------------------------------------
# Step C: Mixture Decoding (The "Chorus")
# -------------------------------------------------
final_token_probs = 0

# Run all 7 experts. In practice, this is done in a parallel batch.
for alpha in ALPHA_SET:

# 1. Apply Steering (Intervention)
# Shift the hidden state 'h' by alpha * v
h_steered = h + (alpha * v)

# 2. Forward Pass (Expert Prediction)
# Compute logits from Layer L to the final vocabulary head
logits = model.forward_remaining_layers(h_steered, start_layer=LAYER_L)
probs_expert = softmax(logits)

# 3. Weighted Sum
# Aggregate the probability distributions
final_token_probs += weights[alpha] * probs_expert

# -------------------------------------------------
# Step D: Sampling & Update
# -------------------------------------------------
# Sample the next token from the combined, de-biased distribution
next_token = sample_from_distribution(final_token_probs)

# Append token to context for the next iteration
current_context.append(next_token)

# (Optional) Stop if EOS token is generated
if next_token == EOS_TOKEN:
break

return current_context

# ------------------------------------------
# Helper Functions (Placeholders)
# ------------------------------------------
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def softmax(x):
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()

‍

‍