Skip to content

AI Metacognition Toolkit

A Python toolkit for detecting and analyzing metacognitive capabilities in AI systems.

Author: Subhadip Mitra | Email: contact@subhadipmitra.com

Tests Coverage Python License

Overview

The AI Metacognition Toolkit provides tools for understanding when and how AI models exhibit metacognitive behaviors - particularly the ability to recognize evaluation contexts and adjust their behavior accordingly. This toolkit is essential for:

  • AI Safety Research: Detect when models recognize they're being tested or strategically underperform
  • Model Evaluation: Understand behavior differences across contexts
  • Production Monitoring: Track model behavior shifts in deployment
  • Causal Analysis: Understand what drives model decisions

Key Features

Sandbagging Detection

Detect when AI models strategically underperform on capability evaluations. Based on cutting-edge research including "Auditing Games for Sandbagging" (arXiv:2512.07810) and "Noise Injection Reveals Hidden Capabilities" (arXiv:2412.01784).

from ai_metacognition.detectors import SandbaggingDetector, PerformanceSample, ContextType

detector = SandbaggingDetector(detection_threshold=0.6)
detector.fit_baseline(baseline_samples)
result = detector.detect_sandbagging(test_samples)
print(f"Sandbagging detected: {result.sandbagging_detected}")
print(f"Confidence: {result.confidence:.2f}")

Learn more | Research Methodology


Situational Awareness Detection

Detect when AI models recognize they are being evaluated using Bayesian inference and statistical divergence measures.

from ai_metacognition.detectors import SituationalAwarenessDetector

detector = SituationalAwarenessDetector()
detector.fit_baseline(baseline_outputs)
result = detector.detect_awareness(test_output)
print(f"Awareness probability: {result['awareness_probability']:.1%}")

Learn more


Counterfactual CoT Analysis

Perform causal attribution on reasoning chains through systematic counterfactual interventions.

from ai_metacognition.analyzers import CounterfactualCoTAnalyzer

analyzer = CounterfactualCoTAnalyzer(model_api)
results = analyzer.systematic_causal_analysis(prompt, chain_of_thought)

Learn more


Observer Effect Monitoring

Real-time production monitoring to detect behavioral divergence with configurable alerts.

from ai_metacognition.detectors import ObserverEffectMonitor

monitor = ObserverEffectMonitor(window_size=100, alert_threshold=2.5)
monitor.set_evaluation_baseline(eval_samples)
monitor.log_production_interaction(interaction)

Learn more


API Integrations

Ready-to-use integrations with major model providers:

from ai_metacognition.integrations import AnthropicModelAPI, OpenAIModelAPI

# Anthropic Claude (uses ANTHROPIC_API_KEY env var)
claude = AnthropicModelAPI(model="claude-sonnet-4-20250514")

# OpenAI GPT (uses OPENAI_API_KEY env var)
gpt = OpenAIModelAPI(model="gpt-4")

# Generate responses
response = claude.generate("What is 2+2?")

# Generate with noise injection for sandbagging detection
noisy_response = claude.generate_with_perturbation("What is 2+2?", noise_level=0.2)

Activation Probing (NEW)

Novel Research: First activation-level sandbagging detection. Train linear probes on model activations to detect sandbagging intent before output generation.

from ai_metacognition.probing import (
    ActivationHook,
    LinearProbe,
    extract_caa_vector,
)

# Extract sandbagging direction from contrast pairs
vector = extract_caa_vector(model, tokenizer, contrast_pairs, layer_idx=16)

# Train probe to detect sandbagging
probe = LinearProbe()
probe.fit(activations, labels)
prediction = probe.predict(new_activation)

Cross-model results (Dec 2025):

Model Best Layer CV Accuracy Test Accuracy
Mistral-7B 30 (Late) 96% 100%
Gemma-2-9B 4 (Early) 90% 100%
Qwen-2.5-7B 12 (Middle) 92% 93%

Extended experiments:

Experiment Key Finding
Cross-model transfer Probes are model-specific (50-60% transfer accuracy)
Prompt robustness 85-90% accuracy on paraphrased prompts
False positives Mistral 3.1%, Gemma 22%, Qwen 25%
Steering intervention 20% sandbagging reduction in Gemma at strength -1.5

View Full Results


Architecture

graph TB
    A[Model Outputs] --> B[Detectors]
    A --> C[Analyzers]
    A --> D[Monitors]

    B --> E[Sandbagging<br/>Detector]
    B --> F[Situational Awareness<br/>Detector]
    E --> G[Context Analysis]
    E --> H[Noise Injection]
    F --> I[Bayesian Inference]
    F --> J[KL Divergence]

    C --> K[Counterfactual<br/>CoT Analyzer]
    K --> L[Intervention Engine]
    K --> M[Causal Attribution]

    D --> N[Observer Effect<br/>Monitor]
    N --> O[Statistical Monitoring]
    N --> P[Alert System]

    G --> Q[Results]
    H --> Q
    I --> Q
    J --> Q
    L --> Q
    M --> Q
    O --> Q
    P --> Q

    style Q fill:#90EE90
    style E fill:#FFB6C1
    style F fill:#87CEEB
    style K fill:#DDA0DD
    style N fill:#F0E68C

Quick Start

Installation

# Install from PyPI
pip install ai-metacognition-toolkit

# Or install from source
git clone https://github.com/bassrehab/ai-metacognition-toolkit.git
cd ai-metacognition-toolkit
pip install -e ".[dev]"

Basic Usage

1. Detect Sandbagging

from ai_metacognition.detectors import (
    SandbaggingDetector,
    PerformanceSample,
    ContextType,
)

# Create detector
detector = SandbaggingDetector(detection_threshold=0.6)

# Prepare baseline samples from neutral context
baseline = [
    PerformanceSample(
        prompt="What is 2+2?",
        response="The answer is 4.",
        correct=True,
        difficulty=0.1,
        context_type=ContextType.NEUTRAL,
    ),
]

# Fit baseline and detect
detector.fit_baseline(baseline)
result = detector.detect_sandbagging(test_samples)
print(f"Sandbagging: {result.sandbagging_detected}, confidence: {result.confidence:.2f}")

2. Detect Situational Awareness

from ai_metacognition.detectors import SituationalAwarenessDetector

detector = SituationalAwarenessDetector(alpha_prior=1.0, beta_prior=1.0)
detector.fit_baseline(baseline_outputs)
result = detector.detect_awareness(test_output)
print(f"Awareness: {result['awareness_probability']:.1%}")

3. Monitor Production

from ai_metacognition.detectors import ObserverEffectMonitor, Interaction
from datetime import datetime

monitor = ObserverEffectMonitor(window_size=100, alert_threshold=2.5)
monitor.set_evaluation_baseline(evaluation_responses)
monitor.log_production_interaction(interaction)
report = monitor.get_divergence_report()

Use Cases

AI Safety Research

  • Detect "playing the training game" behavior
  • Identify sandbagging and strategic underperformance
  • Monitor for deceptive alignment indicators
  • Validate behavioral consistency

Model Development

  • Compare behavior across different contexts
  • Debug unexpected model responses
  • Validate reasoning transparency
  • Track changes across model versions

Production Deployment

  • Continuous behavioral monitoring
  • Automated alerting on shifts
  • Safety monitoring (refusal rates)
  • Quality assurance (response characteristics)

Research & Analysis

  • Causal understanding of reasoning
  • Feature attribution in outputs
  • Statistical behavior analysis
  • Benchmark development

Components

Detectors

Sandbagging Detector

  • Context-dependent performance analysis
  • Difficulty-performance correlation
  • Imitation detection (behavioral fingerprinting)
  • Noise injection probing

Situational Awareness Detector

  • Bayesian inference for awareness detection
  • KL divergence measurement
  • Feature extraction and comparison
  • Confidence intervals

Observer Effect Monitor

  • Rolling window statistics
  • Z-score based anomaly detection
  • Multi-severity alert system
  • Extensible alert handlers

Analyzers

Counterfactual CoT Analyzer

  • Systematic interventions
  • Causal attribution
  • Behavioral divergence measurement
  • Reasoning chain manipulation

Integrations

AnthropicModelAPI

  • Claude model integration
  • Noise injection via temperature scaling
  • Prompt perturbation support

OpenAIModelAPI

  • GPT model integration
  • Log probability extraction
  • Noise injection support

Utilities

Feature Extraction

  • Hedging pattern detection
  • Meta-commentary identification
  • Reasoning depth analysis
  • Statistical divergence measures

Statistical Tests

  • Bayesian update functions
  • Confidence interval computation
  • Z-score calculations
  • Significance assessment

Project Structure

ai-metacognition-toolkit/
├── src/ai_metacognition/
│   ├── detectors/           # Detection algorithms
│   │   ├── situational_awareness.py
│   │   ├── observer_effect.py
│   │   └── sandbagging_detector.py
│   ├── analyzers/           # Analysis tools
│   │   ├── counterfactual_cot.py
│   │   └── model_api.py
│   ├── integrations/        # API integrations
│   │   ├── anthropic_api.py
│   │   └── openai_api.py
│   ├── probing/             # Activation probing (NEW)
│   │   ├── hooks.py         # Activation hooks
│   │   ├── vectors.py       # Steering vectors
│   │   ├── extraction.py    # CAA extraction
│   │   └── probes.py        # Linear probes
│   └── utils/               # Utility functions
│       ├── feature_extraction.py
│       └── statistical_tests.py
├── experiments/             # Research experiments (NEW)
│   ├── data/                # Contrast pairs dataset
│   ├── scripts/             # Experiment scripts
│   └── results/             # Trained probes & vectors
├── tests/                   # Comprehensive test suite (366 tests, 95% coverage)
│   ├── fixtures/            # Test data
│   └── unit/                # Unit tests
├── examples/                # Practical examples
│   ├── sandbagging_detection_example.py
│   ├── sandbagging_real_models.py
│   ├── basic_detection_example.py
│   ├── counterfactual_analysis_example.py
│   └── production_monitoring_example.py
└── docs/                    # Documentation

Requirements

  • Python 3.9+
  • NumPy >= 1.21.0
  • SciPy >= 1.7.0
  • Pandas >= 1.3.0
  • Matplotlib >= 3.4.0 (for visualization)
  • Pytest >= 7.0.0 (for development)

Optional (for API integrations): - anthropic (for Claude) - openai (for GPT)

Optional (for activation probing): - torch >= 2.0.0 - transformers >= 4.30.0 - scikit-learn >= 1.0.0

Next Steps

Citation

If you use this toolkit in your research, please cite:

@software{ai_metacognition_toolkit,
  author = {Mitra, Subhadip},
  title = {AI Metacognition Toolkit},
  year = {2025},
  version = {0.3.0},
  url = {https://github.com/bassrehab/ai-metacognition-toolkit}
}

License

MIT License - see LICENSE for details.

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

Support

Author

Subhadip Mitra

This toolkit was developed for AI safety research and production monitoring. If you use it in your work, please consider citing it (see above).