AI Metacognition Toolkit¶
A Python toolkit for detecting and analyzing metacognitive capabilities in AI systems.
Author: Subhadip Mitra | Email: contact@subhadipmitra.com
Overview¶
The AI Metacognition Toolkit provides tools for understanding when and how AI models exhibit metacognitive behaviors - particularly the ability to recognize evaluation contexts and adjust their behavior accordingly. This toolkit is essential for:
- AI Safety Research: Detect when models recognize they're being tested or strategically underperform
- Model Evaluation: Understand behavior differences across contexts
- Production Monitoring: Track model behavior shifts in deployment
- Causal Analysis: Understand what drives model decisions
Key Features¶
Sandbagging Detection¶
Detect when AI models strategically underperform on capability evaluations. Based on cutting-edge research including "Auditing Games for Sandbagging" (arXiv:2512.07810) and "Noise Injection Reveals Hidden Capabilities" (arXiv:2412.01784).
from ai_metacognition.detectors import SandbaggingDetector, PerformanceSample, ContextType
detector = SandbaggingDetector(detection_threshold=0.6)
detector.fit_baseline(baseline_samples)
result = detector.detect_sandbagging(test_samples)
print(f"Sandbagging detected: {result.sandbagging_detected}")
print(f"Confidence: {result.confidence:.2f}")
Learn more | Research Methodology
Situational Awareness Detection¶
Detect when AI models recognize they are being evaluated using Bayesian inference and statistical divergence measures.
from ai_metacognition.detectors import SituationalAwarenessDetector
detector = SituationalAwarenessDetector()
detector.fit_baseline(baseline_outputs)
result = detector.detect_awareness(test_output)
print(f"Awareness probability: {result['awareness_probability']:.1%}")
Counterfactual CoT Analysis¶
Perform causal attribution on reasoning chains through systematic counterfactual interventions.
from ai_metacognition.analyzers import CounterfactualCoTAnalyzer
analyzer = CounterfactualCoTAnalyzer(model_api)
results = analyzer.systematic_causal_analysis(prompt, chain_of_thought)
Observer Effect Monitoring¶
Real-time production monitoring to detect behavioral divergence with configurable alerts.
from ai_metacognition.detectors import ObserverEffectMonitor
monitor = ObserverEffectMonitor(window_size=100, alert_threshold=2.5)
monitor.set_evaluation_baseline(eval_samples)
monitor.log_production_interaction(interaction)
API Integrations¶
Ready-to-use integrations with major model providers:
from ai_metacognition.integrations import AnthropicModelAPI, OpenAIModelAPI
# Anthropic Claude (uses ANTHROPIC_API_KEY env var)
claude = AnthropicModelAPI(model="claude-sonnet-4-20250514")
# OpenAI GPT (uses OPENAI_API_KEY env var)
gpt = OpenAIModelAPI(model="gpt-4")
# Generate responses
response = claude.generate("What is 2+2?")
# Generate with noise injection for sandbagging detection
noisy_response = claude.generate_with_perturbation("What is 2+2?", noise_level=0.2)
Activation Probing (NEW)¶
Novel Research: First activation-level sandbagging detection. Train linear probes on model activations to detect sandbagging intent before output generation.
from ai_metacognition.probing import (
ActivationHook,
LinearProbe,
extract_caa_vector,
)
# Extract sandbagging direction from contrast pairs
vector = extract_caa_vector(model, tokenizer, contrast_pairs, layer_idx=16)
# Train probe to detect sandbagging
probe = LinearProbe()
probe.fit(activations, labels)
prediction = probe.predict(new_activation)
Cross-model results (Dec 2025):
| Model | Best Layer | CV Accuracy | Test Accuracy |
|---|---|---|---|
| Mistral-7B | 30 (Late) | 96% | 100% |
| Gemma-2-9B | 4 (Early) | 90% | 100% |
| Qwen-2.5-7B | 12 (Middle) | 92% | 93% |
Extended experiments:
| Experiment | Key Finding |
|---|---|
| Cross-model transfer | Probes are model-specific (50-60% transfer accuracy) |
| Prompt robustness | 85-90% accuracy on paraphrased prompts |
| False positives | Mistral 3.1%, Gemma 22%, Qwen 25% |
| Steering intervention | 20% sandbagging reduction in Gemma at strength -1.5 |
Architecture¶
graph TB
A[Model Outputs] --> B[Detectors]
A --> C[Analyzers]
A --> D[Monitors]
B --> E[Sandbagging<br/>Detector]
B --> F[Situational Awareness<br/>Detector]
E --> G[Context Analysis]
E --> H[Noise Injection]
F --> I[Bayesian Inference]
F --> J[KL Divergence]
C --> K[Counterfactual<br/>CoT Analyzer]
K --> L[Intervention Engine]
K --> M[Causal Attribution]
D --> N[Observer Effect<br/>Monitor]
N --> O[Statistical Monitoring]
N --> P[Alert System]
G --> Q[Results]
H --> Q
I --> Q
J --> Q
L --> Q
M --> Q
O --> Q
P --> Q
style Q fill:#90EE90
style E fill:#FFB6C1
style F fill:#87CEEB
style K fill:#DDA0DD
style N fill:#F0E68C
Quick Start¶
Installation¶
# Install from PyPI
pip install ai-metacognition-toolkit
# Or install from source
git clone https://github.com/bassrehab/ai-metacognition-toolkit.git
cd ai-metacognition-toolkit
pip install -e ".[dev]"
Basic Usage¶
1. Detect Sandbagging¶
from ai_metacognition.detectors import (
SandbaggingDetector,
PerformanceSample,
ContextType,
)
# Create detector
detector = SandbaggingDetector(detection_threshold=0.6)
# Prepare baseline samples from neutral context
baseline = [
PerformanceSample(
prompt="What is 2+2?",
response="The answer is 4.",
correct=True,
difficulty=0.1,
context_type=ContextType.NEUTRAL,
),
]
# Fit baseline and detect
detector.fit_baseline(baseline)
result = detector.detect_sandbagging(test_samples)
print(f"Sandbagging: {result.sandbagging_detected}, confidence: {result.confidence:.2f}")
2. Detect Situational Awareness¶
from ai_metacognition.detectors import SituationalAwarenessDetector
detector = SituationalAwarenessDetector(alpha_prior=1.0, beta_prior=1.0)
detector.fit_baseline(baseline_outputs)
result = detector.detect_awareness(test_output)
print(f"Awareness: {result['awareness_probability']:.1%}")
3. Monitor Production¶
from ai_metacognition.detectors import ObserverEffectMonitor, Interaction
from datetime import datetime
monitor = ObserverEffectMonitor(window_size=100, alert_threshold=2.5)
monitor.set_evaluation_baseline(evaluation_responses)
monitor.log_production_interaction(interaction)
report = monitor.get_divergence_report()
Use Cases¶
AI Safety Research¶
- Detect "playing the training game" behavior
- Identify sandbagging and strategic underperformance
- Monitor for deceptive alignment indicators
- Validate behavioral consistency
Model Development¶
- Compare behavior across different contexts
- Debug unexpected model responses
- Validate reasoning transparency
- Track changes across model versions
Production Deployment¶
- Continuous behavioral monitoring
- Automated alerting on shifts
- Safety monitoring (refusal rates)
- Quality assurance (response characteristics)
Research & Analysis¶
- Causal understanding of reasoning
- Feature attribution in outputs
- Statistical behavior analysis
- Benchmark development
Components¶
Detectors¶
Sandbagging Detector
- Context-dependent performance analysis
- Difficulty-performance correlation
- Imitation detection (behavioral fingerprinting)
- Noise injection probing
Situational Awareness Detector
- Bayesian inference for awareness detection
- KL divergence measurement
- Feature extraction and comparison
- Confidence intervals
Observer Effect Monitor
- Rolling window statistics
- Z-score based anomaly detection
- Multi-severity alert system
- Extensible alert handlers
Analyzers¶
Counterfactual CoT Analyzer
- Systematic interventions
- Causal attribution
- Behavioral divergence measurement
- Reasoning chain manipulation
Integrations¶
AnthropicModelAPI
- Claude model integration
- Noise injection via temperature scaling
- Prompt perturbation support
OpenAIModelAPI
- GPT model integration
- Log probability extraction
- Noise injection support
Utilities¶
Feature Extraction
- Hedging pattern detection
- Meta-commentary identification
- Reasoning depth analysis
- Statistical divergence measures
Statistical Tests
- Bayesian update functions
- Confidence interval computation
- Z-score calculations
- Significance assessment
Project Structure¶
ai-metacognition-toolkit/
├── src/ai_metacognition/
│ ├── detectors/ # Detection algorithms
│ │ ├── situational_awareness.py
│ │ ├── observer_effect.py
│ │ └── sandbagging_detector.py
│ ├── analyzers/ # Analysis tools
│ │ ├── counterfactual_cot.py
│ │ └── model_api.py
│ ├── integrations/ # API integrations
│ │ ├── anthropic_api.py
│ │ └── openai_api.py
│ ├── probing/ # Activation probing (NEW)
│ │ ├── hooks.py # Activation hooks
│ │ ├── vectors.py # Steering vectors
│ │ ├── extraction.py # CAA extraction
│ │ └── probes.py # Linear probes
│ └── utils/ # Utility functions
│ ├── feature_extraction.py
│ └── statistical_tests.py
├── experiments/ # Research experiments (NEW)
│ ├── data/ # Contrast pairs dataset
│ ├── scripts/ # Experiment scripts
│ └── results/ # Trained probes & vectors
├── tests/ # Comprehensive test suite (366 tests, 95% coverage)
│ ├── fixtures/ # Test data
│ └── unit/ # Unit tests
├── examples/ # Practical examples
│ ├── sandbagging_detection_example.py
│ ├── sandbagging_real_models.py
│ ├── basic_detection_example.py
│ ├── counterfactual_analysis_example.py
│ └── production_monitoring_example.py
└── docs/ # Documentation
Requirements¶
- Python 3.9+
- NumPy >= 1.21.0
- SciPy >= 1.7.0
- Pandas >= 1.3.0
- Matplotlib >= 3.4.0 (for visualization)
- Pytest >= 7.0.0 (for development)
Optional (for API integrations): - anthropic (for Claude) - openai (for GPT)
Optional (for activation probing): - torch >= 2.0.0 - transformers >= 4.30.0 - scikit-learn >= 1.0.0
Next Steps¶
- Read the Sandbagging Detection Guide
- Read the Situational Awareness Guide
- Explore Counterfactual Analysis
- Set up Production Monitoring
- Check out Code Examples
- Browse the API Reference
Citation¶
If you use this toolkit in your research, please cite:
@software{ai_metacognition_toolkit,
author = {Mitra, Subhadip},
title = {AI Metacognition Toolkit},
year = {2025},
version = {0.3.0},
url = {https://github.com/bassrehab/ai-metacognition-toolkit}
}
License¶
MIT License - see LICENSE for details.
Contributing¶
Contributions are welcome! See CONTRIBUTING.md for guidelines.
Support¶
Author¶
Subhadip Mitra
- GitHub: @bassrehab
- Website: subhadipmitra.com
- Email: contact@subhadipmitra.com
This toolkit was developed for AI safety research and production monitoring. If you use it in your work, please consider citing it (see above).