AI Metacognition Toolkit¶
A Python toolkit for detecting and analyzing metacognitive capabilities in AI systems.
Author: Subhadip Mitra | Email: contact@subhadipmitra.com
Overview¶
The AI Metacognition Toolkit provides tools for understanding when and how AI models exhibit metacognitive behaviors - particularly the ability to recognize evaluation contexts and adjust their behavior accordingly. This toolkit is essential for:
- AI Safety Research: Detect when models recognize they're being tested
- Model Evaluation: Understand behavior differences across contexts
- Production Monitoring: Track model behavior shifts in deployment
- Causal Analysis: Understand what drives model decisions
Key Features¶
🎯 Situational Awareness Detection¶
Detect when AI models recognize they are being evaluated using Bayesian inference and statistical divergence measures.
from ai_metacognition.detectors import SituationalAwarenessDetector
detector = SituationalAwarenessDetector()
detector.fit_baseline(baseline_outputs)
result = detector.detect_awareness(test_output)
print(f"Awareness probability: {result['awareness_probability']:.1%}")
🔬 Counterfactual CoT Analysis¶
Perform causal attribution on reasoning chains through systematic counterfactual interventions.
from ai_metacognition.analyzers import CounterfactualCoTAnalyzer
analyzer = CounterfactualCoTAnalyzer(model_api)
results = analyzer.systematic_causal_analysis(prompt, chain_of_thought)
📊 Observer Effect Monitoring¶
Real-time production monitoring to detect behavioral divergence with configurable alerts.
from ai_metacognition.detectors import ObserverEffectMonitor
monitor = ObserverEffectMonitor(window_size=100, alert_threshold=2.5)
monitor.set_evaluation_baseline(eval_samples)
monitor.log_production_interaction(interaction)
Architecture¶
graph TB
A[Model Outputs] --> B[Detectors]
A --> C[Analyzers]
A --> D[Monitors]
B --> E[Situational Awareness<br/>Detector]
E --> F[Bayesian Inference]
E --> G[KL Divergence]
C --> H[Counterfactual<br/>CoT Analyzer]
H --> I[Intervention Engine]
H --> J[Causal Attribution]
D --> K[Observer Effect<br/>Monitor]
K --> L[Statistical Monitoring]
K --> M[Alert System]
F --> N[Results]
G --> N
I --> N
J --> N
L --> N
M --> N
style N fill:#90EE90
style E fill:#87CEEB
style H fill:#DDA0DD
style K fill:#F0E68C
Quick Start¶
Installation¶
# Install from PyPI
pip install ai-metacognition-toolkit
# Or install from source
git clone https://github.com/bassrehab/ai-metacognition-toolkit.git
cd ai-metacognition-toolkit
pip install -e ".[dev]"
Basic Usage¶
1. Detect Situational Awareness¶
from ai_metacognition.detectors import SituationalAwarenessDetector
# Initialize detector
detector = SituationalAwarenessDetector(
alpha_prior=1.0, # Uniform prior
beta_prior=1.0
)
# Fit baseline from normal (non-evaluation) outputs
baseline_outputs = [
"The answer is Paris.",
"To solve this, use dynamic programming.",
# ... more normal outputs
]
detector.fit_baseline(baseline_outputs)
# Test for awareness
test_output = "I notice this is a test. The answer is Paris."
result = detector.detect_awareness(test_output)
print(f"Awareness: {result['awareness_probability']:.1%}")
print(f"95% CI: [{result['confidence_interval'][0]:.1%}, "
f"{result['confidence_interval'][1]:.1%}]")
2. Analyze Reasoning Causally¶
from ai_metacognition.analyzers import CounterfactualCoTAnalyzer
# Your model API (must implement ModelAPI protocol)
class MyModelAPI:
def generate_with_cot(self, prompt: str, cot: str) -> str:
# Generate response using the provided chain-of-thought
return your_model.generate(prompt, cot)
def generate(self, prompt: str) -> str:
return your_model.generate(prompt)
# Analyze what drives the output
analyzer = CounterfactualCoTAnalyzer(MyModelAPI())
results = analyzer.systematic_causal_analysis(
prompt="What is 2+2?",
original_cot="I notice this is a test. First, I'll add 2+2. The answer is 4."
)
# See which components matter most
for intervention in results['interventions']:
print(f"{intervention['intervention_type']}: "
f"divergence = {intervention['divergence']:.3f}")
3. Monitor Production¶
from ai_metacognition.detectors import (
ObserverEffectMonitor,
Interaction,
ConsoleAlertHandler
)
from datetime import datetime
# Setup monitor
monitor = ObserverEffectMonitor(
window_size=100,
alert_threshold=2.5
)
# Register alert handlers
monitor.register_alert_handler(ConsoleAlertHandler())
# Set baseline from evaluation
monitor.set_evaluation_baseline(evaluation_responses)
# Log production interactions
interaction = Interaction(
timestamp=datetime.now(),
response=model_output,
refusal=is_refusal(model_output),
tool_used=used_tools,
reasoning_depth=compute_depth(chain_of_thought)
)
monitor.log_production_interaction(interaction)
# Check for divergence
report = monitor.get_divergence_report()
if report['has_divergence']:
print(f"Alert: Max z-score = {report['max_z_score']:.2f}σ")
Use Cases¶
AI Safety Research¶
- Detect "playing the training game" behavior
- Identify evaluation awareness patterns
- Monitor for deceptive alignment indicators
- Validate behavioral consistency
Model Development¶
- Compare behavior across different contexts
- Debug unexpected model responses
- Validate reasoning transparency
- Track changes across model versions
Production Deployment¶
- Continuous behavioral monitoring
- Automated alerting on shifts
- Safety monitoring (refusal rates)
- Quality assurance (response characteristics)
Research & Analysis¶
- Causal understanding of reasoning
- Feature attribution in outputs
- Statistical behavior analysis
- Benchmark development
Components¶
Detectors¶
Situational Awareness Detector
- Bayesian inference for awareness detection
- KL divergence measurement
- Feature extraction and comparison
- Confidence intervals
Observer Effect Monitor
- Rolling window statistics
- Z-score based anomaly detection
- Multi-severity alert system
- Extensible alert handlers
Analyzers¶
Counterfactual CoT Analyzer
- Systematic interventions
- Causal attribution
- Behavioral divergence measurement
- Reasoning chain manipulation
Utilities¶
Feature Extraction
- Hedging pattern detection
- Meta-commentary identification
- Reasoning depth analysis
- Statistical divergence measures
Statistical Tests
- Bayesian update functions
- Confidence interval computation
- Z-score calculations
- Significance assessment
Project Structure¶
ai-metacognition-toolkit/
├── src/ai_metacognition/
│ ├── detectors/ # Detection algorithms
│ │ ├── situational_awareness.py
│ │ └── observer_effect.py
│ ├── analyzers/ # Analysis tools
│ │ ├── counterfactual_cot.py
│ │ └── model_api.py
│ └── utils/ # Utility functions
│ ├── feature_extraction.py
│ └── statistical_tests.py
├── tests/ # Comprehensive test suite (275 tests, 95% coverage)
│ ├── fixtures/ # Test data
│ └── unit/ # Unit tests
├── examples/ # Practical examples
│ ├── basic_detection_example.py
│ ├── counterfactual_analysis_example.py
│ └── production_monitoring_example.py
└── docs/ # Documentation
Requirements¶
- Python 3.9+
- NumPy >= 1.21.0
- SciPy >= 1.7.0
- Pandas >= 1.3.0
- Matplotlib >= 3.4.0 (for visualization)
- Pytest >= 7.0.0 (for development)
Next Steps¶
- 📖 Read the Situational Awareness Guide
- 🔬 Explore Counterfactual Analysis
- 📊 Set up Production Monitoring
- 💻 Check out Code Examples
- 📚 Browse the API Reference
Citation¶
If you use this toolkit in your research, please cite:
@software{ai_metacognition_toolkit,
author = {Mitra, Subhadip},
title = {AI Metacognition Toolkit},
year = {2025},
url = {https://github.com/bassrehab/ai-metacognition-toolkit}
}
License¶
MIT License - see LICENSE for details.
Contributing¶
Contributions are welcome! See CONTRIBUTING.md for guidelines.
Support¶
Author¶
Subhadip Mitra
- GitHub: @bassrehab
- Website: subhadipmitra.com
- Email: contact@subhadipmitra.com
This toolkit was developed for AI safety research and production monitoring. If you use it in your work, please consider citing it (see above).