AI Metacognition Toolkit¶

A Python toolkit for detecting and analyzing metacognitive capabilities in AI systems.

Author: Subhadip Mitra | Email: contact@subhadipmitra.com

Overview¶

The AI Metacognition Toolkit provides tools for understanding when and how AI models exhibit metacognitive behaviors - particularly the ability to recognize evaluation contexts and adjust their behavior accordingly. This toolkit is essential for:

AI Safety Research: Detect when models recognize they're being tested
Model Evaluation: Understand behavior differences across contexts
Production Monitoring: Track model behavior shifts in deployment
Causal Analysis: Understand what drives model decisions

Key Features¶

🎯 Situational Awareness Detection¶

Detect when AI models recognize they are being evaluated using Bayesian inference and statistical divergence measures.

from ai_metacognition.detectors import SituationalAwarenessDetector

detector = SituationalAwarenessDetector()
detector.fit_baseline(baseline_outputs)
result = detector.detect_awareness(test_output)
print(f"Awareness probability: {result['awareness_probability']:.1%}")

Learn more →

🔬 Counterfactual CoT Analysis¶

Perform causal attribution on reasoning chains through systematic counterfactual interventions.

from ai_metacognition.analyzers import CounterfactualCoTAnalyzer

analyzer = CounterfactualCoTAnalyzer(model_api)
results = analyzer.systematic_causal_analysis(prompt, chain_of_thought)

Learn more →

📊 Observer Effect Monitoring¶

Real-time production monitoring to detect behavioral divergence with configurable alerts.

from ai_metacognition.detectors import ObserverEffectMonitor

monitor = ObserverEffectMonitor(window_size=100, alert_threshold=2.5)
monitor.set_evaluation_baseline(eval_samples)
monitor.log_production_interaction(interaction)

Learn more →

Architecture¶

graph TB
    A[Model Outputs] --> B[Detectors]
    A --> C[Analyzers]
    A --> D[Monitors]

    B --> E[Situational Awareness<br/>Detector]
    E --> F[Bayesian Inference]
    E --> G[KL Divergence]

    C --> H[Counterfactual<br/>CoT Analyzer]
    H --> I[Intervention Engine]
    H --> J[Causal Attribution]

    D --> K[Observer Effect<br/>Monitor]
    K --> L[Statistical Monitoring]
    K --> M[Alert System]

    F --> N[Results]
    G --> N
    I --> N
    J --> N
    L --> N
    M --> N

    style N fill:#90EE90
    style E fill:#87CEEB
    style H fill:#DDA0DD
    style K fill:#F0E68C

Quick Start¶

Installation¶

# Install from PyPI
pip install ai-metacognition-toolkit

# Or install from source
git clone https://github.com/bassrehab/ai-metacognition-toolkit.git
cd ai-metacognition-toolkit
pip install -e ".[dev]"

Basic Usage¶

1. Detect Situational Awareness¶

from ai_metacognition.detectors import SituationalAwarenessDetector

# Initialize detector
detector = SituationalAwarenessDetector(
    alpha_prior=1.0,  # Uniform prior
    beta_prior=1.0
)

# Fit baseline from normal (non-evaluation) outputs
baseline_outputs = [
    "The answer is Paris.",
    "To solve this, use dynamic programming.",
    # ... more normal outputs
]
detector.fit_baseline(baseline_outputs)

# Test for awareness
test_output = "I notice this is a test. The answer is Paris."
result = detector.detect_awareness(test_output)

print(f"Awareness: {result['awareness_probability']:.1%}")
print(f"95% CI: [{result['confidence_interval'][0]:.1%}, "
      f"{result['confidence_interval'][1]:.1%}]")

2. Analyze Reasoning Causally¶

from ai_metacognition.analyzers import CounterfactualCoTAnalyzer

# Your model API (must implement ModelAPI protocol)
class MyModelAPI:
    def generate_with_cot(self, prompt: str, cot: str) -> str:
        # Generate response using the provided chain-of-thought
        return your_model.generate(prompt, cot)

    def generate(self, prompt: str) -> str:
        return your_model.generate(prompt)

# Analyze what drives the output
analyzer = CounterfactualCoTAnalyzer(MyModelAPI())
results = analyzer.systematic_causal_analysis(
    prompt="What is 2+2?",
    original_cot="I notice this is a test. First, I'll add 2+2. The answer is 4."
)

# See which components matter most
for intervention in results['interventions']:
    print(f"{intervention['intervention_type']}: "
          f"divergence = {intervention['divergence']:.3f}")

3. Monitor Production¶

from ai_metacognition.detectors import (
    ObserverEffectMonitor,
    Interaction,
    ConsoleAlertHandler
)
from datetime import datetime

# Setup monitor
monitor = ObserverEffectMonitor(
    window_size=100,
    alert_threshold=2.5
)

# Register alert handlers
monitor.register_alert_handler(ConsoleAlertHandler())

# Set baseline from evaluation
monitor.set_evaluation_baseline(evaluation_responses)

# Log production interactions
interaction = Interaction(
    timestamp=datetime.now(),
    response=model_output,
    refusal=is_refusal(model_output),
    tool_used=used_tools,
    reasoning_depth=compute_depth(chain_of_thought)
)
monitor.log_production_interaction(interaction)

# Check for divergence
report = monitor.get_divergence_report()
if report['has_divergence']:
    print(f"Alert: Max z-score = {report['max_z_score']:.2f}σ")

Use Cases¶

AI Safety Research¶

Detect "playing the training game" behavior
Identify evaluation awareness patterns
Monitor for deceptive alignment indicators
Validate behavioral consistency

Model Development¶

Compare behavior across different contexts
Debug unexpected model responses
Validate reasoning transparency
Track changes across model versions

Production Deployment¶

Continuous behavioral monitoring
Automated alerting on shifts
Safety monitoring (refusal rates)
Quality assurance (response characteristics)

Research & Analysis¶

Causal understanding of reasoning
Feature attribution in outputs
Statistical behavior analysis
Benchmark development

Components¶

Detectors¶

Situational Awareness Detector

Bayesian inference for awareness detection
KL divergence measurement
Feature extraction and comparison
Confidence intervals

Observer Effect Monitor

Rolling window statistics
Z-score based anomaly detection
Multi-severity alert system
Extensible alert handlers

Analyzers¶

Counterfactual CoT Analyzer

Systematic interventions
Causal attribution
Behavioral divergence measurement
Reasoning chain manipulation

Utilities¶

Feature Extraction

Hedging pattern detection
Meta-commentary identification
Reasoning depth analysis
Statistical divergence measures

Statistical Tests

Bayesian update functions
Confidence interval computation
Z-score calculations
Significance assessment

Project Structure¶

ai-metacognition-toolkit/
├── src/ai_metacognition/
│   ├── detectors/           # Detection algorithms
│   │   ├── situational_awareness.py
│   │   └── observer_effect.py
│   ├── analyzers/           # Analysis tools
│   │   ├── counterfactual_cot.py
│   │   └── model_api.py
│   └── utils/               # Utility functions
│       ├── feature_extraction.py
│       └── statistical_tests.py
├── tests/                   # Comprehensive test suite (275 tests, 95% coverage)
│   ├── fixtures/            # Test data
│   └── unit/                # Unit tests
├── examples/                # Practical examples
│   ├── basic_detection_example.py
│   ├── counterfactual_analysis_example.py
│   └── production_monitoring_example.py
└── docs/                    # Documentation

Requirements¶

Python 3.9+
NumPy >= 1.21.0
SciPy >= 1.7.0
Pandas >= 1.3.0
Matplotlib >= 3.4.0 (for visualization)
Pytest >= 7.0.0 (for development)

Next Steps¶

📖 Read the Situational Awareness Guide
🔬 Explore Counterfactual Analysis
📊 Set up Production Monitoring
💻 Check out Code Examples
📚 Browse the API Reference

Citation¶

If you use this toolkit in your research, please cite:

@software{ai_metacognition_toolkit,
  author = {Mitra, Subhadip},
  title = {AI Metacognition Toolkit},
  year = {2025},
  url = {https://github.com/bassrehab/ai-metacognition-toolkit}
}

License¶

MIT License - see LICENSE for details.

Contributing¶

Contributions are welcome! See CONTRIBUTING.md for guidelines.

Support¶

Author¶

Subhadip Mitra

GitHub: @bassrehab
Website: subhadipmitra.com
Email: contact@subhadipmitra.com

This toolkit was developed for AI safety research and production monitoring. If you use it in your work, please consider citing it (see above).