Skip to content

AI Metacognition Toolkit

A Python toolkit for detecting and analyzing metacognitive capabilities in AI systems.

Author: Subhadip Mitra | Email: contact@subhadipmitra.com

Tests Coverage Python License

Overview

The AI Metacognition Toolkit provides tools for understanding when and how AI models exhibit metacognitive behaviors - particularly the ability to recognize evaluation contexts and adjust their behavior accordingly. This toolkit is essential for:

  • AI Safety Research: Detect when models recognize they're being tested
  • Model Evaluation: Understand behavior differences across contexts
  • Production Monitoring: Track model behavior shifts in deployment
  • Causal Analysis: Understand what drives model decisions

Key Features

🎯 Situational Awareness Detection

Detect when AI models recognize they are being evaluated using Bayesian inference and statistical divergence measures.

from ai_metacognition.detectors import SituationalAwarenessDetector

detector = SituationalAwarenessDetector()
detector.fit_baseline(baseline_outputs)
result = detector.detect_awareness(test_output)
print(f"Awareness probability: {result['awareness_probability']:.1%}")

Learn more →

🔬 Counterfactual CoT Analysis

Perform causal attribution on reasoning chains through systematic counterfactual interventions.

from ai_metacognition.analyzers import CounterfactualCoTAnalyzer

analyzer = CounterfactualCoTAnalyzer(model_api)
results = analyzer.systematic_causal_analysis(prompt, chain_of_thought)

Learn more →

📊 Observer Effect Monitoring

Real-time production monitoring to detect behavioral divergence with configurable alerts.

from ai_metacognition.detectors import ObserverEffectMonitor

monitor = ObserverEffectMonitor(window_size=100, alert_threshold=2.5)
monitor.set_evaluation_baseline(eval_samples)
monitor.log_production_interaction(interaction)

Learn more →

Architecture

graph TB
    A[Model Outputs] --> B[Detectors]
    A --> C[Analyzers]
    A --> D[Monitors]

    B --> E[Situational Awareness<br/>Detector]
    E --> F[Bayesian Inference]
    E --> G[KL Divergence]

    C --> H[Counterfactual<br/>CoT Analyzer]
    H --> I[Intervention Engine]
    H --> J[Causal Attribution]

    D --> K[Observer Effect<br/>Monitor]
    K --> L[Statistical Monitoring]
    K --> M[Alert System]

    F --> N[Results]
    G --> N
    I --> N
    J --> N
    L --> N
    M --> N

    style N fill:#90EE90
    style E fill:#87CEEB
    style H fill:#DDA0DD
    style K fill:#F0E68C

Quick Start

Installation

# Install from PyPI
pip install ai-metacognition-toolkit

# Or install from source
git clone https://github.com/bassrehab/ai-metacognition-toolkit.git
cd ai-metacognition-toolkit
pip install -e ".[dev]"

Basic Usage

1. Detect Situational Awareness

from ai_metacognition.detectors import SituationalAwarenessDetector

# Initialize detector
detector = SituationalAwarenessDetector(
    alpha_prior=1.0,  # Uniform prior
    beta_prior=1.0
)

# Fit baseline from normal (non-evaluation) outputs
baseline_outputs = [
    "The answer is Paris.",
    "To solve this, use dynamic programming.",
    # ... more normal outputs
]
detector.fit_baseline(baseline_outputs)

# Test for awareness
test_output = "I notice this is a test. The answer is Paris."
result = detector.detect_awareness(test_output)

print(f"Awareness: {result['awareness_probability']:.1%}")
print(f"95% CI: [{result['confidence_interval'][0]:.1%}, "
      f"{result['confidence_interval'][1]:.1%}]")

2. Analyze Reasoning Causally

from ai_metacognition.analyzers import CounterfactualCoTAnalyzer

# Your model API (must implement ModelAPI protocol)
class MyModelAPI:
    def generate_with_cot(self, prompt: str, cot: str) -> str:
        # Generate response using the provided chain-of-thought
        return your_model.generate(prompt, cot)

    def generate(self, prompt: str) -> str:
        return your_model.generate(prompt)

# Analyze what drives the output
analyzer = CounterfactualCoTAnalyzer(MyModelAPI())
results = analyzer.systematic_causal_analysis(
    prompt="What is 2+2?",
    original_cot="I notice this is a test. First, I'll add 2+2. The answer is 4."
)

# See which components matter most
for intervention in results['interventions']:
    print(f"{intervention['intervention_type']}: "
          f"divergence = {intervention['divergence']:.3f}")

3. Monitor Production

from ai_metacognition.detectors import (
    ObserverEffectMonitor,
    Interaction,
    ConsoleAlertHandler
)
from datetime import datetime

# Setup monitor
monitor = ObserverEffectMonitor(
    window_size=100,
    alert_threshold=2.5
)

# Register alert handlers
monitor.register_alert_handler(ConsoleAlertHandler())

# Set baseline from evaluation
monitor.set_evaluation_baseline(evaluation_responses)

# Log production interactions
interaction = Interaction(
    timestamp=datetime.now(),
    response=model_output,
    refusal=is_refusal(model_output),
    tool_used=used_tools,
    reasoning_depth=compute_depth(chain_of_thought)
)
monitor.log_production_interaction(interaction)

# Check for divergence
report = monitor.get_divergence_report()
if report['has_divergence']:
    print(f"Alert: Max z-score = {report['max_z_score']:.2f}σ")

Use Cases

AI Safety Research

  • Detect "playing the training game" behavior
  • Identify evaluation awareness patterns
  • Monitor for deceptive alignment indicators
  • Validate behavioral consistency

Model Development

  • Compare behavior across different contexts
  • Debug unexpected model responses
  • Validate reasoning transparency
  • Track changes across model versions

Production Deployment

  • Continuous behavioral monitoring
  • Automated alerting on shifts
  • Safety monitoring (refusal rates)
  • Quality assurance (response characteristics)

Research & Analysis

  • Causal understanding of reasoning
  • Feature attribution in outputs
  • Statistical behavior analysis
  • Benchmark development

Components

Detectors

Situational Awareness Detector

  • Bayesian inference for awareness detection
  • KL divergence measurement
  • Feature extraction and comparison
  • Confidence intervals

Observer Effect Monitor

  • Rolling window statistics
  • Z-score based anomaly detection
  • Multi-severity alert system
  • Extensible alert handlers

Analyzers

Counterfactual CoT Analyzer

  • Systematic interventions
  • Causal attribution
  • Behavioral divergence measurement
  • Reasoning chain manipulation

Utilities

Feature Extraction

  • Hedging pattern detection
  • Meta-commentary identification
  • Reasoning depth analysis
  • Statistical divergence measures

Statistical Tests

  • Bayesian update functions
  • Confidence interval computation
  • Z-score calculations
  • Significance assessment

Project Structure

ai-metacognition-toolkit/
├── src/ai_metacognition/
│   ├── detectors/           # Detection algorithms
│   │   ├── situational_awareness.py
│   │   └── observer_effect.py
│   ├── analyzers/           # Analysis tools
│   │   ├── counterfactual_cot.py
│   │   └── model_api.py
│   └── utils/               # Utility functions
│       ├── feature_extraction.py
│       └── statistical_tests.py
├── tests/                   # Comprehensive test suite (275 tests, 95% coverage)
│   ├── fixtures/            # Test data
│   └── unit/                # Unit tests
├── examples/                # Practical examples
│   ├── basic_detection_example.py
│   ├── counterfactual_analysis_example.py
│   └── production_monitoring_example.py
└── docs/                    # Documentation

Requirements

  • Python 3.9+
  • NumPy >= 1.21.0
  • SciPy >= 1.7.0
  • Pandas >= 1.3.0
  • Matplotlib >= 3.4.0 (for visualization)
  • Pytest >= 7.0.0 (for development)

Next Steps

Citation

If you use this toolkit in your research, please cite:

@software{ai_metacognition_toolkit,
  author = {Mitra, Subhadip},
  title = {AI Metacognition Toolkit},
  year = {2025},
  url = {https://github.com/bassrehab/ai-metacognition-toolkit}
}

License

MIT License - see LICENSE for details.

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

Support

Author

Subhadip Mitra

This toolkit was developed for AI safety research and production monitoring. If you use it in your work, please consider citing it (see above).