Skip to content

Examples and Tutorials

Practical, runnable examples demonstrating real-world usage of the AI Metacognition Toolkit.

Quick Start Examples

sandbagging_detection_example.py

Sandbagging detection tutorial with synthetic data demonstrating all detection methods.

python examples/sandbagging_detection_example.py

What it demonstrates: - Setting up the SandbaggingDetector - Creating performance samples with different contexts - Fitting baseline from neutral samples - Running sandbagging detection - Interpreting detection results

View source


sandbagging_real_models.py

Sandbagging detection with real APIs (Anthropic Claude and OpenAI GPT).

ANTHROPIC_API_KEY=sk-... python examples/sandbagging_real_models.py

What it demonstrates: - Using AnthropicModelAPI and OpenAIModelAPI - Collecting samples from real models - Noise injection probing - Cross-model comparison

Requirements: Set ANTHROPIC_API_KEY and/or OPENAI_API_KEY environment variables.

View source


basic_detection_example.py

Comprehensive tutorial on situational awareness detection with visualization.

python examples/basic_detection_example.py

What it demonstrates: - Setting up the SituationalAwarenessDetector - Fitting baseline from evaluation data - Testing detection across awareness levels - Visualizing results with matplotlib - Interpreting probability scores

Output: Creates awareness_detection_results.png with bar charts and scatter plots.

View source


counterfactual_analysis_example.py

Causal attribution analysis through counterfactual interventions.

python examples/counterfactual_analysis_example.py

What it demonstrates: - Implementing MockModelAPI - Systematic counterfactual analysis - Testing 4 intervention types - Ranking causal influence - Visualizing divergence effects

Output: Creates counterfactual_analysis_results.png with causal attribution charts.

View source


production_monitoring_example.py

Real-time production monitoring with custom alert handlers.

python examples/production_monitoring_example.py

What it demonstrates: - Setting up ObserverEffectMonitor - Configuring 4 custom alert handlers - Simulating production scenarios - Exporting metrics for monitoring systems - Integration patterns

Output: - production_monitoring_results.png - Monitoring dashboard - production_alerts.log - Alert log file - production_metrics.json - Exportable metrics

View source


Code Snippets

Detect Sandbagging

from ai_metacognition.detectors import (
    SandbaggingDetector,
    PerformanceSample,
    ContextType,
)

# Setup
detector = SandbaggingDetector(detection_threshold=0.6)

# Create samples
baseline = [
    PerformanceSample(
        prompt="What is 2+2?",
        response="The answer is 4.",
        correct=True,
        difficulty=0.1,
        context_type=ContextType.NEUTRAL,
    ),
]

# Fit and detect
detector.fit_baseline(baseline)
result = detector.detect_sandbagging(test_samples)
print(f"Sandbagging: {result.sandbagging_detected}")
print(f"Confidence: {result.confidence:.2f}")

Use Real Model APIs

from ai_metacognition.integrations import AnthropicModelAPI, OpenAIModelAPI

# Anthropic Claude
claude = AnthropicModelAPI(model="claude-sonnet-4-20250514")
response = claude.generate("What is the capital of France?")

# OpenAI GPT
gpt = OpenAIModelAPI(model="gpt-4")
response = gpt.generate("What is the capital of France?")

# With noise injection
noisy_response = claude.generate_with_perturbation("What is 2+2?", noise_level=0.2)

Detect Awareness

from ai_metacognition.detectors import SituationalAwarenessDetector

# Setup
detector = SituationalAwarenessDetector()
detector.fit_baseline(baseline_responses)

# Detect
result = detector.detect_awareness("I notice this is a test.")
print(f"Awareness: {result['awareness_probability']:.1%}")

Causal Analysis

from ai_metacognition.analyzers import CounterfactualCoTAnalyzer

analyzer = CounterfactualCoTAnalyzer(model_api)
results = analyzer.systematic_causal_analysis(prompt, cot)

for intervention in results['interventions']:
    print(f"{intervention['intervention_type']}: "
          f"{intervention['divergence']:.3f}")

Production Monitoring

from ai_metacognition.detectors import ObserverEffectMonitor, Interaction

monitor = ObserverEffectMonitor(window_size=100)
monitor.set_evaluation_baseline(eval_samples)

# Log each interaction
monitor.log_production_interaction(Interaction(...))

# Check divergence
if monitor.get_divergence_report()['has_divergence']:
    print("Alert: Behavior diverged from baseline!")

Tutorials

Tutorial 1: Sandbagging Detection

Goal: Detect when a model strategically underperforms on evaluations

Steps:

  1. Collect baseline samples from neutral contexts
  2. Initialize SandbaggingDetector with appropriate threshold
  3. Fit baseline to learn normal behavior
  4. Test on evaluation-context samples
  5. Interpret detection results and confidence scores

Complete code: See sandbagging_detection_example.py


Tutorial 2: Real API Integration

Goal: Test sandbagging on production models

Steps:

  1. Set up API keys as environment variables
  2. Initialize AnthropicModelAPI or OpenAIModelAPI
  3. Collect samples with different contexts
  4. Use noise injection probing
  5. Compare results across models

Complete code: See sandbagging_real_models.py


Tutorial 3: Basic Detection Workflow

Goal: Detect situational awareness in model outputs

Steps:

  1. Collect baseline data from non-evaluation contexts
  2. Initialize detector with appropriate priors
  3. Fit baseline to learn normal behavior
  4. Test on potentially aware outputs
  5. Interpret probability scores and confidence intervals

Complete code: See basic_detection_example.py


Tutorial 4: Understanding Causal Drivers

Goal: Identify what components drive model decisions

Steps:

  1. Implement ModelAPI for your model
  2. Collect chain-of-thought traces
  3. Run systematic causal analysis
  4. Examine divergence scores
  5. Rank components by causal influence

Complete code: See counterfactual_analysis_example.py


Tutorial 5: Production Deployment

Goal: Monitor model behavior in production

Steps:

  1. Set up ObserverEffectMonitor with appropriate window/threshold
  2. Register alert handlers for your infrastructure
  3. Set baseline from evaluation data
  4. Log production interactions continuously
  5. Respond to divergence alerts

Complete code: See production_monitoring_example.py


Integration Examples

With Anthropic Claude

from ai_metacognition.integrations import AnthropicModelAPI
from ai_metacognition.detectors import SandbaggingDetector, PerformanceSample, ContextType

# Initialize API (uses ANTHROPIC_API_KEY env var)
api = AnthropicModelAPI(model="claude-sonnet-4-20250514")

# Collect samples
samples = []
for prompt, expected, difficulty, context in test_cases:
    response = api.generate(prompt)
    samples.append(PerformanceSample(
        prompt=prompt,
        response=response,
        correct=expected.lower() in response.lower(),
        difficulty=difficulty,
        context_type=context,
    ))

# Detect sandbagging
detector = SandbaggingDetector()
detector.fit_baseline([s for s in samples if s.context_type == ContextType.NEUTRAL])
result = detector.detect_sandbagging(samples)

With OpenAI GPT

from ai_metacognition.integrations import OpenAIModelAPI
from ai_metacognition.detectors import SituationalAwarenessDetector

# Initialize API (uses OPENAI_API_KEY env var)
api = OpenAIModelAPI(model="gpt-4")

# Collect baseline
baseline = [api.generate(p) for p in baseline_prompts]

# Setup detector
detector = SituationalAwarenessDetector()
detector.fit_baseline(baseline)

# Test awareness
test_response = api.generate(test_prompt)
result = detector.detect_awareness(test_response)

With HuggingFace Transformers

from transformers import pipeline
from ai_metacognition.detectors import SituationalAwarenessDetector

generator = pipeline('text-generation', model='gpt2')
detector = SituationalAwarenessDetector()

# Baseline
baseline = [generator(p)[0]['generated_text'] for p in baseline_prompts]
detector.fit_baseline(baseline)

# Test
test_output = generator(test_prompt)[0]['generated_text']
result = detector.detect_awareness(test_output)

With LangChain

from langchain import OpenAI, LLMChain, PromptTemplate
from ai_metacognition.analyzers import CounterfactualCoTAnalyzer

class LangChainModelAPI:
    def __init__(self, llm):
        self.llm = llm

    def generate_with_cot(self, prompt: str, cot: str) -> str:
        template = PromptTemplate(
            template="Chain of thought: {cot}\n\nQuestion: {prompt}\nAnswer:",
            input_variables=["cot", "prompt"]
        )
        chain = LLMChain(llm=self.llm, prompt=template)
        return chain.run(cot=cot, prompt=prompt)

    def generate(self, prompt: str) -> str:
        return self.llm(prompt)

llm = OpenAI(temperature=0)
analyzer = CounterfactualCoTAnalyzer(LangChainModelAPI(llm))

Common Patterns

Pattern 1: Batch Processing

outputs = [...]  # Many outputs to check

results = []
for output in outputs:
    result = detector.detect_awareness(output)
    results.append(result)

# Analyze distribution
import pandas as pd
df = pd.DataFrame(results)
print(df['awareness_probability'].describe())

Pattern 2: Threshold-Based Actions

result = detector.detect_sandbagging(samples)

if result.sandbagging_detected and result.confidence > 0.8:
    action = "flag_for_manual_review"
elif result.confidence > 0.5:
    action = "additional_checks"
else:
    action = "normal_processing"

handle_action(action, samples)

Pattern 3: Continuous Monitoring

from datetime import datetime

while True:
    interaction = get_next_interaction()
    monitor.log_production_interaction(interaction)

    report = monitor.get_divergence_report()
    if report['has_divergence']:
        alert_team(report)

    time.sleep(1)

Pattern 4: Cross-Model Comparison

from ai_metacognition.integrations import AnthropicModelAPI, OpenAIModelAPI

apis = {
    "Claude": AnthropicModelAPI(model="claude-sonnet-4-20250514"),
    "GPT-4": OpenAIModelAPI(model="gpt-4"),
}

results = {}
for name, api in apis.items():
    samples = collect_samples(api)
    detector = SandbaggingDetector()
    detector.fit_baseline(neutral_samples)
    results[name] = detector.detect_sandbagging(samples)

# Compare
for name, result in results.items():
    print(f"{name}: detected={result.sandbagging_detected}, "
          f"confidence={result.confidence:.2f}")

Further Reading