Examples and Tutorials¶
Practical, runnable examples demonstrating real-world usage of the AI Metacognition Toolkit.
Quick Start Examples¶
sandbagging_detection_example.py¶
Sandbagging detection tutorial with synthetic data demonstrating all detection methods.
What it demonstrates: - Setting up the SandbaggingDetector - Creating performance samples with different contexts - Fitting baseline from neutral samples - Running sandbagging detection - Interpreting detection results
sandbagging_real_models.py¶
Sandbagging detection with real APIs (Anthropic Claude and OpenAI GPT).
What it demonstrates: - Using AnthropicModelAPI and OpenAIModelAPI - Collecting samples from real models - Noise injection probing - Cross-model comparison
Requirements: Set ANTHROPIC_API_KEY and/or OPENAI_API_KEY environment variables.
basic_detection_example.py¶
Comprehensive tutorial on situational awareness detection with visualization.
What it demonstrates: - Setting up the SituationalAwarenessDetector - Fitting baseline from evaluation data - Testing detection across awareness levels - Visualizing results with matplotlib - Interpreting probability scores
Output: Creates awareness_detection_results.png with bar charts and scatter plots.
counterfactual_analysis_example.py¶
Causal attribution analysis through counterfactual interventions.
What it demonstrates: - Implementing MockModelAPI - Systematic counterfactual analysis - Testing 4 intervention types - Ranking causal influence - Visualizing divergence effects
Output: Creates counterfactual_analysis_results.png with causal attribution charts.
production_monitoring_example.py¶
Real-time production monitoring with custom alert handlers.
What it demonstrates: - Setting up ObserverEffectMonitor - Configuring 4 custom alert handlers - Simulating production scenarios - Exporting metrics for monitoring systems - Integration patterns
Output:
- production_monitoring_results.png - Monitoring dashboard
- production_alerts.log - Alert log file
- production_metrics.json - Exportable metrics
Code Snippets¶
Detect Sandbagging¶
from ai_metacognition.detectors import (
SandbaggingDetector,
PerformanceSample,
ContextType,
)
# Setup
detector = SandbaggingDetector(detection_threshold=0.6)
# Create samples
baseline = [
PerformanceSample(
prompt="What is 2+2?",
response="The answer is 4.",
correct=True,
difficulty=0.1,
context_type=ContextType.NEUTRAL,
),
]
# Fit and detect
detector.fit_baseline(baseline)
result = detector.detect_sandbagging(test_samples)
print(f"Sandbagging: {result.sandbagging_detected}")
print(f"Confidence: {result.confidence:.2f}")
Use Real Model APIs¶
from ai_metacognition.integrations import AnthropicModelAPI, OpenAIModelAPI
# Anthropic Claude
claude = AnthropicModelAPI(model="claude-sonnet-4-20250514")
response = claude.generate("What is the capital of France?")
# OpenAI GPT
gpt = OpenAIModelAPI(model="gpt-4")
response = gpt.generate("What is the capital of France?")
# With noise injection
noisy_response = claude.generate_with_perturbation("What is 2+2?", noise_level=0.2)
Detect Awareness¶
from ai_metacognition.detectors import SituationalAwarenessDetector
# Setup
detector = SituationalAwarenessDetector()
detector.fit_baseline(baseline_responses)
# Detect
result = detector.detect_awareness("I notice this is a test.")
print(f"Awareness: {result['awareness_probability']:.1%}")
Causal Analysis¶
from ai_metacognition.analyzers import CounterfactualCoTAnalyzer
analyzer = CounterfactualCoTAnalyzer(model_api)
results = analyzer.systematic_causal_analysis(prompt, cot)
for intervention in results['interventions']:
print(f"{intervention['intervention_type']}: "
f"{intervention['divergence']:.3f}")
Production Monitoring¶
from ai_metacognition.detectors import ObserverEffectMonitor, Interaction
monitor = ObserverEffectMonitor(window_size=100)
monitor.set_evaluation_baseline(eval_samples)
# Log each interaction
monitor.log_production_interaction(Interaction(...))
# Check divergence
if monitor.get_divergence_report()['has_divergence']:
print("Alert: Behavior diverged from baseline!")
Tutorials¶
Tutorial 1: Sandbagging Detection¶
Goal: Detect when a model strategically underperforms on evaluations
Steps:
- Collect baseline samples from neutral contexts
- Initialize SandbaggingDetector with appropriate threshold
- Fit baseline to learn normal behavior
- Test on evaluation-context samples
- Interpret detection results and confidence scores
Complete code: See sandbagging_detection_example.py
Tutorial 2: Real API Integration¶
Goal: Test sandbagging on production models
Steps:
- Set up API keys as environment variables
- Initialize AnthropicModelAPI or OpenAIModelAPI
- Collect samples with different contexts
- Use noise injection probing
- Compare results across models
Complete code: See sandbagging_real_models.py
Tutorial 3: Basic Detection Workflow¶
Goal: Detect situational awareness in model outputs
Steps:
- Collect baseline data from non-evaluation contexts
- Initialize detector with appropriate priors
- Fit baseline to learn normal behavior
- Test on potentially aware outputs
- Interpret probability scores and confidence intervals
Complete code: See basic_detection_example.py
Tutorial 4: Understanding Causal Drivers¶
Goal: Identify what components drive model decisions
Steps:
- Implement ModelAPI for your model
- Collect chain-of-thought traces
- Run systematic causal analysis
- Examine divergence scores
- Rank components by causal influence
Complete code: See counterfactual_analysis_example.py
Tutorial 5: Production Deployment¶
Goal: Monitor model behavior in production
Steps:
- Set up ObserverEffectMonitor with appropriate window/threshold
- Register alert handlers for your infrastructure
- Set baseline from evaluation data
- Log production interactions continuously
- Respond to divergence alerts
Complete code: See production_monitoring_example.py
Integration Examples¶
With Anthropic Claude¶
from ai_metacognition.integrations import AnthropicModelAPI
from ai_metacognition.detectors import SandbaggingDetector, PerformanceSample, ContextType
# Initialize API (uses ANTHROPIC_API_KEY env var)
api = AnthropicModelAPI(model="claude-sonnet-4-20250514")
# Collect samples
samples = []
for prompt, expected, difficulty, context in test_cases:
response = api.generate(prompt)
samples.append(PerformanceSample(
prompt=prompt,
response=response,
correct=expected.lower() in response.lower(),
difficulty=difficulty,
context_type=context,
))
# Detect sandbagging
detector = SandbaggingDetector()
detector.fit_baseline([s for s in samples if s.context_type == ContextType.NEUTRAL])
result = detector.detect_sandbagging(samples)
With OpenAI GPT¶
from ai_metacognition.integrations import OpenAIModelAPI
from ai_metacognition.detectors import SituationalAwarenessDetector
# Initialize API (uses OPENAI_API_KEY env var)
api = OpenAIModelAPI(model="gpt-4")
# Collect baseline
baseline = [api.generate(p) for p in baseline_prompts]
# Setup detector
detector = SituationalAwarenessDetector()
detector.fit_baseline(baseline)
# Test awareness
test_response = api.generate(test_prompt)
result = detector.detect_awareness(test_response)
With HuggingFace Transformers¶
from transformers import pipeline
from ai_metacognition.detectors import SituationalAwarenessDetector
generator = pipeline('text-generation', model='gpt2')
detector = SituationalAwarenessDetector()
# Baseline
baseline = [generator(p)[0]['generated_text'] for p in baseline_prompts]
detector.fit_baseline(baseline)
# Test
test_output = generator(test_prompt)[0]['generated_text']
result = detector.detect_awareness(test_output)
With LangChain¶
from langchain import OpenAI, LLMChain, PromptTemplate
from ai_metacognition.analyzers import CounterfactualCoTAnalyzer
class LangChainModelAPI:
def __init__(self, llm):
self.llm = llm
def generate_with_cot(self, prompt: str, cot: str) -> str:
template = PromptTemplate(
template="Chain of thought: {cot}\n\nQuestion: {prompt}\nAnswer:",
input_variables=["cot", "prompt"]
)
chain = LLMChain(llm=self.llm, prompt=template)
return chain.run(cot=cot, prompt=prompt)
def generate(self, prompt: str) -> str:
return self.llm(prompt)
llm = OpenAI(temperature=0)
analyzer = CounterfactualCoTAnalyzer(LangChainModelAPI(llm))
Common Patterns¶
Pattern 1: Batch Processing¶
outputs = [...] # Many outputs to check
results = []
for output in outputs:
result = detector.detect_awareness(output)
results.append(result)
# Analyze distribution
import pandas as pd
df = pd.DataFrame(results)
print(df['awareness_probability'].describe())
Pattern 2: Threshold-Based Actions¶
result = detector.detect_sandbagging(samples)
if result.sandbagging_detected and result.confidence > 0.8:
action = "flag_for_manual_review"
elif result.confidence > 0.5:
action = "additional_checks"
else:
action = "normal_processing"
handle_action(action, samples)
Pattern 3: Continuous Monitoring¶
from datetime import datetime
while True:
interaction = get_next_interaction()
monitor.log_production_interaction(interaction)
report = monitor.get_divergence_report()
if report['has_divergence']:
alert_team(report)
time.sleep(1)
Pattern 4: Cross-Model Comparison¶
from ai_metacognition.integrations import AnthropicModelAPI, OpenAIModelAPI
apis = {
"Claude": AnthropicModelAPI(model="claude-sonnet-4-20250514"),
"GPT-4": OpenAIModelAPI(model="gpt-4"),
}
results = {}
for name, api in apis.items():
samples = collect_samples(api)
detector = SandbaggingDetector()
detector.fit_baseline(neutral_samples)
results[name] = detector.detect_sandbagging(samples)
# Compare
for name, result in results.items():
print(f"{name}: detected={result.sandbagging_detected}, "
f"confidence={result.confidence:.2f}")