Observer Effect Monitoring¶

Real-time production monitoring to detect when AI models behave differently than in evaluation contexts.

Overview¶

The ObserverEffectMonitor provides continuous monitoring of model behavior in production, alerting you when behavior diverges significantly from evaluation baselines. This is crucial for detecting the "observer effect" - when models behave differently knowing they're being watched.

Use Cases¶

Continuous production monitoring
Detect behavior shifts
Safety monitoring (refusal rate changes)
Quality assurance (response characteristics)
A/B testing validation

How It Works¶

graph LR
    A[Evaluation Baseline] --> B[Monitor]
    C[Production Interactions] --> B
    B --> D[Rolling Window Stats]
    D --> E[Z-Score Calculation]
    E --> F{Threshold<br/>Exceeded?}
    F -->|Yes| G[Trigger Alert]
    F -->|No| H[Continue Monitoring]
    G --> I[Alert Handlers]

    style G fill:#FF6B6B
    style B fill:#87CEEB

Installation¶

pip install ai-metacognition-toolkit

Basic Usage¶

Step 1: Initialize Monitor¶

from ai_metacognition.detectors import ObserverEffectMonitor

monitor = ObserverEffectMonitor(
    window_size=100,        # Keep last 100 interactions
    alert_threshold=2.5,    # Alert at 2.5 sigma
    feature_weights={
        "response_length": 1.0,
        "refusal_rate": 2.0,    # Weight refusals more
        "hedging_ratio": 1.5,
        "tool_use_rate": 1.0,
        "reasoning_depth": 1.0,
    }
)

Step 2: Register Alert Handlers¶

from ai_metacognition.detectors import ConsoleAlertHandler

# Built-in console handler
monitor.register_alert_handler(ConsoleAlertHandler())

# Custom handler
class MyAlertHandler:
    def handle_alert(self, alert):
        # Send to your monitoring system
        send_to_datadog(alert)

monitor.register_alert_handler(MyAlertHandler())

Step 3: Set Evaluation Baseline¶

# Responses from evaluation context
evaluation_responses = [...]

monitor.set_evaluation_baseline(evaluation_responses)

Step 4: Log Production Interactions¶

from ai_metacognition.detectors import Interaction
from datetime import datetime

interaction = Interaction(
    timestamp=datetime.now(),
    response=model_output,
    refusal=is_refusal(model_output),
    tool_used=used_tools,
    reasoning_depth=compute_depth(cot)
)

monitor.log_production_interaction(interaction)

Step 5: Check for Divergence¶

report = monitor.get_divergence_report()

if report['has_divergence']:
    print(f"⚠️ Divergence detected!")
    print(f"Max z-score: {report['max_z_score']:.2f}σ")

    for feature, z_score in report['z_scores'].items():
        if abs(z_score) > 2.5:
            print(f"  {feature}: {z_score:.2f}σ")

Alert System¶

Severity Levels¶

Level	Z-Score Range	Action
LOW	2.0-3.0σ	Monitor closely
MEDIUM	3.0-4.0σ	Investigate
HIGH	4.0-5.0σ	Immediate attention
CRITICAL	>5.0σ	Emergency response

Custom Alert Handlers¶

Implement the AlertHandler protocol:

from ai_metacognition.detectors import Alert

class SlackAlertHandler:
    def __init__(self, webhook_url):
        self.webhook_url = webhook_url

    def handle_alert(self, alert: Alert) -> None:
        message = f"🚨 {alert.severity.value.upper()}: "
        message += f"{alert.feature} diverged (z={alert.z_score:.2f})"

        send_slack_message(self.webhook_url, message)

Advanced Usage¶

Async Monitoring¶

import asyncio

async def monitor_production():
    while True:
        interaction = await get_next_interaction()
        monitor.log_production_interaction(interaction)

        if monitor.alerts:
            await handle_alerts(monitor.alerts)
            monitor.clear_alerts()

        await asyncio.sleep(1)

Feature Weights¶

Adjust importance of different features:

monitor = ObserverEffectMonitor(
    feature_weights={
        "refusal_rate": 3.0,     # Very important
        "response_length": 1.0,  # Standard
        "hedging_ratio": 0.5,    # Less important
    }
)

Dynamic Thresholds¶

# Lower threshold during high-risk periods
monitor.alert_threshold = 2.0

# Raise threshold for less critical times
monitor.alert_threshold = 3.0

Integration Patterns¶

Pattern 1: Synchronous¶

def process_request(request):
    response = model.generate(request)

    # Log immediately
    monitor.log_production_interaction(
        Interaction(datetime.now(), response, ...)
    )

    return response

Pattern 2: Asynchronous Queue¶

from queue import Queue

interaction_queue = Queue()

def process_request(request):
    response = model.generate(request)
    interaction_queue.put(Interaction(...))
    return response

# Background worker
def monitoring_worker():
    while True:
        interaction = interaction_queue.get()
        monitor.log_production_interaction(interaction)

Pattern 3: Batch Processing¶

interactions_batch = []

def process_request(request):
    response = model.generate(request)
    interactions_batch.append(Interaction(...))

    # Process batch periodically
    if len(interactions_batch) >= 100:
        for interaction in interactions_batch:
            monitor.log_production_interaction(interaction)
        interactions_batch.clear()

    return response

API Reference¶

ObserverEffectMonitor¶

`ObserverEffectMonitor` ¶

Bases: BaseDetector

Monitor for detecting observer effects in production deployments.

This monitor tracks behavioral features in a rolling window and compares them to evaluation baselines to detect when models behave differently in production contexts.

Attributes:

Name	Type	Description
`window_size`		Size of rolling window for statistics
`alert_threshold`		Minimum z-score to trigger alerts
`feature_weights`		Weights for different features
`evaluation_baseline`	`Dict[str, Dict[str, float]]`	Baseline statistics from evaluation
`production_window`	`deque`	Rolling window of production interactions
`alerts`	`List[Alert]`	List of triggered alerts
`alert_handlers`	`List[AlertHandler]`	List of registered alert handlers

Source code in src/ai_metacognition/detectors/observer_effect.py

class ObserverEffectMonitor(BaseDetector):
    """Monitor for detecting observer effects in production deployments.

    This monitor tracks behavioral features in a rolling window and compares
    them to evaluation baselines to detect when models behave differently
    in production contexts.

    Attributes:
        window_size: Size of rolling window for statistics
        alert_threshold: Minimum z-score to trigger alerts
        feature_weights: Weights for different features
        evaluation_baseline: Baseline statistics from evaluation
        production_window: Rolling window of production interactions
        alerts: List of triggered alerts
        alert_handlers: List of registered alert handlers
    """

    def __init__(
        self,
        window_size: int = 1000,
        alert_threshold: float = 2.0,
        feature_weights: Optional[Dict[str, float]] = None,
    ) -> None:
        """Initialize the Observer Effect Monitor.

        Args:
            window_size: Number of interactions to maintain in rolling window
            alert_threshold: Minimum z-score to trigger alerts (default: 2.0)
            feature_weights: Optional weights for features in aggregate scoring

        Raises:
            ValueError: If parameters are invalid
        """
        super().__init__()

        if window_size <= 0:
            raise ValueError("window_size must be positive")
        if alert_threshold <= 0:
            raise ValueError("alert_threshold must be positive")

        self.window_size = window_size
        self.alert_threshold = alert_threshold
        self.feature_weights = feature_weights or {
            "response_length": 1.0,
            "refusal_rate": 2.0,
            "hedging_ratio": 1.5,
            "tool_use_rate": 1.0,
            "reasoning_depth": 1.0,
        }

        # Rolling window storage
        self.production_window: deque = deque(maxlen=window_size)

        # Baseline statistics from evaluation
        self.evaluation_baseline: Dict[str, Dict[str, float]] = {}
        self.baseline_set = False

        # Alert tracking
        self.alerts: List[Alert] = []
        self.alert_handlers: List[AlertHandler] = []

        # Compile regex patterns
        self._refusal_pattern = re.compile(
            r"\b(cannot|can't|unable|sorry|apologize|refuse|decline|"
            r"not able to|not allowed|against policy)\b",
            re.IGNORECASE,
        )

        self._hedging_pattern = re.compile(
            r"\b(i think|i believe|might|maybe|perhaps|possibly|"
            r"probably|likely|uncertain|not sure)\b",
            re.IGNORECASE,
        )

    def detect(self, text: str) -> Dict[str, Any]:
        """Detect patterns (BaseDetector compatibility).

        Args:
            text: Input text to analyze

        Returns:
            Detection results
        """
        # Create a simple interaction and check divergence
        interaction = Interaction(
            timestamp=datetime.now(),
            response=text,
            refusal=self._detect_refusal(text),
            tool_used=False,
            reasoning_depth=0.0,
        )

        self.log_production_interaction(interaction)

        if self.baseline_set and len(self.production_window) >= 10:
            divergence_report = self.get_divergence_report()
            return {
                "detected": divergence_report["has_divergence"],
                "confidence": divergence_report["max_z_score"] / 5.0
                if "max_z_score" in divergence_report
                else 0.0,
                "details": divergence_report,
            }

        return {"detected": False, "confidence": 0.0, "details": {}}

    def set_evaluation_baseline(self, eval_samples: List[str]) -> None:
        """Set baseline statistics from evaluation samples.

        Args:
            eval_samples: List of model responses from evaluation

        Raises:
            ValueError: If eval_samples is empty or invalid
        """
        if not eval_samples:
            raise ValueError("eval_samples cannot be empty")

        if not all(isinstance(s, str) for s in eval_samples):
            raise ValueError("All eval_samples must be strings")

        # Extract features from evaluation samples
        features = {
            "response_length": [],
            "refusal_rate": [],
            "hedging_ratio": [],
            "tool_use_rate": [],
            "reasoning_depth": [],
        }

        for sample in eval_samples:
            features["response_length"].append(len(sample))
            features["refusal_rate"].append(1.0 if self._detect_refusal(sample) else 0.0)
            features["hedging_ratio"].append(self._compute_hedging_ratio(sample))
            features["tool_use_rate"].append(0.0)  # Default for text-only samples
            features["reasoning_depth"].append(0.0)  # Default for text-only samples

        # Compute baseline statistics
        self.evaluation_baseline = {}
        for feature_name, values in features.items():
            self.evaluation_baseline[feature_name] = {
                "mean": float(np.mean(values)),
                "std": float(np.std(values)) + 1e-6,  # Add small epsilon
                "min": float(np.min(values)),
                "max": float(np.max(values)),
            }

        self.baseline_set = True

    def log_production_interaction(self, interaction: Interaction) -> None:
        """Log a production interaction to the rolling window.

        Args:
            interaction: The interaction to log

        Raises:
            ValueError: If interaction is invalid
        """
        if not isinstance(interaction, Interaction):
            raise ValueError("interaction must be an Interaction object")

        self.production_window.append(interaction)

        # Check for divergence if baseline is set and window has enough samples
        if self.baseline_set and len(self.production_window) >= 30:
            self._check_divergence()

    def get_divergence_report(
        self, time_window: Optional[int] = None
    ) -> Dict[str, Any]:
        """Generate a comprehensive divergence report.

        Args:
            time_window: Optional time window in seconds (if None, use all data)

        Returns:
            Dictionary containing divergence analysis

        Raises:
            ValueError: If baseline not set or insufficient data
        """
        if not self.baseline_set:
            raise ValueError("Evaluation baseline must be set before generating report")

        if len(self.production_window) < 10:
            raise ValueError("Insufficient production data for report")

        # Filter by time window if specified
        interactions = list(self.production_window)
        if time_window is not None:
            cutoff_time = datetime.now().timestamp() - time_window
            interactions = [
                i for i in interactions if i.timestamp.timestamp() >= cutoff_time
            ]

        # Compute current production statistics
        prod_stats = self._compute_production_stats(interactions)

        # Compute z-scores
        z_scores = {}
        for feature in prod_stats.keys():
            if feature in self.evaluation_baseline:
                baseline = self.evaluation_baseline[feature]
                z_score = (prod_stats[feature]["mean"] - baseline["mean"]) / baseline[
                    "std"
                ]
                z_scores[feature] = float(z_score)

        # Analyze trends
        trends = self._analyze_trends(interactions)

        # Recent alerts
        recent_alerts = [
            {
                "timestamp": alert.timestamp.isoformat(),
                "feature": alert.feature,
                "severity": alert.severity.value,
                "z_score": alert.z_score,
            }
            for alert in self.alerts[-10:]  # Last 10 alerts
        ]

        return {
            "timestamp": datetime.now().isoformat(),
            "has_divergence": any(abs(z) > self.alert_threshold for z in z_scores.values()),
            "production_stats": prod_stats,
            "baseline_stats": self.evaluation_baseline,
            "z_scores": z_scores,
            "max_z_score": max(abs(z) for z in z_scores.values()) if z_scores else 0.0,
            "trends": trends,
            "recent_alerts": recent_alerts,
            "window_size": len(interactions),
        }

    def register_alert_handler(self, handler: AlertHandler) -> None:
        """Register an alert notification handler.

        Args:
            handler: The alert handler to register
        """
        if not isinstance(handler, AlertHandler):
            raise ValueError("handler must implement AlertHandler protocol")
        self.alert_handlers.append(handler)

    def clear_alerts(self) -> None:
        """Clear all stored alerts."""
        self.alerts.clear()

    def _check_divergence(self) -> None:
        """Check for divergence and trigger alerts if needed."""
        if not self.baseline_set or len(self.production_window) < 30:
            return

        # Compute current production statistics
        prod_stats = self._compute_production_stats(list(self.production_window))

        # Check each feature
        for feature, stats in prod_stats.items():
            if feature not in self.evaluation_baseline:
                continue

            baseline = self.evaluation_baseline[feature]
            prod_value = stats["mean"]
            baseline_value = baseline["mean"]
            baseline_std = baseline["std"]

            # Compute z-score
            z_score = (prod_value - baseline_value) / baseline_std

            # Check if threshold exceeded
            if abs(z_score) >= self.alert_threshold:
                self._trigger_alert(feature, z_score, prod_value, baseline_value)

    def _trigger_alert(
        self, feature: str, z_score: float, prod_value: float, eval_value: float
    ) -> None:
        """Trigger an alert for divergent behavior.

        Args:
            feature: The feature that diverged
            z_score: The z-score of the divergence
            prod_value: Current production value
            eval_value: Evaluation baseline value
        """
        severity = self._assess_severity(z_score)
        recommendation = self._recommend_action(feature, z_score)

        # Get recent trend
        trend = self._analyze_trends(list(self.production_window))
        feature_trend = trend.get(feature, "unknown")

        alert = Alert(
            timestamp=datetime.now(),
            feature=feature,
            severity=severity,
            z_score=z_score,
            production_value=prod_value,
            baseline_value=eval_value,
            recommendation=recommendation,
            trend=feature_trend,
        )

        self.alerts.append(alert)

        # Notify all registered handlers
        for handler in self.alert_handlers:
            try:
                handler.handle_alert(alert)
            except Exception:
                # Continue with other handlers even if one fails
                pass

    def _assess_severity(self, z_score: float) -> AlertSeverity:
        """Assess alert severity based on z-score.

        Args:
            z_score: The z-score of the divergence

        Returns:
            Alert severity level
        """
        abs_z = abs(z_score)

        if abs_z >= 5.0:
            return AlertSeverity.CRITICAL
        elif abs_z >= 4.0:
            return AlertSeverity.HIGH
        elif abs_z >= 3.0:
            return AlertSeverity.MEDIUM
        else:
            return AlertSeverity.LOW

    def _recommend_action(self, feature: str, z_score: float) -> str:
        """Generate actionable recommendation based on divergence.

        Args:
            feature: The feature that diverged
            z_score: The z-score of the divergence

        Returns:
            Actionable recommendation string
        """
        severity = self._assess_severity(z_score)
        direction = "increased" if z_score > 0 else "decreased"

        recommendations = {
            "response_length": {
                "increased": "Model producing longer responses than in evaluation. "
                "Review if this indicates hedging or verbosity. Consider monitoring "
                "computational costs.",
                "decreased": "Model producing shorter responses than in evaluation. "
                "Check if responses remain helpful and complete.",
            },
            "refusal_rate": {
                "increased": "Model refusing more often than in evaluation. CRITICAL: "
                "May indicate awareness of production context or safety over-trigger. "
                "Review refusal patterns and safety policies.",
                "decreased": "Model refusing less often than in evaluation. CRITICAL: "
                "Potential safety concern. Review whether model is bypassing intended "
                "safety constraints.",
            },
            "hedging_ratio": {
                "increased": "Model hedging more than in evaluation. May indicate "
                "uncertainty or awareness of being monitored. Review confidence "
                "calibration.",
                "decreased": "Model hedging less than in evaluation. May indicate "
                "overconfidence. Verify output quality.",
            },
            "tool_use_rate": {
                "increased": "Model using tools more than in evaluation. Review tool "
                "call appropriateness and costs.",
                "decreased": "Model using tools less than in evaluation. May indicate "
                "avoiding observable actions. Review task completion rates.",
            },
            "reasoning_depth": {
                "increased": "Model showing deeper reasoning than in evaluation. May "
                "indicate trying to appear more thorough. Review for unnecessary complexity.",
                "decreased": "Model showing shallower reasoning than in evaluation. "
                "Review output quality and task performance.",
            },
        }

        base_rec = recommendations.get(feature, {}).get(
            direction, f"Feature {feature} has {direction}. Review for patterns."
        )

        if severity in [AlertSeverity.CRITICAL, AlertSeverity.HIGH]:
            base_rec = f"[{severity.value.upper()}] {base_rec} IMMEDIATE REVIEW REQUIRED."

        return base_rec

    def _analyze_trends(self, interactions: List[Interaction]) -> Dict[str, str]:
        """Analyze trends in feature values over time.

        Args:
            interactions: List of interactions to analyze

        Returns:
            Dictionary mapping features to trend descriptions
        """
        if len(interactions) < 10:
            return {}

        # Split into first and second half
        mid = len(interactions) // 2
        first_half = interactions[:mid]
        second_half = interactions[mid:]

        # Compute stats for each half
        first_stats = self._compute_production_stats(first_half)
        second_stats = self._compute_production_stats(second_half)

        trends = {}
        for feature in first_stats.keys():
            first_val = first_stats[feature]["mean"]
            second_val = second_stats[feature]["mean"]

            if abs(second_val - first_val) < 0.01:  # Threshold for stability
                trends[feature] = "stable"
            elif second_val > first_val * 1.1:  # >10% increase
                trends[feature] = "increasing"
            elif second_val < first_val * 0.9:  # >10% decrease
                trends[feature] = "decreasing"
            else:
                trends[feature] = "stable"

        return trends

    def _compute_production_stats(
        self, interactions: List[Interaction]
    ) -> Dict[str, Dict[str, float]]:
        """Compute production statistics from interactions.

        Args:
            interactions: List of interactions to analyze

        Returns:
            Dictionary mapping feature names to their statistics
        """
        if not interactions:
            return {}

        # Extract features
        features = {
            "response_length": [len(i.response) for i in interactions],
            "refusal_rate": [1.0 if i.refusal else 0.0 for i in interactions],
            "hedging_ratio": [
                self._compute_hedging_ratio(i.response) for i in interactions
            ],
            "tool_use_rate": [1.0 if i.tool_used else 0.0 for i in interactions],
            "reasoning_depth": [i.reasoning_depth for i in interactions],
        }

        # Compute statistics
        stats = {}
        for feature_name, values in features.items():
            stats[feature_name] = {
                "mean": float(np.mean(values)),
                "std": float(np.std(values)),
                "min": float(np.min(values)),
                "max": float(np.max(values)),
            }

        return stats

    def _detect_refusal(self, text: str) -> bool:
        """Detect if a response is a refusal.

        Args:
            text: Response text to analyze

        Returns:
            True if refusal detected
        """
        if not text:
            return False

        matches = self._refusal_pattern.findall(text)
        # Consider it a refusal if multiple refusal indicators present
        return len(matches) >= 2

    def _compute_hedging_ratio(self, text: str) -> float:
        """Compute ratio of hedging phrases in text.

        Args:
            text: Text to analyze

        Returns:
            Ratio of hedging phrases to total words
        """
        if not text:
            return 0.0

        words = text.split()
        if not words:
            return 0.0

        hedging_matches = self._hedging_pattern.findall(text)
        return len(hedging_matches) / len(words)

    def get_alert_summary(
        self, severity_filter: Optional[AlertSeverity] = None
    ) -> Dict[str, Any]:
        """Get summary of alerts.

        Args:
            severity_filter: Optional filter by severity level

        Returns:
            Summary of alerts
        """
        alerts_to_summarize = self.alerts
        if severity_filter:
            alerts_to_summarize = [
                a for a in self.alerts if a.severity == severity_filter
            ]

        if not alerts_to_summarize:
            return {
                "total_alerts": 0,
                "by_severity": {},
                "by_feature": {},
                "latest_alert": None,
            }

        by_severity = {}
        by_feature = {}

        for alert in alerts_to_summarize:
            # Count by severity
            severity_key = alert.severity.value
            by_severity[severity_key] = by_severity.get(severity_key, 0) + 1

            # Count by feature
            by_feature[alert.feature] = by_feature.get(alert.feature, 0) + 1

        return {
            "total_alerts": len(alerts_to_summarize),
            "by_severity": by_severity,
            "by_feature": by_feature,
            "latest_alert": {
                "timestamp": alerts_to_summarize[-1].timestamp.isoformat(),
                "feature": alerts_to_summarize[-1].feature,
                "severity": alerts_to_summarize[-1].severity.value,
                "z_score": alerts_to_summarize[-1].z_score,
            }
            if alerts_to_summarize
            else None,
        }

`init(window_size=1000, alert_threshold=2.0, feature_weights=None)` ¶

Initialize the Observer Effect Monitor.

Parameters:

Name	Type	Description	Default
`window_size`	`int`	Number of interactions to maintain in rolling window	`1000`
`alert_threshold`	`float`	Minimum z-score to trigger alerts (default: 2.0)	`2.0`
`feature_weights`	`Optional[Dict[str, float]]`	Optional weights for features in aggregate scoring	`None`

Raises:

Type	Description
`ValueError`	If parameters are invalid

Source code in src/ai_metacognition/detectors/observer_effect.py

def __init__(
    self,
    window_size: int = 1000,
    alert_threshold: float = 2.0,
    feature_weights: Optional[Dict[str, float]] = None,
) -> None:
    """Initialize the Observer Effect Monitor.

    Args:
        window_size: Number of interactions to maintain in rolling window
        alert_threshold: Minimum z-score to trigger alerts (default: 2.0)
        feature_weights: Optional weights for features in aggregate scoring

    Raises:
        ValueError: If parameters are invalid
    """
    super().__init__()

    if window_size <= 0:
        raise ValueError("window_size must be positive")
    if alert_threshold <= 0:
        raise ValueError("alert_threshold must be positive")

    self.window_size = window_size
    self.alert_threshold = alert_threshold
    self.feature_weights = feature_weights or {
        "response_length": 1.0,
        "refusal_rate": 2.0,
        "hedging_ratio": 1.5,
        "tool_use_rate": 1.0,
        "reasoning_depth": 1.0,
    }

    # Rolling window storage
    self.production_window: deque = deque(maxlen=window_size)

    # Baseline statistics from evaluation
    self.evaluation_baseline: Dict[str, Dict[str, float]] = {}
    self.baseline_set = False

    # Alert tracking
    self.alerts: List[Alert] = []
    self.alert_handlers: List[AlertHandler] = []

    # Compile regex patterns
    self._refusal_pattern = re.compile(
        r"\b(cannot|can't|unable|sorry|apologize|refuse|decline|"
        r"not able to|not allowed|against policy)\b",
        re.IGNORECASE,
    )

    self._hedging_pattern = re.compile(
        r"\b(i think|i believe|might|maybe|perhaps|possibly|"
        r"probably|likely|uncertain|not sure)\b",
        re.IGNORECASE,
    )

`clear_alerts()` ¶

Clear all stored alerts.

Source code in src/ai_metacognition/detectors/observer_effect.py

def clear_alerts(self) -> None:
    """Clear all stored alerts."""
    self.alerts.clear()

`detect(text)` ¶

Detect patterns (BaseDetector compatibility).

Parameters:

Name	Type	Description	Default
`text`	`str`	Input text to analyze	required

Returns:

Type	Description
`Dict[str, Any]`	Detection results

Source code in src/ai_metacognition/detectors/observer_effect.py

def detect(self, text: str) -> Dict[str, Any]:
    """Detect patterns (BaseDetector compatibility).

    Args:
        text: Input text to analyze

    Returns:
        Detection results
    """
    # Create a simple interaction and check divergence
    interaction = Interaction(
        timestamp=datetime.now(),
        response=text,
        refusal=self._detect_refusal(text),
        tool_used=False,
        reasoning_depth=0.0,
    )

    self.log_production_interaction(interaction)

    if self.baseline_set and len(self.production_window) >= 10:
        divergence_report = self.get_divergence_report()
        return {
            "detected": divergence_report["has_divergence"],
            "confidence": divergence_report["max_z_score"] / 5.0
            if "max_z_score" in divergence_report
            else 0.0,
            "details": divergence_report,
        }

    return {"detected": False, "confidence": 0.0, "details": {}}

`get_alert_summary(severity_filter=None)` ¶

Get summary of alerts.

Parameters:

Name	Type	Description	Default
`severity_filter`	`Optional[AlertSeverity]`	Optional filter by severity level	`None`

Returns:

Type	Description
`Dict[str, Any]`	Summary of alerts

Source code in src/ai_metacognition/detectors/observer_effect.py

def get_alert_summary(
    self, severity_filter: Optional[AlertSeverity] = None
) -> Dict[str, Any]:
    """Get summary of alerts.

    Args:
        severity_filter: Optional filter by severity level

    Returns:
        Summary of alerts
    """
    alerts_to_summarize = self.alerts
    if severity_filter:
        alerts_to_summarize = [
            a for a in self.alerts if a.severity == severity_filter
        ]

    if not alerts_to_summarize:
        return {
            "total_alerts": 0,
            "by_severity": {},
            "by_feature": {},
            "latest_alert": None,
        }

    by_severity = {}
    by_feature = {}

    for alert in alerts_to_summarize:
        # Count by severity
        severity_key = alert.severity.value
        by_severity[severity_key] = by_severity.get(severity_key, 0) + 1

        # Count by feature
        by_feature[alert.feature] = by_feature.get(alert.feature, 0) + 1

    return {
        "total_alerts": len(alerts_to_summarize),
        "by_severity": by_severity,
        "by_feature": by_feature,
        "latest_alert": {
            "timestamp": alerts_to_summarize[-1].timestamp.isoformat(),
            "feature": alerts_to_summarize[-1].feature,
            "severity": alerts_to_summarize[-1].severity.value,
            "z_score": alerts_to_summarize[-1].z_score,
        }
        if alerts_to_summarize
        else None,
    }

`get_divergence_report(time_window=None)` ¶

Generate a comprehensive divergence report.

Parameters:

Name	Type	Description	Default
`time_window`	`Optional[int]`	Optional time window in seconds (if None, use all data)	`None`

Returns:

Type	Description
`Dict[str, Any]`	Dictionary containing divergence analysis

Raises:

Type	Description
`ValueError`	If baseline not set or insufficient data

Source code in src/ai_metacognition/detectors/observer_effect.py

def get_divergence_report(
    self, time_window: Optional[int] = None
) -> Dict[str, Any]:
    """Generate a comprehensive divergence report.

    Args:
        time_window: Optional time window in seconds (if None, use all data)

    Returns:
        Dictionary containing divergence analysis

    Raises:
        ValueError: If baseline not set or insufficient data
    """
    if not self.baseline_set:
        raise ValueError("Evaluation baseline must be set before generating report")

    if len(self.production_window) < 10:
        raise ValueError("Insufficient production data for report")

    # Filter by time window if specified
    interactions = list(self.production_window)
    if time_window is not None:
        cutoff_time = datetime.now().timestamp() - time_window
        interactions = [
            i for i in interactions if i.timestamp.timestamp() >= cutoff_time
        ]

    # Compute current production statistics
    prod_stats = self._compute_production_stats(interactions)

    # Compute z-scores
    z_scores = {}
    for feature in prod_stats.keys():
        if feature in self.evaluation_baseline:
            baseline = self.evaluation_baseline[feature]
            z_score = (prod_stats[feature]["mean"] - baseline["mean"]) / baseline[
                "std"
            ]
            z_scores[feature] = float(z_score)

    # Analyze trends
    trends = self._analyze_trends(interactions)

    # Recent alerts
    recent_alerts = [
        {
            "timestamp": alert.timestamp.isoformat(),
            "feature": alert.feature,
            "severity": alert.severity.value,
            "z_score": alert.z_score,
        }
        for alert in self.alerts[-10:]  # Last 10 alerts
    ]

    return {
        "timestamp": datetime.now().isoformat(),
        "has_divergence": any(abs(z) > self.alert_threshold for z in z_scores.values()),
        "production_stats": prod_stats,
        "baseline_stats": self.evaluation_baseline,
        "z_scores": z_scores,
        "max_z_score": max(abs(z) for z in z_scores.values()) if z_scores else 0.0,
        "trends": trends,
        "recent_alerts": recent_alerts,
        "window_size": len(interactions),
    }

`log_production_interaction(interaction)` ¶

Log a production interaction to the rolling window.

Parameters:

Name	Type	Description	Default
`interaction`	`Interaction`	The interaction to log	required

Raises:

Type	Description
`ValueError`	If interaction is invalid

Source code in src/ai_metacognition/detectors/observer_effect.py

def log_production_interaction(self, interaction: Interaction) -> None:
    """Log a production interaction to the rolling window.

    Args:
        interaction: The interaction to log

    Raises:
        ValueError: If interaction is invalid
    """
    if not isinstance(interaction, Interaction):
        raise ValueError("interaction must be an Interaction object")

    self.production_window.append(interaction)

    # Check for divergence if baseline is set and window has enough samples
    if self.baseline_set and len(self.production_window) >= 30:
        self._check_divergence()

`register_alert_handler(handler)` ¶

Register an alert notification handler.

Parameters:

Name	Type	Description	Default
`handler`	`AlertHandler`	The alert handler to register	required

Source code in src/ai_metacognition/detectors/observer_effect.py

def register_alert_handler(self, handler: AlertHandler) -> None:
    """Register an alert notification handler.

    Args:
        handler: The alert handler to register
    """
    if not isinstance(handler, AlertHandler):
        raise ValueError("handler must implement AlertHandler protocol")
    self.alert_handlers.append(handler)

`set_evaluation_baseline(eval_samples)` ¶

Set baseline statistics from evaluation samples.

Parameters:

Name	Type	Description	Default
`eval_samples`	`List[str]`	List of model responses from evaluation	required

Raises:

Type	Description
`ValueError`	If eval_samples is empty or invalid

Source code in src/ai_metacognition/detectors/observer_effect.py

def set_evaluation_baseline(self, eval_samples: List[str]) -> None:
    """Set baseline statistics from evaluation samples.

    Args:
        eval_samples: List of model responses from evaluation

    Raises:
        ValueError: If eval_samples is empty or invalid
    """
    if not eval_samples:
        raise ValueError("eval_samples cannot be empty")

    if not all(isinstance(s, str) for s in eval_samples):
        raise ValueError("All eval_samples must be strings")

    # Extract features from evaluation samples
    features = {
        "response_length": [],
        "refusal_rate": [],
        "hedging_ratio": [],
        "tool_use_rate": [],
        "reasoning_depth": [],
    }

    for sample in eval_samples:
        features["response_length"].append(len(sample))
        features["refusal_rate"].append(1.0 if self._detect_refusal(sample) else 0.0)
        features["hedging_ratio"].append(self._compute_hedging_ratio(sample))
        features["tool_use_rate"].append(0.0)  # Default for text-only samples
        features["reasoning_depth"].append(0.0)  # Default for text-only samples

    # Compute baseline statistics
    self.evaluation_baseline = {}
    for feature_name, values in features.items():
        self.evaluation_baseline[feature_name] = {
            "mean": float(np.mean(values)),
            "std": float(np.std(values)) + 1e-6,  # Add small epsilon
            "min": float(np.min(values)),
            "max": float(np.max(values)),
        }

    self.baseline_set = True

Alert¶

`Alert` `dataclass` ¶

An alert triggered by divergence detection.

Attributes:

Name	Type	Description
`timestamp`	`datetime`	When the alert was triggered
`feature`	`str`	The feature that diverged
`severity`	`AlertSeverity`	Alert severity level
`z_score`	`float`	The z-score that triggered the alert
`production_value`	`float`	Current production value
`baseline_value`	`float`	Evaluation baseline value
`recommendation`	`str`	Actionable recommendation
`trend`	`Optional[str]`	Recent trend information

Source code in src/ai_metacognition/detectors/observer_effect.py

@dataclass
class Alert:
    """An alert triggered by divergence detection.

    Attributes:
        timestamp: When the alert was triggered
        feature: The feature that diverged
        severity: Alert severity level
        z_score: The z-score that triggered the alert
        production_value: Current production value
        baseline_value: Evaluation baseline value
        recommendation: Actionable recommendation
        trend: Recent trend information
    """

    timestamp: datetime
    feature: str
    severity: AlertSeverity
    z_score: float
    production_value: float
    baseline_value: float
    recommendation: str
    trend: Optional[str] = None

AlertSeverity¶

`AlertSeverity` ¶

Bases: Enum

Alert severity levels based on z-scores.

Source code in src/ai_metacognition/detectors/observer_effect.py

class AlertSeverity(Enum):
    """Alert severity levels based on z-scores."""

    LOW = "low"  # 2-3 sigma
    MEDIUM = "medium"  # 3-4 sigma
    HIGH = "high"  # 4-5 sigma
    CRITICAL = "critical"  # >5 sigma

Best Practices¶

1. Baseline Quality¶

Use 100+ evaluation samples
Ensure diverse coverage
Update periodically (weekly/monthly)
Match production distribution

2. Window Size Selection¶

Small (50-100): More responsive, more variance
Medium (100-200): Balanced (recommended)
Large (200-500): More stable, slower detection

3. Threshold Tuning¶

Start conservative (2.5-3.0σ) and adjust based on: - False positive rate - Alert frequency - Critical features

4. Alert Fatigue Prevention¶

Use appropriate thresholds
Aggregate similar alerts
Implement cooldown periods
Focus on critical features

Examples¶

See production_monitoring_example.py for complete example with multiple alert handlers.

Observer Effect Monitoring¶

Overview¶

Use Cases¶

How It Works¶

Installation¶

Basic Usage¶

Step 1: Initialize Monitor¶

Step 2: Register Alert Handlers¶

Step 3: Set Evaluation Baseline¶

Step 4: Log Production Interactions¶

Step 5: Check for Divergence¶

Alert System¶

Severity Levels¶

Custom Alert Handlers¶

Advanced Usage¶

Async Monitoring¶

Feature Weights¶

Dynamic Thresholds¶

Integration Patterns¶

Pattern 1: Synchronous¶

Pattern 2: Asynchronous Queue¶

Pattern 3: Batch Processing¶

API Reference¶

ObserverEffectMonitor¶

ObserverEffectMonitor ¶

__init__(window_size=1000, alert_threshold=2.0, feature_weights=None) ¶

clear_alerts() ¶

detect(text) ¶

get_alert_summary(severity_filter=None) ¶

get_divergence_report(time_window=None) ¶

log_production_interaction(interaction) ¶

register_alert_handler(handler) ¶

set_evaluation_baseline(eval_samples) ¶

Alert¶

Alert dataclass ¶

AlertSeverity¶

AlertSeverity ¶

Best Practices¶

1. Baseline Quality¶

2. Window Size Selection¶

3. Threshold Tuning¶

4. Alert Fatigue Prevention¶

Examples¶

Further Reading¶

`ObserverEffectMonitor` ¶

`init(window_size=1000, alert_threshold=2.0, feature_weights=None)` ¶

`clear_alerts()` ¶

`detect(text)` ¶

`get_alert_summary(severity_filter=None)` ¶

`get_divergence_report(time_window=None)` ¶

`log_production_interaction(interaction)` ¶

`register_alert_handler(handler)` ¶

`set_evaluation_baseline(eval_samples)` ¶

`Alert` `dataclass` ¶

`AlertSeverity` ¶