observability-toolkit

MCP server for observability tooling - query traces, metrics, and logs from local JSONL files for Claude Code sessions. Optionally integrates with SigNoz Cloud for enhanced observability.

Installation

``bash claude mcp add observability-toolkit -- npx -y observability-toolkit`

Or for local development:

`bash claude mcp add observability-toolkit -- node ~/.claude/mcp-servers/observability-toolkit/dist/server.js`

`Tools`

| Tool | Description | |------|-------------| |obs_query_traces| Query traces with filtering, regex, numeric operators | |obs_query_metrics| Query metrics with aggregations (sum, avg, p50, p95, p99, rate) | |obs_query_logs| Query logs with boolean search, field extraction | |obs_query_llm_events| Query LLM events with token usage and duration metrics | |obs_query_evaluations| Query evaluation events with aggregations and groupBy | |obs_query_verifications| Query human verification events for EU AI Act compliance | |obs_health_check| Check telemetry system health with cache statistics | |obs_context_stats| Get context window utilization stats | |obs_get_trace_url| Get SigNoz trace viewer URL (requires SigNoz) | |obs_setup_claudeignore| Add entries to .claudeignore | |obs_export_langfuse | Export evaluations to Langfuse via OTLP HTTP |

`Configuration`

| Variable | Description | Default | |----------|-------------|---------| |TELEMETRY_DIR | Local telemetry directory | ~/.claude/telemetry| |SIGNOZ_URL| SigNoz instance URL | - | |SIGNOZ_API_KEY| SigNoz API key | - | |CACHE_TTL_MS | Query cache TTL in milliseconds | 60000| |RETENTION_DAYS | Days to retain telemetry files | 7| |LANGFUSE_ENDPOINT| Langfuse OTLP endpoint URL | - | |LANGFUSE_PUBLIC_KEY| Langfuse public key | - | |LANGFUSE_SECRET_KEY | Langfuse secret key | - |

`Usage Examples`

`$3`

`javascript // Basic query obs_query_traces({ limit: 10 })

// Filter by trace ID obs_query_traces({ traceId: "abc123..." })

// Filter by service and duration obs_query_traces({ serviceName: "claude-code", minDurationMs: 100 })

// Regex pattern matching obs_query_traces({ spanNameRegex: "^http\\..*" })

// Numeric attribute filtering obs_query_traces({ numericFilter: [ { attribute: "http.status_code", operator: "gte", value: 400 } ] })

// Existence checks obs_query_traces({ attributeExists: ["error.message"], attributeNotExists: ["http.response.body"] })

// OTel GenAI agent/tool filters obs_query_traces({ agentName: "Explore", toolName: "Read" }) obs_query_traces({ operationName: "execute_tool", toolCallId: "toolu_123" })`

`$3`

`javascript // Basic severity filter obs_query_logs({ severity: "ERROR", limit: 20 })

// Boolean search (AND) obs_query_logs({ searchTerms: ["timeout", "connection"], searchOperator: "AND" })

// Boolean search (OR) obs_query_logs({ searchTerms: ["error", "warning", "critical"], searchOperator: "OR" })

// Field extraction from JSON logs obs_query_logs({ extractFields: ["user.id", "request.method", "response.status"] })

// Exclude patterns obs_query_logs({ search: "error", excludeSearch: "health-check" })`

`$3`

`javascript // Basic query obs_query_metrics({ metricName: "session.context.size" })

// Aggregations obs_query_metrics({ metricName: "http.duration", aggregation: "avg" }) obs_query_metrics({ metricName: "http.duration", aggregation: "p95" }) obs_query_metrics({ metricName: "requests.count", aggregation: "rate" })

// Time bucket grouping obs_query_metrics({ metricName: "token.usage", aggregation: "sum", timeBucket: "1h", groupBy: ["model"] })

// Percentiles obs_query_metrics({ metricName: "latency", aggregation: "p99" })`

`$3`

`javascript // Basic query obs_query_llm_events({ limit: 20 })

// Filter by model and provider obs_query_llm_events({ model: "claude-3-opus", provider: "anthropic" })

// OTel GenAI operation types obs_query_llm_events({ operationName: "chat" }) obs_query_llm_events({ operationName: "invoke_agent" })

// Filter by conversation obs_query_llm_events({ conversationId: "conv-abc123" })

// Combine filters obs_query_llm_events({ operationName: "chat", provider: "anthropic", conversationId: "conv-abc123" })`

`$3`

Query events from any LLM provider using OTel GenAI standard identifiers:

`javascript // Anthropic Claude obs_query_llm_events({ provider: "anthropic", model: "claude-3-opus" })

// OpenAI obs_query_llm_events({ provider: "openai", model: "gpt-4o" })

// Google Gemini obs_query_llm_events({ provider: "gcp.gemini", model: "gemini-1.5-pro" })

// Mistral AI obs_query_llm_events({ provider: "mistral_ai", model: "mistral-large" })

// Cohere obs_query_llm_events({ provider: "cohere", model: "command-r-plus" })

// AWS Bedrock (multi-model) obs_query_llm_events({ provider: "aws.bedrock" })

// Azure OpenAI obs_query_llm_events({ provider: "azure.ai.openai" })

// Local models (Ollama) obs_query_llm_events({ provider: "ollama", model: "llama3:8b" })

// Groq obs_query_llm_events({ provider: "groq", model: "llama-3.3-70b" })`

Provider Fallback Chain: The toolkit uses OTel GenAI v1.39 compliant attribute lookup: 1.gen_ai.provider.name(primary) 2.gen_ai.system(legacy OTel) 3.provider (custom/fallback)

`$3`

`javascript obs_health_check({ verbose: true })

// Returns: { "status": "ok", "backends": { ... }, "cache": { "traces": { "hits": 10, "misses": 5, "hitRate": 0.67, "size": 15, "evictions": 0 }, "logs": { "hits": 8, "misses": 12, "hitRate": 0.4, "size": 20, "evictions": 2 }, "metrics": { "hits": 0, "misses": 0, "hitRate": 0, "size": 0, "evictions": 0 }, "llmEvents": { "hits": 0, "misses": 0, "hitRate": 0, "size": 0, "evictions": 0 } } }`

`Features`

`$3`

| Feature | Description | |---------|-------------| | Percentile Aggregations | p50, p95, p99 for metrics | | Time Bucket Grouping | 1m, 5m, 1h, 1d buckets for trend analysis | | Rate Calculations | Per-second rate of change | | Numeric Operators | gt, gte, lt, lte, eq for attribute filtering | | Regex Patterns | Advanced span name filtering | | Boolean Search | AND/OR operators for log queries | | Field Extraction | Extract JSON paths from structured logs | | Negation Filters | Exclude matching spans/logs | | Existence Checks | Filter by attribute presence |

`$3`

| Feature | Description | |---------|-------------| | severityNumber | Standard OTel severity levels | | statusCode | UNSET, OK, ERROR for spans | | Histogram Buckets | Full histogram distribution support | | InstrumentationScope | Library/module metadata | | Span Links | Cross-trace relationships | | Exemplars | Metric-to-trace correlation | | Aggregation Temporality | DELTA, CUMULATIVE support |

`$3`

| Feature | Description | |---------|-------------| |gen_ai.operation.name| Filter by chat, embeddings, invoke_agent, execute_tool | |gen_ai.provider.name| Provider fallback: gen_ai.provider.name → gen_ai.system → provider | |gen_ai.conversation.id| Filter LLM events by conversation ID | |gen_ai.agent.id/name| Filter traces by agent attributes | |gen_ai.tool.name/call.id| Filter traces by tool attributes | |gen_ai.response.model| Actual model that responded | |gen_ai.response.finish_reasons| Why generation stopped | |gen_ai.request.temperature| Sampling temperature | |gen_ai.request.max_tokens| Maximum output tokens | | Percentiles | p50, p95, p99, rate aggregations |

`$3`

| Provider ID | Description | Example Models | |-------------|-------------|----------------| |anthropic| Anthropic Claude | claude-3-opus, claude-3-sonnet, claude-3-haiku | |openai| OpenAI GPT | gpt-4o, gpt-4-turbo, gpt-3.5-turbo, o1, o1-mini | |gcp.gemini| Google AI Studio | gemini-1.5-pro, gemini-1.5-flash, gemini-2.0-flash | |gcp.vertex_ai| Google Vertex AI | gemini-pro, claude-3-opus (via Vertex) | |aws.bedrock| AWS Bedrock | claude-3-sonnet, titan-text, llama-3 | |azure.ai.openai| Azure OpenAI | gpt-4-deployment, gpt-35-turbo | |mistral_ai| Mistral AI | mistral-large, mistral-small, codestral | |cohere| Cohere | command-r-plus, command-r, embed-english | |groq| Groq | llama-3.3-70b, mixtral-8x7b | |ollama| Ollama (local) | llama3, mistral, codellama | |together_ai| Together AI | llama-3-70b, mixtral-8x7b | |fireworks_ai| Fireworks AI | llama-v3-70b, mixtral-8x7b | |huggingface| HuggingFace | Various open models | |replicate| Replicate | Various hosted models | |perplexity | Perplexity | sonar-pro, sonar |

Note: Custom provider identifiers are also supported for internal or unlisted LLM services.

`$3`

| Feature | Description | |---------|-------------| | Query Caching | LRU cache with configurable TTL | | File Indexing | .idx sidecars for fast lookups | | Gzip Support | Transparent decompression of .jsonl.gz files | | BatchWriter | Buffered writes to reduce I/O | | Streaming | Early termination for large files | | Parallel Queries | Concurrent multi-directory queries | | Cursor Pagination | Efficient large result set handling |

`$3`

| Feature | Description | |---------|-------------| | Cache Metrics | Hit/miss/eviction tracking | | Query Timing | Slow query warnings (>500ms) | | Circuit Breaker Logging | State transition visibility | | Health Check Stats | Cache statistics in health output |

`$3`

| Feature | Description | |---------|-------------| | Query Escaping | ClickHouse-specific escaping, 22-pattern blocklist | | Memory Limits | MAX_RESULTS_IN_MEMORY=10000, streaming aggregation | | Input Validation | limit≤1000, date range≤365 days, regex limits | | Type Safety | NaN/Infinity rejection, explicit type assertions |

See docs/security.md for details.

`Data Sources`

`$3`

Scans multiple telemetry directories: - Global:~/.claude/telemetry/(always checked) - Project-local:.claude/telemetry/, telemetry/, .telemetry/

File patterns (supports gzip compression): -traces-YYYY-MM-DD.jsonl / .jsonl.gz-logs-YYYY-MM-DD.jsonl / .jsonl.gz-metrics-YYYY-MM-DD.jsonl / .jsonl.gz-llm-events-YYYY-MM-DD.jsonl / .jsonl.gz

`$3`

When configured, queries SigNoz Cloud API with: - Circuit breaker protection - Cursor-based pagination - Response time tracking

`OTLP Export`

Export data in OpenTelemetry format:

`javascript // Export traces const otlpTraces = await backend.exportTracesOTLP({ startDate: "2026-01-28" });

// Export logs const otlpLogs = await backend.exportLogsOTLP({ severity: "ERROR" });

// Export metrics const otlpMetrics = await backend.exportMetricsOTLP({ metricName: "http.duration" });`

`$3`

Export evaluations to Langfuse for unified tracing and evaluation analysis:

`javascript // Export all evaluations from last 7 days obs_export_langfuse({})

// Export with filters obs_export_langfuse({ evaluationName: "quality", scoreMin: 0.8, limit: 500, batchSize: 100 })

// Dry run to preview export obs_export_langfuse({ startDate: "2026-01-28", dryRun: true })

// Override credentials (for testing) obs_export_langfuse({ endpoint: "https://cloud.langfuse.com", publicKey: "pk-lf-...", secretKey: "sk-lf-..." })`

Features: - Batched OTLP HTTP export with retry logic - Memory protection (400MB warn, 600MB abort) - Progress logging for large exports - Credential sanitization in error messages - DNS rebinding protection

`Evaluation Libraries`

`$3`

Single-pass LLM evaluation for output quality:

`typescript import { gEval, qagEvaluate, JudgeCircuitBreaker } from './lib/llm-as-judge.js';

// G-Eval pattern with chain-of-thought const result = await gEval(testCase, criteria, llmFn);

// QAG faithfulness evaluation const faithfulness = await qagEvaluate(testCase, llmFn);

// Production circuit breaker const breaker = new JudgeCircuitBreaker(5, 60000); const result = await breaker.evaluate(() => gEval(...));`

`$3`

Multi-step agent evaluation with trajectory analysis:

`typescript import { verifyToolCalls, aggregateStepScores, analyzeTrajectory, collectiveConsensus, ProceduralJudge, ReactiveJudge, } from './lib/agent-as-judge.js';

// Verify tool call correctness const verifications = verifyToolCalls(actions, expectedTools);

// Analyze agent trajectory efficiency const metrics = analyzeTrajectory({ actions, expectedSteps: 5 });

// Multi-agent consensus evaluation const consensus = await collectiveConsensus(judges, { id: 'eval-1' }, { rounds: 3, convergenceThreshold: 0.05, });

// Procedural multi-stage evaluation const proceduralJudge = new ProceduralJudge([ { name: 'syntax', evaluate: syntaxChecker }, { name: 'semantic', evaluate: semanticAnalyzer }, ]); const result = await proceduralJudge.evaluate(evaluand);

// Reactive specialist-based evaluation const reactiveJudge = new ReactiveJudge(router, specialists, deepDiveSpecialists); const result = await reactiveJudge.evaluate(evaluand);`

`$3`

`javascript // Filter by agent ID/name obs_query_evaluations({ agentId: 'agent-123', agentName: 'TaskRunner', evaluationName: 'tool_correctness', })

// Response includes agent-specific fields { stepScores: [{ step: 0, score: 0.9, explanation: '...' }], toolVerifications: [{ toolName: 'search', toolCorrect: true, score: 1.0 }], trajectoryLength: 5, }`

`Development`

`bash cd ~/.claude/mcp-servers/observability-toolkit npm install npm run build npm test # 3254 tests npm run start``

Documentation

- docs/changelog/ - Version history and changelogs
- docs/reliability/security.md - Security controls and hardening
- docs/quality/llm-as-judge.md - LLM-as-Judge architecture
- docs/quality/agent-as-judge.md - Agent-as-Judge architecture
- docs/backlog/ - Feature backlog and roadmap
- docs/changelog/SESSION_HISTORY.md - Development session logs
- docs/Summary.md - Full documentation index