Under the hood

Built for how incidents actually work.

A stateful investigation graph. Parallel evidence collection. Human-gated actions. Every decision logged, traced, and auditable.

How the investigation graph works

POST /investigate
Anomaly scanner (APScheduler)
Slash command /rca

ingest

entry point

triage

classify · dedupe · drop noise

dismiss / duplicate
END
proceed
15 fetch nodes parallel
fetch_logs fetch_metrics fetch_traces fetch_errors fetch_tickets fetch_runbooks fetch_similar_rcas fetch_code fetch_infra fetch_db fetch_apm fetch_pipeline fetch_workflow fetch_queue fetch_serverless
evidence merged

synthesiser

Pass 1: primary_role · Pass 2: deep_rca_role if confidence < 0.70

embed_rca

stores RCA in vector store · silent side-effect

human_review

graph_interrupt · approve / reject / re-investigate

on approval only

create_ticket

Jira ticket · full RCA evidence attached

Every signal. In parallel.

Each fetch node queries one category of evidence. All 15 run concurrently.

Category Node Backends Status
Logs fetch_logs Loki, ELK, CloudWatch, Cloud Logging, New Relic
implemented
Metrics fetch_metrics Prometheus, CloudWatch Metrics, Cloud Monitoring
implemented
Tracing fetch_traces Grafana Tempo, AWS X-Ray, Cloud Trace
implemented
Errors fetch_errors Sentry, New Relic APM
implemented
Tickets fetch_tickets Jira via MCP
implemented
Runbooks fetch_runbooks Confluence via MCP + vector similarity retrieval
implemented
Vector fetch_similar_rcas Qdrant, ApertureDB — prior incidents by similarity score
implemented
Code fetch_code GitHub, GitLab, Bitbucket — commit diff at alert time
implemented
Infra fetch_infra Kubernetes pod events, OOMKill, CrashLoop, ReplicaSet rollouts
implemented
Database fetch_db PostgreSQL, RDS/Aurora, MySQL, Neo4j — slow queries, locks
implemented
APM fetch_apm New Relic application traces
implemented
Pipeline fetch_pipeline GitHub Actions, GitLab CI, CircleCI, Jenkins, Cloud Build
implemented
Workflow fetch_workflow AWS Step Functions, Cloud Workflows
implemented
Queue fetch_queue SQS queue depth and age
implemented
Serverless fetch_serverless AWS Lambda errors and cold starts
implemented

Proactive detection. Always running.

every 5 min
Anomaly scan z-score + EWMA per service error rate and latency
implemented
every 30 min
Correlation sweep simultaneous anomalies across multiple services
implemented
every 4 hours
Health digest deterministic summary · no LLM cost
implemented
daily · 06:00 UTC
Trend analysis week-on-week regressions, error pattern growth
implemented
nightly · 02:00 UTC
Nightly embed resolved incidents + Confluence pages → vector store
implemented

All intervals configurable via aqweth.yaml. Scheduled jobs use deterministic math — no LLM cost on routine reports.

No black box.

Every decision is logged, traced, and inspectable.

Structured logs

JSON · correlation IDs

Every log line carries incident_id, run_id, node, duration. Middleware stamps a correlation ID across the whole investigation.

Distributed tracing

OTel · per-node spans

An aqweth.node.<name> span per graph node. Exportable to any OTLP backend you already run.

Prometheus metrics

cost · latency · by model

Investigation duration, per-node timing, backend error rates, LLM token spend per model. Grafana dashboards included.

Prompt audit trail

snapshot per investigation

Each LangGraph checkpoint stores the exact prompt config that produced its RCA. Auditable record of which prompt produced which conclusion.

Per-role LLM assignment.

Every agent role is independently configurable. No code changes.

Role Default Example production
Primary reasoning / RCA qwen3:4b
claude-sonnet-4-6 · gpt-4o
Deep RCA (thinking mode) qwen3:4b
claude-opus-4-7 · Qwen3-30B self-hosted
Fast triage / classification qwen3:4b
Qwen3-4B self-hosted · gpt-4o-mini
Code understanding qwen3:4b
Qwen3-Coder self-hosted · claude-sonnet-4-6
Embeddings nomic-embed-text
bge-m3 self-hosted · text-embedding-3-small
aqweth.yaml yaml
llm:
  gateway_url: http://litellm.internal:4000
  roles:
    triage:    qwen3-4b-instruct
    primary:   claude-sonnet-4-6
    deep_rca:  claude-opus-4-7
    coder:     qwen3-coder-next
    embedder:  bge-m3

One Helm chart. Online in under a day.

1
aqweth init

Interactive wizard

Live connection validation. Profile presets: aws · gcp · k8s · local

2
aqweth.yaml + .env

Single config file

Per-role LLM, per-service backends, deployment context.

3
helm install

Own namespace

Deploys into its own namespace. Never touches production namespaces.

4
/rca

First investigation

RCA card returns within ~30s.

aqweth validate re-checks every backend connection before deployment. No "deploy and pray."

Exit is a namespace delete.

You delete a namespace, revoke an IAM role, and you are done.

no proprietary data format standard Helm + Terraform your data stays in your infra

Let us run an investigation on one of your incidents.

No deployment required. You nominate an incident from your retro doc — we run the analysis together.

Request access

Or email us at hello@aqweth.ai · No commitment