Under the hood

15-node parallel RCA. Built for how incidents actually work.

A stateful investigation graph. Parallel evidence collection. Human-gated actions. Every decision logged, traced, and auditable.

How the investigation graph works

POST /investigate
Anomaly scanner (APScheduler)
Slash command /rca

ingest

entry point

triage

classify · dedupe · drop noise

dismiss / duplicate → END

proceed
15 fetch nodes parallel
fetch_logs fetch_metrics fetch_traces fetch_errors fetch_tickets fetch_runbooks fetch_similar_rcas fetch_code fetch_infra fetch_db fetch_apm fetch_pipeline fetch_workflow fetch_queue fetch_serverless
evidence merged

synthesiser

Pass 1: primary_role · Pass 2: deep_rca_role if confidence < 0.70

embed_rca

stores RCA in vector store · silent side-effect

human_review

graph_interrupt · approve / reject / re-investigate

on approval only

create_ticket

Jira ticket · full RCA evidence attached

What's live today.

Full production support across AWS, GCP, and Kubernetes.

Graph

15-node parallel investigation

Triggers

Reactive + 5 scheduled jobs

Chat

Slack · Google Chat

Block Kit · HMAC · threaded

AWS

CloudWatch · X-Ray · Lambda · S3 · IRSA

GCP

Cloud Logging · Monitoring · Trace · GCS · Build · Workflows

Kubernetes

EKS · GKE · AKS

pod events · restarts · rollouts

Tickets / KB

Jira + Confluence via MCP

Vector

Qdrant · ApertureDB

State

Redis-backed LangGraph checkpoint

investigations survive restarts

CLI

aqweth init · live validation

profiles: aws · gcp · k8s · local

Prompts

Per-deployment customisation in aqweth.yaml

no code changes

Self-observable

OTel traces · Prometheus metrics

Grafana dashboards included

Deploy

Helm + Terraform · IRSA-native

amd64 · arm64

Alerts in

Alertmanager · Grafana · New Relic webhooks

Every signal. In parallel.

Each fetch node queries one category of evidence. All 15 run concurrently.

Logs fetch_logs Loki, ELK, CloudWatch, Cloud Logging, New Relic
Metrics fetch_metrics Prometheus, CloudWatch Metrics, Cloud Monitoring
Tracing fetch_traces Grafana Tempo, AWS X-Ray, Cloud Trace
Errors fetch_errors Sentry, New Relic APM
Tickets fetch_tickets Jira via MCP
Runbooks fetch_runbooks Confluence via MCP + vector similarity retrieval
Vector fetch_similar_rcas Qdrant, ApertureDB — prior incidents by similarity score
Code fetch_code GitHub, GitLab, Bitbucket — commit diff at alert time
Infra fetch_infra Kubernetes pod events, OOMKill, CrashLoop, ReplicaSet rollouts
Database fetch_db PostgreSQL, RDS/Aurora, MySQL, Neo4j — slow queries, locks
APM fetch_apm New Relic application traces
Pipeline fetch_pipeline GitHub Actions, GitLab CI, CircleCI, Jenkins, Cloud Build
Workflow fetch_workflow AWS Step Functions, Cloud Workflows
Queue fetch_queue SQS queue depth and age
Serverless fetch_serverless AWS Lambda errors and cold starts

Proactive detection. Automatic investigation.

When a scan detects an anomaly, Aqweth fans out 15 fetch nodes and delivers an RCA card to Slack — no manual trigger required.

every 5 min
Anomaly scan z-score + EWMA per service error rate and latency
every 30 min
Correlation sweep simultaneous anomalies across multiple services
every 4 hours
Health digest deterministic summary · no LLM cost
daily · 06:00 UTC
Trend analysis week-on-week regressions, error pattern growth
nightly · 02:00 UTC
Nightly embed resolved incidents + Confluence pages → vector store

All intervals configurable via aqweth.yaml. Scheduled jobs use deterministic math — no LLM cost on routine reports.

No black box.

Every decision is logged, traced, and inspectable.

Structured logs

JSON · correlation IDs

Every log line carries incident_id, run_id, node, duration. Middleware stamps a correlation ID across the whole investigation.

Distributed tracing

OTel · per-node spans

An aqweth.node.<name> span per graph node. Exportable to any OTLP backend you already run.

Prometheus metrics

cost · latency · by model

Investigation duration, per-node timing, backend error rates, LLM token spend per model. Grafana dashboards included.

Prompt audit trail

snapshot per investigation

Each LangGraph checkpoint stores the exact prompt config that produced its RCA. Auditable record of which prompt produced which conclusion.

Per-role LLM assignment.

Every agent role is independently configurable. No code changes.

Primary reasoning / RCA qwen3:4b
claude-sonnet-4-6 · gpt-4o
Deep RCA (thinking mode) qwen3:4b
claude-opus-4-7 · Qwen3-30B self-hosted
Fast triage / classification qwen3:4b
Qwen3-4B self-hosted · gpt-4o-mini
Code understanding qwen3:4b
Qwen3-Coder self-hosted · claude-sonnet-4-6
Embeddings nomic-embed-text
bge-m3 self-hosted · text-embedding-3-small
aqweth.yaml yaml
llm:
  gateway_url: http://litellm.internal:4000
  roles:
    triage:    qwen3-4b-instruct
    primary:   claude-sonnet-4-6
    deep_rca:  claude-opus-4-7
    coder:     qwen3-coder-next
    embedder:  bge-m3

One Helm chart. Online in under a day.

1
aqweth init

Interactive wizard

Live connection validation. Profile presets: aws · gcp · k8s · local

2
aqweth.yaml + .env

Single config file

Per-role LLM, per-service backends, deployment context.

3
helm install

Own namespace

Deploys into its own namespace. Never touches production namespaces.

4
/rca

First investigation

RCA card returns within ~30s.

aqweth validate re-checks every backend connection before deployment. No "deploy and pray."

Exit is a namespace delete.

You delete a namespace, revoke an IAM role, and you are done.

no proprietary data format standard Helm + Terraform your data stays in your infra

Let us run an investigation on one of your incidents.

No deployment required. You nominate an incident from your retro doc — we run the analysis together.

Request access

Or email us at hello@aqweth.ai · No commitment