Under the hood
15-node parallel RCA. Built for how incidents actually work.
A stateful investigation graph. Parallel evidence collection. Human-gated actions. Every decision logged, traced, and auditable.
How the investigation graph works
ingest
entry point
triage
classify · dedupe · drop noise
dismiss / duplicate → END
synthesiser
Pass 1: primary_role · Pass 2: deep_rca_role if confidence < 0.70
embed_rca
stores RCA in vector store · silent side-effect
human_review
graph_interrupt · approve / reject / re-investigate
create_ticket
Jira ticket · full RCA evidence attached
What's live today.
Full production support across AWS, GCP, and Kubernetes.
Graph
15-node parallel investigation
Triggers
Reactive + 5 scheduled jobs
Chat
Slack · Google Chat
Block Kit · HMAC · threaded
AWS
CloudWatch · X-Ray · Lambda · S3 · IRSA
GCP
Cloud Logging · Monitoring · Trace · GCS · Build · Workflows
Kubernetes
EKS · GKE · AKS
pod events · restarts · rollouts
Tickets / KB
Jira + Confluence via MCP
Vector
Qdrant · ApertureDB
State
Redis-backed LangGraph checkpoint
investigations survive restarts
CLI
aqweth init · live validation
profiles: aws · gcp · k8s · local
Prompts
Per-deployment customisation in aqweth.yaml
no code changes
Self-observable
OTel traces · Prometheus metrics
Grafana dashboards included
Deploy
Helm + Terraform · IRSA-native
amd64 · arm64
Alerts in
Alertmanager · Grafana · New Relic webhooks
Every signal. In parallel.
Each fetch node queries one category of evidence. All 15 run concurrently.
Proactive detection. Automatic investigation.
When a scan detects an anomaly, Aqweth fans out 15 fetch nodes and delivers an RCA card to Slack — no manual trigger required.
All intervals configurable via aqweth.yaml. Scheduled jobs use deterministic math — no LLM cost on routine reports.
No black box.
Every decision is logged, traced, and inspectable.
Structured logs
JSON · correlation IDs
Every log line carries incident_id, run_id, node, duration. Middleware stamps a correlation ID across the whole investigation.
Distributed tracing
OTel · per-node spans
An aqweth.node.<name> span per graph node. Exportable to any OTLP backend you already run.
Prometheus metrics
cost · latency · by model
Investigation duration, per-node timing, backend error rates, LLM token spend per model. Grafana dashboards included.
Prompt audit trail
snapshot per investigation
Each LangGraph checkpoint stores the exact prompt config that produced its RCA. Auditable record of which prompt produced which conclusion.
Per-role LLM assignment.
Every agent role is independently configurable. No code changes.
llm: gateway_url: http://litellm.internal:4000 roles: triage: qwen3-4b-instruct primary: claude-sonnet-4-6 deep_rca: claude-opus-4-7 coder: qwen3-coder-next embedder: bge-m3
One Helm chart. Online in under a day.
Interactive wizard
Live connection validation. Profile presets: aws · gcp · k8s · local
Single config file
Per-role LLM, per-service backends, deployment context.
Own namespace
Deploys into its own namespace. Never touches production namespaces.
First investigation
RCA card returns within ~30s.
aqweth validate re-checks every backend connection before deployment. No "deploy and pray."
Exit is a namespace delete.
You delete a namespace, revoke an IAM role, and you are done.
Let us run an investigation on one of your incidents.
No deployment required. You nominate an incident from your retro doc — we run the analysis together.
Request accessOr email us at hello@aqweth.ai · No commitment