Production RCA · Autonomous

Aqweth sees your production so your engineers don't have to.

Agentic AI for autonomous root-cause analysis. 15 parallel fetch nodes investigate across logs, metrics, traces, deploys, and code — and report back in plain language within seconds.

15 nodes · parallel
any LLM · any stack
read-only · never mutates
aqweth
[14:32:04] jm /rca payments-api 5xx spike
[14:32:06] aq Investigating inc-2104 across 15 sources…
[14:32:09] aq fetch_logs — 47 ERRORs in 30min
[14:32:09] aq fetch_metrics — p99 2.3s (baseline 180ms)
[14:32:10] aq fetch_pipeline — deploy v2.4.1 at 14:28
RCA · inc-2104 · confidence 91% live

Root cause

Null check removed in PaymentProcessor.validate() in commit a3f92b (deploy 14:28).

Suggested fix

Revert to v2.4.0 or patch null guard on line 142.

Production incidents cost more than downtime.

Two structural problems compound every incident.

problem · 01

Access walls

Engineers are paged at 3 AM and spend the first 20 minutes navigating VPN, requesting elevated access, and waiting for approval. By the time they reach logs, the critical window has passed.

03:14 Paged
03:18 Responder hits access wall
03:22 Escalates to prod-on-call
03:41 Investigation begins

problem · 02

Tool fragmentation

RCA means manually cross-referencing five different systems with no shared timeline. Every tool has a different auth flow, a different query syntax, and a different data model — all under pressure, in the middle of the night.

logs metrics traces deploys code

One investigation. Fifteen sources. Seconds.

Triage runs first — noise dismissed before a single LLM token is spent.

1

/rca · alert · proactive

Trigger

Invoke with /rca in Slack, connect to your alerting pipeline, or let Aqweth run proactive scans on schedule. Any alert format, any channel.

2

Dedupe + classify

Triage

Signal is separated from noise before a single LLM token is spent. Duplicate alerts are merged, severity is classified, irrelevant signals are dropped.

3

15 fetch nodes · parallel

Fan-out

Up to 15 fetch nodes execute in parallel, each querying a different backend. Slow or offline backends time out gracefully — the rest continue.

logs metrics traces errors tickets runbooks similar_rcas code infra db apm pipeline workflow queue serverless
4

RCA card → chat

Synthesise

All evidence is assembled into a structured RCA card with confidence score, root cause, and suggested fix. Streamed directly to the Slack thread that triggered the investigation.

Aqweth recommends. Your engineers act.

The only production action Aqweth can take is opening a Jira ticket — and only on explicit approval.

no rollbacks no restarts no config changes

No automation, no surprises, no "AI rolled back the deploy while you slept."

RCA card posted

in Slack / Chat

Engineer reviews

evidence, confidence, fix

Approve or reject

human_review interrupt

on approve only ↓

Jira ticket opened

with full RCA evidence attached

Fits the stack you already have.

Switching backends is one line in aqweth.yaml. No code. No rebuild.

Available today
Logs
Loki · ELK · CloudWatch · Cloud Logging · New Relic
Metrics
Prometheus · CloudWatch Metrics · Cloud Monitoring
Tracing
Grafana Tempo · AWS X-Ray · Cloud Trace
Errors / APM
Sentry · New Relic
CI/CD
GitHub Actions · GitLab CI · CircleCI · Jenkins · Cloud Build · Cloud Workflows
Infra
Kubernetes (EKS · GKE · AKS)
Database
PostgreSQL · RDS/Aurora · MySQL · Neo4j
Serverless
AWS Lambda · API Gateway
Queues
SQS
Code
GitHub · GitLab · Bitbucket
Tickets
Jira · Confluence
Notifications
Slack · Google Chat
Vector store
Qdrant · ApertureDB
LLM
Claude · OpenAI · Google · vLLM · Ollama
Coming soon
Logs
VictoriaLogs · SumoLogic · Azure Monitor · Splunk
Metrics
VictoriaMetrics · Azure Monitor
Tracing
Jaeger · Zipkin · Azure AppInsights
APM
Honeycomb · AppDynamics
Errors
Rollbar · Bugsnag
CI/CD
ArgoCD · Buildkite · Concourse
Database
MongoDB · Cassandra · Redis
Queues
Kafka · RabbitMQ · Azure Service Bus · Google Pub/Sub
Code
Azure DevOps
Tickets
Linear · ServiceNow · GitHub Issues
Knowledge
Notion · GitBook · Outline · Bookstack
Notifications
Microsoft Teams · Discord · PagerDuty · OpsGenie
Vector store
pgvector · Weaviate · Milvus · Chroma
LLM
AWS Bedrock · Vertex AI
Cloud
Azure

Data residency on your terms.

Run all inference in your cluster. Or use cloud APIs. Or mix both. One config file either way.

Cloud API

+ No GPU footprint
+ Frontier models out of the box
− Prod context leaves your network
primary_role: claude-opus-4-7
air-gap compatible

Self-hosted

+ Data never leaves your cluster
+ Works offline
− GPU infra required
primary_role: vllm/qwen3-30b-a3b

Mix both: embedder + triage self-hosted, reasoning via cloud API. One YAML line per role.

Always on. Not just when alerts fire.

Aqweth catches degradation before it crosses thresholds.

every 5 min

Anomaly scan

z-score + EWMA on error rates and latency per service.

every 30 min

Correlation sweep

Multi-service degradation within a time window.

every 4 hours

Health digest

Deterministic summary posted to SRE channel.

daily · 06:00 UTC

Trend report

Week-on-week regressions, no LLM cost.

nightly · 02:00 UTC

Nightly embed

Resolved incidents + runbooks → vector store.

Let us run an investigation on one of your incidents.

No deployment required. You nominate an incident from your retro doc — we run the analysis together.

Request access

Or email us at hello@aqweth.ai · No commitment