Agentic SRE · Autonomous RCA · AIOps

Aqweth sees your production so your on-call engineers don't have to.

// ˈAK-WETH // NOUN · PRODUCTION RELIABILITY

An AI SRE agent for autonomous root-cause analysis. 15 parallel fetch nodes investigate across logs, metrics, traces, deploys, and code — and report back to on-call engineers in plain language within seconds.

30–90 minutes of every incident goes to gathering evidence, not fixing it.

Aqweth automates the investigation — AI-powered RCA that cuts MTTR from hours to seconds.

Mean time to diagnosis

30–90

minutes

Engineering cost per incident

2–6

eng-hours

Who gets paged

1

engineer

Incident window
Investigation · 38–80 min
12m
With Aqweth
~30s
12m
~70m freed

What you're correlating, by hand, at 03:14 a.m.

Grafana Loki Prometheus Sentry Jira GitLab recent deploys runbooks (Confluence) last week's similar inc K8s pod events

Production incidents cost more than downtime.

Three structural problems compound every incident response.

problem · 01

Silent degradation

Error rates climb, latency drifts — gradually, beneath alert thresholds. No single metric trips the wire. By the time a page fires, users have already been impacted for minutes and the clearest evidence has started to decay.

02:30 Error rate begins climbing
03:01 First user reports
03:08 Latency threshold crossed
03:14 Alert fires → paged

problem · 02

Access walls

Engineers are paged at 3 AM and spend the first 20 minutes navigating VPN, requesting elevated access, and waiting for approval. By the time they reach logs, the critical window has passed.

03:14 Paged
03:18 Responder hits access wall
03:22 Escalates to prod-on-call
03:41 Investigation begins

problem · 03

Tool fragmentation

RCA means manually cross-referencing five different systems with no shared timeline. Every tool has a different auth flow, a different query syntax, and a different data model — all under pressure, in the middle of the night.

logs metrics traces deploys code

One investigation. Fifteen sources. Seconds.

Triage runs first — noise dismissed before a single LLM token is spent.

1

/rca · alert · proactive

Trigger

Invoke with /rca in Slack, connect to your alerting pipeline, or let Aqweth run proactive scans on schedule. Any alert format, any channel.

2

Dedupe + classify

Triage

Signal is separated from noise before a single LLM token is spent. Duplicate alerts are merged, severity is classified, irrelevant signals are dropped.

3

15 fetch nodes · parallel

Fan-out

Up to 15 fetch nodes execute in parallel, each querying a different backend. Slow or offline backends time out gracefully — the rest continue.

logs metrics traces errors tickets runbooks similar_rcas code infra db apm pipeline workflow queue serverless
4

RCA card → chat

Synthesise

All evidence is assembled into a structured RCA card with confidence score, root cause, and suggested fix. Streamed directly to the Slack thread that triggered the investigation.

The deliverable

A cited RCA card. In the chat platform you already use.

Hypothesis with citations. Confidence score. Ranked fixes. Approve or reject — never auto-applied.

  1. 01

    Every claim is cited to the raw evidence — log line, span ID, deploy SHA, ticket. No hallucinated conclusions.

  2. 02

    Below 0.70 confidence, Aqweth automatically escalates to deep reasoning and surfaces uncertainty explicitly on the card.

  3. 03

    Slack and Google Chat shipping today. Microsoft Teams on the roadmap.

CRITICAL inc-2104 confidence 0.84

payments-api 5xx surge after deploy a3f1c92

Root cause

a3f1c92 removed the null check in PaymentProcessor.validate() at line 142. Every transaction since deploy 14:28 fails validation.

Evidence · 4 sources

·

Error rate jumped 14× within 90s of rollout.

prometheus · payments_5xx_total · 14:30:12 → 14:31:42

·

47 ERRORs in 30 min, NullPointerException at line 142.

loki · payments-api · trace 9d4ba2e1

·

Removed null check in PaymentProcessor.validate() in commit a3f1c92.

github · PR #4417 · merged 14:28 UTC

·

Similar failure resolved in inc-1879 (similarity 0.91) — 4 prior matches.

fetch_similar_rcas · vector store · embed_role bge-m3

Confidence

0.84

Suggested fixes · ranked

01

Revert payments-api to a3f0d4b — restores null guard. Runbook: payments-validation.

MTTR < 2 min
02

Add null guard at line 142 in PaymentProcessor.validate() — forward-fix, no rollback required.

MTTR ~5 min

Aqweth recommends. Your engineers act.

The only production action Aqweth can take is opening a Jira ticket — and only on explicit approval.

no rollbacks no restarts no config changes

No automation, no surprises, no "AI rolled back the deploy while you slept."

RCA card posted

in Slack / Chat

Engineer reviews

evidence, confidence, fix

Approve or reject

human_review interrupt

on approve only ↓

Jira ticket opened

with full RCA evidence attached

Fits the observability stack you already have.

Works with Kubernetes, AWS, and GCP out of the box. Switching backends is one line in aqweth.yaml. No code. No rebuild.

Available today
Logs
Loki · ELK · CloudWatch · Cloud Logging · New Relic
Metrics
Prometheus · CloudWatch Metrics · Cloud Monitoring
Tracing
Grafana Tempo · AWS X-Ray · Cloud Trace
Errors / APM
Sentry · New Relic
CI/CD
GitHub Actions · GitLab CI · CircleCI · Jenkins · Cloud Build · Cloud Workflows
Infra
Kubernetes (EKS · GKE · AKS)
Database
PostgreSQL · RDS/Aurora · MySQL · Neo4j
Serverless
AWS Lambda · API Gateway
Queues
SQS
Code
GitHub · GitLab · Bitbucket
Tickets
Jira · Confluence
Notifications
Slack · Google Chat
Vector store
Qdrant · ApertureDB
LLM
Claude · OpenAI · Google · vLLM · Ollama
Coming soon
Logs
VictoriaLogs · SumoLogic · Azure Monitor · Splunk
Metrics
VictoriaMetrics · Azure Monitor
Tracing
Jaeger · Zipkin · Azure AppInsights
APM
Honeycomb · AppDynamics
Errors
Rollbar · Bugsnag
CI/CD
ArgoCD · Buildkite · Concourse
Database
MongoDB · Cassandra · Redis
Queues
Kafka · RabbitMQ · Azure Service Bus · Google Pub/Sub
Code
Azure DevOps
Tickets
Linear · ServiceNow · GitHub Issues
Knowledge
Notion · GitBook · Outline · Bookstack
Notifications
Microsoft Teams · Discord · PagerDuty · OpsGenie
Vector store
pgvector · Weaviate · Milvus · Chroma
LLM
AWS Bedrock · Vertex AI
Cloud
Azure

Data residency on your terms.

Run all inference in your cluster. Or use cloud APIs. Or mix both. One config file either way.

Cloud API

+ No GPU footprint
+ Frontier models out of the box
− Prod context leaves your network
primary_role: claude-opus-4-7
air-gap compatible

Self-hosted

+ Data never leaves your cluster
+ Works offline
− GPU infra required
primary_role: vllm/qwen3-30b-a3b

Mix both: embedder + triage self-hosted, reasoning via cloud API. One YAML line per role.

Always on. Not just when alerts fire.

When an anomaly is detected, Aqweth automatically triggers a full investigation — RCA card in Slack before anyone is paged.

every 5 min

Anomaly scan

z-score + EWMA on error rates and latency per service.

every 30 min

Correlation sweep

Multi-service degradation within a time window.

every 4 hours

Health digest

Deterministic summary posted to SRE channel.

daily · 06:00 UTC

Trend report

Week-on-week regressions, no LLM cost.

nightly · 02:00 UTC

Nightly embed

Resolved incidents + runbooks → vector store.

Engineers wake up to a resolved investigation, not a page.

Let us run an investigation on one of your incidents.

No deployment required. You nominate an incident from your retro doc — we run the analysis together.

Request access

Or email us at hello@aqweth.ai · No commitment