How does Aqweth perform root cause analysis?

Aqweth fans out 15 parallel fetch nodes, each querying a different backend — logs, metrics, traces, error tracking, tickets, runbooks, similar past incidents, code, infrastructure, database, APM, CI/CD pipeline, workflow, queue, and serverless. Evidence is synthesised by an AI model into a structured RCA card with root cause, confidence score, and ranked fixes.

What observability tools does Aqweth integrate with?

Aqweth integrates with Grafana, Loki, Prometheus, Sentry, Jira, GitLab, and Confluence out of the box. It supports AWS, GCP, and Kubernetes infrastructure. Additional backends are configured with a single line in aqweth.yaml — no code changes required.

How much does Aqweth reduce MTTR?

Aqweth reduces mean time to resolution (MTTR) by cutting the investigation phase from the typical 30–90 minutes of manual evidence gathering down to approximately 30 seconds. Engineers spend their time on the fix, not the investigation.

Is Aqweth self-hosted?

Yes. Aqweth deploys via a single Helm chart into its own Kubernetes namespace and never touches production namespaces. All data stays in your own infrastructure. Exit is a namespace delete — no vendor lock-in and no proprietary data formats.

Does Aqweth support self-hosted LLMs?

Yes. Aqweth supports both self-hosted LLMs (Ollama, vLLM) and cloud LLM providers (OpenAI, Anthropic, Gemini). The LLM is configurable per role in aqweth.yaml, so different models can be used for triage, fetch, and synthesis stages.

Does Aqweth detect and investigate incidents proactively?

Yes. Aqweth runs scheduled anomaly scans every 5 minutes and correlation sweeps every 30 minutes — catching gradual degradation across multiple data sources that no single alert threshold would detect, before users notice. When an anomaly is identified, Aqweth automatically triggers a full 15-node investigation and delivers an RCA card to Slack before engineers are paged. Engineers wake up to a resolved investigation, not an alert.

Agentic SRE · Autonomous RCA · AIOps

Aqweth sees your production so your on-call engineers don't have to.

// ˈAK-WETH // NOUN · PRODUCTION RELIABILITY

An AI SRE agent for autonomous root-cause analysis. 15 parallel fetch nodes investigate across logs, metrics, traces, deploys, and code — and report back to on-call engineers in plain language within seconds.

Request access See how it works ↓

15 nodes · parallel

any LLM · any stack

read-only · never mutates

aqweth

[14:32:04] jm > /rca payments-api 5xx spike

[14:32:06] aq Investigating inc-2104 across 15 sources…

[14:32:09] aq ✓ fetch_logs — 47 ERRORs in 30min

[14:32:09] aq ✓ fetch_metrics — p99 2.3s (baseline 180ms)

[14:32:10] aq ✓ fetch_pipeline — deploy v2.4.1 at 14:28

RCA · inc-2104 · confidence 91% live

Root cause

Null check removed in PaymentProcessor.validate() in commit a3f92b (deploy 14:28).

Suggested fix

Revert to v2.4.0 or patch null guard on line 142.

30–90 minutes of every incident goes to gathering evidence, not fixing it.

Aqweth automates the investigation — AI-powered RCA that cuts MTTR from hours to seconds.

Mean time to diagnosis

30–90

minutes

Engineering cost per incident

2–6

eng-hours

Who gets paged

engineer

Incident window

Investigation · 38–80 min

12m Fix · 12 min

With Aqweth

~30s

12m Fix · 12 min

~70m freed 70+ min freed

What you're correlating, by hand, at 03:14 a.m.

Grafana Loki Prometheus Sentry Jira GitLab recent deploys runbooks (Confluence) last week's similar inc K8s pod events

Production incidents cost more than downtime.

Three structural problems compound every incident response.

problem · 01

Silent degradation

Error rates climb, latency drifts — gradually, beneath alert thresholds. No single metric trips the wire. By the time a page fires, users have already been impacted for minutes and the clearest evidence has started to decay.

02:30 Error rate begins climbing

03:01 First user reports

03:08 Latency threshold crossed

03:14 Alert fires → paged

02:30 Error rate climbing

03:01 First user reports

03:08 Latency threshold

03:14 Alert fires → paged

problem · 02

Access walls

Engineers are paged at 3 AM and spend the first 20 minutes navigating VPN, requesting elevated access, and waiting for approval. By the time they reach logs, the critical window has passed.

03:14 Paged

03:18 Responder hits access wall

03:22 Escalates to prod-on-call

03:41 Investigation begins

03:14 Paged

03:18 Hits access wall

03:22 Escalates to prod-on-call

03:41 Investigation begins

problem · 03

Tool fragmentation

RCA means manually cross-referencing five different systems with no shared timeline. Every tool has a different auth flow, a different query syntax, and a different data model — all under pressure, in the middle of the night.

logs metrics traces deploys code

One investigation. Fifteen sources. Seconds.

Triage runs first — noise dismissed before a single LLM token is spent.

/rca · alert · proactive

Trigger

Invoke with /rca in Slack, connect to your alerting pipeline, or let Aqweth run proactive scans on schedule. Any alert format, any channel.

Dedupe + classify

Triage

Signal is separated from noise before a single LLM token is spent. Duplicate alerts are merged, severity is classified, irrelevant signals are dropped.

15 fetch nodes · parallel

Fan-out

Up to 15 fetch nodes execute in parallel, each querying a different backend. Slow or offline backends time out gracefully — the rest continue.

logs metrics traces errors tickets runbooks similar_rcas code infra db apm pipeline workflow queue serverless

RCA card → chat

Synthesise

All evidence is assembled into a structured RCA card with confidence score, root cause, and suggested fix. Streamed directly to the Slack thread that triggered the investigation.

Full product architecture →

The deliverable

A cited RCA card. In the chat platform you already use.

Hypothesis with citations. Confidence score. Ranked fixes. Approve or reject — never auto-applied.

01
Every claim is cited to the raw evidence — log line, span ID, deploy SHA, ticket. No hallucinated conclusions.
02
Below 0.70 confidence, Aqweth automatically escalates to deep reasoning and surfaces uncertainty explicitly on the card.
03
Slack and Google Chat shipping today. Microsoft Teams on the roadmap.

CRITICAL inc-2104 confidence 0.84

payments-api 5xx surge after deploy a3f1c92

Root cause

a3f1c92 removed the null check in PaymentProcessor.validate() at line 142. Every transaction since deploy 14:28 fails validation.

Evidence · 4 sources

Error rate jumped 14× within 90s of rollout.

prometheus · payments_5xx_total · 14:30:12 → 14:31:42

47 ERRORs in 30 min, NullPointerException at line 142.

loki · payments-api · trace 9d4ba2e1

Removed null check in PaymentProcessor.validate() in commit a3f1c92.

github · PR #4417 · merged 14:28 UTC

Similar failure resolved in inc-1879 (similarity 0.91) — 4 prior matches.

fetch_similar_rcas · vector store · embed_role bge-m3

Confidence

0.84

Suggested fixes · ranked

Revert payments-api to a3f0d4b — restores null guard. Runbook: payments-validation.

MTTR < 2 min

Add null guard at line 142 in PaymentProcessor.validate() — forward-fix, no rollback required.

MTTR ~5 min

Aqweth recommends. Your engineers act.

The only production action Aqweth can take is opening a Jira ticket — and only on explicit approval.

✗ no rollbacks ✗ no restarts ✗ no config changes

No automation, no surprises, no "AI rolled back the deploy while you slept."

RCA card posted

in Slack / Chat

→

Engineer reviews

evidence, confidence, fix

→

Approve or reject

human_review interrupt

on approve only ↓

Jira ticket opened

with full RCA evidence attached

Fits the observability stack you already have.

Works with Kubernetes, AWS, and GCP out of the box. Switching backends is one line in aqweth.yaml. No code. No rebuild.

Available today

Logs

Loki · ELK · CloudWatch · Cloud Logging · New Relic

Metrics

Prometheus · CloudWatch Metrics · Cloud Monitoring

Tracing

Grafana Tempo · AWS X-Ray · Cloud Trace

Errors / APM

Sentry · New Relic

CI/CD

GitHub Actions · GitLab CI · CircleCI · Jenkins · Cloud Build · Cloud Workflows

Infra

Kubernetes (EKS · GKE · AKS)

Database

PostgreSQL · RDS/Aurora · MySQL · Neo4j

Serverless

AWS Lambda · API Gateway

Queues

SQS

Code

GitHub · GitLab · Bitbucket

Tickets

Jira · Confluence

Notifications

Slack · Google Chat

Vector store

Qdrant · ApertureDB

LLM

Claude · OpenAI · Google · vLLM · Ollama

Coming soon

Logs

VictoriaLogs · SumoLogic · Azure Monitor · Splunk

Metrics

VictoriaMetrics · Azure Monitor

Tracing

Jaeger · Zipkin · Azure AppInsights

APM

Honeycomb · AppDynamics

Errors

Rollbar · Bugsnag

CI/CD

ArgoCD · Buildkite · Concourse

Database

MongoDB · Cassandra · Redis

Queues

Kafka · RabbitMQ · Azure Service Bus · Google Pub/Sub

Code

Azure DevOps

Tickets

Linear · ServiceNow · GitHub Issues

Knowledge

Notion · GitBook · Outline · Bookstack

Notifications

Microsoft Teams · Discord · PagerDuty · OpsGenie

Vector store

pgvector · Weaviate · Milvus · Chroma

LLM

AWS Bedrock · Vertex AI

Cloud

Azure

Full backend reference →

Data residency on your terms.

Run all inference in your cluster. Or use cloud APIs. Or mix both. One config file either way.

Cloud API

+ No GPU footprint

+ Frontier models out of the box

− Prod context leaves your network

primary_role: claude-opus-4-7

air-gap compatible

Self-hosted

+ Data never leaves your cluster

+ Works offline

− GPU infra required

primary_role: vllm/qwen3-30b-a3b

Mix both: embedder + triage self-hosted, reasoning via cloud API. One YAML line per role.

Always on. Not just when alerts fire.

When an anomaly is detected, Aqweth automatically triggers a full investigation — RCA card in Slack before anyone is paged.

every 5 min

Anomaly scan

z-score + EWMA on error rates and latency per service.

every 30 min

Correlation sweep

Multi-service degradation within a time window.

every 4 hours

Health digest

Deterministic summary posted to SRE channel.

daily · 06:00 UTC

Trend report

Week-on-week regressions, no LLM cost.

nightly · 02:00 UTC

Nightly embed

Resolved incidents + runbooks → vector store.

Engineers wake up to a resolved investigation, not a page.

Let us run an investigation on one of your incidents.

No deployment required. You nominate an incident from your retro doc — we run the analysis together.

Request access

Or email us at hello@aqweth.ai · No commitment