HiveBoard — The Datadog for AI Agents

The Investigation

You see the red.
You click. Here's what you find in 15 seconds.

The sales agent has been silently failing for 2 hours. Eight leads are backed up. No one noticed — until now.

sales

ERROR

sales

2m ago

Q:8 1 issue

⚑

Agent-Reported Issue

HIGH

CRM API returning 403 Forbidden
for all workspace query endpoints

Invisible failures aren't the only thing hiding.
So is invisible waste.

Your agents are making LLM calls every second. Do you know which model, how many tokens, and how much each one costs?

◆ Cost Explorer

Last 7 days

Total LLM Spend

$847.20

↓ 74% from last week

LLM Calls

12,847

avg 1,835/day

Avg Cost / Task

$0.04

was $0.16 last week

Total Tokens

8.4M

↓ 68% from last week

Savings This Week

$2,412

vs. previous run rate

Daily LLM Spend by Model

claude-opus

claude-sonnet

claude-haiku

HiveBoard visibility begins

Mon

Tue

Wed

Thu

Fri

Sat

Sun

Mon

Tue

Wed

Today

Cost by Model

Model	Calls	Cost
claude-opus	842	$412.60
claude-sonnet	6,204	$318.40
claude-haiku	5,801	$116.20

💡 Optimization found

Opus used for classify_intent (842 calls).
Haiku handles this at 1/10th the cost.

Cost by Agent

Agent	Tasks	Cost
sales	4,280	$412.80
support	3,640	$286.40
main	2,927	$148.00

💡 Anomaly detected

sales costs $0.10/task, support costs $0.08.
Same task type. Check prompt sizes.

⚠ Prompt Bloat Analysis

Before — avg prompt size ~18,000 tok

18,000 tokens

What's inside that prompt

redundant context

verbose instructions

repeated turns

actual content

After — trimmed & optimized ~5,000 tok

5,000 tokens

18k

Before

→

After

−72%

Token reduction

$40

/hour

→

/hour

No model switch. No architecture change.
Just visibility into what was already happening, followed by informed prompt optimization.

80% cost reduction from observability alone

Saving $768/week at current volume

All of this — from a 3-line change to your code

↓

The Integration

Three layers.
Ship after any one.

Each layer builds on the last. Each one unlocks new dashboard capabilities. Start with 3 lines, go deeper when you're ready.

Presence & Heartbeat

— your agent is alive

~10 lines · 2 minutes

initialization — run once at startup

                        import hiveloop

                        # One line: connect to HiveBoard
                        hb = hiveloop.init(api_key="hb_live_your_key", environment="production")
                        # Two lines: register your agent
                        agent = hb.agent(
                            agent_id="sales",
                            type="sales",
                            version="1.2.0",
                            heartbeat_interval=30,
                        )

                        # That's it. Your agent now has:
                        # ✓ Heartbeat every 30s
                        # ✓ Online/offline detection
                        # ✓ Stuck detection (5m default)
                        # ✓ Dashboard card with sparkline
                    

What lights up on the dashboard

sales

IDLE

sales

12s ago

v1.2.0

Activity Stream

agent_registered

sales · v1.2.0

heartbeat

sales · 12s ago

status_change

sales → IDLE

Unlocks →

Agent Cards

Heartbeat Sparklines

Stuck Detection

Online/Offline Status

Connection Indicator

Tasks & Actions

— what your agent is doing

~30 lines · 10 minutes

wrap your task boundary

                        def process_lead(agent, lead):
                            with agent.task(f"task-lead-{lead.id}", project="sales", type="lead") as task:
                        result = run_pipeline(lead)
                        return result
                        # ✓ task_started on entry
                        # ✓ task_completed on clean exit
                        # ✓ task_failed on exception (auto-caught)
                    

decorate key functions (5-7 nodes, not 30)

                        @agent.track("evaluate_lead")
                        def evaluate_lead(lead):
                        score = run_scoring_model(lead)
                        return score

                        @agent.track("crm_search")
                        def search_crm(query):
                        return crm_client.search(query)

                        @agent.track("send_email")
                        def send_outreach(contact, template):
                        return email_client.send(contact, template)
                    

or use context managers for dynamic names

                        # When the LLM picks tools at runtime:
                        for tool_call in response.tool_calls:
                            with agent.track_context(tool_call.name) as ctx:
                        result = execute_tool(tool_call.name, tool_call.args)
                    

What lights up on the dashboard

sales

PROCESSING

↳ task-lead-4801

Timeline

task-lead-4801

▶

task-lead-4801

14.2s

✓

⚡

evaluate_lead

1.8s

✓

⚡

crm_search

0.8s

✓

⚡

send_email

3.1s

✗

↳ ConnectionError: smtp refused

Unlocks →

Task Table

Action Timelines

Success / Failure Rates

Duration Tracking

Error Attribution

Full Narrative Telemetry

— the complete story

~5–10 lines per call site

2a · LLM call tracking → cost explorer

                        response = llm_client.chat(messages, tools=tool_catalog)

                        task.llm_call(
                            "agent_turn",
                            model=response.model,
                            tokens_in=response.usage.input_tokens,
                            tokens_out=response.usage.output_tokens,
                            cost=estimate_cost(response.model, tokens_in, tokens_out),
                            duration_ms=elapsed_ms,
                        )
                    

2b · Plans, issues, escalations

                        # Agent decides on a plan
                        task.plan(["Search CRM", "Score lead", "Draft email", "Send", "Log result"])

                        # Agent detects something wrong
                        agent.report_issue("CRM API returning 403", severity="high", category="permissions")

                        # Agent needs a human
                        task.escalate("Credit >$200 needs approval", to="support-lead")

                        # Queue health snapshot
                        agent.queue_snapshot(depth=8, oldest_age_s=2820)
                    

2c · Framework integrations (one line each)

                        # LangChain
                        from hiveloop.integrations.langchain import HiveLoopCallback
                        chain.invoke(input, config={"callbacks": [HiveLoopCallback(hb)]})

                        # CrewAI
                        from hiveloop.integrations.crewai import CrewAICallback
                        crew = Crew(agents=agents, callbacks=[CrewAICallback(hb)])

                        # AutoGen
                        from hiveloop.integrations.autogen import AutoGenCallback
                        callback = AutoGenCallback(hb, project="sales-pipeline")
                    

What lights up on the dashboard

Timeline

task-lead-4801

◆ 5 LLM · $0.04

◆

LLM · reasoning

1.2s

✓

claude-sonnet-4-5 · 842→156 · $0.008

⚡

crm_search

0.8s

✓

◆

LLM · tool use

1.4s

✓

⚑

report_issue

⚑ high

◆ Cost Explorer

$0.04 this task

claude-sonnet-4-5

5 calls · $0.04

Tokens: 5,200 in / 890 out

Rich Events

plan

5 steps · 1/5 completed

issue

CRM 403 · high · ×8

escalation

Credit approval → support-lead

queue

depth:8 · oldest: 47m

Unlocks →

Cost Explorer

Token Usage

LLM Nodes in Timeline

Plan Progress Bars

Issue Tracking

Escalation Visibility

Queue Health

🛡 The Safety Contract

Observability is a side channel. If HiveBoard goes down, your agents continue running identically. Every SDK call follows the guard pattern:

if hiveloop_agent:
    try:
        hiveloop_agent.some_method(...)
    except Exception:
        pass  # observability must never break the agent

Claude Code

~/my-agent $

✓ Scanning codebase… found LangChain + 4 tool functions

✓ Added hiveloop to requirements.txt

✓ Instrumented: init → agent → task → llm_call → shutdown

✓ Enhanced: @track on 4 tools, plan, issues, log bridge

⬡ HiveLoop integration complete. Start your agent to see live telemetry.

NEW The Shortcut

Or let Claude Code
do it for you.

One slash command. Full integration. Our Claude Code skill scans your codebase, discovers your framework, and instruments every layer automatically — from heartbeat to LLM call tracking.

/integrate-hiveloop hb_live_your_key

↓ Download Skill Read the docs →

Works with: LangChain · CrewAI · AutoGen · Semantic Kernel · Custom loops

38 questions, one dashboard

↓

What HiveBoard Sees

38 questions.
One dashboard. Four moments.

Every interaction with an observability tool happens in one of four moments. HiveBoard was designed to serve all of them — from a 2-second glance to a 30-minute strategic review.

Moment 1

The Glance

⏱ 2 seconds

5 questions

Walking past a screen. Checking between meetings. Is everything OK?

"Are my agents running?"

Heartbeat dots on every agent card — green, amber, or red. No click needed.

"Does anything need my attention?"

Attention badge — red pulsing pill "2 ⚠". If it's not there, nothing needs you.

"Is anything stuck?"

Stuck counter in Stats Ribbon. Stuck agents glow red, sorted to the top.

"Is work flowing?"

Four mini-charts: throughput, success rate, errors, cost. Shapes, not numbers.

"Is anything happening right now?"

Activity Stream with green "Live" pulse. If events appear, agents are working.

Moment 2

The Investigation

⏱ 2–5 minutes

11 questions

Something's wrong. An agent is stuck. A task failed. What happened?

Agent-level

"What is this agent doing right now?"

Click agent card → live timeline with task ID and elapsed time.

"Is this agent's heartbeat healthy?"

Three states: green (recent), amber (drifting), red (stale). Sparkline shows trend.

"Does it have pending work nobody's looking at?"

Queue badge "Q:4" on card. Amber if exceeding threshold. Pipeline tab shows contents.

"Has it reported its own problems?"

Issues via agent.report_issue() → red dot, occurrence count, severity, category.

Task-level

"What steps did this task take?"

Timeline renders every event as color-coded nodes. Read the story without clicking.

"What was the plan, and where did it go wrong?"

Plan progress bar: green-green-red-gray = third step failed.

"Which tool failed?"

Red nodes in timeline. Click → tool name, arguments, error, retries visible.

"Which LLM was called, and what did it see?"

Purple nodes with model badge. Click → tokens, cost, prompt/response preview.

"How long did each step take?"

Duration labels on every node. Time Breakdown bar shows where time went.

"Was it escalated? Did it need human approval?"

Amber escalation nodes. Approval request with approver name and resolution.

"Can I share this investigation?"

Every timeline has a permalink. Paste in Slack → full story in 15 seconds.

Moment 3

The Optimization

⏱ 10–15 minutes

12 questions

Nothing is on fire. But you suspect things could be better.

Cost optimization

"How much are my agents costing me?"

Cost Explorer: total spend broken down by model and by agent.

"Am I using expensive models where cheap ones would work?"

Cost by Model table. Opus for classify_intent? That's Haiku's job at 1/10th the cost.

"Why did costs spike this week?"

Stacked timeseries chart. Click into spike period → inspect LLM call nodes.

"Is there prompt bloat?"

LLM node: 18,000 tokens-in, 200 tokens-out = bloated. Fastest path to savings.

"Are different agents doing similar work at different costs?"

Cost by Agent table. Compare per-task cost across agents doing the same job.

Invisible failures

"Are tasks being silently dropped?"

Queue items aging beyond expected processing time. The most dangerous failure.

"Is the queue growing while the agent reports idle?"

IDLE badge + "Q:8" amber = scheduling bug or silent crash recovery.

"Are credentials failing silently?"

Issue: "CRM API 403" with occurrence count climbing. Pattern visible in timeline.

"Is the heartbeat doing less than it used to?"

Payload-aware heartbeats catch behavioral drift — not just alive/dead.

Operational health

"Are human approvals backing up?"

Stats Ribbon "Waiting" count. Activity Stream "human" filter shows queue depth.

"Which action within a plan consistently fails?"

Same red node position across timelines. Logs say "failed"; timelines say where.

"Is the same issue recurring without resolution?"

Issue at "×50 occurrences" = agent flagged it fifty times, nobody addressed it.

Moment 4

The Review

⏱ 20–30 minutes

10 questions

End of week. Before a board meeting. After a deploy. Are things getting better?

Performance trends

"Is my success rate improving?"

Stats Ribbon success rate + mini-chart trend. Line up = deploys working.

"Are tasks getting faster or slower?"

Avg duration + Time Breakdown. LLM time up after model switch = explanation.

"Which agent fails most often?"

Compare sparklines across agent cards. Red-trending = localized problem.

"Are agents getting better after deploys?"

Compare timelines before/after. Success rate, duration, cost, turns per task.

Cost accountability

"What's our total agent infrastructure cost?"

Cost Explorer, full time range. One number. Break down by model or agent.

"Is cost per task trending up or down?"

LLM Cost/Task mini-chart. Rising = prompts growing. Falling = optimizations working.

"Can I prove ROI on agent observability?"

Cost before vs. after. $40/hr → $8/hr. That's your ROI.

Fleet-level insights

"How many agents are in production?"

Stats Ribbon: Total Agents. Each visible in The Hive with its own card.

"What's the overall health of the fleet?"

All dots green + stuck=0 + errors low + success above baseline = healthy. 2 seconds.

"Are we ready to scale?"

High success + stable costs + manageable queues = safe to add agents. Both signals, same screen.

Questions nobody else answers

Architectural differences, not feature gaps — these can't be patched

"Is my agent stuck?"

No heartbeat concept. No stuck threshold. No liveness beyond "last API call."

✗ LangSmith · Langfuse · Datadog

"What's in the work queue?"

No intent pipeline. They see what happened, not what's waiting to happen.

✗ LangSmith · Langfuse · Datadog

"Did the agent plan to do vs. what it actually did?"

No plan-step tracking. No planned-vs-actual comparison.

✗ LangSmith · Langfuse · Datadog

"Is the heartbeat still doing what it used to?"

No payload-aware heartbeat. Heartbeat is binary: alive or dead.

✗ LangSmith · Langfuse · Datadog

The story behind the product

↓

The Story

Built from real pain.
Documented in three parts.

HiveBoard wasn't imagined in a meeting room. It was forged from two weeks of deploying AI agents into production and watching them fail in ways nobody could see.

48h

Total build time

Part 1 · Why it exists

THE JOURNEY

How we built HiveBoard — the Datadog for AI Agents — in 48 hours. A chronicle of pain, pivots, and the moment visibility changed everything.

⚡

~2 hours actual coding out of 48 total

$40/hr → $8/hr — 80% cost cut from visibility alone

✗

FormsFlow killed in a single session. No sunk cost fallacy.

◆

13 event types, 6 major specs, 5 data model iterations

"Nobody needs observability on demo day. Everyone needs it on day 30 when the agent silently stopped working and nobody noticed for 6 hours."

Read the full chronicle →

450+

Audit checkpoints

Part 2 · How it was built

The Hive Method

A development methodology for building complex systems with AI teams. One human, three Claude instances, and the process that made it work.

⬡

1 human + 3 Claude instances — PM, Dev Team 1, Dev Team 2

⊗

12 critical bugs caught by cross-auditing — invisible to unit tests

✓

~96% specs, ~4% code — the code almost wrote itself

↔

Divergent perspectives — CLI vs Cloud caught complementary blind spots

"Having Team 1 audit Team 2 and vice versa caught issues neither team found in their own work. The consumer of an API catches contract mismatches the producer's tests structurally cannot."

Read the methodology →

Questions answered

Part 3 · What it does

What HiveBoard Sees

The 38 questions your agents can finally answer — organized by the four moments every operator lives in, from a 2-second glance to a 30-minute review.

⏱

The Glance — 5 questions in 2 seconds

🔍

The Investigation — 11 questions in 2–5 minutes

◆

The Optimization — 12 questions in 10–15 minutes

📊

The Review — 10 questions in 20–30 minutes

"Existing tools think your agent is a function that calls an LLM. HiveBoard thinks your agent is a worker that takes tasks, gets stuck, asks for help, and recovers. That's the difference."

Read the full catalog →

"I was spending $40/hour running my agents. I instrumented them with HiveLoop. I could see every prompt, every response, every token count. I cut my costs to $8/hour in a week."

— That's not a feature list. That's a before-and-after that sells itself.

Ready to see for yourself?

↓

Your agents are working.
Are they healthy?

You see the red.
You click. Here's what you find in 15 seconds.

Invisible failures aren't the only thing hiding.
So is invisible waste.

The $40/hr was hidden in
842 Opus calls doing Haiku's job.

Three layers.
Ship after any one.

38 questions.
One dashboard. Four moments.

38 questions.
Already answered. Before you ask.

Built from real pain.
Documented in three parts.

See your agents
in 60 seconds.

Your agents are working. Are they healthy?

You see the red. You click. Here's what you find in 15 seconds.

Invisible failures aren't the only thing hiding. So is invisible waste.

The $40/hr was hidden in 842 Opus calls doing Haiku's job.

Three layers. Ship after any one.

38 questions. One dashboard. Four moments.

38 questions. Already answered. Before you ask.

Built from real pain. Documented in three parts.

See your agents in 60 seconds.

Your agents are working.
Are they healthy?

You see the red.
You click. Here's what you find in 15 seconds.

Invisible failures aren't the only thing hiding.
So is invisible waste.

The $40/hr was hidden in
842 Opus calls doing Haiku's job.

Three layers.
Ship after any one.

38 questions.
One dashboard. Four moments.

38 questions.
Already answered. Before you ask.

Built from real pain.
Documented in three parts.

See your agents
in 60 seconds.