Skip to content

Production Monitoring

Once your app is deployed to production, the Production tab in the IDE workspace shows live metrics from Cloudflare. This page explains what each metric means, what normal looks like, and how to investigate when something goes wrong.


How to open it

Inside your project in the IDE, click the Production tab in the workspace panel. It only appears after at least one production deploy.

Use the time window selector in the top-right to switch between 1h, 6h, 24h (default), and 7d views. Narrowing the window helps isolate a spike; widening it shows trends.


Traffic metrics

Requests

Total HTTP requests received by your app's Worker in the selected window. This includes all routes — API calls, page loads, static assets, background job triggers, etc.

What's normal: Depends entirely on your app and how many people use it. A personal tool used by one person might see 50–200 requests/day. A team dashboard could see thousands.

Watch for: An unexpected spike (could be a polling loop, a broken cron job, or load from a new user) or a sudden drop to zero (your app may be down or broken in a way that prevents requests from reaching it).


Errors

Requests that returned a 5xx server error. These are failures in your app's backend — unhandled exceptions, crashes, missing variables, timeouts.

What's normal: Zero, or very close to it. A well-functioning app should have an error rate well under 1%.

Watch for: Any sustained non-zero error count. Even a handful of 5xx responses is worth investigating, especially if users are actively reporting problems.


Error rate

Errors as a percentage of total requests. This normalizes error counts against traffic volume — useful for distinguishing "we had 10 errors" from "10% of requests failed."

What's normal: Below 0.5% is healthy. Above 1% indicates something that needs attention.

What 100% error rate means: Your app is crashing on every request — usually a missing secret, a broken import, or an unhandled top-level exception in your Worker.


Performance metrics

CPU time

How long the Worker's CPU was actively executing code per request. Reported as:

  • p50 — the median request (half of requests finished faster than this)
  • p99 — the slowest 1% of requests

What's normal:

App typep50p99
Simple CRUD (KV or D1)< 5ms< 20ms
With API calls to external services< 50ms< 200ms
With AI inference (Workers AI)< 500ms< 2,000ms

Watch for: p99 creeping above 1,000ms for non-AI apps — that suggests a slow query, a blocking operation, or a hot path that needs optimization. Ask the AI: "The p99 CPU time is high — what might be causing slow requests?"


Duration

Wall-clock time from request received to response sent. Includes CPU time plus anything the Worker was waiting on — network round trips to external APIs, KV reads, D1 queries, AI inference.

Duration is always higher than CPU time. The gap between them is time spent waiting on I/O.

What's normal: 2–5× CPU time is typical when there's external I/O (KV, D1, API calls). A very large gap (e.g., 5ms CPU but 3,000ms duration) usually means a slow external dependency — an API timing out or a D1 query taking longer than expected.


Storage metrics

Only shown if your app uses that storage type.

KV

  • Reads — how many times your app read a value from KV storage
  • Writes — how many times your app wrote or updated a KV value
  • Deletes — how many KV values were removed
  • Lists — how many KV list operations ran (listing keys by prefix)

Watch for: An unexpected spike in writes or lists can indicate a runaway cron job or a polling loop that's writing on every tick instead of only on change.


D1 (SQL database)

  • Read queries — SELECT statements executed
  • Write queries — INSERT, UPDATE, DELETE statements executed
  • Rows read — total rows scanned (not just returned) by queries
  • Rows written — total rows inserted or updated

Watch for: High "rows read" relative to "rows returned" means your queries are doing full or near-full table scans. Ask the AI to add an index if this is happening on a frequently-called route.


R2 (file storage)

  • Gets — file download operations
  • Puts — file upload operations
  • Deletes — file removals

Watch for: High put counts you didn't expect — could mean files are being uploaded repeatedly when they should be cached, or a job is regenerating the same file on every run.


Investigating a problem in production

High error rate

  1. Open the Server Logs tab to see live error output — the actual exception message and stack trace will be there
  2. Copy the error and paste it into chat: "Users are seeing errors in production. Here's what the server log shows: [paste]"
  3. If the problem started after a specific deploy, consider rolling back via Versions & Rollback while you fix the root cause

High CPU time or duration

  1. Check whether you recently added a new feature that makes external API calls or runs heavy queries
  2. Ask the AI: "Response times spiked after the last change — what in the code could cause slow requests?"
  3. For D1 performance: check if "rows read" is much higher than "rows written" — that usually means missing indexes on frequently-filtered columns

Unexpected traffic spike

  1. Check if a cron job is misconfigured — a job set to * * * * * (every minute) instead of 0 * * * * (every hour) generates 60× the expected load
  2. Look at whether the requests are concentrated on one route (visible in Server Logs) or spread across many
  3. If the source is unclear, contact the platform team — they can inspect Cloudflare's request logs with more detail than the in-app metrics show

App appears down (zero requests or total 5xx)

  1. Try opening your production URL directly — if you see a Cloudflare error page rather than your app, the Worker failed to start
  2. Check Build Logs for any failed deploy that may have pushed broken code
  3. If you need to recover quickly, roll back to the last known-good version

Versions & Rollback


Limits and what happens when you hit them

MetricPlatform limitWhat you'll see
CPU time per request30 seconds503 / Worker timeout
Memory per request128 MBWorker crash, 5xx error
Requests per day (free tier)100,000Throttling (unlikely for internal tools)

Most internal tools run well within these limits. If you're hitting CPU time limits, the most common cause is AI inference or a very large data processing loop — ask the AI to break the work into smaller chunks.


Questions

Reach out in #eng_product_platform_team if you're seeing unexpected metrics or need help interpreting what you're looking at.

Built by the Veho Developer Platform team