What factors drive unplanned downtime costs in production environments?
answered
Stack traces are still useful. They tell you many important things, such as where an exception surfaced within a process, which call path led to it, and sometimes even which line of code deserves attention first. While certainly helpful, they become insufficient once a request crosses service boundaries. In debugging distributed systems, the hard part is usually not finding where one service failed. It is understanding how a request moved through several services and which failure caused it, rather than the fallout.
Stack traces are local by design, which is their main limitation. They explain what happened within a single process, but say nothing about what happened before the request reached it.
In a monolith, that local view is often enough. A request enters the app, touches a few modules, throws an exception, and the stack trace gives you a reasonably direct path back to the bug. You can often reproduce it, set a breakpoint, and fix it without much ceremony.
In a multi-service system, the same user action can hit an API gateway, an auth service, a payment service, a queue, a worker, and a database. If the final service throws an exception, the stack trace only shows what happened in that last process. It does not show the request that triggered it, the upstream timeout that shaped the call, or the retries that exacerbated the situation further.
A few limits become apparent quickly.
A real example is a checkout request timing out at the edge while the payment service logs a database exception. The API service may only show a generic timeout. The payment service stack trace may point to a failed query. The order service might log that the payment was never completed. None of these views alone explains the entire sequence.
Correlation IDs help, but only up to a point. They group logs from the same request, which is useful. Still, they usually leave you reading a flat timeline across many services. You can search better, but you still do a lot of reconstruction in your head. Tools such as Hud.io are built to go further.
Distributed tracing helps because it models the request rather than just the error. Instead of showing a single stack within a single runtime, it follows a request from service to service using a shared trace ID. Each unit of work becomes a span. That span records timing, parent-child relationships, and a bit of context around what the service was doing.
That alters the debugging workflow practically by offering:
Say an HTTP request takes 8 seconds and returns a 500 response. The API service stack trace says a dependency call timed out. This information is helpful, but incomplete. A trace might show that auth took 40 ms, inventory took 120 ms, payment retried 3 times, and the real delay came from a single slow database call inside payment that blocked the rest of the flow. That is a markedly different starting point for debugging.
This is where distributed tracing becomes useful compared with relying only on a stack trace. It answers questions that stack traces were never meant to answer, such as:
This does not make stack traces obsolete-you still need them when you are inside the failing service and want code-level detail. A trace can tell you that payment-service failed while calling a fraud provider. The stack trace inside payment-service still tells you which function blew up and what exception type was thrown.
A better pattern, however, is to treat them as different layers of evidence.
A stack trace still earns its place. It is just too narrow to carry debugging distributed systems on its own. Once requests cross services, queues, and async workers, you need the request-level view that distributed tracing provides. Then you turn to logs and stack trace details when you know which hop actually deserves your time.