Most teams I work with have the same problem: they’ve instrumented their services with OpenTelemetry, they have traces flowing into their backend, and yet — when something breaks in production — the traces don’t actually help them find the problem.
This isn’t an OpenTelemetry problem. It’s a design problem.
The Instrumentation Trap
When engineers first discover OpenTelemetry, the reaction is almost always the same: instrument everything. Add spans to every function, every database call, every HTTP request. More data means more visibility, right?
Not quite.
The problem with instrumenting everything is that you end up with a trace that looks like a call stack — hundreds of spans, nested five levels deep, each one named after a function or a class method. It’s technically a trace, but it’s not useful observability data.
// What most people do: instrument the implementation
tracer.spanBuilder("UserService._loadUserFromCache").startSpan().use {
user = cache.get(userId)
}
tracer.spanBuilder("UserService._queryDatabase").startSpan().use {
user = db.query("SELECT * FROM users WHERE id = ?", userId)
}
This gives you timing data for your internal functions. That’s rarely what you need when debugging a production incident.
What Spans Should Tell You
A well-designed span should answer a business question, not describe an implementation detail. The key shift is moving from “what is the code doing?” to “what is the system doing?”
Compare these two approaches:
| Bad span name | Good span name |
|---|---|
UserService._loadFromDb | user.fetch |
CacheManager.get | cache.lookup{resource=user} |
HttpClient.execute | GET /api/users/{id} |
JSONSerializer.serialize | (not worth a span) |
The good names describe what happened in terms your team can understand during an incident, not how the code works internally.
The Three Questions Every Span Should Answer
Before adding a span, ask yourself:
- Would this span help me during an incident? If you can’t imagine a scenario where you’d be grateful this span existed at 3am, don’t add it.
- Does this span represent a meaningful unit of work? Spans should map to something a non-engineer could understand — “fetch user”, “send email”, “process payment” — not “call method”.
- Does this span have useful attributes? A span without context is almost useless. If you’re adding a
user.fetchspan, make sure it carriesuser.id,db.name, and whether it was a cache hit.
Attributes Are More Important Than Spans
Here’s the thing most teams get wrong: the value of a trace isn’t the spans themselves — it’s the attributes attached to those spans.
tracer.spanBuilder("user.fetch").startSpan().use { span ->
span.setAttribute("user.id", userId)
span.setAttribute("db.name", "users_primary")
span.setAttribute("cache.hit", false)
val user = db.fetchUser(userId)
span.setAttribute("user.tier", user.subscriptionTier)
span.setAttribute("user.region", user.region)
user
}
Now when you’re debugging a latency spike, you can filter by user.region = "eu-west" or user.tier = "free" and immediately narrow down whether the issue affects all users or a specific segment.
A Practical Heuristic
When reviewing instrumentation, I use this rule: one span per I/O boundary, plus spans for significant business operations.
- Every external HTTP call: one span
- Every database query (not each query in a loop): one span per logical operation
- Every cache operation (check + set): one span
- Payment processing, order creation, email sending: one span each
- Internal method calls, JSON serialization, string manipulation: no spans
This keeps your traces readable and your trace storage costs manageable.
A well-structured Kotlin service might look like this:
class OrderService(private val tracer: Tracer) {
fun processOrder(orderId: String, userId: String): Order {
// ONE span for the business operation
return tracer.spanBuilder("order.process")
.startSpan()
.use { span ->
span.setAttribute("order.id", orderId)
span.setAttribute("user.id", userId)
// Internal methods — no extra spans needed
val order = fetchOrder(orderId)
validateOrder(order)
// ONE span per external I/O boundary
val payment = chargePayment(order) // has its own span inside
notifyWarehouse(order) // has its own span inside
span.setAttribute("order.amount", order.totalAmount)
span.setAttribute("payment.id", payment.id)
order
}
}
private fun chargePayment(order: Order): Payment {
return tracer.spanBuilder("payment.charge")
.setSpanKind(SpanKind.CLIENT)
.startSpan()
.use { span ->
span.setAttribute("payment.provider", "stripe")
span.setAttribute("payment.amount", order.totalAmount)
paymentGateway.charge(order)
}
}
}
Conclusion
Effective observability isn’t about collecting more data — it’s about collecting the right data. Start by defining what questions you need to answer during incidents, then design your instrumentation to answer those questions.
If you can look at a trace during a production incident and immediately understand what the system was doing, which external calls were slow, and what user context was involved — your instrumentation is good. If you’re staring at a wall of nested function names, start over.