Service Maps are a lie

Service Maps are a common feature of APM solutions today. They are marketed as tools to help you understand the communication graph of your services. However, the reality is that they simply lack the context to give you accurate information and can even lead you to misunderstand the flow of requests in your system.

A simple example

Here is a graph of flights from between 3 cities. This graph says there are 200 passengers to NewYork from San Francisco. It doesnt say how many of those 200 passengers are coming via Seattle and just have San Francisco as their connecting flight. Looking at this graph, we may also think that everyone from Seattle is only going to San Francisco when in reality some passengers may be traveling to New York but using San Francisco as their connecting flight.

We have connected 2 disjointed pieces of information, flights leaving San Francisco and flights leaving Seattle, without any context. Adding that context in the diagram below things are much more clearer.

We can see now that of the 100 Seattle passengers to San Francisco,  80 have a connecting flight to NewYork. We now know exactly how many passengers are coming to NewYork from each city.

Here is a Service Map from Datadog.

  • How many calls to postgresql are made with requests that came only from du-router?
  • How do the calls change over time?
  • Are there direct requests to du-coord made, that do not come via another service, but that result in calls to postgresql?
  • Is du-coord running a periodic job that makes calls to postgresql or du-indexer?
  • Do incoming requests from du-indexer result in separate calls being made back to du-indexer (resulting in a cycle) or are calls to du-indexer from du-coord only made on incoming requests via du-router?

The Service Map has simply connected two disjointed pieces of information:

  1. There are 2 services (du-router and du-indexer) that talk to du-coord.
  2. du-coord talks to 2 different services (postgresql and du-indexer).

The map is showing correlation when we are looking for causation.

The Service Map simply doesnt have enough information for one to understand and debug the flow of requests through your system. It is also a point-in-time graph only which doesnt show you how the communication between entities changed over time.