Service Maps are a lie
Service Maps are a common feature of APM solutions today. They are marketed as tools to help you understand the communication graph of your services. However, the reality is that they simply lack the context to give you accurate information and can even lead you to misunderstand the flow of requests in your system.
A simple example
Here is a graph of flights from between 3 cities. This graph says there are 200 passengers to NewYork from San Francisco. It doesnt say how many of those 200 passengers are coming via Seattle and just have San Francisco as their connecting flight. Looking at this graph, we may also think that everyone from Seattle is only going to San Francisco when in reality some passengers may be traveling to New York but using San Francisco as their connecting flight.
We have connected 2 disjointed pieces of information, flights leaving San Francisco and flights leaving Seattle, without any context. Adding that context in the diagram below things are much more clearer.
We can see now that of the 100 Seattle passengers to San Francisco, 80 have a connecting flight to NewYork. We now know exactly how many passengers are coming to NewYork from each city.
Here is a Service Map from Datadog.
- How many calls to
postgresqlare made with requests that came only from
- How do the calls change over time?
- Are there direct requests to
du-coordmade, that do not come via another service, but that result in calls to
du-coordrunning a periodic job that makes calls to
- Do incoming requests from
du-indexerresult in separate calls being made back to
du-indexer(resulting in a cycle) or are calls to
du-coordonly made on incoming requests via
The Service Map has simply connected two disjointed pieces of information:
- There are 2 services (
du-indexer) that talk to
du-coordtalks to 2 different services (
The map is showing correlation when we are looking for causation.
The Service Map simply doesnt have enough information for one to understand and debug the flow of requests through your system. It is also a point-in-time graph only which doesnt show you how the communication between entities changed over time.