Introspection

So far, we have discussed a lot of tasks that the orchestrator is responsible for and that it can execute in a completely autonomous way. But there is also the need for human operators to be able to see and analyze what's currently running on the cluster and in what state or health the individual applications are. For all this, we need the possibility of introspection. The orchestrator needs to surface crucial information in a way that is easily consumable and understandable.

The orchestrator should collect system metrics from all the cluster nodes and make it accessible to the operators. Metrics include CPU, memory and disk usage, network bandwidth consumption, and more. The information should be easily available on a node-per-node basis, as well in an aggregated form.

We also want the orchestrator to give us access to logs produced by service instances or containers. Even more, the orchestrator should provide us exec access to each and every container if we have the correct authorization to do so. With exec access to containers, one can then debug misbehaving containers.

In highly distributed applications, where each request to the application goes through numerous services until it is completely handled, tracing requests is really important task. Ideally, the orchestrator supports us in implementing a tracing strategy or gives us some good guidelines to follow.

Finally, human operators can best monitor a system when working with a graphical representation of all the collected metrics and logging and tracing information. Here, we are speaking about dashboards. Every decent orchestrator should offer at least some basic dashboard with a graphical representation of the most critical system parameters.

But human operators are not all that concerned about introspection. We also need to be able to connect external systems with the orchestrator to consume this information. There needs to be an API available, over which external systems can access data such as cluster state, metrics, and logs and use this information to  make automated decisions, such as creating pager or phone alerts, sending out emails, or triggering an alarm siren if some thresholds are exceeded by the system.