Observability and Telemetry: Why IBM i Shops Should Care
Dawn May outlines IBM i's unique strengths in observability and telemetry while exploring ways to extend those capabilities to other environments
Introduced in the late 1980s, the AS/400’s advertised strength was applications. Everything needed to build great business applications shipped with the system, crucially the database. Today, the strength of IBM i is the integrated database. While the application model has expanded, the database remains.
In many enterprises, the application model is now distributed: Application servers and web servers are built external to the IBM i, but entirely dependent upon the data that resides on IBM i. IBM i is also surrounded by other technologies that IT shops use to manage this distributed network of systems such as security monitoring, performance monitoring, availability monitoring, AI, etc. COMMON recognizes this fact and allocated a track at PowerUp2026: Technologies Integrated with IBM i.
Looking at the MELT Data
As a performance person, the monitoring solutions interest me and so does the word “observability.” IBM’s documentation states that “Observability is the ability to understand the internal state or condition of a complex system based solely on knowledge of its external outputs, specifically its telemetry.”
Now we have another word, telemetry. IBM states that “Telemetry is the automated collection and transmission of data and measurements from distributed or remote sources to a central system for monitoring, analysis and resource optimization.” This is referred to as MELT data—metrics, events, logs and traces.
Observability and telemetry is a differentiator for IBM i, although the words we use tend to be different. We don’t talk about telemetry; we talk about messages, job logs and performance data. Let’s quickly review IBM i’s MELT data.
Metrics – There are four performance data collectors: Collection Services, Performance Explorer, Job Watcher and Disk Watcher. These performance data collectors efficiently collect and store an insane amount of metric data. Many messages have metric data (e.g., CPF1164). The SQL Plan Cache and Db Monitor facilities fall in here as well. This data is automatically collected and managed by the system. There is nothing an application developer needs to do to collect metric data. Optionally, a developer could add additional metrics specific to their application, but often what the system automatically collects is sufficient.
Events – Messages are event data. QSYSOPR, QSYSMSG, History Log and standard message queues contain this event data.
Logs – Job logs and the history log fall here. There are also problem activity logs and licensed internal code logs. IBM i also generates logs similar to that of LUW systems, such as Apache log files or other logs for open-source software.
Traces—IBM i features extensive built-in tracing facilities: job trace, communications trace, TCP trace, trace internal and others.
Extending IBM i’s Observability
The IT industry deals with many distributed systems with limited built-in diagnostic capabilities, but IBM i is not one of them. A whole set of applications and technologies have been developed to extend the observability and telemetry to other environments.
The challenge as I see it is that IBM i often sits in the middle of these environments as a backend database server, doing what it does very well. In fact, it works so well that little attention is often paid to it. However, as a database server, it is a critical component of the business application. Shouldn’t IBM i be more visible?
“Observability” applications are numerous—IBM has Instana. Open source examples include Prometheus (for telemetry data), Grafana, Datadog and Dynatrace—homegrown utilities for monitoring the system. The list goes on. The problem with these solutions is how they capture the telemetry data from IBM i—most of them use the various QSYS2 services to collection system information. It’s easy. But it’s resource-heavy, particularly when retrieved on a repetitive basis. Retrieving information about top-consuming jobs is especially resource-heavy.
Rather than relying on the easy-to-use, but resource-intensive QSYS2 services, direct queries over the Collection Services data can deliver a more efficient solution. However, this requires deep knowledge of Collection Services data. This performance data can also be harvested to build a signature of your system’s behavior.
Using AI/ML, newer tools could determine your performance signature and proactively identify changes or anomalies before they become a crisis. With AI/ML technologies advancing quickly, I’m looking forward to seeing what products will appear in the marketplace.