The application stability has been a more frequent concern for companies specially when we talk about high value applications. Every time a core application stops working, many money is lost or many money stop being made. Because of that, a lot have been said about telemetry for applications more and more often. But what is telemetry for software actually and how to get benefits from this practice?
What is telemetry?
Telemetry is the act of measuring something remotely, by distance, and automatically.
Talking about software architecture, the telemetry is already very easy to find. Some simple examples are the Chrome platform, the Windows, OSX, Android, iOS and Sony’s Playstation OS operational systems, and also your mobile and desktop apps, such as Spotify and Microsoft Office. What these softwares do is to operate and gather all data that matters to its work. Then they send this data, naturally just if you allow, to their manufacturers (Google, Microsoft, Apple, etc). The next step, when they are grouped, is to analyze the data and change things if they have to. The core intention is to improve the systems so they can operate in many different environments having its proper behavior.
So the main thing about telemetry is to operate, gather data, analyze, and then improve the system code to reach a better behavior.
The telemetry can bring a lot of value to the business. Lets explore an example. Imagine you have an app running, which has a non-interactive FAQ screen to your users. Once your users get there, they will stop using your call center service because they have already found what they were looking for. This means money saving to your company. Now imagine that one of the answers (a quick how-to video) of this screen, for some reason, stops being shown (stops working properly). If you don’t have something checking for this screen’s healthy, it will be hard for you to notice, because we don’t have people browsing on systems 24/7. And then you will depend on some user good will to TELL you that the screen stopped working. It will happen sometime, but before that, your call center will start being called more and more. That’s waste of money because of software bad behavior.
The telemetry can be used to check many levels of service operation:
- Very deep information, like machine’s CPU and memory. Are they green?
- Do the number of machines up match your historic knowledge about how many were supposed to be up to support 1, 2 or 3 thousand of users at the same time?
- Are the core webs-services your customers access every time up?
- Is the final screen the user uses to login up?
Then when a telemetry practice is watching the important things on your application, you will be able to take actions and prevent problems from happening.
How to start?
A good telemetry implementation will depend on the size of information you will want to store and to analyze. The more information you have, the more infrastructure and knowledge you will have to have to process it all. It can even mean using bidata tools. But let’s talk about a new simple example, such as a system that receives vehicles security data from a company that sells insurance letters. A good way to start can be as following:
- Identify why to measure: let’s measure because it is a core service that tells our customers how their loads are being transported over the roads;
- Be sure of the goals related: the main goal is to keep the entire system running without down time, because every time this application stops, the contract allows the customers not to pay for that down time;
- Identify what to measure: let’s check if all of our many data inputs are up. Let’s also check if the inputs are sending the same amount of data they are used to;
- Set a strategy for measuring: let’s reach every end-point of the many data inputs. If they are available means its healthy. If they are not, its red. Then let’s read the amount of data received in the last minute. If it’s around the known number, its healthy;
- Set an analysis strategy: can we automatize anything? If one of the endpoints is down, is it useful to restart the operational system, container or application server of the load balancer or the micro services? Or will it have to be shown in a dashboard to a human to analyze?
- Implement ways to gather the data: let’s create the code to gather data and take actions, or to show it;
- Show it! Now it’s time to show it. Is it useful to create a chart? Let’s show it using colors, so people can easily identify when there is a problem. If we need results fast, a good very first small step MVP could be opening a ticket on the infrastructure team;
- Analyze: this is the most important time. It’s time to be critical and identify the root reason of the problem. Do not focus on the problem, but why it is happening. Why it is happening? Do we have problems with the parts that send data to us? Is the problem on our side? If yes, is our application running the way it should? Do we have to change something in our development process?
- Take actions: through code or not, solve the things you found in all the steps above;
The telemetry can bring a lot of value to the business. It will give you intelligence to act before things happen. If your business is critical, it can mean a lot of money easily. It’s a common practice to many things in our administrative areas, like the PDCA mindset thought, why not do that for our software?