A stack of techical books

"There's No Silver Bullet"

We believe in methods that amplify learning, communication and collaborative teamwork. Methods that help us do our job better and faster.

In this section our experts share their insights and experience.

Articles

Reaktor Twitter

Production Operations and Development

Agile methods require agile operations

One of the significant benefits of agile software development methods is that in the end of each short iteration, a new version of the program is completed and can be deployed to production environments. The new functionalities can thus be quickly taken to use.

In traditional software development, production deployments may have occurred once a year, whereas in agile software development they can be made even daily. This is why agile methods set new requirements to production operations, maintenance and control.

Operability is established during development

The fact that the system is built for production use must be considered during system development. Error management and error logs must be given special attention in development. A system will always have errors, because all possible cases can never be exhaustively tested in unit, integration and acceptance testing. An error can also occur “outside” the system, in back end systems, network traffic, name services, etc.

It makes sense to build a system so that an error in one back end system, for example, does not prevent the use of the entire system. The system should log all runtime exceptions. In addition, it often makes sense to log calls made to back end systems and the answers received more meticulously, for example. A more detailed log should be kept of the most critical functionalities, such as online payments.

Different control mechanisms must be built already during the development phase to enable monitoring the internal state and behavior of the system. This development work should be done in cooperation with operations, helping them learn about the system at the same time. It is important that the operations team is involved in development early on, particularly if the development team at some point moves on to new projects and further development is left entirely to operations. Operations must have access to source code and all available documentation (such as wiki pages, discussion areas, error database). Communication between developers and operations can be improved with instant messaging tools (e.g. Skype or irc).

Frequent production deployments decrease risks

Contrary to what is often thought, frequent production deployments decrease the risk of problems. When they are more frequent, the changes are minor. At the same time, needs to automate the process are quickly noticed, which also decreases the chances of human errors.

There are many open source solutions for production deployments (e.g. Fabric, Capistrano, Chef, Puppet), but what matters most is that the deployment process is simple, straight forward and can be repeated identically each time.

Monitoring and automation speed up problem resolution

Because production deployments are made frequently, even daily, nobody has time to follow the application logs, for example. Problems must be recognized immediately when they happen, though, or preferably even before they become a problem. This is why the best practice is to collect and save relevant data continuously and automate all possible monitoring. Several commercial and open source products are available for monitoring (Zabbix, Nagios, Ganglia, Cacti etc.)

When identifying the cause of an error, it is often important to know the state of the system at that particular moment. If operations start investigating what happened 30 minutes after it happened, for example, the system may already have recovered from the situation and finding out causes and consequences could be difficult. When monitoring is continuous and automatic, it is possible to view the circumstances where the error happened afterwards. This information is relevant when solving the problem, but also for preventing problems – we can set alarms for certain limit values and possibly take action before the situation gets worse.

When an error occurs in a production environment, investigating it must be started immediately, its cause must be identified quickly, a temporary solution must made as fast as possible, and making the proper corrections must be started at once. At the same time, users must be informed about the problem, its extent and effects, as well as about the schedule for resolution.

Decisions for problem resolution are often needed from the business operations side: How serious the problem is, how is correcting it prioritized and what are the immediate actions. Solving the problem can therefore not be left with operations alone, and help from at least the developers, customer service and business operations are likely to be needed. In order to solve the situation fast, communication must be swift and the different parties must work together regardless of organizational limits.

The significance of processes and tools in monitoring is considerable, but the most important factors are the people and company culture. The attitude towards mistakes is essential: The fact is that mistakes happen and cannot be fully prevented. However, reacting to mistakes can be improved and mistakes can be learnt from.

Monitoring to bring transparency to production

All data collected in monitoring must be available to developers easily and in real time, so that they can follow the status and reliability of the system, and the effects of production deployments on performance, for example. This should be in the interests of every self-respecting developer. It makes sense to create separate views for the representatives of business operations, so that they can view the information that is important to them (such as number of visitors or subscriptions) and use this information to support their decisions.

In longer term, the collected data can be used in capacity planning. By investigating graphs it is possible to predict when more disk space, memory, CPU or servers will be needed. This is an important cost factor: Server hardware ages fast, so postponing the purchase is likely to improve its price-quality ratio. On the other hand, a purchase decision made too late can also cause major problems and financial losses. Collected data is the only reliable way to predict future needs.

Peace of mind with 24/7 service

Critical systems must work all the time, so someone has to be on call to investigate and solve problems at all hours. It is crucial that the team on call knows how the system works and can immediately start solving the problem. The primary goal in all problem situations is to minimize the damage caused by the problem. This may mean removing the erroneous server from the cluster, removing certain services from use, or the shutdown of the entire system.

When problems come up, quick actions and decisions are usually needed and the situation can be very stressful. Problem resolution reveals how well operations know how the system works, how well they are prepared by automating the most common functions, for example, and how well problem situations are considered during the development phase. These situations often stretch our stress tolerance, but fortunately experience gives confidence here, too.

It is a good idea to have a separate team dedicated to the operations of a critical system. The team's primary task is reacting to problem situations. The amount of required work is difficult to estimate beforehand, so scheduling other tasks is a challenge. The benefit of a dedicated team is that when the red lights aren’t flashing, so to speak, the team can focus on developing the system and the production environment, as well as on improving their operations.

A system in production requires attention and care

Operating a system in production often reveals new development needs related to, for example, the runtime environment or its operations. The development needs almost always involve automation. Examples of this include building the runtime environment on a new server, configuration changes, collecting diagnostic data, booting the servers or processes and of course, production deployments.

Problems related to performance or scalability, for example, often go unnoticed until production. These problems can be solved on many different levels. The solution is considerably affected by how the system architecture scales. Server capacity can be increased horizontally, that is by adding new servers to the cluster for example, or vertically by increasing the capacity of an individual server (increasing processor power, adding memory, etc.). Problems cannot always be solved by increasing server capacity, however, and the system and the architecture also have to be changed. The essential thing here is that the problem has been located by monitoring and diagnostics, and the developers and operations work together to solve the problem. This improvement work is also iterative: After fixing one bottleneck, we soon spot the next.

In addition to what has been mentioned before, operations is also responsible for the tasks that traditionally belong to them, as well as organizing these tasks: Back-ups, load balancing configurations, security updates, support during scheduled service breaks, and so on.

DevOps changes organizational culture and ways of working

Building, developing and maintaining online services requires agility, and only a product that is in production use brings added value. It is absolutely vital that developers, testers, operations, business people and everyone else involved continuously cooperate and support each other with their own strengths and expertise. Time is money particularly in developing and launching online services, so everything possible should be automated. Automating only testing and packaging is not enough, and production deployments, configuration changes and database migrations should also be automated.

The whole of these practices is often called DevOps, which stands for development and operations. The name is slightly misleading, however: In addition to developers and operations, participation from business operations, for example, is also very important in this process. In the past few years, the DevOps practice has become more popular in different sizes of organizations around the world.

Bringing DevOps to a traditional organizational and operational model is usually a major change. It changes the culture and ways of working of the IT department, the product development organization, and even the entire company. The change is unavoidable, though, to enable the company to react to changes and new information quickly, and to keep its business critical systems, such as an online bank, reliable.

Markus Ylä-Ilomäki closeup

Markus Ylä-Ilomäki, Software Architect

Markus Ylä-Ilomäki has extensive experience of the operations and further development of critical systems in different environments. He has used agile software development methods for nearly ten years.