Server or application downtime is a mission-critical problem for organizations of all sizes, from small businesses to enterprise level architectures. For larger companies, the combined costs can add up to millions of dollars lost every year! In the event a server or application fails unexpectedly, not only is there a direct revenue impact, but long-term business operations, reputation, and SEO can also be affected. There are some extreme instances where unplanned outages have even forced companies out of business. So how do IT professionals avoid such a catastrophic event? By identifying common causes of downtime and architecting an infrastructure and policies that help mitigate them. This article will provide insight on how to prepare and avoid downtime and keep the applications that your business depends on up and running.
It’s no surprise that overloading a server is a common instance that can occur leading to downtime. When an application is first tested before implemented in production environments, it probably operated effectively and was accepted as the right solution. However, once the same solution is live and users begin to operate their day to day tasks, unaccounted demands lead to severe stress on a server which can result in drastically decreased performance and even component failure.
Applications can drain server resources such as memory, storage, and database space. Combine this with a high quantity of users accessing data and executing processes, downtime is almost certainly around the corner if proper precautions have not been applied. Increasing server memory capacity will help to speed up data transfer rates to mitigate stress on servers. Identifying and eliminating resource/network bottlenecks, such as memory usage, CPU utilization, and user connections will aid to lessen overload conditions. Load shedding helps to prevent server flapping, a condition of an overloaded server rebooting just to be overloaded again once online, as well as cascading failures, which are incidents that begin in one system or area within the application that leads to a snowball effect within system architecture.
Redundancy and Fault Tolerance
Electronics fail. People unplug things they shouldn’t. Unplanned events occur. These are all realities of business. Even with the best-laid plans, it’s unlikely you will be able to avoid a failure in production. To account for this, everything in mission-critical applications should be at least “N+1” redundant. N+1 indicates you have the quantity required for the application (N) plus at least one more up and running at all times. Implementing N+1 redundancy throughout an application minimizes single points of failure and better ensures business continuity.
While budget and infrastructure constraints can limit what’s practical for a given application, IT teams should do all they can to become at least N+1 redundant. For an application that can run just fine on one server, making it truly “N+1” redundant and fault tolerant would entail: using 2 servers (likely in a “Blue/Green” deployment model, check out this TechTarget article for more on “Blue/Green”), each with redundant power supplies, connected to two separate UPS systems, on two separate AC power circuits, backed up by two separate generators, and with redundancy built into the cooling system.
Widely considered by IT specialists as a leading downtime culprit, human error is a factor that can lead to disastrous conditions. In March 2017, Amazon reported its Simple Storage Service (S3) suffered a collapse of their web services on February 28th, specifically their cloud and data center infrastructure business, that led to a 4-hour service interruption. An analysis by Cyence found that, in this time frame, S&P 500 companies incurred losses of $150 million as a result (Source: Business Insider). After an investigation was launched to determine the root cause, Amazon provided this response (Source: Amazon):
"The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems."
To paraphrase, an employee entered a typo with severe ramifications. The effects of this outage were felt by prominent popular online retailers such as Disney, Target, and Nike, creating web page load times more than 30 seconds that normally are only a fraction of that duration averaging only a few seconds. With S3 designed to deliver 99.999999999% durability, this only proves how human error can bring down even the most well-designed systems.
Unfortunately, there is no single solution to the human error problem. Training employees in their specific roles and alerting them to mission-critical processes still remains an important proponent in error preventive maintenance. However, this is not limited to new hires as refresh training for existing long-standing personnel serve as a strong reminder of the significance of their actions. Access limitation to sensitive resources and systems only compound the benefits of training. Allowing only select well-trained staff to perform vital monitoring and tasks help protect the infrastructure of an application. Physical and software-based security and monitoring help prevent human-error (or malicious user) incidents from occurring and are valuable tools in maintaining maximum uptime and performance. Additionally, designing, implementing, and testing a disaster recovery plan will minimize loss in the event error occurs. Disaster simulations show real-time cause and effect situations to improve reaction times and judgment while serving as a good reminder of the severity of user actions.
One cannot overlook the importance of consistent high-quality power available to an entire computing environment. In May 2017, British Airways was forced to cancel more than 400 flights stranding 75,000 passengers one day of a holiday weekend (Source: The Register). The subsequent investigation launched uncovered the source of the problem was due to a power surge that occurred when a technician unplugged and reconnected a power supply which damaged mission-critical components of the IT infrastructure. While this instance is a blend of human error as well as power, it only reinforces the value of consistent voltage for datacenter applications.
Uninterruptible power supply (UPS) systems are an essential component of preventing power loss situations leading to downtime. Having varying levels of power capacity, available battery backup time, and outlet configurations available from a large selection of manufacturers enables network and datacenter administrators to select an appropriate solution for their needs. Features such as online double conversion technology (converts AC power to DC power then back to AC eliminating harmful harmonic distortion, voltage fluctuations, and waveform anomalies) add another level of protection that can lengthen the operating life of connected equipment. Enterprise level units provide network communication to a UPS enabling power and load monitoring, network notifications of power events, along with remote access to load power management through controllable outlets.
Generators are the larger capacity backup devices that have more focus on the general power requirements of a facility, but models with computer grade AC output allow for a UPS to utilize it as a utility power replacement until the true power problem has been addressed. In some applications, a battery backup only needs to provide a small amount of runtime during a blackout so as to allow a generator to be activated and power necessary components for continued business operation.
The fundamental building block of a high-availability deployment model is server hardware built with uptime in mind. High-availability servers from Premio are designed with zero downtime in mind. With features like redundant power supplies, fans, hot-swappable HDD or SSD storage bays, industry trusted Intel PCSD server boards, and a tool-less design, Premio’s servers are uniquely equipped and built from the ground up to minimize downtime and achieve fault tolerance at the server level.
In conclusion, while there is no single solution available that can prevent downtime, identifying single points of failure and building a solution with fault tolerance in mind can help organizations maximize uptime and business continuity. Contact us today to begin building your “zero downtime” solution. We offer an extensive line of server options to meet numerous computing design needs that help develop a secure, stable environment with high-density capacity options and industry-leading processing speeds.
[vc_empty_space height="30px"][vc_column_text el_class="bp-maroon-solid"]