An ounce of prevention is worth a pound of cure. While cliché, the old adage is proven to be quite true when it comes to the topic of IT infrastructure. With 98% of organizations saying just one hour of downtime costs over $100,000 (Source:
ITIC), monitoring your servers’ performance to prevent downtime before it occurs is more important than ever. In this post, we'll dive into the topic of server monitoring and discuss some of the technical details and best practices.
What is server monitoring?
Server monitoring is the act of polling and capturing data on server status and performance. The objective of server monitoring is to measure vital server statistics to confirm that servers are utilizing resources effectively and give administrators a detailed view of their infrastructure so they can be proactive in addressing issues as they develop and be notified quickly should equipment go down.
Monitoring can be done using native operating system tools (e.g.
Windows Performance Monitor), 3
rd party tools and scanners (e.g.
Pingdom), network management systems a.k.a. NMSes (e.g.
SolarWinds or
Nagios), or custom scripts and command line utilities for ad-hoc commands (e.g. PowerShell scripts using WMI or bash scripts using Net-SNMP).
Some of the most common protocols used for monitoring servers over the network are SNMP (Simple Network Management Protocol), IPMI (Intelligent Platform Management Interface), and WMI (Windows Management Instrumentation). In addition to these protocols, many tools are capable of monitoring based on SSH, FTP, ICMP, DNS, HTTP(S), RESTful APIs, vendor-specific agents, or other protocols and services.
Agent-based and agentless monitoring are two popular terms when it comes to server monitoring. The term agent-based indicates that an agent must be running on the server for the monitoring to occur. Agentless implies that no agent is required and a standard protocol like SNMP can be used for the monitoring (although, this can technically be considered a misnomer since SNMP is based on a manager to “agent” model). Each method has its pros and cons, with the general tradeoff being agent-based methods can provide more detail or functionality while agentless methods are more lightweight and conducive to centralized monitoring. For more details on the topic, check out Nagios’ descriptions of their
agent-based and
agentless monitoring options.
Some of the most popular metrics to monitor when configuring monitoring of servers include memory, disk utilization, networking, CPU usage, ping (ICMP to check if a server is up or down), database statistics, virtualization statistics, power metrics, fan status, temperature, and humidity. Those variables are just the tip of the iceberg when it comes to server monitoring as a number of other variables can be monitored and the use case may dictate more specific monitoring is required. For example, if you have a web server, checking HTTP status regularly is a must (
Uptime Robot is an example of a tool that can do this for free) whereas a MySQL server may need to regularly check the error log, and a file and print server may benefit more from monitoring of specific processes and services. Additionally, many server monitoring tools can help you track inventory and gain a more holistic view of your overall IT infrastructure.
How does this positively affect your server's performance?
The two big picture metrics that surround server monitoring are MTTD (mean time to detect) and MTTR (mean time to resolve). Server monitoring, when done right, reduces both thereby maximizing uptime for your mission-critical servers (for more on maximizing uptime, check out our
Is Your Server Down? Here Are Some Tips To Help Get To Zero Downtime post) and making your IT operations more efficient.
While the demands of a particular environment will dictate what specific variables your team monitors, the one commonality across use cases is: monitoring empowers your team to be proactive, understand resource utilization better, and react quickly when a problem arises.
Many of the monitoring tools mentioned offer notification (SMS, email, etc) and automation features (execute a script or program when a threshold is exceeded or a device goes down) that can be used to make responding to and resolving issues quicker and more efficient. This means that the next time a web server stops responding to HTTP requests, the administrator responsible can be notified immediately or a script could be executed to restart the HTTP service (e.g. Apache, nginx, or IIS). Similarly, if a server is constantly exceeding a predefined CPU utilization threshold, your team can take action to investigate and resolve the issue before it becomes an issue with application performance and user experience.
Additionally, monitoring and capturing the data allows you to track and report on it to identify trends before they become a problem. By understanding the baseline performance of your infrastructure you can scale more effectively by making informed decisions, identify weak spots in your infrastructure, and increase server performance by making data-driven tweaks and optimizations.
Getting started with server monitoring
To get started with server monitoring, you should ask yourself a number of questions to see exactly what you should monitor and why. Here, we’ll help you define those questions and drill down to specific actionable items.
What services matter to your business?
If you are monitoring web servers, you will likely need to select a tool that monitors HTTP requests, network latency, uptime, and other web server vitals. Database servers will require a solution that can run database queries to check for failed logins, errors, and database status. Servers that act as hypervisors will need highly granular resource monitoring to ensure that virtual machines and hosts are consuming resources efficiently.
What is the size of your environment and how will it scale in the future?
The needs of an SMB that doesn’t project much IT infrastructure growth are different than the needs of a Fortune 500 company that runs its own data center. Understand the scale of your environment before selecting a monitoring tool as this will help you avoid overpaying for services you don’t need or missing out on features you do.
What tools are available to meet your needs?
A small network of servers may be able to fulfill all of their needs simply using native operating system based tools and built-in features. For example, Network Performance Monitor and a few custom scripts for email alerts may be sufficient for a small Windows environment.
A large network of servers with varying functions and applications will likely benefit from a feature-rich NMS that supports all of the aforementioned use cases and more and allows for centralized monitoring, alerting, and reporting.
What do you do once you have a monitoring solution in place?
The answer to this question will vary greatly depending on the tool selected and the requirements of the use case, but generally you will need to provision the tool by discovering the servers, defining thresholds for the metrics that matter to your business, adding contacts to be notified when events occur, and defining any scripts or other custom actions and features that need to be set up. Note that many enterprise-grade monitoring solutions support “auto-discover” features and will automatically begin detecting and monitoring networked devices. A few key points to remember when provisioning your server monitoring solution:
-
Automate as much as possible- Your team probably already has a solid understanding of fixes to common issues, automate those solutions as much as possible. For example, if a monitored service locks up semi-regularly and the solution is to simply restart it, configure an action to do so.
-
Notify the right people- Automation is great, but your IT team is there for a reason, be sure they are notified when thresholds are exceeded and servers go down. Leverage email and SMS notifications where possible.
-
Avoid desensitizing your team to alarms- One of the most overlooked issues in IT is alarm overload. Avoid sending your team alerts for trivial events that don’t require action. This means setting thresholds based on performance baselines, not arbitrary metrics that don’t fit your use case. For example, if server CPU spikes to 82% once a day when a particular batch process occurs and your team gets alarms at 80%, they’ll quickly learn to ignore these alarms.
-
Track data to set performance baselines- In order to make better long-term decisions, you need to know what your IT investment is doing today. Leverage the data you gather to map up server resource utilization, areas where upgrades are required, and areas that may be creating bottlenecks in your infrastructure.
How can you tell if it's actually working?
Once you put your thresholds in place and have your alerts and actions configured, how can you know if they will actually work when it matters? While many tools offer “test” or “simulation” features to confirm you’ll everything is working, a more surefire way to test your configuration is to set a threshold low enough it will fire an action (send an email or execute a script). For example, if your normal threshold for an email alert for CPU Utilization is 80%, but the CPU generally runs at 5%, drop the threshold to 1% to trigger the alert. If you receive the notification as expected, the business logic is functioning correctly and you can change the threshold back.
At the core of any server monitoring plan should be robust server infrastructure designed with uptime in mind. Our line of
Durastreams Mission Critical Servers is designed to keep your mission-critical applications up and running 24/7/365. Contact us today to learn more about what Premio’s industry-leading server and storage designs and knowledgeable solutions experts can do for you.