By: Brian Wilson Our customers continue to amaze me. Each and every one is a pioneer navigating a complex IT landscape. A recent blog post, Adding OnCommand Unified Manager to Gain Single Pane Event Monitoring in IT, by Ed Wang at NetApp is very insightful to hybrid monitoring. Read the entire post here or below. Ed previously won our prestigious GalaxZ 16 Innovator of the Year Award. Register today for GalaxZ 17 to hear more strategies and best practices from our customers and partners! For more information on our customers and their winning IT strategies, click here.
Adding OnCommand Unified Manager to Gain Single Pane Event Monitoring in IT
By Ed Wang, Senior Manager, Automation and Monitoring Tools, NetApp IT, and Tim Burr, Sr. Manager, Infrastructure Operations, NetApp IT
IT Event Monitoring to Identify and Prevent Issues
Necessity is the mother of all invention. At least when you work in IT and support global resources that include five data centers, 5,300 servers, and 52 PBs of data center storage. As a result, NetApp’s IT environment generates a constant flow of alerts. Our eternal and ongoing challenge is to quickly identify the root cause of the issues and prevent them from happening again.
Our event monitoring strategy plays an important role in addressing this challenge. We want to ensure critical alerts quickly rise to the top for immediate attention, while informational alerts can be analyzed separately for later action. To support this strategy, we needed to consolidate our alerts into a single ecosystem made up of individual, best-in-class components. This would feed alerts into our incident management software for auto-ticketing.
This ‘single pane of glass’ strategy enables the NetApp resources on the infrastructure support team, called the Command Center, to quickly resolve critical issues 24x7 across the globe and not be sidetracked by non-urgent alerts. This approach improves IT’s responsiveness and focus, ultimately resulting in increased operational stability.
NetApp IT Redefines IT Monitoring Strategy
Our first step was developing an alerting process. Like most IT shops, we have a two-tier alerting system, but we classified our alerts in a slightly different way:
• Reactive: This alert is the only type of alert to automatically be forwarded to the Command Center for immediate action. It is defined as “actionable” and requires attention by the team.
• Proactive: These alerts are typically performance related, but less urgent and are not immediately forwarded to the Command Center for action. Dashboards are used to manage thresholds for the alerts at a broader level. The Command Center monitors the dashboards to proactively address issues, such as storage capacity or CPU utilization, with partner application support teams. These types of alerts remain a key volume driver for the Command Center, but teams continue to focus on streamlining and automating these responses over time.
Over the course of about nine months, process and support teams focused on understanding what existing alerts, thresholds, and events were most important and “actionable.” The result of this work was to position NetApp IT to implement a single, integrated service management and alerting ecosystem, with significantly less noise for those accountable for responding to the alerts.
NetApp IT Builds an IT Integrated Monitoring Ecosystem
Our plan was to create an event monitoring ecosystem that fed alerts into central incident management software. A single ecosystem would enable the sorting, tracking, and accurate routing of alerts from our IT systems into our incident management software through auto-ticketing. For storage events specifically, this required we integrate multiple tools--Zenoss, Splunk, and NetApp OnCommand® Unified Manager (OCUM)--into our ServiceNow incident management platform.
We created a Zenoss ZenPack (to be published soon in the Zenoss community), a plug-in module that outlines the business rules for OCUM to pass its monitoring events to Zenoss. Zenoss screens the alerts, dedupes them, then identifies the critical alerts for auto-ticketing. This integration brings storage alerts into the ecosystem, along with similar alerting configurations for server virtualization, network, and security components in the data center. It also enables NetApp IT to achieve another critical step toward consolidated event and incident management.
Improving IT Operational Stability
The new alerting strategy offers many benefits. The Command Center has greatly reduced its dependency on email for event notifications. Team members don’t need to sort through alerts to find the critical ones, dedupe alerts about the same issue from multiple sources, or risk assuming that someone has already addressed the issue. The team only receives alerts that specifically require action.
When a device goes offline, a storage volume becomes unavailable, or a storage system experiences a hardware failure, the team is positioned to respond appropriately. Therefore, urgent infrastructure issues are identified and fixed more rapidly, before they cause havoc in our IT environment, reducing the overall number and impact of P1 incidents.
Regardless of the incident management or event monitoring software being used, any IT organization can benefit from rationalizing the number of actionable alerts and adopting an integrated event monitoring ecosystem. By creating a strategy that enables fast action on high-priority issues, we’ve improved the efficiency and effectiveness of our Command Center.
Ultimately, this approach has a direct impact on the operational stability of IT operations for our customers, partners, and employees.
The NetApp-on-NetApp blog series features advice from subject matter experts from NetApp IT who share their real-world experiences using NetApp’s industry-leading storage solutions to support business goals. Want to learn more about the program? Visit www.NetAppIT.com.