Analyst IT Operations(Nagios Engineer – Enterprise Monitoring & Infrastructure Operations )
- Full-time
- Location: India - Hyderabad
Job Description
Mattel India is seeking an experienced Nagios Engineer to manage, enhance, and optimize Nagios-based monitoring systems in a SOX-regulated infrastructure environment. You will ensure accurate alerting, efficient performance monitoring, and platform reliability across servers, networks, applications, cloud resources, and critical business systems (e.g., financial reporting, supply chain, e-commerce). This role supports proactive issue detection, SOX ITGC compliance (e.g., monitoring for unauthorized changes, SLA adherence, anomaly detection, audit evidence), and high availability to minimize disruptions for global toy brand operations.
Key Responsibilities
- Manage and administer Nagios (core Nagios or Nagios XI) environments: Install, configure, upgrade, and maintain core servers, agents (NRPE), plugins, and distributed monitoring setups.
- Design, implement, and enhance monitoring checks for infrastructure (servers, storage, virtualization like VMware), networks (devices, bandwidth, latency), applications, APIs, and cloud resources (AWS/Azure).
- Configure accurate alerting: Define thresholds, dependencies, notification rules (email, PagerDuty/Slack/ServiceNow integrations), escalation policies, and false-positive reduction to ensure reliable, actionable alerts.
- Monitor system health, performance metrics, availability, and trends; generate reports/dashboards for leadership and compliance needs.
- Provide 24x5 support during core hours for monitoring operations, tuning, and troubleshooting; participate in weekend on-call rotation for critical incidents, alert responses, restarts, or failover activations.
- Support SOX compliance: Set up monitors for ITGC-relevant events (e.g., configuration changes, access anomalies, performance deviations impacting financial systems), assist in audit evidence collection, control testing, and remediation of findings.
- Develop/customize plugins/scripts (e.g., Perl, Python, Bash) for advanced checks, integrations, and automation of routine tasks.
- Collaborate with Infrastructure, Security, Compliance, Application, and Dev teams to define monitoring requirements, SLAs, and best practices in a regulated setup.
- Perform root-cause analysis on incidents, contribute to post-incident reviews, and drive continuous improvement (e.g., scaling Nagios, migrating to hybrid setups, reducing alert noise).
- Document configurations, procedures, runbooks, and SOX-related artifacts; support knowledge transfer and team handovers.
Required Skills & Experience
- 3–5 years of hands-on experience administering and enhancing Nagios (Nagios Core/XI) in enterprise/production environments.
- Strong expertise in:
- Nagios configuration (object definitions, macros, time periods, contacts/contact groups).
- Plugin ecosystem (standard + custom), NRPE, NSCA, and distributed monitoring.
- Performance graphing (e.g., Nagiosgraph, PNP4Nagios), dashboards, and reporting.
- Alerting integrations and notification optimization.
- Solid understanding of SOX compliance in IT monitoring: ITGCs, change/access monitoring, logging/auditing, anomaly detection, and regulatory evidence support.
- Experience in 24x5/24x7 operations with on-call (including weekends), incident response, and high-availability monitoring.
- Knowledge of infrastructure technologies: Windows/Linux servers, networking (Cisco/Juniper), virtualization (VMware), cloud (AWS/Azure), and scripting (Perl/Python/Bash for Nagios extensions).
- Familiarity with ITIL processes, ServiceNow/Jira for ticketing, and complementary tools (e.g., Splunk, PagerDuty).
- Strong troubleshooting, analytical, and documentation skills in a global, regulated context.
Preferred Qualifications
- Experience in consumer goods, retail, manufacturing, or entertainment sectors (e.g., monitoring supply chain, DTC/e-commerce, or financial batch systems).
- Certifications: Nagios-related (e.g., Nagios XI Certified), ITIL Foundation, or compliance-focused (CISA/CRISC).
- Exposure to hybrid monitoring (Nagios + modern tools like Prometheus, Zabbix) or additional platforms (SolarWinds, SCOM, AlertSite).
- Scripting for automation in regulated environments.