SRE Manager

  • Full-time

Company Description

Ivy is a global, cutting-edge software and support services provider, partnering with one of the world’s biggest online gaming and entertainment groups. Founded in 2001, we’ve grown from a small tech company in Hyderabad to one creating innovative software solutions used by millions of consumers around the world, with billions of transactions taking place to head even some of the biggest technology giants. Focused on quality at scale, we deliver excellence to our customers day in and day out, with everyone working together to make what sometimes feels impossible, possible.

This means that not only do you get to work for a dynamic organization delivering pioneering technology, gaming and business solutions, you can also have an exciting and entertaining career. At Ivy, Bright Minds Shine Brighter. 

Job Description

As a SRE Manager, you will focus on ensuring the reliability, performance, and scalability of services and infrastructure.

Reporting to the Head of Engineering you will be part of the Product & Technology team will actively participate in all aspects of Site Reliability Engineering, including technical vision, telemetry and observation decisions, automation strategy, solution delivery, and platform incident and problem management.  This is a leadership role with both technical and people leadership responsibilities. As such, this role participates in short and long-term systems planning, teams and organizational planning. This position reports directly to the Director of Engineering.

What you will do

  • Provide technical and people leadership to the SRE teams by facilitating one-one-one, team, and performance review meetings
  • Fulfil the role of Escalation Manager/Critical Incident Manager on critical/ major incidents by facilitating quick and effective incident resolution to minimize player and business impact.
  • Conduct RCA and Post-Incident Reviews (PIRs) in a Blameless manner to identify root causes and prevent recurrence.
  • Build advanced Incident Management and Problem Management support (SOPs and run-books) to effectively identify, remediate, and resolve issues related to platform reliability, stability, and performance through careful analysis of telemetry data and system logs.
  • Continuously work to improve problem identification and service restoration of platforms by leading and overseeing efforts to define, enhance, and deliver automated alerting and response systems with intelligent, self-healing capabilities
  • Collaborate with platform engineers through implementation decisions to achieve highly reliable infrastructure, systems, and integrations (develop synthetic monitoring, health dashboards, reliable alerts and system performance).
  • Promote automation (CI/CD), infrastructure-as-code (IAC) practices, develop tools and process for seamless deployments, rollbacks, monitoring and troubleshooting.
  • Define and ensure proper reviews are built to minimise the Mean Time to Recover/ Discover (MTTR/ MTTD) and Mean Time to Failure (MTTF).
  • Works with development teams to set error budgets, SLIs/ SLOs and policies. Works with SRE to implement alerts and policies to minimize the impact failures and outages have on players.

Qualifications

  • Graduate or Post-Graduate with strong engineering background.
  • 10+ years of experience working in global organizations with the ability to effectively communicate with executives, leaders and individual contributors across the organization.
  • 5+ years of SRE experience working with telemetry, observation, self-healing solutions, and platform automation.
  • Proficient in analysing complex technical issues, identifying root causes, and implementing effective solutions under pressure.
  • Experience with monitoring, logging & telemetry tools like New Relic, Splunk, ELK, Nagios, Prometheus, AWS CloudWatch, Datadog, etc.
  • Experience in Disaster Recovery, Chaos Engineering with tools like Chaos Mesh and Chaos Monkey and periodically testing resiliency and failovers.
  • Hand-on experience in the monitoring of Exposure with automation and tools such as (but not limited to) GitlabCI, Jenkins, Terraform, Ansible, etc.
  • Expert in designing, creating and supporting Automation (PowerShell, Python, Ruby, AWK, SED, etc.) to run health-checks and self-healing capabilities for the platforms.
  • Experience with Networking, Content Delivery Networks (CDN, e.g. Akamai, Cloudflare), streaming platform technologies, like Apache Kafka and Databases: (Oracle, MS SQL, etc.)
  • Experience with Cloud platforms esp. Amazon Web Services (AWS)
  • Application Security, the practice of safeguarding application through access control, Authn & Authz, data encryption, secure communication using TLS/SSL and MTLS.
  • Collaboration & Change Management tools: Jira, ServiceNow, SharePoint, etc.
  • Experience in managing relationships with third-party vendors and service providers contributing to the business.

Additional Information

What we offer

At Ivy, we know that signing top players requires a great starting package, and plenty of support to inspire peak performance. Join us, and a competitive salary is just the beginning. Working for us in Hyderabad, you can expect to receive great benefits like:

  • Safe home pickup and home drop
  • Group Mediclaim policy
  • Group Critical Illness policy
  • Communication & Relocation allowance
  • Annual Health check

And outside of this, you’ll have the chance to turn recognition from leaders and colleagues into amazing prizes. Join a winning team of talented people and be a part of an inclusive and supporting community where everyone is celebrated for being themselves.  

Should you need any adjustments or accommodations to the recruitment process, at either application or interview, please contact us.

At Ivy, we do what’s right. It’s one of our core values and that’s why we're taking the lead when it comes to creating a diverse, equitable and inclusive future - for our people, and the wider global sports betting and gaming sector. However you identify, across any protected characteristic, our ambition is to ensure our people across the globe feel valued, respected and their individuality celebrated. 

Privacy Policy