Site Reliability Engineer

Full-time

Sub Division: Group Information Technology
Division: Group Technology & Transformation

Company Description

Now it’s your time to join the #1 bank in the Middle East and one of the most prestigious financial companies in the region. Shaking up the world of banking requires a lot of smarts and skill. We’re looking for the brightest and best to help us reach our goals and we’ll also help you reach yours. Your success is our success as you grow stronger in your career. Join us and leave a legacy of your own, as a pioneer in both the company and the industry

Job Description

Job Purpose

The Principal Engineer is responsible for creating scalable and highly reliable software systems through the leverage of tools and automation. The principal engineer will focus on SRE related roles and in improving performance and operational efficiency of the Business Applications of the bank.

Key Accountabilities

Carry out Capacity Management best practices for Business Applications in scope.
Monitor and Report on the coverage of Business Applications in scope.
Automate and identify scope for improvement of Reliability and Availability of applications in scope by leveraging the banks tools and knowledge of scripting.
Aim to implement Chaos Engineering practices to be better prepared at recovery of business-critical services and drive down the MTTD and MTTR. Demonstrate through monthly/quarterly reports.
Identify and implement CSI initiatives with a focus on reducing technical debt and improving reliability/scalability/availability.
Active participation in incident/problem management calls, BPM and RC

Qualifications

Academics:

Bachelor’s degree or equivalent.

Job knowledge, skills & experience:

10+ Years of demonstrable hands-on experience in improving the reliability of Critical Business Applications through SRE Best Practices.
Exceptional knowledge in systems monitoring, alerting and analytics (AppDynamics, Dynatrace, Splunk, etc.)
Experience in troubleshooting highly available, secure and reliable services with automatic failover using containers and container-orchestration tools like Kubernetes/OpenShift. While leveraging the monitoring solutions of the bank.
Extensive experience with Cloud Technologies Amazon Web Services and/or Azure.
Ability to define and report on the key KPIs to be tracked and improved using SRE best practices.
Experience in automating routine tasks – knowledge of Python, Bash, Ansible, Terraform
Experienced in working closely with Performance and Load test teams to define, track and analyse performance and availability targets for the Business Applications.
Ability to define comprehensive coverage requirements for monitored Business Applications and define the goals and outcomes to increase reliability and improve/maintain SLAs.
Demonstrates understanding of the Architecture of Business Applications with the ability to recommend improvements to improve reliability and uptime.
Experience using Chaos Engineering practices to build resiliency through the development lifecycle and Production.
Hands on knowledge of build automation and continuous integration/delivery ecosystem: Gitlab, Docker, Nexus, Selenium, Jenkins, Docker, Kubernetes.
Experience in working on a Linux based infrastructure
Critical thinker and problem-solving skills.

Must have knowledge

APM and log aggregation solution knowledge
Monitoring Tools Expertise minimum one or all tools like Splunk / ELK / AppDynamics / Dynatrace / NewRelica
Proficient in scripting - Python, Bash or Java
Experience working on Linux based infrastructure
ITIL Certified

Bonus knowledge

Experience in developing Continuous Integration/ Continuous Delivery pipelines (CI/ CD) – Gitlab /Azure Devops / Jenkins
Good hands-on knowledge of Configuration Management, Orchestration and Deployment tools like – Ansible, Terraform.
Cloud environment knowledge – Kubernetes, AWS EKS, Azure AKS
Working knowledge of various tools, open-source technologies, and cloud services

Behaviour Skills:

Independent, Self-Driven and able to bring ideas to the table
Ability to make decisions and drive changes.
Excellent Communication skills and able to communicate with senior stakeholders as well as with the technical teams.
Knowledgeable and a quick learner.
Fosters Innovation

Site Reliability Engineer

Company Description

Job Description

Qualifications

Job Location