Platform Reliability Engineer

Full-time

Company Description

We are a global algo trading company fully dedicated to technology. Our brilliant team comprises 100+ individuals scattered across the globe. Seventy percent of our team members are technical specialists, encompassing development, infrastructure, testing, and analytics, while the rest comprise business units, including operations, legal, finance, etc.

Job Description

We are looking for an Incident Engineer who will be responsible for ensuring the reliable operation of our platform, working with metrics to improve production process efficiency, and participating in testing new product versions.

The career journey with our team starts with a bootcamp — our preparatory platform. This is a paid remote bootcamp with possible employment in our company after successful graduation. During the bootcamp (2 months), you will be able to perform real business tasks and gain new knowledge.

Responsibilities:

Monitoring the operation of the platform and its environment, detecting anomalies in the system’s performance (using tools such as Zabbix, Grafana, Prometheus/ELK).
Investigating incidents related to software, network, and equipment.
Independently resolving failures in real-time where possible, or together with DevOps, development, and QA teams.
Processing data necessary for issuing resolution using Clickhouse.
Configuring and optimizing monitoring, alerting, and logging tools.
Automating and optimizing routine processes (using Python).

Qualifications

1+ years of experience working in support, maintenance, analytics, software administration, DevOps, or SRE.
Interest in the field of trading or any experience of independent trading.
Experience in using Python, SQL scripts for automation and optimization.
Degree in Computer Science, Physics, Mathematics.
English for reading technical documentation (B1-B2).

I'm interested