Site Reliability Engineer

Full Time
  • Full Time
  • Toronto

Epsilon Solutions Ltd.

Title: SRE Engineer

location – Toronto, ON (Hybrid)

Details & Criteria

Functional Title: SRE Engineer


SRE- Distributed, Dynatrace, Catchpoint, Incident handling, PCF, Debugging Skills, Good communication

SRE Job Description:


What will SREs do?

• Provide hands-on SRE with 24×7 SRE support, including incident management, problem management, root cause analysis, monitoring, alerting, and maintenance of infrastructure, compliance

• Track, audit, monitor and implement on technical work streams

• Act as portfolio SME (Subject Matter Expert) – understand & document common components, core functionalities, infrastructure of supported applications


• Be an escalation point in the on-call rotation, and support our maintenance, scheduled work, support and release deployment requirements

• Lead in incident management and problem management for applications in scope and RCA Action items fulfillment/ownership

• Focus on Continuous improvement and technical standards – Drive improvements in productivity, monitoring, tooling and best practices

• Manage technology currency (server patching, certificate renewal, compliance, etc.) with keen eye on automating opportunities



• Drive best-in-class technical solutions by tracking closely industry leading solutions and applying to RBC environment and needs

• Leverage the value in unit, department, and enterprise wide teams to develop better solutions and achieve a cross enterprise mindset Engineering:

• Develop SRE solutions (monitoring and alerting, machine learning anomaly detection, self-healing and reliability testing)

• Apply design-thinking and agile mindset in working with SREs, Scrum Masters and Incident Leads


• Contribute to and leverage best practices in SRE

• Simplifies development by building repeatable solutions to manual tasks

• Supports unit’s goals to adopt automation solutions for applications in scope Production Support:

• Perform production support role, including off-hours support and rotational on-call support to be compensated accordingly with overtime pay, lieu time, and on-call allowance


• Assist in incident management and problem management for applications in scope

• Evaluate continuously – what went well, what went wrong, what can be done to improve and prevent in future

• Maintain technology currency (perform server patching, certificate renewal, etc.) with keen eye on automating opportunities

• Ensure availability and uptime of applications in scope, as per service level objectives


• Ensure compliance of all systems and applications in scope, including maintaining segregation of duties Technical Consultation:

• Support initiatives outside of application or squad level scope

Consult on products build to other teams in RBPT and enterprise Innovation and Learning:

• Stay abreast of technology change and learn constantly, through official training assignments and self-assigned learning


• Provide demos to team at large of new technology findings


Must have:

• Advanced knowledge of the following SRE practices and technologies

• 3-5 years of experience in related field o Python, YAML, Shell scripting


• Azure, Linux o Dynatrace, Prometheus, PagerDuty, Moog, Splunk, Elastic, Azure monitor o Chaos Engineering oMQ, Kafka o Perform production support role, including off-hours support

• In-depth hands-on experience in a variety of SRE tools (Ansible, Azure Automation, Catchpoint) Good to have: A Bachelor’s degree in Computer Science or related technical field (Example:

To apply, please visit the following URL: