Epsilon Solutions Ltd.
Title: SRE Engineer
location – Toronto, ON (Hybrid)
Details & Criteria
Functional Title: SRE Engineer
SRE- Distributed, Dynatrace, Catchpoint, Incident handling, PCF, Debugging Skills, Good communication
SRE Job Description:
What will SREs do?
• Provide hands-on SRE with 24×7 SRE support, including incident management, problem management, root cause analysis, monitoring, alerting, and maintenance of infrastructure, compliance
• Track, audit, monitor and implement on technical work streams
• Act as portfolio SME (Subject Matter Expert) – understand & document common components, core functionalities, infrastructure of supported applications
• Be an escalation point in the on-call rotation, and support our maintenance, scheduled work, support and release deployment requirements
• Lead in incident management and problem management for applications in scope and RCA Action items fulfillment/ownership
• Focus on Continuous improvement and technical standards – Drive improvements in productivity, monitoring, tooling and best practices
• Manage technology currency (server patching, certificate renewal, compliance, etc.) with keen eye on automating opportunities
• Drive best-in-class technical solutions by tracking closely industry leading solutions and applying to RBC environment and needs
• Leverage the value in unit, department, and enterprise wide teams to develop better solutions and achieve a cross enterprise mindset Engineering:
• Develop SRE solutions (monitoring and alerting, machine learning anomaly detection, self-healing and reliability testing)
• Apply design-thinking and agile mindset in working with SREs, Scrum Masters and Incident Leads
• Contribute to and leverage best practices in SRE
• Simplifies development by building repeatable solutions to manual tasks
• Supports unit’s goals to adopt automation solutions for applications in scope Production Support:
• Perform production support role, including off-hours support and rotational on-call support to be compensated accordingly with overtime pay, lieu time, and on-call allowance
• Assist in incident management and problem management for applications in scope
• Evaluate continuously – what went well, what went wrong, what can be done to improve and prevent in future
• Maintain technology currency (perform server patching, certificate renewal, etc.) with keen eye on automating opportunities
• Ensure availability and uptime of applications in scope, as per service level objectives
• Ensure compliance of all systems and applications in scope, including maintaining segregation of duties Technical Consultation:
• Support initiatives outside of application or squad level scope
Consult on products build to other teams in RBPT and enterprise Innovation and Learning:
• Stay abreast of technology change and learn constantly, through official training assignments and self-assigned learning
• Provide demos to team at large of new technology findings
Must have:
• Advanced knowledge of the following SRE practices and technologies
• 3-5 years of experience in related field o Python, YAML, Shell scripting
• Azure, Linux o Dynatrace, Prometheus, PagerDuty, Moog, Splunk, Elastic, Azure monitor o Chaos Engineering oMQ, Kafka o Perform production support role, including off-hours support
• In-depth hands-on experience in a variety of SRE tools (Ansible, Azure Automation, Catchpoint) Good to have: A Bachelor’s degree in Computer Science or related technical field (Example: