Senior Digital Platform Ops Specialist
Job Description
We are looking for a Senior Digital Platform Ops Specialist to lead and manage a squad responsible for the end-to-end stability, performance, and reliability of CelcomDigi’s digital platforms that support our consumer and enterprise services. This role encompasses cloud/application support, including troubleshooting, monitoring and issue resolution, as well as vendor management and SRE automation practices. The ideal candidate will drive operational excellence across the squad and build reliability into systems to ensure seamless service continuity across our digital ecosystem.
Responsibilities
- Lead and provide application support, including monitoring system health to detect issues proactively, supporting application release activities, ensuring timely resolution of incidents with root cause analysis, and maintaining service compliance with defined SLAs.
- Manage and optimize cloud infrastructures to ensure scalability, reliability, and cost efficiency, including capacity planning and resource utilization optimization.
- Drive the implementation of SRE principles such as error budgets and service level objectives (SLOs), automating repetitive activities to reduce toil and enable higher-value engineering initiatives.
- Manage third-party vendors to ensure delivery of their responsibilities and compliance with agreed support scope, quality standards, and SLAs, including handling vendor escalations and performance issues.
- Drive continuous improvement by identifying operational gaps, proposing enhancements, mentoring junior team members and developing a comprehensive knowledge base of best practices, SOPs, and playbooks.
Requirements
- Bachelor's degree in Computer Science, Software Engineering, or related field.
- 5+ years of experience in digital platform operations, large-scale IT systems support or SRE-related roles.
- Strong understanding of mobile/web application architecture, APIs, and middleware.
- Proficiency with tools such as Dynatrace, Firebase, AWS CloudWatch, Datadog , PRTG, Sentry or similar platforms for monitoring and incident management.
- Proficiency in automation scripting (Python, Bash, or equivalent).
- Strong vendor management experience with ability to enforce governance.
- Excellent communication and collaboration skills, especially during incident management.
Job Description
We are looking for a Senior Digital Platform Ops Specialist to lead and manage a squad responsible for the end-to-end stability, performance, and reliability of CelcomDigi’s digital platforms that support our consumer and enterprise services. This role encompasses cloud/application support, including troubleshooting, monitoring and issue resolution, as well as vendor management and SRE automation practices. The ideal candidate will drive operational excellence across the squad and build reliability into systems to ensure seamless service continuity across our digital ecosystem.
Responsibilities
- Lead and provide application support, including monitoring system health to detect issues proactively, supporting application release activities, ensuring timely resolution of incidents with root cause analysis, and maintaining service compliance with defined SLAs.
- Manage and optimize cloud infrastructures to ensure scalability, reliability, and cost efficiency, including capacity planning and resource utilization optimization.
- Drive the implementation of SRE principles such as error budgets and service level objectives (SLOs), automating repetitive activities to reduce toil and enable higher-value engineering initiatives.
- Manage third-party vendors to ensure delivery of their responsibilities and compliance with agreed support scope, quality standards, and SLAs, including handling vendor escalations and performance issues.
- Drive continuous improvement by identifying operational gaps, proposing enhancements, mentoring junior team members and developing a comprehensive knowledge base of best practices, SOPs, and playbooks.
Requirements
- Bachelor's degree in Computer Science, Software Engineering, or related field.
- 5+ years of experience in digital platform operations, large-scale IT systems support or SRE-related roles.
- Strong understanding of mobile/web application architecture, APIs, and middleware.
- Proficiency with tools such as Dynatrace, Firebase, AWS CloudWatch, Datadog , PRTG, Sentry or similar platforms for monitoring and incident management.
- Proficiency in automation scripting (Python, Bash, or equivalent).
- Strong vendor management experience with ability to enforce governance.
- Excellent communication and collaboration skills, especially during incident management.
Screen readers cannot read the following searchable map.
Follow this link to reach our Job Search page to search for available jobs in a more accessible format.
Job Segment:
Technical Support, Cloud, Computer Science, Software Engineer, Engineer, Technology, Engineering