Title: Senior Observability / SRE Engineer (Splunk \| Linux \| Python)
Location: Remote
Duration: Long term
We are looking for a highly experienced Senior Observability \& Site Reliability Engineer to support large\-scale enterprise platforms and mission\-critical applications.
The ideal candidate will have deep hands\-on experience in building and operating end\-to\-end monitoring, logging, and alerting solutions across distributed environments.
This role involves close collaboration with development, infrastructure, and operations teams to ensure platform reliability, performance visibility, and incident response effectiveness.
Key Responsibilities:
- Design, implement, and maintain enterprise observability solutions using Splunk Enterprise including dashboards, alerts, and data ingestion pipelines
- Develop and enhance monitoring frameworks for infrastructure, applications, and web platforms
- Automate operational processes using Linux shell scripting and Python
- Implement intelligent alerting strategies to reduce noise and improve incident response efficiency
- Provide L3 production support for business\-critical applications and infrastructure
- Support cloud and containerized deployments across AWS and Kubernetes environments
- Collaborate with engineering teams to standardize logging and telemetry practices
- Drive root cause analysis, post\-incident reviews, and continuous reliability improvements
- Build operational runbooks, disaster recovery procedures, and service continuity plans
- Integrate monitoring and deployment workflows with CI/CD tools such as Jenkins, Git, and TeamCity
- Support database monitoring and performance analysis across SQL Server, Oracle, DB2, and MySQL platforms
- Participate in ITIL\-based change, incident, and problem management processes
- Strong hands\-on expertise in Splunk engineering, administration, and architecture
- Advanced experience in Linux / Unix environments
- Proficiency in Python, Shell scripting, and automation frameworks
- Experience with AWS cloud services and Kubernetes / Docker platforms
- Knowledge of monitoring tools such as Nagios and custom observability solutions
- Experience supporting high\-availability web platforms and distributed systems
- Strong troubleshooting and production incident management skills
- Understanding of CI/CD pipelines and deployment automation
- Familiarity with ITIL processes and service management tools like ServiceNow
- Splunk certifications (Power User / Admin / Architect)
- Experience building large\-scale telemetry platforms
- Background in financial services or high\-transaction enterprise environments
- Experience designing intelligent alerting and automated incident workflows
Required Skills:
Preferred Qualifications:
Experience Level 15\+ years in production engineering / SRE / observability roles Prior experience supporting mission\-critical enterprise systems
For applications and inquiries, contact: hirings@openkyber.com