Job FunctionsDesigning and implementing cloud infrastructureAutomating infrastructure provisioning and management using Terraform & PythonCollaborating with development teams to optimize cloud resources and enhance system reliabilityDeveloping and maintaining monitoring and alerting systems to proactively identify and resolve issues affecting the reliability of our writing solutionsConducting post-mortem analyses of system failures to identify root causes and implement preventive measuresOptimizing and scaling our cloud infrastructure to support growing user demand and ensure cost efficiency
Job RequirementsBachelorās degree in Computer Science, Engineering, or a related technical fieldExpertise in Site Reliability Engineering with a minimum of 7 years of hands-on experienceDeep understanding of system architecture and infrastructure design to ensure high availability and performanceStrong proficiency in programming languages such as Python, Java, Go for automation and monitoringExperience with cloud platforms like AWS, Azure, or GCP, and their respective services for scalable and resilient systemsExpertise in containerization technologies (e.g., Docker, Kubernetes) and orchestration toolsKnowledge of monitoring and logging tools (e.g., Prometheus, Grafana, ELK Stack) to maintain system health and performance
SkillsProven expertise in Site Reliability Engineering with a minimum of 7 years of hands-on experienceDeep understanding of system architecture and infrastructure design to ensure high availability and performanceStrong proficiency in programming languages such as Python, Java, Go for automation and monitoringExperience with cloud platforms like AWS, Azure, or GCP, and their respective services for scalable and resilient systemsExpertise in containerization technologies (e.g., Docker, Kubernetes) and orchestration toolsKnowledge of monitoring and logging tools (e.g., Prometheus, Grafana, ELK Stack) to maintain system health and performanceExcellent communication skills to collaborate effectively with cross-functional teams and stakeholdersProactive approach to identifying and mitigating potential system failures and performance bottlenecksAbility to lead and mentor junior engineers in best practices for reliability and system optimization