Category-defining tech. Career-defining work.
Lots of tech companies disrupt. But, many fail when they try to scale. We're different. CockroachDB makes it easier for companies to build and scale apps. This is how and why we're helping some of the most innovative companies on the planet. We tackle problems head-on and focus on solutions that create lasting impact. 
Because when our customers win, we all win. 
The Role
CockroachDB provides the backbone of storing data on a global scale. Our core mission on the SRE team is to operate at scale a secure & reliable Cockroach Cloud product. We provide consultation, planning, architectural oversight, concrete designs, development, and implementation that improve the resilience, efficiency, performance, and availability of our Cloud Service. We also take pride in being good on-call engineers. We believe regular reflection on the experience of being on-call can contribute in the short, medium, & long term to improvements to the core product, including to CRDB itself.As a Site Reliability Engineer you’ll help manage and scale our CockroachCloud service, a fully managed global offering of CockroachDB spanning multiple cloud providers. 
You will oversee our production system, ensuring that we can provide stable and scalable infrastructure as we deliver CockroachDB to our customers.
You Will
- Manage the infrastructure for cloud services, including running internal production systems and hosting CockroachDB for our external customers.
- Design, write and deliver software and systems to increase product reliability and operational efficiency.
- Develop custom tools as necessary.
- Keep a complex system running and solve problems relating to mission-critical services.
- Design, implement, operate, and troubleshoot the automation and monitoring of production clusters to maximize performance and availability.
- Drive the company through disaster recovery tests, where we manually turn down pieces of CockroachDB to test its overall resilience to failures.
- Participate in an on-call rotation for our production systems and hosted services.
The Expectations
In your first 30 days, you will onboard and be exposed to our current internal and customer-facing production systems. Working with our existing SRE and engineering teams, you will pair on production operations and build out runbooks for the operation of different systems. We believe that it's essential for you to take this first month to become familiar with our technology and our company.After 3 months, you'll be fully integrated into the team. You will develop and own tooling for reliability, automation, and other issues related to CockroachCloud’s stability and scalability. 
You will identify new opportunities for automating processes, streamlining delivery, deploying new core functionality, and building great tools. You will help make CockroachCloud the best platform to host CockroachDB on by bringing your expertise to our database.
You Have
- Expertise in analyzing, monitoring, and troubleshooting large-scale distributed systems.
- Experience in software development using one or more of the following: Go, C, C++, Python, Java.
- Proficiency working with algorithms, data structures, and production troubleshooting.
- Expertise in working with major cloud providers (AWS, Azure, GCP, etc.) and Cloud APIs.
- Debugged and optimized code and to automate routine tasks.
- Working knowledge of web and network protocols and standards (HTTP, TLS, DNS, etc.)
- Prior on-call experience, exhibiting sense of ownership, attention to detail, and urgency.
- Experience building collaborative relationships with your colleagues. You enjoy being part of the code review process and partnering with your teammates on challenging problems.
The Team 
We are a group of software engineers first & foremost. We use software engineering as a means to achieve our mission; this is the SRE way. The SRE team is currently distributed across North America (5) and India (4).
Reporting to Tom Schmidt - Sr. Manager, Engineering (Site Reliability Engineering)
Tom recently joined Cockroach Labs as manager of Site Reliability Engineering and has taken responsibility for Cockroach Cloud’s production operations. Tom joined Cockroach Labs after 15 years at IBM where he initially contributed in a wide variety of technical leadership roles, generally focussing on quality and automation across compiler development, test frameworks, CICD, and more. Over the past 7 years, Tom has become an enthusiastic advocate of the Site Reliability Engineering discipline, presenting on the topic at conferences, developing certification curriculum, and securing multiple patents. 
Tom was also a primary contributor towards the establishment of IBMs formal SRE profession and was recognized as one of the first three SRE Thought Leaders within the company. Most recently, Tom transitioned into a management role where he introduced Site Reliability Engineering to the IBM Business Analytics organization, building an SRE team from the ground up, eventually managing over 20 individuals across 3 unique project areas while establishing practices that now guide over 80 engineers internationally. Cockroach Labs presented a new and unique opportunity to gain experience in a high paced startup environment, laying the foundation for scalable reliability as we prepare for the rapid growth of our Cockroach Cloud offering. Beyond the business, Tom is blessed to call himself a proud father of a 4 year old boy, and otherwise enjoys finding balance between spending time in nature (hiking, camping, exploring) and testing his mettle in competitive gaming.
Jordan Lewis - Senior Director of Engineering
Jordan is the Head of Engineering for CockroachDB Cloud. He’s responsible for the teams that build, maintain and keep CockroachDB Cloud reliably serving the needs of Cockroach Labs’ most demanding customer base. He joined Cockroach Labs as a database engineer in 2016 when it was just 25 people before moving into engineering leadership and most recently moving to lead the Cloud organization. Jordan lives in his hometown of Brooklyn NY with his wife. Outside of work he enjoys folk music and riding his electric scooter around town.
Isaac Wong - EVP of Engineering
Isaac is responsible for the health of the engineering organization at Cockroach Labs. He partners closely with teams to ensure we have a balanced culture that promotes quality and innovation in pursuit of our goals. Before joining Cockroach Labs Isaac was in life sciences for 16 years with Medidata Solutions where he had a front row seat on the exciting ride from a 30 person startup to more than 2000 people worldwide. But the lure of distributed, resilient, and consistent SQL databases, along with the amazing technology and culture at Cockroach Labs proved too much. 
When not working he likes to draw, play the piano and search NYC for cannolis with his wife and kids.Cockroach Labs is proud to be an Equal Opportunity Employer building a diverse and inclusive workforce. If you need additional accommodations to feel comfortable during your interview process, please email us at accessibility@cockroachlabs.com.Cockroach Labs has a hybrid work model, with Roachers that are local to one of our offices coming in on Mondays, Tuesdays, and Thursdays and working flexibly the rest of the week. While we’ve learned valuable lessons working remotely, nothing can replace the connection, creativity, and fun that occurs when Roachers get together and we are committed to fostering a workplace that encourages collaboration and allows us all to do our best work.
Benefits
- Stock Options
- Medical Insurance
- Vision Insurance
- Dental Insurance
- Life and Disability Insurance
- Professional Development Funds
- Flexible Time Off
- Paid Holidays
- Paid Sick Days
- Paid Parental Leave
- Retirement Benefits
- Mental Wellbeing Benefits
- And more!
The annual anticipated base salary range for U.S. candidates for this role is listed in USD below. Salary is one component of the Cockroach Labs’ Total Rewards package, which also includes, for each employee: stock options, medical insurance, vision insurance, dental insurance, life and disability insurance, funds towards professional development resources, flexible paid time off, 11 paid holidays a year, 10 paid sick days a year, paid parental leave, a 401(k) plan, and wellbeing benefits.  We set standard ranges for all U.S.-based roles based on function, level, and geographic location, benchmarked against similar stage growth companies. 
Actual salaries may vary and fall outside of this range depending on factors such as a candidate’s qualifications, geographic location, skills, experience, and competencies. In addition, we are often open to a wide variety of profiles, and recognize that the person we hire may be less experienced (or more senior) than this job description as posted. Salaries for candidates outside the U.S. will vary based on local compensation structures. This position will remain posted until filled. Applicants should apply via our Careers Page.Annual Anticipated Base Salary Range (U.S)$179,000—$236,900 USD