Near Palo Alto, CA
Created Dec 10, 2020
We are seeking a Senior Site Reliability Engineer who will join an SRE team, in creating best practices and solutions to keep the Rivian Digital Commerce sites and applications highly available and reliable. This is an exciting role working with software engineering teams from the ground up to build cloud-based solutions using the latest technologies, tools, and practices. The right candidate will be passionate about site reliability and how to serve millions of customers with full automation, and limited downtime.
This is what you’ll do:
- Work with engineering teams to deliver high quality products and solutions that delight Rivian customers.
- Work with engineering teams to design robust cloud-based architectures and redundant, fault tolerant solutions utilizing practices around CICD, blue-green deployments, canary testing, and traffic management.
- Define non-functional requirements (NFRs) for engineering teams around security, logging, monitoring, alerting, configuration, and testing and work with those teams in their implementations of apps and services.
- Develop runbooks and standard operating procedures (SOPs) for each service and application to ensure DevOps and SRE teams can detect incidents or issues before customers are impacted and act quickly to restore impacted services.
- Define practices and procedures around postmortems and root cause analysis to ensure service quality and maintainability KPIs are improving and downtime and service interruption are negligible.
- Work collaboratively with various stake holders to provide team-based solutions, creating a culture of inclusion and diversity of skillsets.
- Participate in a 24x7 on-call rotation and define and implement on-call practices and procedures.
This is what you’ll need:
- 5+ years in a technical role in Site Reliability, Operations, Systems Administration, or Cloud Infrastructure.
- 5+ years of experience being responsible for the uptime and reliability of customer facing web or mobile applications and critical services.
- 5+ years of experience maintaining and administrating large scale Linux based environments with best practices for security and automation.
- 5+ years of experience providing and maintaining cloud-based infrastructure such as AWS, GCP, Azure, or internal data center solutions based on VSphere, Openstack etc.
- 3+ years implementing and maintaining monitoring and alerting systems, creating service level indicators (SLIs), service level objectives (SLOs), and focusing on systems that self-heal or alert teams to take action before system downtime.
- 3+ years designing and operating fault tolerant systems, with zero to no downtime.
- Expert knowledge of monitoring systems such as: AppDynamics, New Relic, Prometheus, Grafana, Graphite, Nagios, AWS CloudWatch etc.
- Knowledge of network architectures, security, and troubleshooting of connectivity or latency issues.
- Comfortable managing several thousand node deployments and the automation it takes to ensure system uptime and redundancy.
- Experience with Docker, K8S, AWS Lambda is a plus.
- Proficiency in writing automation scripts and tools using bash / python / awk etc.
- Bachelor’s degree in computer science, electrical engineering, information systems or equivalent work experience.