At Lyft, community is what we are and it’s what we do. It’s what makes us different. To create the best ride for all, we start in our own community by creating an open, inclusive, and diverse organization where all team members are recognized for what they bring.
Passengers rely on Lyft to get to work, to go to the doctor, or to get home safely when public transit has stopped running. Drivers use Lyft for income and flexibility. Building a stable and reliable application for our passengers and drivers is a responsibility we take very seriously, and we are building out a team of Software Engineers focused on reliability, to deliver a consistent and highly reliable user experience.
This Reliability Software Engineering (RSWE) team will work on standardizing and supporting all of the rapidly growing teams throughout our organization, assessing their architecture, helping them design scalable services, and fostering excellent operational practices. It's a mission-critical role of ensuring that our systems are always healthy, monitored, automated, and designed to scale.
Data is the core of our business at Lyft helping us create an exceptional transportation experience for our customers and providing insights into the effectiveness of our product & features. To support this, we operate an extensive big data infrastructures in the AWS cloud. In addition to relying on big data compute engines like Hive and Presto, we also build an ecosystem of tools and services that allow all Lyft teams to leverage the platform as a cohesive service. Along with that, we are building a next-generation streaming platform based on Apache Flink and Apache Kafka. Our platform runs thousands of jobs, processes billions of events, and we support hundreds of data analysts and engineers across the company.
As a member of the cross-functional RSWE team, you will embed with the engineers in Data Platform to develop a reliable data infrastructure that scales with the company's incredible growth.
- Build holistic visibility into SLIs, SLOs, SLAs, dependency graphs, past performance of jobs and systems load to bring much-needed clarity to job executions.
- Build infrastructure and drive projects that break things with the aim to improve the production systems.
- Use the core Site Reliability Engineering principles of change management, monitoring, emergency response, capacity planning, and production readiness reviews to run the Data Platform Infrastructure.
- Step back to observe patterns and develop innovative tools and automation to minimize toil. Use those learnings to drive the best operational practices.
- Partner with the broader Lyft organization to build the culture of rigorously learning from incidents.
Experience & Skills:
- Extensive programming experience in Python or Go
- Passion for building tools and automation to make infrastructure more robust
- Experience working with public cloud platforms (e.g., AWS, Google Cloud Platform, Microsoft Azure, etc.)
- Experience designing, debugging and running fault tolerant large-scale distributed systems
- Hands-on experience with Hadoop (or similar) ecosystem - Yarn, Hive, HDFS, Spark, Presto, Parquet, HBase, Flink, Kafka, Kinesis a plus
Lyft is an Equal Employment Opportunity employer that proudly pursues and hires a diverse workforce. Lyft does not make hiring or employment decisions on the basis of race, color, religion or religious belief, ethnic or national origin, nationality, sex, gender, gender identity, sexual orientation, disability, age, military or veteran status, or any other basis protected by applicable local, state, or federal laws or prohibited by Company policy. Lyft also strives for a healthy and safe workplace and strictly prohibits harassment of any kind. Pursuant to the San Francisco Fair Chance Ordinance and other similar state laws and local ordinances, and its internal policy, Lyft will also consider for employment qualified applicants with arrest and conviction records.