This position is on the Production Operations Team responsible for Zoox's application, compute, and storage infrastructure. A successful candidate for this role will have strong project management and organization skills to complement his/her technical skills. This individual will work with a team of developers and other infrastructure engineers to design, implement, and maintain Zoox's global, internal, and external cloud infrastructure.
Responsibilities
Build CPU and GPU clustered compute systems
Design, implement, and support our internal and cloud systems
Track key metrics and logs
Oversee capacity and planning of our clusters
Work directly with application developers to help investigate upgrades, system tweaks, and next generation hardware.
Participate in a 24x7 on-call rotation
Qualifications
10+ years of experience and ability to work with little or no supervision
Familiar with GPU usage in Compute Cluster
Familiar with Cuda and TensorFlow workloads
Expert level knowledge of virtual platforms (vSphere, Xen, Docker, or KVM)
Experience with larger HPC clusters (>10,000 cores)
Familiar with container clustering (K8S/Kubernetes, Swarm, etc.)
Familiar with job and resource scheduling managers (Slurm (preferred), LSF, etc.)
Ability to script in any of the following: Perl, Python, Ruby or Bash
About Zoox
Zoox is developing the first ground-up, fully autonomous vehicle fleet and the supporting ecosystem required to bring this technology to market. Sitting at the intersection of artificial intelligence, robotics, and design, Zoox aims to provide the next generation of mobility-as-a-service in urban environments. We’re looking for top talent that shares our passion and wants to be part of a fast-moving and highly execution-oriented team.
You do not need to match every listed expectation to apply for this position. Here at Zoox, we know that diverse perspectives foster the innovation we need to be successful, and we are committed to building a team that encompasses a variety of backgrounds, experiences, and skills.