As humans, our vision and hearing are wired to navigate our fast-paced, ever-changing environment. Our brains constantly scan for potential risks and hazards, taking in new information before determining our next action. Vision is one of the most important safety features we have while driving. It is also one of our primary senses, shaping how we perceive the world. 

But for a robot that lacks human senses, how do our vehicles perceive and make sense of their environment? How can you automate the need for survival and perceive the world around us through robotics? The Zoox perception system is the ‘eyes and ears’ of our vehicles. The data it collects helps it make sense of situations and scenarios in the real-world unfolding around it. For an autonomous vehicle, perception software has three main objectives: detecting objects and obstacles, tracking those objects over time, and determining their attributes (such as location, speed, and direction) and classifications.

The all-seeing eyes

In ideal situations, humans are pretty good at navigating the driving environment. Unfortunately, humans are also fallible. High-risk driving distractions, like feeling tired or intoxicated, but also driving in low visibility or heavy rain conditions can impair our judgment and reaction speed. Even the best human drivers have blind spots. Zoox vehicles don’t get bored or sleepy. They can also ‘see’ much better than us. Unlike our limited field of view, Zoox’s sensors provide a 360-degree field of view. 

Sensors on each corner of the Zoox vehicle enable it to ‘see’ equally in all directions

Sensors on each corner of the Zoox vehicle enable it to ‘see’ equally in all directions

“When it comes to perception, the goal is clear,” said RJ He, Director of Perception. “We seek the most accurate understanding of the world around the vehicle. We rely on redundant sensors to achieve a high degree of robustness.”

The Zoox sensor architecture consists of sensor pods with identical architecture strategically positioned on each of the four corners of the vehicle. The sensors consist of five modalities to see (and hear!) their surroundings: cameras, lidars, radars, longwave infrared sensors (LWIR), and microphones. Each sensor gives us complementary information, so we can create an accurate representation of the world.

  • Cameras are closest to human vision and provide high-resolution color information, which is helpful when looking for traffic lights or attributes like pedestrian gestures. But they are also susceptible to severe weather conditions and can perform worse in low-visibility situations, such as detecting dark objects at night.
  • Radar transmits radio waves in a targeted direction, with the receiver picking up these waves and analyzing objects’ location and speed. Radar is especially useful for long-range detections, providing velocity measurements with low latency and sensing motion even in adverse weather conditions.
  • Lidar – or light detection and ranging – uses light in the form of a pulsed laser to calculate the distance to objects in the vehicle’s surroundings. Millions of accurate 3D points per second paint a detailed map of what’s going on all around the vehicle, providing high confidence that no relevant object will go undetected.
  • Long-wave infrared thermal vision imaging cameras sense heat and differentiate between objects based on their temperature, which is particularly advantageous for detecting people and animals, even at night and in adverse weather conditions.
  • Audio sensors (microphones) act as the ears of our vehicles and are invaluable for detecting sirens from emergency vehicles, and even discerning their direction of arrival.
The Zoox sensor pod

The Zoox sensor pod

Like everything on the Zoox vehicle, this sensor positioning is deliberate. Beyond just safety, the configuration also enables a fail-operational architecture. That means in the unlikely event that an individual sensor were to fail, the vehicle can still complete its drive. Tricky blind spots are now a thing of the past, which is critically important when navigating dense urban environments crowded with pedestrians, bicyclists, and other road users. 

Bringing it all together

Now that we have all the raw data collected from the modalities mentioned above, it’s time to combine everything. This process is called sensor fusion.

“Sensor fusion combines data from the different modalities to detect and classify agents and objects around our vehicle,” said He. “This gives us less uncertainty than relying on a single sensor modality. Our perception algorithms and machine learning models leverage the strengths of each sensor modality to build a robust understanding of the world around our vehicle.” 

Our perception output is then used by other components of the AI stack, namely, prediction and planner. Prediction forecasts potential behavioral outcomes by other ‘agents’ – for example, cars, pedestrians, and cyclists – that have already been identified and classified by perception. Then planner – the vehicle’s executive decision-maker – determines the best route for the vehicle to take.

Fusing the data from our sensors provides our vehicles with a highly-accurate view of what is happening around them

Fusing the data from our sensors provides our vehicles with a highly-accurate view of what is happening around them

From perception to planning, everything happens in real-time. “Zoox vehicles must be able to react to any environmental change as they occur,” said He. “When our vehicles encounter a new scenario, they will approach cautiously, not unlike what a human driver would do. Our vehicles also understand when there are temporary environmental changes, such as unmapped construction zones.”

Optimal perception, performance, and safety

Designing an autonomous vehicle from the ground up is challenging and time-consuming, but it does provide benefits. One is having complete control of the perception systems. Zoox has implemented three complementary real-time perception systems that are independent and redundant of the others, running simultaneously. The first is the main AI system, which provides our main perception output.

“The first system uses a variety of machine learning models and obtains a sophisticated understanding of the world around the vehicle,” said He. “This includes our vision, lidar, and radar sensors—all needed for detection, tracking, and segmentation. We also use a motion modeling system that gives our system an understanding of how agents move through the world.”

The second and third systems act as collision-avoidance systems. They both focus on checking for possible obstructions in the Zoox vehicle’s direct path that might lead to a collision. One of these systems consists of geometric, interpretable algorithms that operate on sensor data and are responsible for detecting objects in our intended driving path. The other is a machine-learned algorithm, which we call Safety Net, that performs both detection and prediction of future movement in a short time horizon, 360 degrees around our vehicle. 

These two systems are architecturally different from the central AI system to avoid common-cause failures. They are also optimized for low end-to-end latency, allowing our vehicle to react to sudden obstructions. For example, a pedestrian running in front of our vehicle after being occluded by a vehicle on the side of the road. If the future collision probability meets a certain threshold, the system will trigger the vehicle to stop. 

Long tail cases and next steps

One of the biggest challenges in perception technology is solving long-tail edge cases. Edge cases refer to possible scenarios that have a low probability of occurring. Here, humans are sometimes instinctively more proficient at interpreting the semantics of an unusual scene; however, our perception technology continues to improve as our software gets smarter and we train on more data. There are many innovations, techniques, and algorithms in development across all aspects of our perception system.

Zoox has adopted a sophisticated and comprehensive verification and validation strategy, spanning real-world dense urban testing, probabilistic testing in simulation, and structured testing to enable our vehicles to perceive and make sense of their environment. 

Watch now