As Autonomous Driving gets into production, the complexity of the safety-driven autonomous software becomes more evident. The complexities arise from several factors such as safety compliance, data-driven safe architecture, integration, test coverage, virtual simulation, mileage coverage, and homologation. As a result, most of the production programs for the autonomy level L3 and above face delays. Further, achieving safety requirements as per ISO 26262 in the complete Autonomous Driving Software Stack is a critical aspect to fulfil L3 and above levels of autonomy requirements.
A “Fail Operational” Safe redundant software layer as a Secondary Channel ensures redundancy in the Autonomous Driving Software Stack to meet ASIL D requirements. Even though each function and feature at Primary Channel needs to adhere to safety design, the requirement for a secondary channel is also critical to comply with the ASIL D requirement as Secondary Channel “Fail Operational” function takes over the dynamic driving task when the primary channel fails, in order to have the safe driving for the autonomous vehicle.
In this paper, a systematic approach to derive Fail-operational safe architecture is discussed. An autonomous highway pilot feature was considered as a system. Detailed system engineering activities and functional safety analyses were conducted to identify the necessity of the secondary channel. The concept formalized using ISO 26262 and ISO 15288 processes and the architecture evolved to meet redundancy requirements. From a system engineering process point of view, the identified faults were analyzed, and a detailed technical analysis performed for the perception component to derive Minimum Risk Element (MRE) (Minimum Risk Condition (MRC), and Minimum Risk Maneuver (MRM)) thus bringing the system to Fail-operational. To get MREs, further subcomponents were identified in the architecture and discussed. Further alternative architectures, Active-Standby architecture vs Fail-operational Safe architecture were analyzed before arriving at the Fail-operational architecture.
When primary channel failures are detected and the driver does not take the control in time, the secondary channel function starts with the degraded operation that leads up to the safe stop of the autonomous vehicle. This is achieved by a fully functional Fail-operational software which includes sensing, perception, localization, planning and motion control. To define the architecture of the secondary channel software, a detailed system engineering and safety analysis has been conducted as per ISO 26262 process, which is described later. AD L3 feature is the system under consideration to derive the item definitions, operational conditions, and HARA. As part of HARA, these faults are then analyzed and Minimum Risk Conditions (MRC) are defined. To bring the system to a safe state during failure, MRCs become the basis to define MRMs. Fail operational architecture for a secondary channel is defined from the MRMs.
Highly Automated Driving System or L3+ System relies mainly on the system itself for all its Dynamic Driving Task. This includes Longitudinal & Lateral control, Object detection & Recognition and Dynamic Driving Task fallback situation. For the Highly Automated Driving system, it is expected that the fallback ready user regains control in short notice when the system requests for it. These systems cannot expect the driver to respond to the request given the limitation of time to respond.
To solve the above problem for a higher level of autonomy, availability, and improvement of the system is the key. Fail-operational architecture for the AD system is required to achieve the acceptable safety level and it must execute minimum risk maneuver to bring the system to a safe state in case of any component or sub-function failure. The Fail-operational system must detect errors caused by faults, assess the damage, and recover from the error in fault-tolerant time and isolate the fault. The system ensures the integrity of the output data used to control the vehicle actuators and provides continuous safe operation in the presence of faults. One way to achieve Fail-operational is by increasing the redundancy of sensors, controllers, actuators, etc. but this leads to additional cost, weight, and space which are some of the constraints in the automotive industry. Therefore, the solution must be in the software stack.
Before we get into the details of the architecture, it is essential to understand the complete process followed to define, develop, and validate such safety functions. The Figure 1 below shows the process workflow.
Figure 1 : Safety Process to define the Fail-operational architecture
A process ensures the systematic development of safe software. Usually, detail process is not followed diligently before the prototype is realized. But for the Fail-operational system, it is critical to follow the process to define the architecture which must be derived from the safety analysis artefacts.
Through an elaborate system and safety development, the fail operation architecture is derived to meet the ASIL D requirement at the AD system level. The below Figure 2 defines the system architecture of how the primary and secondary functions ensure safe driving on the highway. Both primary and secondary functions physically isolated with independent power supply to avoid single point failure. Dissimilar software architecture (primary & secondary) would provide more robustness and unbiassed due to distinct technology as it is owned by different owners.
Figure 2 : Fail operational architecture for the Secondary Functions
Based on the scenario in the architecture, each component failure is analyzed and derived multiple possible MREs. These MREs are then grouped in the architecture ( Figure 2 ) as per the component layer. All these MREs' goal is to reach to final MRC which is safe/ comfort/ emergency stop. As an example, MRE_Group1 is elaborated to derive the internal MREs to achieve a Fail-operational condition in case of any component failure.
This architecture considers appropriate interfaces with primary channel functions, for example, sensor interfaces, application interfaces, and middleware interfaces. Each layer of the architecture is divided into different MRE groups (Perception, Trajectory, Vehicle motion control, etc.). A smooth transition of primary system execution to Fail operation function must be ensured.
The perception component may face failure due to one or more component failures at the Object Fusion, Lane Fusion, and/or Sensor Fusion levels. For each component failure, a detailed analysis is performed to derive internal MREs and achieve limited operation.
As shown in Figure 3 , Object Fusion can fail due to several reasons like Camera detection failure or Radar detection failure or both leading to severe object detection failures. When one single sensor, for example, camera failure is detected; the perception module still can rely on other sensors especially if a moving target is detected by radar. Some other techniques like locking the target information when it is detected more than a certain period and using this historical information before the fault condition can help to perform the fail operation. Similarly, when Radar detection fails, camera data errors due to pitch can be mitigated using gyro-based distance stabilization.
Figure 3 : Derived possible MREs for Object Fusion Failure
Lane Fusion Failure
When there is a partial lane detection or low confidence camera data, ego lane geometry can be estimated based on other traffic object paths. If the road boundaries or guard rail are detected (clusters in Radar reflections) it can help construct virtual lane information. In the worst-case scenario, where the lane information is not available, it can rely on front or side target information and follow the target with a safe distance.
IMU Sensor Failure
In the case of intermittent IMU failure, using vehicle dynamic models and vehicle states (vehicle sensor) and design of linear/extended/unscented Kalman filters will help to improve the sensor accuracy, however, in the case of severe failure, it’s proposed that the architecture have a redundant sensor for the secondary channel.
Need for the secondary safety function has been realized by system engineering process. A systematic approach to develop Fail-operational secondary function as per ISO 26262 has been discussed with an example. Independent development of safety function with all systems and safety artifacts would eliminate bias from primary function owners. The architecture design is modular and can be customized and integrated with the primary functions and middleware to make the Autonomous Driving software stack compliant to ASIL requirements. This, in turn, accelerates AD development while fulfilling regulatory requirements.
About the Author:
Dr. Manaswini Rath,
Vice President and Global Head, Autonomous Driving,
Senior Solution Architect,