End-to-End Autonomous Driving Without Costly Modularization and 3D Manual Annotation
/ Authors
/ Abstract
We propose UAD, an end-to-end framework with <bold>U</bold>nsupervised pretext task for vision-based <bold>A</bold>utonomous <bold>D</bold>riving, achieving the best open-loop evaluation performance in nuScenes, meanwhile showing robust closed-loop driving quality in CARLA. Our motivation stems from the observation that current end-to-end autonomous driving (E2EAD) models still mimic the modular architecture in typical driving stacks, with carefully designed <bold>supervised</bold> perception and prediction subtasks to provide environment information for oriented planning. Although achieving groundbreaking progress, such design has certain drawbacks: 1) preceding subtasks require massive high-quality 3D annotations as supervision, posing a significant impediment to scaling the training data; and 2) each submodule entails substantial computation overhead in both training and inference. To this end, we propose UAD, an E2EAD framework with an <bold>unsupervised</bold><sup>1</sup> proxy to address all these issues. Firstly, we design a novel Angular Perception Pretext to eliminate the annotation requirement. The pretext perceives the driving scene by predicting the angular-wise spatial objectness and temporal dynamics, without manual annotation. Secondly, a self-supervised training strategy, which learns the consistency of the predicted trajectories under different augment views, is proposed to enhance the planning robustness in steering scenarios. Our UAD achieves 38.7% relative improvements over UniAD on the average collision rate of nuScenes open-loop evaluation and obtains the route completion score of 98.5% in closed-loop evaluation of CARLA’s Town05 Long benchmark, which outperforms the recent work VADv2. Moreover, the proposed method consumes only 44.3% training resources of UniAD and runs <inline-formula><tex-math notation="LaTeX">$3.4\times$</tex-math><alternatives><mml:math><mml:mrow><mml:mn>3</mml:mn><mml:mo>.</mml:mo><mml:mn>4</mml:mn><mml:mo>×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="guo-ieq1-3610517.gif"/></alternatives></inline-formula> faster in inference when employing the same backbone network. Our innovative design not only for the first time demonstrates unarguable performance advantages over supervised counterparts, but also enjoys unprecedented efficiency in data, training, and inference.
Journal: IEEE Transactions on Pattern Analysis and Machine Intelligence