Recently, Li Xiang discussed the latest research and development progress of Li Auto’s autonomous driving on the Chongqing Forum. This article expands on that discussion, aiming to help readers better understand the underlying logic and technical implementation.
Everything began from a discussion with Xiang Ge in the second half of last year. The no-map solution sometimes fails at complex intersections without high-definition maps, leading to potential wrong turns. Xiang Ge asked if we could teach the NOA system to understand navigation maps like humans, rather than relying solely on perception results. After consideration, we realized the car needs two systems: one for driving and one for reading maps. Thus, the journey began.
The core theoretical idea comes from the concept of System 1 and System 2 in cognitive psychology, proposed by Nobel laureate Daniel Kahneman in his book “Thinking, Fast and Slow.” The fast and slow system theory offers a new perspective on observing and understanding human decision-making processes, demonstrating the interaction between intuition and analytical thinking. It reveals how people balance intuitive judgments and logical analysis in complex situations, aiding in understanding the predictability and unpredictability of human behavior in specific environments.
System 1 is an automatic, fast, and unconscious thinking mode, usually based on intuition and experience, corresponding to behavioral intelligence. Lecun mentioned in his convolutional neural network paper that “CNNs can quickly and efficiently process image data, similar to human quick intuitive responses.” The tremendous success of deep learning in the past decade also falls into this category, representing the most basic form of human intelligence, behavioral intelligence.
System 2 focuses on simulating the thought processes of humans or other advanced organisms. This type of intelligence involves deeper understanding, reasoning, learning, and adaptation, corresponding to cognitive intelligence. Cognitive intelligence attempts to replicate how the human brain works in information processing, decision-making, problem-solving, and language comprehension. It builds on behavioral intelligence, simplifying and abstracting the information processed by behavioral intelligence, and performing more complex analyses, reasoning, and advanced thinking processes.
Since 2023, LLMs (Large Language Models) like the GPT series have made preliminary progress toward creating AGI (Artificial General Intelligence). Although LLMs still rely heavily on deep learning techniques to fit inputs and outputs, new paradigms such as Chain of Thought (CoT) represent a shift where AI models can analyze, decompose, and solve tasks, closely resembling human behavior in complex problem-solving. Andrew Karparthy mentioned in his recent talk “Introduction to LLMs for Busy People” that System 2 is currently a research focus in AI, and techniques like CoT or Tree of Thought (ToT) can more easily achieve System 2 capabilities.
With the theoretical foundation of System 1 and System 2, let’s apply it to human driving:
- Automated driving process (System 1): This involves habitual and intuitive behaviors, such as automatically shifting gears or stopping at a red light without deep thought. These processes are fast, unconscious, and become more automated with driving experience.
- Complex control process (System 2): In complex or novel driving situations, such as emergencies, heavy traffic, or unfamiliar routes, drivers need to focus more and make deliberate decisions. These processes are slow, conscious, and involve advanced cognitive functions like judgment, planning, and decision-making.
Most of the time (>95%), we can drive the vehicle quickly and intuitively without complex thought processes, allowing us to talk with passengers while driving safely. This is because System 1 is sufficient for ordinary driving tasks. However, there are rare corner cases that require us to mobilize our mental resources to analyze and think before providing safe solutions. For example, encountering a herd of cows/sheep on the road requires recognizing them as animals (common sense), slowing down for safety, and analyzing their movements to determine when it’s safe to pass. This complex process requires System 2. Therefore, autonomous driving technology also needs similar complex cognitive intelligence to handle such corner cases properly, achieving fully autonomous driving without human intervention. The autonomous driving system must handle corner cases independently and gradually solidify these solutions into everyday driving, mirroring the human learning process for driving.
How to achieve System 1 and System 2 in autonomous driving and validate it? Li Auto’s answer is E2E (End-to-End Model) + VLM (Vision-Language Model) + Simulation Tests (World Model). Efficient data auto-labeling is also essential.
This system’s working principle perfectly simulates human driving behavior. The end-to-end model carries System 1, and VLM carries System 2, deployed on two OrinX chips. One OrinX runs the real-time end-to-end model, taking sensor data as input and outputting the planned trajectory for the actuators to control lateral and longitudinal movements. The other OrinX runs a VLM model with 2 billion parameters, the first large model deployed on NV’s vehicle-end chip, achieving near-real-time inference speed to meet System 2’s needs. The data for training System 1 and System 2 exceeds 1 million clips, each being a 30-second short video, equivalent to 10,000 hours or 500,000 kilometers of driving data. These clips are “handpicked” from over 100 million kilometers of driving data, representing “experienced driver” data. The data labeling and model training processes are fully automated, with 3-5 version iterations per week. By the end of this year, the training data volume is expected to reach 10 million clips.
Evaluating autonomous driving systems differs from intelligent driving (or assisted driving). Intelligent driving functions are designed and tested to meet design requirements. Autonomous driving systems possess “capabilities,” which, like human abilities, can continuously iterate and grow. These capabilities can only be evaluated through “exams.” Like a novice driver who must pass exams before hitting the road, the autonomous driving system must be evaluated to reach road-ready levels before being delivered to users.
We use the World Model + Shadow Mode approach. The World Model rebuilds and generates real scenarios for exams. Rebuilt scenarios come from real driving data and Badcases, similar to real exam questions, ensuring the system doesn’t repeat human driving errors. Generated scenarios, similar but not identical to rebuilt ones, act as mock exam questions, facilitating learning and comprehensive understanding. After passing these simulations, we use early bird/internal testing vehicles for real-car exams, similar to actual driving school exams. If it fails, we iterate until it passes. Once delivered to users, the system continues to improve with driving data, becoming better over time.
This summarizes our research work alongside regular deliveries. However, technical introductions are best understood through papers. Here are some core papers from the team with summaries to help you understand the technical details.
System 1 (E2E) Direction:
- Zhou L, Tang T, Hao P, et al. “UA-Track: Uncertainty-Aware End-to-End 3D Multi-Object Tracking.” arXiv preprint arXiv:2406.02147, 2024.
- This paper proposes UA-Track, a 3D multi-object tracking framework addressing uncertainties caused by occlusion and small targets in autonomous driving perception. It achieves significant performance improvements, especially in the nuScenes benchmark, reaching 66.3% AMOTA, surpassing previous best end-to-end solutions.
- Li F, Hou W, Jia P. “RMFA-Net: A Neural ISP for Real RAW to RGB Image Reconstruction.” arXiv preprint arXiv:2406.11469, 2024.
- This paper introduces RMFA-Net, a novel neural network for high-quality RAW to RGB image reconstruction. It addresses traditional algorithms’ deficiencies in processing RAW data through implicit black level correction, tri-channel splitting, and tone mapping based on Retinex theory. Experimental results show RMFA-Net surpasses existing technologies in image quality, achieving over 25 dB PSNR scores.
System 2 (VLM) Direction:
- Tian X, Gu J, Li B, et al. “DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models.” arXiv preprint arXiv:2402.12289, 2024.
- This paper introduces DriveVLM, an autonomous driving system combining large vision-language models to enhance understanding of complex and corner-case scenarios in urban environments. The innovative chain of thought module improves scene description, analysis, and hierarchical planning capabilities. DriveVLM-Dual hybrid system overcomes spatial reasoning and computational intensity limitations, achieving robust spatial understanding and real-time inference speed, significantly improving performance in complex and unpredictable driving conditions.
Simulation Test (World Model) Direction:
- Yan Y, Lin H, Zhou C, et al. “Street Gaussians for Modeling Dynamic Urban Scenes.” arXiv preprint arXiv:2401.01339, 2024.
- This paper introduces Street Gaussians, a new technique for simulating dynamic urban street scenes. It optimizes point clouds and 3D Gaussian representations, addressing existing methods’ slow training and rendering speed and high dependence on vehicle tracking accuracy. This work achieves superior performance in multiple benchmarks, enabling real-time rendering and scene editing.
- Ma E, Zhou L, Tang T, et al. “Unleashing Generalization of End-to-End Autonomous Driving with Controllable Long Video Generation.” arXiv preprint arXiv:2406.01349, 2024.
- This paper proposes Delphi, a novel end-to-end autonomous driving video generation method addressing spatial and temporal consistency issues in existing technologies for generating long videos. It significantly enhances the generalization capability of autonomous driving models, achieving high performance on small datasets.
- Du X, Sun H, Wang S, et al. “3DRealCar: An In-the-wild RGB-D Car Dataset with 360-degree Views.” arXiv preprint arXiv:2406.04875, 2024.
- This paper introduces 3DRealCar, a large-scale high-quality real vehicle 3D dataset with 360-degree views and varying lighting conditions. It addresses the low quality of existing 3D vehicle datasets, providing essential research resources for autonomous driving systems, virtual reality, and achieving significant progress in vehicle 3D reconstruction and understanding.
Data (Auto-Labeling) Direction:
- Wei D, Gao T, Jia Z, et al. “BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving.” arXiv preprint arXiv:2401.01065, 2024.
- This paper proposes the BEV-TSR framework, addressing text-scene retrieval challenges in autonomous driving by leveraging bird’s-eye view (BEV) space and large language models (LLM). It solves traditional image retrieval methods’ deficiencies in global feature representation and complex text retrieval capabilities, achieving significant performance improvements in efficiently and accurately retrieving traffic scenes in BEV space.
- Jin B, Zheng Y, Li P, et al. “TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes.” arXiv preprint arXiv:2403.19589, 2024.
- This paper proposes the TOD3Cap network and dataset, addressing the challenges of indoor-outdoor scene differences and significantly enhancing 3D object detection and description performance.
Those who reach here are passionate about autonomous driving. Thank you for your time, and we hope you now have a better understanding of Li Auto’s autonomous driving. We believe the System 1 + System 2 technical architecture can apply to autonomous driving and future intelligent robots, as autonomous driving itself is a form of wheeled robotics. Achieving autonomous driving equates to achieving artificial intelligence, but this process poses significant challenges in computing power, data, and talent for companies. Finally, to quote Xiang Ge: “We are confident to deliver L3 supervised autonomous driving as early as the second half of this year.”