背景介绍
百度ApolloScape重磅发布了自动驾驶开放数据集。自动驾驶开发测试中,海量、高质的真实数据是必不可缺的“原料”。但是,少有团队有能力开发并维持一个适用的自动驾驶平台,定期校准并收集新数据。
据介绍,Apollo开放平台此次发布的ApolloScape不仅开放了比Cityscapes等同类数据集大10倍以上的数据量,包括感知、仿真场景、路网数据等数十万帧逐像素语义分割标注的高分辨率图像数据,进一步涵盖更复杂的环境、天气和交通状况等。从数据难度上来看,ApolloScape数据集涵盖了更复杂的道路状况(例如,单张图像中多达162辆交通工具或80名行人),同时开放数据集采用了逐像素语义分割标注的方式,是目前环境最复杂、标注最精准、数据量最大的自动驾驶数据集。
Apollo开放平台还将与加州大学伯克利分校在CVPR 2018(IEEE国际计算机视觉与模式识别会议)期间联合举办自动驾驶研讨会(Workshop on Autonomous Driving),并将基于ApolloScape的大规模数据集定义了多项任务挑战,为全球自动驾驶开发者和研究人员提供共同探索前沿领域技术突破及应用创新的平台。
参考一:PoseNet implementation for self-driving car localization using Pytorch on Apolloscape dataset
This article covers the very beginning of the journey and includes the reading and visualization of the Apolloscape dataset for localization task. Implement PoseNet [2] architecture for monocular image pose prediction and visualize results. I use Python and Pytorch for the task.
NOTE: If you want to jump straight to the code here is the GitHub repo. It’s is still an ongoing work where I intend to implement Vidloc [7], Pose Graph Optimization [3,8] and Structure from Motion [9] pipelines for Apolloscape Dataset in the context of the localization task.
Apolloscape Pytorch Dataset
For Pytorch I need to have a Dataset object that prepares and feeds the data to the loader and then to the model. I want to have a robust dataset class that can:
- support stereo and mono images
- support train/validation splits that came along with data or generate a new one
- support pose normalization
- support different pose representations (needed mainly for visualization and experiments with loss functions)
- support filtering by record id
- support general Apolloscape folder structure layout
I am not putting here the full listing of the Apolloscape dataset and concentrate solely on how to use it and what data we can get from it. For the full source code, please refer to the Github file datasets/apolloscape.py.
Here how to create a dataset:
1 | from datasets.apolloscape import Apolloscape |
output:
1 | Dataset: Apolloscape |
POLLO_PATH is a folder with unpacked Apolloscape datasets, e.g. $APOLLO_PATH/road02_seg or $APOLLO_PATH/zpark. Download data from Apolloscape page and unpack iot. Let’s assume that we’ve also created a symlink ./data/apolloscape that points to $APOLLO_PATH folder.
We can view the list of available records with a number of data samples in each:
1 | # Show records with numbers of data points |
output:
1 | Records: |
We can draw a route for one record with a sampled camera image:
1 | from utils.common import draw_record |
output:
Alternatively, we can see all records at once in one chart:
1 | # Draw all records for current dataset |
output:
Another option is to see it in a video:
1 | from utils.common import make_video |
Output (cut gif version of the generated video):
For the PoseNet training we will use mono images with zero-mean normalized poses and camera images center-cropped to 250px:
1 | # Resize and CenterCrop |
Output:
Implemented Apolloscape Pytorch dataset also supports cache_transform option which is when enabled saves all transformed pickled images to a disk and retrieves it later for the subsequent epochs without the need to redo convert and transform operations every image read event. Cache saves up to 50% of the time during training time though it’s not working with image augmentation transforms like torchvision.transforms.ColorJitter.
Also, we can get the mean and the standard deviation that we need later to recover true poses translations:
1 | poses_mean = train_dataset.poses_mean |
Output:
1 | Translation poses_mean = [ 449.95782055 -2251.24771214 40.17147932] in meters |
You can find all mentioned examples in Apolloscape_View_Records.ipynb notebook.
And now let’s turn to something useful and more interesting, for example, training PoseNet deep convolutional network to regress poses from camera images.
PoseNet localization task
参考:PoseNet implementation for self-driving car localization using Pytorch on Apolloscape dataset
A Pytorch implementation of the PoseNet model using a mono image:
1 | import torch |
For further experiments I’ve also implemented stereo version (currently it’s simply processes two images in parallel without any additional constraints), option to switch off stats tracking for BatchNorm layers and Kaiming He normal for weight initialization [4]. Full source code is here models/posenet.py
PoseNet Loss Functions
For more details on where it came from and intro to Bayesian Deep Learning (BDL) you can refer to an excellent post by Alex Kendall where he explains different types of uncertainties and its implications to the multi-task models. And even more results you can find in papers “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics.” [5] and “What uncertainties do we need in Bayesian deep learning for computer vision?.” [6].
Pytorch implementation for both versions of a loss function is the following:
1 | class PoseNetCriterion(torch.nn.Module): |
If learn_beta param is False it’s a simple weighted sum version of the loss and if learn_beta is True it’s using sx and sq params with enabled gradients that trains together with other network parameter with the same optimizer.
PoseNet Training Implementation Details
Now let’s combine it all to the training loop. I use torch.optim.Adam optimizer with learning rate 1e-5, ResNet34 pretrained on ImageNet as a feature extractor and 2048 features on the last FC layer before pose regressors.
1 | from torchvision import transforms, models |
A little bit simplified train function below with error calculation that is used solely for logging purposes:
1 | def train(train_loader, model, criterion, optimizer, epoch, max_epoch, |
validate function is similar to train except model.eval()/model.train() modes, logging and error calculations. Please refer to /utils/training.py on GitHub for full-versions of train and validate functions.
The training converges after about 1-2k epochs. On my machine, with GTX 1080 Ti it takes about 22 seconds per epoch on ZPark sample train dataset with 2242 images pre-processed and scaled to 250x250 pixels. Total training time – 6-12 hours.
PoseNet Results on Apolloscape dataset. ZPark sample road.
After 2k epochs of training, the model was managed to get a prediction of pose translation with a mean 40.6 meters and rotation with a mean 1.69 degrees.
Further development
Established results are far from one that can be used in autonomous navigation where a system needs to now its location within accuracy of 15cm. Such precision is vital for a car to act safely, correctly predict the behaviors of others and plan actions accordingly. In any case, it’s a good baseline and building blocks of the pipeline to work with Apolloscape dataset that I can develop and improve further.
There many things to try next:
- Use temporal nature of a video.
- Rely on geometrical features of stereo cameras.
- Pose graph optimization techniques.
- Loss based on 3D reprojection errors.
- Structure from motion methods to build 3D map representation.
And what’s more importantly, all above-mentioned methods need no additional information but that we already have in ZPark sample road from Apolloscape dataset.
References
- Kendall, Alex, and Roberto Cipolla. “Geometric loss functions for camera pose regression with deep learning.” (2017).
- Kendall, Alex, Matthew Grimes, and Roberto Cipolla. “Posenet: A convolutional network for real-time 6-dof camera relocalization.” (2015).
- Brahmbhatt, Samarth, et al. “Mapnet: Geometry-aware learning of maps for camera localization.” (2017).
- He, Kaiming, et al. “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.” (2015).
- Kendall, Alex, Yarin Gal, and Roberto Cipolla. “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics.” (2017).
- Kendall, Alex, and Yarin Gal. “What uncertainties do we need in bayesian deep learning for computer vision?.” (2017).
- Clark, Ronald, et al. “VidLoc: A deep spatio-temporal model for 6-dof video-clip relocalization.” (2017).
- Calafiore, Giuseppe, Luca Carlone, and Frank Dellaert. “Pose graph optimization in the complex domain: Lagrangian duality, conditions for zero duality gap, and optimal solutions.” (2015).
- Martinec, Daniel, and Tomas Pajdla. “Robust rotation and translation estimation in multiview reconstruction.” (2007).