We use a spatial and motion stream cnn with ResNet50 for modeling video information in NDAs (Non-Driving-Activities) dataset that collected ourselves. NDAs are one of driver's activity, however, when level 3 Autonomous Vehicle is in autopilot mode, the primary task of drivers are no longer driving. But we still need to understand what the drivers are doing to know the reaction time, because if the driver needs to take control the reaction time of the period became a crucial information. For the motion stream we use FlowNet 2.0 to generate the optical flow.
- This is the original video from our NDAs dataset. The size is about 3G.
- In temporal stream, we used FlowNet 2.0 to extract the optical flow. The size is about 3G.
- You could follow the FlowNet2.0 Colab Notebook for obtaining the optical flow of the video stream
- We use ResNet50 first pre-trained with ImageNet to be the main structure, the input of the Spatial CNN is the frame that extract from the original video stream. The input shape is (3, 244, 244). The resize and data augmentation are all automotically done in the code.
- Input data of temporal cnn is a stack of optical flow images which is a RGB-visualised optiacal flow image using FlowNet2.0, So the input shape is (30, 224, 224) which can be considered as a 30-channel image, 30 comes from RGB (3-channel) times 10 images per stack.
- In order to utilize ImageNet pre-trained weight on our model, we have to modify the weights of the first convolution layer pre-trained with ImageNet from (64, 3, 7, 7) to (64, 30, 7, 7).
- For every videos in a mini-batch, we randomly select 3 frames from each video. Then a consensus among the frames will be derived as the video-level prediction for calculating loss.
- In every mini-batch, we randomly select 16 (batch size) videos from the training videos and futher randomly select 1 stacked optical flow in each video.
- Both stream apply the same data augmentation technique such as random cropping.
- For every testing videos, we uniformly sample 19 frames in each video and the video level prediction is the voting result of all 19 frame level predictions.
network | top1 |
---|---|
Spatial stream | 80.8% |
Motion stream | 83.3% |
Fusion | 96.4% |
- The below is the link of my Google Colab notebook, all the instruction could be found on it. If there is any problem, leave a issue on this git repository.
- Google Colab notebook
- This code is modified from jeffreyhuang. All the edition is completed by myself in order to fit the NDAs Recognition project.
- The optical flow images was estimated using FlowNet 2.0 from NVIDIA git repository