Single Frame CNN

Deep Learning has been shown to learn highly effective features from image and video data, yielding high accuracy in many tasks. We have performed many training experiments on AWS g4dn.xlarge instance (with Tesla T4 GPU) in this phase. The data preparation and CNN architecture details are as given below.

DATA PREPARATION

Train-test-validation split:

We have used 66 videos for training, 22 for testing and 16 for validation.

Data pipeline (Using Dataset API of tensorflow):

Firstly, we created a list of filenames to jpg images (for this, we stored all individual frames of the videos in a separate folder) and a corresponding list of labels. We applied following steps to create the input pipeline for our model:

Created dataset from slices of the filenames and labels
Shuffled the data with a buffer size equal to the length of the dataset. This ensures good shuffling.
Parsed the images from filename to the pixel values. We used multiple threads to improve the speed of preprocessing.
Applied data augmentation for the images. For data augmentation, we used random brigtness, contrast and saturation to the images. Also, we used multiple threads to improve the speed of preprocessing
Batched the images (batch size = 16).
Prefetched one batch to make sure that a batch is ready to be served at all time.

Reference used : https://cs230.stanford.edu/blog/datapipeline/

MobileNetV2 BASED ARCHITECTURE

In this approach, we have used the MobileNetV2 architecture with additional dense layers at the top. We used MobileNetV2 because it is a lightweight architecture particularly useful for mobile and embedded vision applications.

Model training

We trained the model for 150 epochs , saving checkpoint for the best value of precision@90recall.
Loss function used: BinaryCrossentropy. We also assigned class weights (unsafe: 1 , safe: 1.92) while training as our dataset is imbalanced.
Optimizer used: Adam optimizer.
Python implementation for the same can be found here.

Model performance

Precision : 0.90, Recall : 0.99, Binary accuracy: 0.99 (on train data)
Precision : 0.90, Recall : 0.60, Binary accuracy: 0.85 (on test data)
Throughput on Jetson Nano after conversion into TensorRT graph: 15 fps.

Sample Prediction Outputs

True Positive

True Negative

False Positive

False Negative

SELF DEVELOPED ARCHITECTURE

As MobileNetV2 based architecture did not give a satisfactory performance on test data, we developed our own CNN architecture from scratch.

Model training

We trained the model for 150 epochs , saving checkpoint for the best value of precision@90recall.
Loss function used: BinaryCrossentropy. We also assigned class weights (unsafe: 1 , safe: 1.92) while training as our dataset is imbalanced.
Optimizer used: Adam optimizer with learning rate equal to 0.0002.
Python implementation for the same can be found here.

Model performance

Precision : 0.90, Recall : 0.99, Binary accuracy: 0.99 (on train data)
Precision : 0.90, Recall : 0.72, Binary accuracy: 0.89 (on test data)
Throughput on Jetson Nano after conversion into TensorRT graph: 3 fps. The inference speed on Jetson Nano is very low because our architecture had convolutional layers with larger kernel size (i.e. >3) - so the TensorRT graph could not be optimized.

Sample Prediction Outputs

True Positive

True Negative

False Positive

False Negative

SELF DEVELOPED ARCHITECTURE WITH DILATED CONVOLUTIONS

Since convolutional layers with larger kernel size were not optimized by TensorRT engine, we reduced the kernel size and added dialtion to those layers to compensate for the lowering of receptive field. This resulted in a higher inference speed after optimization to TensorRT graph.

Model training

We trained the model for 150 epochs , saving checkpoint for the best value of precision@90recall.
Loss function used: BinaryCrossentropy. We also assigned class weights (unsafe: 1 , safe: 1.92) while training as our dataset is imbalanced.
Optimizer used: Adam optimizer.
Python implementation for the same can be found here.

Model performance

Precision : 0.90, Recall : 0.99, Binary accuracy: 0.97 (on train data)
Precision : 0.90, Recall : 0.77, Binary accuracy: 0.84 (on test data)
Throughput on Jetson Nano after conversion into TensorRT graph: 8 fps.
Since this model gives a decent enough recall value of 0.77 at a high precision of 0.90 along with a good inference speed, we have deployed this model on Jetson Nano for building a real time and portable road crossing assistant.

Sample Prediction Outputs

True Positive

True Negative

False Positive

False Negative

Multi frame SVM