Single Frame CNN


Deep Learning has been shown to learn highly effective features from image and video data, yielding high accuracy in many tasks. We have performed many training experiments on AWS g4dn.xlarge instance (with Tesla T4 GPU) in this phase. The data preparation and CNN architecture details are as given below.

DATA PREPARATION


Train-test-validation split:

We have used 66 videos for training, 22 for testing and 16 for validation.

Data pipeline (Using Dataset API of tensorflow):

Firstly, we created a list of filenames to jpg images (for this, we stored all individual frames of the videos in a separate folder) and a corresponding list of labels. We applied following steps to create the input pipeline for our model:

  1. Created dataset from slices of the filenames and labels
  2. Shuffled the data with a buffer size equal to the length of the dataset. This ensures good shuffling.
  3. Parsed the images from filename to the pixel values. We used multiple threads to improve the speed of preprocessing.
  4. Applied data augmentation for the images. For data augmentation, we used random brigtness, contrast and saturation to the images. Also, we used multiple threads to improve the speed of preprocessing
  5. Batched the images (batch size = 16).
  6. Prefetched one batch to make sure that a batch is ready to be served at all time.

Reference used : https://cs230.stanford.edu/blog/datapipeline/

In this approach, we have used the MobileNetV2 architecture with additional dense layers at the top. We used MobileNetV2 because it is a lightweight architecture particularly useful for mobile and embedded vision applications.


Model training

  • We trained the model for 150 epochs , saving checkpoint for the best value of precision@90recall.
  • Loss function used: BinaryCrossentropy. We also assigned class weights (unsafe: 1 , safe: 1.92) while training as our dataset is imbalanced.
  • Optimizer used: Adam optimizer.
  • Python implementation for the same can be found here.
Model performance

  • Precision : 0.90, Recall : 0.99, Binary accuracy: 0.99 (on train data)
  • Precision : 0.90, Recall : 0.60, Binary accuracy: 0.85 (on test data)
  • Throughput on Jetson Nano after conversion into TensorRT graph: 15 fps.
Sample Prediction Outputs

True Positive

True Positive

True Negative

True Negative

False Positive

False Negative


As MobileNetV2 based architecture did not give a satisfactory performance on test data, we developed our own CNN architecture from scratch.


Model training

  • We trained the model for 150 epochs , saving checkpoint for the best value of precision@90recall.
  • Loss function used: BinaryCrossentropy. We also assigned class weights (unsafe: 1 , safe: 1.92) while training as our dataset is imbalanced.
  • Optimizer used: Adam optimizer with learning rate equal to 0.0002.
  • Python implementation for the same can be found here.
Model performance

  • Precision : 0.90, Recall : 0.99, Binary accuracy: 0.99 (on train data)
  • Precision : 0.90, Recall : 0.72, Binary accuracy: 0.89 (on test data)
  • Throughput on Jetson Nano after conversion into TensorRT graph: 3 fps. The inference speed on Jetson Nano is very low because our architecture had convolutional layers with larger kernel size (i.e. >3) - so the TensorRT graph could not be optimized.
Sample Prediction Outputs

True Positive

True Positive

True Negative

True Negative

False Positive

False Negative


Since convolutional layers with larger kernel size were not optimized by TensorRT engine, we reduced the kernel size and added dialtion to those layers to compensate for the lowering of receptive field. This resulted in a higher inference speed after optimization to TensorRT graph.


Model training

  • We trained the model for 150 epochs , saving checkpoint for the best value of precision@90recall.
  • Loss function used: BinaryCrossentropy. We also assigned class weights (unsafe: 1 , safe: 1.92) while training as our dataset is imbalanced.
  • Optimizer used: Adam optimizer.
  • Python implementation for the same can be found here.
Model performance

  • Precision : 0.90, Recall : 0.99, Binary accuracy: 0.97 (on train data)
  • Precision : 0.90, Recall : 0.77, Binary accuracy: 0.84 (on test data)
  • Throughput on Jetson Nano after conversion into TensorRT graph: 8 fps.
  • Since this model gives a decent enough recall value of 0.77 at a high precision of 0.90 along with a good inference speed, we have deployed this model on Jetson Nano for building a real time and portable road crossing assistant.
Sample Prediction Outputs

True Positive

True Positive

True Negative

True Negative

False Positive

False Negative