Model Training - Overview

For developing a prediction model to predict road crossing safety, we have performed many training experiments as shown below. Since road crossing for visually impaired people is an application which requires high precision (along with a decent enough recall) in order to be deployed in real-time, we have used precision and recall as the evaluation metrics.

Single frame SVM

Simple handcrafted featuress

precision : 0.51, recall : 0.70

It is our simplest approach where we extracted simple per-frame features capturing number, location and size of vehicles. We used SVM to train the classification model.

Existing features + directional vehicle filtering

precision : 0.55 , recall : 0.74

It is an advancement over previous approach where we have improved our feature extraction logic by ignoring the vehicles travelling on opposite half of the road (using vehicle tracking).

Existing features + Vehicle speed

precision : 0.68 , recall : 0.87

It is an advancement over previous approach where we have improved our feature extraction logic by considereing relative speed of the vehicles, and we improved the labels by annotating videos frame-wise (instead of second-wise).

Multi frame SVM

Multi frame features

precision : 0.75 , recall : 0.88

As it is obvious that even we as humans do not decide whether it is safe to cross a road by just having one glance at the road, we have started using multi-frame features instead of per-frame features in this approach.

Optimized multi-frame features

precision : 0.79 , recall : 0.84

It is similar to the previous approach, in which we have used multi-frame features in a sliding window based manner. Its feature extraction logic is a bit optimized as compared to that of previous approach.

Single frame CNN

MobileNetV2 based architecture

precision : 0.90 , recall : 0.60

In this approach, we used the MobileNetV2 architecture with additional dense layers at the top. We used the MobileNetV2 because it is a lightweight architecture particularly useful for mobile and embedded vision applications.


Self developed architecture

precision : 0.90 , recall : 0.72

As MobileNetV2 did not give a satisfactory performance on test data, we developed our own CNN architecture. However, since it's convolutional layers consisted of kernels of size greater than 3, its inference speed was very low even after optimization to a TensorRT graph.

Self developed architecture with dilated convolutions

precision : 0.90 , recall : 0.77

It is an advancement over previous approach, in which we have replaced the convolutional layers having larger kernal size with dilated convolutional layers. This resulted in a higher inference speed after optimization to TensorRT graph.


Note : The precision and recall mentioned are model performances on test data.