Modified Single Shot Detector Architecture for Autonomous Delivery Robot

By Abhishek Jadhav

Freelance Tech Writer

November 30, 2021


Modified Single Shot Detector Architecture for Autonomous Delivery Robot

With Industry 4.0 kicking in, the need for autonomous robots to perform inference at the edge is increasing exponentially. The integrated sensors on the robotic platform have been an important aspect in the design for robot localization, navigation, and obstacle avoidance.

HermesBot Autonomous Delivery Robot [Image Credit: Research Paper]

Due to Covid-19, logistic providers had to come up with new methods for the last-mile delivery. Shifting from the traditional methods to unmanned aerial vehicles and autonomous ground vehicles, the lifting capacity, range, and cost-effectiveness had been a few of the significant factors to consider. There are several difficulties even with autonomous ground vehicles, as they need to be robust and have precise localization technologies along with obstacle detection and avoidance algorithms.

For an autonomous delivery robot, obstacle detection and avoidance require real-time incoming data from the integrated sensors about the non-stationary inputs from the surroundings. Some of the data acquisition sensors that are integrated into the robot include LiDAR, ultrasonic distance sensors, and infrared and visible spectrum cameras. The HermesBot used in the experimental setup has 6 rolling-shutter cameras on the perimeter providing a 360-degree field of view. With massive real-time incoming data, it becomes difficult for the remote robot to handle the information and can face problems related to computational power and memory constraints. Also, there is always a trade-off between the accuracy of the algorithms and the inference time associated with them. The ongoing research focuses on these aspects more to improve the efficiency of the robot and enable cost-effectiveness.

CNN-based Omnidirectional Object Detection for HermesBot

The work aims to improve the efficiency and efficacy of object detection systems on remote delivery robots. The methodology is suitable for highly complex systems with massive real-time incoming data from multiple cameras with limited computation power. The HermesBot delivery robot has two sets of RealSense cameras on the front and back sides for robot localization and six RasPi NoIR V2 cameras for pedestrian detection.

It is known that the R-CNN (Region-based Convolutional Neural Network) technique for object detection takes two shots for detecting multiple objects within the image. To increase the computational speed, the researchers have used Single Shot MultiBox Detector (SSD) architecture with EfficientNet-B0 for feature extraction. Single Shot Detector takes one single shot to detect multiple objects within the image making it faster than RPN-based approaches but with lesser accuracy.

Architecture of Single Shot Detector with EfficientNet-B0 feature extractor 

Architecture of Single Shot Detector with EfficientNet-B0 feature extractor [Image Credit: Research Paper]

From the architecture of Single Shot Detector with EfficientNet-B0 feature extractor, it can be seen that the input image is passed through a feature extractor (backbone), which extracts features at various convolutional layers. To find more spatial information, convolutional layers at the bottom extract more features for the detection block. Then, all the features extracted in the convolution layer are then sent to the object detection block. 

For classification models, EfficientNet-B0 is one of the fastest feature extraction backbones. Three parameters important for this method are the depth of the layers, the number of input and output channels, and the spatial size. But the traditional methodology of Single Shot Detector still faces several difficulties like providing real-time information on detected pedestrians.

Modified Single Shot Detector Architecture

Modified Single Shot Detector Architecture [Image Credit: Research Paper]

The research modifies the architecture to add a classification layer prior to the extra feature extraction convolutional layers. This is done to improve the speed of human detection algorithm by neglecting images that do not have the target object. 


Performance of the Modified SSD on the number of frames with people, compared to Classical SSD [Image Credit: Research Paper]

As per the results, this method can be a breakthrough for a multiple camera setup for a delivery robot. The performance improvement for the modified SSD architecture can be seen in the performance chart. The results show that in most of the cases the proposed algorithm significantly reduces the computational complexity of object detection. This method is suitable for other detection architectures, where a classifier is used as a feature extractor. The future work can be done on identifying people density around the robot in urban area to improve the efficiency.

Abhishek Jadhav is an engineering student, freelance tech writer, RISC-V Ambassador, and leader of the Open Hardware Developer Community.

More from Abhishek