An experiment with YOLOv3-4-5 and EfficientDet for infrastructure asset management

The above video shows results of YOLOv4 trained on a small dataset of hectometer sign images. In a short experiment we compared YOLOv3, YOLOv4, YOLOv5 and EfficientDet. YOLOv4 performed better than YOLOv3: with v4 smaller plates in the image are detected. Also images partially hidden by grass are detected, although later than uncovered signs. YOLOv5 showed similar results, but had a training time of no less than six times shorter. However, YOLOv5 is not yet solidly substantiated and there is some controversy about both the content of the model and the claim to the YOLO name. We leave this discussion for what it is and recommend testing YOLOv5 yourself.

An Object detection experiment using YOLOv3-4-5 and EfficientDet for Infrastructure Asset Management

By Gerard Mol (Managing Partner) and Brand Hop (Chief Data Science), Result! Data, Zoetermeer, the Netherlands

Why this blog?

In this blog we want to explore possibilities for the automatic extraction of data of physical assets, such as roads, bridges, ports, railroads or street objects, that are usually managed by Asset Managers in both the public and private sector. We focus on Object Detection, a technique to localize and classify objects within an image or video. We thank Roboflow.ai for their help, fruitful discussions and their tutorials at their website. Please check out https://roboflow.ai/ for the latest developments and tutorials on Computer Vision!

Asset Management Information

Infrastructure Asset Managers in industry, utilities, municipalities or public organizations manage massive amounts of physical assets of high capital expenditure. Their task is to provide reliable, available, maintainable, safe infrastructure at acceptable operational cost and capital expenditure, avoiding risks for health and environment. Good asset management optimizes this blend of activities to achieve the best balance for the organization. All these activities call for good information. As we will see, there are two types of data involved: first, the Static Data or Master Data and second, the Dynamic Data. The quality of these two types of data has a high impact on the quality of Asset Management.

The first, and often biggest challenge for many Asset Managers is the quality of the Master Data. One of the most important reasons of data errors is that it relies on human entry. Asset Managers must create and maintain data of millions of objects, and as we all know, human beings have their limitations in processing massive amounts of data accurately. If Master Data of a physical objects is still present while the physical object has been removed already, is missing, or contains errors and omissions in characteristics, it will impact the quality of Asset Management processes in both proactive and reactive maintenance and replacement, requiring costly correction measures to be taken. The cost comes from regular, traditionally human inspections to correct data errors and manpower within the organization to compensate for the effects and resulting risks of data errors.

Asset Managers often have improved the data quality leaning on human input the last decades as far as they can and have reached the boundaries of what is possible in the current way of working. Errors and shortcomings in static and dynamic data limit further improvement of Asset Management. We therefore must look for new possibilities to reach the next step in Data Driven Asset Management.

How Computer Vision can help to improve master data quality

One of the most promising techniques for this is the application of Computer Vision. With advanced Machine Learning we have been able to detect objects better and better in images for several years now. In recent years, we have seen that these models are able to detect objects from multiple classes in real time and even indicate their position in the image. This offers opportunities to gather object data automatically instead of manually. Reduction or even elimination of human interpretation and entry of object data could offer significant improvements in data quality of both static and dynamic data of physical assets. We imagine a world that can be relieved from cumbersome and error prone human inspection, taking notes, extracting data from notes and images and finally manual entry of data. We believe Machine Learning, among which the subset of models for CV, offers great potential for automatic extraction of data of physical assets. That is the main reason for the experiment we describe in this blog, being to get an impression of the potential of state-of-the-art Object Detection models.

Recent developments in Object Detection

The field of CV develops very rapidly, offering models that are easier and easier to use as the availability of models and frameworks grow. To investigate the possibilities of automated detection, inspection and data verification, we compared four models (YOLOv3, YOLOv4, EfficientDet and YOLOv5), of which the last three appeared in 2020, the last one only at the end of May 2020. In this experiment it was not the intention to conduct a rigid, scientifically sound investigation, but rather to explore the potential of the models. Do not expect to find a thorough comparison on metrics as mean average precision, but rather a quick illustration of the possibilities in Asset Management. So, let's go!

Our experiment

To explore the possibilities of the recently published models, we took pictures of hectometre signs along Dutch national roads and highways. Next we standardized the size of all pictures, picked the bounding boxes and applied data augmentation, by cropping, rotating and adjusting brightness, and adding blur and noise, resulting in three extra images per original image on average. The next picture shows an example of a still image, taken by an iPhoneX from behind the windscreen of a driving car.

Data split, training and results

We augmented 192 original, annotated images into 764 images, and used 612 for train-validate, and left out 152 images from training, to test and validate. With YOLOv4 we reached 88.4% mAP. This seems like a high score, that probably could be explained by the fact that the dataset is small and testing was done on a random selection of images taken in only a few runs. So you shouldn't adher too much to this value, for the time being this was our best result. We also trained YOLOv3, EfficientDet and YOLOv5. YOLOv3 produced lower scores, EfficientDet came closer and YOLOv5 showed similar results to YOLOv4. The only thing was that the training time of YOLOv5 appeared to be 20 minutes instead of 2 hours for YOLOv4. We should take some care here, since YOLOv4 we used was based on another framework (PyT) than YOLOv5 (using DarkNet). So actually, it is a comparison between model-framework combinations, rather than the models alone.

The trained model applied to a video

At the top of this post we published a short video showing results of the trained YOLOv4 model on a video. It appears to detect the signs quite well, even when partly covered by grass. The results do show an offset of the bounding box predictions, especially when the signs are taken from a more sideways angle. We do not have a good explanation yet but we suspect that the training images being taken almost straight ahead may be of influence. If anyone has a good explanation, feel free to contact us.

Conclusions

The world of object detection moves fast. Two years ago, YOLOv3 was state-of-the-art, then EfficientDet improved results for a short period, but was quickly overruled by YOLOv4 that came out in April this year and shortly after that by YOLOv5 to come out recently in June. There is a wide range of models to use for object detection.

Hectometre signs are relatively small compared to other objects in an image, but the new models can handle these objects relatively easy. Our experiment confirmed that YOLOv3 had difficulties in detecting small objects in the images. EfficientDet was better, but not as good as YOLOv4 and 5. Not regarding the current discussion about YOLOv5 on the model itself nor the claim on the YOLO name, we found the YOLOv5 model to produce promising results in this case. Training time of YOLOv5 was reduced by a factor of 6 compared to YOLOv4. Especially the model size and inference speed seem very promising for high speed requirements. It might not always be necessary for purposes of gathering data on infrastructure but it can definitely be of use on measurement trains or cameras used on regular passenger and freight trains, enabling frequent inspections of rails, sleepers, bolts, clamps and switches. It could also be promising in frequent road inspections, or in airport security applications. YOLOv3 showed lower performance, especially with small object sizes in the image. EfficientDet seemed to show slightly less performance and had comparable training times to YOLOv4.

We conclude that the use of the recently developed object detection Machine Learning models offer great potential for automatic gathering of data of infrastructure objects. In this case we performed a small experiment using models with fast inference, but in many cases post processing is still acceptable. The great advantage of the models shown in this blog however, is that it offers the opportunity for streaming applications, avoiding having to store and transport massive amounts of data. If you are an Infrastructure Asset Manager, think tens to hundreds of Terabytes of data that you will produce in images, videos and point clouds each year. Storing these amounts in cloud solutions from Microsoft, AWS or Google will cause significant cost. Using the object detection algorithms offer the possibilities to only extract the detected classes and locations to be compared with your Enterprise Asset Management databases. In our view they offer great potential for automatic verification and imputation of data.

Final notes

This was not a rigorously performed quantitative comparison between models. As the video shows, predicted bounding boxes were offset when seen more from the side, probably resulting into an IoU of 0.0, so technically, the model would miss the object entirely which would definitely influence mAP. As we didn't determine the ground truth at our video, we judged it qualitatively to get a feeling for the potential for our purposes. Consider it a short experiment to explore possibilities, and to our opinion, they are great.

Acknowledgements

We greatly thank Joseph Nelson, CEO of Roboflow.ai for his help and fruitful discussions on the work and the models. Roboflow is dedicated to Computer Vision and offers services to go from raw images to trained models in hours to even minutes. You might want to check out their tutorials.