Computer vision model for industrial safety


Industrial security is a very important topic because in industrial jobs, por example the construction, there are a many people doing hand works and is difficult control that each worker has to use his safety equipment. Emerging technologies such as Computer vision at the present are helping to automate many activities, the desired idea is to propose a solution based in it.

Computer Vision is an interdisciplinary area that make computers can understand images or videos. From this knowledge and its application, this paper investigates how apply computer vision algorithms such as RetinaNet to detect if a worker is wearing his helmet or not, whit the unique object of reduce the number of accidents for the lack of a safety helmet.


It’s observed in every place that buildings are always built, for example departmental buildings, shopping centers, traffic routes, etc. Consequently, the are people working in these places and their security is very important, so a lot of companies choose to give their employees safety accessories such as helmets, gloves, glasses depending on the type of work. The most common accessory is the helmet, because being high risk jobs a helmet could be decisive between living or dying. But, in the other hand, the worker many times is reckless and avoids wearing the helmet during the work time inside the construction.

As well as improvisation could be a response for emergencies situations, it could also be a form to surpass some barrier imposed by the lack of tools, materials or personal at the moment the worker needs them. But this element is one of the factors to produce accidents in the work centers. Prevention of injuries from occupational accidents is a public priority, only in the United States 3.9 out of every 100 full-time workers employed suffered some kind of nonfatal occupational accident or illness. On the other hand, the high accident rate is associated with several different variables, and in their day-to-day activity workers are exposed to many different risk factors such as high noise levels, altitude, warm, etc.

In Peru, in the month of December 2020 the most frequent nonfatal accidents are blows by objects (11.56%), falls of people to level (10.56%) and physical efforts (10.44%). According to occupational category the people who have suffered more accidents are those who have manual jobs, for example operators, workers, etc. Also, based in the same bulletin there is a 3% of accidents specifically in the head (Ministry of Labor and Employment Promotion, 2020).

The aim of this study is to development a computer vision model to prevent injuries and accidents associated with the lack of a security helmet, detecting when a worker wears or not his helmet.

Materials and methods


The data for this research work was images about people and workers wearing security helmet, the size was around 1500 images and were labeled manually each one. The labeled has two approach, one was to label separately the helmet and the worker (identifying two target helmet and worker) for later identify if the detected helmet is on the head of the worker, and the other approach was to label jointly the worker and the helmet identifying two targets (worker with helmet and worker without helmet).

On the collections of data at first there was images from different sources, for example, images of the internet, own images and images of security cameras. But the majority of the images from internet was very bad and were removed. It’s important to mention that there’s diversity of images, between colors and sizes and also different poses of the people in the images.

Computer Vision

Computer vision is the ability of computers to capture and analyze images and make interpretations and decisions about it. For example, it can be used to detect and recognize images and to identify patterns or objects within them. Images are processed across a set of components performing various types of transformations that result in a final product. This process is known as the image processing pipeline or computer vision pipeline.

An artificial neural network is a computing system that is designed to work the way the human brain work. For example, when we see an object, our eyes capture an image of that object that is passed to the brain as an input signal. Neurons in our brain do the computation on the input signals and generate outputs. In another words, we learn to recognize an object when to see just once or a few times the same object.

This behavior is copied for artificial neuronal network and the results are a lot. But the brain has a lot of neuronal networks and a form to represent it is trough deep learning. Deep learning is another name for a multilayer artificial neuronal network. We have different types of deep learning systems depending upon the neuronal network architecture and its working principles. For example, feed-forward neuronal networks, convolutional networks, recurrent neural networks, autoencoders and deep beliefs.

For this work, the architecture of the multilayer artificial neuronal network was convolutional networks, because is the based to recognize images and detect objects. Object detection involves two distinct sets of activities: locating objects and classifying objects. Locating objects within the image is called localization, which is typically performed by drawing bounding boxes around the objects (Ansari, 2020).

RetinaNet is a deep learning model that classify and detect objects, it was introduced in Focal Loss for Dense Object Detection and is a dense, one-stage network composed of a base ResNet-type network and two task-specific subnetworks. The base network computes a convolutional feature map for different images scales using FPN. The first subnet performs object classification and the second subnet performs convolutional bounding box regression (Kar, 2020).

Residual Network or ResNet is simple, it starts and ends exactly like GoogLeNet (except without a dropout layer), and in between is just a very deep stack of simple residual units. Each residual unit is composed of two convolutional layers (and no pooling layers), with Batch Normalization and ReLU activation, using 3 x 3 kernels and preserving spatial dimensions (stride 1, “same” padding) (Géron, 2019).

ResNet architecture

Technical Stack

From a technological perspective, a wide variety of frameworks of deep learning, for example TensorFlow, Keras, PyTorch, Caffe, etc. Also was mentioned that there is a lot of detection object models for example YOLO, R-CNN, Fast-RCNN, Faster-RCNN, SSD and RetinaNet. But for this work will be used Keras implementation of RetinaNet object detection.


First, we labeled the images as described above, for the first approach we labeled in each image two targets, helmet and worker. The idea here is that the model detects helmets and workers, then through of the bounding boxes of each object see if the helmet is on the worker’s head. This idea is showed in the next image.

Labeled of images (first approach)

The second approach for the labeled is to have two target, one denominate “With Helmet” and the other “Without Helmet”. Here is not necessary to do calculus with the bounding boxes, only to see the detection. This is more practical and easier, but even we don’t know if is better at the moment of the training and then the detection.

Labeled of images (Second approach)

Then of the labeled, is necessary put all this data into a set for later to do the preprocessing of the images. A better training of the model is when using a pretrained model, it’s called transfer learning. For the propose of this work we will use the “coco” pretrained model.

Like we said before, we’re going to use keras RetinaNet model, a complete implementation of this is in the next repository (, it contains all the steps for the training of the model since the preprocessing of the images until the training and testing.

The training was done in a computer with a GPU, exactly the GeForce MX130. The training took around four hours, because it had 16 epochs and the number of images was around 1500. In the next table is showed the result of the training in each epoch for the first approach.

Results of traininig (first approach)

For the second approach, the results in each epoch are the following.

Results of training (second approach)


The aim of this article was to development a computer vision model to prevent injuries and accidents associated with the lack of a security helmet, detecting when a worker wears or not his helmet. The use of Keras-Retinanet provided a robust algorithm of object detection which with good images had a better training. A resource very important was the GPU, because depending of it the training can be more of ten hours or like in this work only two for each epoch.

For the labeled there was two approaches, with the final purpose of compare which is better, by this reason the next images show the detections according to each labelling.

Detection for the first approach
Detection for the first approach
Detection for the second approach
Detection for the second approach

According to results, is evident that the model trained with the labeled two has better detections compare with the other. Such difference is because the first approach has two labels but for the detection, through comparisons, is more difficult detect if the helmet is on the head of the worker. For the second approach is easier, it has also two labels but not has comparisons like the first and the detection is direct.

The industrial security is crucial because human lives are involved, so it is important that the model does accurate detection, it can be improved with more images for the training or improve the algorithm, but a mix of the two would be much better.


This study has explored the use of algorithm of computer vision, specifically Keras-RetinaNet, in industrial safety. The findings consistently supported that is possible usar this model for this propose. The association between the data, the labeled and the algorithm was confirmed and tested in different sceneries.

It was also demonstrated that, the form of labeled is important. Most conclusions indicate that the data and the model should be improved, but based on the two scenarios proposed, this research is able to support the assumption that the labeled is important when using computer vision models.

Last and most important, the study indicate that is possible use models of computer vision into industrial security, these would help, as proposed in this work, to prevent the lack of security clothing and be support to the systems already proposed.


Ansari, S. (2020). Building Computer Vision Applications Using Artificial Neural Networks: With Steps-by-Steps Examples in OpenCV and TensorFlow with Python. Centreville: Apress. Obtenido de

Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition. O’Reilly Media, Inc. Obtenido de

Kar, K. (2020). Mastering Computer Vision with TensorFlow 2.x. Birmingham: Packt Publishing. Obtenido de

Ministry of Labor and Employment Promotion. (2020). Notificaciones de accidentes de trabajo, incidentes peligrosos y emfermedades ocupacionales. Lima.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store