YOLO v3 deep learning model architecture

Common object detection in classical computer vision uses a sliding window to detect objects, scanning a whole image with different window sizes and scales. The main problem here is the huge time consumption in scanning the image several times to find objects.

YOLO uses a different approach by dividing the diagram into an S x S grid. For each grid, YOLO checks for B bounding boxes, and then the deep learning model extracts the bounding boxes for each patch, the confidence to contain a possible object, and the confidence of each category in the training dataset per each box. The following screenshot shows the S x S grid:

YOLO is trained with a grid of 19 and 5 bounding boxes per grid using 80 categories. Then, the output result is 19 x 19 x 425, where 425 comes from the data of bounding box (x, y, width, height), the object confidence, and the 80 classes, confidence multiplied by the number of boxes per grid; 5_bounding boxes*(x,y,w,h,object_confidence, classify_confidence[80])=5*(4 + 1 + 80):

The YOLO v3 architecture is based on DarkNet, which contains 53 layer networks, and YOLO adds 53 more layers for a total of 106 network layers. If you want a faster architecture, you can check version 2 or TinyYOLO versions, which use fewer layers.