Algorithms Description used in project Huy Le Viet Anh, Quan Vo Nguyen Trung, Anh Hoang Tuan June 1, 2023 1 Contents 1 YOLO [1] 2 2 PaddleOCR [2] 4 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Unied Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 The Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 3 4 6 1 YOLO [1] 1.1 Introduction The human visual system quickly and accurately identies objects in images, their positions, and interactions. Ecient object detection algorithms would enable computers to drive cars without additional sensors, assistive devices to provide real-time scene information, and responsive robotic systems Authors of YOLO, you only look once, redene object detection as a regression problem, directly predicting bounding box coordinates and class probabilities from image pixels. YOLO uses a single convolutional network that predicts multiple boxes and corresponding class probabilities simultaneously, as shown in Figure 1. Figure 1: The YOLO Detection System [1] . What is convolutional neural network? Before diving into convolutional neural network, let have a brief understanding about neural network. Neural networks, or articial neural networks (ANNs), are computational models inspired by the human brain. They consist of interconnected articial neurons organized in layers, which receive inputs, perform mathematical operations, and produce output signals. Neural networks learn from data through training, adjusting weights and biases to improve performance. A convolutional neural network (CNN) is a type of neural network commonly used for analyzing visual data, such as images and videos and particularly particularly eective in tasks such as image classication, object detection, and image segmentation CNNs are designed to automatically learn hierarchical representations of data through the application of convolutional layers. These layers use lters (also known as kernels) to scan the input data, extracting local features and patterns. By employing multiple layers of convolutions, along with pooling layers for downsampling, CNNs can progressively learn more abstract and complex features. The architecture of a CNN typically consists of convolutional layers, activation functions (e.g., ReLU), pooling layers, fully connected layers, and an output layer. Convolutional layers detect local patterns, activation functions introduce non-linearity, pooling layers reduce the spatial dimensionality, and fully connected layers make the nal predictions. YOLO's benets 1. YOLO achieves remarkable speed 2. YOLO takes a global perspective of the image during prediction. 3 3. YOLO acquires generalizable object representations through learning. 1.2 Unied Detection Separate components of object detection are unied into a single neural network. The network utilizes features from the complete image to predict bounding boxes for each object. Additionally, it simultaneously predicts bounding boxes for all object classes in the image. Figure 2: The Model Network Design The model is implemented as a convolutional neural network and assess its performance using the PASCAL VOC detection dataset. The initial convolutional layers of the network extract image features, while the fully connected layers predict the probabilities and coordinates for the output. Figure 3: The architecture 4 2 PaddleOCR [2] 2.1 Introduction Scene text recognition is one of the most important and and challenging tasks in imagebased sequence recognition. The novel neural network architecture named Convolutional Recurent Neural Network (CRNN)is proposed for this problem CRNN is the combination of both DCNN and RNN 1. DCNN Deep Convolutional Neural Networks (DCNNs) and Convolutional Neural Networks (CNNs) are both types of neural networks commonly used for analyzing visual data. While they share similarities, there are some key dierences between the two: (a) Depth The main distinction between DCNNs and CNNs is the depth of the network. DCNNs are characterized by having a signicantly larger number of layers compared to traditional CNNs. The increased depth allows DCNNs to learn more abstract and complex features, capturing hierarchical representations of the data. (b) Feature Extraction Both DCNNs and CNNs use convolutional layers to extract features from input data. However, in DCNNs, the stacking of multiple convolutional layers enables the network to learn increasingly sophisticated and discriminative features. This depth enhances the ability of DCNNs to capture intricate details and patterns in the data. (c) Complexity Due to their deeper architecture, DCNNs tend to be more complex and computationally expensive compared to traditional CNNs. Training and optimizing DCNNs can require more computational resources and time. (d) Performance The increased depth of DCNNs often leads to improved performance in tasks that require complex visual analysis. DCNNs are known for achieving state-ofthe-art results in various computer vision tasks, such as image classication, object detection, and image segmentation. Traditional CNNs, on the other hand, may perform well in simpler visual tasks but may struggle to capture intricate details and complex relationships. 2. RNN A recurrent neural network (RNN) is a type of neural network that is designed to eectively process sequential data or data with temporal dependencies. Unlike feedforward neural networks, which process data in a single pass from input to output, RNNs have feedback connections that allow information to be carried from one step to the next. The key feature of RNNs is their ability to maintain a form of memory or context about previous inputs as they process subsequent inputs in a sequence. This memory is achieved through recurrent connections, where the output of a previous time step is fed back as an input to the current time step. 5 2.2 The Network Architecture Figure 4: The network architecture In gure 4 1. The convolutional layers automatically extract a sequence of features from each image. 2. The recurrent network is constructed to make predictions for each frame of the feature sequence. 3. The transcription layer at the highest level of CRNN translates the per-frame predictions from the recurrent layers into a sequence of labels. 6 References [1] Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. You only look once: Unied, real-time object detection. CoRR, abs/1506.02640, 2015. [2] Baoguang Shi, Xiang Bai, and Cong Yao. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. CoRR, abs/1507.05717, 2015. 7