Uploaded by tuananh121225

algorithmsDescription

advertisement
Algorithms Description used in project
Huy Le Viet Anh, Quan Vo Nguyen Trung, Anh Hoang Tuan
June 1, 2023
1
Contents
1 YOLO [1]
2
2 PaddleOCR [2]
4
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Unied Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 The Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . .
2
2
3
4
6
1
YOLO [1]
1.1 Introduction
The human visual system quickly and accurately identies objects in images, their positions, and interactions. Ecient object detection algorithms would enable computers to
drive cars without additional sensors, assistive devices to provide real-time scene information, and responsive robotic systems
Authors of YOLO, you only look once, redene object detection as a regression problem, directly predicting bounding box coordinates and class probabilities from image
pixels.
YOLO uses a single convolutional network that predicts multiple boxes and corresponding class probabilities simultaneously, as shown in Figure 1.
Figure 1: The YOLO Detection System [1] .
What is convolutional neural network?
Before diving into convolutional neural network, let have a brief understanding about
neural network. Neural networks, or articial neural networks (ANNs), are computational
models inspired by the human brain. They consist of interconnected articial neurons
organized in layers, which receive inputs, perform mathematical operations, and produce
output signals. Neural networks learn from data through training, adjusting weights and
biases to improve performance.
A convolutional neural network (CNN) is a type of neural network commonly used for
analyzing visual data, such as images and videos and particularly particularly eective
in tasks such as image classication, object detection, and image segmentation
CNNs are designed to automatically learn hierarchical representations of data through
the application of convolutional layers. These layers use lters (also known as kernels) to
scan the input data, extracting local features and patterns. By employing multiple layers
of convolutions, along with pooling layers for downsampling, CNNs can progressively
learn more abstract and complex features.
The architecture of a CNN typically consists of convolutional layers, activation functions (e.g., ReLU), pooling layers, fully connected layers, and an output layer. Convolutional layers detect local patterns, activation functions introduce non-linearity, pooling
layers reduce the spatial dimensionality, and fully connected layers make the nal predictions.
YOLO's benets
1. YOLO achieves remarkable speed
2. YOLO takes a global perspective of the image during prediction.
3
3. YOLO acquires generalizable object representations through learning.
1.2 Unied Detection
Separate components of object detection are unied into a single neural network. The
network utilizes features from the complete image to predict bounding boxes for each
object. Additionally, it simultaneously predicts bounding boxes for all object classes in
the image.
Figure 2: The Model
Network Design
The model is implemented as a convolutional neural network and assess its performance using the PASCAL VOC detection dataset. The initial convolutional layers of the
network extract image features, while the fully connected layers predict the probabilities
and coordinates for the output.
Figure 3: The architecture
4
2
PaddleOCR [2]
2.1 Introduction
Scene text recognition is one of the most important and and challenging tasks in imagebased sequence recognition. The novel neural network architecture named Convolutional
Recurent Neural Network (CRNN)is proposed for this problem
CRNN is the combination of both DCNN and RNN
1. DCNN
Deep Convolutional Neural Networks (DCNNs) and Convolutional Neural Networks
(CNNs) are both types of neural networks commonly used for analyzing visual data.
While they share similarities, there are some key dierences between the two:
(a) Depth
The main distinction between DCNNs and CNNs is the depth of the network.
DCNNs are characterized by having a signicantly larger number of layers
compared to traditional CNNs. The increased depth allows DCNNs to learn
more abstract and complex features, capturing hierarchical representations of
the data.
(b) Feature Extraction
Both DCNNs and CNNs use convolutional layers to extract features from input data. However, in DCNNs, the stacking of multiple convolutional layers
enables the network to learn increasingly sophisticated and discriminative features. This depth enhances the ability of DCNNs to capture intricate details
and patterns in the data.
(c) Complexity
Due to their deeper architecture, DCNNs tend to be more complex and computationally expensive compared to traditional CNNs. Training and optimizing
DCNNs can require more computational resources and time.
(d) Performance
The increased depth of DCNNs often leads to improved performance in tasks
that require complex visual analysis. DCNNs are known for achieving state-ofthe-art results in various computer vision tasks, such as image classication,
object detection, and image segmentation. Traditional CNNs, on the other
hand, may perform well in simpler visual tasks but may struggle to capture
intricate details and complex relationships.
2. RNN
A recurrent neural network (RNN) is a type of neural network that is designed
to eectively process sequential data or data with temporal dependencies. Unlike
feedforward neural networks, which process data in a single pass from input to
output, RNNs have feedback connections that allow information to be carried from
one step to the next.
The key feature of RNNs is their ability to maintain a form of memory or context
about previous inputs as they process subsequent inputs in a sequence. This memory is achieved through recurrent connections, where the output of a previous time
step is fed back as an input to the current time step.
5
2.2 The Network Architecture
Figure 4: The network architecture
In gure 4
1. The convolutional layers automatically extract a sequence of features from each
image.
2. The recurrent network is constructed to make predictions for each frame of the
feature sequence.
3. The transcription layer at the highest level of CRNN translates the per-frame predictions from the recurrent layers into a sequence of labels.
6
References
[1] Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. You
only look once: Unied, real-time object detection. CoRR, abs/1506.02640, 2015.
[2] Baoguang Shi, Xiang Bai, and Cong Yao. An end-to-end trainable neural network
for image-based sequence recognition and its application to scene text recognition.
CoRR, abs/1507.05717, 2015.
7
Download