Uploaded by f20220624

Report NeelPatel L&Tvadodara

advertisement
Appendix-A
A REPORT
ON
Object Detection using Machine-Learning Model and
Developing Use-cases with the help of these models
BY
Name(s) of the
Student(s)
ID.No.(s)
Neel Patel
2022AAPS0624G
AT
L&T EnergyHydrocarbons, Vadodara
A Practice School-I Station
of
BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI
(July,2024)
1/5
Appendix-B
A REPORT
ON
Object Detection using Machine-Learning Model and
Developing Use-cases with the help of these models
BY
Name(s) of the
Student(s)
Neel Patel
ID.No.(s)
2022AAPS0624G
Discipline(s)
Electronics and Communication
Prepared in partial fulfillment of the
Practice School-I Course Nos.
BITS C221/BITS C231/BITS C241
AT
L&T EnergyHydrocarbons, Vadodara
A Practice School-I Station
of
BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI
(July, 2024)
2/5
ACKNOWLEDGEMENT
I would like to express my sincere gratitude to Birla Institute of Technology
Pilani for providing me with the opportunity to undertake this summer term
practice school 1 at Larsen And Toubro Energy Hydrocarbon, Vadodara. This
experience has been invaluable in enhancing my practical knowledge and
understanding of Video Analytics.
I am deeply thankful to Prof. V. Ramgopal Rao, Vice Chancellor of Birla
Institute of Technology Pilani, and Prof. SudhirKumar Birai, Director of the
Institute, for their leadership and commitment to providing students with such
enriching experiences.
My heartfelt appreciation goes to Dr. Raghuram Ammavajjala, my professor in
charge, for his guidance, support, and valuable insights throughout this
training period. His expertise and mentorship have been instrumental in
shaping my understanding of the subject matter.
I would also like to extend my gratitude to the management and staff of L&T
Energy for their cooperation and for providing me with the opportunity to
learn about Video Analytics in a real-world industrial setting.
This experience has significantly contributed to my academic and professional
growth, and I am grateful to all those who have made it possible.
3/5
Appendix-C
Format of an Abstract Sheet
BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE
PILANI(RAJASTHAN)
Practice School Division
Station: ……L&T Energy- Hydrocarbons……………… Centre ...............Vadodara...……
Duration ........8 weeks………………… Date of Start.............28/05/2024....…………………
Date of Submission .............19/07/2024…………………………
Title of the Project: Object Detection using Machine-Learning Model and Developing Usecases with the help of these models
ID
No./Name(s)/
Discipline(s)/o
f the student(s)
:
2022AAPS0624G
Neel Patel
Electronics and Communication
Name(s) and
designation(s
)of the
expert(s):
Vivek Chaudhary
Assistant Manager at IT & Systems
Name(s)
ofthe PS
Faculty:
Ammavajjala Sesha Sai Raghuram
Key Words:
Deep Learning, CNN, Computer Vision, Object
DetectionProject Areas: Artificial Intelligence/ Machine Learning
Abstract:
Training Models on different classes and Increasing accuracy by
hyperparameter tuning.
Signature(s) of Student(s)
Signature of PS Faculty
Date: 19/07/2024
Date
4/5
TABLE OF CONTENT
1. Introduction
1.1 Background
1.2 Objective
1.3 Scope
1.4 Methodology
1.5 Significance
2. Project Overview
2.1 Convolutional Neural Networks (CNN)
2.1.1 Working of CNN
2.1.2 Applications
2.1.3 Advancements
2.2 Data Collection & Annotation
2.2.1 Data Integration and Pre-Processing
2.2.2 Creating YAML Dataset
2.3 YOLO (You Only Look Once)
2.4 Use-Cases developed using YOLO
2.4.1 Safety Harness
2.4.2 Helmet and Jumpsuit Detection
2.4.3 Crowd Detection
2.4.4 Fall Detection
2.4.5 Oil Spillage Detection
2.4.6 Number-Plate Detection
2.4.7 Posture Detection
3. Conclusion
4. References
5. Glossary
5/5
1. Introduction
1.1 Background
AI and ML have their roots in the broader field of computer science and data analytics. AI
refers to the simulation of human intelligence in machines that are programmed to think
and learn. ML, a subset of AI, involves the development of algorithms that allow
computers to learn from and make decisions based on data. These technologies have seen
exponential growth in recent years, driven by advancements in computational power, the
proliferation of big data, and innovative algorithms.
1.2 Objective
The primary objective of this project is to explore the application of AI and ML in [specific
industry or field, e.g., healthcare, finance, manufacturing], highlighting how these
technologies can solve complex problems, enhance operational efficiency, and drive
innovation. This report aims to:
1. Provide a comprehensive overview of AI and ML concepts and methodologies.
2. Analyze the current state of AI and ML applications in the chosen field.
3. Present case studies showcasing successful implementations.
4. Identify the challenges and limitations associated with AI and ML adoption.
5. Offer insights and recommendations for future research and development.
1.3 Scope
This report encompasses a detailed study of AI and ML technologies, including supervised
and unsupervised learning, neural networks, and deep learning. It examines the impact of
these technologies on various processes, from predictive analytics and decision-making to
automation and customer engagement. The scope also includes an evaluation of ethical
considerations, data privacy issues, and the regulatory landscape affecting AI and ML
deployment.
1.4 Methodology
The research methodology for this project involves a combination of literature review,
case study analysis, and empirical research. Data is gathered from academic journals,
industry reports, expert interviews, and real-world examples. The analysis focuses on
both qualitative and quantitative aspects, providing a holistic view of AI and ML's role in
modern industry. Data is collected from various sources such as ROBOFLOW and Cornell
university datasets for various models.
1.5 Significance
6/5
Understanding AI and ML is crucial for businesses and researchers aiming to stay
competitive in an increasingly data-driven world. By leveraging these technologies,
organizations can gain a strategic advantage, improve decision-making processes, and
foster innovation. This project report serves as a valuable resource for stakeholders
seeking to harness the potential of AI and ML to drive growth and efficiency.
In conclusion, AI and ML represent the forefront of technological advancement, promising
to reshape industries and society. This report aims to shed light on their transformative
potential, offering a roadmap for successful implementation and integration in various
sectors.
7/5
2. Project Overview
2.1 Convolutional Neural Networks (CNN)
Neural Networks are a subset of machine learning, and they are at the heart of deep
learning algorithms. They are comprised of node layers, containing an input layer, one or
more hidden layers, and an output layer. Each node connects to another and has an
associated weight and threshold. If the output of any individual node is above the
specified threshold value, that node is activated, sending data to the next layer of the
network. Otherwise, no data is passed along to the next layer of the network.
While we primarily focused on feed-forward networks in that article, there are various
types of neural nets, which are used for different use cases and data types. For example,
recurrent neural networks are commonly used for natural language processing and speech
recognition whereas convolutional neural networks (ConvNets or CNNs) are more often
utilized for classification and computer vision tasks. Prior to CNNs, manual, timeconsuming feature extraction methods were used to identify objects in images. However,
convolutional neural networks now provide a more scalable approach to image
classification and object recognition tasks, leveraging principles from linear algebra,
specifically matrix multiplication, to identify patterns within an image.
2.1.1 Working of CNN
Convolutional neural networks are distinguished from other neural networks by their
superior performance with image, speech, or audio signal inputs. They have three main
types of layers, which are:



Convolutional layer
Pooling layer
Fully-connected (FC) layer
The convolutional layer is the first layer of a convolutional network. While convolutional
layers can be followed by additional convolutional layers or pooling layers, the fullyconnected layer is the final layer. With each layer, the CNN increases in its complexity,
identifying greater portions of the image. Earlier layers focus on simple features, such as
colors and edges. As the image data progresses through the layers of the CNN, it starts to
recognize larger elements or shapes of the object until it finally identifies the intended
object.
8/5
2.1.2 Applications
CNNs have revolutionized a multitude of fields with their exceptional ability to analyze
visual data. Key applications explored in the project include:





Image Classification: Categorizing images into predefined classes, with applications
ranging from medical diagnostics to autonomous vehicles.
Object Detection: Identifying and localizing objects within images, crucial for
surveillance, robotics, and augmented reality.
Semantic Segmentation: Assigning a class label to each pixel in an image, used in
medical imaging and autonomous driving for understanding environments at a
granular level.
Face Recognition: Enhancing security systems and enabling personalized user
experiences.
Visual Search Engines: Powering advanced search functionalities in e-commerce
and digital libraries.
2.1.3 Advancements
The project highlighted several recent advancements in CNN technology, including:




Transfer Learning: Utilizing pre-trained models on large datasets to improve
performance on specific tasks with limited data.
Generative Adversarial Networks (GANs): Combining CNNs with adversarial
training to generate realistic images.
Capsule Networks: Addressing the limitations of traditional CNNs in understanding
spatial hierarchies.
Automated Machine Learning (AutoML): Leveraging automated techniques to
design and optimize CNN architectures without extensive manual intervention.
Future research directions identified include improving computational efficiency,
enhancing model interpretability, and addressing ethical concerns related to bias and
privacy.
9/5
2.2 Data Collection & Annotation
To effectively train and evaluate our Convolutional Neural Network (CNN) models, we
collected a diverse dataset from multiple sources. This included both live data from
cameras pre-installed at various sites in India and publicly available datasets from
platforms like Roboflow.
Live Data Collection from Pre-installed Cameras
Our primary data collection involved leveraging an extensive network of pre-installed
cameras across different sites in India. These cameras were strategically placed in urban,
suburban, and rural areas to capture a wide range of visual scenarios. The collected data
included:



Traffic Surveillance: Images and videos capturing traffic conditions, vehicle types,
and pedestrian movements.
Public Spaces: Footage from parks, markets, and streets, providing diverse scenes
of human activity.
Industrial Sites: Visual data from manufacturing plants and construction sites,
showcasing various machinery and work processes.
The data collection process was automated using scripts that periodically retrieved
footage from these cameras, ensuring a consistent and comprehensive dataset. To address
privacy concerns and adhere to ethical standards, all collected data was anonymized, with
faces and other personally identifiable information blurred.
Sourcing Data from Online Platforms
To augment our dataset and ensure a rich variety of images for training our CNN models,
we sourced additional data from online platforms, particularly Roboflow. Roboflow is
known for its extensive collection of labeled image datasets, which are crucial for training
robust AI models. The data sourced from Roboflow included:



Annotated Datasets: High-quality images with detailed annotations for object
detection, classification, and segmentation tasks using LabelImg and LabelMe.
Diverse Categories: A wide range of categories, from common objects like vehicles
and animals to specific items relevant to our applications.
Pre-processed Data: Images that have been pre-processed and standardized,
facilitating seamless integration into our training pipeline.
10/
5
2.2.1 Data Integration and Pre-processing
Combining data from live cameras and online platforms required meticulous preprocessing to ensure consistency and quality. The pre-processing steps included:

Data Cleaning: Removing duplicates, corrupted files, and irrelevant images to
maintain a high-quality dataset.
Annotation Standardization: Ensuring that annotations from different sources
followed a consistent format, crucial for effective training.
Normalization: Scaling pixel values and standardizing image sizes to meet the input
requirements of our CNN models.
Data Augmentation: Applying techniques like rotation, flipping, and color
adjustment to artificially expand the dataset, improving model generalization.



2.2.2 Creating YAML Dataset
To effectively train our YOLO (You Only Look Once) models for object detection tasks, we
need to format our collected data into a structured format that YOLO can understand. The
YAML file format is used to define the dataset configuration, specifying the paths to the
training, validation, and testing data, as well as the classes of objects to be detected. This
section outlines the process of creating a YAML dataset from our collected data. This
involves separating the images and their corresponding annotations into training,
validation, and test sets. The directory structure typically looks like this:
· images/train/: Contains the training images.
· images/val/: Contains the validation images.
· images/test/: Contains the test images.
· labels/train/: Contains the annotation files (in YOLO format) corresponding to the
training images.
· labels/val/: Contains the annotation files corresponding to the validation images.
· labels/test/: Contains the annotation files corresponding to the test images.
The annotation files are in YOLO format, where each annotation file corresponds to an
image and contains lines of data representing the bounding boxes of detected objects.
Each line in the annotation file follows this format:





class_id: The integer identifier for the class of the object.
x_center: The normalized x-coordinate of the center of the bounding box.
y_center: The normalized y-coordinate of the center of the bounding box.
width: The normalized width of the bounding box.
height: The normalized height of the bounding box.
11
2.3 YOLO (You Only Look Once)
You Only Look Once (YOLO) is a state-of-the-art, real-time object detection algorithm
introduced in 2015 by Joseph Redmon, Santosh Divvala, Ross Gershick, and Ali Farhadi in
their famous research paper “You Only Look Once: Unified, Real-Time Object Detection”.
The authors frame the object detection problem as a regression problem instead of a
classification task by spatially separating bounding boxes and associating probabilities to
each of the detected images using a single convolutional neural network (CNN).
Some of the reasons why YOLO is leading the competition include its:
1- Speed
YOLO is extremely fast because it does not deal with complex pipelines. It can process
images at 45 Frames Per Second (FPS). In addition, YOLO reaches more than twice the
mean Average Precision (mAP) compared to other real-time systems, which makes it a
great candidate for real-time processing.
From the graphic below, we observe that YOLO is far beyond the other object detectors
with 91 FPS.
2- High detection accuracy
YOLO is far beyond other state-of-the-art models in accuracy with very few background
errors.
3- Better generalization
This is especially true for the new versions of YOLO, which will be discussed later in the
article. With those advancements, YOLO pushed a little further by providing a better
generalization for new domains, which makes it great for applications relying on fast and
robust object detection.
For instance the Automatic Detection of Melanoma with Yolo Deep Convolutional Neural
Networks paper shows that the first version YOLOv1 has the lowest mean average
precision for the automatic detection of melanoma disease, compared to YOLOv2 and
YOLOv3.
4- Open source
Making YOLO open-source led the community to constantly improve the model. This is one
of the reasons why YOLO has made so many improvements in such a limited time.
12
2.4 Use-Cases developed using YOLOv7 model
2.4.1 Safety Harness
To address the specific task of detecting different classes related to harnesses—namely,
"no harness," "harness without hook," and "harness with hook"—we utilized the YOLOv7
model. This section details the process of training the YOLOv7 model on these classes,
from dataset preparation to model evaluation.
YOLOv7 is an evolution of the YOLO (You Only Look Once) family of models, optimized for
real-time object detection. The configuration involves setting up parameters such as
model architecture, training hyperparameters, and dataset configuration in a YAML file.
Training the YOLOv7 model involved executing scripts with configurations specific to our
dataset and model architecture. Here’s a basic outline of the training procedure:
1. Model Initialization: Initialize YOLOv7 model architecture using configuration files
(yolov7.yaml).
2. Data Loading: Load dataset and annotations using the data.yaml configuration file.
3. Training Setup: Define training parameters such as batch size, learning rate, and
number of epochs.
4. Model Training: Run the training script (train.py), which iteratively optimizes the
model weights based on the labeled dataset. During training, the model learns to identify
and classify harnesses into the specified categories.
Evaluation and Fine-tuning
After training, the model’s performance was evaluated using the validation set to assess
metrics like precision, recall, and mean average precision (mAP). Fine-tuning involved
adjusting parameters and augmenting data to improve detection accuracy and robustness.
Results
Accuracy :
65%
EPOCHS :
300
Learning Rate : 0.01
Activation :
ReLU
Regularization : None
13
2.4.2 Helmet and Jumpsuit Detection
In addition to harness detection, the YOLOv8 model was employed to detect safety
helmetsand jumpsuits, essential for ensuring workplace safety compliance. The dataset
for this task was meticulously curated to include images of individuals wearing helmets
and jumpsuits, as well as instances where these safety gears were absent. The annotations
were prepared in YOLO format, specifying the bounding boxes and class labels for "no
helmet", "no jumpsuit" and "person". By training YOLOv7 on this annotated dataset, the
model learned to accurately identify and differentiate between the presence and absence
of helmets and jumpsuits in various environments. During the training process, we finetuned the model parameters and employed data augmentation techniques to enhance its
robustness and accuracy. The trained YOLOv7 model demonstrated high precision and
recall in detecting helmets and jumpsuits, making it a reliable tool for real-time
monitoring and enforcement of safety protocols in industrial and construction settings.
This capability not only aids in compliance checks but also significantly contributes to
reducing workplace accidents by ensuring that all personnel are appropriately equipped
with necessary safety gear.
Automated Annotation Using the Pretrained Model
To expedite the annotation process for our dataset, the pretrained and fine-tuned YOLOv7
model was also used for automatic annotation. This involved running the model on a large
set of unannotated images to generate initial bounding boxes and labels for helmets and
jumpsuits. The steps included:
1. Inference: Running inference on the unannotated images using the fine-tuned
YOLOv7 model to detect and classify helmets and jumpsuits.
2. Annotation Generation: The model outputs were saved in the YOLO format, creating
automatic annotations that were then reviewed and, if necessary, adjusted by
human annotators for accuracy.
3. Dataset Expansion: The automatically annotated images were added to the training
dataset, further enriching it and enabling iterative training cycles to continuously
improve the model’s performance.
Results
Accuracy :
72%
EPOCHS :
200
Learning Rate : 0.001
Activation :
ReLU
Regularization : L1 (Lasso)
14
2.4.3 Crowd Detection
Density map based crowd counting
To estimate the number of people in a given image via the Convolutional Neural Networks
(CNNs), there are two natural configurations. One is a network whose input is the image
and the output is the estimated head count. The other one is to output a density map of
the crowd (say how many people per square meter), and then obtain the head count by
integration. In this paper, we are in favor of the second choice for the following reasons:
1. Density map preserves more information. Compared to the total number of the crowd,
density map gives the spatial distribution of the crowd in the given image, and such
distribution information is useful in many applications. For example, if the density in a
small region is much higher than that in other regions, it may indicate something
abnormal happens there.
2. In learning the density map via a CNN, the learned filters are more adapted to heads of
different sizes, hence more suitable for arbitrary inputs whose perspective effect varies
significantly. Thus the filters are more semantic meaningful, and consequently improves
the accuracy of crowd counting.
Density map via geometry-adaptive kernels
Since the CNN needs to be trained to estimate the crowd density map from an input image,
the quality of density given in the training data very much determines the performance of
our method. We first describe how to convert an image with labeled people heads to a
map of crowd density.
If there is a head at pixel xi, we represent it as a delta function δ(x − xi). Hence an image
with N heads labeled can be represented as a function.
To convert this to a continuous density function, we may convolve this function with a
Gaussian kernel Gσ so that the density is F(x) = H(x) ∗ Gσ(x). However, such a density
function assumes that these xi are independent samples in the image plane which is not
the case here: In fact, each xi is a sample of the crowd density on the ground in the 3D
scene and due to the perspective distortion, and the pixels associated with different
samples xi correspond to areas of different sizes in the scene.Therefore, we should
determine the spread parameter σ based on the size of the head for each person within the
image. However, in practice, it is almost impossible to ac curately get the size of head due
to the occlusion in many cases.
15
That’s why the concept we applied is a little bit different than density mapping as its both
computationally heavy and demands too much hassle.
The model detects 4 or more people in a image and it calculates the average area of the
each individual person’s bounding box and only takes those people in the crowd whose
bounding box falls within 6x the area of the calculated average area of bounding box. It
works on the principle of relative addressing. It detects a person in the image and creates
a bounding box around the person and calculates the area of the bounding box and checks
if it has any more bounding boxes within the 6x radius of the bounding box. If the number
of bounding boxes of people exceed the number 4 then a net bounding box is created
enclosing all the bounding boxes within that 6x area and mark it as CROWD.
It also gives the ‘COUNT’ of the number of people inside the bounding box of crowd.
16
2.4.4 Fall Detection
Introduction
This report details the development of a fall detection system using advanced computer
vision techniques. The primary objective of the system is to identify and classify falls in realtime video footage, which is crucial for elderly care, workplace safety, and various
monitoring applications. The system integrates two powerful models, YOLOv8x and
YOLOv8x-pose-p6, to detect falls accurately.
Models Used
1. YOLOv8x: A state-of-the-art object detection model used for identifying the presence of a
person and detecting falls.
2. YOLOv8x-pose-p6: An advanced pose estimation model used for analyzing keypoints of a
person to confirm fall incidents.
Key Formulas
Aspect Ratio = Height of the Bounding Box / Width of the Bounding Box
Normalized X coordinate= X coordinate / Width of frame
Normalized Y coordinate= Y coordinate / Height of frame
Height change= Height of bounding box in current frame - Height of bounding box in
previous frame
Width change= Width of bounding box in current frame - Width of bounding box in previous
frame
Theory
The current version of our fall detection code classifies a person into two primary classes:
STANDING and FALLEN. However, it also uses two intermediate classes, CROPPED and
FALLING, to enhance prediction accuracy.
17
- STANDING vs. FALLEN: A person who is standing typically has a higher aspect ratio
compared to someone who has fallen. However, relying solely on the aspect ratio can be
misleading due to variations in camera zoom levels. To address this, additional parameters
like changes in bounding box dimensions and keypoint movements from the YOLO-pose
model are used.
Solution Implementation
1. Identifying FALLING:
- Conditions to identify if a person is falling:
- Height change < -5 (adjustable)
- Width change > 3 (adjustable)
- Differentiate between FALLING and CROPPED:
- Height decrease + Width does not change → CROP
- Height decrease + Width increases → FALLING
2. Classification Based on Aspect Ratio:
- If a person is detected as FALLING at any point:
- Aspect ratio > 1.5 → STANDING
- Aspect ratio < 1.5 → FALLEN
- If a person is never detected as FALLING:
- Aspect ratio > 1.25 → STANDING
- Aspect ratio < 1.25 → FALLEN
3. Using Keypoints for Confirmation:
- If a person is detected as FALLEN without falling (aspect ratio < 1.25), further analyze
keypoints (shoulders, knees, waist) using YOLO-pose to confirm the fall.
18
- If YOLO-pose provides a different prediction from the original, the YOLO-pose prediction
is considered.
Fall Detection Using Keypoints
1. Extract Relevant Keypoints: Shoulder, knee, and waist coordinates are extracted.
2. Normalize Coordinates: Coordinates are normalized to account for varying zoom levels.
3. Calculate Average Coordinates:
- Average x-coordinate of shoulders.
- Average x-coordinate of knees.
- Average x-coordinate of waist.
4. Check Alignment:
- If (average x-coordinate of shoulders - average x-coordinate of waist) < threshold value
→ Aligned
- If (average x-coordinate of waist - average x-coordinate of knees) < threshold value →
Aligned
- Typical threshold value = 0.03
5. Check Shoulder Position: If average y-coordinate of shoulders > average y-coordinate of
waist → Shoulders are above waist that means the person has not fallen yet.
6. Standing Condition:
- If knees and waist or waist and shoulders are aligned and shoulders are above waist then
STANDING is detected else fallen is detected.
7. Handle Missing Keypoints: If relevant keypoints are not detected, then the result is
shown as FALLEN.
Integrating YOLO-Pose and YOLO
19
Both YOLOv8x and YOLOv8x-pose-p6 can be used in the same environment. The steps to
confirm a fall are as follows:
1. Use YOLOv8x to get bounding box coordinates if a person is detected as FALLEN without
falling.
2. Provide these coordinates as input to a function implementing YOLO-pose.
3. Run YOLO-pose within these coordinates and check the fall conditions using keypoints.
Conclusion
Our fall detection system combines the strengths of YOLOv8x for object detection and
YOLOv8x-pose-p6 for pose estimation. This dual-model approach ensures a robust and
reliable fall detection system. By using aspect ratios, bounding box changes, and keypoint
analysis, we achieve high accuracy in distinguishing between standing and fallen states.
Future improvements could focus on refining the detection thresholds and enhancing the
integration of keypoint analysis for even better performance.
20
2.4.5 Oil Spillage Detection
In this project, a system is designed to detect oil spills using image segmentation techniques. This system is
crucial for environmental monitoring and industrial safety, enabling quick responses to oil spills to minimize
damage.
How It Works
1. Data Preparation:
o We created a dataset using Roboflow, which includes images of oil spills in
different environments.
o This dataset was annotated for segmentation tasks using Labelme. The
coordinates of points in the segmented areas were stored in JSON files to
prepare the data for model training.
2. Model Training:
o We used the U-Net architecture, a powerful model for image segmentation, to
train on our prepared dataset.
o The training process involved feeding the model images along with their
segmentation maps, enabling the model to learn to accurately identify and
segment areas affected by oil spills.
3. Validation and Testing:
o After training, the model was tested using Roboflow to measure its
performance.
o The model achieved a precision level of 82% and a mean Average Precision
(mAP) of 96.4%, indicating high accuracy in segmenting oil spills.
4. Deployment:
o The trained segmentation model was deployed to analyze video feeds in realtime.
o When the system detects an oil spill, it triggers an alert, allowing for immediate
response to mitigate environmental damage.
By leveraging the advanced capabilities of the U-Net model for image segmentation, our oil
spillage detection system accurately identifies and segments areas affected by oil spills. The
integration with other safety monitoring solutions further enhances industrial safety,
offering comprehensive protection against both workplace injuries and environmental
hazards. Future improvements will focus on refining the detection thresholds and enhancing
the model's ability to handle diverse environments for even better performance.
21
2.4.6 Number-Plate Detection
this report, we'll walk through the development of a system designed to detect and track
number plates from video footage. The system leverages advanced technologies like YOLOv7
for object detection and EasyOCR for reading text from images. This setup aims to
automatically detect vehicles, extract their number plates, and keep track of them over time,
which can be incredibly useful for traffic monitoring, law enforcement, and security
purposes.
System Overview
The system has two main components:
1. Object Detection: We used YOLOv7, a cutting-edge model known for its real-time
object detection capabilities, to identify vehicles and number plates in video frames.
2. OCR (Optical Character Recognition): EasyOCR helps extract the text from the
detected number plates, ensuring accurate reading of the characters.
How It Works
1. Processing Video Input:
o The system processes each frame of the video one by one. Each frame undergoes
both object detection and OCR procedures.
2. Detecting Vehicles:
o
YOLOv7 is trained to detect various types of vehicles, such as cars, motorcycles,
buses, trucks, and trains. It draws bounding boxes around each detected vehicle.
3. Detecting Number Plates:
o
Another YOLOv7 model is specifically tuned to detect number plates. This model
identifies the exact region in each frame where a number plate is present.
4. Reading Number Plates (OCR):
o
EasyOCR reads the text within the detected number plate regions, converting
the image data into alphanumeric characters. We also implemented a correction
algorithm to fix common OCR errors and ensure the text matches expected
formats.
5. Tracking Vehicles:
The system tracks vehicles across frames by assigning unique IDs based on their
location in consecutive frames. This helps maintain continuity and correctly
associate number plates with specific vehicles over time.Also,this is used to take
many frames of a single vehicle in the csv file to avoid any mistakes in detection
of the text of the number plate.
6. Recording Data:
o The system records various details like frame number, vehicle ID, bounding box
coordinates, plate text, confidence scores, and image paths in a CSV file. This
data is crucial for further analysis and reporting.From this data, we found the
o
22
text detected for the maximum time by the OCR to decrease chances of wrong
entry.
Improving OCR Accuracy
OCR isn't always perfect, so we included a text correction algorithm to improve accuracy:



The algorithm removes special characters and spaces.
It converts the entire text to uppercase.
There is a specific pattern of number plates in India,for eg. the first two are alphabets
next two are numbers and so on. We exploited this logic to make our model more
accurate.
It corrects common mistakes, such as replacing 'O' with '0' or 'I' with '1', to match expected
number plate formats.
23
2.4.7 POSTURE DETECTION
In this report, we'll walk through the development of a system designed to detect and
classify human postures using the YOLOv8-pose model. This system identifies 17 specific
keypoints on the human body and classifies postures such as standing, sitting, squatting, and
leaning. It can also detect if someone remains idle while sitting for too long. This setup is
crucial for applications in health monitoring, workplace safety, and activity recognition.
System Overview
The system has two main components:
1. Keypoint Detection: Using the YOLOv8-pose model, we identify 17 keypoints on the
human body, which correspond to specific anatomical locations.
2. Posture Classification: We classify the detected keypoints into different postures and
identify an idle state if someone sits for a prolonged period.
How It Works
1. Training the Model:
- The YOLOv8-pose model learns to identify keypoints by analyzing labeled data. This data
includes 2D coordinates of 17 keypoints for each person in every frame, stored in a CSV file.
This allows the model to understand the spatial relationships and positions of these points.
2. Processing Video Input:
- The system processes each frame of a video one by one. Each frame is analyzed to detect
key-points, which are then used to determine the posture.
3. Key-point Detection:
- The model identifies 17 keypoints, such as the nose, eyes, shoulders, elbows, wrists, hips,
knees, and ankles. These points are superimposed on the video to visualize the detection.
4. Posture Classification:
- The system classifies postures based on the keypoints:
- Standing: Shoulders are above the waist and knees.
- Sitting: Hips are lower with knees bent at a right angle.
- Squatting: Hips are even lower, with knees bent significantly.
- Leaning: The body tilts, indicating a leaning posture.
5. Idle Detection:
- If someone sits for too long, the system labels them as 'idle' in addition to 'sitting'. This is
done by setting a time threshold for the sitting posture.
6. Output Video Generation:
- The system creates an output video showing the detected keypoints and the classified
posture for each frame. This helps in understanding and analyzing posture changes over
24
time.
Enhancing Accuracy and Performance
- Data Preparation:We organized and pre-processed the training data to fit the YOLOv8
framework, ensuring high accuracy in keypoint detection.
- Posture Classification Logic: The system uses specific criteria based on keypoint positions
and relationships to classify postures accurately.
- Idle Detection Mechanism: By integrating a time threshold for sitting, the system
effectively identifies prolonged idleness, providing useful insights.
This system's ability to detect and classify human postures in real-time video footage is
invaluable for various applications, including monitoring the elderly, ensuring workplace
safety, and analyzing human activities. Future improvements will focus on refining detection
thresholds and enhancing keypoint analysis for even more accurate and reliable
performance.
25
Conclusion
In this project, we've created advanced systems using the latest AI models to tackle several
important challenges like detecting safety gear, monitoring postures, identifying oil spills,
reading number plates, counting crowds, and detecting falls. Each system is built on robust
training with carefully curated datasets, ensuring they deliver accurate results where it
matters most.
For example, our safety gear detection system powered by YOLOv7 excels at spotting
harnesses, helmets, and jumpsuits, crucial for workplace safety. Meanwhile, our posture
detection system with YOLOv8-pose accurately recognizes human postures, including
detecting when someone sits for too long—a key feature for health monitoring and
ergonomics.
Our oil spill detection system, using U-Net, swiftly identifies and maps affected areas, aiding
in quick responses to environmental crises. Additionally, our number plate OCR system
combines YOLOv7 and Easy-OCR to track vehicles and read number plates, enhancing
security and traffic management.
These projects highlight how AI can make a real difference in practical scenarios. Looking
ahead, we're dedicated to refining these systems further, exploring new techniques, and
integrating the latest advancements to ensure our solutions remain at the forefront of
technology. By continually improving and innovating, we aim to provide tools that not only
meet current challenges effectively but also pave the way for future breakthroughs in AI
applications for safety, environmental protection, and more.
26
4. References
1. https://www.ibm.com/topics/convolutional-neural-networks
2. https://www.datacamp.com/blog/yolo-object-detection-explained
3. M. Rodriguez, I. Laptev, J. Sivic, and J.-Y. Audibert. Density-aware person detection
and tracking in crowds. In ICCV, pages 2423–2430. IEEE, 2011.
4. K. Tota and H. Idrees. Counting in dense crowds using deep features.
5. C. S. Regazzoni and A. Tesei. Distributed data fusion for real-time crowding
estimation. Signal Processing, 53(1):47 63, 1996
6. K. Chen, S. Gong, T. Xiang, and C. C. Loy. Cumulative at tribute space for age and
crowd density estimation. In CVPR, pages 2467–2474. IEEE, 2013.
7. B. Wu and R. Nevatia. Detection of multiple, partially occluded humans in a single
image by bayesian combination of edgelet part detectors. In ICCV, volume 1, pages 90–
97. IEEE, 2005.
8. T. Zhao, R. Nevatia, and B. Wu. Segmentation and tracking of multiple humans in
crowded environments. Pattern Analysis and Machine Intelligence, 30(7):1198–1211,
2008.
9. Real-Time Multi-Person 2D Pose Estimation using Part Affinity Fields by Zhe Cao et
10. U-Net: Convolutional Networks for Biomedical Image Segmentation by Olaf
Ronneberger
11.https://github.com/JaidedAI/EasyOCR
12. A Survey of Fall Detection Techniques: Classification and Comparison by Amer A. Ali
13. Deep Learning for Semantic Segmentation in Environmental Monitoring: A Survey by
Juan D. Bolanos
14. Real-time Video Analysis and Object Tracking by Liang Wang
27
5. Glossary
Bounding Box: A rectangular border around an object in an image, used to define the
position and extent of the object for detection and annotation purposes.
CNN (Convolutional Neural Network): A type of deep learning model particularly effective
for image recognition and classification tasks, capable of automatically and adaptively
learning spatial hierarchies of features from input images.
Data Augmentation: Techniques used to increase the diversity of training data without
actually collecting new data, often by applying random transformations such as rotations,
translations, and flips to the existing dataset.
Dataset: A collection of data, typically organized in a structured format, used for training,
validating, and testing machine learning models.
Density Map: A representation of the spatial distribution of objects within an image, used
particularly in crowd counting to estimate the number of people in different regions of the
image.
Fine-Tuning: The process of taking a pretrained model and further training it on a new,
often smaller, dataset specific to a particular task to improve its performance for that task.
Hyperparameters: Configurable parameters external to the model that influence the
training process, such as learning rate, batch size, and number of epochs.
Inference: The process of using a trained model to make predictions on new, unseen data.
Iterative Training: A training process that involves continuously adding new data and retraining the model to improve its accuracy and robustness over time.
Mean Average Precision (mAP): A metric used to evaluate the accuracy of object detection
models, averaging the precision over a range of recall values.
Model Architecture: The structure and organization of layers within a neural network
model, defining how the model processes input data to produce output predictions.
Pretrained Model: A model that has been previously trained on a large dataset and can be
fine-tuned for a specific task, reducing the time and computational resources required for
training.
Real-Time Monitoring: The capability of a system to process and analyze data instantly as
it is being collected, allowing for immediate insights and actions.
28
Regularization: Techniques used during training to prevent overfitting by penalizing
complex models, such as L1 (Lasso) regularization which encourages sparsity in the model
parameters.
YOLO (You Only Look Once): A family of real-time object detection models known for their
speed and accuracy, processing entire images at once and predicting bounding boxes and
class probabilities simultaneously.
YOLOv7: An advanced version of the YOLO model family, optimized for real-time object
detection with improved accuracy and efficiency.
YAML : A human-readable data serialization standard used to define configurations for
datasets and other settings in machine learning projects.
Active Learning: A machine learning approach where the model selectively queries a
human annotator to label new data points with uncertain predictions, improving model
performance iteratively.
Annotated Dataset: A dataset where each image or piece of data is labeled with relevant
information, such as bounding boxes and class labels, necessary for training machine
learning models.
Annotation Standardization: Ensuring consistency in labeling across different data sources
to maintain uniformity and accuracy in the dataset used for training.
Automated Machine Learning (AutoML): Techniques and tools that automate the process
of model selection, hyperparameter tuning, and feature engineering to improve machine
learning workflows without extensive manual intervention.
Capsule Networks: A type of neural network designed to better capture spatial hierarchies
and relationships between parts of an object, addressing some limitations of traditional
CNNs.
Class Label: An identifier assigned to an object in an image, indicating its category, used
in object detection and classification tasks.
Convolutional Layer: A layer in a convolutional neural network that applies convolution
operations to the input, extracting features such as edges, textures, and patterns from
images.
Data Cleaning: The process of removing duplicates, corrupted files, and irrelevant data to
ensure the quality and consistency of the dataset.
Data Integration: Combining data from multiple sources into a single, coherent dataset,
ensuring consistency in format and quality.
Deep Learning: A subset of machine learning involving neural networks with many layers,
capable of learning complex patterns and representations from large amounts of data.
29
Empirical Research: The process of collecting and analyzing data through direct and
indirect observation or experience, forming the basis for model training and evaluation.
Evaluation Metrics: Ǫuantitative measures such as precision, recall, and mean average
precision (mAP) used to assess the performance of a machine learning model.
Fully-Connected Layer: A layer in a neural network where each neuron is connected to
every neuron in the previous layer, typically used in the final stages of a CNN to perform
classification.
Generative Adversarial Networks (GANs): A class of deep learning models where two
networks (generator and discriminator) are trained simultaneously, with the generator
creating realistic data and the discriminator distinguishing between real and generated
data.
Learning Rate: A hyperparameter that controls how much the model's weights are
adjusted during training with respect to the loss gradient.
Model Initialization: The process of setting up a model's architecture and initial weights
before training begins.
Neural Network: A series of algorithms that attempt to recognize underlying relationships
in a set of data through a process that mimics the way the human brain operates.
Object Detection: The task of identifying and localizing objects within an image, crucial
for applications like surveillance, robotics, and augmented reality.
Pooling Layer: A layer in a convolutional neural network that performs down-sampling,
reducing the spatial dimensions of the input and helping to make the representation more
manageable.
Pretrained Model: A model that has been previously trained on a large dataset and can be
fine-tuned for a specific task, reducing the time and computational resources required for
training.
Supervised Learning: A type of machine learning where the model is trained on a labeled
dataset, learning to map inputs to the correct outputs.
Transfer Learning: A technique in machine learning where a pretrained model is used as
the starting point for a new, related task, improving performance and reducing training
time.
Unsupervised Learning: A type of machine learning where the model is trained on an
unlabeled dataset, discovering patterns and relationships within the data without
predefined labels.
Aspect Ratio:Ratio of the height to the width of a bounding box around an object or
person. Used to infer standing or fallen states.
30
OCR (Optical Character Recognition):Technology used to recognize text within images.
EasyOCR is mentioned for reading number plates.
Posture Classification:Classification of human body positions (like standing, sitting,
squatting) based on detected keypoints.
Threshold Values: Predefined values used as limits or conditions for making decisions
in algorithms, such as for detecting changes in bounding box dimensions or keypoint
alignments.
Environmental Monitoring: Surveillance and analysis of environmental conditions to
ensure compliance with safety and regulatory standards, such as detecting and
responding to oil spills.
31
Download