Appendix-A A REPORT ON Object Detection using Machine-Learning Model and Developing Use-cases with the help of these models BY Name(s) of the Student(s) ID.No.(s) Neel Patel 2022AAPS0624G AT L&T EnergyHydrocarbons, Vadodara A Practice School-I Station of BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI (July,2024) 1/5 Appendix-B A REPORT ON Object Detection using Machine-Learning Model and Developing Use-cases with the help of these models BY Name(s) of the Student(s) Neel Patel ID.No.(s) 2022AAPS0624G Discipline(s) Electronics and Communication Prepared in partial fulfillment of the Practice School-I Course Nos. BITS C221/BITS C231/BITS C241 AT L&T EnergyHydrocarbons, Vadodara A Practice School-I Station of BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI (July, 2024) 2/5 ACKNOWLEDGEMENT I would like to express my sincere gratitude to Birla Institute of Technology Pilani for providing me with the opportunity to undertake this summer term practice school 1 at Larsen And Toubro Energy Hydrocarbon, Vadodara. This experience has been invaluable in enhancing my practical knowledge and understanding of Video Analytics. I am deeply thankful to Prof. V. Ramgopal Rao, Vice Chancellor of Birla Institute of Technology Pilani, and Prof. SudhirKumar Birai, Director of the Institute, for their leadership and commitment to providing students with such enriching experiences. My heartfelt appreciation goes to Dr. Raghuram Ammavajjala, my professor in charge, for his guidance, support, and valuable insights throughout this training period. His expertise and mentorship have been instrumental in shaping my understanding of the subject matter. I would also like to extend my gratitude to the management and staff of L&T Energy for their cooperation and for providing me with the opportunity to learn about Video Analytics in a real-world industrial setting. This experience has significantly contributed to my academic and professional growth, and I am grateful to all those who have made it possible. 3/5 Appendix-C Format of an Abstract Sheet BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE PILANI(RAJASTHAN) Practice School Division Station: ……L&T Energy- Hydrocarbons……………… Centre ...............Vadodara...…… Duration ........8 weeks………………… Date of Start.............28/05/2024....………………… Date of Submission .............19/07/2024………………………… Title of the Project: Object Detection using Machine-Learning Model and Developing Usecases with the help of these models ID No./Name(s)/ Discipline(s)/o f the student(s) : 2022AAPS0624G Neel Patel Electronics and Communication Name(s) and designation(s )of the expert(s): Vivek Chaudhary Assistant Manager at IT & Systems Name(s) ofthe PS Faculty: Ammavajjala Sesha Sai Raghuram Key Words: Deep Learning, CNN, Computer Vision, Object DetectionProject Areas: Artificial Intelligence/ Machine Learning Abstract: Training Models on different classes and Increasing accuracy by hyperparameter tuning. Signature(s) of Student(s) Signature of PS Faculty Date: 19/07/2024 Date 4/5 TABLE OF CONTENT 1. Introduction 1.1 Background 1.2 Objective 1.3 Scope 1.4 Methodology 1.5 Significance 2. Project Overview 2.1 Convolutional Neural Networks (CNN) 2.1.1 Working of CNN 2.1.2 Applications 2.1.3 Advancements 2.2 Data Collection & Annotation 2.2.1 Data Integration and Pre-Processing 2.2.2 Creating YAML Dataset 2.3 YOLO (You Only Look Once) 2.4 Use-Cases developed using YOLO 2.4.1 Safety Harness 2.4.2 Helmet and Jumpsuit Detection 2.4.3 Crowd Detection 2.4.4 Fall Detection 2.4.5 Oil Spillage Detection 2.4.6 Number-Plate Detection 2.4.7 Posture Detection 3. Conclusion 4. References 5. Glossary 5/5 1. Introduction 1.1 Background AI and ML have their roots in the broader field of computer science and data analytics. AI refers to the simulation of human intelligence in machines that are programmed to think and learn. ML, a subset of AI, involves the development of algorithms that allow computers to learn from and make decisions based on data. These technologies have seen exponential growth in recent years, driven by advancements in computational power, the proliferation of big data, and innovative algorithms. 1.2 Objective The primary objective of this project is to explore the application of AI and ML in [specific industry or field, e.g., healthcare, finance, manufacturing], highlighting how these technologies can solve complex problems, enhance operational efficiency, and drive innovation. This report aims to: 1. Provide a comprehensive overview of AI and ML concepts and methodologies. 2. Analyze the current state of AI and ML applications in the chosen field. 3. Present case studies showcasing successful implementations. 4. Identify the challenges and limitations associated with AI and ML adoption. 5. Offer insights and recommendations for future research and development. 1.3 Scope This report encompasses a detailed study of AI and ML technologies, including supervised and unsupervised learning, neural networks, and deep learning. It examines the impact of these technologies on various processes, from predictive analytics and decision-making to automation and customer engagement. The scope also includes an evaluation of ethical considerations, data privacy issues, and the regulatory landscape affecting AI and ML deployment. 1.4 Methodology The research methodology for this project involves a combination of literature review, case study analysis, and empirical research. Data is gathered from academic journals, industry reports, expert interviews, and real-world examples. The analysis focuses on both qualitative and quantitative aspects, providing a holistic view of AI and ML's role in modern industry. Data is collected from various sources such as ROBOFLOW and Cornell university datasets for various models. 1.5 Significance 6/5 Understanding AI and ML is crucial for businesses and researchers aiming to stay competitive in an increasingly data-driven world. By leveraging these technologies, organizations can gain a strategic advantage, improve decision-making processes, and foster innovation. This project report serves as a valuable resource for stakeholders seeking to harness the potential of AI and ML to drive growth and efficiency. In conclusion, AI and ML represent the forefront of technological advancement, promising to reshape industries and society. This report aims to shed light on their transformative potential, offering a roadmap for successful implementation and integration in various sectors. 7/5 2. Project Overview 2.1 Convolutional Neural Networks (CNN) Neural Networks are a subset of machine learning, and they are at the heart of deep learning algorithms. They are comprised of node layers, containing an input layer, one or more hidden layers, and an output layer. Each node connects to another and has an associated weight and threshold. If the output of any individual node is above the specified threshold value, that node is activated, sending data to the next layer of the network. Otherwise, no data is passed along to the next layer of the network. While we primarily focused on feed-forward networks in that article, there are various types of neural nets, which are used for different use cases and data types. For example, recurrent neural networks are commonly used for natural language processing and speech recognition whereas convolutional neural networks (ConvNets or CNNs) are more often utilized for classification and computer vision tasks. Prior to CNNs, manual, timeconsuming feature extraction methods were used to identify objects in images. However, convolutional neural networks now provide a more scalable approach to image classification and object recognition tasks, leveraging principles from linear algebra, specifically matrix multiplication, to identify patterns within an image. 2.1.1 Working of CNN Convolutional neural networks are distinguished from other neural networks by their superior performance with image, speech, or audio signal inputs. They have three main types of layers, which are: Convolutional layer Pooling layer Fully-connected (FC) layer The convolutional layer is the first layer of a convolutional network. While convolutional layers can be followed by additional convolutional layers or pooling layers, the fullyconnected layer is the final layer. With each layer, the CNN increases in its complexity, identifying greater portions of the image. Earlier layers focus on simple features, such as colors and edges. As the image data progresses through the layers of the CNN, it starts to recognize larger elements or shapes of the object until it finally identifies the intended object. 8/5 2.1.2 Applications CNNs have revolutionized a multitude of fields with their exceptional ability to analyze visual data. Key applications explored in the project include: Image Classification: Categorizing images into predefined classes, with applications ranging from medical diagnostics to autonomous vehicles. Object Detection: Identifying and localizing objects within images, crucial for surveillance, robotics, and augmented reality. Semantic Segmentation: Assigning a class label to each pixel in an image, used in medical imaging and autonomous driving for understanding environments at a granular level. Face Recognition: Enhancing security systems and enabling personalized user experiences. Visual Search Engines: Powering advanced search functionalities in e-commerce and digital libraries. 2.1.3 Advancements The project highlighted several recent advancements in CNN technology, including: Transfer Learning: Utilizing pre-trained models on large datasets to improve performance on specific tasks with limited data. Generative Adversarial Networks (GANs): Combining CNNs with adversarial training to generate realistic images. Capsule Networks: Addressing the limitations of traditional CNNs in understanding spatial hierarchies. Automated Machine Learning (AutoML): Leveraging automated techniques to design and optimize CNN architectures without extensive manual intervention. Future research directions identified include improving computational efficiency, enhancing model interpretability, and addressing ethical concerns related to bias and privacy. 9/5 2.2 Data Collection & Annotation To effectively train and evaluate our Convolutional Neural Network (CNN) models, we collected a diverse dataset from multiple sources. This included both live data from cameras pre-installed at various sites in India and publicly available datasets from platforms like Roboflow. Live Data Collection from Pre-installed Cameras Our primary data collection involved leveraging an extensive network of pre-installed cameras across different sites in India. These cameras were strategically placed in urban, suburban, and rural areas to capture a wide range of visual scenarios. The collected data included: Traffic Surveillance: Images and videos capturing traffic conditions, vehicle types, and pedestrian movements. Public Spaces: Footage from parks, markets, and streets, providing diverse scenes of human activity. Industrial Sites: Visual data from manufacturing plants and construction sites, showcasing various machinery and work processes. The data collection process was automated using scripts that periodically retrieved footage from these cameras, ensuring a consistent and comprehensive dataset. To address privacy concerns and adhere to ethical standards, all collected data was anonymized, with faces and other personally identifiable information blurred. Sourcing Data from Online Platforms To augment our dataset and ensure a rich variety of images for training our CNN models, we sourced additional data from online platforms, particularly Roboflow. Roboflow is known for its extensive collection of labeled image datasets, which are crucial for training robust AI models. The data sourced from Roboflow included: Annotated Datasets: High-quality images with detailed annotations for object detection, classification, and segmentation tasks using LabelImg and LabelMe. Diverse Categories: A wide range of categories, from common objects like vehicles and animals to specific items relevant to our applications. Pre-processed Data: Images that have been pre-processed and standardized, facilitating seamless integration into our training pipeline. 10/ 5 2.2.1 Data Integration and Pre-processing Combining data from live cameras and online platforms required meticulous preprocessing to ensure consistency and quality. The pre-processing steps included: Data Cleaning: Removing duplicates, corrupted files, and irrelevant images to maintain a high-quality dataset. Annotation Standardization: Ensuring that annotations from different sources followed a consistent format, crucial for effective training. Normalization: Scaling pixel values and standardizing image sizes to meet the input requirements of our CNN models. Data Augmentation: Applying techniques like rotation, flipping, and color adjustment to artificially expand the dataset, improving model generalization. 2.2.2 Creating YAML Dataset To effectively train our YOLO (You Only Look Once) models for object detection tasks, we need to format our collected data into a structured format that YOLO can understand. The YAML file format is used to define the dataset configuration, specifying the paths to the training, validation, and testing data, as well as the classes of objects to be detected. This section outlines the process of creating a YAML dataset from our collected data. This involves separating the images and their corresponding annotations into training, validation, and test sets. The directory structure typically looks like this: · images/train/: Contains the training images. · images/val/: Contains the validation images. · images/test/: Contains the test images. · labels/train/: Contains the annotation files (in YOLO format) corresponding to the training images. · labels/val/: Contains the annotation files corresponding to the validation images. · labels/test/: Contains the annotation files corresponding to the test images. The annotation files are in YOLO format, where each annotation file corresponds to an image and contains lines of data representing the bounding boxes of detected objects. Each line in the annotation file follows this format: class_id: The integer identifier for the class of the object. x_center: The normalized x-coordinate of the center of the bounding box. y_center: The normalized y-coordinate of the center of the bounding box. width: The normalized width of the bounding box. height: The normalized height of the bounding box. 11 2.3 YOLO (You Only Look Once) You Only Look Once (YOLO) is a state-of-the-art, real-time object detection algorithm introduced in 2015 by Joseph Redmon, Santosh Divvala, Ross Gershick, and Ali Farhadi in their famous research paper “You Only Look Once: Unified, Real-Time Object Detection”. The authors frame the object detection problem as a regression problem instead of a classification task by spatially separating bounding boxes and associating probabilities to each of the detected images using a single convolutional neural network (CNN). Some of the reasons why YOLO is leading the competition include its: 1- Speed YOLO is extremely fast because it does not deal with complex pipelines. It can process images at 45 Frames Per Second (FPS). In addition, YOLO reaches more than twice the mean Average Precision (mAP) compared to other real-time systems, which makes it a great candidate for real-time processing. From the graphic below, we observe that YOLO is far beyond the other object detectors with 91 FPS. 2- High detection accuracy YOLO is far beyond other state-of-the-art models in accuracy with very few background errors. 3- Better generalization This is especially true for the new versions of YOLO, which will be discussed later in the article. With those advancements, YOLO pushed a little further by providing a better generalization for new domains, which makes it great for applications relying on fast and robust object detection. For instance the Automatic Detection of Melanoma with Yolo Deep Convolutional Neural Networks paper shows that the first version YOLOv1 has the lowest mean average precision for the automatic detection of melanoma disease, compared to YOLOv2 and YOLOv3. 4- Open source Making YOLO open-source led the community to constantly improve the model. This is one of the reasons why YOLO has made so many improvements in such a limited time. 12 2.4 Use-Cases developed using YOLOv7 model 2.4.1 Safety Harness To address the specific task of detecting different classes related to harnesses—namely, "no harness," "harness without hook," and "harness with hook"—we utilized the YOLOv7 model. This section details the process of training the YOLOv7 model on these classes, from dataset preparation to model evaluation. YOLOv7 is an evolution of the YOLO (You Only Look Once) family of models, optimized for real-time object detection. The configuration involves setting up parameters such as model architecture, training hyperparameters, and dataset configuration in a YAML file. Training the YOLOv7 model involved executing scripts with configurations specific to our dataset and model architecture. Here’s a basic outline of the training procedure: 1. Model Initialization: Initialize YOLOv7 model architecture using configuration files (yolov7.yaml). 2. Data Loading: Load dataset and annotations using the data.yaml configuration file. 3. Training Setup: Define training parameters such as batch size, learning rate, and number of epochs. 4. Model Training: Run the training script (train.py), which iteratively optimizes the model weights based on the labeled dataset. During training, the model learns to identify and classify harnesses into the specified categories. Evaluation and Fine-tuning After training, the model’s performance was evaluated using the validation set to assess metrics like precision, recall, and mean average precision (mAP). Fine-tuning involved adjusting parameters and augmenting data to improve detection accuracy and robustness. Results Accuracy : 65% EPOCHS : 300 Learning Rate : 0.01 Activation : ReLU Regularization : None 13 2.4.2 Helmet and Jumpsuit Detection In addition to harness detection, the YOLOv8 model was employed to detect safety helmetsand jumpsuits, essential for ensuring workplace safety compliance. The dataset for this task was meticulously curated to include images of individuals wearing helmets and jumpsuits, as well as instances where these safety gears were absent. The annotations were prepared in YOLO format, specifying the bounding boxes and class labels for "no helmet", "no jumpsuit" and "person". By training YOLOv7 on this annotated dataset, the model learned to accurately identify and differentiate between the presence and absence of helmets and jumpsuits in various environments. During the training process, we finetuned the model parameters and employed data augmentation techniques to enhance its robustness and accuracy. The trained YOLOv7 model demonstrated high precision and recall in detecting helmets and jumpsuits, making it a reliable tool for real-time monitoring and enforcement of safety protocols in industrial and construction settings. This capability not only aids in compliance checks but also significantly contributes to reducing workplace accidents by ensuring that all personnel are appropriately equipped with necessary safety gear. Automated Annotation Using the Pretrained Model To expedite the annotation process for our dataset, the pretrained and fine-tuned YOLOv7 model was also used for automatic annotation. This involved running the model on a large set of unannotated images to generate initial bounding boxes and labels for helmets and jumpsuits. The steps included: 1. Inference: Running inference on the unannotated images using the fine-tuned YOLOv7 model to detect and classify helmets and jumpsuits. 2. Annotation Generation: The model outputs were saved in the YOLO format, creating automatic annotations that were then reviewed and, if necessary, adjusted by human annotators for accuracy. 3. Dataset Expansion: The automatically annotated images were added to the training dataset, further enriching it and enabling iterative training cycles to continuously improve the model’s performance. Results Accuracy : 72% EPOCHS : 200 Learning Rate : 0.001 Activation : ReLU Regularization : L1 (Lasso) 14 2.4.3 Crowd Detection Density map based crowd counting To estimate the number of people in a given image via the Convolutional Neural Networks (CNNs), there are two natural configurations. One is a network whose input is the image and the output is the estimated head count. The other one is to output a density map of the crowd (say how many people per square meter), and then obtain the head count by integration. In this paper, we are in favor of the second choice for the following reasons: 1. Density map preserves more information. Compared to the total number of the crowd, density map gives the spatial distribution of the crowd in the given image, and such distribution information is useful in many applications. For example, if the density in a small region is much higher than that in other regions, it may indicate something abnormal happens there. 2. In learning the density map via a CNN, the learned filters are more adapted to heads of different sizes, hence more suitable for arbitrary inputs whose perspective effect varies significantly. Thus the filters are more semantic meaningful, and consequently improves the accuracy of crowd counting. Density map via geometry-adaptive kernels Since the CNN needs to be trained to estimate the crowd density map from an input image, the quality of density given in the training data very much determines the performance of our method. We first describe how to convert an image with labeled people heads to a map of crowd density. If there is a head at pixel xi, we represent it as a delta function δ(x − xi). Hence an image with N heads labeled can be represented as a function. To convert this to a continuous density function, we may convolve this function with a Gaussian kernel Gσ so that the density is F(x) = H(x) ∗ Gσ(x). However, such a density function assumes that these xi are independent samples in the image plane which is not the case here: In fact, each xi is a sample of the crowd density on the ground in the 3D scene and due to the perspective distortion, and the pixels associated with different samples xi correspond to areas of different sizes in the scene.Therefore, we should determine the spread parameter σ based on the size of the head for each person within the image. However, in practice, it is almost impossible to ac curately get the size of head due to the occlusion in many cases. 15 That’s why the concept we applied is a little bit different than density mapping as its both computationally heavy and demands too much hassle. The model detects 4 or more people in a image and it calculates the average area of the each individual person’s bounding box and only takes those people in the crowd whose bounding box falls within 6x the area of the calculated average area of bounding box. It works on the principle of relative addressing. It detects a person in the image and creates a bounding box around the person and calculates the area of the bounding box and checks if it has any more bounding boxes within the 6x radius of the bounding box. If the number of bounding boxes of people exceed the number 4 then a net bounding box is created enclosing all the bounding boxes within that 6x area and mark it as CROWD. It also gives the ‘COUNT’ of the number of people inside the bounding box of crowd. 16 2.4.4 Fall Detection Introduction This report details the development of a fall detection system using advanced computer vision techniques. The primary objective of the system is to identify and classify falls in realtime video footage, which is crucial for elderly care, workplace safety, and various monitoring applications. The system integrates two powerful models, YOLOv8x and YOLOv8x-pose-p6, to detect falls accurately. Models Used 1. YOLOv8x: A state-of-the-art object detection model used for identifying the presence of a person and detecting falls. 2. YOLOv8x-pose-p6: An advanced pose estimation model used for analyzing keypoints of a person to confirm fall incidents. Key Formulas Aspect Ratio = Height of the Bounding Box / Width of the Bounding Box Normalized X coordinate= X coordinate / Width of frame Normalized Y coordinate= Y coordinate / Height of frame Height change= Height of bounding box in current frame - Height of bounding box in previous frame Width change= Width of bounding box in current frame - Width of bounding box in previous frame Theory The current version of our fall detection code classifies a person into two primary classes: STANDING and FALLEN. However, it also uses two intermediate classes, CROPPED and FALLING, to enhance prediction accuracy. 17 - STANDING vs. FALLEN: A person who is standing typically has a higher aspect ratio compared to someone who has fallen. However, relying solely on the aspect ratio can be misleading due to variations in camera zoom levels. To address this, additional parameters like changes in bounding box dimensions and keypoint movements from the YOLO-pose model are used. Solution Implementation 1. Identifying FALLING: - Conditions to identify if a person is falling: - Height change < -5 (adjustable) - Width change > 3 (adjustable) - Differentiate between FALLING and CROPPED: - Height decrease + Width does not change → CROP - Height decrease + Width increases → FALLING 2. Classification Based on Aspect Ratio: - If a person is detected as FALLING at any point: - Aspect ratio > 1.5 → STANDING - Aspect ratio < 1.5 → FALLEN - If a person is never detected as FALLING: - Aspect ratio > 1.25 → STANDING - Aspect ratio < 1.25 → FALLEN 3. Using Keypoints for Confirmation: - If a person is detected as FALLEN without falling (aspect ratio < 1.25), further analyze keypoints (shoulders, knees, waist) using YOLO-pose to confirm the fall. 18 - If YOLO-pose provides a different prediction from the original, the YOLO-pose prediction is considered. Fall Detection Using Keypoints 1. Extract Relevant Keypoints: Shoulder, knee, and waist coordinates are extracted. 2. Normalize Coordinates: Coordinates are normalized to account for varying zoom levels. 3. Calculate Average Coordinates: - Average x-coordinate of shoulders. - Average x-coordinate of knees. - Average x-coordinate of waist. 4. Check Alignment: - If (average x-coordinate of shoulders - average x-coordinate of waist) < threshold value → Aligned - If (average x-coordinate of waist - average x-coordinate of knees) < threshold value → Aligned - Typical threshold value = 0.03 5. Check Shoulder Position: If average y-coordinate of shoulders > average y-coordinate of waist → Shoulders are above waist that means the person has not fallen yet. 6. Standing Condition: - If knees and waist or waist and shoulders are aligned and shoulders are above waist then STANDING is detected else fallen is detected. 7. Handle Missing Keypoints: If relevant keypoints are not detected, then the result is shown as FALLEN. Integrating YOLO-Pose and YOLO 19 Both YOLOv8x and YOLOv8x-pose-p6 can be used in the same environment. The steps to confirm a fall are as follows: 1. Use YOLOv8x to get bounding box coordinates if a person is detected as FALLEN without falling. 2. Provide these coordinates as input to a function implementing YOLO-pose. 3. Run YOLO-pose within these coordinates and check the fall conditions using keypoints. Conclusion Our fall detection system combines the strengths of YOLOv8x for object detection and YOLOv8x-pose-p6 for pose estimation. This dual-model approach ensures a robust and reliable fall detection system. By using aspect ratios, bounding box changes, and keypoint analysis, we achieve high accuracy in distinguishing between standing and fallen states. Future improvements could focus on refining the detection thresholds and enhancing the integration of keypoint analysis for even better performance. 20 2.4.5 Oil Spillage Detection In this project, a system is designed to detect oil spills using image segmentation techniques. This system is crucial for environmental monitoring and industrial safety, enabling quick responses to oil spills to minimize damage. How It Works 1. Data Preparation: o We created a dataset using Roboflow, which includes images of oil spills in different environments. o This dataset was annotated for segmentation tasks using Labelme. The coordinates of points in the segmented areas were stored in JSON files to prepare the data for model training. 2. Model Training: o We used the U-Net architecture, a powerful model for image segmentation, to train on our prepared dataset. o The training process involved feeding the model images along with their segmentation maps, enabling the model to learn to accurately identify and segment areas affected by oil spills. 3. Validation and Testing: o After training, the model was tested using Roboflow to measure its performance. o The model achieved a precision level of 82% and a mean Average Precision (mAP) of 96.4%, indicating high accuracy in segmenting oil spills. 4. Deployment: o The trained segmentation model was deployed to analyze video feeds in realtime. o When the system detects an oil spill, it triggers an alert, allowing for immediate response to mitigate environmental damage. By leveraging the advanced capabilities of the U-Net model for image segmentation, our oil spillage detection system accurately identifies and segments areas affected by oil spills. The integration with other safety monitoring solutions further enhances industrial safety, offering comprehensive protection against both workplace injuries and environmental hazards. Future improvements will focus on refining the detection thresholds and enhancing the model's ability to handle diverse environments for even better performance. 21 2.4.6 Number-Plate Detection this report, we'll walk through the development of a system designed to detect and track number plates from video footage. The system leverages advanced technologies like YOLOv7 for object detection and EasyOCR for reading text from images. This setup aims to automatically detect vehicles, extract their number plates, and keep track of them over time, which can be incredibly useful for traffic monitoring, law enforcement, and security purposes. System Overview The system has two main components: 1. Object Detection: We used YOLOv7, a cutting-edge model known for its real-time object detection capabilities, to identify vehicles and number plates in video frames. 2. OCR (Optical Character Recognition): EasyOCR helps extract the text from the detected number plates, ensuring accurate reading of the characters. How It Works 1. Processing Video Input: o The system processes each frame of the video one by one. Each frame undergoes both object detection and OCR procedures. 2. Detecting Vehicles: o YOLOv7 is trained to detect various types of vehicles, such as cars, motorcycles, buses, trucks, and trains. It draws bounding boxes around each detected vehicle. 3. Detecting Number Plates: o Another YOLOv7 model is specifically tuned to detect number plates. This model identifies the exact region in each frame where a number plate is present. 4. Reading Number Plates (OCR): o EasyOCR reads the text within the detected number plate regions, converting the image data into alphanumeric characters. We also implemented a correction algorithm to fix common OCR errors and ensure the text matches expected formats. 5. Tracking Vehicles: The system tracks vehicles across frames by assigning unique IDs based on their location in consecutive frames. This helps maintain continuity and correctly associate number plates with specific vehicles over time.Also,this is used to take many frames of a single vehicle in the csv file to avoid any mistakes in detection of the text of the number plate. 6. Recording Data: o The system records various details like frame number, vehicle ID, bounding box coordinates, plate text, confidence scores, and image paths in a CSV file. This data is crucial for further analysis and reporting.From this data, we found the o 22 text detected for the maximum time by the OCR to decrease chances of wrong entry. Improving OCR Accuracy OCR isn't always perfect, so we included a text correction algorithm to improve accuracy: The algorithm removes special characters and spaces. It converts the entire text to uppercase. There is a specific pattern of number plates in India,for eg. the first two are alphabets next two are numbers and so on. We exploited this logic to make our model more accurate. It corrects common mistakes, such as replacing 'O' with '0' or 'I' with '1', to match expected number plate formats. 23 2.4.7 POSTURE DETECTION In this report, we'll walk through the development of a system designed to detect and classify human postures using the YOLOv8-pose model. This system identifies 17 specific keypoints on the human body and classifies postures such as standing, sitting, squatting, and leaning. It can also detect if someone remains idle while sitting for too long. This setup is crucial for applications in health monitoring, workplace safety, and activity recognition. System Overview The system has two main components: 1. Keypoint Detection: Using the YOLOv8-pose model, we identify 17 keypoints on the human body, which correspond to specific anatomical locations. 2. Posture Classification: We classify the detected keypoints into different postures and identify an idle state if someone sits for a prolonged period. How It Works 1. Training the Model: - The YOLOv8-pose model learns to identify keypoints by analyzing labeled data. This data includes 2D coordinates of 17 keypoints for each person in every frame, stored in a CSV file. This allows the model to understand the spatial relationships and positions of these points. 2. Processing Video Input: - The system processes each frame of a video one by one. Each frame is analyzed to detect key-points, which are then used to determine the posture. 3. Key-point Detection: - The model identifies 17 keypoints, such as the nose, eyes, shoulders, elbows, wrists, hips, knees, and ankles. These points are superimposed on the video to visualize the detection. 4. Posture Classification: - The system classifies postures based on the keypoints: - Standing: Shoulders are above the waist and knees. - Sitting: Hips are lower with knees bent at a right angle. - Squatting: Hips are even lower, with knees bent significantly. - Leaning: The body tilts, indicating a leaning posture. 5. Idle Detection: - If someone sits for too long, the system labels them as 'idle' in addition to 'sitting'. This is done by setting a time threshold for the sitting posture. 6. Output Video Generation: - The system creates an output video showing the detected keypoints and the classified posture for each frame. This helps in understanding and analyzing posture changes over 24 time. Enhancing Accuracy and Performance - Data Preparation:We organized and pre-processed the training data to fit the YOLOv8 framework, ensuring high accuracy in keypoint detection. - Posture Classification Logic: The system uses specific criteria based on keypoint positions and relationships to classify postures accurately. - Idle Detection Mechanism: By integrating a time threshold for sitting, the system effectively identifies prolonged idleness, providing useful insights. This system's ability to detect and classify human postures in real-time video footage is invaluable for various applications, including monitoring the elderly, ensuring workplace safety, and analyzing human activities. Future improvements will focus on refining detection thresholds and enhancing keypoint analysis for even more accurate and reliable performance. 25 Conclusion In this project, we've created advanced systems using the latest AI models to tackle several important challenges like detecting safety gear, monitoring postures, identifying oil spills, reading number plates, counting crowds, and detecting falls. Each system is built on robust training with carefully curated datasets, ensuring they deliver accurate results where it matters most. For example, our safety gear detection system powered by YOLOv7 excels at spotting harnesses, helmets, and jumpsuits, crucial for workplace safety. Meanwhile, our posture detection system with YOLOv8-pose accurately recognizes human postures, including detecting when someone sits for too long—a key feature for health monitoring and ergonomics. Our oil spill detection system, using U-Net, swiftly identifies and maps affected areas, aiding in quick responses to environmental crises. Additionally, our number plate OCR system combines YOLOv7 and Easy-OCR to track vehicles and read number plates, enhancing security and traffic management. These projects highlight how AI can make a real difference in practical scenarios. Looking ahead, we're dedicated to refining these systems further, exploring new techniques, and integrating the latest advancements to ensure our solutions remain at the forefront of technology. By continually improving and innovating, we aim to provide tools that not only meet current challenges effectively but also pave the way for future breakthroughs in AI applications for safety, environmental protection, and more. 26 4. References 1. https://www.ibm.com/topics/convolutional-neural-networks 2. https://www.datacamp.com/blog/yolo-object-detection-explained 3. M. Rodriguez, I. Laptev, J. Sivic, and J.-Y. Audibert. Density-aware person detection and tracking in crowds. In ICCV, pages 2423–2430. IEEE, 2011. 4. K. Tota and H. Idrees. Counting in dense crowds using deep features. 5. C. S. Regazzoni and A. Tesei. Distributed data fusion for real-time crowding estimation. Signal Processing, 53(1):47 63, 1996 6. K. Chen, S. Gong, T. Xiang, and C. C. Loy. Cumulative at tribute space for age and crowd density estimation. In CVPR, pages 2467–2474. IEEE, 2013. 7. B. Wu and R. Nevatia. Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors. In ICCV, volume 1, pages 90– 97. IEEE, 2005. 8. T. Zhao, R. Nevatia, and B. Wu. Segmentation and tracking of multiple humans in crowded environments. Pattern Analysis and Machine Intelligence, 30(7):1198–1211, 2008. 9. Real-Time Multi-Person 2D Pose Estimation using Part Affinity Fields by Zhe Cao et 10. U-Net: Convolutional Networks for Biomedical Image Segmentation by Olaf Ronneberger 11.https://github.com/JaidedAI/EasyOCR 12. A Survey of Fall Detection Techniques: Classification and Comparison by Amer A. Ali 13. Deep Learning for Semantic Segmentation in Environmental Monitoring: A Survey by Juan D. Bolanos 14. Real-time Video Analysis and Object Tracking by Liang Wang 27 5. Glossary Bounding Box: A rectangular border around an object in an image, used to define the position and extent of the object for detection and annotation purposes. CNN (Convolutional Neural Network): A type of deep learning model particularly effective for image recognition and classification tasks, capable of automatically and adaptively learning spatial hierarchies of features from input images. Data Augmentation: Techniques used to increase the diversity of training data without actually collecting new data, often by applying random transformations such as rotations, translations, and flips to the existing dataset. Dataset: A collection of data, typically organized in a structured format, used for training, validating, and testing machine learning models. Density Map: A representation of the spatial distribution of objects within an image, used particularly in crowd counting to estimate the number of people in different regions of the image. Fine-Tuning: The process of taking a pretrained model and further training it on a new, often smaller, dataset specific to a particular task to improve its performance for that task. Hyperparameters: Configurable parameters external to the model that influence the training process, such as learning rate, batch size, and number of epochs. Inference: The process of using a trained model to make predictions on new, unseen data. Iterative Training: A training process that involves continuously adding new data and retraining the model to improve its accuracy and robustness over time. Mean Average Precision (mAP): A metric used to evaluate the accuracy of object detection models, averaging the precision over a range of recall values. Model Architecture: The structure and organization of layers within a neural network model, defining how the model processes input data to produce output predictions. Pretrained Model: A model that has been previously trained on a large dataset and can be fine-tuned for a specific task, reducing the time and computational resources required for training. Real-Time Monitoring: The capability of a system to process and analyze data instantly as it is being collected, allowing for immediate insights and actions. 28 Regularization: Techniques used during training to prevent overfitting by penalizing complex models, such as L1 (Lasso) regularization which encourages sparsity in the model parameters. YOLO (You Only Look Once): A family of real-time object detection models known for their speed and accuracy, processing entire images at once and predicting bounding boxes and class probabilities simultaneously. YOLOv7: An advanced version of the YOLO model family, optimized for real-time object detection with improved accuracy and efficiency. YAML : A human-readable data serialization standard used to define configurations for datasets and other settings in machine learning projects. Active Learning: A machine learning approach where the model selectively queries a human annotator to label new data points with uncertain predictions, improving model performance iteratively. Annotated Dataset: A dataset where each image or piece of data is labeled with relevant information, such as bounding boxes and class labels, necessary for training machine learning models. Annotation Standardization: Ensuring consistency in labeling across different data sources to maintain uniformity and accuracy in the dataset used for training. Automated Machine Learning (AutoML): Techniques and tools that automate the process of model selection, hyperparameter tuning, and feature engineering to improve machine learning workflows without extensive manual intervention. Capsule Networks: A type of neural network designed to better capture spatial hierarchies and relationships between parts of an object, addressing some limitations of traditional CNNs. Class Label: An identifier assigned to an object in an image, indicating its category, used in object detection and classification tasks. Convolutional Layer: A layer in a convolutional neural network that applies convolution operations to the input, extracting features such as edges, textures, and patterns from images. Data Cleaning: The process of removing duplicates, corrupted files, and irrelevant data to ensure the quality and consistency of the dataset. Data Integration: Combining data from multiple sources into a single, coherent dataset, ensuring consistency in format and quality. Deep Learning: A subset of machine learning involving neural networks with many layers, capable of learning complex patterns and representations from large amounts of data. 29 Empirical Research: The process of collecting and analyzing data through direct and indirect observation or experience, forming the basis for model training and evaluation. Evaluation Metrics: Ǫuantitative measures such as precision, recall, and mean average precision (mAP) used to assess the performance of a machine learning model. Fully-Connected Layer: A layer in a neural network where each neuron is connected to every neuron in the previous layer, typically used in the final stages of a CNN to perform classification. Generative Adversarial Networks (GANs): A class of deep learning models where two networks (generator and discriminator) are trained simultaneously, with the generator creating realistic data and the discriminator distinguishing between real and generated data. Learning Rate: A hyperparameter that controls how much the model's weights are adjusted during training with respect to the loss gradient. Model Initialization: The process of setting up a model's architecture and initial weights before training begins. Neural Network: A series of algorithms that attempt to recognize underlying relationships in a set of data through a process that mimics the way the human brain operates. Object Detection: The task of identifying and localizing objects within an image, crucial for applications like surveillance, robotics, and augmented reality. Pooling Layer: A layer in a convolutional neural network that performs down-sampling, reducing the spatial dimensions of the input and helping to make the representation more manageable. Pretrained Model: A model that has been previously trained on a large dataset and can be fine-tuned for a specific task, reducing the time and computational resources required for training. Supervised Learning: A type of machine learning where the model is trained on a labeled dataset, learning to map inputs to the correct outputs. Transfer Learning: A technique in machine learning where a pretrained model is used as the starting point for a new, related task, improving performance and reducing training time. Unsupervised Learning: A type of machine learning where the model is trained on an unlabeled dataset, discovering patterns and relationships within the data without predefined labels. Aspect Ratio:Ratio of the height to the width of a bounding box around an object or person. Used to infer standing or fallen states. 30 OCR (Optical Character Recognition):Technology used to recognize text within images. EasyOCR is mentioned for reading number plates. Posture Classification:Classification of human body positions (like standing, sitting, squatting) based on detected keypoints. Threshold Values: Predefined values used as limits or conditions for making decisions in algorithms, such as for detecting changes in bounding box dimensions or keypoint alignments. Environmental Monitoring: Surveillance and analysis of environmental conditions to ensure compliance with safety and regulatory standards, such as detecting and responding to oil spills. 31