TrashWatch: Empowering Cleanliness through Smart Cameras Jash Jain1 , Manthan Juthani2 , Kashish Jain3 and Anant V. Nimkar4 1 2 Sardar Patel Institute Of Technology, Mumbai, India {jash.jain, manthan.juthani, kashish.jain, anant_nimkar}@spit.ac.in Abstract. Illegal garbage disposal is an integral social and environmental issue. Garbage disposal at arbitrary sites damages the environment and is a major health hazard. We want to minimize the disposal of garbage in public places. This paper introduces a novel approach to addressing littering incidents by utilizing computer vision and surveillance cameras to detect and monitor littering actions accurately. The system can tell the difference between littering and litter removal because it uses a combination of cutting-edge approaches like YOLOv4 and the Area Over Intersection method with traditional geometry. The research also focuses on determining the necessary camera features, including focal length and maximum camera angle, to optimize the system’s performance in different settings. The proposed system, named ’Trashwatch,’ provides users with flexibility in choosing camera resolutions and settings based on their specific requirements. The study demonstrates the system’s effectiveness in identifying littering and cleaning actions while considering user preferences for camera features. Ultimately, this comprehensive solution lays the groundwork for automating the entire process of detecting litterbugs, including tracking the responsible individuals and employing facial recognition for penalty enforcement. Keywords: Littering Detection, Camera features, Frame Extraction, Throwing action 1 Introduction Littering—throwing debris like a Styrofoam cup from a moving car—is a recent occurrence. It became more common in the 1950s with throwaway products and plastic packaging. Each ocean receives 8 million tons of plastic waste, polluting water, land, and air. Cigarette butts produce arsenic and formaldehyde into waters and endanger humans and animals as litter disintegrates. Trash causes 60% of global water contamination, and over 40% of it is burned outside, producing harmful toxins that cause breathing issues and acid rain. Traditional littering methods require human interaction and eyewitness testimony, which can delay and mislead. Video surveillance cameras can automate littering detection. This system’s accuracy and scope depend on camera features including focal length, resolution, distance, and angle. The study provides a mathematical link between these elements to enable customized settings. 2 Jash Jain, Manthan Juthani, Kashish Jain and Anant V. Nimkar Currently, no research exists that considers both the littering action and the required camera features for detection. While papers by R. Csordás [1] and S. Mahankali [2] discuss specific littering actions, and Porikli’s paper [7] explains general object detection, this paper covers these topics more extensively. Object detection is the primary step, followed by using geometry, center of mass, and area over intersection to determine if an action is littering or just carrying an object. The paper resolves them. Cameras will record littering. It can identify and trace litterers using YOLOv4, Area Over Intersection, and geometry. It differentiates littering from picking up. The solution is public and private. The second part of the approach suggests camera features based on the coverage area to accurately identify littering if used privately. The technique used a simulator to iterate to find the greatest camera angle at a given focal length. After the user enters the objects it wants identified as litter and the distance it wants it detected, the system outputs different camera resolutions, the ideal focal length at that resolution, the maximum camera angle that will cover the most area, and the percentage of the area it will cover. User can pick as needed. The paper’s main contribution is the implementation of the suggested system ’Trashwatch’, which identifies littering and cleaning acts and matches user choices for surveillance camera features. Computer vision and cutting-edge methods are used to detect littering. The system’s ability to meet user camera feature needs makes it complete. Thus, this technology may automate the full litterbug detection process, including identifying littering behavior, tracking the perpetrator, and administering sanctions via facial recognition. This paper proposes an organized approach to public-place pollution. Section II analyzes current computer vision and object-tracking developments and breakthroughs pertinent to the issue after the introduction. Following that, in Section III, the novel qualities of the suggested system as well as its research contributions are discussed. In Section IV, the approach and instruments used to evaluate the effectiveness of the system are discussed. The findings of the study, together with a more in-depth analysis, are presented in Section V. The most important discoveries, contributions, and upcoming actions are highlighted in Section VI. 2 Literature Survey Several research papers have addressed various challenges in object detection and recognition, as well as related applications in surveillance and waste management. Csord et al. [1] proposed a robust method for detecting objects thrown over a fence using a monocular camera system. Their approach utilized optical flow to track object trajectories and was particularly effective for tracking blurry, small, or variably shaped objects. However, it was limited to a specific camera placement at the end of the fence, focused on object detection over the fence line. Mahankali et al. [2] presented a system for identifying illegal garbage dumping in video footage using deep learning algorithms. Their system achieved TrashWatch: Empowering Cleanliness through Smart Cameras 3 a high accuracy of 95% in classifying objects as garbage or non-garbage based on shape, size, and color features. Nevertheless, the system had limitations in detecting obscured or specific types of garbage and identifying the individuals responsible for the dumping. In the field of object detection, Liu et al. [3] introduced a Region Proposal Network (RPN) that improved the efficiency and precision of object detection networks by sharing convolutional features between the RPN and the detection network. This integration allowed for real-time operation and enhanced the quality of proposed regions. Conversely, Esen et al. [4] focused on motion-based detection for surveillance videos, proposing the motion co-occurrence feature (MCF) as a promising candidate for abnormal event detection. However, the computational time required for high frame history values limited its suitability for real-time applications. For action recognition in videos, Wang et al. [5] presented the temporal segment network (TSN), which improved the modeling of long-range temporal structures and achieved state-of-the-art performance in action recognition while maintaining computational efficiency. Zhang et al. [6] conducted a survey of vision-based fall detection methods, highlighting the challenges of distinguishing falls from other similar activities in daily life. They found that vision-based methods alone may not provide a comprehensive solution for fall detection. Porikli et al. [7] proposed a pixel-wise method that relied on dual foregrounds to detect objects brought into a scene at a subsequent time, such as abandoned items or illegally parked vehicles. This approach did not rely on object tracking and was effective in crowded scenarios. Zhou et al. [8] presented DECOLOR, a motion-based algorithm for moving object detection that addressed non-static backgrounds by using a parametric motion model and low-rank representation. While DECOLOR achieved accurate results, it was not suitable for real-time object detection. In garbage detection, Xu et al. [14] analyzed existing trash datasets and introduced two new benchmarks: detect-waste and classify-waste. EfficientDet-D2 localized litter and EfficientNet-B2 categorized it into garbage types in a twostage detector. Majchrowska et al. [13] enhanced the YOLO-CS detection system, enabling YOLOv4 to distinguish several objects in a single cell. Their collaborative prediction method outperformed state-of-the-art detectors on CrowdHuman and CityPersons. 3 TrashWatch The proposed project involves the development of a litter detection algorithm capable of detecting instances of littering, as well as identifying the specific type of waste being discarded by individuals. To implement this system, the proposal involves offering a hardware solution that enables the government or public entities to specify the prevalent type of litter in a given area, the minimum size of the debris, and the distance at which it must be detected. On the basis of these findings, the system would derive the camera parameters necessary for 4 Jash Jain, Manthan Juthani, Kashish Jain and Anant V. Nimkar the proposed solution. The camera’s minimum resolution, focal length, coverage area, and angle of view would be specified. 3.1 Littering Detection At the outset, in order to identify instances of throwing or littering, we contemplated the possibility of isolating an individual as a pixelated entity. In the hypothetical scenario where the entirety of the human form is converted into a pixelated binary entity, it would be feasible to discern a separate binary entity by observing its movement away from the aforementioned human entity. The second entity was postulated to be a refuse item. This line of reasoning gave rise to the notion of utilizing a video that documented alterations in scenery. The method in question is beset by the issue of fragmentation of the human subject depicted in the photograph, which leads to outcomes that lack consistency. The YOLOv4 (You Only Look Once) system is a real-time object recognition system that is capable of identifying specific objects in various media formats such as videos, live feeds, and still images [9][10]. The YOLO machine learning algorithm employs a deep convolutional neural network to leverage the learned features in order to accurately identify an object. The prediction process utilizes eleven discrete convolutions, implying that the dimensions of the preceding feature map and the resultant prediction map are equivalent. The Yolo language comprises a total of eighty unique item classifications, out of which thirteen have been identified as litter through our analysis. The inventory comprised of various objects such as receptacles for liquids, bags for carrying personal belongings, and protective gear for rain, along with edible produce such as bananas and apples. The identification of humans in the stream was accomplished through the utilization of the Histogram of Oriented Gradients (HOG) technique. The HOG individual detector utilizes a sliding detection window which is translated across the image.The texture-based detection method known as HOG, along with its advantages and features as elucidated in the reference [?], appeared to be the most appropriate choice. A HOG descriptor is allocated to every position within the detection window. The aforementioned attribute is subsequently transmitted to the proficient Support Vector Machine (SVM) algorithm, which categorizes it according to its assessment as either "an individual" or "an entity other than an individual." A constant was added to the borders of the bounding box with the aim of minimizing the number of bounding boxes that enclose a solitary entity to a single one. After the detection of waste through the Yolo Model, the AOI (Area over Intersection) technique was implemented. It can be observed that in cases where two axis-aligned bounding boxes intersect, the outcome is invariably another axis-aligned bounding box. By applying the principle of overlap, a computation was executed to ascertain the area of intersection between the individual and waste materials. Once the litter has been removed from the person, the surrounding environment is rendered devoid of waste, thereby enabling the perpetrator to be identified. TrashWatch: Empowering Cleanliness through Smart Cameras AoI = area of overlap area of union 5 = Fig. 1. Area Over Intersection Fig. 2. Littering and Cleaning Action Detected Conversely, it is contended that the process of cleaning is deemed finished once a particular item of litter enters an individual’s bounding box and subsequently moves a designated distance in an upward or downward direction. To validate our model, we employed both a live webcam feed and a pre-existing recorded video stream. Both methods were effective in identifying individuals who were engaged in either cleaning up or disposing of waste. 3.2 Camera Feature Extraction As part of our research contribution, various types of litter, such as purses, bottles, umbrellas, and more, to generate camera features were employed. These objects were utilized in different sizes and placed at varying distances from the camera, simulating real-world scenarios of littering actions. The purpose of this approach was to accurately assess the performance of our TrashWatch system in detecting littering incidents. By analyzing the output of the system when confronted with different objects and distances, valuable data 6 Jash Jain, Manthan Juthani, Kashish Jain and Anant V. Nimkar was gathered. This data was then entered into the JVSG Lens Simulator, which provided with the necessary camera features required for effective detection if the objects were being littered in a real-life CCTV camera scenario. Through simulations run, various camera angles and percentage of coverage area were found where littering action was detected. Using the angle of view, the ideal focal length was then calculated using AOV = 2 ∗ tan−1 ( d ) 2f (1) where AOV represents the angle of view, d corresponds to the chosen dimension (such as film or sensor size), and f denotes the effective focal length of the camera. By conducting these experiments and leveraging the simulator, precise and relevant camera features specific to litter detection were obtained. This research contributes to the advancement of surveillance systems by enhancing their capabilities to identify and monitor instances of littering, leading to improved cleanliness and environmental preservation. 4 Experimental Setup The experimental setup used TrashWatch to process camera video. The Person Detection Model found persons in the shot while the Custom YOLOv4 Model found rubbish. The models supplied detection bounding box coordinates and probabilities. The AOI Function determined if the activity was cleaning or littering by calculating bounding box overlap. Camera Features Model specifications were resolution, focal length, angle, and coverage area. Object distance and size were entered into the JVSG Lens Simulator to calculate these values. The TrashWatch system was tested utilizing varied video footage, ground truth annotations, and quantitative evaluations. The Person Detection Model, Custom YOLOv4 Model, AOI Function, and Camera Features Model merged to enable real-time trash detection and monitoring. The study used the cutting-edge YOLOv4 model to improve litter detection accuracy. Bottles, handbags, and umbrellas were initially hard to identify. A bespoke dataset was used to fine-tune the YOLOv4 model to overcome this restriction. Cropping and rotation were used to segment a wide range of rubbish pictures. By training the model for 100 epochs at 0.001, litter prediction accuracy improved significantly. Due to a lack of litter detection datasets, we can only identify 12 litter categories. The model’s minimum height is 13 cm, hence this threshold was chosen for detection. The video resolution should be at least 480p to accurately detect litter and faces for penalization. The camera is expected to be able to see pedestrians’ faces at a specific point to help track litterers. The next step in our procedure was to create a framework that provides a camera with only the necessary hardware to build up a system for detecting trash in the neighborhood. The JVSG Lens Calculator was used to obtain information based on the lens type and distance. Based on the subject’s height and distance, TrashWatch: Empowering Cleanliness through Smart Cameras 7 Fig. 3. Architectural Diagram the camera’s focal length (in millimeters), resolution, and angle were set. The following resolutions were tried and trued: 480p, 640p, 1280p, and 2048p. 5 Results & Discussions The study has challenges in its attempt to simplify the system for widespread usage, as stated in its scope. This detection model relies on object and pedestrian detection for all math. The camera contains several internal features that affect its object detection accuracy, but evaluating them all is too tough for a common person, which contradicts the study’s goal. In addition, not all these factors are considered during purchase or installation, so to simplify and make things easier to conclude, the fundamental factors such as the height of the installation (which is kept constant at 3.5 m), the focal length of the lens (measured in mm), the camera’s resolution, the horizontal distance of the object to be detected from the camera (measured in meters), and the angle of depression from the lens to the. The results focus on conclusions from hundreds of simulations to explore the association between the elements above. For the user, Table 1 shows the results that will be displayed when he enters his requirements in terms of the object to be detected and the size to which they want them to be detected. When the user wants to detect bananas at a maximum distance of 4 meters, they get 4 options from which they can pick according to requirement/availability/budget. The camera with the lowest resolution gives 73% coverage, the one with the next best resolution gives 92% coverage, and the one with the second best resolution gives 100% coverage in this case. Along with the coverage, the study also outputs the ideal focal length of the lens that is to be set up and the maximum angle at which the camera should point at the detection area for optimum detection. 8 Jash Jain, Manthan Juthani, Kashish Jain and Anant V. Nimkar Table 1. Camera features for a selected object at a distance of 4m Resolution Ideal Focal Length Max Camera Angle % Coverage 480 * 360 23 55.9 73% 640 * 512 17 62.3 92% 1280 * 720 7.7 72.4 100% 2048 * 1536 7.7 72.4 100% First, we’re looking for a connection between the ideal focal length and the camera resolution, assuming both the object’s distance from the camera and its size remain same. The result of the correlation is seen in Figure 4,i.e. the ideal focal length decreases when the camera resolution gets better. Ideal focal length is the length that the study recommends that will help the user have the maximum coverage of the area. It is also mathematically co-related to the maximum angle of depression that the camera can have which is given by in [12] to actually be able to recognize the object at that particular distance. Figure 5 shows the relationship between the minimum object length and the percent coverage of the total area for various focal lengths and camera angles for objects like a banana, a bottle, and a bag, assuming a constant object-tocamera distance of 4 meters and a constant camera resolution of 480 by 360 pixels. The minimum length is considered because if a 20 centimeter bottle is detected properly, then a bottle with comparable characteristics but a slightly longer height would also be discovered. The testing was done with bottles of various sizes, but for camera feature analysis, the minimal object length was chosen, which was 20 cm for bottles, 13 cm for bananas, and 35 cm for bags. 6 Conclusion TrashWatch detects littering with twelve things with high accuracy. Camera specifications make personalized implementation straightforward. The system’s efficacy has been proven in varied settings and test participants. The camera’s seeing range decreased slightly as object size rose while retaining distance and resolution. Optimizing camera resolution and focal length increased coverage significantly. The study showed how camera resolution and focus length affect item identification. TrashWatch might design a comprehensive solution that includes surveillance-based rubbish detection, parallel active monitoring for facial identification, and an automatic penalty system. Active tracking can locate offenders, and working with authorities to create a powerful image database connected to National Identification Cards (NIC) can help enforce in schools and offices. This study lays the groundwork for waste management technology developments and reinforces the commitment to cleaner and more accountable surroundings. TrashWatch: Empowering Cleanliness through Smart Cameras 9 Fig. 4. Ideal Focal Length of the Camera for Maximum Coverage vs Camera Resolution Fig. 5. Maximum Percentage Area covered for Detection using the System vs Minimum object Length to be Detected References 1. Csordás, Róbert, László Havasi, and Tamás Szirányi. "Detecting objects thrown over fence in outdoor scenes." International Conference on Computer Vision Theory and Applications. Vol. 2. SciTePress, 2015. 2. Mahankali, Sriya, et al. "Identification of illegal garbage dumping with video analytics." 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE, 2018. 3. Liu, Wei, et al. "Ssd: Single shot multibox detector." Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer International Publishing, 2016. 4. Esen, Ersin, Mehmet Ali Arabaci, and Medeni Soysal. "Fight detection in surveillance videos." 2013 11th International Workshop on Content-Based Multimedia Indexing (CBMI). IEEE, 2013. 5. Wang, Limin, et al. "Temporal segment networks: Towards good practices for deep action recognition." European conference on computer vision. Springer, Cham, 2016. 10 Jash Jain, Manthan Juthani, Kashish Jain and Anant V. Nimkar 6. Zhang, Zhong, Christopher Conly, and Vassilis Athitsos. "A survey on vision-based fall detection." Proceedings of the 8th ACM international conference on PErvasive technologies related to assistive environments. 2015. 7. Porikli, Fatih, Yuri Ivanov, and Tetsuji Haga. "Robust abandoned object detection using dual foregrounds." EURASIP Journal on Advances in Signal Processing 2008 (2007): 1-11. 8. Zhou, Xiaowei, Can Yang, and Weichuan Yu. "Moving object detection by detecting contiguous outliers in the low-rank representation." IEEE transactions on pattern analysis and machine intelligence 35.3 (2012): 597-610. 9. Jiang, Peiyuan, et al. "A Review of Yolo algorithm developments." Procedia Computer Science 199 (2022): 1066-1073. 10. Redmon, Joseph, and Ali Farhadi. "Yolov3: An incremental improvement." arXiv preprint arXiv:1804.02767 (2018). 11. Rana, Md Sohel, Aiden Nibali, and Zhen He. "Selection of object detections using overlap map predictions." Neural Computing and Applications 34.21 (2022): 1861118627. 12. Li, Xiang, et al. "Evaluating effects of focal length and viewing angle in a comparison of recent face landmark and alignment methods." Eurasip Journal on Image and Video Processing 2021 (2021): 1-18. 13. Majchrowska, Sylwia, et al. "Deep learning-based waste detection in natural and urban environments." Waste Management 138 (2022): 274-284. 14. Xu, Hong-hui, et al. "Object detection in crowded scenes via joint prediction." Defence Technology (2021). 15. Begur, Hema, et al. "An edge-based smart mobile service system for illegal dumping detection and monitoring in San Jose." 2017 IEEE SmartWorld, Ubiquitous Intelligence Computing, Advanced Trusted Computed, Scalable Computing Communications, Cloud Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI). IEEE, 2017. 16. Dabholkar, Akshay, et al. "Smart illegal dumping detection." 2017 IEEE Third International Conference on Big Data Computing Service and Applications (BigDataService). IEEE, 2017. 17. Zhang, Qing, Yongwei Nie, and Wei-Shi Zheng. "Dual illumination estimation for robust exposure correction." Computer graphics forum. Vol. 38. No. 7. 2019. 18. Guo, Xiaojie, Yu Li, and Haibin Ling. "LIME: Low-light image enhancement via illumination map estimation." IEEE Transactions on image processing 26.2 (2016): 982-993. 19. Hasan, I., et al. "Generalizable Pedestrian Detection: The Elephant In The Room, 2021 IEEE." CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021. 20. Wang, Haoran, Zhen Hua, and Jinjiang Li. "Two-stage progressive residual learning network for multi-focus image fusion." IET Image Processing 16.3 (2022): 772-786. 21. Singh, Mohit, Vijay Laxmi, and Parvez Faruki. "Dense spatially-weighted attentive residual-haze network for image dehazing." Applied Intelligence 52.12 (2022): 1385513869.