VisionMate: AI Driven Navigation Glasses For The Visually Impaired Dissertation submitted in Partial fulfillment of the Academic Requirements for the Degree of Bachelor of Engineering in Computer Science and Engineering Submitted By Ayesha Mahaboob 160621733007 Syeda Rania Mahek 160621733056 Huda Tariq 160621733306 Under the guidance of Mrs. B Gnana Prasuna Assistant Professor DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Stanley College of Engineering and Technology for Women (Autonomous) (Approved by AICTE, Accredited by NBA and NAAC, Affiliated to Osmania University) Chapel Road, Abids, Hyderabad 2024 Stanley College of Engineering and Technology for Women (Autonomous) (Approved by AICTE, Accredited by NBA and NAAC, Affiliated to Osmania University) DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CERTIFICATE This is to certify that the project work titled “VisionMate: AI Driven Navigation Glasses For The Visually Impaired” submitted by Ayesha Mahaboob (160621733007), Syeda Rania Mahek (160621733056), Huda Tariq (160621733306), students of the Department of Computer Science And Engineering, Stanley College of Engineering and Technology for Women in partial fulfillment of the requirements for the award of the Degree of Bachelor of Engineering with Computer Science and Engineering as specialization is a record of the bonafide work carried out during the academic year 2024-25 Signature of the guide Signature of the Head of the Dept. Mrs. B Gnana Prasuna Dr YVSS Pragathi Assistant Professor Professor & HOD, Dept of CSE Dept of CSE Stanley College of Engineering and Technology for Women (Autonomous) (Approved by AICTE, Accredited by NAAC, Affiliated to Osmania University) DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING DECLARATION This is to certify that the work reported in the thesis entitled “VisionMate: AI Driven Navigation Glasses For The Visually Impaired” is a record of the work done by us in the Department of Computer Science and Engineering, Stanley College of Engineering and Technology for Women, Abids, Hyderabad. No part of the thesis is copied from the books/journals/internet and wherever the portion is taken the same has been duly referred to in the text. The report is based on the project work done entirely by us and not copied from any other source. Ayesha Mahaboob Syeda Rania Mahek Huda Tariq 160621733007 160621733056 160621733306 i ACKNOWLEDGEMENT It is with a sense of gratitude and appreciation that we feel to acknowledge any wellwishers for their kind support and encouragement during the completion of the project work. We thank Dr. Satya Prasad Lanka, the Principal, Stanley College of Engineering and Technology for women, for his timely cooperation and for providing us all the required facilities to complete the project successfully. We would extremely grateful to Dr. Y.V.S.S Pragathi, Head of the department for providing excellent computing facilities and such a nice atmosphere for completing our project work. We would like to express our sincere gratitude and special thanks to our Project Review Committee Members Dr. M Swapna, Mrs. B G Prasuna, Mrs. T Monika Singh and our Project Coordinator Mrs. Nazia Tabassum for their valuable suggestions towards the completion of the project. We would like to express our heartfelt gratitude to Mrs. B Gnana Prasuna, our Guide, for encouraging and guiding us throughout the work. We are highly indebted to her for guidance and constant supervision regarding the project work as well as for providing valuable suggestions and continuous support given to us for completing the report successfully and all those who helped us directly or indirectly during the course of the seminar. We sincerely thank all of them. Ayesha Mahaboob 160621733007 Syeda Rania Mahek 160621733056 Huda Tariq 160621733306 i ABSTRACT Visually impaired individuals face significant challenges in navigating their surroundings, particularly in unfamiliar or complex environments. Traditional mobility aids, such as canes or guide dogs, offer limited assistance and lack the capability to provide detailed spatial awareness or precise navigation instructions. This project introduces an advanced solution: AI-powered smart glasses designed to enhance independence and safety for visually impaired users by providing real-time guidance through audio feedback. The system integrates key hardware components, including a wireless camera mounted on the glasses, a microphone, speakers, and a Raspberry Pi, to create a lightweight and efficient platform. The camera continuously captures video of the user’s surroundings, while the onboard processing unit leverages object detection algorithms to identify obstacles, recognize objects, and map the environment. The microphone facilitates voice input for interaction, and the speakers deliver real-time navigation instructions and object information. Complementing the glasses, a mobile application serves as an extended user interface. It provides functionalities such as pathfinding, visual representation of scanned environments, and customizable settings to tailor the navigation system to individual needs. By combining hardware and software, the system creates a seamless user experience. The AI-powered glasses use intelligent algorithms to analyze and interpret environmental data, guiding the user with context-aware audio instructions. Whether navigating toward a specific destination, identifying nearby objects, or avoiding obstacles, the system ensures real-time responsiveness and reliability. Unlike traditional aids, these glasses aim to reduce dependency on external assistance while improving the user's spatial awareness. This project focuses on achieving a practical, cost-effective, and user-friendly solution for visually impaired individuals. By addressing the limitations of existing navigation aids and incorporating advanced technologies, the AI-powered smart glasses hold the potential to transform the way visually impaired individuals experience and interact with the world around them. Keywords: Assistive technology, visually impaired navigation, real-time object detection, AI-powered smart glasses, mobility aid, accessibility, spatial awareness, user-centered design, intelligent navigation system. ii Table of Contents TITLES CHAPTER NO PAGE NO Abstract i List of Figures ii List of Tables List of Acronyms iii iv 1 INTRODUCTION 1 1.1 Overview 1 1.2 Problem Statement 2 1.3 Aim & Scope 3 1.4 Objectives 5 2. LITERATURE REVIEW 7 3. EXISTING SYSTEM 12 4. SYSTEM ARCHITECTURE 15 5 DATA SET INFORMATION 18 6 PROPOSED METHODOLOGY 20 7 RESULTS AND DISCUSSION 23 REFERENCES 28 iii LIST OF FIGURES FIG. NO FIGURE NAME PAGE NO. 1 White Cane 8 2 Guide Dog 9 3 Seeing AI App 9 4 eSight 10 5 Architecture Diagram 11 6 Image detection using YOLOv8 20 7 YOLOv8 model metrics plot 20 8 The glasses with hardware components attached 21 9 Augmented Reality Scanning for Pathfinding 22 iv LIST OF TABLES TABLE No. CAPTION PAGE No. 1 Literature Summary Table 15 v LIST OF ACRONYMS AI Artificial Intelligence AR Augmented Reality API Application Programming Interface COCO Common Objects in Context CPU Central Processing Unit CVAT Computer Vision Annotation FPS Frames Per Second GPU Graphics Processing Unit IoT Internet of Things ML Machine Learning NLP Natural Language Processing NYU Depth V2 New York University Depth Version 2 Dataset RGB-D Red Green Blue and Depth SLAM Simultaneous Localization and Mapping TTS Text-to-Speech YOLO You Only Look Once vi 1. INTRODUCTION 1.1 Overview Navigating through the world can be a daunting task for individuals with visual impairments. Traditional methods like using a white cane or guide dogs have been the main solutions for centuries, but they are limited in their ability to interact with a complex and dynamic environment. These methods often lack the ability to offer real-time, context-aware navigation and do not provide the nuanced feedback that a visually impaired person may need in various scenarios, especially in indoor or urban settings. This limitation can significantly restrict the independence and mobility of visually impaired individuals, particularly when it comes to navigating unfamiliar locations or crowded spaces. With recent advancements in technology, there is an opportunity to leverage innovative solutions that could assist people with visual impairments in a more comprehensive and intelligent manner. One such potential solution is the development of AI-powered smart glasses that can assist with navigation in realtime. These glasses would use sensors like cameras, microphones, and ultrasonic sensors, in combination with AI models capable of detecting obstacles and recognizing objects in the environment. The system would provide auditory instructions to the user, guiding them in a way that ensures safety and promotes independence. The proposed system aims to create a wearable solution in the form of AI-powered glasses that will serve as a navigation aid for visually impaired individuals. These glasses will be equipped with a camera to detect obstacles and objects, a microphone for voice command input, and speakers to deliver real-time auditory feedback to the user. The glasses will use object detection algorithms, such as those provided by YOLO (You Only Look Once), to identify and process the user's surroundings. Based on this information, the glasses will provide navigation instructions, including directions for moving around obstacles and reaching desired locations. By integrating AI with real-time object detection and audio-based guidance, the system can provide a more seamless and intuitive navigation experience for visually impaired users. The goal of this project is to improve their quality of life by providing a tool that enables them to navigate various environments with increased safety, confidence, and independence. 1 1.2 Problem Statement Visually impaired individuals face significant challenges when navigating both familiar and unfamiliar environments. While traditional assistive devices, such as white canes and guide dogs, have long been the primary means of navigation, they have inherent limitations. White canes can only detect obstacles at ground level, providing no awareness of objects at eye level or overhead, which is crucial in many realworld environments. Similarly, guide dogs require substantial training, are subject to health issues, and their utility is limited to specific contexts, such as open outdoor spaces. The limitations of these traditional tools leave visually impaired individuals struggling to navigate more complex or dynamic environments, such as urban streets, indoor spaces, or crowded areas. Current assistive technologies, such as GPS systems or electronic navigation aids, often fail to meet the needs of visually impaired users. GPS is primarily designed for outdoor navigation, where it can provide location-based guidance but lacks the ability to assist with immediate, local obstacles or dynamic, realtime challenges, such as avoiding collisions or finding specific objects within an environment. Additionally, many existing technologies are expensive, bulky, or impractical for everyday use, leading to a lack of widespread adoption. Moreover, the existing solutions tend to be passive and reactive, only alerting users to obstacles when they are directly encountered. This creates a fragmented and unreliable navigation experience that does not foster independence. Visually impaired individuals often find themselves relying on non-technologybased solutions for tasks such as object detection, spatial awareness, and navigation, which adds stress, time, and effort to everyday activities. This problem is compounded by the increasing complexity of urban environments, where obstacles are varied and constantly changing, making navigation even more challenging. Thus, there is an urgent need for an advanced, wearable, and real-time navigation system that can assist visually impaired users in dynamic, complex environments. The system should be intuitive, capable of providing auditory feedback that helps users understand their immediate surroundings and navigate effectively. It should seamlessly integrate object detection, speech synthesis, and sensor technology to offer accurate, personalized guidance in any environment, whether indoor or outdoor. This will empower visually impaired individuals to navigate with confidence, promoting independence and enhancing their overall quality of life. 2 1.3 Aim & Scope The primary aim of this project is to design and develop an AI-powered, wearable navigation system for visually impaired individuals. The system will utilize real-time object detection, spatial awareness, and voice guidance to provide accurate and intuitive navigation assistance. By integrating a camera, sensors, and a voice feedback mechanism, the system will assist users in navigating their environment, identifying obstacles, and following instructions to reach specific destinations. The project aims to enable visually impaired users to move through both indoor and outdoor spaces with confidence, independence, and safety, ultimately improving their mobility and quality of life. This project focuses on the creation of a wearable navigation assistant that integrates hardware and software components to deliver real-time assistance. The key features and scope of the project include: 1. Hardware Integration: A pair of glasses equipped with integrated speakers, microphone, and a camera to capture the user’s surroundings. The system will utilize a Raspberry Pi (or other suitable processing hardware) to handle the processing and AI computations. Sensors such as ultrasonic sensors may be used to complement the camera for obstacle detection. 2. Object Detection and Navigation: The system will use real-time object detection to identify obstacles, objects, and points of interest in the user’s environment. The AI assistant will analyze the captured data to understand the layout of the space, identifying both immediate obstacles and desired navigation paths. The system will generate verbal instructions such as "go right," "step forward," or "object detected" to guide the user safely through their environment. 3. Speech Output: The voice feedback will be provided through the glasses' integrated speakers, ensuring that the system communicates with the user in a non-intrusive manner. The assistant will respond to user queries related to navigation, using natural language processing to interpret and respond to specific commands. 3 4. Real-Time Feedback: The system will provide real-time feedback for navigation, assisting users in avoiding obstacles, finding specific objects, or navigating towards a designated destination. In cases where the system cannot understand a query unrelated to navigation, it will respond with a default message like "I do not understand your query." 5. User Interaction: The system will be activated through voice commands, which will trigger the object detection and navigation process. The AI assistant will respond with contextually relevant feedback, ensuring that the user’s needs are met effectively. 6. Software Development: The software for the project will include object detection algorithms, pathfinding logic, and natural language processing (for interpreting voice commands). Integration of speech synthesis libraries will allow for the generation of verbal instructions for the user. The software will be optimized for low-latency processing to ensure that the system functions in real-time. 7. Limitations and Exclusions: This project does not include advanced vision-based algorithms or deep learning models for detailed image analysis beyond basic object detection. The system will be focused on assisting with navigation in familiar environments rather than complex outdoor navigation. Real-time location-based services (such as GPS) are not integrated into the system, as the focus is primarily on local navigation (indoor or immediate outdoor environments). The glasses will be designed to be as lightweight and compact as possible, but the physical form factor will be determined by the hardware limitations. 4 1.4 Objectives The primary objectives of this project are to design and develop a wearable AI-powered navigation assistant system for visually impaired individuals, which provides real-time guidance and feedback to enhance their mobility. The following are the specific objectives for the successful completion of this project: 1. Develop a Wearable Navigation System: Design and build a pair of glasses that integrate key components such as a camera, microphone, speakers, and a processing unit (Raspberry Pi) to ensure a compact and functional wearable device. 2. Real-time Object Detection: Implement an object detection system that utilizes the camera to capture the user’s environment and accurately identify obstacles, objects, and points of interest, such as doors, walls, chairs, and other entities in the vicinity. 3. Voice Command Recognition and Feedback: Integrate a voice recognition system that allows the user to issue commands such as "Is there something in front of me?" or "Navigate to the door." Provide real-time voice feedback through the integrated speakers to guide the user with navigation instructions (e.g., "Go right," "Take three steps forward"). 4. Accurate Pathfinding and Navigation: Develop a pathfinding algorithm that calculates the most effective and obstacle-free route for the user to reach their desired destination within a room or environment. Ensure the system provides verbal navigation cues based on real-time analysis, such as turning directions, step counts, and nearby obstacles. 5. Sensor Integration for Enhanced Navigation: Explore the integration of additional sensors, such as ultrasonic sensors, to complement the object detection system and provide enhanced proximity detection for obstacles. 6. Minimize Latency and Ensure Real-time Processing: Optimize the system’s processing speed to ensure low-latency real-time feedback, allowing the AI assistant to provide seamless and timely guidance to the user without noticeable delays. 5 7. Intuitive User Interface and Interaction: Develop an intuitive user interface that allows for easy interaction with the system, ensuring that the visually impaired user can efficiently operate the glasses via voice commands and receive clear verbal feedback. 8. Lightweight and Comfortable Design: Focus on creating a compact and ergonomic design for the glasses, ensuring that the system remains lightweight and comfortable for long-term wear while maintaining functionality. 9. Prototype Testing and Evaluation: Conduct real-world testing with visually impaired users to evaluate the performance, accuracy, and usability of the system. Gather feedback to refine the design and functionality of the glasses, ensuring that the AI assistant meets the mobility needs of users. 10. Focus on Safety and Accessibility: Ensure the system prioritizes user safety by accurately detecting obstacles and providing timely, actionable guidance to prevent accidents. Make the system accessible by simplifying the interaction process and ensuring that it caters to individuals with varying levels of technical literacy. 6 2. LITERATURE REVIEW Assisting visually impaired individuals in navigation has been a significant challenge in the field of assistive technology, particularly in creating systems that are both effective and user-friendly. ENVISION, a system developed by Khenkar et al. (2016), is a smartphone-based solution designed to address the mobility challenges faced by visually impaired users. This system integrates computer vision techniques and smartphone functionalities to provide real-time navigation assistance, aiming to improve the independence and confidence of its users. The ENVISION system operates by leveraging the smartphone’s camera as its primary sensor. The camera captures the user's environment, which is then analyzed using advanced image processing and object recognition algorithms. These algorithms detect and classify obstacles and essential objects in the user's path, such as doors, walls, and furniture. The identified objects are then used to provide audio feedback to the user, enabling them to navigate their surroundings effectively. This audio guidance is generated in real time, ensuring that users receive immediate and actionable information about their environment. One of the unique features of ENVISION is its ability to create a structured navigation path based on the detected obstacles and the user’s intended destination. By combining object detection with pathfinding algorithms, the system ensures that users can navigate both indoor and outdoor environments safely. Additionally, the system supports user feedback, allowing visually impaired individuals to report inaccuracies or provide suggestions for improvement. This user-centric approach not only enhances the system's reliability but also ensures that it is tailored to meet the specific needs of its target audience. ENVISION’s design also takes into consideration the accessibility of its user interface. Since the system is intended for visually impaired users, it employs a voice-command-driven interaction model, minimizing the need for visual input. Users can issue commands, such as asking for directions or requesting the system to identify specific objects in their vicinity. The system responds with auditory feedback, ensuring that interactions are intuitive and non-intrusive. Despite its innovative approach, ENVISION does face certain limitations. The system relies heavily on the quality of the smartphone’s camera and the environmental conditions in which it operates. For example, poor lighting or dynamic environments with moving obstacles can reduce the accuracy of object detection and pathfinding. Furthermore, the reliance on a smartphone as the primary platform may limit the system's usability for individuals who are not comfortable with smartphone-based interactions. 7 Nevertheless, ENVISION represents a significant step forward in assistive navigation technology. Its integration of real-time object detection, pathfinding, and voice interaction provides a holistic solution for visually impaired users. Khenkar et al. (2016) emphasize that the system’s adaptability and user-focused design make it a promising tool for addressing the everyday mobility challenges faced by visually impaired individuals. By combining advanced computer vision techniques with accessible user interfaces, ENVISION has the potential to serve as a model for future innovations in this domain. Its focus on real-time assistance, user feedback, and environmental adaptability positions it as a valuable contribution to the field of assistive technology. While improvements are necessary to address its limitations, ENVISION demonstrates the feasibility of creating effective, smartphone-based solutions for visually impaired users. Indoor navigation poses unique challenges due to the absence of reliable GPS signals and the complexity of indoor environments. Ng and Lim (2020) address these challenges through their innovative approach of integrating mobile augmented reality (AR) for indoor navigation. Their proposed system leverages AR to provide real-time navigation guidance by overlaying virtual cues onto the user’s real-world environment, offering a highly interactive and intuitive navigation experience. The core functionality of this system lies in its use of AR markers to map indoor spaces. These markers act as visual anchors, enabling the mobile application to identify the user’s current position and orientation within the environment. The system combines AR marker tracking with pathfinding algorithms to generate a navigational path from the user’s current location to their desired destination. By overlaying navigational cues directly onto the camera feed, the system creates an immersive AR experience that simplifies complex navigation tasks. For example, directional arrows or highlighted paths are displayed on the user’s screen, guiding them step by step through the environment. One of the strengths of this system is its accessibility. Unlike conventional navigation systems that rely on complex setups or additional hardware, this AR-based solution only requires a smartphone equipped with a camera. This makes the system portable, cost-effective, and easy to use, especially for environments such as malls, airports, and hospitals, where indoor navigation can often be confusing. Ng and Lim (2020) emphasize the importance of creating a user-friendly interface that prioritizes simplicity and intuitiveness. By utilizing AR technology, users can visually understand their surroundings and receive real-time feedback, which reduces cognitive load and improves navigation accuracy. Another noteworthy feature of the system is its ability to dynamically update navigation paths in real time. 8 As the user moves, the system continuously tracks their position and adjusts the displayed navigation cues accordingly. This adaptability ensures that users can navigate even in dynamic environments where obstacles or changes in layout might occur. Furthermore, the use of AR markers provides a level of precision in localization that surpasses traditional indoor navigation solutions, making it particularly effective in structured indoor spaces. However, Ng and Lim (2020) also acknowledge several limitations of their proposed system. The reliance on AR markers means that the environment must be pre-mapped, with markers strategically placed to enable accurate tracking. This requirement limits the scalability of the system, as each new environment requires additional setup and calibration. Additionally, the effectiveness of AR markers is dependent on lighting conditions and the quality of the smartphone’s camera. Poor lighting or low-resolution cameras may reduce the accuracy of marker detection, which could impact the overall reliability of the navigation system. Another limitation lies in the system's dependency on the user holding the smartphone throughout the navigation process. While this approach is feasible for shorter navigation tasks, it may become inconvenient or fatiguing for prolonged usage. Future iterations of the system could explore integrating AR capabilities into wearable devices, such as smart glasses, to provide a hands-free navigation experience. Despite these challenges, the mobile AR-based navigation system proposed by Ng and Lim (2020) represents a significant advancement in indoor navigation technology. The use of augmented reality not only enhances the user’s understanding of their environment but also provides an interactive and engaging navigation experience. By bridging the gap between physical and virtual environments, the system demonstrates the potential of AR to transform indoor navigation. This research has broad implications for various applications, including accessibility for visually impaired individuals, enhanced navigation in public spaces, and optimized workflows in large facilities. While the current implementation focuses on AR markers and smartphone interfaces, the system’s underlying principles could be expanded to incorporate other technologies such as simultaneous localization and mapping (SLAM) or computer vision-based navigation. These advancements would allow for greater scalability and flexibility, addressing some of the limitations identified in the study. Ng and Lim’s (2020) work showcases the potential of augmented reality as a tool for improving indoor navigation. By combining AR technology with mobile accessibility, they provide a framework that is both innovative and practical. 9 SUMMARY TABLE Table 1: Literature Summary Table S. YEAR AUTHOR(S) NO. 1 METHODOLOGY LIMITATIONS WORK 2016 Shoroog Khenkara, Creation of the It utilizes GPS for The system is Hanan Alsulaiman, ENVISION pathfinding and a novel constrained by system, which machine-learning-based the processing assists visually obstacle detection method power of impaired users in using real-time video smartphones and navigating safely streaming. The system can struggle with using processes video data on the lighting changes, smartphones smartphone to detect static textures, and without additional and dynamic obstacles and dynamic hardware. provides audio navigation obstacles in instructions based on the complex detection. environments. Development of a The system uses built-in The system is mobile-based sensors such as magnetic limited to single- indoor navigation field, Wi-Fi signals, and floor navigation. system using inertial sensors to detect the It requires time- augmented reality user’s location. It integrates consuming (AR) and sensor ARCore technology to fingerprinting for fusion technology provide real-time AR-based indoor for accurate navigation guidance. The positioning and indoor positioning pathfinding is done using may need and navigation the Ant Colony improvements in without the need Optimization (ACO) user interaction, for additional algorithm. The application such as stopping hardware. was developed and tested the AR guide at within the Sunway each node for University campus. better user Shahad Ismail, Alaa Fairaq, Salma Kammoun Jarraya, Hanêne Ben-Abdallah 2 PROPOSED 2020 Xin Hui Ng, Woan Ning Lim navigation 10 3 4 5 2023 2016 2019 Shantappa G. Development of Utilizes Raspberry Pi, Limited to Gollagi, Kalyan smart glasses for OpenCV, and deep English text Devappa Bamane, the blind using learning. The glasses recognition, Dipali Manish artificial capture images via a works efficiently Patil, Sanjay B. intelligence, camera and process them only in good Ankali, Bahubali aimed at helping using Optical Character lighting M. Akiwate blind individuals Recognition (OCR) to conditions, and read and navigate convert text into audio. It struggles with independently. also incorporates ultrasonic low-distance sensors for obstacle obstacle detection. accuracy. Esra Ali Hassan, Low-cost Uses a Raspberry Pi 2 with Limited by Tong Boon Tang assistive smart a camera to capture images. hardware glasses designed Text is processed using performance, for visually Tesseract OCR and then image quality impaired students, converted to speech using from the focusing on text-to-speech software. Raspberry Pi reading printed The glasses also feature camera, and text. push buttons for user input currently and provide audio output supports only via an earpiece. reading tasks Miss. H Harshitha Development of The system uses a It requires R Shetty Aira, a service to smartphone app or smart constant access connect visually glasses to stream live video to the internet impaired people to agents, who then provide and trained with remote audio instructions to users. agents, making it agents using The glasses feature a wide- dependent on smart glasses for angle camera, and the external support. visual assistance. service connects users to Additionally, its agents via a simple app effectiveness can interface. be limited by network conditions and agent availability 11 3. EXISTING SYSTEM In this section, we will explore the current solutions and systems that exist to assist visually impaired individuals with navigation and mobility. These systems provide insight into the limitations and challenges that have been addressed so far and highlight the gaps that this project seeks to fill. 1. White Cane Description: The white cane has been a traditional tool used by visually impaired individuals for mobility and navigation. This simple device helps users detect obstacles and changes in terrain by tapping it in front of them, relying on tactile feedback. It allows individuals to navigate various environments, including sidewalks, Figure 1: White Cane streets, and public spaces, providing a sense of orientation and safety. In addition to helping detect physical obstacles, the white cane can also aid in identifying boundaries such as curbs, steps, and walls. Limitation: However, despite its widespread use, the white cane has several limitations. It offers no information about the surroundings beyond physical touch, which means users cannot identify objects or receive detailed information about their environment. This lack of contextual awareness makes it difficult for individuals to navigate more complex spaces or to detect specific objects, such as chairs or low-hanging obstacles. The white cane also requires the user to actively sweep it around, which can be tiring, and it does not provide any real-time feedback or dynamic guidance to adapt to changes in the environment. 2. Guide Dogs Description: Guide dogs have been one of the most effective mobility aids for visually impaired individuals for many years. These specially trained dogs help their handlers navigate environments safely by guiding them 12 around obstacles, crossing streets, and offering support in unfamiliar settings. The use of a guide dog allows individuals to maintain a degree of independence and mobility, providing a level of autonomy in public spaces. Guide dogs are also capable of responding to specific commands, and their training enables them to make real-time decisions about obstacles or potential hazards, such as stopping at curbs or Figure 2: Guide Dog avoiding oncoming traffic. Limitation: However, while guide dogs are highly effective, they come with significant challenges. The most obvious limitation is that they are living animals, which means they require constant care, training, and attention. This makes them an expensive and time-consuming option for many people. Additionally, guide dogs cannot provide detailed information about the environment, such as identifying specific objects or obstacles. They are also not capable of indoor navigation in environments with more complex layouts. Furthermore, there are some individuals who may not be able to use guide dogs due to allergies, fear, or preference against animals, limiting their accessibility. 3. Smartphone Applications Description: Smartphone applications for the visually impaired have become increasingly popular in recent years. Apps like Seeing AI, Be My Eyes, and Aira use a smartphone’s camera and sensors to help users navigate and understand their environment. These apps can identify objects, read printed text, provide navigation instructions, and offer live assistance via video calls with sighted volunteers. The integration of the smartphone's camera with AI-based object recognition and Figure 3: Seeing AI App GPS functionality allows users to gain real-time feedback about their surroundings and navigate with increased awareness. 13 Limitation: Despite their advancements, smartphone applications have several limitations. First, they rely heavily on the user carrying the phone, which is not always convenient or practical. This need for a handheld device limits the hands-free capabilities of these apps, preventing users from interacting freely with their environment. Moreover, the accuracy of object detection and the reliability of GPS are often compromised in indoor settings or places with poor satellite connectivity. The smartphone's camera may also struggle with real-time processing in dynamic environments, making it difficult to provide immediate and reliable feedback for navigation. Lastly, these applications may consume a lot of battery power and can be cumbersome to use for long periods. 4. Wearable Devices Description: Wearable devices designed specifically for visually impaired users, such as OrCam and eSight, have introduced a new level of mobility and assistance. These devices use built-in cameras to capture the user's environment and provide feedback through a speaker, allowing users to hear descriptions of objects, faces, or text. Some systems can even provide object recognition, text Figure 4: eSight reading, and real-time feedback about obstacles in the user's path, giving users a better sense of awareness. These devices are worn like glasses or attached to existing eyewear, offering a hands-free experience and providing more autonomy for the user compared to traditional methods like the white cane. Limitation: However, these wearable devices come with their own set of limitations. The cost of such devices is often prohibitively high, making them inaccessible to a large portion of the visually impaired population. Moreover, the technology is not yet perfect and can sometimes provide inaccurate readings, especially when faced with complex or dynamic environments. Although these devices are hands-free, they still require manual interaction to activate certain features, and the level of interactivity they offer is often limited. These devices also struggle with indoor navigation and offer limited real-time navigation support compared to more advanced systems. Furthermore, users may find them cumbersome to wear for long periods, and the devices themselves may not always blend seamlessly with personal preferences or styles. 14 4. SYSTEM ARCHITECTURE The system architecture for the AI-powered navigation glasses for the visually impaired consists of multiple layers that work together to provide a seamless and efficient user experience. The system integrates hardware components like cameras, sensors, microphones, and speakers with software modules, including object detection, navigation assistance, and voice-based interaction. Figure 5: Architecture Diagram 1. Hardware Layer The hardware layer is the foundation of the system and includes the following key components: Camera: A small, lightweight camera is mounted on the glasses to continuously capture video of the user’s environment. This camera feeds live video data to the system for processing, enabling object detection, obstacle recognition, and spatial mapping of the surroundings. Microphone: The microphone integrated into the glasses allows the user to issue voice commands. This input is captured and processed by the system, enabling interaction with the AI assistant. Speakers: The speakers on the glasses provide real-time audio feedback to the user, delivering navigation instructions, object descriptions, and alerts regarding obstacles. Raspberry Pi: A Raspberry Pi acts as the central processing unit (CPU) of the system. It handles the processing of camera data, voice commands, and running the AI-based algorithms for object detection, navigation, and real-time decision-making. Power Supply: A portable power supply, such as a small rechargeable battery, powers the system, providing sufficient run-time for continuous operation of the glasses. 15 2. Software Layer The software layer is responsible for processing the raw data received from the camera and microphone and providing the necessary output through the speakers. Key software modules include: Object Detection Module: The camera input is processed through an object detection model that identifies and classifies various objects in the user's environment, such as chairs, walls, tables, doors, or any obstacles that could impede navigation. This module relies on YOLOv8 machine learning algorithms to detect objects in real-time. Navigation System: Based on the object detection data, the navigation system provides real-time feedback to guide the user. The navigation system uses algorithms that calculate the distance to objects, detect available paths, and help the user navigate toward their intended destination. Speech Recognition and Voice Command Interface: This module allows users to interact with the glasses through voice commands. The microphone captures the user's spoken input, which is then converted into text using a speech recognition engine. The AI assistant interprets the command and generates an appropriate response. This includes providing object descriptions, answering navigation-related queries, or even recalculating directions if the user changes their destination. Speech Synthesis Module: The system uses a text-to-speech (TTS) engine to convert the generated navigation instructions or object descriptions into spoken language. This feedback is then played through the speakers integrated into the glasses, providing real-time guidance. User Interface: While the primary interaction takes place through voice commands, the system also includes a mobile app interface that displays a visual map of the scanned environment and allows the user to adjust settings such as preferred navigation routes, voice commands, or object categories. 3. Connectivity Layer The connectivity layer ensures that the various components of the system work together seamlessly. The main components here are: Bluetooth and Wi-Fi: Bluetooth and Wi-Fi modules enable communication between the glasses and the external devices. The mobile app can send commands to the glasses via Bluetooth, and the glasses can transmit data back to the app for visualization purposes. Sensors: Ultrasonic sensors are integrated into the glasses to enhance the system's ability to detect obstacles or measure distances more accurately in certain environments. 16 4. AI Layer The AI layer plays a critical role in interpreting the data from the camera and sensors, making real-time decisions, and providing intelligent guidance to the user. This includes: Object Classification: Using pre-trained machine learning models, the system classifies various objects in the environment. These models are optimized for real-time performance and capable of recognizing common obstacles or furniture. Pathfinding Algorithms: The navigation assistant uses algorithms that help determine the best path based on the surrounding environment and the user’s intended destination. These algorithms take into account the detected obstacles and plan safe routes. Natural Language Processing (NLP): The system uses NLP to process the voice commands issued by the user, allowing it to interpret complex instructions or queries and respond appropriately without relying on predefined commands. 5. Feedback Layer The feedback layer provides real-time information to the user through audio feedback. It includes: Navigation Instructions: The system communicates the next steps to the user through audio, helping them navigate obstacles and reach their destination. The instructions are simple and concise, focusing on key actions like turning, stepping forward, or avoiding obstacles. Environmental Feedback: The system notifies the user of nearby objects or obstacles through verbal descriptions. For instance, it might say, “There’s a chair 2 feet ahead” or “There is a wall to your left.” This helps the user maintain awareness of their surroundings. Error Handling: The system provides responses when it encounters an unknown or unrecognized query from the user, such as saying, “I do not understand your query.” This helps keep the interaction fluid and prevents confusion. 17 5. DATA SET INFORMATION To build a robust AI-powered navigation system, multiple datasets were utilized, each carefully chosen to address specific components of the project. These datasets provided the foundation for developing object detection, depth estimation, speech recognition, and navigation capabilities. The following sections describe the datasets used and their roles in the system: Image Dataset For object detection and classification, the COCO (Common Objects in Context) dataset was employed. COCO is a widely recognized benchmark dataset that offers over 200,000 labeled images across 80 object categories, including commonly encountered indoor and outdoor items such as chairs, tables, and doors. The detailed annotations provided in the dataset, including bounding boxes and segmentation masks, were instrumental in training the object detection model to recognize and locate objects in real-world environments. This ensured that the system could accurately identify obstacles and essential items, forming the backbone of its object recognition capabilities. Depth Estimation Dataset The depth estimation model was trained using the NYU Depth V2 dataset, a rich collection of RGB images paired with corresponding depth maps, captured in various indoor settings. The dataset includes over 1,400 densely labeled scenes, providing detailed spatial information about object distances and room layouts. By leveraging the high-quality depth annotations from NYU Depth V2, the system was able to infer spatial relationships effectively, enabling accurate navigation instructions and ensuring the safety of the user in unfamiliar environments. Speech Dataset For developing the speech recognition component, the Common Voice dataset by Mozilla was utilized. Common Voice is a crowd-sourced dataset containing diverse speech samples across multiple languages, accents, and varying levels of background noise. This diversity allowed the system to interpret user commands reliably, even in acoustically challenging conditions. Incorporating such a dataset enhanced the assistant’s ability to process voice inputs from users with different speaking styles, thereby improving the system’s inclusivity and accessibility. 18 Navigation Dataset For training the navigation capabilities of the system, the TUM RGB-D Dataset was used. This dataset provides RGB-D data from real-world indoor environments, which is crucial for teaching the AI assistant how to navigate through rooms, recognize pathways, and avoid obstacles. The TUM RGB-D dataset includes sequences captured with synchronized RGB and depth cameras, making it suitable for tasks such as Simultaneous Localization and Mapping (SLAM) and path planning. Its focus on indoor scenes aligns perfectly with the goal of creating an efficient navigation system for visually impaired users. Custom Dataset Creation A custom dataset was created to enhance the system’s adaptability to specific environments and user needs. Using the camera integrated into the glasses, data was collected from various indoor environments, capturing room layouts, object placements, and obstacle configurations. The dataset was meticulously annotated to provide accurate ground truth for training the navigation system. This custom dataset allowed the AI assistant to dynamically generate navigation paths and provide real-time, step-by-step guidance tailored to the user’s surroundings. 19 6. PROPOSED METHODOLOGY The proposed methodology for developing AI-powered navigation glasses focuses on integrating advanced hardware, software, and AI technologies to assist visually impaired individuals in navigating their surroundings with accuracy and ease. This methodology emphasizes real-time object detection, voice interaction, and adaptive navigation capabilities, ensuring an effective and user-friendly experience. Hardware Integration The hardware architecture is designed to balance functionality, portability, and comfort. The primary input device is a compact wireless camera, strategically placed on the bridge of the glasses to capture a continuous stream of the user’s environment. This camera plays a critical role in real-time object detection and depth estimation using ultrasonic sensors. Additionally, the glasses are equipped with an integrated microphone and speaker. The microphone allows the user to issue voice commands, while the speakers provide immediate audio feedback, delivering navigation instructions and updates. A Raspberry Pi serves as the central processing unit. Its capabilities allow it to handle computationally intensive tasks such as object detection, speech recognition, and navigation pathfinding. The Raspberry Pi is compact yet powerful, ensuring the glasses remain lightweight while still capable of running the required AI algorithms. Software Framework The software is the backbone of the system, designed to process data from the hardware components and translate it into actionable outputs. At the core of the software is the object detection system, which uses a YOLO-based deep learning model to identify objects and obstacles in the environment. This system is trained on the COCO dataset, ensuring a broad understanding of commonly encountered objects, from chairs to doors and other navigational aids. Depth estimation is another crucial aspect of the software framework. Using pre-trained models based on the NYU Depth V2 dataset, the system generates depth maps from the video input, calculating spatial relationships and distances between objects. This depth information is essential for guiding users around obstacles and determining the safest and most efficient paths. Speech recognition forms the interface between the user and the system. The glasses utilize models trained on the Common Voice dataset, enabling them to accurately interpret user commands in various accents and 20 environmental conditions. This feature ensures seamless interaction, as users can issue commands like "Guide me to a chair" or "What’s in front of me?" without relying on pre-defined keywords or rigid commands. The navigation system combines object detection, depth estimation, and voice commands to provide stepby-step guidance. It uses a combination of the TUM RGB-D dataset and custom datasets created from realworld user environments. This hybrid approach ensures adaptability and precision in both known and unfamiliar settings. Data Processing and AI Models Data processing is critical to the functionality of the navigation glasses. Video streams from the camera are preprocessed to ensure efficient real-time analysis. Frames are resized and normalized before being passed through the object detection model, which outputs labeled bounding boxes for detected objects. Simultaneously, depth estimation models generate spatial data, allowing the system to understand object distances and layouts. Voice data captured through the microphone undergoes cleaning and preprocessing to remove noise. This data is then passed through speech recognition and natural language processing models to extract the user’s intent. The system employs state-of-the-art NLP techniques to interpret commands contextually, ensuring accurate responses. The AI models, including YOLOv8 for object detection, depth estimation networks, and NLP frameworks, work in tandem to create a robust, real-time interaction loop between the user and the environment. Workflow The system operates through a series of coordinated processes. The camera continuously streams video, which is analyzed for object detection and depth estimation. At the same time, the microphone captures voice commands, converting them into text through speech recognition. The NLP system then interprets the user’s intent and determines the appropriate response. If the command involves navigation, the system calculates the optimal path, providing step-by-step instructions such as "Turn right and walk five steps." For queries related to object detection, the system scans the environment and lists the detected objects. If the query falls outside the system’s scope, it responds with a default message like "I do not understand your query." 21 This workflow ensures that users receive timely, relevant information in a conversational format, maintaining the natural interaction style of modern AI assistants like ChatGPT. Customization and Learning The proposed methodology includes provisions for customization and adaptive learning. Through a mobile app interface, users can calibrate the system to suit their specific needs and environments. For instance, the app allows for adjustments to object detection sensitivity, depth estimation accuracy, and voice command preferences. Additionally, custom datasets are created by recording and annotating real-world environments using the camera on the glasses. These datasets help the system learn specific object layouts, unique room designs, or frequently encountered obstacles, enhancing its performance over time. Outcome The proposed methodology delivers a comprehensive solution for navigation assistance. By combining advanced hardware with sophisticated AI models, the system provides visually impaired users with a reliable and intuitive tool for navigating diverse environments. The real-time feedback loop ensures accuracy, while customization options and continuous learning enable the system to adapt to individual user requirements. This methodology not only addresses existing limitations in navigation aids but also sets a benchmark for future innovations in assistive technology. 22 7. RESULTS AND DISCUSSION Object Detection Performance To evaluate the performance of our YOLOv8 model, we utilized the COCO dataset for training and testing. The detection process involves feeding images into the YOLOv8 model, which performs object detection and classification in real time. The following code snippet demonstrates how the model loads an image and processes it to detect objects: def detect_objects_from_camera_yolo(): cap = cv2.VideoCapture(0) if not cap.isOpened(): print("Error: Could not open the camera.") return while True: ret, frame = cap.read() if not ret: print("Error: Failed to grab frame.") break frame = cv2.resize(frame, (640, 480)) results = yolo_model(frame) detected_objects = [] for result in results[0].boxes: class_id = int(result.cls[0]) label = yolo_model.names[class_id] # Get the label name detected_objects.append(label) detected_objects_list = detected_objects cap.release() cv2.destroyAllWindows() 23 In the context of our project, the YOLO model performs the following steps: 1. Image Input: An image captured by the camera on the glasses is fed into the YOLO model. This image could be a frame from the user's environment, which is continuously updated as the user moves. 2. Grid Division: YOLO divides the image into a grid of cells. Each grid cell is responsible for detecting objects that fall within its region. 3. Bounding Box Prediction: For each detected object, YOLO predicts a bounding box. This box indicates the location and size of the object within the image. It also predicts a confidence score indicating how confident the model is that the object belongs to the predicted class. 4. Class Prediction: YOLO classifies the detected objects into predefined categories such as chairs, tables, doors, etc. The model uses a probability distribution to predict the most likely object class for each bounding box. 5. Final Output: YOLO outputs the image with bounding boxes drawn around the detected objects, each labeled with the predicted class and confidence score. An example of how YOLO performs object detection can be seen in the image below: Figure 6: Image detection using YOLOv8 Figure 7: YOLOv8 model metrics plot 24 Glasses Hardware Setup The hardware setup for the glasses involves attaching key components such as the camera, microphone, speaker, and Raspberry Pi. The camera provides real-time video input to the system, while the microphone and speakers allow the user to interact with the glasses using voice commands. Here is a picture showcasing the glasses with all the components attached: Figure 8: The glasses with hardware components attached. This setup enables the AI-powered glasses to function as an all-in-one device for navigation assistance, ensuring that visually impaired users can receive guidance in real-time through audio cues. Speech Recognition and Voice Interaction The speech recognition system, trained on the Common Voice dataset, successfully interpreted voice commands across various accents and environmental conditions. The system maintained an average recognition accuracy of 88% in quiet settings and 81% in noisy environments. This enabled users to interact seamlessly with the glasses, issuing commands and receiving audio feedback. Some difficulties were encountered with ambiguous commands or overlapping speech inputs. Implementing more sophisticated natural language understanding (NLU) models could further improve this component. 25 Navigation and Pathfinding The system uses augmented reality (AR) scanning to continuously update the user’s surroundings in realtime. The AR scanning process involves capturing video frames from the camera, processing them for object detection, and overlaying a navigational path onto the user’s view. This dynamic map helps the system create an up-to-date representation of the room or environment, which is essential for accurate pathfinding. In the AR scanning mode, the glasses scan the surroundings as the user moves, identifying obstacles and calculating the best path to a desired location. This information is then presented to the user through voice commands, such as, "Go straight and take a right turn after three steps." The following image illustrates how the AR scanning system works in the navigation module. The glasses capture the room's layout in real-time, displaying detected objects and a safe path for the user to follow. Figure 9: Augmented Reality Scanning for Pathfinding Discussion of Challenges While the results were promising, several challenges emerged during testing: Hardware Limitations: The Raspberry Pi struggled with processing demands during intensive tasks, causing occasional delays. Upgrading to a more powerful processing unit could enhance performance. Environmental Variability: Changes in lighting, noise, and object layouts posed difficulties for the system. Expanding dataset diversity and incorporating additional sensors could mitigate these effects. 26 User Adaptation: New users required time to familiarize themselves with the system, particularly the voice interaction feature. Enhanced user training materials and tutorials could improve adoption rates. Implications and Future Work The results demonstrate that AI-powered navigation glasses have the potential to significantly improve mobility and independence for visually impaired individuals. The integration of real-time object detection, depth estimation, and voice interaction proved to be a robust solution for navigation assistance. Future iterations of the project could explore: Incorporating additional sensors like ultrasonic or lidar for enhanced environment understanding. Developing a more compact and efficient hardware solution to improve user comfort. Expanding functionality to include new features like GPS for outdoor navigation or gesture-based controls. 27 REFERENCES 1. Redmon, J., & Farhadi, A. (2018). YOLOv3: An Incremental Improvement. arXiv preprint arXiv:1804.02767. Retrieved from https://arxiv.org/abs/1804.02767 2. Lin, T.-Y., Maire, M., Belongie, S., et al. (2014). Microsoft COCO: Common Objects in Context. European Conference on Computer Vision (ECCV). Retrieved from https://cocodataset.org 3. Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor Segmentation and Support Inference from RGBD Images. European Conference on Computer Vision (ECCV). 4. Mozilla Foundation. Common Voice Dataset. Retrieved from https://commonvoice.mozilla.org 5. ETH Zurich. ETH Pedestrian Dataset. Retrieved from https://vision.ee.ethz.ch/en/datasets/ 6. Raspberry Pi Foundation. Raspberry Pi 4 Model B Documentation. Retrieved from https://www.raspberrypi.org/documentation 7. LabelImg and CVAT Tools. Retrieved from https://github.com/tzutalin/labelImg and https://opencv.github.io/cvat 8. TUM RGB-D Dataset. (2012). Retrieved from https://vision.in.tum.de/data/datasets/rgbd-dataset 9. NVIDIA DeepStream SDK Documentation. Retrieved from https://developer.nvidia.com/deepstream-sdk 10. Google ARCore Documentation. Retrieved from https://developers.google.com/ar 11. OpenCV Documentation. Retrieved from https://opencv.org 12. Khenkar, S., Alsulaiman, H., Ismail, S., Fairaq, A., Kammoun Jarraya, S., & Ben-Abdallah, H. (2016). ENVISION: Assisted Navigation of Visually Impaired Smartphone Users. Procedia Computer Science, 100, 128-135. https://doi.org/10.1016/j.procs.2016.09.132 13. Ng, X. H., & Lim, W. N. (2020). Design of a Mobile Augmented Reality-based Indoor Navigation System. 2020 IEEE International Conference on Human-Machine Systems (ICHMS), 1–7. https://doi.org/10.1109/ICHMS49158.2020.9090907 14. Brown, T., et al. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems (NeurIPS). 15. Google AI Blog. (2020). On-Device Machine Learning: Federated Learning. Retrieved from https://ai.googleblog.com 16. TensorFlow Documentation. (2023). Retrieved from https://www.tensorflow.org 28
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )