Walchand Institute of Technology, Solapur Honors in Artificial Intelligence and Machine Learning T.Y.B.Tech.(Computer Science and Engineering), SemesterVI 21CSU6HM3 : NATURAL LANGUAGE PROCESSING Teaching Scheme Examination Scheme Lecture: 3 Hours /Week, ESE – 60 Marks Practical: 2 Hours/Week, ISE – 40 Marks ICA - 25 Marks Natural Language Processing (NLP) Introduction • Natural Language Processing (NLP) is basically how you can teach machines to understand human languages and extract meaning from text. • This course is intended as a theoretical and methodological introduction to a the most widely used and effective current techniques, strategies and toolkits for natural language processing. • This course also covers basis of syntax, semantic analysis and discourse analysis and drives it to machine translation. • Pre-requisite: A basic course in a object-oriented programming Language, Theory of computation and parsers. COURSE OUTCOMES: At the end of the course students will be able to 1. Understand the fundamentals of Natural Language Processing. 2. Analyze how the words are formed morphologically and how they are related to each others. 3. Develop strategies for language modeling, syntax and semantic analysis. 4. Design and implement and analyze the Natural Language Processing algorithms for real word applications. Natural Language Processing Section I Unit I: Introduction (6) Introduction to NLP, Machine Learning and NLP, why NLP is hard? Programming languages Vs Natural Languages, Are natural languages regular? Finite automata for NLP, Stages of NLP, challenges (Open Problems) in NLP. Basics of Text Processing: Tokenization, Stemming, Lemmatization, Part of Speech Tagging. Introduction to NLP •Natural Language Processing (NLP) is a branch of computer science and artificial intelligence that deals with the interaction between computers and humans using natural language. • The goal of NLP is to enable computers to understand, interpret, and generate human language in a way that is both effective and efficient. Introduction to NLP Natural Language Processing (NLP) is the study of the computational treatment of natural (human) language. In other words, teaching computers how to understand (and generate) human language. It is field of Computer Science, Artificial Intelligence and Computational Linguistics. Natural language processing systems take strings of words (sentences) as their input and produce structured representations capturing the meaning of those strings as their output. The nature of this output depends heavily on the task at hand. NLP has a wide range of applications Chatbots and virtual assistants: Chatbots and virtual assistants use NLP to understand natural language and provide human-like responses to queries and requests. Sentiment analysis: NLP can be used to analyses social media posts, customer reviews, and other types of text to determine the sentiment of the writer. Machine translation: NLP is used to translate text from one language to another. Machine translation is used in various applications, including Google Translate and Microsoft Translator. Text summarization: NLP can be used to summarize lengthy documents or articles, making it easier for people to digest large amounts of information. Named entity recognition: NLP can be used to automatically identify and classify named entities such as people, places, and organizations within a text. Speech recognition: NLP can be used to recognize and transcribe spoken language, enabling applications such as virtual assistants and speech-to-text systems. Text classification: NLP can be used to classify text into different categories, such as spam or non-spam emails, or positive and negative sentiment in customer reviews. Question answering: NLP can be used to answer questions posed by users, such as on search engines or in chatbots. Automatic summarization: NLP can be used to automatically summarize news articles or other long-form content, providing users with a quick overview of the content. Language modelling: NLP can be used to create models of how language is used in a particular domain, such as legal or medical language, and can be used to generate new text in that domain. Natural Language Generation: Natural Language Generation (NLG) is the process of generating human-like language from structured data or machine representations. It is used in applications such as chatbots and automated report generation. Perpectivising NLP: Areas of AI and their inter-dependencies Search Logic Machine Learning NLP Vision Knowledge Representation Planning Robotics AI is the forcing function for Computer Science Expert Systems What is NLP ? Branch of AI: 2 Goals Science Goal: Understand the way language operates. Engineering Goal: Build systems that analyze and generate language; reduce the man machine gap. Machine Learning (ML) and NLP • Machine Learning (ML) is a subfield of artificial intelligence that focuses on building systems that can learn and improve from experience without being explicitly programmed. • NLP and ML are closely related because ML algorithms are often used in NLP applications to automatically learn patterns and relationships in language data. Some common ML techniques used in NLP include: 1.Supervised learning: A machine learning model is trained on a labeled dataset, where each example is labeled with the correct output. For example, a sentiment analysis model can be trained on a dataset of customer reviews, where each review is labeled as positive or negative. Once trained, the model can predict the sentiment of new, unlabeled reviews. 2.Unsupervised learning: a machine learning model is trained on an unlabeled dataset, and the goal is to learn patterns and relationships in the data without any specific guidance. For example, a clustering algorithm can be used to group similar documents together based on their content. 3.Semi-supervised learning: A machine learning model is trained on a combination of labeled and unlabeled data. This approach can be useful when labeled data is scarce or expensive to obtain. 4.Deep learning: It is a subfield of ML that uses neural networks with multiple layers to learn complex patterns in data. Deep learning has been particularly successful in NLP applications, such as machine translation and text classification. • NLP and ML have led to significant advances in natural language understanding and communication, and have numerous practical applications in industries such as healthcare, finance, and marketing. Machine learning (ML) and natural language processing (NLP) are two closely related fields that are often used together in applications such as speech recognition, language translation, and chatbots. While there is overlap between the two fields, there are some key differences: 1.Focus: Machine learning is a general term for the process of teaching a computer system to recognize patterns in data and make predictions based on those patterns. NLP, on the other hand, focuses specifically on the processing and analysis of human language. 1.Data: Machine learning algorithms can be applied to any type of data, such as images, audio, numerical data, and text. NLP, however, is specifically focused on analyzing and processing textual data. 3.Techniques: Machine learning techniques can be used in NLP, but there are also many specialized techniques that are unique to NLP, such as part-of-speech tagging, named entity recognition, and sentiment analysis. 4.Tools: There are many machine learning libraries and frameworks that can be used for a wide range of applications, such as TensorFlow and Scikit-Learn. For NLP, there are specific tools and libraries such as NLTK, Spacy, and Gensim that are designed for processing and analyzing natural language. 5.Applications: While machine learning can be applied to a wide range of applications, NLP is specifically focused on applications related to human language, such as speech recognition, language translation, and chatbots. In summary, while there is overlap between machine learning and NLP, NLP is a specialized field focused specifically on the processing and analysis of human language, while machine learning is a more general field that can be applied to a wide range of data types and applications. Machine Learning (ML) and Natural Language Processing (NLP) are both subfields of artificial intelligence (AI), but they differ in their goals and approaches. Machine learning is a broader field that involves using algorithms to analyze and learn patterns from data. Machine learning models can be used to make predictions or classify data based on patterns that they have learned from past examples. Natural Language Processing, on the other hand, focuses specifically on the interaction between computers and human language. NLP involves the development of algorithms and models that can understand, analyze, and generate human language. While both fields use algorithms to analyze data, the data that is analyzed in NLP is typically text-based, while machine learning can be applied to a wide range of data types, including images, audio, and numerical data. Another key difference is that NLP often involves more complex models than those used in machine learning. NLP models must be able to understand the meaning and context of language, as well as the grammar and syntax of sentences. This requires a deeper understanding of language and how it works, as well as the ability to deal with ambiguity and variability in human language. Machine learning and NLP are complementary fields that are often used together to develop powerful AI applications. Machine learning provides the foundation for analyzing and understanding data, while NLP enables computers to interact with humans in a more natural and intuitive way. Why NLP is complex Natural language is extremely rich in form and structure, and very ambiguous. • How to represent meaning, • Which structures map to which meaning structures. One input can mean many different things. Ambiguity can be at different levels. Lexical (word level) ambiguity -- different meanings of words Syntactic ambiguity -- different ways to parse the sentence Interpreting partial information -- how to interpret pronouns Contextual information -- context of the sentence may affect the meaning of that sentence. Attachment ambiguity in natural language processing Attachment ambiguity in natural language processing, Define attachment ambiguity, examples of attachment ambiguity, attachment ambiguity is a type of syntactic ambiguity • Attachment ambiguity is a type of syntactic ambiguity Syntactic ambiguity • It is a type of ambiguity where the doubt is about the syntactic structure of the sentence. That is, there is a possibility that a sentence could be parsed in many syntactical forms (a sentence may be interpreted in more than one way). The doubt is about which one among different syntactical forms is correct. • For example, the sentence “old men and women” is ambiguous. Here, the doubt is that whether the adjective old is attached with both men and women or men alone. Attachment ambiguity It arises from uncertainty of attaching a phrase or clause to a part of sentence. It usually happens when a sentence has more than two prepositional phrases. Example 1 In the sentence “the boy saw the girl with the telescope”, the uncertainty is about relating the prepositional phrase “with the telescope” to “the boy” or to “the girl”. This could end up with the following meaning based on the attachment; 1. The boy saw the girl carrying a telescope 2. The boy saw the girl through the telescope The first meaning arises it we attach the prepositional phrase with “the girl” whereas the second one arises if we attach the prepositional phrase with “the boy”. Example 2 Consider the following sentence; “Guna ate an ice cream with fruits from Chennai” In this sentence, we have two prepositional phrases “with fruits” and “from Chennai”. Here the possible meanings are as follows; 1. Guna who is from Chennai ate an ice cream filled with fruits. 2. Guna ate an ice cream filled with fruits and the ice cream is brought from Chennai. 3. Guna who is from Chennai ate the ice cream with the help of fruits. 4. Guna with the help of fruits ate the ice cream which is brought from Chennai Here we got four possibilities due to two prepositional phrases. Each one arises from how we attach the prepositional phrases “with fruits” and “from Chennai” to either “Guna” or the “ice cream”. Prepositional Phrase (PP) Attachment Problem V – NP1 – P – NP2 (Here P means preposition) NP2 attaches to NP1 ? or NP2 attaches to V ? Parse Trees for a Structurally Ambiguous Sentence Let the grammar be – S NP VP NP DT N | DT N PP PP P NP VP V NP PP | V NP For the sentence, “I saw a boy with a telescope” Parse Tree - 1 S NP N I VP V NP saw Det N PP a boy P NP with Det N a telescope Parse Tree -2 S NP N I VP V NP saw Det N PP P NP a boy with Det N a telescope \\ Lexical Knowledge Networks \\ • It is also known as lexical semantic networks, are a type of knowledge representation used in NLP and computational linguistics. • They represent the relationships between words based on their meaning or semantic content. • In a lexical knowledge network, words are represented as nodes, and the relationships between words are represented as edges. • The edges may be labeled with a specific relationship type, such as "synonym", "antonym", "hypernym" (a word that is more general than another word), or "hyponym" (a word that is more specific than another word). There are several different types of lexical knowledge networks, each with their own specific characteristics and uses. Some of the most well-known lexical knowledge networks include WordNet, FrameNet, and ConceptNet. • WordNet is a lexical database of English words and their relationships, developed at Princeton University's Cognitive Science Laboratory. • It is widely used in NLP and computational linguistics for tasks such as word sense disambiguation, text classification, and machine translation. • WordNet groups English words into sets of synonyms, called synsets, which are defined by a common sense or meaning. • Each synset contains one or more words that are related in meaning and can be used interchangeably in certain contexts. • For example, the synset for "car" includes the words "automobile", "vehicle", and "motorcar".