BEYKOZ UNIVERSITY NATURAL LANGUAGE PROCESSING "CLASSIFYING MOVIE REVIEWS" USING PYTHON Name: Heraa Farooq Synopsis INTRODUCTION: Sentiment relates to the meaning of a word or sequence of words and is usually associated with an opinion or emotion. And analysis? Well, this is the process of looking at data and making inferences; in this case, using machine learning to learn and predict whether a movie review is positive or negative. People are interested in watching movies by looking at the reviews are positive or negative. Sentiment analysis is a commonly used NLP(natural language processing) technique to determine whether the text is positive, negative, or neutral. It has been frequently used to look at customer satisfaction based on review sentiment or serve as an additional perspective when we doing analysis on text data. DATASET: The dataset which we will use in sentiment analysis is the International Movie Database(IMDb) reviews for 50,000 reviews of movies from all over the world, its a binary classification dataset categorizing each review in a positive or negative. It has 25000 samples for training and 25000 for testing. PREPARATION OF DATA ??. Clean and Preprocess The raw text is pretty messy for these reviews so before we can do any analytics we need to clean things up. ??. Vectorization In order for this data to make sense to our machine learning algorithm we’ll need to convert each review to a numeric representation, which we call vectorization. ??. Tokenizing Tokenization is the process of breaking down chunks of text into smaller pieces. spaCy comes with a default processing pipeline that begins with tokenization, making this process a snap. In spaCy, you can do either sentence tokenization or word tokenization: ??. Word tokenization breaks text down into individual words. ??. Sentence tokenization breaks text down into individual sentences. Tools Required: Spacy, NLTK