Advance Programming A report on Classification Project: Extract Stock Sentiment from News Headlines UNDER GUIDANCE OF Prof. Rajinder Chitoria Masters of Business Administration-Business Analytics 2019-20 Submitted By: Dimpy Tyagi Devika Singh Poonam Mishra Shalabh Sharma Introduction Sentiment analysis is a text analysis method that detects polarity (e.g. a positive or negative opinion) within text, whether a whole document, paragraph, sentence, or clause. Understanding people’s emotions is essential for businesses since customers are able to express their thoughts and feelings more openly than ever before. By automatically analysing customer feedback, from survey responses to social media conversations, brands are able to listen attentively to their customers, and tailor products and services to meet their needs. Why Perform Sentiment Analysis? It’s estimated that 80% of the world’s data is unstructured, in other words it’s unorganized. Huge volumes of text data (emails, support tickets, chats, social media conversations, surveys, articles, documents, etc.), is created every day but it’s hard to analyse, understand, and sort through, not to mention time-consuming and expensive. Sentiment analysis, however, helps businesses make sense of all this unstructured text by automatically tagging it. Benefits of sentiment analysis include: Sorting Data at Scale Can you imagine manually sorting through thousands of tweets, customer support conversations, or surveys? There’s just too much data to process manually. Sentiment analysis helps businesses process huge amounts of data in an efficient and cost-effective way. Real-Time Analysis Sentiment analysis can identify critical issues in real-time, for example is a PR crisis on social media escalating? Is an angry customer about to churn? Sentiment analysis models can help you immediately identify these kinds of situations and gauge brand sentiment, so you can take action right away. Consistent criteria It’s estimated that people only agree around 60-65% of the time when determining the sentiment of a particular text. Tagging text by sentiment is highly subjective, influenced by personal experiences, thoughts, and beliefs. By using a centralized sentiment analysis system, companies can apply the same criteria to all of their data, helping them improve accuracy and gain better insights. Classification Project: Extract Stock Sentiment from News Headlines Objective: investing insight by applying sentiment analysis on financial news headlines from FINVIZ.com, using various sentiment analysis technique in python. Tasks Performed: 1. Tag the each news heading the “Positive” or “Negative” Label. 2. Visualize Day Wise Positive & Negative No. of News Headings 3. Visualize News Agencies Wise Positive & Negative No. of News Headings, name of news agency can be viewed by click on the icon before new heading date & time 4. Repeat the above task for Blogs heading also. 5. Use classification machine learning algorithm would be best choice for above data predication (e.g. Logistic Regression, Decision Tree Classification) Why FINVIZ: FinViz is definitely one of the best go-to websites for information on the stock market. From fundamental ratios, technical indicators to news headlines and insider training data, it is a perfect stock screener. Furthermore, it has updated information on the performance of each sector, industry and any major stock index. Link : https://finviz.com/ Some important non convectional libraries which we have used in our project Etree BeautifulSoup : “Beautiful Soup” which is used to parse the data from FinViz urllib.request import urlopen, Request : used to get data #1.Tag the each news heading the “Positive” or “Negative” Label.(News Articles) The AFINN lexicon is perhaps one of the simplest and most popular lexicons that can be used extensively for sentiment analysis. Developed and curated by Finn Årup Nielsen, you can find more details on this lexicon in the paper, “A new ANEW: evaluation of a word list for sentiment analysis in microblogs”, proceedings of the ESWC 2011 Workshop. The current version of the lexicon is AFINNen-165.txt and it contains over 3,300+ words with a polarity score associated with each word. The author has also created a nice wrapper library on top of this in Python called afinn, which we will be using for our analysis. Visual Representation of Labels as per Ticker: #2 Visualize Day Wise Positive & Negative No. of News Headings Using python we have taken the average of the sentiment scores for all news headlines collected during each date and plot it on a bar chart. Note that on some days without news headlines for any particular “positive” , “negative” or “neutral”, there would be no score. Interpretation: We can see most of the news are neutral daily. On May 21, high number of negative news was received. #3.Visualize News Agencies Wise Positive & Negative No. of News Headings, name of news agency can be viewed by click on the icon before new heading date & time We have implemented this in 3 ways. All include interactive charts where agency of interest can be selected by the user. a) Agency Wise positive, neutral and negative news Interpretation: For Bloomberg Agency, maximum news was neutral. Negative news was comparatively less in number for this agency. b) Agency and Ticker Wise positive, neutral and negative news: Interpretation: For Tesla, all news reported by Yahoo finance were positive. c) Agency wise and According to Date: Interpretation: For Agency Reuters, on may 15th only negative news were reported. After that day only positive and neutral news were reported. #4 Similar scenarios have been implemented for Blogs as well. Blogs Data has been filtered out through main data frame like this: Sentiment Analysis output: Visualisation as per labels and Date: Agency Wise Representation: By this we can see which agency is posting more positive or negative blogs. 5) Classification: Random Forest & Decision Tree for Headlines Data pre-processing that is implemented on the dataset is as follows: Getting the data into pandas data frame The target variable was imbalanced, handled it through smote oversampling- Train -Test split for model creation- Labelled categories with number Feature extraction done using TfidVectorizer Dimensionality reduction using TruncatedSVD 5.1 With 2 categories viz. "Positive & Negative" - Output 1 for Headlines a) Data is transformed into clean text Prediction: RESULT: Random Forest is the best model for the given data with highest accuracy, precision and f1 score Classification with 3 Categories - "Positive, Negative & Neutral" Decision Tree is giving better results than Random Forest. 5.2 Random Forest & Decision Tree for Blogs Clean Text: Classification With 2 categories viz. "Positive & Negative Output for Classification With Three Categories: By looking at the various parameters we can say that RANDOM FOREST is the best choice. Conclusion: Sentiment Analysis in business, also known as opinion mining, is a process in which a piece of text is identified and catalogued according to the tone it conveys. This text can be tweets, comments, reviews and even random rants with related positive, negative , and neutral feelings. Every business needs to carry out automated analysis of the feelings. This can never be 100 per cent precision. And naturally a machine doesn't understand sarcasm. However, according to a research, 80 per cent of the time people don't agree. It means that although the precision of the machine does not score a perfect 10, it will still be more accurate than human analysis. Also, manually analyzing when the corpus is huge is not an option. One cannot overlook the applications of sentiment analysis in business. Market sentiment analysis will prove a significant advancement for the full revitalisation of the brand. The secret to running a profitable company with the data on feelings is the ability to make use of unstructured data for actionable insights. Machine learning models which rely largely on the features manually generated before classification have served this purpose well for the past few years. This sentiment analysis is pivotal for market research. Through Our Project we can now distinguish that from which platform what type of articles are coming up more so we can distinguish which type of blogs are written on the platforms This will enable the individual to make decision making regarding which site is more prone to give negative, positive or neutral news.