TextToStocks Presentation 2 Team 48 Shrey Joshi, Adrian Mah, Suryansh Pratap Singh, Ryan Van Duren, Muhammad Raza Advisor: Yuqian Zhang The Problem “Meme stocks” are equities that have gained a cult like following online and through various social media platforms. These equities are often hyped by these online communities who can often times greatly influence their price. This raises the question: do internet discussions about equities correlate to the future prices of said equities, and if so can this correlation be used to improve the accuracy of equity price predictions? Scope of Work Recap To reiterate from last semester, our goal is to be able to draw any conclusion(s) about the future price of an equity based on prior interned discussion regarding the equity, or establish that there is no relationship between the two if that is the case. We’ll be using machine learning with Natural Language Processing (NLP) on messages scraped from certain internet forums to make such predictions on equity prices using a Long Short Term Memory (LSTM) machine learning model Task Allocation ● Work is common amongst all team members, but work ownership is allocated as follows: Adrian: ML Research and Implementation Shrey: Sorting and categorizing scraped messages Muhammad: Webscraping needed pages for data Ryan: Prediction algorithm based on Natural Language Processing, data collection ○ Suryansh: Machine training dataset ○ ○ ○ ○ General Progress and Implementation ● Selected environment and several packages ○ ○ Environment: ■ Anaconda3 with python ■ Windows Packages: ■ Deep Learning : Tensorflow ■ Data Acquisition : pandas_datareader and Yahoo Finance ■ Plotting : matplotlib ■ Web Scraping : Beautiful Soup ■ NLP : spaCy Neural Network : Objective ● Built a NN that takes in basic stock data and outputs a prediction ○ ○ “Control” prediction Data: ■ Indexed by date, separated by equity ■ High, Low, Open, Close, Volume, Adj Close ● Train the model to maximize the accuracy of the prediction ○ Cannot account for world events ■ Accuracy will never be perfect ● Build a second NN that incorporates sentiment ○ ○ Compare accuracy of two networks Ideally demonstrate improved accuracy with sentiment parameter Neural Network : Progress ● ● We selected and built a Long Short-Term Memory (LSTM) Neural Network ○ Recurring NN that is capable of learning long-term dependencies ○ Uses each prior state and current state for calculations ○ Common for equity based deep learning projects ○ Our simple LSTM model can predict stock prices when given cherry picked example data Input Parameters ○ High, Low, Open, Close, Volume, Adj Close ○ Eventually : Sentiment ■ We must calculate sentiment through natural language processing Neural Network : Issues ● Cannot account for real world events ○ Accuracy will never be perfect ■ Especially with political instability… ● Model must be trained ○ Takes time ● Hardware configuration ○ Configure training environment such that the most efficient hardware is used regardless of which device the application is run on ■ GPU if available Neural Network : Remaining Objectives ● Continue to train model with stock data ○ ○ Improve accuracy Train with a variety of stocks ● Build duplicate network and add sentiment parameter ○ Sentiment analysis allows natural language processing and ML to interpret and classify emotions within text information ● Tweak NN architecture to better accommodate sentiments ● Train NN with sentiment data ○ From r/wallstreetbets Natural Language Processing : Objective ● Analyze relevant online discussion ○ Determine which equity conversations refer to and which conversations are relevant ○ Produce a daily “sentiment” value for each measured equity that accurately reflects how real people feel about an equity ○ To be used as an input parameter for NN ● State of the Art: Reddit Sentiment Analysis ○ Has been done before ○ Limited to r/wallstreetbets ■ Easy due to ticker system ■ We must expand to other subreddits and other platforms in order to get a more complete picture Natural Language Processing : Progress ● Identify the online resources we are going to use for finding sentiment analysis ○ r/wallstreetbets ● Using beautifulsoup to collect the data from the source which is going to be used for the natural language processing ● Created a basic algorithm using spaCy to identify positive and negative sentiment in a sentence ○ It has no sarcasm detection yet. Natural Language Processing : Issues ● ● Expanding outside of r/wallstreetbets ○ No easily defined tickers ■ Difficult to accurately identify what company is being discussed ○ Solution : Use larger context to assume relevant ticker if key words are missing ■ Ex: someone responding to this tweet with “will be selling my shares!” ● We can infer the response is relevant to a Musk company Expanding outside of reddit ○ Which Tweets are important? ■ Can’t analyze them all ○ Which other platforms are relevant? Natural Language Processing : Issues ● Improving sarcasm detection ○ Ex: ■ WOW… Tesla’s new car is incredible! The people are so excited ■ This is Tesla’s “incredible” new car people are so excited for? WOW… ○ Sentences with similar words can have opposite meanings ● What comments / discussions are relevant ○ Discussions may not be relevant if they… ■ Are made by bots ● More common than you may think ■ Are made by biased individuals ● Shareholders, employees, fans, etc ■ Are not popular or are particularly unpopular ● Comments / tweets with no likes / responses ● Comments / tweets with negative responses ○ May indicate the opposite sentiment Remaining Objectives ● Scrape the web for relevant discussions ○ Identify which equity the discussion is relevant to ● Calculate a sentiment value accurate enough that it’s inclusion improves the accuracy of the LSTM network ○ Must account for sarcasm and popularity ● Required Steps: ○ ○ Improve the accuracy of sentiment value ■ Sarcasm detection ■ Bot identification Train NN with and without sentiment included