Comparative Stock Market Analysis in R using Quandl & tidyverse - Part I D AT A E XPLO RAT I O N D AT A M I NI NG D AT A VI S UA LI Z AT I O N LI BRA RI E S M A C HI NE LE A RNI NG PRO G RA M M I NG R S T O C K T RA D I NG S T RUC T URE D D AT A S UPE RVI S E D T I M E S E RI E S F O RE C A S T I NG Introduction What differentiates the best data scientists from others? It is their focus on application of data science. The best data scientists I know of, see data science and its application every where they look. They look at this world as an outcome of flow of data and information. On the other hand, most beginners often ask the question – how do we apply our learning on real life problems? In this post (and another one following this), I have picked up a real life dataset (Stock Markets in India) and showed how I would use this data to come out with useful insights. I hope that you will find this useful. The idea is show the vast opportunities present in data science in a simple yet powerful manner. If you can think of more examples like this – let me know in comments below! For the best results, I would strongly recommend to build the application yourself as you follow the tutorial. Objective of this Tutorial In this article, we will analyze stock market in banking segment based on the bank stocks which are listed in NSE India. Our objective is to find the trends (Seasonal or cyclic) in banking stocks. In our comparative analysis we will use several packages and the primary focus will be on tidy verse package. The emphasis will be given on grouping with the help of tibble dataframe from tidy verse package. This will help to perform similar operation on multiple groups at a time, hence reducing the code length and computational time. This article also focuses on API Key, database code search using quandl, and finally how to directly download the data from R Console. So lets get started! Note: The code that has been mentioned below is to be run on the R command line for best results. Table of Contents Setting up the system Getting Started with Comparative Analysis Creating the dataset Visualizing the monthly prices Discovering the Relation between Total Traded Quantity vs Close Price Finding the Density Distribution of Deviation of High Price from Open Price Observing the Autocorrelation lags Setting Up The System There are a few things you should take care of before you go on further. Below mentioned are the packages you need to install in the system Quandl for Data Download timetk to coerce the dataframe into xts tidyverse to Use tibble for grouping and perform single operation on multiple groups tidyquant for Time Series and Financial functions to perform the analysis gglot for Plotting and Visualization gganimate to plot the monthly prices. To get more information on gganimate, please read my previous post on Analytics Vidhya forcats for modification of factor levels stringr for string use If you don’t have any of the packages, then use the below code to install the packages. Modify the packages variable if any of the above packages are already installed. pckgs<-c("Quandl","Sweep","tidyverse","tidyquant","ggplot","forcats","stringr") install.packages(pckgs,dependencies = TRUE) You can then call the necessary packages using the code below library(Quandl) library(tidyverse) library(ggplot2) library(tidyquant) library(timetk) library(forcats) library(stringr) library(gganimate) library(plyr) library(stringr) library(gridExtra) Getting Started with Comparative Analysis Creating the Dataset We will be using Quandl is online repository for the core financial, macroeconomic statistics and forex. Quandl has a vast collection of free and open data collected from a variety of organizations: central banks, governments, multinational organizations and more. You can use it without payment and with few restrictions. Both Free and Premium data are available. Authenticated Free users have a limit of 300 calls per 10 seconds, 2,000 calls per 10 minutes and a limit of 50,000 calls per day. Premium data subscribers have a limit of 5,000 calls per 10 minutes and a limit of 720,000 calls per day. We will use this online repository to get our data using “Quandl” package directly from the R Console. Quandl package directly interacts with the Quandl API to offer data in a number of formats usable in R, downloading a zip with all data from a Quandl database, and the ability to search. For More information on Quandl Package, please visit this page. To get started with Quandl, create an account and get the quandl API key. Please click here to create an account. Then click on the Login button provided on the top right corner of the screen. Once the registration is complete, please click here to get the API Key. In our analysis, we have selected following banks ICICI BOB CANARA AXIS SBI PNB We have selected these banks as they are in the price band of Rs 200 to Rs 500. We will use the following codes to get the data into R console. Quandl(Code=“NSE/—”,collapse=“—”,start_date=“—-”,type=“…”) The parameters we use are as follows: Code Dataset code on Quandl specified as a string or an array of strings. collapse Collapse frequency of Data.Eg; “daily”, “monthly”, “weekly”, “yearly”. star t_date Desired Start Date type Type of data returned specified as string. Can be ‘raw’, ‘ts’, ‘zoo’, ‘xts’ or ‘timeSeries’ Now we will download the data, add a column “Stock” for the stock identifier, and then we paste the respective stock name in the downloaded dataset. We will then consolidate all stock data into one Master Data frame for analysis ## Setup the Quandl Free Account and API Key, Please copy and paste the API key in order to #authenticate Quandl.api_key("<Your-API-Key>") ## Download the data Set ICICI = Quandl("NSE/ICICIBANK",collapse="daily",start_date="2016-09-01",type="raw") PNB= Quandl("NSE/PNB",collapse="daily",start_date="2016-09-01",type="raw") Axis=Quandl("NSE/AXISBANK",collapse="daily",start_date="2016-09-01",type="raw") Canara=Quandl("NSE/CANBK",collapse="daily",start_date="2016-09-01",type="raw") BOB=Quandl("NSE/BANKBARODA",collapse="daily",start_date="2016-09-01",type="raw") SBI=Quandl("NSE/SBIN",collapse="daily",start_date="2016-09-01",type="raw") in Datasets using cbind(Axis,Stock="") Paste the stock cbind command ICICI<-cbind(ICICI,Stock="") SBI<-cbind(SBI,Stock="") name in stock column ## Add another ("Stock") PNB<-cbind(PNB,Stock="") Canara<-cbind(Canara,Stock="") coloumn Axis<- BOB<-cbind(BOB,Stock="") ICICI$Stock<-paste(ICICI$Stock,"ICICI",sep="") ## PNB$Stock<- paste(PNB$Stock,"PNB",sep="") Axis$Stock<-paste(Axis$Stock,"Axis",sep="") SBI$Stock<- paste(SBI$Stock,"SBI",sep="") Canara$Stock<-paste(Canara$Stock,"Canara",sep="") BOB$Stock<- paste(BOB$Stock,"BOB",sep="") ## Consolidate under one dataset Master_Data<- rbind(ICICI,PNB,Axis,SBI,Canara,BOB) Visualization of monthly prices Let us look at Monthly and Daily price pattern for Stocks using ggplot package. For this we will need to group the master dataframe according by Stock. We have heavily manipulated the theme section of ggplot to get the desired plot. More information on plot is provided here. ## Visualisation in ggplot2 ("Comparative Visulisation of Close Price listed on NSE") ## Convert the dates into character in order to as.character(Master_Data$Date) strsplit(Master_Data$Date,"-") split ## ## the Split Convert coloumn the the colnames(Master_Date1)<-c("Year","Month","Day") into date "Y" and list into ## Column "m" create dataframe bind "dd"" a columns list for library(plyr) with the Master_Data$Date<the same list<- Master_Date1<-ldply(list) main dataframe Master_Data<- cbind(Master_Data,Master_Date1) names(Master_Data) ## Change the scale for Traded Quantity Master_Data$`Total Trade Quantity`<-Master_Data$`Total Trade Quantity`/100000 Master_Data$Date<-as.Date(Master_Data$Date) ## ## Convert Visualisation with ggplot(Master_Data,aes(factor(Stock),Close,color=Stock,frame=Month)) + the Date to Bubble as.Date() Plot geom_jitter(aes(size P<- = Close, colour=Stock, alpha=.02)) + ylim(0,1000)+ labs(title = "Bank Stock Monthly Prices", x = "Banks", y= "Close Price") + theme(panel.border = element_blank(), panel.grid.major = 0.5, linetype plot.title = = "dotted"), panel.grid.minor element_text(hjust = = element_line(colour element_blank(), = "grey61", size = axis.line=element_line(colour="black"), 0.5,size=18,colour="indianred4"))+ gganimate(P,'Price_Range.gif',ani.width=600,ani.height=400,interval=1) theme(legend.position="none") ## Group By Stock P1<- Master_Data<- Master_Data%>% tibble::as.tibble()%>% group_by(Stock) ## Visualisation for Daily Stock Prices Master_Data %>% ggplot(aes(x = Date, y = Close, color = Stock)) + geom_point() + "Month",y="Close Price") + facet_wrap(~ scale_fill_tq(fill="green4",theme="light") panel.grid.major element_blank(), = element_line(colour = + Stock, ncol theme_tq() "grey61", size = + 0.5,size=18,colour="indianred4"))+ theme(legend.position="none") = = 3, "Daily scale theme(panel.border 0.5, axis.line=element_line(colour="black"), labs(title linetype plot.title = = = "dotted"), = Close Price", x "free_y") = + element_blank(), panel.grid.minor element_text(hjust = = Discovering the Relation between Total Traded Quantity vs Close Price Usually, traded quantity increases if the stock price increases or decreases too rapidly on a given day. This parameter is important for our model for prediction. So we should take some time out to identify the relation between them in our data. ## Traded Quantity vs Price z<-Master_Data %>% ggplot(aes(x = `Total Stock,frame=Month)) + geom_smooth(method='loess') + xlim(0,400)+ Trade labs(title Quantity`, y = Close, color = = "Monthly Traded Quantity vs Price", x = "Traded Quantity (Lacs)",y="Close Price") + facet_wrap(~ Stock, ncol = 3, scale = "free_y") + scale_fill_tq(fill="green4",theme="light") panel.grid.major = element_blank(), + theme_tq() panel.grid.minor = + theme(panel.border element_blank(), plot.title = = element_blank(), element_text(hjust = 0.5,size=18,colour="indianred4"), axis.line = element_line(colour = "black"))+ theme(legend.position="none") z1<-gganimate(z,'Quantity_Price.gif',ani.width=600,ani.height=400,interval=0.7) We have an idea of trend of the stock price, but not much is clear from the Monthly prices. Axis Bank share price improved in september and stayed at Rs750 for a month. whereas all other Banks were consistent and did not show much of volatility. Finding the Density Distribution of Deviation of High Price from Open Price Now we will see the density distribution of High Price from Open Price in order to get an understanding that how much price is deviating in either direction (North or South) on weekly basis. This gives us an idea of price range for any stock in intraday trading. We will use the transmute_tq() function from tidyquant package to compute the weekly prices. Please click here to get more information. For this add a new column with the difference of high and open price using mutate function. Add another new column with the difference of low and open price using mutate function. Calculate the weekly average of differences using “tq_transmute()” function from tidyverse package. Visualize both density plots with dot distribution on ggplot ## Deviation from High & Low Price Master_Data_High<-Master_Data%>%mutate(Dev_High=High-Open) Master_Data_Low<-Master_Data%>%mutate(Dev_Low=Open-Low) Master_Data_High_Week <- Master_Data_High %>% apply.weekly, FUN = mean, ## Computation tq_transmute( na.rm of weekly select = TRUE, average for = Dev_High, high mutate_fun = col_rename = "Dev_High_Mean" Computation weekly average for Low Price Master_Data_Low_Week<-Master_Data_Low%>% tq_transmute( select na.rm = TRUE, = Dev_Low, col_rename = mutate_fun = apply.weekly, "Dev_Low_Mean" ) ## FUN = mean, Visualisation of density Price distribution of High Price ) ## High<- Master_Data_High_Week%>%ggplot(aes(x=Dev_High_Mean,color=Stock))+ geom_dotplot(binwidth=0.50,aes(fill=Stock))+ xlim(0,10)+ "#E69F00","#CC9933","#99FF00","#CC3399","#FF9933"))+ Open Price",x="Weekly Mean labs(title="Distribution of High Price Deviation from Deviation")+ scale_color_tq(values=c("#999999"))+ scale_fill_manual(values=c("#999999", facet_wrap(~Stock,ncol=3,scale="free_y")+ theme_tq()+ theme(panel.border = element_blank(), panel.grid.major = element_line(colour = "grey61", size = 0.5, linetype = "dotted"), = element_blank(), axis.line=element_line(colour="black"), 0.5,size=16,colour="indianred4"))+ of Low Price theme(legend.position="none") ## Visualisation of density distribution xlim(0,10)+ "#E69F00","#CC9933","#99FF00","#CC3399","#FF9933"))+ Open plot.title = element_text(hjust = Low<-Master_Data_Low_Week%>%ggplot(aes(x=Dev_Low_Mean,color=Stock))+ geom_dotplot(binwidth=0.50,aes(fill=Stock))+ from Price",x="Weekly Mean scale_color_tq(values=c("#999999"))+ Deviation")+ theme_tq()+ scale_fill_manual(values=c("#999999", labs(title="Distribution of Weekly Low Price Deviation facet_wrap(~Stock,ncol=3,scale="free_y")+ theme(panel.border = element_blank(), panel.grid.major = element_line(colour = "grey61", size = 0.5, linetype = "dotted"), = element_blank(), panel.grid.minor axis.line=element_line(colour="black"), 0.5,size=16,colour="indianred4"))+ panel.grid.minor plot.title = element_text(hjust = theme(legend.position="none") ## Arrange grid.arrange(High,Low,ncol = 2, nrow = 1) Observing the Autocorrelation lags The lag operator (also known as backshift operator) is a function that shifts (offsets) a time series such that the “lagged” values are aligned with the actual time series. The lags can be shifted any number of units, which simply controls the length of the backshift. Here, “k” is denoted as lag. We will see the lag of 180 days period and see how stocks behave. These are the steps for Computation Define k lag period Create columns for lag periods Group the data by Stock by creating new data frame for lags Apply lag.xts using tq_mutate() function on the new dataframe Apply Auto-correaltion k <- 1:180 col_names <- paste0("lag_", k) ## Only Select Columns "Date" and "Close" from hte master data frame. Master_Data_lags<-Master_Data%>% tibble::as_tibble() Master_Data_lags%>%select(Date,Close) # Apply Master_Data_lags%>% = Close, tq_mutate( select lag.xts %>% group_by(Stock) function mutate_fun = using lag.xts, Master_Data_lags<- tq_mutate k=1:180, Master_Data_lags<- col_rename=col_names ) # Calculate the autocorrelations and 95% cutoffs Master_Data_AutoCorrelations<-Master_Data_lags %>% gather(key = "lag", value = "lag_value", -c(Stock,Date, Close)) %>% mutate(lag = str_sub(lag, start = 5) %>% as.numeric) %>% group_by(Stock, lag) %>% summarize( cor = cor(x = Close, y = lag_value, use = "pairwise.complete.obs"), cutoff_upper = 2/(n())^0.5, Master_Data_AutoCorrelations horizontal line a y=0 cutoff_lower %>% = -2/(n())^0.5 ggplot(aes(x = geom_hline(yintercept lag, = 0) ) y + ## Visualisation = of = cor, color # Plot autocorrelations Autocorrelation: ACF group Stock)) + geom_point(size = Stock, = Plot # Add 2) + geom_segment(aes(xend = lag, yend = 0), size = 1) + # Add cutoffs geom_line(aes(y = cutoff_upper), color = "blue", linetype = 2) + geom_line(aes(y = cutoff_lower), color = "blue", linetype = 2) + # Add facets facet_wrap(~ Stock, ncol = 3) + # Aesthetics expand_limits(y = c(-1, 1)) + scale_color_tq() + theme_tq() + labs( title = paste0("Tidyverse ACF Plot: Lags ", rlang::expr_text(k)), x = "Lags" ) + theme( legend.position = "none", "grey61", axis.text.x size = = element_text(angle 0.5, linetype = 45, = hjust = "dotted"), 1), panel.grid.major plot.title = = element_line(colour element_text(hjust = = 0.5,size=18,colour="indianred4") ) Its apparent from the ACF plot, that there is no weekly or monthly pattern. End Notes This article contains descriptive analysis of stocks in terms of Daily/Weekly Price fluctuations. It also includes analysis on deviation from High and Low Price. The focus is also given on the relationship between the daily traded quantity of shares & close price and to check for the relationship. In the later part, the main focus is on xts package for the computation of Auto-correaltion. In the article, the focus is provided on finding lag and acf plot using ggplot rather than using the conventional time series package. This includes the analysis on ACF using different lags and to check if there is any pattern in the series. You can read part 2 of this article here. Aritra Chatterjee is a professional in the field of Data Science and Operation Management having experience of more than 5 years. He aspires to develop skill in the field of Automation, Data Science and Machine Learning. Learn, engage,compete, and get hired! Article Url - https://www.analyticsvidhya.com/blog/2017/09/comparative-stock-analysis/ Guest Blog