Uploaded by Mayur Kapur

comparative-stock-analysis

advertisement
Comparative Stock Market Analysis in R using Quandl &
tidyverse - Part I
D AT A E XPLO RAT I O N
D AT A M I NI NG
D AT A VI S UA LI Z AT I O N
LI BRA RI E S
M A C HI NE LE A RNI NG
PRO G RA M M I NG
R
S T O C K T RA D I NG
S T RUC T URE D D AT A
S UPE RVI S E D
T I M E S E RI E S F O RE C A S T I NG
Introduction
What differentiates the best data scientists from others? It is their focus on application of data science.
The best data scientists I know of, see data science and its application every where they look. They look at
this world as an outcome of flow of data and information.
On the other hand, most beginners often ask the question – how do we apply our learning on real life
problems?
In this post (and another one following this), I have picked up a real life dataset (Stock Markets in India)
and showed how I would use this data to come out with useful insights.
I hope that you will find this useful. The idea is show the vast opportunities present in data science in a
simple yet powerful manner. If you can think of more examples like this – let me know in comments below!
For the best results, I would strongly recommend to build the application yourself as you follow the
tutorial.
Objective of this Tutorial
In this article, we will analyze stock market in banking segment based on the bank stocks which are listed
in NSE India. Our objective is to find the trends (Seasonal or cyclic) in banking stocks.
In our comparative analysis we will use several packages and the primary focus will be on tidy verse
package. The emphasis will be given on grouping with the help of tibble dataframe from tidy verse
package. This will help to perform similar operation on multiple groups at a time, hence reducing the code
length and computational time.
This article also focuses on API Key, database code search using quandl, and finally how to directly
download the data from R Console.
So lets get started!
Note: The code that has been mentioned below is to be run on the R command line for best results.
Table of Contents
Setting up the system
Getting Started with Comparative Analysis
Creating the dataset
Visualizing the monthly prices
Discovering the Relation between Total Traded Quantity vs Close Price
Finding the Density Distribution of Deviation of High Price from Open Price
Observing the Autocorrelation lags
Setting Up The System
There are a few things you should take care of before you go on further. Below mentioned are the
packages you need to install in the system
Quandl for Data Download
timetk to coerce the dataframe into xts
tidyverse to Use tibble for grouping and perform single operation on multiple groups
tidyquant for Time Series and Financial functions to perform the analysis
gglot for Plotting and Visualization
gganimate to plot the monthly prices. To get more information on gganimate, please read my previous
post on Analytics Vidhya
forcats for modification of factor levels
stringr for string use
If you don’t have any of the packages, then use the below code to install the packages. Modify the
packages variable if any of the above packages are already installed.
pckgs<-c("Quandl","Sweep","tidyverse","tidyquant","ggplot","forcats","stringr")
install.packages(pckgs,dependencies = TRUE)
You can then call the necessary packages using the code below
library(Quandl)
library(tidyverse)
library(ggplot2)
library(tidyquant)
library(timetk)
library(forcats)
library(stringr) library(gganimate) library(plyr) library(stringr) library(gridExtra)
Getting Started with Comparative Analysis
Creating the Dataset
We will be using Quandl is online repository for the core financial, macroeconomic statistics and forex.
Quandl has a vast collection of free and open data collected from a variety of organizations: central banks,
governments, multinational organizations and more. You can use it without payment and with few
restrictions.
Both Free and Premium data are available. Authenticated Free users have a limit of 300 calls per 10
seconds, 2,000 calls per 10 minutes and a limit of 50,000 calls per day. Premium data subscribers have a
limit of 5,000 calls per 10 minutes and a limit of 720,000 calls per day.
We will use this online repository to get our data using “Quandl” package directly from the R Console.
Quandl package directly interacts with the Quandl API to offer data in a number of formats usable in R,
downloading a zip with all data from a Quandl database, and the ability to search.
For More information on Quandl Package, please visit this page.
To get started with Quandl, create an account and get the quandl API key. Please click here to create an
account. Then click on the Login button provided on the top right corner of the screen. Once the
registration is complete, please click here to get the API Key.
In our analysis, we have selected following banks
ICICI
BOB
CANARA
AXIS
SBI
PNB
We have selected these banks as they are in the price band of Rs 200 to Rs 500. We will use the following
codes to get the data into R console.
Quandl(Code=“NSE/—”,collapse=“—”,start_date=“—-”,type=“…”)
The parameters we use are as follows:
Code Dataset code on Quandl specified as a string or an array of strings.
collapse Collapse frequency of Data.Eg; “daily”, “monthly”, “weekly”, “yearly”.
star t_date Desired Start Date
type Type of data returned specified as string. Can be ‘raw’, ‘ts’, ‘zoo’, ‘xts’ or ‘timeSeries’
Now we will download the data, add a column “Stock” for the stock identifier, and then we paste the
respective stock name in the downloaded dataset. We will then consolidate all stock data into one Master
Data frame for analysis
## Setup the Quandl Free Account and API Key, Please copy and paste the API key in order to #authenticate
Quandl.api_key("<Your-API-Key>")
##
Download
the
data
Set
ICICI
=
Quandl("NSE/ICICIBANK",collapse="daily",start_date="2016-09-01",type="raw")
PNB=
Quandl("NSE/PNB",collapse="daily",start_date="2016-09-01",type="raw")
Axis=Quandl("NSE/AXISBANK",collapse="daily",start_date="2016-09-01",type="raw")
Canara=Quandl("NSE/CANBK",collapse="daily",start_date="2016-09-01",type="raw")
BOB=Quandl("NSE/BANKBARODA",collapse="daily",start_date="2016-09-01",type="raw")
SBI=Quandl("NSE/SBIN",collapse="daily",start_date="2016-09-01",type="raw")
in
Datasets
using
cbind(Axis,Stock="")
Paste
the
stock
cbind
command
ICICI<-cbind(ICICI,Stock="")
SBI<-cbind(SBI,Stock="")
name
in
stock
column
##
Add
another
("Stock")
PNB<-cbind(PNB,Stock="")
Canara<-cbind(Canara,Stock="")
coloumn
Axis<-
BOB<-cbind(BOB,Stock="")
ICICI$Stock<-paste(ICICI$Stock,"ICICI",sep="")
##
PNB$Stock<-
paste(PNB$Stock,"PNB",sep="")
Axis$Stock<-paste(Axis$Stock,"Axis",sep="")
SBI$Stock<-
paste(SBI$Stock,"SBI",sep="")
Canara$Stock<-paste(Canara$Stock,"Canara",sep="")
BOB$Stock<-
paste(BOB$Stock,"BOB",sep="")
##
Consolidate
under
one
dataset
Master_Data<-
rbind(ICICI,PNB,Axis,SBI,Canara,BOB)
Visualization of monthly prices
Let us look at Monthly and Daily price pattern for Stocks using ggplot package. For this we will need to
group the master dataframe according by Stock.
We have heavily manipulated the theme section of ggplot to get the desired plot. More information on plot
is provided here.
## Visualisation in ggplot2 ("Comparative Visulisation of Close Price listed on NSE") ## Convert the dates
into
character
in
order
to
as.character(Master_Data$Date)
strsplit(Master_Data$Date,"-")
split
##
##
the
Split
Convert
coloumn
the
the
colnames(Master_Date1)<-c("Year","Month","Day")
into
date
"Y"
and
list
into
##
Column
"m"
create
dataframe
bind
"dd""
a
columns
list
for
library(plyr)
with
the
Master_Data$Date<the
same
list<-
Master_Date1<-ldply(list)
main
dataframe
Master_Data<-
cbind(Master_Data,Master_Date1) names(Master_Data) ## Change the scale for Traded Quantity Master_Data$`Total
Trade
Quantity`<-Master_Data$`Total
Trade
Quantity`/100000
Master_Data$Date<-as.Date(Master_Data$Date)
##
##
Convert
Visualisation
with
ggplot(Master_Data,aes(factor(Stock),Close,color=Stock,frame=Month))
+
the
Date
to
Bubble
as.Date()
Plot
geom_jitter(aes(size
P<-
=
Close,
colour=Stock, alpha=.02)) + ylim(0,1000)+ labs(title = "Bank Stock Monthly Prices", x = "Banks", y= "Close
Price") + theme(panel.border = element_blank(), panel.grid.major =
0.5,
linetype
plot.title
=
=
"dotted"),
panel.grid.minor
element_text(hjust
=
=
element_line(colour
element_blank(),
=
"grey61",
size
=
axis.line=element_line(colour="black"),
0.5,size=18,colour="indianred4"))+
gganimate(P,'Price_Range.gif',ani.width=600,ani.height=400,interval=1)
theme(legend.position="none")
##
Group
By
Stock
P1<-
Master_Data<-
Master_Data%>% tibble::as.tibble()%>% group_by(Stock) ## Visualisation for Daily Stock Prices Master_Data %>%
ggplot(aes(x = Date, y = Close, color = Stock)) + geom_point() +
"Month",y="Close
Price")
+
facet_wrap(~
scale_fill_tq(fill="green4",theme="light")
panel.grid.major
element_blank(),
=
element_line(colour
=
+
Stock,
ncol
theme_tq()
"grey61",
size
=
+
0.5,size=18,colour="indianred4"))+ theme(legend.position="none")
=
=
3,
"Daily
scale
theme(panel.border
0.5,
axis.line=element_line(colour="black"),
labs(title
linetype
plot.title
=
=
=
"dotted"),
=
Close
Price",
x
"free_y")
=
+
element_blank(),
panel.grid.minor
element_text(hjust
=
=
Discovering the Relation between Total Traded Quantity vs Close Price
Usually, traded quantity increases if the stock price increases or decreases too rapidly on a given day. This
parameter is important for our model for prediction. So we should take some time out to identify the
relation between them in our data.
## Traded Quantity vs Price z<-Master_Data %>% ggplot(aes(x = `Total
Stock,frame=Month)) +
geom_smooth(method='loess')
+
xlim(0,400)+
Trade
labs(title
Quantity`, y = Close, color =
=
"Monthly
Traded
Quantity
vs
Price", x = "Traded Quantity (Lacs)",y="Close Price") + facet_wrap(~ Stock, ncol = 3, scale = "free_y") +
scale_fill_tq(fill="green4",theme="light")
panel.grid.major
=
element_blank(),
+
theme_tq()
panel.grid.minor
=
+
theme(panel.border
element_blank(),
plot.title
=
=
element_blank(),
element_text(hjust
=
0.5,size=18,colour="indianred4"), axis.line = element_line(colour = "black"))+ theme(legend.position="none")
z1<-gganimate(z,'Quantity_Price.gif',ani.width=600,ani.height=400,interval=0.7)
We have an idea of trend of the stock price, but not much is clear from the Monthly prices. Axis Bank share
price improved in september and stayed at Rs750 for a month. whereas all other Banks were consistent
and did not show much of volatility.
Finding the Density Distribution of Deviation of High Price from Open Price
Now we will see the density distribution of High Price from Open Price in order to get an understanding
that how much price is deviating in either direction (North or South) on weekly basis. This gives us an idea
of price range for any stock in intraday trading.
We will use the transmute_tq() function from tidyquant package to compute the weekly prices. Please click
here to get more information.
For this add a new column with the difference of high and open price using mutate function. Add another
new column with the difference of low and open price using mutate function. Calculate the weekly average
of differences using “tq_transmute()” function from tidyverse package. Visualize both density plots with
dot distribution on ggplot
##
Deviation
from
High
&
Low
Price
Master_Data_High<-Master_Data%>%mutate(Dev_High=High-Open)
Master_Data_Low<-Master_Data%>%mutate(Dev_Low=Open-Low)
Master_Data_High_Week <- Master_Data_High %>%
apply.weekly,
FUN
= mean,
##
Computation
tq_transmute(
na.rm
of
weekly
select
= TRUE,
average
for
= Dev_High,
high
mutate_fun =
col_rename = "Dev_High_Mean"
Computation weekly average for Low Price Master_Data_Low_Week<-Master_Data_Low%>%
tq_transmute(
select
na.rm = TRUE,
= Dev_Low,
col_rename
=
mutate_fun = apply.weekly,
"Dev_Low_Mean"
)
##
FUN = mean,
Visualisation
of
density
Price
distribution
of
High
Price
) ##
High<-
Master_Data_High_Week%>%ggplot(aes(x=Dev_High_Mean,color=Stock))+
geom_dotplot(binwidth=0.50,aes(fill=Stock))+
xlim(0,10)+
"#E69F00","#CC9933","#99FF00","#CC3399","#FF9933"))+
Open
Price",x="Weekly
Mean
labs(title="Distribution of High Price Deviation from
Deviation")+
scale_color_tq(values=c("#999999"))+
scale_fill_manual(values=c("#999999",
facet_wrap(~Stock,ncol=3,scale="free_y")+
theme_tq()+
theme(panel.border
=
element_blank(),
panel.grid.major = element_line(colour = "grey61", size = 0.5, linetype = "dotted"),
= element_blank(),
axis.line=element_line(colour="black"),
0.5,size=16,colour="indianred4"))+
of
Low
Price
theme(legend.position="none") ## Visualisation of density distribution
xlim(0,10)+
"#E69F00","#CC9933","#99FF00","#CC3399","#FF9933"))+
Open
plot.title = element_text(hjust =
Low<-Master_Data_Low_Week%>%ggplot(aes(x=Dev_Low_Mean,color=Stock))+
geom_dotplot(binwidth=0.50,aes(fill=Stock))+
from
Price",x="Weekly
Mean
scale_color_tq(values=c("#999999"))+
Deviation")+
theme_tq()+
scale_fill_manual(values=c("#999999",
labs(title="Distribution of Weekly Low Price Deviation
facet_wrap(~Stock,ncol=3,scale="free_y")+
theme(panel.border
=
element_blank(),
panel.grid.major = element_line(colour = "grey61", size = 0.5, linetype = "dotted"),
= element_blank(),
panel.grid.minor
axis.line=element_line(colour="black"),
0.5,size=16,colour="indianred4"))+
panel.grid.minor
plot.title = element_text(hjust =
theme(legend.position="none") ## Arrange grid.arrange(High,Low,ncol = 2,
nrow = 1)
Observing the Autocorrelation lags
The lag operator (also known as backshift operator) is a function that shifts (offsets) a time series such
that the “lagged” values are aligned with the actual time series. The lags can be shifted any number of
units, which simply controls the length of the backshift.
Here, “k” is denoted as lag. We will see the lag of 180 days period and see how stocks behave.
These are the steps for Computation
Define k lag period
Create columns for lag periods
Group the data by Stock by creating new data frame for lags
Apply lag.xts using tq_mutate() function on the new dataframe
Apply Auto-correaltion
k <- 1:180 col_names <- paste0("lag_", k) ## Only Select Columns "Date" and "Close" from hte master data
frame.
Master_Data_lags<-Master_Data%>%
tibble::as_tibble()
Master_Data_lags%>%select(Date,Close)
#
Apply
Master_Data_lags%>%
=
Close,
tq_mutate(
select
lag.xts
%>%
group_by(Stock)
function
mutate_fun
=
using
lag.xts,
Master_Data_lags<-
tq_mutate
k=1:180,
Master_Data_lags<-
col_rename=col_names
)
#
Calculate the autocorrelations and 95% cutoffs Master_Data_AutoCorrelations<-Master_Data_lags %>% gather(key
= "lag", value = "lag_value", -c(Stock,Date, Close)) %>% mutate(lag = str_sub(lag, start = 5) %>% as.numeric)
%>% group_by(Stock, lag) %>% summarize( cor = cor(x = Close, y = lag_value, use = "pairwise.complete.obs"),
cutoff_upper
=
2/(n())^0.5,
Master_Data_AutoCorrelations
horizontal
line
a
y=0
cutoff_lower
%>%
=
-2/(n())^0.5
ggplot(aes(x
=
geom_hline(yintercept
lag,
=
0)
)
y
+
##
Visualisation
=
of
=
cor,
color
#
Plot
autocorrelations
Autocorrelation:
ACF
group
Stock))
+
geom_point(size
=
Stock,
=
Plot
#
Add
2)
+
geom_segment(aes(xend = lag, yend = 0), size = 1) + # Add cutoffs geom_line(aes(y = cutoff_upper), color =
"blue",
linetype
=
2)
+
geom_line(aes(y
=
cutoff_lower),
color
=
"blue",
linetype
=
2)
+
#
Add
facets
facet_wrap(~ Stock, ncol = 3) + # Aesthetics expand_limits(y = c(-1, 1)) + scale_color_tq() + theme_tq() +
labs( title = paste0("Tidyverse ACF Plot: Lags ", rlang::expr_text(k)), x = "Lags" ) + theme( legend.position
=
"none",
"grey61",
axis.text.x
size
=
=
element_text(angle
0.5,
linetype
=
45,
=
hjust
=
"dotted"),
1),
panel.grid.major
plot.title
=
=
element_line(colour
element_text(hjust
=
=
0.5,size=18,colour="indianred4") )
Its apparent from the ACF plot, that there is no weekly or monthly pattern.
End Notes
This article contains descriptive analysis of stocks in terms of Daily/Weekly Price fluctuations. It also
includes analysis on deviation from High and Low Price. The focus is also given on the relationship
between the daily traded quantity of shares & close price and to check for the relationship. In the later part,
the main focus is on xts package for the computation of Auto-correaltion. In the article, the focus is
provided on finding lag and acf plot using ggplot rather than using the conventional time series package.
This includes the analysis on ACF using different lags and to check if there is any pattern in the series.
You can read part 2 of this article here.
Aritra Chatterjee is a professional in the field of Data Science and Operation Management having experience
of more than 5 years. He aspires to develop skill in the field of Automation, Data Science and Machine
Learning.
Learn, engage,compete, and get hired!
Article Url - https://www.analyticsvidhya.com/blog/2017/09/comparative-stock-analysis/
Guest Blog
Download