Uploaded by Ali Bagheri

Thesis Ali-Bagheri FINAL (4)

advertisement
A Machine Learning Approach: Network Analysis
of the NASDAQ Stock Market
by
Ali Bagheri (SNR: 2044174)
A thesis submitted in partial fulfillment of the requirements for the degree of
Bachelor in Econometrics and Operations Research
Tilburg School of Economics and Management
Tilburg University
Supervised by: Denis Kojevnikov
Date: January, 2023
Abstract
The primary aim of this paper was to explore the potential of network analysis of stock
markets for risk management purposes by leveraging state-of-the-art machine learning
techniques. Yearly networks of the NASDAQ 500 most significant stocks concerning
market capitalization were constructed using the Pearson Coefficient and the Planar
Maximally Filtered Graph algorithm. Furthermore, a Random Forest classifier was
built and trained on the topological attributes corresponding to the given stocks in
the network to predict the stocks with returns in the top quantile(s) of the next-day
returns. The attained predictive model attained an average cross-validation score of
67.5% and therefore indicated some degree of robustness regarding its generalization
capacity. Characteristically, closeness and eigenvector centrality contributed more significantly to the model’s performance than other topological measures. Lastly, a series
of portfolios using the acquired predictive model were simulated and compared to other
types of portfolios, including the NASDAQ-100 index, for the considered days.
1
Contents
1 Introduction
3
2 Literature Review
2.1 Motivation . . . . . . . . . . . . . . . . .
2.2 Network Construction . . . . . . . . . . .
2.3 Network Topology . . . . . . . . . . . . .
2.4 Data Mining . . . . . . . . . . . . . . . .
2.5 Classification Algorithms . . . . . . . . .
2.6 Random Forest Model . . . . . . . . . . .
2.7 Cross-Validation . . . . . . . . . . . . . .
2.8 Hyper-parameter Optimization . . . . . .
2.9 Feature Testing: Permutation Importance
.
.
.
.
.
.
.
.
.
4
4
6
7
9
9
9
12
12
14
3 Methodology
3.1 Data Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Network Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Designing the Classification Model . . . . . . . . . . . . . . . . . . . . . . .
15
15
16
16
4 Empirical Findings and Analysis
4.1 Pre-Downsampling . . . . . . . .
4.2 Post-Downsampling . . . . . . .
4.3 Variable Importance . . . . . . .
4.4 Portfolio Construction . . . . . .
19
19
20
22
24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 Conclusion
27
6 Discussion
6.1 The Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 The Predictive Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3 Portfolio Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
28
29
30
7 Appendix
7.1 Append A: Visualization Networks . . . . . . . . . . . . . . . . . . . . . . .
34
34
2
1
Introduction
In recent times, there has been a spike in studying complex natural and physical phenomena from the perspective of network science. Large intricate systems can be represented by
networks where individual entities, commonly referred to as nodes or vertices, interact with
one another. Characteristically, these vertices have a scale-free relationships, meaning that
a relative change in a node of the network causes proportional changes in all of the linked
nodes and, thus, changes the aggregate behavior of the network (Albert and Barabási 2002).
Due to this topological property of networks, large complex systems are represented, with
a certain degree of approximation, as networks. These systems are used to gain insights
into complicated issues in various disciplines. The study of complex systems was subject
to classical graph theory in the 20th century. However, in recent decades, the exponential advances in technological computing capacity, new robust statistical instruments, and
the availability of large amounts of data have developed the contemporary network science
(Dorogovtsev and Mendes 2002).
Network Analysis could prove to be an additional robust risk-management tool to help analyze relationships and connections within the stock market that may not always be directly
apparent from traditional risk-management tools such as the widely used mean-variance
model, which relies heavily on statistical properties like expected return. Subsequently, the
shortcoming of these traditional methods is that they depend on the chronological consistency of stock returns and volatility. Consequently, these methods could lead to potentially
inferior portfolios. For this reason, it would be helpful to examine whether network analysis
could mitigate these shortcomings and prove a valuable instrument for risk managers.
The primary aim of this paper is to explore the potential of network analysis of stock markets
for risk management purposes by leveraging state-of-the-art machine learning techniques.
We will apply existing knowledge about the topological properties of the NASDAQ stock
market to a machine-learning framework to create a risk management tool. This research
will involve multiple components that will be presented in the next section.
3
2
2.1
Literature Review
Motivation
Network analysis has been used to gain insights about intricate issues in various subjects
from a different perspective. For instance, Yang and Leskovec (2010) investigated the information diffusion process within social media platforms. Shirley and Rushton (2005)
researched the impact of network topology on disease spread to study the risk and evolution of epidemic spread. Similarly, exploiting network analysis in finance has also been a
recurring theme.
Fan et al. (2019) used the topology of product similarity to make predictions about the
evolution of the market. Namaki et al. (2011) explored and compared the topological structure of the emerging TSE stock market with the mature Dow Jones Industrial Average
(DJIA), for which they confirmed the scale-free properties of these networks for both markets. Long, Zhang, and Tang (2017) demonstrated that stocks in a network with high
topological properties, such as betweenness centrality and clustering coefficients, are linked
with higher systematic risk. Eng-Uthaiwat (2018) confirmed empirically that by using a
network topological measure, the diameter, that idiosyncratic risk is not always diversified
away as commonly believed and that networks occasionally heighten the effect of idiosyncratic risk to cause aggregate fluctuation. He et al. (2022) have found that during abrupt
shock periods such as the Covid-19 crisis, the network structure exhibits particular topological indications of the systematic risk contribution that are consistent across different
markets. Yun, Jeong, and Park (2019) studied the topology of PageRank centrality to explore systematic risk and found that it was a more robust measure for systematic risk than
traditional risk measures such as Conditional Value at Risk (CoVaR) or Marginal Expected
Shortfall (MES) measures.
To fulfil the aim of this thesis, which is to investigate the potential value of network analysis for risk management by using a machine-learning framework, we will need to include
multiple components in our research.
Firstly, by investigating a robust method for the construction of the network. Crucially, the
network should represent the whole equity market while simultaneously minimizing outliers
for bias prevention and filtering weak links to reduce the computation complexity of the
market. In addition, the network’s topology is examined, and metrics of the network’s
stocks are extracted as appropriate, such as centrality measures, clustering coefficients, or
degree distributions (Namaki et al. 2011; Long, Zhang, and Tang 2017).
Furthermore, a machine learning algorithm will be considered, namely, the Random Forest
algorithm, to recognize the pattern of returns of the network’s vertices concerning their
topological measures and past returns to classify the position of the nodes relative to the
rest of the network. Moreover, it is crucial to explore the contribution of all the topological
4
measures by performing variable testing, also commonly referred to as feature importance
testing in the Data-Science field, to identify these measures’ significance for the model at
hand. Followingly, cross-validation and grid search analysis will be employed to optimize
the classification model by preventing over- or underfitting and by tuning the relevant
hyper-parameters of the learning model to control the model’s complexity.
5
2.2
Network Construction
In the financial literature regarding network analysis, the correlation of returns of individual stocks often serve as the foundation of constructing links in the network. This is
because they capture the interconnectedness of the stock market (Namaki et al. 2011; He
et al. 2022; Eng-Uthaiwat 2018). Millington and Niranjan (2021) compared the construction of networks between Pearson correlation matrices and rank-based correlation matrices
like Kendall and Spearman. They found virtually no difference in the robustness between
the correlation coefficients. B. Podobnik et al. (2010) construct correlation matrices with a
time lag, unlike the frequently used zero-lag correlation matrices of Pearson, Kendall, and
Spearman. Boris Podobnik and Stanley (2008) propose a novel method of forming crosscorrelation matrices based on detrended covariance.
Thus, it is possible to establish the edges of our network by using multiple criteria. However, as the Pearson correlation coefficient is widely used for financial network analysis,
this coefficient will be used in this paper because the exploration of alternative correlation
matrices is outside of the scope of this thesis.
Consequently, it is imperative to reduce the complexity of the network while still preserving
critical information about the interconnectedness of stocks using a filtering algorithm. He
et al. (2022) use a threshold method to set a certain threshold for correlation between two
stocks and construct an adjacent matrix in which only correlations exceeding the threshold are sustained, and smaller correlations are discarded. Alternatively, another widely
employed filtering procedure is constructing a Minimum Spanning Tree (MST) from the
correlation matrix using algorithms such as Kruskal or Prim’s algorithm (Long, Zhang,
and Tang 2017; Millington and Niranjan 2021; Mantegna 1999). The algorithm begins by
sorting the correlations from large to small and forms edges corresponding to the highest
correlation one by one while ensuring that no cycles form. Hence, when a correlation produces a cycle, the given link is ignored, and the algorithm continues with the subsequent
highest correlation in the queue. MST results in the formation of n−1 edges when there are
n nodes. However, Eng-Uthaiwat (2018) claims that utilizing a Minimum Spanning Tree
approach to form the network is perhaps too strict, resulting in vast amounts of information loss. Therefore, Tumminello et al. (2005) propose an alternative approach to filtering
complex correlation-based graphs that preserves the hierarchical association of Minimum
Spanning Trees while retaining a larger amount of information than MST, named the Planar Maximally Filtered Graph (PMFG) algorithm. The PMFG algorithm results in the
formation of 3(n − 2) edges from n nodes.
Due to PMFG being widely utilized within the financial network literature and being less
strict than MST while being similar, PMFG is opted for in this paper (Eng-Uthaiwat 2018;
Tumminello et al. 2005; Hong and Yoon 2022).
6
2.3
Network Topology
As aforementioned, studies on topological properties can help to understand both the local
behavior of individual entities and the network’s dynamic behavior. Namaki et al. (2011)
and He et al. (2022) consider a wide range of topological measures from graph theory, such
as centrality measures, degree distribution, and clustering coefficients, which are relevant
for emerging and mature markets. We will consider numerous topological measures and
compute the measures for all the nodes in the constructed structure. First the square
(N, N ) matrix that represents the given network corresponding to the N given stocks is
defined as A. These topological measures include:
1. Eigenvector centrality: This measure indicates the impact of a vertex in a network
based on the influence of its neighbouring vertices. Essentially, the measure portrays
the extent of influence of a node concerning its links to other important nodes (Long,
Zhang, and Tang 2017; He et al. 2022; Bonacich 2007). It is iteratively calculated as
follows:
N
X
EVi = λ
Aij EVj
(1)
j=1
where λ represents the largest eigenvector of matrix A. A high EVi value indicates
that i node is linked to many other well-connected nodes. In this paper, 100 iterations
will be performed to attain the measures values for all the vertices of the constructed
network.
2. Degree centrality: This measure indicates the number of edges connected to the given
node relative to the total number of possible links. (Long, Zhang, and Tang 2017; He
et al. 2022; Gong et al. 2019). The greater the degree centrality of a node is, the more
connections the given node has relative to the rest of the nodes in the network.
PN
D(i) =
j=1 Aij
(2)
N −1
3. Betweenness centrality: This measure mirrors the number of times a node finds itself
on the shortest path between other nodes. Betweenness centrality portrays the extent
of dependence of other nodes on the given node (Freeman 1978; He et al. 2022).
BW (i) =
X βjl(i)
2
(N − 2)(N − 1)
βjl
,
j ̸= i ̸= l
(3)
(j,l)
where βjl(i) represents the shortest path from node j to node l via node i and βjl
reflects the shortest path from node j to node i.
4. Closeness centrality: This centrality measure reflects the average shortest path of the
given node to all its connected nodes (He et al. 2022). Closeness centrality portrays
7
the closeness of a node to the core of the network.
(N − 1)
C(i) = P
(i,j) βij
(4)
where βij represents the shortest path from node i to node j.
5. Clustering coefficient: This measure illustrates how tightly the given node tends to
cluster together with other nodes in the network. The computation of the clustering
coefficient is based on the formation of triangles, where triangles are a form of a cycle
between three given nodes. The measure is then computed by the number of triangles
the given node is part of divided by the total number of triangles in the whole network
(Watts and Strogatz 1998; He et al. 2022; Wang et al. 2016).
CC(i) = P
j
Aij (
1
P
j
X
Aij − 1)
Aij Ajk Aki
(5)
(j,k)
6. PageRank: Comparable to eigenvector centrality, which considers the connected nodes’
quality, PageRank also considers the direction of an edge by exploiting the degree of
a node and assigning edges a particular weight (Brin and Page 1998). Traditionally,
PageRank is only defined to directed graphs. However, it is possible to transform an
undirected graph into a equivalent directed graph in order to compute the PageRank.
More specifically, all the edges of the undirected graph will be replaced with two new
directed edges with opposite direction. Furthermore, the computation of PageRank
involves an iterative approach. First, all nodes in the graph, are assigned a equal
initial value equal to 1/N . Moreover, using the following formula, a new PageRank is
calculated for all vertices in the network:
P R(i) =
X P R(j)
1−d
+d
N
Ej
(6)
i̸=j
where Ej reflects the numbers of outgoing edges from the considered node. Additionally, d ∈ (0, 1) reflects the damping factor. The damping factor is meant to compensate
for some degree of randomness in the connections between the nodes. Therefore, d of
the proportion of the information corresponding to the incoming edges of the nodes
is considered when computing the PageRank of the given node. In this paper, 100
iterations are considered for the computation of PageRank for a certain node with a
damping factor of 0.85.
Ultimately, the union of these measures captures the different facets of the interconnectedness of the entities in the entire network.
8
2.4
Data Mining
Data mining is the action of finding (hidden) patterns and peculiarities in large amounts
of data. Due to the size, complexity, and fast-expanding nature of financial data, data
mining has become increasingly valuable in quantitative finance. When combining data
mining techniques with machine learning methods, it is possible to make computers learn
and comprehend the results from these mining processes and swiftly provide users with
outcome predictions or classify entities within the data.
2.5
Classification Algorithms
A significant subset of data mining and machine learning are classification models, where
the aim is to determine which class, or a group, an observation belongs to. Supervised classification algorithms revolve around constructing a learning model subject to a pre-labeled
training set to label unseen data correctly (Aly 2005). Furthermore, classification algorithms can be categorized into binary and multi-class classification models. The former,
binary classification, entails that the model should distinguish between two class types,
commonly referred to as labels. In contrast, multi-class classification involves training the
model for multiple possible labels. In this study, the binary classification will be implemented, corresponding to a robust learning model concerning the financial data and the
interrelation of the variables of interest.
2.6
Random Forest Model
Random Forest is a machine-learning model for classification and regression tasks developed
by Leo Breiman (2001). It is an ensemble technique that incorporates the predictions of
numerous decision trees to create a robust predictive model widely known for its effectiveness as a supervised learning model (Biau and Scornet 2015).
Decision trees work by creating a tree-like model of decisions based on the values of the
predictor variables, commonly referred to as features in machine learning. At each internal
node, the model decides based on the value of a feature and then splits the data into different
branches based on the outcome of that decision. Specifically, this can be done by using a
particular importance measure called Gini Criterion which minimizes the impurity of a node
by selecting the feature that results in the highest proportion of the samples belonging to
the same class (Leo Breiman 1996). In the case of a classification problem with i classes, the
Gini value corresponding to a particular node of a decision tree, is realized in the following
fashion:
X
Gini(P) =
Pi (1 − Pi )
(7)
i
where Pi reflects the frequency of class i concerning the sample at the given node. Finally,
the leaves of the trees, which are pure nodes, contain the final prediction for a given sample
and have a Gini value of zero. The decision tree algorithm works by recursively partitioning
9
the data into smaller and smaller subgroups based on the value of a feature. It chooses the
feature that maximizes the information gained at each step using the Gini Criterion, which
measures how well the feature can predict the target variable. The process continues until
the tree reaches a maximum depth or until all the leaves are pure, meaning they contain
only samples from a single class and thus have no children nodes (L. Breiman et al. 1984).
In the Random Forest algorithm, the decision trees are fed and trained on arbitrary subsets of the training data (Leo Breiman 2001). Finally, the Random Forest aggregates the
predictions of all the decision trees in the forest and makes a prediction based on the majority votes (i.e., output) of all the individual trees. This process, known as bootstrapping,
helps to reduce overfitting to the training set i.e., reduces the variance of the model, and
thus improves the model’s generalization ability (Biau and Scornet 2015). In Figure 1, a
summarized flowchart of the workings of a Random Forest Classifier is depicted (Wu, Gao,
and Jiao 2019).
Figure 1: Overview Random Forest Classifier.
There are a variety of classification models available, including logistic regression, support
vector machine, decision trees and their ensemble counterparts (e.g., Random Forest and
Gradient Boosting), and Bayesian models like the Gaussian Naı̈ve-Bayes algorithm. Each
model has unique characteristics and may be suitable for a given classification task. Naturally, the performance of different machine learning models is case-sensitive due to the
differences in the statistical assumptions.
Random Forest has certain advantageous characteristic properties, namely,
1. Random Forest is a non-parametric model that does not make distributional assumptions.
2. It has been empirically shown that Random Forests are more resistant when highly
correlated explanatory variables are present and thus will result in a less biased model
10
compared to, for example, logistic regression (Lindner, Puck, and Verbeke 2022). This
is due to the bootstrap sampling property of the Random Forest and the fact that
not all predictor variables are used at once.
3. It can handle a relatively more extensive number of features since not all predictor
variables are used simultaneously.
4. Random Forests can also detect non-linear anomalies in the data, unlike logistic regression or support vector machines, which assume linearity.
As the interconnectedness of financial assets is quite complex and non-linear, and their
distribution, in general, is unknown, the Random Forest algorithm is a robust method for
the aim of this thesis. Additionally, as aforementioned, it is more flexible in including
variables that exhibit multicollinearity. For instance, centrality measures of a network have
been empirically shown to be linearly correlated in some networks (Oldham et al. 2019).
However, by leveraging Random Forest, both these variables may be included in the model,
and consequentially both can be tested on whether they offer a predictive value to the
classification model.
11
2.7
Cross-Validation
When constructing a robust learning model, it is imperative to evaluate the model’s generalization performance and prevent over- or underfitting to the training set. A commonly
used method in statistics and machine learning is employing cross-validation.
Cross-validation is a statistical resampling method where a single dataset is partitioned
into k equal-sized subsets called folds (Sammut and Webb 2010). Furthermore, the learning model is executed k number of times where in each iteration, the union of the folds
is utilized as the training set except the j ′ th fold, which serves as the test set in the j ′ th
iteration. Finally, the score of the learning model in all iterations is averaged to attain a
cross-validation score that serves as a measure of the model’s generalization ability. In the
diagram below, a 10-fold cross-validation scheme is depicted. (Aatqb and Iqbal 2019)
Figure 2: Overview Cross-Validation procedure.
2.8
Hyper-parameter Optimization
Depending on the complexity and purpose of the model, it is crucial to tune the parameters
of the learning model, often referred to as hyper-parameters, accordingly. There are several
ways to achieve this, including Bayesian optimization, gradient-based optimization, exact
search methods like grid or random search, or heuristics such as genetic algorithm or simulated annealing, which are advantageous when computing power is constraining (Feurer
and Hutter 2019).
For the aim of the thesis, the opted method of hyper-parameter optimization will be gridsearch, as the learning model will not be exposed to an extensive dataset, and only a few
12
parameters of the Random Forest will be considered.
Grid search aims to find a set of hyperparameters for which the model’s performance is
best concerning a specific measure such as mean squared error or the balanced accuracy of
the target classes. Specifically, this is done by choosing a set of values, a grid, regarding
the relevant hyperparameters subject to the optimization. Finally, the learning model is
trained and evaluated for all the combinations of those parameters (Bergstra and Bengio
2012).
As aforementioned, Random Forest algorithm will be leveraged in this paper. Several
relevant (hyper-)parameters can be tuned in a Random Forest model to control the intricacy
of the model, including:
1. The quantity of decision trees
2. The maximum depth of each tree: this parameter refers to the maximum number of
nodes from the tree’s root node to the farthest leaf node
3. The minimum number of samples is to split a node and be at a leaf node (Biau and
Scornet 2015)
4. The maximum number of features: this parameter regulates the number of predictor
variables that are considered when examining for the most optimal split at each node
in each tree
13
2.9
Feature Testing: Permutation Importance
When constructing a particular model, it is paramount to examine which predictor variables
correlate significantly with the dependent variable and thus are relevant for explaining the
model.
Permutation importance is a technique vastly used in machine learning for such purposes
and works in the following fashion; the machine learning model is fitted to the training data
and tested employing cross-validation. Afterward, the value of a single feature is randomly
rearranged or permuted, potentially dropping the model’s performance. This procedure is
repeated several times for all the learning model features. The change, or rather decrease,
in the model’s performance concerning its (balanced) accuracy, indicates the degree of
importance between the given feature and the dependent variable. In this fashion, the
proportion of the predictive value of each feature for the given model is attained (Leo
Breiman 2001). In a nutshell, the permutation importance, I, of the feature l is realized as
follows:
R
1 X
Il = ϕ −
ϕr,l
R
(8)
r=1
where ϕ is the score of the trained model before permuting the features, R is the total number of iterations for which the permutation importance procedure is repeated, and ϕr,l is the
score of predictive model where the observations corresponding to feature l are randomly
shuffled in iteration r.
Notably, this procedure does not suggest whether the feature is appropriate for the given
problem in general but is solely indicative of the trained model at hand. Another notable
drawback of permutation importance is that when there is a high degree of correlation between features, the importance of these features will be lower than their ’true’ importance.
Because when such a correlated feature is shuffled, the model will still acquire some information about the shuffled feature from its correlated counterpart. This drawback could be
mitigated by clustering the correlated features and only including one of these features when
performing the permutation importance procedure. Nevertheless, suppose a model exhibits
a limited amount of bias, and a particular feature has a considerable value of permutation
importance. In that case, the variable is likely of value to the predictive model.
Permutation importance will be implemented on the Random Forest that will be constructed
in this thesis to gain insight into the predictive value of the features tested and especially
concerning the network topological measures.
14
3
Methodology
3.1
Data Definition
The data used throughout this thesis originate from the financial database of the Center
for Research in Security Prices (CRSP). Multiple datasets are extracted from CRSP. First,
as the network changes dynamically, eight years of data, starting from 2013 until 2022, is
considered, except 2015, due to a significant amount of missing values. Each year, a network
with 500 of the biggest stocks with respect to their market capitalization in the NASDAQ
stock exchange will be constructed. Therefore, resulting in eight networks corresponding
to the eight years of data. For each year, the data starts on April 1st and ends on the
second to last working-day of March of the following year. Illustratively, the first network
is based on daily records starting from 1th April 2013 until 27th March 2014. The variables
considered in the datasets are given in Table 1.
Variables
Acronym
Company Name
Ticker
CRSP Permanent Compnay Number
Security Status
Price
Share Volume
Holding Period Return
Number of Shares Outstanding
Delisting Code
COMNAM
TICKER
PERMCO
SECSTAT
PRC
VOL
RET
SHROUT
DLSTCD
Table 1: List of extracted variables from the CRSP database for the considered years.
The first step in constructing our network is to define our nodes and edges. The nodes of
the given network correspond to the different stocks, whereas the edges correspond to the
distances between the different stocks. These distances will be based on the correlation of
the returns of all stocks. As mentioned in the literature review, the first phase of forming the
network is to compute the Pearson correlation matrix. Next, an adjacent distance matrix
is attained from the resulting correlation matrix, which will serve as the foundation of the
Planar Maximally Filtered Graph algorithm. First, the natural logarithm of the prices of
all stocks, s = (1, ..., N ), is taken for each period, where Ps (t) denotes the price of stock s
at time period t. Then, the prices’ logarithm will be subtracted pairwise from each other
for all t = (1, ..., T ) to attain the return rate of all the periods successively.
rs (t) = ln
Ps (t + 1)
Ps (t)
15
(9)
Afterward, using the returns of all the stocks for the different periods, the Pearson coefficient
is used to compute the correlation matrix of the considered stocks.
PT
− x̄i ) (xjt − x¯j )
2 PT
2
t=1 (xit − x̄i )
t=1 (xjt − x¯j )
Ci,j = qP
T
t=1 (xit
(10)
Followingly,
the symmetric (N, N ) correlation matrix will be transformed by the formula,
p
dij = 2(1 − Cij ), ensuring multiple properties such as higher correlation will correspond to
smaller distances and hence more similarity between the nodes. Moreover, dij will equal dji ,
and thus symmetry will still hold. Lastly, self-loops resulting from the diagonal entries will
be discarded. The next phase is to perform the Planar Maximally Filtered Graph algorithm
to distance matrix to attain the network. The network construction will be implemented
using Python 3.10 with Pandas, Numpy, Networkx, and Planarity packages.
3.2
Network Topology
Subsequently, the network topology measures must be extracted from the attained network.
With references to the literature review, betweenness centrality, degree centrality, closeness
centrality, eigenvector centrality, PageRank, and the clustering coefficient of all the nodes
in the network are realized and stored.
3.3
Designing the Classification Model
The algorithm aims to predict whether the return of a stock in the next period will be in
the top Z quantile with respect to all the other stocks by feeding it various input variables,
such as the returns of multiple periods before and the topological network measures of the
given stocks.
For the model’s design, it is imperative to construct the data set on which the machine
learning algorithm will be trained in a practical manner. The structure will be as follows:
each row will correspond to a different stock, while the columns will contain returns from
multiple periods and topology measures. Therefore, after structuring the dataset in the
desired format. The stocks will be labeled according to the last return period of that year.
The labeling procedure is done in the following fashion: specific quantiles of all the returns
in the last period are calculated. Subsequently, stocks with returns below the Z quantile of
all the returns will be labeled as ’Bad Relative Position.’ Alternatively, stocks with returns
exceeding the Z quantile will be labeled ’Good Relative Position.’
(
Good Relative Position, if Z < rs (T )
Class =
(11)
Bad Relative Position,
otherwise
16
The relative position of a stock is the target variable that our algorithms will be trained on
such that they will be able to predict the relative position concerning the next day. The
resulting pre-labeled dataset will be the training set of the Random Forest algorithm to
predict the class of the stock i.e., whether a stock has a Good or Bad Relative position.
Additionally, these relative positions can indicate which stocks are proper to invest in for
the next period, in this case, the next day’s return. As an illustration, a training set will
contain the following variables, presented in Table 2.
Features
Returns of periods T-10 until T-1
Volumes of periods T-3 until T-1
PageRank
Clustering Coefficient
Closeness Centrality
Betweenness Centrality
Degree Centrality
Eigenvector Centrality
Class of the Stock
Table 2: List of features (variables) used to train the Random Forest Classifier.
Furthermore, after constructing the Random Forest algorithm for the training set at hand,
cross-validation will be performed for different fold values concerning a scoring measure
called, balanced accuracy. Balanced accuracy is the weighted average accuracy of the learning model for each class. In other words, the average between the proportion of correctly
predicted observations from the given classes. In the case of a binary classification, the
balanced accuracy (ϕ) is defined as follows:
1
TP
TN
ϕ=
+
(12)
2 TP + FN
TN + FP
where T P , T N , F P , and F N reflect the number of true positives, true negatives, false
positives, and false negatives, respectively. Moreover, positive refers to the class where
stocks have a good relative position, and negative refers to the class where stocks have a
bad relative position.
It is essential to use this scoring measure instead of precision, as the training set is imbalanced and thus contains more ’bad’ stocks, and would the model, in that case, be biased
toward classifying stocks as ’bad.’ Lastly, with the attained optimal number of folds for our
training set, a grid search is performed to optimize the hyper-parameters of the Random
Forest, including the number of estimators, the maximum depth of a tree, and the maximum number of features considered in the tree.
17
Finally, multiple Z values will be considered to examine the algorithm’s sensitivity concerning the set threshold of whether a stock has a bad or good relative position. The different
values for the threshold Z considered are 0.55 quantile, 0.65 quantile, 0.75 quantile, and
0.85 quantile.
18
4
4.1
Empirical Findings and Analysis
Pre-Downsampling
After implementing the proposed methodology, the Random Forest algorithm is fitted to
the training data and a 10-fold cross-validation procedure is implemented. The following
results have been attained which are displayed using confusion matrices, diagrams that
summarize the performance of a classification model by showing the shares of true positive,
true negative, false positive, and false negative predictions.
Figure 3: The confusion matrices corresponding to the results of the given Random Forest
model on the test set without downsampling the imbalanced training set for different
values of the quantile threshold Z.
From the confusion matrices for the different Z quantiles, it is clear that as the threshold
quantiles become more significant, the training set has become more imbalanced; therefore,
the Random Forest model’s bias has increased. For this reason, the greater class is downsampled to the same size as the smaller class when training the machine learning model
while keeping the imbalance when testing.
19
4.2
Post-Downsampling
After performing the downsampling of the stocks with the bad relative position, the subsequent confusion matrices are acquired for different values of Z.
Figure 4: The confusion matrices corresponding to the results of the given Random Forest
model on the test set after downsampling the imbalanced training set for different values
of the quantile threshold Z.
The classification model is more robust after the downsampling of the greater class and is
less biased, with an average cross-validation score of 67.5% across all 10 folds of the training set, between all values of Z. Additionally, the classification model also has an average
balanced accuracy of 67.5% on the test set, between all values of Z. Furthermore, after
downsampling the greater class, the learning model’s accuracy is significantly less sensitive
to different values of threshold Z. This could be explained by the fact that algorithm is
already pretty strict with classifying stocks as good. Another noticeable insight that the
confusion matrices reveal is that the Random Forest algorithm tends to become more greedy
towards the good class as the quantile threshold, Z, increases.
Given the fact that the algorithm is trained on a dataset with a size of 4500 records,
with a 10-fold cross-validation, the cross-validation score of the learning model, which is
representative of the generalizability of the model, is somewhat acceptable to gain insights
about the predictive value of the network topology measures. However, one should keep in
mind that the bias of the given model is not negligible and that a certain degree of omitted
20
variable bias is likely to be present.
21
4.3
Variable Importance
Moreover, it is essential to implement permutation importance (section 2.9) procedure to
examine each feature’s proportion of predictive value for the constructed model. For each
feature, the procedure is repeated twenty times. The following graphs depict the feature
instance for all values of Z.
Figure 5: The bar charts depicting the permutation importance of all predictor variables
for different values of the quantile threshold Z corresponding to the constructed Random
Forest model.
Firstly, as Z increases, more features become relevant for the given Random Forest classifier.
From the confusion matrices in the previous section, it was also observed that the algorithm
became more greedy toward picking the ’good’ class as Z increased, which could potentially
22
explain why the variables’ importance became more significant for the model’s performance.
Furthermore, for all values of Z, closeness and eigenvector centrality contribute the most
toward the model’s performance compared to other network topological properties. As mentioned in section 2.3 about topological measures, these network measures could be highly
correlated. Especially in the case of the .55 and .65 quantiles, where the algorithm is relatively more balanced in terms of biasedness toward one of the classes, multicollinearity is
more visible and could have led to the non-significance of the topology measures, PageRank,
degree centrality, betweenness centrality, and clustering coefficient, for the model at hand.
Even in the case of the .75 and .85 quantiles, these four topology measures still have a
relatively low significance compared to closeness and eigenvector centrality for the model.
Hence, indicating the potential collinearity between the pair of closeness and eigenvector
centrality with the remaining topology measures.
Lastly, the model’s predictive performance stays nearly similar when performing the Random Forest algorithm without the non-significant variables according to the permutation
importance results. Moreover, the importance of the remaining topology measures becomes
more significant, again suggesting the presence of multicollinearity between the topology
measures.
23
4.4
Portfolio Construction
In order to assess the applicability of the machine learning model, it is possible to construct
a portfolio for the next day using the built model. To examine whether the portfolio reflects
or is even capable of beating the market. The latter depends on how significant the error
of the machine learning algorithm is when misclassifying a stock. If the algorithm for a
threshold value of .85 quantile misclassifies a stock belonging to a quantile larger or equal
to the .5 quantile, then the resulting portfolio would still be adequate. On the other hand,
if the given misclassified stocks had belonged to a quantile less than the .5 quantile, the
resulting portfolio would be less satisfactory. Similarly, suppose the learning model would
be relatively good in identifying the top .05 quantile. In that case, these returns could
mitigate the adverse effect of the proportion of the misclassified stocks in the portfolio.
To examine the practicality of the learning model for risk-management purposes, numerous
network portfolios have been constructed for each year and compared to three other kinds
of portfolios. The types of constructed portfolios are as the following:
1. Network portfolios: Portfolios consisting of stocks predicted to have a good relative
position for the next day established by the Random Forest model based on network
topological attributes of the considered stocks.
2. Random Stock Portfolios: Portfolios consisting of randomly chosen stocks from the
500 most extensive stocks concerning market capitalization.
3. Average Stock Portfolios: Portfolios consisting of stocks that all have the average
return of the predicted day as their return. These portfolios represent the trend of
the NASDAQ stock market in the given days.
4. NASDAQ-100 Index: A portfolio consisting only of the NASDAQ-100 index, an index
of the 100 largest companies concerning their market capitalization of the NASDAQ
stock market, except for financial companies.
The network portfolios are constructed using sliding windows, meaning that all the data
is used to train the Random Forest classifier except the year in which the desired day lies.
For instance, to predict the stocks with good relative positions on 30th March 2022, the
algorithm is evidently not trained on that year’s data. Furthermore, only Z = 0.75 quantile
is considered in this section, i.e., the classification model is trained on data where the stocks
with returns in the top 0.25 quantile have the class of good relative position.
Finally, as mentioned in section 3.1, eight years of data corresponding to eight yearly networks have been considered in this study. Therefore, eight days have been predicted using
this model, corresponding to the penultimate working day of March.
24
Figure 6: The different types of simulated portfolios for the considered years and their
corresponding returns in percentages.
From the eight next-day predictions of the Random Forest algorithm and the corresponding
network portfolios, six of these attained portfolios consisting of 100 stocks result in significantly higher returns than the NASDAQ-100 index. These attained network portfolios and
their corresponding returns indicate that constructed machine-learning model is relatively
suitable for portfolio construction for the given days. However, in 2020 and 2022, the net-
25
work portfolio of 100 stocks had lower returns than the NASDAQ-100. In 2020, the Network
Portfolio still had a greater return than the corresponding Average Return Stock Portfolio
consisting of 100 stocks. Hence, the return of the Network Portfolio is still greater than 0.5
quantile of all returns. As in the case of an uptrend, the average return will be greater than
the median return.
However, the NASDAQ-100 still outperformed the Network Portfolio in 2020. Potentially, a
reason for this could be because the Covid-19 pandemic hit at the start of 2020, when huge
tech companies such as Tesla, Amazon, Meta, and others, which are NASDAQ-100 listed,
realized historically immense returns. Tesla, for instance, realized almost 740% returns in
2020.
In 2022, the Network Portfolio consisting of 100 stocks, performs significantly worse than
the NASDAQ-100 and slightly worse than the Average Return Stocks Portfolio. A potential
explanation for the subpar performance is that, in 2022, there is a steep downtrend. In contrast to 2022, except for 2018, there was a moderate uptrend in the NASDAQ stock market
each year. Consequently, the given Random Forest classifier has been trained more extensively on periods of an uptrend and is potentially less adequate in times of a downtrend. In
2018, the more ’optimistic’ trait of the machine learning model was less problematic, but
in 2022 the algorithm’s greediness is much more perceptible and costly.
Additionally, these portfolios have been constructed based on the threshold quantile value,
Z = 0.75. Furthermore, it was observed earlier that the algorithm was more greedy in classifying stocks as ’good’ as the threshold quantile value Z increased. Therefore, a potential
factor of the relatively worse performance during the downtrend is due to the threshold
quantile Z used. Thus, the value of Z should be considered concerning the market trend
when constructing portfolios.
26
5
Conclusion
This paper aimed to examine the potential of network analysis for risk-management purposes when constructing stock portfolios by bridging the gap between the theoretical knowledge about network analysis and state-of-art machine learning techniques and applying it
to the NASDAQ stock market.
A Random Forest model was constructed based on yearly topological properties of the given
stocks and other relevant independent variables to predict the relative position of stocks for
the next day with an average cross-validation score of 67.5%. Although the bias of the attained learning model was not negligible, it showed a certain degree of robustness concerning
the variance of the model to assess the predictive value of network topology measures and,
thus, the usefulness of network analysis. Moreover, a permutation importance procedure
was implemented to assess the proportion of the predictive value of the topology measures
for the given model. Consequently, eigenvector and closeness centrality appeared to have
the most predictive value compared to other topology measures. However, multicollinearity
between the topology measures could have made the measures that correlate with eigenvector and closeness centrality less relevant.
Furthermore, numerous portfolios were constructed for eight days using the Random Forest
model and compared to other portfolios, including the NASDAQ-100 index. For the considered days, the Network Portfolio outperformed the NASDAQ-100 index six out of the
eight days. Furthermore, the model’s performance exhibited a certain degree of sensitivity
towards the market trend that could potentially be due to the biasedness of the training
data toward an uptrend.
Finally, to address the purpose of this thesis, network analysis certainly shows significant
potential to be used as an additional instrument to analyze the complex workings of an equity market. Furthermore, machine learning and the increased availability of historical data
and computing power have created the prospect of making network analysis adequate for
risk management purposes. However, more research is necessary to make network analysis
a practical risk-management tool for the agents of the stock market, which are presented in
the next section.
27
6
6.1
Discussion
The Network
Regarding the construction of the underlying network of the NASDAQ stock market that
is used as the foundation to build the particular Random Forest model, certain limitations
could be improved in future research.
First, for constructing the networks in this study, Pearson correlation matrices of stock
returns are used to form the networks. However, as presented in the literature review, the
Pearson coefficient considers only linear relationships between the different stocks. Therefore, the constructed networks based on the Pearson coefficient do not contain the non-linear
interconnectedness between the given stocks. Additionally, the Pearson coefficient is a zerotime lag correlation coefficient and therefore neglects the time lags that could be present
between the correlations of the given stocks. For example, this shortcoming could be improved by introducing a time-lagged correlation matrix. However, then the problem arises
that the resulting correlation matrix would not be symmetric. Hence, the correlation matrix
on which the prediction model is based has certain shortcomings, which could be enhanced
in future research.
Furthermore, in this investigation, yearly networks are constructed based on daily data of
the corresponding year. However, this element could be done more systematically. In the
case of predicting the relative positions of given stocks for a particular day, it is potentially more appropriate to build monthly or quarterly networks instead of yearly networks.
Specifically, if there is a significant change in the correlations between stocks in the last periods of that year on which the network is based, then the topological measures of the stock
extracted from the network may be less representative of the next-day relative position of
that given stock. In the case of monthly networks, it is also possible to include lags of the
topological measures corresponding to a stock, potentially revealing the dynamic shift of
the stock’s relative position more optimally.
Moreover, it is viable to include yearly, and monthly networks, as then both the stocks’
short-term and relatively long-term topological measures can be used for constructing the
predictive model. Lastly, it would also be interesting to predict the next-month relative
positions instead of daily relative positions. As with daily returns, the probability of outliers
is more extensive than with monthly returns, as these are aggregated returns of several
days. Therefore, the bias of the outlying days is diminished. In that instance, it would
be interesting to consider yearly and quarterly networks. Hence, concerning the desired
prediction period, the networks should be constructed so that the dynamic conduct of the
stocks in the networks is revealed. Additionally, these stocks’ short-term and long-term
relative positions can be utilized.
28
6.2
The Predictive Model
In this study, the chosen predictive model was the Random Forest classifier. Since Random
Forest is a non-parametric model, in addition to being somewhat resistant to the presence
of multicollinearity in comparison to, for example, linear regression or Bayesian learning,
as not all predictor variables are used in the decision trees of the Random Forest simultaneously.
There are some limitations to the predictive model that could be improved in future research.
First, the training data of the model should incorporate predictor variables corresponding
to different macroeconomic factors such as the state of the economy, the market trend,
the profitability of the given companies, and the level of interest rate, for example. Since
macro-economic conditions also affect the behavior of the stock market. More generally, the
predictive model should be subjected to a more significant number of independent variables
that could affect future stock returns and, thus, the relative position of a given stock in the
network. Evidently, these predictor variables should be tested for (statistical) significance
for the given model. Consequently, the predictive model would be less prone to omitted
variable bias, which was likely the case in this paper.
Furthermore, the quality and size of the training data are also crucial for the model’s
performance. Instead of considering the 500 most extensive stocks concerning market capitalization, more stocks in the given equity market could be included. Alternatively, it is
possible to consider multiple stock markets jointly, as the stocks within these different markets could correlate significantly. Furthermore, regarding the quality of the training data,
the model should be subjected to all different types of market movements and should not
be biased toward a bearish or bullish market state. A naı̈ve method to achieve this is to
feed the predictive model more data.
When utilizing a large amount of data, it may be beneficial to utilize other machine learning models, such as Gradient Boosting. Gradient Boosting is more rapid and robust as the
size of the training data grows compared to the Random Forest, which is more suitable for
relatively small or moderate amounts of data.
Lastly, as discovered in this investigation, specific topological properties are potentially
correlated. In order to discover the relevance of these predictor variables, it is possible
to cluster these correlated variables together and guide the learning model only to employ
these variables separately. In this manner, the predictive value of all variables is more easily
recognized, and the model can be adjusted accordingly.
29
6.3
Portfolio Construction
In this paper, the portfolio construction using the predictive model was intended for demonstrative purposes of the potential of the constructed predictive model. Specific points could
be improved and taken into consideration for future work.
When constructing portfolios, it is essential to consider the risk tolerance of the user, which
may be balanced against the introduced threshold quantile Z. As when Z grows, the learning model becomes more greedy towards picking the ’good’ class and hence takes more risk.
Furthermore, to create a balanced training set for the predictive model, the downsampling
of the ’bad’ class can also be used for adjusting the risk threshold. A higher level of downsampling of the ’bad’ class leads to labeling more stocks as ’good.’ In comparison, a lower
level of downsampling, which results in the ’bad’ class size being more significant than the
’good’ class, will result in the classification of more stocks with a relatively bad position
and, thus, less risk to be taken.
Principally, when constructing the portfolios, it is necessary to conduct more simulations for
several periods with various macroeconomic conditions to assess the predictive model’s performance better. However, running more simulations and predicting more periods may be
computationally constraining. Ultimately, the attained network portfolios could be further
improved using robust risk assessment measures such as the Sharpe Ratio or Conditional
Value at Risk to better regulate the risk of these portfolios.
30
References
Aatqb, Johar and Amer Iqbal (Apr. 2019). “Introduction to Support Vector Machines
and Kernel Methods”. In: url: https : / / www . researchgate . net / publication /
332370436_Introduction_to_Support_Vector_Machines_and_Kernel_Methods.
Albert, Réka and Albert-László Barabási (Jan. 2002). “Statistical mechanics of complex
networks”. In: Rev. Mod. Phys. 74 (1), pp. 47–97. doi: 10.1103/RevModPhys.74.47.
url: https://link.aps.org/doi/10.1103/RevModPhys.74.47.
Aly, Mohamed (2005). “Survey on multiclass classification methods”. In: Neural Netw 19.1,
p. 9.
Bergstra, James and Yoshua Bengio (2012). “Random Search for Hyper-Parameter Optimization”. In: Journal of Machine Learning Research 13.10, pp. 281–305. url: http:
//jmlr.org/papers/v13/bergstra12a.html.
Biau, Gérard and Erwan Scornet (2015). A Random Forest Guided Tour. doi: 10.48550/
ARXIV.1511.05741. url: https://arxiv.org/abs/1511.05741.
Bonacich, Phillip (2007). “Some unique properties of eigenvector centrality”. In: Social
Networks 29.4, pp. 555–564. issn: 0378-8733. doi: https : / / doi . org / 10 . 1016 / j .
socnet.2007.04.002. url: https://www.sciencedirect.com/science/article/
pii/S0378873307000342.
Breiman, L. et al. (1984). Classification and Regression Trees. Taylor & Francis. isbn:
9780412048418. url: https://books.google.nl/books?id=JwQx-WOmSyQC.
Breiman, Leo (July 1996). “Technical Note: Some Properties of Splitting Criteria”. In:
Machine Learning 24.1, pp. 41–47. issn: 1573-0565. doi: 10.1023/A:1018094028462.
url: https://doi.org/10.1023/A:1018094028462.
— (Oct. 2001). “Random Forests”. In: Machine Learning 45.1, pp. 5–32. issn: 1573-0565.
doi: 10.1023/A:1010933404324. url: https://doi.org/10.1023/A:1010933404324.
Brin, Sergey and Lawrence Page (1998). “The anatomy of a large-scale hypertextual Web
search engine”. In: Computer Networks and ISDN Systems 30.1. Proceedings of the
Seventh International World Wide Web Conference, pp. 107–117. issn: 0169-7552. doi:
https : / / doi . org / 10 . 1016 / S0169 - 7552(98 ) 00110 - X. url: https : / / www .
sciencedirect.com/science/article/pii/S016975529800110X.
Dorogovtsev, S. N. and J. F. F. Mendes (2002). “Evolution of networks”. In: Advances
in Physics 51.4, pp. 1079–1187. doi: 10 . 1080 / 00018730110112519. eprint: https :
/ / doi . org / 10 . 1080 / 00018730110112519. url: https : / / doi . org / 10 . 1080 /
00018730110112519.
Eng-Uthaiwat, Harnchai (Aug. 2018). “Stock market return predictability: Does network
topology matter?” In: Review of Quantitative Finance and Accounting 51.2, pp. 433–
460. issn: 1573-7179. doi: 10.1007/s11156-017-0676-3. url: https://doi.org/10.
1007/s11156-017-0676-3.
Fan, Jingfang et al. (Aug. 2019). “Topology of products similarity network for market
forecasting”. In: Applied Network Science 4.1, p. 69. issn: 2364-8228. doi: 10.1007/
s41109-019-0171-y. url: https://doi.org/10.1007/s41109-019-0171-y.
31
Feurer, Matthias and Frank Hutter (2019). “Hyperparameter Optimization”. In: Automated
Machine Learning: Methods, Systems, Challenges. Ed. by Frank Hutter, Lars Kotthoff,
and Joaquin Vanschoren. Cham: Springer International Publishing, pp. 3–33. isbn: 9783-030-05318-5. doi: 10.1007/978- 3- 030- 05318- 5_1. url: https://doi.org/10.
1007/978-3-030-05318-5_1.
Freeman, Linton C. (1978). “Centrality in social networks conceptual clarification”. In:
Social Networks 1.3, pp. 215–239. issn: 0378-8733. doi: https://doi.org/10.1016/
0378-8733(78)90021-7. url: https://www.sciencedirect.com/science/article/
pii/0378873378900217.
Gong, Xiao-Li et al. (2019). “Financial systemic risk measurement based on causal network
connectedness analysis”. In: International Review of Economics Finance 64, pp. 290–
307. issn: 1059-0560. doi: https://doi.org/10.1016/j.iref.2019.07.004. url:
https://www.sciencedirect.com/science/article/pii/S1059056018305586.
He, Chengying et al. (2022). “Sudden shock and stock market network structure characteristics: A comparison of past crisis events”. In: Technological Forecasting and Social
Change 180, p. 121732. issn: 0040-1625. doi: https://doi.org/10.1016/j.techfore.
2022 . 121732. url: https : / / www . sciencedirect . com / science / article / pii /
S004016252200258X.
Hong, Mi Yeon and Ji Won Yoon (Feb. 2022). “The impact of COVID-19 on cryptocurrency
markets: A network analysis based on mutual information”. In: PLOS ONE 17.2, pp. 1–
24. doi: 10.1371/journal.pone.0259869. url: https://doi.org/10.1371/journal.
pone.0259869.
Lindner, Thomas, Jonas Puck, and Alain Verbeke (Sept. 2022). “Beyond addressing multicollinearity: Robust quantitative analysis and machine learning in international business
research”. In: Journal of International Business Studies 53.7, pp. 1307–1314. issn: 14786990. doi: 10.1057/s41267-022-00549-z. url: https://doi.org/10.1057/s41267022-00549-z.
Long, Haiming, Ji Zhang, and Nengyu Tang (July 2017). “Does network topology influence
systemic risk contribution? A perspective from the industry indices in Chinese stock
market”. In: PLOS ONE 12.7, pp. 1–19. doi: 10.1371/journal.pone.0180382. url:
https://doi.org/10.1371/journal.pone.0180382.
Mantegna, R. N. (Sept. 1999). “Hierarchical structure in financial markets”. In: The European Physical Journal B - Condensed Matter and Complex Systems 11.1, pp. 193–197.
issn: 1434-6036. doi: 10.1007/s100510050929. url: https://doi.org/10.1007/
s100510050929.
Millington, Tristan and Mahesan Niranjan (2021). “Construction of minimum spanning
trees from financial returns using rank correlation”. In: Physica A: Statistical Mechanics
and its Applications 566, p. 125605. issn: 0378-4371. doi: https://doi.org/10.1016/
j.physa.2020.125605. url: https://www.sciencedirect.com/science/article/
pii/S0378437120309031.
Namaki, A. et al. (2011). “Network analysis of a financial market based on genuine correlation and threshold method”. In: Physica A: Statistical Mechanics and its Applications
32
390.21, pp. 3835–3841. issn: 0378-4371. doi: https://doi.org/10.1016/j.physa.
2011 . 06 . 033. url: https : / / www . sciencedirect . com / science / article / pii /
S0378437111004808.
Oldham, Stuart et al. (July 2019). “Consistency and differences between centrality measures
across distinct classes of networks”. In: PLOS ONE 14.7, pp. 1–23. doi: 10 . 1371 /
journal.pone.0220061. url: https://doi.org/10.1371/journal.pone.0220061.
Podobnik, B. et al. (June 2010). “Time-lag cross-correlations in collective phenomena”.
In: Europhysics Letters 90.6, p. 68001. doi: 10 . 1209 / 0295 - 5075 / 90 / 68001. url:
https://dx.doi.org/10.1209/0295-5075/90/68001.
Podobnik, Boris and H. Eugene Stanley (Feb. 2008). “Detrended Cross-Correlation Analysis:
A New Method for Analyzing Two Nonstationary Time Series”. In: Phys. Rev. Lett. 100
(8), p. 084102. doi: 10.1103/PhysRevLett.100.084102. url: https://link.aps.
org/doi/10.1103/PhysRevLett.100.084102.
“Cross-Validation” (2010). In: Encyclopedia of Machine Learning. Ed. by Claude Sammut
and Geoffrey I. Webb. Boston, MA: Springer US, pp. 249–249. isbn: 978-0-387-30164-8.
doi: 10.1007/978-0-387-30164-8_190. url: https://doi.org/10.1007/978-0387-30164-8_190.
Shirley, Mark D.F. and Steve P. Rushton (2005). “The impacts of network topology on disease spread”. In: Ecological Complexity 2.3, pp. 287–299. issn: 1476-945X. doi: https:
//doi.org/10.1016/j.ecocom.2005.04.005. url: https://www.sciencedirect.
com/science/article/pii/S1476945X05000280.
Tumminello, M. et al. (2005). “A tool for filtering information in complex systems”. In:
Proceedings of the National Academy of Sciences 102.30, pp. 10421–10426. doi: 10 .
1073/pnas.0500298102. eprint: https://www.pnas.org/doi/pdf/10.1073/pnas.
0500298102. url: https://www.pnas.org/doi/abs/10.1073/pnas.0500298102.
Wang, Yu et al. (Nov. 2016). “Comparison of Different Generalizations of Clustering Coefficient and Local Efficiency for Weighted Undirected Graphs”. In: Neural computation
29, pp. 1–19. doi: 10.1162/NECO_a_00914.
Watts, Duncan J. and Steven H. Strogatz (June 1998). “Collective dynamics of ‘small-world’
networks”. In: Nature 393.6684, pp. 440–442. issn: 1476-4687. doi: 10.1038/30918. url:
https://doi.org/10.1038/30918.
Wu, Xin, Yuchen Gao, and Dian Jiao (2019). “Multi-Label Classification Based on Random
Forest Algorithm for Non-Intrusive Load Monitoring System”. In: Processes 7.6. issn:
2227-9717. doi: 10.3390/pr7060337. url: https://www.mdpi.com/2227-9717/7/6/
337.
Yang, Jaewon and Jure Leskovec (Dec. 2010). “Modeling Information Diffusion in Implicit
Networks”. In: pp. 599–608. doi: 10.1109/ICDM.2010.22.
Yun, Tae-Sub, Deokjong Jeong, and Sunyoung Park (2019). ““Too central to fail” systemic
risk measure using PageRank algorithm”. In: Journal of Economic Behavior Organization 162, pp. 251–272. issn: 0167-2681. doi: https://doi.org/10.1016/j.jebo.
2018 . 12 . 021. url: https : / / www . sciencedirect . com / science / article / pii /
S0167268118303536.
33
7
7.1
Appendix
Append A: Visualization Networks
(a) 2014
(b) 2016
(c) 2017
(d) 2018
(e) 2019
(f) 2020
(g) 2021
(h) 2022
Figure 7: The constructed networks of the NASDAQ 500 most extensive stocks concerning
market capitalization for the considered years.
34
Download