A Machine Learning Approach: Network Analysis of the NASDAQ Stock Market by Ali Bagheri (SNR: 2044174) A thesis submitted in partial fulfillment of the requirements for the degree of Bachelor in Econometrics and Operations Research Tilburg School of Economics and Management Tilburg University Supervised by: Denis Kojevnikov Date: January, 2023 Abstract The primary aim of this paper was to explore the potential of network analysis of stock markets for risk management purposes by leveraging state-of-the-art machine learning techniques. Yearly networks of the NASDAQ 500 most significant stocks concerning market capitalization were constructed using the Pearson Coefficient and the Planar Maximally Filtered Graph algorithm. Furthermore, a Random Forest classifier was built and trained on the topological attributes corresponding to the given stocks in the network to predict the stocks with returns in the top quantile(s) of the next-day returns. The attained predictive model attained an average cross-validation score of 67.5% and therefore indicated some degree of robustness regarding its generalization capacity. Characteristically, closeness and eigenvector centrality contributed more significantly to the model’s performance than other topological measures. Lastly, a series of portfolios using the acquired predictive model were simulated and compared to other types of portfolios, including the NASDAQ-100 index, for the considered days. 1 Contents 1 Introduction 3 2 Literature Review 2.1 Motivation . . . . . . . . . . . . . . . . . 2.2 Network Construction . . . . . . . . . . . 2.3 Network Topology . . . . . . . . . . . . . 2.4 Data Mining . . . . . . . . . . . . . . . . 2.5 Classification Algorithms . . . . . . . . . 2.6 Random Forest Model . . . . . . . . . . . 2.7 Cross-Validation . . . . . . . . . . . . . . 2.8 Hyper-parameter Optimization . . . . . . 2.9 Feature Testing: Permutation Importance . . . . . . . . . 4 4 6 7 9 9 9 12 12 14 3 Methodology 3.1 Data Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Network Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Designing the Classification Model . . . . . . . . . . . . . . . . . . . . . . . 15 15 16 16 4 Empirical Findings and Analysis 4.1 Pre-Downsampling . . . . . . . . 4.2 Post-Downsampling . . . . . . . 4.3 Variable Importance . . . . . . . 4.4 Portfolio Construction . . . . . . 19 19 20 22 24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Conclusion 27 6 Discussion 6.1 The Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 The Predictive Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Portfolio Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 28 29 30 7 Appendix 7.1 Append A: Visualization Networks . . . . . . . . . . . . . . . . . . . . . . . 34 34 2 1 Introduction In recent times, there has been a spike in studying complex natural and physical phenomena from the perspective of network science. Large intricate systems can be represented by networks where individual entities, commonly referred to as nodes or vertices, interact with one another. Characteristically, these vertices have a scale-free relationships, meaning that a relative change in a node of the network causes proportional changes in all of the linked nodes and, thus, changes the aggregate behavior of the network (Albert and Barabási 2002). Due to this topological property of networks, large complex systems are represented, with a certain degree of approximation, as networks. These systems are used to gain insights into complicated issues in various disciplines. The study of complex systems was subject to classical graph theory in the 20th century. However, in recent decades, the exponential advances in technological computing capacity, new robust statistical instruments, and the availability of large amounts of data have developed the contemporary network science (Dorogovtsev and Mendes 2002). Network Analysis could prove to be an additional robust risk-management tool to help analyze relationships and connections within the stock market that may not always be directly apparent from traditional risk-management tools such as the widely used mean-variance model, which relies heavily on statistical properties like expected return. Subsequently, the shortcoming of these traditional methods is that they depend on the chronological consistency of stock returns and volatility. Consequently, these methods could lead to potentially inferior portfolios. For this reason, it would be helpful to examine whether network analysis could mitigate these shortcomings and prove a valuable instrument for risk managers. The primary aim of this paper is to explore the potential of network analysis of stock markets for risk management purposes by leveraging state-of-the-art machine learning techniques. We will apply existing knowledge about the topological properties of the NASDAQ stock market to a machine-learning framework to create a risk management tool. This research will involve multiple components that will be presented in the next section. 3 2 2.1 Literature Review Motivation Network analysis has been used to gain insights about intricate issues in various subjects from a different perspective. For instance, Yang and Leskovec (2010) investigated the information diffusion process within social media platforms. Shirley and Rushton (2005) researched the impact of network topology on disease spread to study the risk and evolution of epidemic spread. Similarly, exploiting network analysis in finance has also been a recurring theme. Fan et al. (2019) used the topology of product similarity to make predictions about the evolution of the market. Namaki et al. (2011) explored and compared the topological structure of the emerging TSE stock market with the mature Dow Jones Industrial Average (DJIA), for which they confirmed the scale-free properties of these networks for both markets. Long, Zhang, and Tang (2017) demonstrated that stocks in a network with high topological properties, such as betweenness centrality and clustering coefficients, are linked with higher systematic risk. Eng-Uthaiwat (2018) confirmed empirically that by using a network topological measure, the diameter, that idiosyncratic risk is not always diversified away as commonly believed and that networks occasionally heighten the effect of idiosyncratic risk to cause aggregate fluctuation. He et al. (2022) have found that during abrupt shock periods such as the Covid-19 crisis, the network structure exhibits particular topological indications of the systematic risk contribution that are consistent across different markets. Yun, Jeong, and Park (2019) studied the topology of PageRank centrality to explore systematic risk and found that it was a more robust measure for systematic risk than traditional risk measures such as Conditional Value at Risk (CoVaR) or Marginal Expected Shortfall (MES) measures. To fulfil the aim of this thesis, which is to investigate the potential value of network analysis for risk management by using a machine-learning framework, we will need to include multiple components in our research. Firstly, by investigating a robust method for the construction of the network. Crucially, the network should represent the whole equity market while simultaneously minimizing outliers for bias prevention and filtering weak links to reduce the computation complexity of the market. In addition, the network’s topology is examined, and metrics of the network’s stocks are extracted as appropriate, such as centrality measures, clustering coefficients, or degree distributions (Namaki et al. 2011; Long, Zhang, and Tang 2017). Furthermore, a machine learning algorithm will be considered, namely, the Random Forest algorithm, to recognize the pattern of returns of the network’s vertices concerning their topological measures and past returns to classify the position of the nodes relative to the rest of the network. Moreover, it is crucial to explore the contribution of all the topological 4 measures by performing variable testing, also commonly referred to as feature importance testing in the Data-Science field, to identify these measures’ significance for the model at hand. Followingly, cross-validation and grid search analysis will be employed to optimize the classification model by preventing over- or underfitting and by tuning the relevant hyper-parameters of the learning model to control the model’s complexity. 5 2.2 Network Construction In the financial literature regarding network analysis, the correlation of returns of individual stocks often serve as the foundation of constructing links in the network. This is because they capture the interconnectedness of the stock market (Namaki et al. 2011; He et al. 2022; Eng-Uthaiwat 2018). Millington and Niranjan (2021) compared the construction of networks between Pearson correlation matrices and rank-based correlation matrices like Kendall and Spearman. They found virtually no difference in the robustness between the correlation coefficients. B. Podobnik et al. (2010) construct correlation matrices with a time lag, unlike the frequently used zero-lag correlation matrices of Pearson, Kendall, and Spearman. Boris Podobnik and Stanley (2008) propose a novel method of forming crosscorrelation matrices based on detrended covariance. Thus, it is possible to establish the edges of our network by using multiple criteria. However, as the Pearson correlation coefficient is widely used for financial network analysis, this coefficient will be used in this paper because the exploration of alternative correlation matrices is outside of the scope of this thesis. Consequently, it is imperative to reduce the complexity of the network while still preserving critical information about the interconnectedness of stocks using a filtering algorithm. He et al. (2022) use a threshold method to set a certain threshold for correlation between two stocks and construct an adjacent matrix in which only correlations exceeding the threshold are sustained, and smaller correlations are discarded. Alternatively, another widely employed filtering procedure is constructing a Minimum Spanning Tree (MST) from the correlation matrix using algorithms such as Kruskal or Prim’s algorithm (Long, Zhang, and Tang 2017; Millington and Niranjan 2021; Mantegna 1999). The algorithm begins by sorting the correlations from large to small and forms edges corresponding to the highest correlation one by one while ensuring that no cycles form. Hence, when a correlation produces a cycle, the given link is ignored, and the algorithm continues with the subsequent highest correlation in the queue. MST results in the formation of n−1 edges when there are n nodes. However, Eng-Uthaiwat (2018) claims that utilizing a Minimum Spanning Tree approach to form the network is perhaps too strict, resulting in vast amounts of information loss. Therefore, Tumminello et al. (2005) propose an alternative approach to filtering complex correlation-based graphs that preserves the hierarchical association of Minimum Spanning Trees while retaining a larger amount of information than MST, named the Planar Maximally Filtered Graph (PMFG) algorithm. The PMFG algorithm results in the formation of 3(n − 2) edges from n nodes. Due to PMFG being widely utilized within the financial network literature and being less strict than MST while being similar, PMFG is opted for in this paper (Eng-Uthaiwat 2018; Tumminello et al. 2005; Hong and Yoon 2022). 6 2.3 Network Topology As aforementioned, studies on topological properties can help to understand both the local behavior of individual entities and the network’s dynamic behavior. Namaki et al. (2011) and He et al. (2022) consider a wide range of topological measures from graph theory, such as centrality measures, degree distribution, and clustering coefficients, which are relevant for emerging and mature markets. We will consider numerous topological measures and compute the measures for all the nodes in the constructed structure. First the square (N, N ) matrix that represents the given network corresponding to the N given stocks is defined as A. These topological measures include: 1. Eigenvector centrality: This measure indicates the impact of a vertex in a network based on the influence of its neighbouring vertices. Essentially, the measure portrays the extent of influence of a node concerning its links to other important nodes (Long, Zhang, and Tang 2017; He et al. 2022; Bonacich 2007). It is iteratively calculated as follows: N X EVi = λ Aij EVj (1) j=1 where λ represents the largest eigenvector of matrix A. A high EVi value indicates that i node is linked to many other well-connected nodes. In this paper, 100 iterations will be performed to attain the measures values for all the vertices of the constructed network. 2. Degree centrality: This measure indicates the number of edges connected to the given node relative to the total number of possible links. (Long, Zhang, and Tang 2017; He et al. 2022; Gong et al. 2019). The greater the degree centrality of a node is, the more connections the given node has relative to the rest of the nodes in the network. PN D(i) = j=1 Aij (2) N −1 3. Betweenness centrality: This measure mirrors the number of times a node finds itself on the shortest path between other nodes. Betweenness centrality portrays the extent of dependence of other nodes on the given node (Freeman 1978; He et al. 2022). BW (i) = X βjl(i) 2 (N − 2)(N − 1) βjl , j ̸= i ̸= l (3) (j,l) where βjl(i) represents the shortest path from node j to node l via node i and βjl reflects the shortest path from node j to node i. 4. Closeness centrality: This centrality measure reflects the average shortest path of the given node to all its connected nodes (He et al. 2022). Closeness centrality portrays 7 the closeness of a node to the core of the network. (N − 1) C(i) = P (i,j) βij (4) where βij represents the shortest path from node i to node j. 5. Clustering coefficient: This measure illustrates how tightly the given node tends to cluster together with other nodes in the network. The computation of the clustering coefficient is based on the formation of triangles, where triangles are a form of a cycle between three given nodes. The measure is then computed by the number of triangles the given node is part of divided by the total number of triangles in the whole network (Watts and Strogatz 1998; He et al. 2022; Wang et al. 2016). CC(i) = P j Aij ( 1 P j X Aij − 1) Aij Ajk Aki (5) (j,k) 6. PageRank: Comparable to eigenvector centrality, which considers the connected nodes’ quality, PageRank also considers the direction of an edge by exploiting the degree of a node and assigning edges a particular weight (Brin and Page 1998). Traditionally, PageRank is only defined to directed graphs. However, it is possible to transform an undirected graph into a equivalent directed graph in order to compute the PageRank. More specifically, all the edges of the undirected graph will be replaced with two new directed edges with opposite direction. Furthermore, the computation of PageRank involves an iterative approach. First, all nodes in the graph, are assigned a equal initial value equal to 1/N . Moreover, using the following formula, a new PageRank is calculated for all vertices in the network: P R(i) = X P R(j) 1−d +d N Ej (6) i̸=j where Ej reflects the numbers of outgoing edges from the considered node. Additionally, d ∈ (0, 1) reflects the damping factor. The damping factor is meant to compensate for some degree of randomness in the connections between the nodes. Therefore, d of the proportion of the information corresponding to the incoming edges of the nodes is considered when computing the PageRank of the given node. In this paper, 100 iterations are considered for the computation of PageRank for a certain node with a damping factor of 0.85. Ultimately, the union of these measures captures the different facets of the interconnectedness of the entities in the entire network. 8 2.4 Data Mining Data mining is the action of finding (hidden) patterns and peculiarities in large amounts of data. Due to the size, complexity, and fast-expanding nature of financial data, data mining has become increasingly valuable in quantitative finance. When combining data mining techniques with machine learning methods, it is possible to make computers learn and comprehend the results from these mining processes and swiftly provide users with outcome predictions or classify entities within the data. 2.5 Classification Algorithms A significant subset of data mining and machine learning are classification models, where the aim is to determine which class, or a group, an observation belongs to. Supervised classification algorithms revolve around constructing a learning model subject to a pre-labeled training set to label unseen data correctly (Aly 2005). Furthermore, classification algorithms can be categorized into binary and multi-class classification models. The former, binary classification, entails that the model should distinguish between two class types, commonly referred to as labels. In contrast, multi-class classification involves training the model for multiple possible labels. In this study, the binary classification will be implemented, corresponding to a robust learning model concerning the financial data and the interrelation of the variables of interest. 2.6 Random Forest Model Random Forest is a machine-learning model for classification and regression tasks developed by Leo Breiman (2001). It is an ensemble technique that incorporates the predictions of numerous decision trees to create a robust predictive model widely known for its effectiveness as a supervised learning model (Biau and Scornet 2015). Decision trees work by creating a tree-like model of decisions based on the values of the predictor variables, commonly referred to as features in machine learning. At each internal node, the model decides based on the value of a feature and then splits the data into different branches based on the outcome of that decision. Specifically, this can be done by using a particular importance measure called Gini Criterion which minimizes the impurity of a node by selecting the feature that results in the highest proportion of the samples belonging to the same class (Leo Breiman 1996). In the case of a classification problem with i classes, the Gini value corresponding to a particular node of a decision tree, is realized in the following fashion: X Gini(P) = Pi (1 − Pi ) (7) i where Pi reflects the frequency of class i concerning the sample at the given node. Finally, the leaves of the trees, which are pure nodes, contain the final prediction for a given sample and have a Gini value of zero. The decision tree algorithm works by recursively partitioning 9 the data into smaller and smaller subgroups based on the value of a feature. It chooses the feature that maximizes the information gained at each step using the Gini Criterion, which measures how well the feature can predict the target variable. The process continues until the tree reaches a maximum depth or until all the leaves are pure, meaning they contain only samples from a single class and thus have no children nodes (L. Breiman et al. 1984). In the Random Forest algorithm, the decision trees are fed and trained on arbitrary subsets of the training data (Leo Breiman 2001). Finally, the Random Forest aggregates the predictions of all the decision trees in the forest and makes a prediction based on the majority votes (i.e., output) of all the individual trees. This process, known as bootstrapping, helps to reduce overfitting to the training set i.e., reduces the variance of the model, and thus improves the model’s generalization ability (Biau and Scornet 2015). In Figure 1, a summarized flowchart of the workings of a Random Forest Classifier is depicted (Wu, Gao, and Jiao 2019). Figure 1: Overview Random Forest Classifier. There are a variety of classification models available, including logistic regression, support vector machine, decision trees and their ensemble counterparts (e.g., Random Forest and Gradient Boosting), and Bayesian models like the Gaussian Naı̈ve-Bayes algorithm. Each model has unique characteristics and may be suitable for a given classification task. Naturally, the performance of different machine learning models is case-sensitive due to the differences in the statistical assumptions. Random Forest has certain advantageous characteristic properties, namely, 1. Random Forest is a non-parametric model that does not make distributional assumptions. 2. It has been empirically shown that Random Forests are more resistant when highly correlated explanatory variables are present and thus will result in a less biased model 10 compared to, for example, logistic regression (Lindner, Puck, and Verbeke 2022). This is due to the bootstrap sampling property of the Random Forest and the fact that not all predictor variables are used at once. 3. It can handle a relatively more extensive number of features since not all predictor variables are used simultaneously. 4. Random Forests can also detect non-linear anomalies in the data, unlike logistic regression or support vector machines, which assume linearity. As the interconnectedness of financial assets is quite complex and non-linear, and their distribution, in general, is unknown, the Random Forest algorithm is a robust method for the aim of this thesis. Additionally, as aforementioned, it is more flexible in including variables that exhibit multicollinearity. For instance, centrality measures of a network have been empirically shown to be linearly correlated in some networks (Oldham et al. 2019). However, by leveraging Random Forest, both these variables may be included in the model, and consequentially both can be tested on whether they offer a predictive value to the classification model. 11 2.7 Cross-Validation When constructing a robust learning model, it is imperative to evaluate the model’s generalization performance and prevent over- or underfitting to the training set. A commonly used method in statistics and machine learning is employing cross-validation. Cross-validation is a statistical resampling method where a single dataset is partitioned into k equal-sized subsets called folds (Sammut and Webb 2010). Furthermore, the learning model is executed k number of times where in each iteration, the union of the folds is utilized as the training set except the j ′ th fold, which serves as the test set in the j ′ th iteration. Finally, the score of the learning model in all iterations is averaged to attain a cross-validation score that serves as a measure of the model’s generalization ability. In the diagram below, a 10-fold cross-validation scheme is depicted. (Aatqb and Iqbal 2019) Figure 2: Overview Cross-Validation procedure. 2.8 Hyper-parameter Optimization Depending on the complexity and purpose of the model, it is crucial to tune the parameters of the learning model, often referred to as hyper-parameters, accordingly. There are several ways to achieve this, including Bayesian optimization, gradient-based optimization, exact search methods like grid or random search, or heuristics such as genetic algorithm or simulated annealing, which are advantageous when computing power is constraining (Feurer and Hutter 2019). For the aim of the thesis, the opted method of hyper-parameter optimization will be gridsearch, as the learning model will not be exposed to an extensive dataset, and only a few 12 parameters of the Random Forest will be considered. Grid search aims to find a set of hyperparameters for which the model’s performance is best concerning a specific measure such as mean squared error or the balanced accuracy of the target classes. Specifically, this is done by choosing a set of values, a grid, regarding the relevant hyperparameters subject to the optimization. Finally, the learning model is trained and evaluated for all the combinations of those parameters (Bergstra and Bengio 2012). As aforementioned, Random Forest algorithm will be leveraged in this paper. Several relevant (hyper-)parameters can be tuned in a Random Forest model to control the intricacy of the model, including: 1. The quantity of decision trees 2. The maximum depth of each tree: this parameter refers to the maximum number of nodes from the tree’s root node to the farthest leaf node 3. The minimum number of samples is to split a node and be at a leaf node (Biau and Scornet 2015) 4. The maximum number of features: this parameter regulates the number of predictor variables that are considered when examining for the most optimal split at each node in each tree 13 2.9 Feature Testing: Permutation Importance When constructing a particular model, it is paramount to examine which predictor variables correlate significantly with the dependent variable and thus are relevant for explaining the model. Permutation importance is a technique vastly used in machine learning for such purposes and works in the following fashion; the machine learning model is fitted to the training data and tested employing cross-validation. Afterward, the value of a single feature is randomly rearranged or permuted, potentially dropping the model’s performance. This procedure is repeated several times for all the learning model features. The change, or rather decrease, in the model’s performance concerning its (balanced) accuracy, indicates the degree of importance between the given feature and the dependent variable. In this fashion, the proportion of the predictive value of each feature for the given model is attained (Leo Breiman 2001). In a nutshell, the permutation importance, I, of the feature l is realized as follows: R 1 X Il = ϕ − ϕr,l R (8) r=1 where ϕ is the score of the trained model before permuting the features, R is the total number of iterations for which the permutation importance procedure is repeated, and ϕr,l is the score of predictive model where the observations corresponding to feature l are randomly shuffled in iteration r. Notably, this procedure does not suggest whether the feature is appropriate for the given problem in general but is solely indicative of the trained model at hand. Another notable drawback of permutation importance is that when there is a high degree of correlation between features, the importance of these features will be lower than their ’true’ importance. Because when such a correlated feature is shuffled, the model will still acquire some information about the shuffled feature from its correlated counterpart. This drawback could be mitigated by clustering the correlated features and only including one of these features when performing the permutation importance procedure. Nevertheless, suppose a model exhibits a limited amount of bias, and a particular feature has a considerable value of permutation importance. In that case, the variable is likely of value to the predictive model. Permutation importance will be implemented on the Random Forest that will be constructed in this thesis to gain insight into the predictive value of the features tested and especially concerning the network topological measures. 14 3 Methodology 3.1 Data Definition The data used throughout this thesis originate from the financial database of the Center for Research in Security Prices (CRSP). Multiple datasets are extracted from CRSP. First, as the network changes dynamically, eight years of data, starting from 2013 until 2022, is considered, except 2015, due to a significant amount of missing values. Each year, a network with 500 of the biggest stocks with respect to their market capitalization in the NASDAQ stock exchange will be constructed. Therefore, resulting in eight networks corresponding to the eight years of data. For each year, the data starts on April 1st and ends on the second to last working-day of March of the following year. Illustratively, the first network is based on daily records starting from 1th April 2013 until 27th March 2014. The variables considered in the datasets are given in Table 1. Variables Acronym Company Name Ticker CRSP Permanent Compnay Number Security Status Price Share Volume Holding Period Return Number of Shares Outstanding Delisting Code COMNAM TICKER PERMCO SECSTAT PRC VOL RET SHROUT DLSTCD Table 1: List of extracted variables from the CRSP database for the considered years. The first step in constructing our network is to define our nodes and edges. The nodes of the given network correspond to the different stocks, whereas the edges correspond to the distances between the different stocks. These distances will be based on the correlation of the returns of all stocks. As mentioned in the literature review, the first phase of forming the network is to compute the Pearson correlation matrix. Next, an adjacent distance matrix is attained from the resulting correlation matrix, which will serve as the foundation of the Planar Maximally Filtered Graph algorithm. First, the natural logarithm of the prices of all stocks, s = (1, ..., N ), is taken for each period, where Ps (t) denotes the price of stock s at time period t. Then, the prices’ logarithm will be subtracted pairwise from each other for all t = (1, ..., T ) to attain the return rate of all the periods successively. rs (t) = ln Ps (t + 1) Ps (t) 15 (9) Afterward, using the returns of all the stocks for the different periods, the Pearson coefficient is used to compute the correlation matrix of the considered stocks. PT − x̄i ) (xjt − x¯j ) 2 PT 2 t=1 (xit − x̄i ) t=1 (xjt − x¯j ) Ci,j = qP T t=1 (xit (10) Followingly, the symmetric (N, N ) correlation matrix will be transformed by the formula, p dij = 2(1 − Cij ), ensuring multiple properties such as higher correlation will correspond to smaller distances and hence more similarity between the nodes. Moreover, dij will equal dji , and thus symmetry will still hold. Lastly, self-loops resulting from the diagonal entries will be discarded. The next phase is to perform the Planar Maximally Filtered Graph algorithm to distance matrix to attain the network. The network construction will be implemented using Python 3.10 with Pandas, Numpy, Networkx, and Planarity packages. 3.2 Network Topology Subsequently, the network topology measures must be extracted from the attained network. With references to the literature review, betweenness centrality, degree centrality, closeness centrality, eigenvector centrality, PageRank, and the clustering coefficient of all the nodes in the network are realized and stored. 3.3 Designing the Classification Model The algorithm aims to predict whether the return of a stock in the next period will be in the top Z quantile with respect to all the other stocks by feeding it various input variables, such as the returns of multiple periods before and the topological network measures of the given stocks. For the model’s design, it is imperative to construct the data set on which the machine learning algorithm will be trained in a practical manner. The structure will be as follows: each row will correspond to a different stock, while the columns will contain returns from multiple periods and topology measures. Therefore, after structuring the dataset in the desired format. The stocks will be labeled according to the last return period of that year. The labeling procedure is done in the following fashion: specific quantiles of all the returns in the last period are calculated. Subsequently, stocks with returns below the Z quantile of all the returns will be labeled as ’Bad Relative Position.’ Alternatively, stocks with returns exceeding the Z quantile will be labeled ’Good Relative Position.’ ( Good Relative Position, if Z < rs (T ) Class = (11) Bad Relative Position, otherwise 16 The relative position of a stock is the target variable that our algorithms will be trained on such that they will be able to predict the relative position concerning the next day. The resulting pre-labeled dataset will be the training set of the Random Forest algorithm to predict the class of the stock i.e., whether a stock has a Good or Bad Relative position. Additionally, these relative positions can indicate which stocks are proper to invest in for the next period, in this case, the next day’s return. As an illustration, a training set will contain the following variables, presented in Table 2. Features Returns of periods T-10 until T-1 Volumes of periods T-3 until T-1 PageRank Clustering Coefficient Closeness Centrality Betweenness Centrality Degree Centrality Eigenvector Centrality Class of the Stock Table 2: List of features (variables) used to train the Random Forest Classifier. Furthermore, after constructing the Random Forest algorithm for the training set at hand, cross-validation will be performed for different fold values concerning a scoring measure called, balanced accuracy. Balanced accuracy is the weighted average accuracy of the learning model for each class. In other words, the average between the proportion of correctly predicted observations from the given classes. In the case of a binary classification, the balanced accuracy (ϕ) is defined as follows: 1 TP TN ϕ= + (12) 2 TP + FN TN + FP where T P , T N , F P , and F N reflect the number of true positives, true negatives, false positives, and false negatives, respectively. Moreover, positive refers to the class where stocks have a good relative position, and negative refers to the class where stocks have a bad relative position. It is essential to use this scoring measure instead of precision, as the training set is imbalanced and thus contains more ’bad’ stocks, and would the model, in that case, be biased toward classifying stocks as ’bad.’ Lastly, with the attained optimal number of folds for our training set, a grid search is performed to optimize the hyper-parameters of the Random Forest, including the number of estimators, the maximum depth of a tree, and the maximum number of features considered in the tree. 17 Finally, multiple Z values will be considered to examine the algorithm’s sensitivity concerning the set threshold of whether a stock has a bad or good relative position. The different values for the threshold Z considered are 0.55 quantile, 0.65 quantile, 0.75 quantile, and 0.85 quantile. 18 4 4.1 Empirical Findings and Analysis Pre-Downsampling After implementing the proposed methodology, the Random Forest algorithm is fitted to the training data and a 10-fold cross-validation procedure is implemented. The following results have been attained which are displayed using confusion matrices, diagrams that summarize the performance of a classification model by showing the shares of true positive, true negative, false positive, and false negative predictions. Figure 3: The confusion matrices corresponding to the results of the given Random Forest model on the test set without downsampling the imbalanced training set for different values of the quantile threshold Z. From the confusion matrices for the different Z quantiles, it is clear that as the threshold quantiles become more significant, the training set has become more imbalanced; therefore, the Random Forest model’s bias has increased. For this reason, the greater class is downsampled to the same size as the smaller class when training the machine learning model while keeping the imbalance when testing. 19 4.2 Post-Downsampling After performing the downsampling of the stocks with the bad relative position, the subsequent confusion matrices are acquired for different values of Z. Figure 4: The confusion matrices corresponding to the results of the given Random Forest model on the test set after downsampling the imbalanced training set for different values of the quantile threshold Z. The classification model is more robust after the downsampling of the greater class and is less biased, with an average cross-validation score of 67.5% across all 10 folds of the training set, between all values of Z. Additionally, the classification model also has an average balanced accuracy of 67.5% on the test set, between all values of Z. Furthermore, after downsampling the greater class, the learning model’s accuracy is significantly less sensitive to different values of threshold Z. This could be explained by the fact that algorithm is already pretty strict with classifying stocks as good. Another noticeable insight that the confusion matrices reveal is that the Random Forest algorithm tends to become more greedy towards the good class as the quantile threshold, Z, increases. Given the fact that the algorithm is trained on a dataset with a size of 4500 records, with a 10-fold cross-validation, the cross-validation score of the learning model, which is representative of the generalizability of the model, is somewhat acceptable to gain insights about the predictive value of the network topology measures. However, one should keep in mind that the bias of the given model is not negligible and that a certain degree of omitted 20 variable bias is likely to be present. 21 4.3 Variable Importance Moreover, it is essential to implement permutation importance (section 2.9) procedure to examine each feature’s proportion of predictive value for the constructed model. For each feature, the procedure is repeated twenty times. The following graphs depict the feature instance for all values of Z. Figure 5: The bar charts depicting the permutation importance of all predictor variables for different values of the quantile threshold Z corresponding to the constructed Random Forest model. Firstly, as Z increases, more features become relevant for the given Random Forest classifier. From the confusion matrices in the previous section, it was also observed that the algorithm became more greedy toward picking the ’good’ class as Z increased, which could potentially 22 explain why the variables’ importance became more significant for the model’s performance. Furthermore, for all values of Z, closeness and eigenvector centrality contribute the most toward the model’s performance compared to other network topological properties. As mentioned in section 2.3 about topological measures, these network measures could be highly correlated. Especially in the case of the .55 and .65 quantiles, where the algorithm is relatively more balanced in terms of biasedness toward one of the classes, multicollinearity is more visible and could have led to the non-significance of the topology measures, PageRank, degree centrality, betweenness centrality, and clustering coefficient, for the model at hand. Even in the case of the .75 and .85 quantiles, these four topology measures still have a relatively low significance compared to closeness and eigenvector centrality for the model. Hence, indicating the potential collinearity between the pair of closeness and eigenvector centrality with the remaining topology measures. Lastly, the model’s predictive performance stays nearly similar when performing the Random Forest algorithm without the non-significant variables according to the permutation importance results. Moreover, the importance of the remaining topology measures becomes more significant, again suggesting the presence of multicollinearity between the topology measures. 23 4.4 Portfolio Construction In order to assess the applicability of the machine learning model, it is possible to construct a portfolio for the next day using the built model. To examine whether the portfolio reflects or is even capable of beating the market. The latter depends on how significant the error of the machine learning algorithm is when misclassifying a stock. If the algorithm for a threshold value of .85 quantile misclassifies a stock belonging to a quantile larger or equal to the .5 quantile, then the resulting portfolio would still be adequate. On the other hand, if the given misclassified stocks had belonged to a quantile less than the .5 quantile, the resulting portfolio would be less satisfactory. Similarly, suppose the learning model would be relatively good in identifying the top .05 quantile. In that case, these returns could mitigate the adverse effect of the proportion of the misclassified stocks in the portfolio. To examine the practicality of the learning model for risk-management purposes, numerous network portfolios have been constructed for each year and compared to three other kinds of portfolios. The types of constructed portfolios are as the following: 1. Network portfolios: Portfolios consisting of stocks predicted to have a good relative position for the next day established by the Random Forest model based on network topological attributes of the considered stocks. 2. Random Stock Portfolios: Portfolios consisting of randomly chosen stocks from the 500 most extensive stocks concerning market capitalization. 3. Average Stock Portfolios: Portfolios consisting of stocks that all have the average return of the predicted day as their return. These portfolios represent the trend of the NASDAQ stock market in the given days. 4. NASDAQ-100 Index: A portfolio consisting only of the NASDAQ-100 index, an index of the 100 largest companies concerning their market capitalization of the NASDAQ stock market, except for financial companies. The network portfolios are constructed using sliding windows, meaning that all the data is used to train the Random Forest classifier except the year in which the desired day lies. For instance, to predict the stocks with good relative positions on 30th March 2022, the algorithm is evidently not trained on that year’s data. Furthermore, only Z = 0.75 quantile is considered in this section, i.e., the classification model is trained on data where the stocks with returns in the top 0.25 quantile have the class of good relative position. Finally, as mentioned in section 3.1, eight years of data corresponding to eight yearly networks have been considered in this study. Therefore, eight days have been predicted using this model, corresponding to the penultimate working day of March. 24 Figure 6: The different types of simulated portfolios for the considered years and their corresponding returns in percentages. From the eight next-day predictions of the Random Forest algorithm and the corresponding network portfolios, six of these attained portfolios consisting of 100 stocks result in significantly higher returns than the NASDAQ-100 index. These attained network portfolios and their corresponding returns indicate that constructed machine-learning model is relatively suitable for portfolio construction for the given days. However, in 2020 and 2022, the net- 25 work portfolio of 100 stocks had lower returns than the NASDAQ-100. In 2020, the Network Portfolio still had a greater return than the corresponding Average Return Stock Portfolio consisting of 100 stocks. Hence, the return of the Network Portfolio is still greater than 0.5 quantile of all returns. As in the case of an uptrend, the average return will be greater than the median return. However, the NASDAQ-100 still outperformed the Network Portfolio in 2020. Potentially, a reason for this could be because the Covid-19 pandemic hit at the start of 2020, when huge tech companies such as Tesla, Amazon, Meta, and others, which are NASDAQ-100 listed, realized historically immense returns. Tesla, for instance, realized almost 740% returns in 2020. In 2022, the Network Portfolio consisting of 100 stocks, performs significantly worse than the NASDAQ-100 and slightly worse than the Average Return Stocks Portfolio. A potential explanation for the subpar performance is that, in 2022, there is a steep downtrend. In contrast to 2022, except for 2018, there was a moderate uptrend in the NASDAQ stock market each year. Consequently, the given Random Forest classifier has been trained more extensively on periods of an uptrend and is potentially less adequate in times of a downtrend. In 2018, the more ’optimistic’ trait of the machine learning model was less problematic, but in 2022 the algorithm’s greediness is much more perceptible and costly. Additionally, these portfolios have been constructed based on the threshold quantile value, Z = 0.75. Furthermore, it was observed earlier that the algorithm was more greedy in classifying stocks as ’good’ as the threshold quantile value Z increased. Therefore, a potential factor of the relatively worse performance during the downtrend is due to the threshold quantile Z used. Thus, the value of Z should be considered concerning the market trend when constructing portfolios. 26 5 Conclusion This paper aimed to examine the potential of network analysis for risk-management purposes when constructing stock portfolios by bridging the gap between the theoretical knowledge about network analysis and state-of-art machine learning techniques and applying it to the NASDAQ stock market. A Random Forest model was constructed based on yearly topological properties of the given stocks and other relevant independent variables to predict the relative position of stocks for the next day with an average cross-validation score of 67.5%. Although the bias of the attained learning model was not negligible, it showed a certain degree of robustness concerning the variance of the model to assess the predictive value of network topology measures and, thus, the usefulness of network analysis. Moreover, a permutation importance procedure was implemented to assess the proportion of the predictive value of the topology measures for the given model. Consequently, eigenvector and closeness centrality appeared to have the most predictive value compared to other topology measures. However, multicollinearity between the topology measures could have made the measures that correlate with eigenvector and closeness centrality less relevant. Furthermore, numerous portfolios were constructed for eight days using the Random Forest model and compared to other portfolios, including the NASDAQ-100 index. For the considered days, the Network Portfolio outperformed the NASDAQ-100 index six out of the eight days. Furthermore, the model’s performance exhibited a certain degree of sensitivity towards the market trend that could potentially be due to the biasedness of the training data toward an uptrend. Finally, to address the purpose of this thesis, network analysis certainly shows significant potential to be used as an additional instrument to analyze the complex workings of an equity market. Furthermore, machine learning and the increased availability of historical data and computing power have created the prospect of making network analysis adequate for risk management purposes. However, more research is necessary to make network analysis a practical risk-management tool for the agents of the stock market, which are presented in the next section. 27 6 6.1 Discussion The Network Regarding the construction of the underlying network of the NASDAQ stock market that is used as the foundation to build the particular Random Forest model, certain limitations could be improved in future research. First, for constructing the networks in this study, Pearson correlation matrices of stock returns are used to form the networks. However, as presented in the literature review, the Pearson coefficient considers only linear relationships between the different stocks. Therefore, the constructed networks based on the Pearson coefficient do not contain the non-linear interconnectedness between the given stocks. Additionally, the Pearson coefficient is a zerotime lag correlation coefficient and therefore neglects the time lags that could be present between the correlations of the given stocks. For example, this shortcoming could be improved by introducing a time-lagged correlation matrix. However, then the problem arises that the resulting correlation matrix would not be symmetric. Hence, the correlation matrix on which the prediction model is based has certain shortcomings, which could be enhanced in future research. Furthermore, in this investigation, yearly networks are constructed based on daily data of the corresponding year. However, this element could be done more systematically. In the case of predicting the relative positions of given stocks for a particular day, it is potentially more appropriate to build monthly or quarterly networks instead of yearly networks. Specifically, if there is a significant change in the correlations between stocks in the last periods of that year on which the network is based, then the topological measures of the stock extracted from the network may be less representative of the next-day relative position of that given stock. In the case of monthly networks, it is also possible to include lags of the topological measures corresponding to a stock, potentially revealing the dynamic shift of the stock’s relative position more optimally. Moreover, it is viable to include yearly, and monthly networks, as then both the stocks’ short-term and relatively long-term topological measures can be used for constructing the predictive model. Lastly, it would also be interesting to predict the next-month relative positions instead of daily relative positions. As with daily returns, the probability of outliers is more extensive than with monthly returns, as these are aggregated returns of several days. Therefore, the bias of the outlying days is diminished. In that instance, it would be interesting to consider yearly and quarterly networks. Hence, concerning the desired prediction period, the networks should be constructed so that the dynamic conduct of the stocks in the networks is revealed. Additionally, these stocks’ short-term and long-term relative positions can be utilized. 28 6.2 The Predictive Model In this study, the chosen predictive model was the Random Forest classifier. Since Random Forest is a non-parametric model, in addition to being somewhat resistant to the presence of multicollinearity in comparison to, for example, linear regression or Bayesian learning, as not all predictor variables are used in the decision trees of the Random Forest simultaneously. There are some limitations to the predictive model that could be improved in future research. First, the training data of the model should incorporate predictor variables corresponding to different macroeconomic factors such as the state of the economy, the market trend, the profitability of the given companies, and the level of interest rate, for example. Since macro-economic conditions also affect the behavior of the stock market. More generally, the predictive model should be subjected to a more significant number of independent variables that could affect future stock returns and, thus, the relative position of a given stock in the network. Evidently, these predictor variables should be tested for (statistical) significance for the given model. Consequently, the predictive model would be less prone to omitted variable bias, which was likely the case in this paper. Furthermore, the quality and size of the training data are also crucial for the model’s performance. Instead of considering the 500 most extensive stocks concerning market capitalization, more stocks in the given equity market could be included. Alternatively, it is possible to consider multiple stock markets jointly, as the stocks within these different markets could correlate significantly. Furthermore, regarding the quality of the training data, the model should be subjected to all different types of market movements and should not be biased toward a bearish or bullish market state. A naı̈ve method to achieve this is to feed the predictive model more data. When utilizing a large amount of data, it may be beneficial to utilize other machine learning models, such as Gradient Boosting. Gradient Boosting is more rapid and robust as the size of the training data grows compared to the Random Forest, which is more suitable for relatively small or moderate amounts of data. Lastly, as discovered in this investigation, specific topological properties are potentially correlated. In order to discover the relevance of these predictor variables, it is possible to cluster these correlated variables together and guide the learning model only to employ these variables separately. In this manner, the predictive value of all variables is more easily recognized, and the model can be adjusted accordingly. 29 6.3 Portfolio Construction In this paper, the portfolio construction using the predictive model was intended for demonstrative purposes of the potential of the constructed predictive model. Specific points could be improved and taken into consideration for future work. When constructing portfolios, it is essential to consider the risk tolerance of the user, which may be balanced against the introduced threshold quantile Z. As when Z grows, the learning model becomes more greedy towards picking the ’good’ class and hence takes more risk. Furthermore, to create a balanced training set for the predictive model, the downsampling of the ’bad’ class can also be used for adjusting the risk threshold. A higher level of downsampling of the ’bad’ class leads to labeling more stocks as ’good.’ In comparison, a lower level of downsampling, which results in the ’bad’ class size being more significant than the ’good’ class, will result in the classification of more stocks with a relatively bad position and, thus, less risk to be taken. Principally, when constructing the portfolios, it is necessary to conduct more simulations for several periods with various macroeconomic conditions to assess the predictive model’s performance better. However, running more simulations and predicting more periods may be computationally constraining. Ultimately, the attained network portfolios could be further improved using robust risk assessment measures such as the Sharpe Ratio or Conditional Value at Risk to better regulate the risk of these portfolios. 30 References Aatqb, Johar and Amer Iqbal (Apr. 2019). “Introduction to Support Vector Machines and Kernel Methods”. In: url: https : / / www . researchgate . net / publication / 332370436_Introduction_to_Support_Vector_Machines_and_Kernel_Methods. Albert, Réka and Albert-László Barabási (Jan. 2002). “Statistical mechanics of complex networks”. In: Rev. Mod. Phys. 74 (1), pp. 47–97. doi: 10.1103/RevModPhys.74.47. url: https://link.aps.org/doi/10.1103/RevModPhys.74.47. Aly, Mohamed (2005). “Survey on multiclass classification methods”. In: Neural Netw 19.1, p. 9. Bergstra, James and Yoshua Bengio (2012). “Random Search for Hyper-Parameter Optimization”. In: Journal of Machine Learning Research 13.10, pp. 281–305. url: http: //jmlr.org/papers/v13/bergstra12a.html. Biau, Gérard and Erwan Scornet (2015). A Random Forest Guided Tour. doi: 10.48550/ ARXIV.1511.05741. url: https://arxiv.org/abs/1511.05741. Bonacich, Phillip (2007). “Some unique properties of eigenvector centrality”. In: Social Networks 29.4, pp. 555–564. issn: 0378-8733. doi: https : / / doi . org / 10 . 1016 / j . socnet.2007.04.002. url: https://www.sciencedirect.com/science/article/ pii/S0378873307000342. Breiman, L. et al. (1984). Classification and Regression Trees. Taylor & Francis. isbn: 9780412048418. url: https://books.google.nl/books?id=JwQx-WOmSyQC. Breiman, Leo (July 1996). “Technical Note: Some Properties of Splitting Criteria”. In: Machine Learning 24.1, pp. 41–47. issn: 1573-0565. doi: 10.1023/A:1018094028462. url: https://doi.org/10.1023/A:1018094028462. — (Oct. 2001). “Random Forests”. In: Machine Learning 45.1, pp. 5–32. issn: 1573-0565. doi: 10.1023/A:1010933404324. url: https://doi.org/10.1023/A:1010933404324. Brin, Sergey and Lawrence Page (1998). “The anatomy of a large-scale hypertextual Web search engine”. In: Computer Networks and ISDN Systems 30.1. Proceedings of the Seventh International World Wide Web Conference, pp. 107–117. issn: 0169-7552. doi: https : / / doi . org / 10 . 1016 / S0169 - 7552(98 ) 00110 - X. url: https : / / www . sciencedirect.com/science/article/pii/S016975529800110X. Dorogovtsev, S. N. and J. F. F. Mendes (2002). “Evolution of networks”. In: Advances in Physics 51.4, pp. 1079–1187. doi: 10 . 1080 / 00018730110112519. eprint: https : / / doi . org / 10 . 1080 / 00018730110112519. url: https : / / doi . org / 10 . 1080 / 00018730110112519. Eng-Uthaiwat, Harnchai (Aug. 2018). “Stock market return predictability: Does network topology matter?” In: Review of Quantitative Finance and Accounting 51.2, pp. 433– 460. issn: 1573-7179. doi: 10.1007/s11156-017-0676-3. url: https://doi.org/10. 1007/s11156-017-0676-3. Fan, Jingfang et al. (Aug. 2019). “Topology of products similarity network for market forecasting”. In: Applied Network Science 4.1, p. 69. issn: 2364-8228. doi: 10.1007/ s41109-019-0171-y. url: https://doi.org/10.1007/s41109-019-0171-y. 31 Feurer, Matthias and Frank Hutter (2019). “Hyperparameter Optimization”. In: Automated Machine Learning: Methods, Systems, Challenges. Ed. by Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren. Cham: Springer International Publishing, pp. 3–33. isbn: 9783-030-05318-5. doi: 10.1007/978- 3- 030- 05318- 5_1. url: https://doi.org/10. 1007/978-3-030-05318-5_1. Freeman, Linton C. (1978). “Centrality in social networks conceptual clarification”. In: Social Networks 1.3, pp. 215–239. issn: 0378-8733. doi: https://doi.org/10.1016/ 0378-8733(78)90021-7. url: https://www.sciencedirect.com/science/article/ pii/0378873378900217. Gong, Xiao-Li et al. (2019). “Financial systemic risk measurement based on causal network connectedness analysis”. In: International Review of Economics Finance 64, pp. 290– 307. issn: 1059-0560. doi: https://doi.org/10.1016/j.iref.2019.07.004. url: https://www.sciencedirect.com/science/article/pii/S1059056018305586. He, Chengying et al. (2022). “Sudden shock and stock market network structure characteristics: A comparison of past crisis events”. In: Technological Forecasting and Social Change 180, p. 121732. issn: 0040-1625. doi: https://doi.org/10.1016/j.techfore. 2022 . 121732. url: https : / / www . sciencedirect . com / science / article / pii / S004016252200258X. Hong, Mi Yeon and Ji Won Yoon (Feb. 2022). “The impact of COVID-19 on cryptocurrency markets: A network analysis based on mutual information”. In: PLOS ONE 17.2, pp. 1– 24. doi: 10.1371/journal.pone.0259869. url: https://doi.org/10.1371/journal. pone.0259869. Lindner, Thomas, Jonas Puck, and Alain Verbeke (Sept. 2022). “Beyond addressing multicollinearity: Robust quantitative analysis and machine learning in international business research”. In: Journal of International Business Studies 53.7, pp. 1307–1314. issn: 14786990. doi: 10.1057/s41267-022-00549-z. url: https://doi.org/10.1057/s41267022-00549-z. Long, Haiming, Ji Zhang, and Nengyu Tang (July 2017). “Does network topology influence systemic risk contribution? A perspective from the industry indices in Chinese stock market”. In: PLOS ONE 12.7, pp. 1–19. doi: 10.1371/journal.pone.0180382. url: https://doi.org/10.1371/journal.pone.0180382. Mantegna, R. N. (Sept. 1999). “Hierarchical structure in financial markets”. In: The European Physical Journal B - Condensed Matter and Complex Systems 11.1, pp. 193–197. issn: 1434-6036. doi: 10.1007/s100510050929. url: https://doi.org/10.1007/ s100510050929. Millington, Tristan and Mahesan Niranjan (2021). “Construction of minimum spanning trees from financial returns using rank correlation”. In: Physica A: Statistical Mechanics and its Applications 566, p. 125605. issn: 0378-4371. doi: https://doi.org/10.1016/ j.physa.2020.125605. url: https://www.sciencedirect.com/science/article/ pii/S0378437120309031. Namaki, A. et al. (2011). “Network analysis of a financial market based on genuine correlation and threshold method”. In: Physica A: Statistical Mechanics and its Applications 32 390.21, pp. 3835–3841. issn: 0378-4371. doi: https://doi.org/10.1016/j.physa. 2011 . 06 . 033. url: https : / / www . sciencedirect . com / science / article / pii / S0378437111004808. Oldham, Stuart et al. (July 2019). “Consistency and differences between centrality measures across distinct classes of networks”. In: PLOS ONE 14.7, pp. 1–23. doi: 10 . 1371 / journal.pone.0220061. url: https://doi.org/10.1371/journal.pone.0220061. Podobnik, B. et al. (June 2010). “Time-lag cross-correlations in collective phenomena”. In: Europhysics Letters 90.6, p. 68001. doi: 10 . 1209 / 0295 - 5075 / 90 / 68001. url: https://dx.doi.org/10.1209/0295-5075/90/68001. Podobnik, Boris and H. Eugene Stanley (Feb. 2008). “Detrended Cross-Correlation Analysis: A New Method for Analyzing Two Nonstationary Time Series”. In: Phys. Rev. Lett. 100 (8), p. 084102. doi: 10.1103/PhysRevLett.100.084102. url: https://link.aps. org/doi/10.1103/PhysRevLett.100.084102. “Cross-Validation” (2010). In: Encyclopedia of Machine Learning. Ed. by Claude Sammut and Geoffrey I. Webb. Boston, MA: Springer US, pp. 249–249. isbn: 978-0-387-30164-8. doi: 10.1007/978-0-387-30164-8_190. url: https://doi.org/10.1007/978-0387-30164-8_190. Shirley, Mark D.F. and Steve P. Rushton (2005). “The impacts of network topology on disease spread”. In: Ecological Complexity 2.3, pp. 287–299. issn: 1476-945X. doi: https: //doi.org/10.1016/j.ecocom.2005.04.005. url: https://www.sciencedirect. com/science/article/pii/S1476945X05000280. Tumminello, M. et al. (2005). “A tool for filtering information in complex systems”. In: Proceedings of the National Academy of Sciences 102.30, pp. 10421–10426. doi: 10 . 1073/pnas.0500298102. eprint: https://www.pnas.org/doi/pdf/10.1073/pnas. 0500298102. url: https://www.pnas.org/doi/abs/10.1073/pnas.0500298102. Wang, Yu et al. (Nov. 2016). “Comparison of Different Generalizations of Clustering Coefficient and Local Efficiency for Weighted Undirected Graphs”. In: Neural computation 29, pp. 1–19. doi: 10.1162/NECO_a_00914. Watts, Duncan J. and Steven H. Strogatz (June 1998). “Collective dynamics of ‘small-world’ networks”. In: Nature 393.6684, pp. 440–442. issn: 1476-4687. doi: 10.1038/30918. url: https://doi.org/10.1038/30918. Wu, Xin, Yuchen Gao, and Dian Jiao (2019). “Multi-Label Classification Based on Random Forest Algorithm for Non-Intrusive Load Monitoring System”. In: Processes 7.6. issn: 2227-9717. doi: 10.3390/pr7060337. url: https://www.mdpi.com/2227-9717/7/6/ 337. Yang, Jaewon and Jure Leskovec (Dec. 2010). “Modeling Information Diffusion in Implicit Networks”. In: pp. 599–608. doi: 10.1109/ICDM.2010.22. Yun, Tae-Sub, Deokjong Jeong, and Sunyoung Park (2019). ““Too central to fail” systemic risk measure using PageRank algorithm”. In: Journal of Economic Behavior Organization 162, pp. 251–272. issn: 0167-2681. doi: https://doi.org/10.1016/j.jebo. 2018 . 12 . 021. url: https : / / www . sciencedirect . com / science / article / pii / S0167268118303536. 33 7 7.1 Appendix Append A: Visualization Networks (a) 2014 (b) 2016 (c) 2017 (d) 2018 (e) 2019 (f) 2020 (g) 2021 (h) 2022 Figure 7: The constructed networks of the NASDAQ 500 most extensive stocks concerning market capitalization for the considered years. 34