Uploaded by Dylan Haynes

Heterochromatin Gene Dynamics & Cancer Instability

advertisement
Exploring Heterochromatin Gene Dynamics Reveals Major
Contributors to Chromosomal Instability in Cancer
Student: Dylan Haynes-Simmons
Student Number: 2780600
Supervisor: Dr Aniek Janssen
Daily Supervisor: Aditya Dixit
Department: Centre for Molecular Medicine UMC Utrecht
Group: Janssen Group
Examiner: Reza Haydarlou
MSc: Bioinformatics and Systems Biology
Track: Bioinformatics
Course Code: XM_0103
Credit Points (ECs): 60
Duration: 10 months
1
1.
Abstract
Heterochromatin plays a critical role in maintaining genomic stability and regulating gene
expression, making it essential for the proper functioning of cells. It ensures genomic
integrity by stabilising repetitive DNA sequences and protecting against harmful genetic
recombinations. Chromosomal instability, a hallmark of cancer, involves increased rates of
chromosomal missegregation leading to aneuploidy, structural chromosomal rearrangements,
and copy number variations. Dysregulation of genes coding for heterochromatin proteins is
thought to contribute to CIN and cancer progression. However, the underlying functional
mechanisms remain poorly understood, particularly regarding the impact of individual genes
on CIN. This study addresses this gap by constructing an interactome of genes encoding for
heterochromatin proteins and employing a machine-learning model to explore their relation
to CIN features in cancer. Our analysis revealed a distinct transcriptional signature of
heterochromatin genes in cancerous conditions. The machine-learning model demonstrated
an overall accuracy of 0.805 in classifying arm-level aneuploidies. Genes identified as
significant contributors to CIN features included CENP-A, linked to multiple arm
aneuploidies; SETDB1, associated with pericentromeric instability on chromosome 1; and
NUP133 and NUP155, which influenced aneuploidies across several chromosomes,
highlighting their potential roles in maintaining genomic stability. These findings emphasise
the essential roles of heterochromatin genes in CIN, illustrating how misregulation of specific
genes such as CENP-A, SETDB1, NUP133, and NUP155 could contribute to tumorigenesis.
This deeper understanding of the molecular mechanisms influencing chromosomal instability
provides valuable insights for future research which could further elucidate the role of
heterochromatin protein-encoding genes in cancer.
2.
Introduction
Heterochromatin, a tightly packed form of
DNA, is a chromatin type identified by
darkly stained regions of the nucleus
(Passarge, 1979). It is essential for
maintaining genome stability and
regulating gene expression in eukaryotic
cells. It was initially described through
studies on position-effect variegation in
Drosophila, where heterochromatin was
observed to cause mosaic gene silencing
(Lomberk et al., 2006). The structure of
heterochromatin is characterised by the
presence of specific proteins, such as
Heterochromatin Protein 1 (HP1), which
binds to di- and tri-methylated histone H3
lysine 9 (H3K9me2/3) and facilitates the
formation of a repressive chromatin
environment (Hennig, 1999). Importantly,
heterochromatin functions as a “guardian”
of genomic integrity, ensuring the stability
of repetitive DNA sequences and
protecting against harmful genetic
recombinations (Janssen et al., 2018).
Heterochromatin is often associated with
the nuclear lamina, a fibrous layer lining
the inner nuclear membrane, and the
proteins that reside there (Olins et al.,
2010; Margalit et al., 2007; Maison et al.,
1997; Poleshko et al., 2013; Zeng at al.,
2010; Shibuya et al., 2014; Dunce et al.,
2018; Li et al., 2018). The nuclear lamina
helps organise chromosomes and establish
genome spatial organisation through
lamina-associated domains (LADs) (Van
Steensel & Belmont, 2017; Shevelyov &
Ulianov, 2019). LADs typically contain
silent or lowly expressed genes and
contribute to transcriptional repression and
chromatin compaction (Van Steensel &
Belmont, 2017). Despite its known roles in
gene silencing and chromosomal
architecture, many aspects of
heterochromatin’s functional mechanisms
remain poorly understood (Lomberk et al.,
2006; Hennig, 1999). This lack of
understanding is significant given
heterochromatin’s crucial role in
chromosomal stability, a concept central to
exploring how chromosomal instability
originates in cancer.
2
Chromosomal instability (CIN) is a
hallmark of cancer characterised by an
increased rate of chromosomal
missegregation, leading to aneuploidy,
structural rearrangements, and copy
number variations (Janssen et al., 2018).
The dysregulation of heterochromatin and
its associated proteins is thought to play a
significant role in CIN. Alterations in the
composition of the nuclear lamina and
LADs during cancer progression result in
the reorganisation of heterochromatin,
leading to changes in gene expression and
genomic instability (Bellanger et al.,
2022). These changes in LADs are often
accompanied by epigenetic modifications,
such as the loss of H3K9me2 and DNA
hypomethylation, which contribute to the
loss of heterochromatin integrity and
increased susceptibility to chromosomal
rearrangements (Van Steensel & Belmont,
2017). Consequently, the restructuring of
LADs and the nuclear lamina in cancer
cells calls attention to the critical
relationship between nuclear architecture
and chromosomal stability (Smith et al.,
2018). Studying the dynamics of
heterochromatin-associated proteins in the
context of cancer could provide significant
insights into their roles in regulating
features of CIN, paving the way for
targeted therapeutic strategies.
Recent studies have highlighted that the
dysregulation of heterochromatinassociated proteins and their encoding
significantly contribute to CIN. For
instance, proteins such as HP1 are
essential for maintaining chromosomal
integrity, gene silencing, and the formation
of heterochromatin, with reductions in
these proteins linked to cancer progression
(Dialynas et al., 2008). Similarly,
Polycomb-protein encoding genes are
frequently upregulated in aggressive
cancers, contributing to transcriptional
repression and tumour development
(Clermont et al., 2015). Defects in
enzymes such as topoisomerases,
important for heterochromatin structure
and function, can lead to the de-repression
of transposons, resulting in chromosomal
instability (Lee & Wang, 2019; Amoiridis
et al., 2024). These examples underline the
pivotal roles of heterochromatin-associated
proteins in maintaining genomic stability
and their significant impact on
chromosomal instability in cancer.
However, a complete understanding of the
role of heterochromatin misregulation in
the development of CIN in cancer is
currently missing. Therefore, further
research is necessary to elucidate the
molecular mechanisms by which such
genes influence CIN and to
explore their potential as therapeutic
targets. Given the potentially fundamental
role of heterochromatin in maintaining
genomic stability and its dysregulation in
cancer, our study aimed to delineate the
roles of “heterochromatin-associated genes”
(i.e. genes encoding for heterochromatin
proteins) in and across cancer types by
constructing a comprehensive interactome
and employing a machine-learning model to
assess their impact on chromosomal
instability. Previous studies have
demonstrated the utility of machine
learning in genomic feature selection and
association studies. Random forest
algorithms have been effectively used to
select relevant genes from high-dimensional
microarray data, improving the
identification of biomarkers associated with
cancer progression (Anaissi et al., 2013).
Additionally, an ensemble model based on
a support vector machine approach proved
adept in feature selection and providing
robust biomarker identification (Anaissi et
al., 2016). Similarly, integrating gene
expression profiles to infer CIN signatures
has provided significant insights into the
underlying molecular mechanisms and has
been predictive of clinical outcomes in
multiple cancer types (Carter et al., 2006).
Here, our approach involved integrating
RNA-sSeq data from The Cancer Genome
Atlas (TCGA) to build a detailed
expression profile of heterochromatin-
3
associated genes across various cancer
types and analyse their impact on
chromosomal instability (Fig.1). By
querying the STRING Protein-Protein
Interaction (PPI) database, we expanded
our core gene set to construct an extensive
heterochromatin interactome. This
interactome included numerous genes
encoding for proteins linked to
heterochromatin structure or function and
implicated in cancer dysregulation. To
relate the aberrations in the expression of
heterochromatin-associated genes to CIN
features, we developed a machine
learning-based approach using a multioutput stacked ensemble model. This
model estimated multiple features
indicative of CIN, such as arm-level
aneuploidies, homologous recombination
deficiency (HRD) signatures, and copy
Core Protein List
Input
Output
Component
Step 1: expansion of the heterochromatin
gene set
Process
STRING PPI
Set-of-Interest
Step 2: creation of the expression & CIN
dataset for the set-of-interest
TCGA PanCan
Dataset
Set-of-Interest
dataset
Step 3: modelling the CIN features from
the set-of-interest dataset
XGBoost ML
Model
Genomic Feature
Importance
CIN Feature
Interactions
Figure 1. Flowchart of the workflow
Structure of the complete workflow from the core protein set to the feature importance
4
number variations in pericentromeric
regions. A subsequent feature importance
analysis identified key genes as significant
contributors to CIN features, providing
novel insights into the molecular
mechanisms underlying chromosomal
instability in cancer. An additional
interaction analysis provided further
context into the network of
heterochromatin-associated gene
interactions. By employing this detailed
approach, we aim to bridge the gap in
understanding the complex relationship
between heterochromatin dynamics and
chromosomal instability in cancer.
3.
Results
3.1
A transcriptionally distinct
heterochromatin gene set is a signature
of cancer
Heterochromatin genes are known to serve
a multitude of nuclear functions, yet their
roles in cancer remain enigmatic (Janssen
et al., 2018); as such, an interactome of
heterochromatin-associated genes could
provide crucial insights into how these
genes influence chromosomal instability,
gene expression regulation, and ultimately,
cancer progression.
A core set of 18 protein-coding genes
known to be enriched in heterochromatin
and the nuclear lamina (Table 1) as
identified in recent literature (Olins et al.,
2010; Margalit et al., 2007; Maison et al.,
1997; Poleshko et al., 2013; Zeng at al.,
2010; Shibuya et al., 2014; Dunce et al.,
2018; Li et al., 2018), was used to find the
candidate heterochromatin interactome and
generate the set-of-interest (SOI).
This initial set of 18 proteins was
expanded (n = 327) by querying the
STRING protein-protein interaction (PPI)
database (Szklarczyk et al., 2023) with the
candidate gene set, specifically for genes
encoding for physically interacting
proteins (Table S1). This led to a SOI of
327 protein-encoding genes. Initially, the
SOI was analysed to gain a general
Protein name Gene name
Anchor factor
Man1
LEMD3
BAF
LBR
LBR
HP1
Emerin
EMD
BAF
LAP1B
TOR1AIP1
Lamin A/C
LAP2b
TMPO
BAF
LEMD2 (LEM2)
LEMD2
BAF
PRR14
PRR14
HP1
Lamin A/C
LMNA
Histone
Lamin B1
LMNB1
Histone
Lamin B2
LMNB2
Histone
IFFO1
IFFO1
XRCC4 & Lamin A/C
BAF
BAF (BANF1)
BAF
TERB1
TERB1
TRF1 (Shelterin)
TERB2
TERB2
TERB1/TRF1
MAJIN
MAJIN
TERB1/2/TRF1
HP1a
CBX5
H3K9
HP1b
CBX1
H3K9
HP1y
CBX3
H3K9
Table 1. Heterochromatin & LAD protein set
Table of the intitial 18 Heterochromatin and LAD proteins selected to
build the heterochromatin interactome. The anchor factor specifies the
complex or protein linking the Heterochromatin protein to the Nuclear
Lamina.
understanding of its transcriptional
characteristics and evaluate its potential
for studying chromosomal instability in the
context of cancer. To achieve this,
TCGA’s PanCan dataset was used to
obtain RNAseq data for the SOI in cancer
samples (n = 10535), as well as healthy
samples (n = 725). Additionally, unique
control sets (n = 547) composed of genes
not included in the SOI but identical in
number of genes (n = 327) were generated
to provide multiple baselines for
comparison across both conditions.
When assessed with each other and the
SOI, the control sets showed distinct
transcriptional profiles. This was validated
by Friedman’s test (p <2.2e-16), implying
control sets failed to show identical
differences in their mean expression levels
when compared to the SOI’s mean
expression level, in both the cancerous as
well as healthy conditions. To validate the
uniqueness of the control sets, we
conducted a contrast analysis for each set.
This involved comparing the difference in
mean expression between the set of
5
interest (SOI) and the control set to the
global mean difference between the SOI
and all other control sets. We repeated this
process for each control set to ensure
thorough validation. This analysis revealed
that the majority of control sets differed
when measured against the global trend
(Mann-Whitney U, adjusted p < 0.05) for
A
C
Control Set
Background Gene
Set of Interest
Gene of Interest
25
150
CBX3
− Log10 P
− Log10 P
20
15
10
SYNE1
CBX7
100 FOS
SMARCA2
GABARAPL1
PPP2CB
RHOJ
50
DSN1
DNMT1
TOP2A
KIF20A
EZH2
CENPA
CDCA8
AURKB
CDK1
CHAF1B
HMGA2
QKI
EGF
5
0
0
−0.2
−0.1
0.0
0.1
Log2 fold change
0.2
−2
−1
0
1
Log2 fold change
2
3
B
Control Set
Set of Interest
− Log10 P
Breast invasive carcinoma
Head and Neck squamous cell carcinoma
7.5
7.5
5.0
5.0
2.5
2.5
0.0
0.0
−0.50
−0.25
0.00
0.25
0.50
−0.50
Lung adenocarcinoma
−0.25
0.00
0.25
0.50
Lung squamous cell carcinoma
7.5
7.5
5.0
5.0
2.5
2.5
0.0
0.0
−0.50
−0.25
0.00
0.25
0.50
−0.50
Log2 fold change
−0.25
0.00
0.25
0.50
Figure 2. Differential gene expression patterns of the interactome set-of-interest in cancerous versus normal tissues.
A&B) Differential expression (log2 of the mean differences in expression values between paired samples; x-axis) in the set-of-interest (red)
compared to 547 randomly generated control sets (blue). For all healthy and cancerous paired samples (A), and selected cancer types (B).
Adjusted p-values (y-axis) calculated using the wilcoxon signed-rank test.
C) Differential gene expression, highlighting genes in the set-of-interest (red) compared to background genes from the TCGA dataset (blue) for
all paired samples. P-values (y-axis) and fold changes (x-axis) calculated with edgeR package.
All p-value cutoffs set to 0.01.
6
both the cancerous subset (99.7%) and the
normal subset (96.2%), indicating that the
control sets would indeed serve as a broad
and distinct range of baselines.
To determine whether the SOI’s
expression differs significantly between
cancerous and healthy conditions
compared to the controls, we conducted a
differential analysis of all gene sets using
only paired cancer-healthy samples. We
found that the SOI distinctly showed a
significant change in average expression
(Paired permutation test, p = 10-20) that
differed from the general trend presented
by the control sets (Fig.2a). Moreover,
when refining the analysis to specific
cancer types, this aberration from the trend
was also observed in three of the four
representative cancer types (breast
invasive carcinoma, head and neck
squamous cell carcinoma & lung
squamous cell carcinoma; Fig.2b). These
results seemed to support the suitability of
the defined set for the next analyses.
We further resolved the expression
characteristics of the individual genes
within the SOI by performing a differential
gene expression analysis (DGEA) on the
full TCGA genome across cancerous and
healthy conditions (Fig.2c). This analysis
demonstrated that several of the genes
within our set display significantly altered
expression in cancer which could drive the
distinctive aggregate expression patterns
observed previously (Fig.2a). These genes,
which include TOP2A, KIF20A, CENPA,
AURKB, EZH2 and CHAF1B mostly
code for structural, mitotic and chromatin
modifier proteins. The standout genes
proved to be fairly invariable when the
DGEA was repeated on the representative
cancer types (Fig.S1).
Thus, the heterochromatin interactome,
which constitutes the SOI, offers
significant and unique value for analysing
heterochromatin-based aberrations and
their relation to chromosomal instability in
cancer.
3.2
Employing a Machine Learning
model to decode the impact of
heterochromatin on chromosomal
instability features
To relate the aberrations in the expression
of heterochromatin-associated genes to the
presence of CIN features, we developed a
machine learning-based approach with a
multioutput stacked ensemble model to
uncover the underlying links (Fig.3). The
model used the cancerous SOI RNAseq
data as input and estimated multiple
features (n = 56) indicative of CIN. The
model was structured for the classification
of arm-level aneuploidies (n = 39), and
estimation of homologous recombination
deficiency (HRD) signatures (n = 3) as
well as copy number variation (CNV)
values for a selection of pericentromeric
genomic regions (n = 14).
We selected these specific features for
measuring CIN based on their established
roles in depicting chromosomal instability.
Arm-level aneuploidies were chosen due
to their direct representation of
chromosome number variations, which are
pivotal in understanding the extent of
chromosomal missegregation, a hallmark
of CIN (Baker et al., 2024; Adell et al.,
2023). Homologous recombination
deficiency (HRD) signatures, including
loss of heterozygosity scores (Abkevich et
al., 2012), number of telomeric imbalances
(Birkbak et al., 2012), and large-scale state
transitions (Popova et al., 2012), are
reflective of defects in DNA repair
mechanisms, ultimately contributing to
genomic instability (Pellegrino et al.,
2020; Doig et al., 2023). Finally, copy
number variations in pericentromeric
regions were included since these
heterochromatin-enriched regions are
known to undergo chromosomal
rearrangements but remain poorly
understood in terms of their genomic
importance (Ehrlich, 2002; Barra &
Fachinetti, 2018).
7
Regression
Classification
4q 4p 3q
SOI RNAseq Dataset
3p
2q
2p
1q
1p
ai1
loh
Input
lst1 peri1 peri2 peri3
Base
Layer
Base Predictions
4q 4p 3q
3p
2q
2p
1q
1p
ai1
loh
lst1 peri1 peri2 peri3
4q 4p 3q
3p
2q
2p
1q
1p
ai1
loh
lst1 peri1 peri2 peri3
Meta
Layer
Output
Figure 3. Configuration of the Chromosome Instability Model
Configuration of the XGBoost-based, stacked, multi-output model.
Together, we expect these features to
provide a solid representation of structural
DNA anomalies associated with CIN.
The HRD and arm aneuploidy feature
types were readily provided for the
PanCan TCGA samples. For the
pericentromeric copy number (CN)
features, a custom approach was devised
which involved leveraging annotations for
transposable element repeats throughout
the human genome from recent studies
(Hoyt et al., 2022; Altemose et al., 2022),
made possible by the Telomere-toTelomere reference genome (Nurk et al.,
2022), and associating them with the
segmented CN records for the TCGA
samples (see methods).
The configuration of the model employed
a stacked design where the target CIN
features were initially inferred
independently of one another with extreme
gradient boosting (XGBoost; Chen &
Guestrin, 2016) learners followed by
a second layer of predictions with
XGBoost learners using the previous
predictions as input (Fig.3).
After the final predictions, the original
genes in the SOI were assessed in terms of
their impact on the presence of the distinct
CIN features.
3.3
The Chromosome Instability
model accurately classifies arm-level
aneuploidies
The model accurately classified most armlevel aneuploidy features achieving a
8
global mean accuracy of 0.805. It
performed slightly better in the base layer
(mean Accuracy = 0.807) than in the meta
layer (mean Accuracy = 0.802) (Table S2).
The worst accuracy occurred for 18p, the p
arm of chromosome 18 (mean Accuracy =
0.731), which saw an increase in
performance from the base prediction
A
(base Accuracy =0.727) to the meta
prediction (Accuracy = 0.736). The best
accuracy occurred for 5q (mean Accuracy
= 0.859) despite seeing a decrease in
performance from the base to the meta
layer (Base Accuracy = 0.862, Meta
Accuracy = 0.856) (Fig.4a).
Accuracy of the Aneuploidy Learners
1.0
Accuracy
0.9
Learner Type
0.8
Base
Meta
0.7
10
p
10
q
11
p
11
q
12
p
12
q
13
q
14
q
15
q
16
p
16
q
17
p
17
q
18
p
18
q
19
p
19
q
20
p
20
q
21
q
22
q
9p
9q
8p
8q
7p
7q
6p
6q
5p
5q
4p
4q
3p
3q
2p
2q
1p
1q
0.6
Learner
B
Pericentromeric and HRD Learners in terms of R2
1.00
0.75
R2
Learner Type
Base
0.50
Meta
0.25
_3
22
20
ri_
pe
ri_
pe
pe
ri_
17
_2
16
_1
ri_
pe
16
ri_
pe
ri_
10
9
pe
ri_
8
pe
pe
ri_
7
pe
ri_
5
pe
ri_
4
pe
ri_
3
pe
ri_
2
pe
ri_
1
ri_
pe
h_
hr
d
t1
lo
ls
ai
1
0.00
Learner
Figure 4. Performance of the general Chromosome Instability model predictions.
Predictive performance of the individual XGBoost learners (x-axis) determined using the overall accuracy (y-axis) for the 39 classification
learners (A), and the coefficient of determination (R2; y-axis) for the 17 regression-based learners (B), assessed for both the base layer (blue)
and meta layer (red).
9
The arm-level aneuploidy features
unanimously displayed a high degree of
imbalance across the three possible
classes, with the “normal” aneuploidy
class consistently making up more than
two-thirds of the distribution. Hence, we
included a probabilistic “class weight”
parameter in the arm-level aneuploidy
learners during modelling to account for
the imbalance. This method seems to have
proved efficient since 76.9% of these
learners significantly outperformed their
corresponding baseline model (p <= 0.05)
which solely classifies using the majority
class (Table S2). The general accuracy
tended to decrease from the base layer to
the meta layer, while the F1 score saw an
increase when going from the base layer
classifications (mean base F1 = 0.718) to
the meta layer (mean meta F1 = 0.725)
(supplementary data). This increase in
overall F1 score was observed for over
half of the learners (59%) when averaging
across all classes and was particularly
relevant for
minority class F1 scores (69% for the
“loss” class; 56% for the “gain” class) in
contrast to the majority class F1 scores
which did not follow this trend (21%)
(supplementary data).
We concluded that the general model’s
classification demonstrates a strong
capability in accurately identifying armlevel aneuploidy features, with a notable
global mean accuracy of 0.805.
Further analysis of the cancer-specific
subsets revealed that while the learners
maintained a performance within a
similarly consistent range, the accuracy
was somewhat lower (Fig.S2). Thereby
indicating that the model’s generalisability
across different cancer types may vary.
3.4
The Chromosome Instability
model exhibits variable performance in
estimating HRD and Pericentromeric
features
Although the model performed
consistently for the classification
objectives, its performance on regressionbased tasks was more varied. When
projecting the remaining CIN features, the
predictiveness of the individual learners
showed a high degree of variability, with
the coefficient of determination (R2) for
the best-performing learner, peri_20 (mean
R2 = 0.765) and the worst, peri_8 (mean
R2 = 0.225) separated by 0.54 (Table S3).
The variation in performance was shown
to be less important for the HRD-based
learners, however, whose R2 remained
above 0.570 (Fig.4b). Interestingly, across
all but one regression learner, the
performance increased upon the transition
from the base layer to the meta layer
(mean base R2 = 0.445 vs mean meta R2 =
0.472). In the sole exception, peri_3, the
learner performance was identical across
both layers (R2 = 0.510). When
considering the root mean squared error
(RMSE) however, a decrease from baselayer to meta-layer estimations was
observed for every feature, emphasising
the performance improvement as a result
of the stacked approach.
With a relatively low global average R2
(0.459), a significant amount of variability
among the individual learners, and a
reliable improvement in the meta-layer
compared to the base layer, the
performance for the regression portion of
the model proved to be mixed. These
results highlight the challenges in
accurately predicting certain features with
consistency and suggest a potentially more
complex relationship between the
expression of the genes within the SOI and
certain characteristics of chromosome
instability.
When the model was applied to cancerspecific subsets, the observed trends
largely mirrored those of the general
model, albeit with lower performance
metrics (Fig.S3). Notably, the cancerspecific analysis revealed greater
10
variability in the predictive dominance of
meta-learners compared to base-learners,
deviating from the consistency observed in
the general model. This suggests that while
the model maintains its predictive
approach across different cancer types, the
performance nuances in these subsets
underscore the complexity of CIN
features’ behaviour in varied oncological
contexts.
3.5
Feature Importance analysis
reveals significant gene impacts on
Chromosome Instability features
To examine the impact of specific genes
on the prediction of distinct features, we
employed a feature importance analysis
using the gain metric (G) which calculates
the relative improvement in predictive
performance attributable to a specific input
feature. As a result of having a large
number of input genes, only the top
contributing genes were included in the
evaluation.
The results from the base-layer genomic
feature analysis revealed that a total of 45
different genes contributed the most to the
base-learners, yielding a large number of
paired gene-to-learner contributions
(56*45) (Fig.5a). MCM3AP displayed the
greatest individual impact in the analysis
from its contribution to the 21q learner
(GMCM3AP = 0.255). Yet, MCM3AP’s
average impact was low (mean GMCM3AP =
0.007, Table 2), slightly above the base
layer’s uniform gain threshold (UGTbase =
0.003), the theoretical gain given an equal
contribution by all genes (UGTbase =
1/327) and below the average impact of
the 44 genes (0.008). CENPA possessed a
greater overall influence (mean GCENPA =
0.020) and was the top gene for all three
HRD-based learners, ai1 (GCENPA = 0.216),
lst1 (GCENPA = 0.176) and loh_hrd (GCENPA
= 0.123) (Table 2).
Overall, the impact of the top genes
showed notable variability. A moderate
proportion of gene-learner pairs (14.8%)
CIN
Feature
Gene
Impact
(Gain)
Average
impact
1p
1q
2p
2q
3p
3q
4p
4q
5p
5q
6p
6q
7p
7q
8p
8q
9p
9q
10p
10q
11p
11q
12p
12q
13q
14q
15q
16p
16q
17p
17q
18p
18q
19p
19q
20p
20q
21q
22q
ai1
lst1
loh_hrd
peri_1
peri_2
peri_3
peri_4
peri_5
peri_7
peri_8
peri_9
peri_10
peri_16_1
peri_16_2
peri_17
peri_20
peri_22_3
ZNF644
NUP133
PCGF1
XRCC5
ZNF621
RPN1
CDK1
PGRMC2
NUP155
CAMLG
BAG6
ORC3
GET4
AGK
CHMP7
ESRP1
PSIP1
TOR1A
SUV39H2
DNAJB12
RRP8
EED
PHC1
SLC25A3
UBAC2
PPP2R5C
ZNF280D
USP7
VPS4A
C1QBP
CHMP6
SMCHD1
MBD1
SGTA
TRIM28
DDRGK1
VAPB
MCM3AP
XRCC6
CENPA
CENPA
CENPA
SETDB1
PCGF1
RPN1
YTHDC1
CAMLG
GET4
ESRP1
TOR1A
DNAJB12
VPS4A
VPS4A
MPRIP
CHMP4B
SMARCB1
0.151
0.120
0.101
0.157
0.116
0.195
0.045
0.091
0.200
0.214
0.198
0.229
0.136
0.142
0.216
0.152
0.160
0.158
0.120
0.239
0.157
0.188
0.130
0.199
0.157
0.096
0.152
0.149
0.249
0.136
0.133
0.222
0.151
0.104
0.115
0.150
0.122
0.255
0.142
0.216
0.176
0.123
0.153
0.167
0.196
0.060
0.061
0.120
0.065
0.115
0.148
0.103
0.251
0.068
0.253
0.188
0.005
0.004
0.009
0.006
0.004
0.010
0.020
0.006
0.014
0.007
0.006
0.007
0.011
0.006
0.011
0.008
0.005
0.007
0.004
0.012
0.005
0.005
0.005
0.007
0.004
0.004
0.005
0.006
0.014
0.004
0.007
0.005
0.006
0.004
0.004
0.005
0.008
0.007
0.006
0.020
0.020
0.020
0.007
0.009
0.010
0.005
0.007
0.011
0.008
0.007
0.012
0.014
0.014
0.006
0.009
0.007
0.154
0.008
Average
Table 2. Top genomic impact by CIN feature
Table of the CIN features and the top genomic
contributor to their base-learner in terms of impact
(Gain) with the corresponding value, and the gene's
average impact across all features.
11
peri_22_3
peri_20
peri_17
peri_16_2
peri_16_1
peri_10
peri_9
peri_8
peri_7
peri_5
peri_4
peri_3
peri_2
peri_1
loh_hrd
lst1
ai1
22q
21q
20q
20p
19q
19p
18q
18p
17q
17p
16q
16p
15q
14q
13q
12q
12p
11q
11p
10q
10p
9q
9p
8q
8p
7q
7p
6q
6p
5q
5p
4q
4p
3q
3p
2q
2p
1q
1p
Gain
0.25
0.20
0.15
0.10
0.05
0.00
AG
B K
C AG
1 6
C QB
AM P
C LG
C DK
C EN 1
H P
M A
C P4
H B
C MP
D HM 6
D P
D RG 7
N K
AJ 1
B1
E 2
ES ED
R
G P1
ET
M
M B 4
C D
M 1
3
M AP
N PR
U IP
N P13
U 3
P1
O 55
P RC
PG CG 3
R F1
M
C
PP PH 2
P2 C1
R
PS 5C
I
R P1
P
R N1
SE RP
TD 8
B
SL SG 1
SM C2 TA
5
A A
SM RC 3
B
SU CH 1
V3 D1
TO 9H2
TR R1
I A
U M28
BA
U C2
S
VA P7
VP PB
XR S4A
XRCC
YT CC5
ZN HD 6
F2 C1
ZN 80
D
ZNF62
F6 1
44
Base Learner
A
Gene
Uniform Gain Threshold: 0.003
peri_22_3
peri_20
peri_17
peri_16_2
peri_16_1
peri_10
peri_9
peri_8
peri_7
peri_5
peri_4
peri_3
peri_2
peri_1
loh_hrd
lst1
ai1
22q
21q
20q
20p
19q
19p
18q
18p
17q
17p
16q
16p
15q
14q
13q
12q
12p
11q
11p
10q
10p
9q
9p
8q
8p
7q
7p
6q
6p
5q
5p
4q
4p
3q
3p
2q
2p
1q
1p
Gain
0.75
0.50
0.25
0.00
1p
1q
2p
2q
3p
3q
4p
4q
5p
5q
6p
6q
7p
7q
8p
8q
9p
9
10q
10p
11q
11p
12q
12p
13q
14q
15q
16q
16p
17q
17p
18q
18p
19q
19p
20q
20p
21q
22q
q
ai
1
lo lst
h_ 1
pe hrd
peri_1
peri_2
peri_3
peri_4
peri_5
peri_7
peri_8
r
peper i_9
i_
peri_1 10
ri_ 6_
1
pe16_
r i_ 2
p
1
e
pe r 7
ri_ i_2
22 0
_3
Meta Learner
B
Base Learner
Uniform Gain Threshold: 0.018
Figure 5. General Model Feature Importance.
The impact of the input features towards the individual model learners in terms of the Gain metric (gradient). Evaluating the impact of the
base-learner’s top contributing genes (x-axis) on all base-learners (y-axis; A), and the impact of the base learners (x-axis) on all meta-learners
(y-axis; B). Displaying the Uniform Gain Threshold values (bottom right; see methods).
12
provided no apparent contribution (G =
0.000) while a smaller fraction of pairs
(2%) supplied more than a tenth of the
total contribution (G > 0.100) towards
specific learners (Fig.5a).
These findings accentuate the complex
nature of gene contributions to
chromosomal instability and demonstrate
the nuances captured by the model, with
certain genes exerting significant influence
across multiple learners while others show
substantial impact on unique features with
narrower influence.
Analysing the genomic feature importance
from the cancer-specific models, we saw
they had a similar number of total
contributing genes (45-49) (Fig.S4). We
also found that only nine contributing
genes were shared between the different
cancer-specific feature importance
analyses and that each model was
represented by a unique most impactful
gene based on the greatest gain, YBX1
(LUSC, 1p GYBX1 = 0.257), EGFR
(LUAD, peri_7 GEGFR = 0.287), MAPRE1
(BRCA, peri_20 GBRCA = 0.331) and EED
(HNSC, 11q GEED = 0.400), all with
varying degrees of impact and to distinct
base-learners (Fig.S5). This variability in
gene contributions across different cancer
types not only highlights the tailored
nature of genomic interactions in cancer
but also reflects the model’s capability to
adaptively capture these distinct
interactions.
3.6
Exploration of feature
interactions reveals selective
dependencies in Chromosome
Instability
We then explored dependencies between
CIN features within our stacked model
configuration. This analysis considered the
initial base-layer predictions and their
corresponding contributions to each of the
refined meta-layer learners. Here, the
UGTmeta (0.018) was higher than UGTbase
due to the number of base-learners (n =
56) being less than the number of genes.
The analysis revealed that feature-specific
base-learners were the primary
contributors to their corresponding metalearners for most CIN features (95%).
Exceptions included three features peri_9,
peri_16_2, and peri_22_3. For these cases,
the main contributors were the q-arm
aneuploidy features for the same
chromosomes (9q, 16q, and 22q
respectively; Fig.5b).
The findings also demonstrated a high
degree of variability in the contributions of
the most impactful learners. The
contribution from 4p’s base-learner to its
meta-learner (G4p = 0.254) was
approximately one-third of peri_20’s baselearner to its specific meta-learner (Gperi_20
= 0.908). We found a few minor
interactions among the combinations
occurring between pericentromeric baselearners and arm-specific meta-learners
targeted to the same chromosome (Fig.5b).
The peri_2 and 2q base-learner to metalearner contribution (Gperi_2 = 0.265)
demonstrate this phenomenon. The same
types of contributions in the reverse, from
the arm-level base-learner to the
pericentromeric meta-learner, did not
generate the same degree of impact (peri_2
G2q = 0.018). The lack of contributions
between different feature-specific learners
was further revealed by low median values
of meta-learner gains, the median
contribution a chosen meta-learner
receives, which never exceeded 0.003
(Table S4). This observation was further
emphasised by low median base-learner
contributions, the median impact
contributed by a chosen base-learner,
which only exceeded the UGTmeta for the
ai1 base-learner (median Gai1 = 0.042).
The examination of CIN feature
interactions ultimately highlighted
nuanced, yet selective dependencies
between base and meta-layer
contributions. This suggests that distinct
CIN features can selectively interact, but
that broad, large-scale dynamics are
uncommon.
13
Viewing the interactions in the context of
the individual cancer models, we found
that the trend showing feature-specific
base-learners contributing the most to their
corresponding meta-learner largely
remained true. Yet, the interactions
demonstrated some significant differences
from the general model, as well as
discrepancies between the unique cancer
models (Fig.S5). For the LUSC model, the
loss of heterozygosity (loh_hrd) baselearner showed an elevated impact on all
HRD-based meta-learners (Gloh_hrd = 0.268
to 0.570) and greater than that of ai1 which
tended to be more impactful across the
HRD features for all other models. The
LUSC and BRCA models demonstrated
relevant contributions from chromosome
arm feature base-learners to
pericentromeric feature meta-learners
which differed from the general model
pattern where the reverse was more
prominent. Additionally, pericentromeric
feature and HRD feature base-learner
contributions were noticeably greater and
more wide-ranging, affecting all types of
meta-learners for the HNSC model and to
an even greater extent for the LUAD
model. The cancer-specific feature
interaction analyses reveal a marked shift
from the general model’s feature
interactions and suggest that CIN feature
dynamics are uniquely modulated across
various cancer types.
4.
Discussion
The current study aimed to delineate the
roles of heterochromatin-associated genes
in cancer by constructing a comprehensive
interactome and employing an advanced
machine-learning model to assess their
impact on chromosomal instability (CIN).
Our findings emphasise the identification
of potentially relevant heterochromatin
genes that significantly influence
chromosomal instability, providing new
insights into the molecular mechanisms
underpinning cancer development and
progression.
The integrity of the analysis depended on
both a core gene set and an extended
interactome, which together constructed a
representative set of heterochromatinassociated genes potentially dysregulated
in cancer. The interactome constructed
using the STRING query included
numerous genes previously linked to
heterochromatin, structurally or
functionally, several of which had been
implicated in cancerous dysregulation.
Polycomb group gene EZH2 is pivotal for
transcriptional repression and was shown
to be upregulated in aggressive cancers,
such as neuroendocrine prostate cancer 18.
Similarly, topoisomerases, especially
TOP2A, have been demonstrated to be
essential for maintaining chromatin
integrity and potentially contribute to
cancer when defects arise (Lee et al., 2019;
Amoiridis et al., 2024). Additionally, the
centromere-specific CENP-A is vital for
maintaining the integrity of centromeric
DNA during replication. Its removal
causes replication stress, and mitotic
defects (De Rop et al., 2012), contributing
to chromosomal instability (Giunta et al.,
2021). Including these and other welldocumented genes during the extension
phase validates the methodology's
robustness and illustrates the careful
approach taken to ensure the integrity and
relevance of the study in understanding the
role of heterochromatin-associated genes
in chromosomal instability in cancer.
Our genomic feature importance analysis
identified several heterochromatinassociated genes as significant contributors
to CIN features. These genes include
CENP-A, NUP133, NUP155, and
SETDB1, each playing a distinct role in
maintaining genomic stability and
influencing cancer progression. The
analysis identified CENP-A as a
significant contributor to various CIN
features including numerous arm
14
aneuploidies. This identification of CENPA in terms of these structural CIN features
proves consistent with prior studies which
identified defects in CENP-A with
missegregation events contributing to the
increased presence of aneuploidies (De
Rop et al., 2012; Giunta et al., 2021; Black
et al., 2018). The analysis in the general
model also highlighted SETDB1, a histone
methyltransferase, as a significant
contributor to the pericentromeric CIN
feature on chromosome 1. Robbez-Masson
et al. (2017) showed that the loss of
SETDB1 function leads to the
upregulation of retrotransposons, resulting
in genomic instability. Retrotransposons, a
type of transposable element (TE), have
been extensively mapped to
pericentromeric regions of chromosome 1
(Nurk et al., 2022; Altemose et al., 2022),
suggesting that the identified importance
of SETDB1 in our analysis may be due to
its role in maintaining the stability of these
regions by repressing retrotransposon
activity, thus preventing the genomic
instability associated with copy number
variations in these critical areas. These
results reinforce the central roles of
specific heterochromatin-associated genes
in contributing to distinct CIN features.
Nucleoporins, the building blocks of
Nuclear Pore Complexes (NPCs),
redistribute to structures such as
kinetochores, spindle poles, and the
mitotic spindle during mitosis (Chatel &
Fahrenkrog, 2011). Prior studies have
shown that NPCs are important in DNA
damage repair in heterochromatin (Ryu et
al., 2015), correlated NUP155 with
microsatellite instability (Wang et al.,
2024), and linked mutations or altered
expression of nucleoporins to
missegregation events leading to
aneuploidies (Nakano et al., 2011). Our
results not only corroborated these
findings by underlining NUP133 and
NUP155 as significant CIN contributors
but also revealed their specific impact on
distinct features, associating NUP133 with
aneuploidies on the q arm of chromosome
1 and NUP155 with aneuploidies across
5p, 5q, 8p and 13q among others. This
further emphasises the fundamental role of
nucleoporins in safeguarding genomic
integrity and highlights the consequences
of their dysregulation in promoting
chromosomal missegregation and
aneuploidy in cancer cells, offering new
insights into potential mechanisms
underlying these phenomena. Notably
absent from the 45 distinct genes revealed
during the genomic feature importance
analysis were any of the genes constituting
the initial core gene set.
In cancer-specific scenarios, the breast
cancer model’s feature importance analysis
identified CDK1 as the top contributor to
the CIN feature associated with the
pericentromeric region of chromosome 1.
CDK1 plays a critical role in cell cycle
regulation and is frequently overexpressed
in various cancers, including breast cancer
(Ryu et al., 2015; Sofi et al., 2022).
Malumbres and Barbacid (2009) describe
how CDK1, along with other cyclindependent kinases (CDKs), controls cell
cycle progression and ensures proper
chromosome segregation during mitosis.
Misregulation and overexpression of
CDK1 can disrupt mitotic processes,
leading to chromosomal instability. The
identification of CDK1’s impact on the
pericentromeric CIN feature aligns with
these findings, supporting its role in
driving chromosomal instability in cancer.
In the head and neck cancer CIN model,
CDC42 was also identified as the top
contributor to the CIN feature involving
the p arm of chromosome 1. CDC42,
known for regulating cell cycle
progression, cytoskeletal organisation, and
cellular migration, has been implicated in
multiple cancers due to its role in
oncogenic processes like invasion and
metastasis (Hodder et al., 2023). CDC42associated kinase (ACK) promotes cancer
progression by regulating protein stability
and degradation pathways. The
identification of CDC42 in our model is
15
consistent with its role in maintaining
chromosomal stability through actin
dynamics and cell division.
Overall, the genomic feature importance
analysis confirms existing knowledge and
proposes novel mechanisms by linking
various heterochromatin-associated genes
to chromosomal instability features. This
demonstrates the robust and exhaustive
aspect of our methodology, indicating its
potential to provide valuable insights into
the molecular underpinnings of cancer.
The cancer-specific feature importance
analyses further enhance our
understanding by highlighting unique gene
contributions across different cancer types.
These findings lay the groundwork for
further research to explore these potential
connections and their implications for
understanding chromosomal instability in
cancer.
The feature importance analysis in our
study relied on identifying the top
contributors, based on Gain, to each CIN
feature. While this approach highlights key
genes, it has potential drawbacks.
Specifically, any gene which was not the
top contributor to any one feature was
excluded from the final analysis, even if it
contributed significantly to multiple CIN
characteristics. This limitation could have
caused us to overlook important genes
which play a substantial role in
chromosomal instability. To address this,
future analyses could include more than
just the top gene per feature. Additionally,
redefining the impact of genes on the
features by using alternative feature
importance metrics provided by XGBoost,
such as coverage or frequency, or a
combination of these metrics, may offer
insights from a different perspective.
Coverage measures the number of samples
affected by a particular gene, while
frequency assesses how often a gene is
used in a learner’s decision-making
process. Integrating these metrics could
provide a more nuanced picture of each
gene’s contribution to CIN features.
Moreover, the inferences drawn from the
feature importance analyses must be
viewed in the context of the model's
overall performance. Gain measures how
the performance of a learner improves with
the inclusion of a specific gene. Therefore,
a low-performing base-learner can assign
significant importance to certain genes,
which may skew the perceived impact of
these genes. Consequently, any
conclusions made should consider not only
the feature importance metrics but also the
learners' overall performance to ensure a
balanced and accurate interpretation of the
results.
The performance evaluation of our
machine-learning model, encompassing
both classification and regression tasks,
accentuates its effectiveness in capturing
key aspects of chromosomal instability
(CIN). The classification tasks, particularly
for arm-level aneuploidies, demonstrated
robust accuracy and improved F1 scores
from the base to the meta layer, indicating
a balanced model more adept at handling
diverse class distributions. However, the
regression tasks exhibited variability in
performance, reflecting the inherent
complexity of predicting certain CIN
features. The observed improvements in
the meta layer for the regression tasks
suggest that the stacked model
configuration enhances predictive
refinement. Yet the variability and limited
improvement indicate the need for further
optimisation to achieve consistent
performance. This mixed performance
highlights the model’s strengths in
classification while pointing to areas for
improvement in regression, emphasising
the multifaceted nature of genomic
instability and the challenges in providing
wide-ranging predictions. Exploring
alternative machine-learning models for
the individual CIN base-learners, such as
Random Forest, which has demonstrated
notable performance when handling both
class imbalances and high-dimensional
16
biological data (Anaissi et al., 2013; Chafai
et al., 2023), or Support Vector Machines
which have proven effective with genomic
feature selection and classification through
ensemble methods (Anaissi et al., 2016;
Chafai et al., 2023), could potentially
enhance the prediction accuracy and
stability for specific CIN features.
The second importance analysis, which
explored interactions between CIN
features, revealed selective dependencies
that offer deeper insights into the structural
anomalies associated with chromosomal
instability. The identification of specific
arm aneuploidies significantly impacting
pericentromeric meta-learner predictions
stresses the interconnected nature of
certain CIN features. This nuanced
understanding of feature interactions
suggests that while broad-scale
interactions are rare, specific dependencies
play a vital role in the manifestation of
genomic instability. The variability in
cancer-specific analyses further indicates
that CIN feature dynamics could be
uniquely modulated across different cancer
types, highlighting the model’s ability to
adapt to distinct oncological contexts.
These findings enrich our broader analysis
by emphasising the importance of
considering both individual gene impacts
and feature interdependencies, ultimately
enhancing our understanding of the
complex mechanisms underlying
chromosomal instability in cancer.
5.
Conclusion
In conclusion, this study successfully
delineated the roles of heterochromatinassociated genes in cancer by employing a
broad heterochromatin interactome and an
ensemble machine-learning model to
assess their impact on chromosomal
instability (CIN). The methodology proved
robust, combining a core gene set and an
extended interactome to identify
significant contributors to CIN features.
The genomic feature importance analysis
not only corroborated existing knowledge
but also revealed novel mechanisms,
highlighting the contributions of key genes
across various cancer types. While the
model demonstrated strong performance in
classification tasks, particularly for armlevel aneuploidies, it also showed
variability in regression tasks, affirming
the complexity of predicting CIN features.
Additionally, the analysis of interactions
between CIN features emphasised the
intricate dependencies among them,
suggesting that specific structural
anomalies are interlinked. These findings
enhance our understanding of the
molecular mechanisms underlying
chromosomal instability in cancer,
providing a solid foundation for
subsequent research. Future studies could
employ knockdowns, knockouts, or
induced over-expression of highlighted
genes to further explore these connections
and their implications regarding CIN in
greater detail.
6.
Methods
Candidate Protein Selection and
Interactome Construction
To identify a heterochromatin-specific
interactome, an initial list of 18
heterochromatic proteins recognised for
their association with the nuclear lamina
and interactions with heterochromatin,
based on literature and annotations from
the UniProt database (Wang et al., 2021)
was manually curated (Table 1).
To explore the interaction landscape of
these candidate proteins, a Python script
(STRING_Interactome_Generation.py)
was developed to interface with the
STRING Protein-Protein Interaction (PPI)
database version 12.0 (Szklarczyk et al.,
2023). This script was designed to
translate protein names into STRING
database identifiers, subsequently
retrieving lists of genes encoding proteins
that physically interact with the candidates.
17
Interaction queries were filtered to include
only physical network interactions with a
minimum confidence score of 0.4 (scale 0
to 1), ensuring a thorough yet relevant
interactome dataset consisting of 327
unique genes including the initial
candidates. The Python script executed its
query via a custom function. Each protein
name from the provided list was converted
into the corresponding STRING identifier
using the database’s API. This conversion
process secured the most relevant
identifier for each protein, by referencing
human taxonomy with the NCBI identifier
9606 (Schoch et al., 2020; Sayers et al.,
2019). Subsequently, for each obtained
STRING identifier, data on physical
interaction partners was sought. To
efficiently manage the volume of data, the
results were constrained to a maximum of
1,500 interactions per protein. The output
from the STRING database composed of a
unique list of gene names, encompassing
both the original candidates and their
interaction partners, formed the
interactome for subsequent analyses.
6.1
Acquisition and Processing of
TCGA Data
RNA-seq Data
RNA sequencing data from The Cancer
Genome Atlas (TCGA) Pan-Cancer
(PANCAN) initiative was utilised for this
study, accessed through the TOIL project
via the Xena Browser platform (Goldman
et al., 2021). This data includes gene
expression patterns from the TCGA
samples.
Two primary datasets were utilised: the
tcga_RSEM_gene_tpm and the
tcga_RSEM_gene_expected_counts. Each
dataset includes data from 10,535 samples
and covers approximately 60,499 unique
gene identifiers. The
tcga_RSEM_gene_tpm dataset comprises
transcripts per million (TPM), normalised
and log-transformed to log2 (tpm+0.001)
which facilitates consistent analysis across
low and high expression genes. Gene
identifiers in this dataset are aligned with
the Gencode v23 annotation, ensuring
precise gene mapping and relevance to
recognised genomic features (Frankish et
al., 2023). The
tcga_RSEM_gene_expected_counts
dataset provides raw count measures of
gene expression. This format was selected
to support specific quantitative analyses
such as library size normalisation (Vivian
et al., 2017). The raw
tcga_RSEM_gene_expected_counts
dataset was processed into a structured
CSV file using the
TCGA_Raw_Data_Processing.py script
Copy Number Variation Data
To examine genomic alterations across
various cancers, copy number variation
(CNV) data for TCGA samples was
collected, using two distinct types of CNV
data. The first type, gene-level CNVs, was
sourced from the Xena Browser, utilizing
the SNP6 array method across all TCGA
cohorts. This dataset represents gene-level
copy number alterations expressed as log
ratios of tumour to normal tissue. The
processing of this gene-level data employs
the GISTIC2 method as part of the TCGA
pipeline, which maps segmented CNV data
to gene-level estimates (Mermel et al.,
2011).
The second type of data, segmented CNVs,
was acquired from the Genomic Data
Commons (GDC) (Grossman et al., 2016).
This dataset includes data for all TCGA
cohorts, providing a detailed
representation of chromosomal alterations,
segmented by chromosome and base pair
regions.
Chromosome Instability Data Types
To assess the genomic stability across all
cancer samples, data on chromosome
instability was analysed. This included
arm-level aneuploidies and aneuploidy
scores, along with three measures of
18
homologous recombination deficiency
(HRD).
Arm-level aneuploidies and aneuploidy
scores were sourced from the ABSOLUTE
purity/ploidy file, named
TCGA_mastercalls.abs_tables_JSedit.fixe
d.txt. This dataset quantifies whole-arm or
segmental gains and losses of
chromosomes by sample, providing a
detailed measure of aneuploidy across the
TCGA PANCAN cohort.
HRD scores, measuring homologous
recombination deficiency, encapsulate
multiple aspects of chromosomal
instability, including LOH (loss of
heterozygosity, “loh_hrd”), LST (largescale state transitions, “lst1”), and NtAI
(number of telomeric allelic imbalances,
“ai1”). LST specifically measures
chromosomal breakages producing
fragments larger than 10 Mb, while NtAI
assesses allelic imbalances extending to
telomeres. Each score provides a single
value per sample.
TCGA Sample Metadata
Additionally, TCGA metadata was
downloaded in TSV format from the GDC
(Grossman et al., 2016). The metadata
selected included the Tissue Source Site
(TSS) codes, which link samples to
specific cancer studies, TCGA Study
Abbreviations that identify cancer types,
and Sample Type Codes that differentiate
between cancerous and normal tissues.
6.2
Exploratory Gene-set Analysis
Set-of-Interest and Control Sets
Generation
This analysis, implemented in R, utilised
Log2 TPM RNA-seq data from the TCGA
PANCAN project, which was restructured
to create specific datasets for evaluation.
The primary datasets focused on the 327
interactome genes identified via the
STRING, collectively termed the Set-ofInterest (SOI). For the cancerous SOI
dataset, sample IDs were filtered to
include only those from cancerous tissues.
Conversely, the non-cancerous SOI dataset
excluded any samples derived from
cancerous or metastatic tissue types.
To provide a comparative framework,
multiple cancerous control datasets were
generated. Each control set contained a
random selection of genes, matching the
SOI gene count but excluding any SOI
genes. To ensure the uniqueness of each
control dataset, impose an upper limit on
the number of control sets and prevent
redundancy, no control set was composed
solely of genes previously used in other
sets. This selection process continued
iteratively until all non-SOI genes were
included in at least one control set,
ensuring full genomic coverage.
Corresponding non-cancerous datasets
were also generated for each cancerous
control, maintaining an identical genomic
background for accurate comparisons.
This methodology generated 547 unique
control datasets for the cancerous and noncancerous conditions which were saved in
CSV format for subsequent analyses.
Comparative Expression Analysis within
Cancerous and Non-Cancerous Tissues
To enable a complete comparative
aggregate expression analysis of the Setof-Interest (SOI) against the control sets,
preprocessing involved calculating the
mean log2 expression across all genes
within the Set-of-Interest (SOI) for each
sample. These calculations were
performed for both cancerous and noncancerous datasets, using a function,
designed to output a summary data frame.
This data frame contained the global mean
expression value for each sample. A
similar approach was adopted for each
control set. The process generated a global
expression value for each sample by
calculating the mean log2 expression across
the genes in each control set. The global
expression values from the controls were
19
then directly compared with their
corresponding SOI global expression
values. The difference in global expression
values between the SOI and each control
set was computed per sample, capturing
distinct expression patterns between the
SOI and the control sets. These differences
were recorded in differential data frames,
one for cancerous and another for noncancerous conditions. Additionally, the
average expression values from the control
sets were also stored in condition-specific
data frames.
The distribution of expression differences
for each SOI-control set pair was visually
assessed using violin plots generated using
the ggplot2 package in R (Wickham,
2016). These plots provided a detailed
view of the shape of the distributions,
highlighting any skew or deviations from
normality. Based on this, appropriate nonparametric statistical tests were performed.
The Friedman test (non-parametric
equivalent of the repeated measures
ANOVA) was applied using the coin
package in R (Hothorn et al., 2006) to
determine if there were significant
differences across multiple control sets
when paired with the SOI. The test’s
results indicated whether any specific
control set showed significant differences
in expression with the SOI, compared to
other control sets within the same
condition.
For each SOI-control pair, the Wilcoxon
Signed-Rank Test was conducted through
the stats package in R (R Core team,
2013). This compared the mean difference
of the aggregate expression levels for the
specific pair to the global mean difference
of all other SOI-to-control combinations.
This test helped identify specific pairings
where the expression differences were
significantly above or below the overall
mean, highlighting unique expression
patterns between the SOI and each control
set.
A two-sided paired permutation test,
implemented with the coin package, was
performed for each SOI-to-control pair to
further assess the significance of the
difference between the general expression
level of the SOI to that of each control set.
By randomly shuffling the labels of the
datasets and recalculating the test statistic
for each permutation, a distribution of the
statistic under the null hypothesis was
constructed. The significance levels from
these tests were adjusted for multiple
comparisons using the BenjaminiHochberg procedure.
To understand if the expression patterns
observed in cancerous and non-cancerous
datasets were consistent across different
types of tissues, a narrow version of the
analysis was performed, assessing the
aggregate expression difference between
the SOI and a single control set, parsed by
cancer type. A single control set was
selected and the paired permutation test
with multiple comparison correction was
reapplied with the SOI across different
tissue types.
Comparative Expression Analysis between
Conditions and across Cancer Types
Two data frames containing the mean
expression value across all genes for each
participant were created. One, based on the
cancerous SOI dataset, and another using
the corresponding non-cancerous set. The
two data frames were then merged on the
participant ID (sample ID) with cancertype information integrated from the
metadata to facilitate condition-specific
analyses. The difference between each
paired metric was calculated yielding the
log2 fold change for each sample.
A two-sided Wilcoxon Signed Rank test
was then applied to compare the log fold
changes between normal and cancerous
conditions for each sample. The test was
repeated for all control sets, the P values
were adjusted for multiple comparison and
20
the results were saved to a new data frame.
The results were visualised in a volcano
plot (Fig.2a), using the EnhancedVolcano
package (Blighe et al., 2013), displaying
the log fold changes against the negative
logarithm (base 10) of the adjusted Pvalues, to emphasise sets with statistically
significant and biologically relevant
changes in their overall expression. This
analysis was further refined by repeating
the Wilcoxon Signed Rank test and
volcano plotting for subsets of data
categorised by cancer type (Fig.2a).
A differential gene expression analysis
(DGEA) was also performed between the
cancerous and non-cancerous conditions
using the complete genomic data available.
The expected counts RNAseq data was
pre-processed by restricting the set to
participants with paired (cancer-healthy)
samples, exponentiating the counts
(reversing the log transformation),
removing samples with no counts and
genes with low counts (<10), and
normalising using the TMM method
(Robinson & Oshlack, 2010) to adjust for
library size differences across samples
using the edgeR package (Chen et al.,
2024; Chen et al., 2016; McCarthy et al.,
2012; Robinson et al., 2010). The DGEA
was visualised in the form of an
EnhancedVolcano plot highlighting the
SOI genes (Fig.2c). This process was
repeated across the different cancer types
(Fig.S1).
The configuration of the Set of Interest
(SOI) and corresponding control sets, the
generation of normal gene expression
datasets, the comparative exploratory
analysis and the DGEA were performed
using custom python and R markdown
scripts
(SOI_and_Control_Set_Configuration.py,
TCGA_Normal_Gene_Set_Generation.py,
TCGA_Comparative_exploratory_annalysi
s.Rmd, and TCGA_dge_analysis.Rmd
respectively)
6.3
Chromosome Instability Model
A stacked, multi-output, Extreme Gradient
Boosting (XGBoost; Chen & Guestrin,
2016) machine-learning model was
designed to utilise RNA-seq data from the
set of interest (SOI) to predict
chromosome instability feature values
derived from the TCGA PANCAN dataset
and ultimately to quantify the genomic
contributions (Fig.3). All development was
done in R.
RNA-seq Data Preparation
The RNA sequencing data utilised was
based on the tcga_RSEM_gene_tpm and
tcga_RSEM_gene_expected_counts data
from the TCGA.
The RNA-seq data, comprising both log2
TPM (transcripts per million) and log2
expected counts, was loaded from the
saved files. The metadata files including
tissue source site identifiers, disease study
categorisations, and gene annotations were
integrated to filter and annotate the
datasets appropriately.
The datasets underwent several processing
measures. Using the TCGA metadata,
samples not classified as cancerous were
excluded from the datasets. Both TPM and
expected count datasets were
exponentiated. Sample identifiers were
matched with corresponding metadata to
ensure accurate sample. Library size
differences were normalised using the
same TMM method as for the DGEA.
Where multiple entries for a single gene
existed, median expression values were
computed and used. Post-normalisation,
the dataset was further refined by retaining
only those genes that are part of the SOI.
Final Dataset Compilation: The processed
data was structured into 4 formats:
•
Raw counts
•
Normalised counts (scaled)
•
Log-transformed raw
counts (log)
21
•
Log-transformed
normalised counts (log-scaled)
Each format was prepared separately for
both the TPM and expected counts
datasets, yielding four different SOI
RNAseq datatypes per condition, which
were saved to separate CSV files.
The generated datasets underwent a
process to segregate individual samples
into test and training sets. All training sets
contained the same samples to ensure
consistency. The training sets were used
for the hyperparameter tuning of the
model. Whereas the complete data
(training + test) was used for the model
training and testing which was performed
using a five-fold cross validation process.
All sample IDs from the eight datasets
were extracted. Common sample IDs
across all data types were identified
through the intersection of sample ID lists.
A random 75% of common sample IDs
were selected to form the training set,
ensuring that the training sets included
only samples present in all datasets. For
each RNA-seq data type, the full dataset
was then split into training and testing sets
based on this randomly selected list of
sample IDs. The data was arranged to
ensure that all datasets maintained the
same order of samples, enhancing
consistency across the model input data.
The training and testing sets for each data
type were saved in CSV format.
Pericentromeric CNV Analysis & Feature
Engineering
To assess cancer-associated genomic
alterations, an R-based methodology
(Pericentromeric_cnv_segmentation.Rmd)
for identifying and quantifying CNVs
within pericentromeric regions of each
TCGA sample was developed.
The TCGA’s PANCAN segmented CNV
data (Chrom_seg_CNV.txt) was loaded
using dplyr. Pericentromeric regions were
defined to include classical human
satellites such as HSat, gamma (γ), beta
(β), and specific alpha (α) satellite
sequences, along with centromeric
transition (ct) regions based on recent
genomic studies involving data generated
by the Telomere-to-Telomere (T2T)
consortium (Barra & Fachineti, 2018; Hoyt
et al., 2012; Altemose et al., 2022). The
T2T data was loaded with the rtracklayer
package (Chen et al., 2016). The data was
accessed via the UCSC genome browser in
the form of a BigBed file (Robinson et al.,
2010). Genomic segments within the
sourced data were filtered based on these
annotations to ensure widespread coverage
of pericentromeric regions. These
segments were initially filtered to exclude
non-pericentromeric genomic features
such as rDNA segments. The resulting
segments within each chromosome were
grouped and sorted by start positions.
Adjacent segments, where the start
position of one corresponded to the end
positions of the other, or overlapping
segments were merged to form contiguous
pericentromeric regions. For each
chromosome, the longest contiguous
pericentromeric region starting and ending
with an HSat was identified. Additionally,
pericentromeric segments starting and
ending with an HSat and exceeding a
threshold of one megabase-pair (1mbp)
were also selected to ensure coverage of
significant pericentromeric areas. These
regions were then labelled with unique
identifiers for each chromosome (“peri_”
followed by the chromosome number and
an additional number if more than one
pericentromeric region was found for the
chromosome in question).
The TCGA segmented CNV data was then
filtered to include only segments which
partially or fully aligned with any of the
newly defined pericentromeric regions.
For each pericentromeric region, the
overlap with the CNV segments from the
TCGA PanCan dataset was calculated to
determine the overlap-weighted mean
CNV. This metric considered the
22
proportion of overlap between each TCGA
segment and the corresponding T2Tdefined pericentromeric region, with a
greater extent of overlap resulting in a
higher weight for the CNV value of the
pericentromeric region for the given
sample.
The resulting CNV values were aggregated
across all TCGA samples, producing a
dataset of the copy number aberrations
across fourteen distinct pericentromeric
regions by sample.
Additional Chromosomal Instability (CIN)
Features
The model incorporates several other
chromosomal instability features, namely:
TCGA arm-level aneuploidy data for
thirty-nine chromosomal arms (categorised
as either a -1, 0 or 1), and three numerical
and continuous Homologous
Recombination Deficiency (HRD)
signatures, for each sample from TCGA
datasets.
Machine-Learning Model Configuration
The model employed a stacked
architecture with two distinct layers,
designed to predict the fifty-six
chromosomal instability (CIN) features. In
the base layer, each feature was addressed
independently with a unique XGBoost
learner (model) optimised for the feature’s
characteristics, using one of the generated
SOI RNA-seq datasets (training and test)
as input in a five-fold cross-validation
strategy. The cross-validation strategy
allowed for the use of the full data to make
predictions while limiting any data
leakage.
XGBoost learners were selected for their
ability to handle categorical and numerical
data and, their lack of assumptions
regarding the distribution of the
relationship between the variables. Out-offold (OOF) predictions, generated during
the testing phases of the cross-validation,
were aggregated by sample into a dataset.
This dataset, aligned with true labels,
underwent a secondary five-fold division
to train the feature-specific meta-learners
across the second meta-layer, to enhance
the predictions. The final output comprised
refined OOF predictions across all CIN
features.
Model Hyperparameters
The learners across both the base and meta
layers utilised several parameters precisely
tailored to their features.
The majority of these are hyperparameters
standardly applied to XGBoost models to
prevent under or overfitting including
maximum tree depth (max_depth),
minimum child weight
(min_child_weight), number of trees
(nrounds), learning rate (eta) and the
regularisation parameter gamma (Chen &
Guestrin, 2016).
Additionally, the datatype for the SOI
RNA-seq dataset used as input in the base
layer was specifically selected for based on
the target CIN feature. Each base-learner
therefore used one of the eight possible
choices as input.
All the parameters assigned and selected
were tuned in a ten-fold cross-validation
process across both layers separately using
the SOI RNA-seq training sets. This
process resulted in hyperparameter tables
with different hyperparameter values and
their resulting evaluation metric (log loss
or RMSE). The selected hyperparameter
value for each specific combination of
parameter, CIN feature, and layer were
selected for manually by assessing the best
hyperparameters based on the evaluation
metrics.
The Tables used to curate the specific
hyperparameters were created using the
several R scripts depending on the learner
type, regression or classification, and the
layer, base or meta (base_class_tune.r,
base_regress_tune.r, meta_learner_tune.r).
Class Imbalance
To manage class imbalances in the armlevel aneuploidy (categorical) features,
class weights inversely proportional to
their frequencies were assigned to each
23
learner. These weights were normalised,
keeping the values between 0 and 1, with a
minimum value set to 0.1 to prevent
assigning a weight of 0 to the majority
class. This was implemented in an R script
(arm_lev_aneu_weight.r) which returned a
table with the appropriate class weights to
use.
Prediction Evaluation
The predictions made across all learners
were saved as csv files and evaluated
separately in terms of the learner type.
Classification learners were assessed using
mean log loss during cross-validation and
based on their accuracy, balanced
accuracy, precision, recall, specificity and
F1 score for the final test predictions.
Regression learners were evaluated using
the root mean squared error (RMSE)
during cross-validation and using the R2
metric for the final predictions.
Feature Importance
The primary feature importance analysis
involved the quantification of the influence
of the input genes on each of the base layer
predictions from the CIN learners using
XGBoost’s framework. For the heatmap
evaluation, only genes identified as top
contributors to a CIN learner were
included. A subsequent interaction feature
importance analysis was implemented after
the meta-layer predictions to evaluate
interdependencies between CIN features.
The quantification metric was defined in
terms of Gain, which indicates the
contribution of an input feature to a learner
by measuring the relative improvement in
predictive performance calculated based
on the difference in the loss function.
The CIN model training, testing and the
genomic and interaction feature
importance analyses were implemented in
an R markdown script
(CIN_model_analysis.Rmd).
Cancer-Specific Analysis
The compiled TCGA PanCan data was
parsed by cancer type and the model was
repeatedly run for each cancer-specific
subset containing 300 samples or more.
CIN features with constant identical values
throughout a particular subset were not
used in the modelling and were therefore
dropped from that cancer-specific analysis.
Cancer-specific scripts for hyperparameter
tuning (cancer_specific_base_class_tune.r,
cancer_specific_base_regress_tune.r,
cancer_specific_meta_learner_tune.r),
class weight configuration
(cancer_specific_arm_lev_aneu_weight.r),
and predictions and feature importance
analyses
(cancer_specific_CIN_model_analysis.Rm
d) were developed.
7.
Abbreviations
CIN - Chromosomal Instability
HP1 - Heterochromatin Protein 1
LADs - Lamina-Associated Domains
PPI - Protein-Protein Interaction
SOI - Set of Interest
TCGA - The Cancer Genome Atlas
TPM - Transcripts Per Million
HRD - Homologous Recombination Deficiency
CNV - Copy Number Variation
XGBoost - Extreme Gradient Boosting
DGEA - Differential Gene Expression Analysis
NPC - Nuclear Pore Complex
TSS - Tissue Source Site
GDC - Genomic Data Commons
LUSC - Lung Squamous Cell Carcinoma
BRCA - Breast Carcinoma
LUAD - Lung Adenocarcinoma
HNSC - Head and Neck Squamous Cell Carcinoma
AI - Allelic Imbalance
LST - Large-Scale State Transitions
8.
Data Availability
The TCGA PanCan Gene-level RNAseq data, HRD
score data, and the Gene-level Copy Number data
can be found at
https://xenabrowser.net/datapages/?cohort=TCGA
%20PanCancer%20(PANCAN)&removeHub=https%3A%2
F%2Fxena.treehouse.gi.ucsc.edu%3A443. The
Segmented Copy Number data and Arm-level
aneuploidy data can be accessed at
https://gdc.cancer.gov/about-
24
data/publications/pancanatlas. The sample meta
data for the TCGA PanCan study can be
downloaded from https://gdc.cancer.gov/resourcestcga-users/tcga-code-tables
https://gdc.cancer.gov/resources-tcga-users/tcgacode-tables. The peri/centromeric satellite sequence
annotations from the T2T project can be
downloaded from https://genome.ucsc.edu/cgibin/hgTables?db=hub_3671779_hs1&hgta_group=
map&hgta_track=hub_3671779_censat&hgta_table
=hub_3671779_censat&hgta_doSchema=describe+
table+schema.
The scripts are available at
https://github.com/DylJHS/Chromosome_Instabilit
y.
For supplementary data, contact
dylanhsimmons@hotmail.fr
9.
Acknowledgements
Sincere thanks are extended to Aniek Janssen,
Aditya Dixit, the Janssen Group, and Stefan
Prekovic for their invaluable support and
contributions.
10.
References
1.
2.
3.
4.
5.
6.
Passarge E. Emil Heitz and the concept of
heterochromatin: longitudinal
chromosome differentiation was
recognized fifty years ago. Am J Hum
Genet. 1979 Mar;31(2):106-15. PMID:
377956; PMCID: PMC1685768.
Lomberk, G., Wallrath, L., & Urrutia, R.
(2006). The heterochromatin protein 1
family. Genome biology, 7, 1-8
Hennig, W. (1999). Heterochromatin.
Chromosoma, 108(1), 1-9.
Janssen, A., Colmenares, S. U., & Karpen,
G. H. (2018). Heterochromatin: guardian
of the genome. Annual review of cell and
developmental biology, 34(1), 265-288.
Janssen, A., Colmenares, S. U., & Karpen,
G. H. (2018). Heterochromatin: guardian
of the genome. Annual review of cell and
developmental biology, 34(1), 265-288.
Olins, A. L., Rhodes, G., Welch, D. B. M.,
Zwerger, M., & Olins, D. E. (2010).
Lamin B receptor: Multi-tasking at the
nuclear envelope. Nucleus, 1(1), 53–70.
https://doi.org/10.4161/nucl.1.1.10515
Margalit, A., Brachner, A., Gotzmann, J.,
Foisner, R., & Gruenbaum, Y. (2007).
Barrier-to-autointegration factor--a
BAFfling little protein. Trends in cell
biology, 17(4), 202–208.
https://doi.org/10.1016/j.tcb.2007.02.004
7.
Maison, C., Pyrpasopoulou, A.,
Theodoropoulos, P. A., & Georgatos, S. D.
(1997). The inner nuclear membrane
protein LAP1 forms a native complex with
B-type lamins and partitions with spindleassociated mitotic vesicles. The EMBO
journal, 16(16), 4839–4850.
https://doi.org/10.1093/emboj/16.16.4839
8. Poleshko, A., Mansfield, K. M.,
Burlingame, C. C., Andrake, M. D., Shah,
N. R., & Katz, R. A. (2013). The human
protein PRR14 tethers heterochromatin to
the nuclear lamina during interphase and
mitotic exit. Cell reports, 5(2), 292–301.
https://doi.org/10.1016/j.celrep.2013.09.02
4
9. Zeng W, Ball AR Jr, Yokomori K. HP1:
heterochromatin binding proteins working
the genome. Epigenetics. 2010 May
16;5(4):287-92. doi:
10.4161/epi.5.4.11683. Epub 2010 May 3.
PMID: 20421743; PMCID: PMC3103764.
10. Shibuya, H., Ishiguro, K., & Watanabe, Y.
(2014). The TRF1-binding protein TERB1
promotes chromosome movement and
telomere rigidity in meiosis. Nature cell
biology, 16(2), 145–156.
https://doi.org/10.1038/ncb2896
11. Dunce, J.M., Milburn, A.E., Gurusaran,
M. et al. Structural basis of meiotic
telomere attachment to the nuclear
envelope by MAJIN-TERB2-TERB1. Nat
Commun 9, 5355 (2018).
https://doi.org/10.1038/s41467-01807794-7
12. Li, W., Bai, X., Li, J., Zhao, Y., Liu, J.,
Zhao, H., Liu, L., Ding, M., Wang, Q.,
Shi, F. Y., Hou, M., Ji, J., Gao, G., Guo,
R., Sun, Y., Liu, Y., & Xu, D. (2019). The
nucleoskeleton protein IFFO1 immobilizes
broken DNA and suppresses chromosome
translocation during tumorigenesis. Nature
cell biology, 21(10), 1273–1285.
https://doi.org/10.1038/s41556-019-03880
13. Van Steensel, B., & Belmont, A. S. (2017).
Lamina-associated domains: links with
chromosome architecture,
heterochromatin, and gene repression.
Cell, 169(5), 780-791.
14. Shevelyov, Y. Y., & Ulianov, S. V. (2019).
The nuclear lamina as an organizer of
chromosome architecture. Cells, 8(2), 136.
15. Bellanger A, Madsen-Østerbye J,
Galigniana NM, Collas P. Restructuring of
Lamina-Associated Domains in
Senescence and Cancer. Cells. 2022;
11(11):1846.
https://doi.org/10.3390/cells11111846
25
16. Smith, E. R., Capo-Chichi, C. D., & Xu,
X. X. (2018). Defective Nuclear Lamina
in Aneuploidy and Carcinogenesis.
Frontiers in oncology, 8, 529.
https://doi.org/10.3389/fonc.2018.00529
17. Dialynas GK, Vitalini MW, Wallrath LL.
Linking Heterochromatin Protein 1 (HP1)
to cancer progression. Mutat Res. 2008
Dec 1;647(1-2):13-20. doi:
10.1016/j.mrfmmm.2008.09.007. Epub
2008 Sep 24. PMID: 18926834; PMCID:
PMC2637788.
18. Clermont PL, Lin D, Crea F, Wu R, Xue
H, Wang Y, Thu KL, Lam WL, Collins
CC, Wang Y, Helgason CD. Polycombmediated silencing in neuroendocrine
prostate cancer. Clin Epigenetics. 2015
Apr 3;7(1):40. doi: 10.1186/s13148-0150074-4. PMID: 25859291; PMCID:
PMC4391120.
19. Lee SK, Wang W. Roles of
Topoisomerases in Heterochromatin,
Aging, and Diseases. Genes. 2019;
10(11):884.
https://doi.org/10.3390/genes10110884
20. Amoiridis, M., Verigos, J., Meaburn, K. et
al. Inhibition of topoisomerase 2 catalytic
activity impacts the integrity of
heterochromatin and repetitive DNA and
leads to interlinks between clustered
repeats. Nat Commun 15, 5727 (2024).
https://doi.org/10.1038/s41467-02449816-7
21. Anaissi, A., Kennedy, P.J., Goyal, M. et al.
A balanced iterative random forest for
gene selection from microarray data. BMC
Bioinformatics 14, 261 (2013).
https://doi.org/10.1186/1471-2105-14-261
22. Anaissi, Ali, et al. "Ensemble feature
learning of genomic data using support
vector machine." PloS one 11.6 (2016):
e0157330.
23. Carter, S. L., Eklund, A. C., Kohane, I. S.,
Harris, L. N., & Szallasi, Z. (2006). A
signature of chromosomal instability
inferred from gene expression profiles
predicts clinical outcome in multiple
human cancers. Nature genetics, 38(9),
1043–1048.
https://doi.org/10.1038/ng1861
24. Szklarczyk D, Kirsch R, Koutrouli M,
Nastou K, Mehryary F, Hachilif R, Annika
GL, Fang T, Doncheva NT, Pyysalo S,
Bork P‡, Jensen LJ‡, von Mering C‡. The
STRING database in 2023: protein–
protein association networks and
functional enrichment analyses for any
sequenced genome of interest. Nucleic
Acids Res. 2023 Jan 6;51(D1):D638-646.
25. Baker, T.M., Waise, S., Tarabichi, M. et al.
Aneuploidy and complex genomic
rearrangements in cancer evolution. Nat
Cancer 5, 228–239 (2024).
https://doi.org/10.1038/s43018-02300711-y
26. Adell MAY, Klockner TC, Höfler R, et al.
Adaptation to spindle assembly
checkpoint inhibition through the selection
of specific aneuploidies. Genes Dev.
2023;37(5-6):171-190.
doi:10.1101/gad.350182.122
27. Abkevich, V., Timms, K. M., Hennessy, B.
T., Potter, J., Carey, M. S., Meyer, L. A.,
Smith-McCune, K., Broaddus, R., Lu, K.
H., Chen, J., Tran, T. V., Williams, D.,
Iliev, D., Jammulapati, S., FitzGerald, L.
M., Krivak, T., DeLoia, J. A., Gutin, A.,
Mills, G. B., & Lanchbury, J. S. (2012).
Patterns of genomic loss of heterozygosity
predict homologous recombination repair
defects in epithelial ovarian cancer. British
journal of cancer, 107(10), 1776–1782.
https://doi.org/10.1038/bjc.2012.451
28. Birkbak, N. J., Wang, Z. C., Kim, J. Y.,
Eklund, A. C., Li, Q., Tian, R., BowmanColin, C., Li, Y., Greene-Colozzi, A.,
Iglehart, J. D., Tung, N., Ryan, P. D.,
Garber, J. E., Silver, D. P., Szallasi, Z., &
Richardson, A. L. (2012). Telomeric
allelic imbalance indicates defective DNA
repair and sensitivity to DNA-damaging
agents. Cancer discovery, 2(4), 366–375.
https://doi.org/10.1158/2159-8290.CD-110206
29. Popova, T., Manié, E., Rieunier, G., CauxMoncoutier, V., Tirapo, C., Dubois, T.,
Delattre, O., Sigal-Zafrani, B., Bollet, M.,
Longy, M., Houdayer, C., Sastre-Garau,
X., Vincent-Salomon, A., Stoppa-Lyonnet,
D., & Stern, M. H. (2012). Ploidy and
large-scale genomic instability
consistently identify basal-like breast
carcinomas with BRCA1/2 inactivation.
Cancer research, 72(21), 5454–5462.
https://doi.org/10.1158/0008-5472.CAN12-1470
30. Pellegrino B, Musolino A, Llop-Guevara
A, et al. Homologous Recombination
Repair Deficiency and the Immune
Response in Breast Cancer: A Literature
Review. Transl Oncol. 2020;13(2):410422. doi:10.1016/j.tranon.2019.10.010
31. Doig KD, Fellowes AP, Fox SB.
Homologous Recombination Repair
Deficiency: An Overview for Pathologists.
Mod Pathol. 2023;36(3):100049.
doi:10.1016/j.modpat.2022.100049
32. Ehrlich M. DNA hypomethylation, cancer,
the immunodeficiency, centromeric region
26
instability, facial anomalies syndrome and
chromosomal rearrangements [published
correction appears in J Nutr 2002
Nov;132(11):3432]. J Nutr. 2002;132(8
Suppl):2424S-2429S.
doi:10.1093/jn/132.8.2424S
33. Barra, V., Fachinetti, D. The dark side of
centromeres: types, causes and
consequences of structural abnormalities
implicating centromeric DNA. Nat
Commun 9, 4340 (2018).
https://doi.org/10.1038/s41467-01806545-y
34. Hoyt, S. J., Storer, J. M., Hartley, G. A.,
Grady, P. G. S., Gershman, A., de Lima, L.
G., Limouse, C., Halabian, R., Wojenski,
L., Rodriguez, M., Altemose, N., Rhie, A.,
Core, L. J., Gerton, J. L., Makalowski, W.,
Olson, D., Rosen, J., Smit, A. F. A.,
Straight, A. F., Vollger, M. R., … O'Neill,
R. J. (2022). From telomere to telomere:
The transcriptional and epigenetic state of
human repeat elements. Science (New
York, N.Y.), 376(6588), eabk3112.
https://doi.org/10.1126/science.abk3112
35. Altemose, N., Logsdon, G. A., Bzikadze,
A. V., Sidhwani, P., Langley, S. A., Caldas,
G. V., Hoyt, S. J., Uralsky, L., Ryabov, F.
D., Shew, C. J., Sauria, M. E. G.,
Borchers, M., Gershman, A., Mikheenko,
A., Shepelev, V. A., Dvorkina, T.,
Kunyavskaya, O., Vollger, M. R., Rhie, A.,
McCartney, A. M., … Miga, K. H. (2022).
Complete genomic and epigenetic maps of
human centromeres. Science (New York,
N.Y.), 376(6588), eabl4178.
https://doi.org/10.1126/science.abl4178
36. Nurk S, Koren S, Rhie A, et al. The
complete sequence of a human genome.
Science. 2022;376(6588):44-53.
doi:10.1126/science.abj6987
37. Chen, T., & Guestrin, C. (2016).
XGBoost: A Scalable Tree Boosting
System. In Proceedings of the 22nd ACM
SIGKDD International Conference on
Knowledge Discovery and Data Mining
(pp. 785–794). New York, NY, USA:
ACM.
https://doi.org/10.1145/2939672.2939785
38. De Rop V, Padeganeh A, Maddox PS.
CENP-A: the key player behind
centromere identity, propagation, and
kinetochore assembly. Chromosomal.
2012 Dec;121(6):527-38. doi:
10.1007/s00412-012-0386-5. Epub 2012
Oct 26. PMID: 23095988; PMCID:
PMC3501172.
39. Giunta, S., Hervé, S., White, R. R.,
Wilhelm, T., Dumont, M., Scelfo, A.,
Gamba, R., Wong, C. K., Rancati, G.,
Smogorzewska, A., Funabiki, H., &
Fachinetti, D. (2021). CENP-A chromatin
prevents replication stress at centromeres
to avoid structural aneuploidy.
Proceedings of the National Academy of
Sciences of the United States of America,
118(10), e2015634118.
https://doi.org/10.1073/pnas.2015634118
40. Black EM, Giunta S. Repetitive Fragile
Sites: Centromere Satellite DNA As a
Source of Genome Instability in Human
Diseases. Genes (Basel). 2018 Dec
7;9(12):615. doi: 10.3390/genes9120615.
PMID: 30544645; PMCID: PMC6315641.
41. Luisa Robbez-Masson, Christopher H.C.
Tie, Helen M. Rowe; Cancer cells, on your
histone marks, get SETDB1, silence
retrotransposons, and go!. J Cell Biol 6
November 2017; 216 (11): 3429–3431.
doi:
https://doi.org/10.1083/jcb.201710068
42. Chatel, G., & Fahrenkrog, B. (2011).
Nucleoporins: leaving the nuclear pore
complex for a successful mitosis. Cellular
signalling, 23(10), 1555–1562.
https://doi.org/10.1016/j.cellsig.2011.05.0
23
43. Ryu T, Spatola B, Delabaere L, Bowlin K,
Hopp H, Kunitake R, Karpen GH, Chiolo
I. Heterochromatic breaks move to the
nuclear periphery to continue
recombinational repair. Nat Cell Biol.
2015 Nov;17(11):1401-11. doi:
10.1038/ncb3258. Epub 2015 Oct 26.
PMID: 26502056; PMCID: PMC4628585.
44. Wang ZQ, Wu ZX, Wang ZP, Bao JX, Wu
HD, Xu DY, Li HF, Xu YY, Wu RX, Dai
XX. Pan-cancer analysis of NUP155 and
validation of its role in breast cancer cell
proliferation, migration, and apoptosis.
BMC Cancer. 2024 Mar 19;24(1):353. doi:
10.1186/s12885-024-12039-6. PMID:
38504158; PMCID: PMC10953186.
45. Nakano H, Wang W, Hashizume C,
Funasaka T, Sato H, Wong RW.
Unexpected role of nucleoporins in
coordination of cell cycle progression.
Cell Cycle. 2011;10(3):425-433.
doi:10.4161/cc.10.3.14721
46. Sofi S, Mehraj U, Qayoom H, Aisha S,
Almilaibary A, Alkhanani M, Mir MA.
Targeting cyclin-dependent kinase 1
(CDK1) in cancer: molecular docking and
dynamic simulations of potential CDK1
inhibitors. Med Oncol. 2022 Jun
20;39(9):133. doi: 10.1007/s12032-02201748-2. PMID: 35723742; PMCID:
PMC9207877.
47. Malumbres, M., & Barbacid, M. (2009).
Cell cycle, CDKs and cancer: a changing
27
paradigm. Nature reviews. Cancer, 9(3),
153–166. https://doi.org/10.1038/nrc2602
48. Hodder S, Fox M, Binti Ahmad Mokhtar
AM, Mott HR, Owen D. ACKnowledging
the role of the Activated-Cdc42 associated
kinase (ACK) in regulating protein
stability in cancer. Small GTPases. 2023
Dec;14(1):14-25. doi:
10.1080/21541248.2023.2212573. PMID:
37194323; PMCID: PMC10193877.
49. Chafai, N., Bonizzi, L., Botti, S., &
Badaoui, B. (2023). Emerging applications
of machine learning in genomic medicine
and healthcare. Critical Reviews in
Clinical Laboratory Sciences, 61(2), 140–
163.
https://doi.org/10.1080/10408363.2023.22
59466
50. Wang Y, Wang Q, Huang H, Huang W,
Chen Y, McGarvey PB, Wu CH, Arighi
CN, UniProt Consortium. A
crowdsourcing open platform for literature
curation in UniProt Plos Biology.
19(12):e3001464 (2021)
51. Schoch CL, et al. NCBI Taxonomy: a
comprehensive update on curation,
resources and tools. Database (Oxford).
2020: baaa062.
52. Sayers EW, et al. GenBank. Nucleic Acids
Res. 2019. 47(D1):D94-D99.
53. Goldman, M.J., Craft, B., Hastie, M. et al.
Visualizing and interpreting cancer
genomics data via the Xena platform. Nat
Biotechnol (2020).
https://doi.org/10.1038/s41587-020-05468
54. Frankish A, Carbonell-Sala S, Diekhans
M, Jungreis I, Loveland JE, Mudge JM,
Sisu C, Wright JC, Arnan C, Barnes I,
Banerjee A, Bennett R, Berry A, Bignell
A, Boix C, Calvet F, Cerdán-Vélez D,
Cunningham F, Davidson C, Donaldson S,
Dursun C, Fatima R, Giorgetti S, Giron
CG, Gonzalez JM, Hardy M, Harrison
PW, Hourlier T, Hollis Z, Hunt T, James
B, Jiang Y, Johnson R, Kay M, Lagarde J,
Martin FJ, Gómez LM, Nair S, Ni P, Pozo
F, Ramalingam V, Ruffier M, Schmitt BM,
Schreiber JM, Steed E, Suner MM,
Sumathipala D, Sycheva I, UszczynskaRatajczak B, Wass E, Yang YT, Yates A,
Zafrulla Z, Choudhary JS, Gerstein M,
Guigo R, Hubbard TJP, Kellis M, Kundaje
A, Paten B, Tress ML, Flicek P. Nucleic
Acids Res 2023 : 51 ; d1 ; D942-D949.
55. Vivian J, Rao AA, Nothaft FA, Ketchum
C, Armstrong J, Novak A, Pfeil J,
Narkizian J, Deran AD, MusselmanBrown A, Schmidt H, Amstutz P, Craft B,
Goldman M, Rosenbloom K, Cline M,
O'Connor B, Hanna M, Birger C, Kent
WJ, Patterson DA, Joseph AD, Zhu J,
Zaranek S, Getz G, Haussler D, Paten B.
Toil enables reproducible, open source,
big biomedical data analyses. Nat
Biotechnol. 2017 Apr 11;35(4):314-316.
doi: 10.1038/nbt.3772. PMID: 28398314;
PMCID: PMC5546205.
56. Mermel, C.H., Schumacher, S.E., Hill, B.
et al. GISTIC2.0 facilitates sensitive and
confident localization of the targets of
focal somatic copy-number alteration in
human cancers. Genome Biol 12, R41
(2011). https://doi.org/10.1186/gb-201112-4-r41
57. Grossman, Robert L., Heath, Allison P.,
Ferretti, Vincent, Varmus, Harold E.,
Lowy, Douglas R., Kibbe, Warren A.,
Staudt, Louis M. (2016) Toward a Shared
Vision for Cancer Genomic Data. New
England Journal of Medicine375:12,
1109-1112
58. Blighe K, Rana S, Lewis M (2024).
EnhancedVolcano: Publication-ready
volcano plots with enhanced colouring and
labeling. R package version 1.22.0,
https://github.com/kevinblighe/Enhanced
Volcano.
59. Robinson, M.D., Oshlack, A. A scaling
normalization method for differential
expression analysis of RNA-seq data.
Genome Biol 11, R25 (2010).
https://doi.org/10.1186/gb-2010-11-3-r25
60. Chen Y, Chen L, Lun ATL, Baldoni P,
Smyth GK (2024). “edgeR 4.0: powerful
differential analysis of sequencing data
with expanded functionality and improved
support for small counts and larger
datasets.” bioRxiv.
doi:10.1101/2024.01.21.576131.
61. Chen Y, Lun ATL, Smyth GK (2016).
“From reads to genes to pathways:
differential expression analysis of RNASeq experiments using Rsubread and the
edgeR quasi-likelihood pipeline.”
F1000Research, 5, 1438.
doi:10.12688/f1000research.8987.2.
62. McCarthy DJ, Chen Y, Smyth GK (2012).
“Differential expression analysis of
multifactor RNA-Seq experiments with
respect to biological variation.” Nucleic
Acids Research, 40(10), 4288-4297.
doi:10.1093/nar/gks042.
63. Robinson MD, McCarthy DJ, Smyth GK
(2010). “edgeR: a Bioconductor package
for differential expression analysis of
digital gene expression data.”
Bioinformatics, 26(1), 139-140.
doi:10.1093/bioinformatics/btp616.
28
64. Lawrence M, Gentleman R, Carey V
(2009). “rtracklayer: an R package for
interfacing with genome browsers.”
Bioinformatics, 25, 1841-1842.
doi:10.1093/bioinformatics/btp328,
http://bioinformatics.oxfordjournals.org/co
ntent/25/14/1841.abstract.
65. Nassar, L. R., Barber, G. P., Benet-Pagès,
A., Casper, J., Clawson, H., Diekhans, M.,
Fischer, C., Gonzalez, J. N., Hinrichs, A.
S., Lee, B. T., Lee, C. M., Muthuraman, P.,
Nguy, B., Pereira, T., Nejad, P., Perez, G.,
Raney, B. J., Schmelter, D., Speir, M. L.,
Wick, B. D., … Kent, W. J. (2023). The
UCSC Genome Browser database: 2023
update. Nucleic acids research, 51(D1),
D1188–D1195.
https://doi.org/10.1093/nar/gkac1072
66. Nassar, L. R., Barber, G. P., Benet-Pagès,
A., Casper, J., Clawson, H., Diekhans, M.,
Fischer, C., Gonzalez, J. N., Hinrichs, A.
S., Lee, B. T., Lee, C. M., Muthuraman, P.,
Nguy, B., Pereira, T., Nejad, P., Perez, G.,
Raney, B. J., Schmelter, D., Speir, M. L.,
Wick, B. D., … Kent, W. J. (2023). The
UCSC Genome Browser database: 2023
update. Nucleic acids research, 51(D1),
D1188–D1195.
https://doi.org/10.1093/nar/gkac1072
67. Wickham H (2016). ggplot2: Elegant
Graphics for Data Analysis. SpringerVerlag New York. ISBN 978-3-31924277-4, https://ggplot2.tidyverse.org.
68. Hothorn T, Hornik K, van de Wiel MA,
Zeileis A (2006). “A Lego system for
conditional inference.” The American
Statistician, 60(3), 257–263.
doi:10.1198/000313006X118430.
69. R Core Team (2013). R: A language and
environment for statistical computing. R
Foundation for Statistical Computing,
Vienna, Austria. ISBN 3-900051-07-0
29
Supplemental Information
30
CHD5
TMEM201
MAD2L2
AGTRAP
CLCN6
CDC42
KDM1A
AHDC1
SPOCD1
HDAC1
RBBP4
AK2
CDCA8
ZMPSTE24
GUCA2B
YBX1
CC2D1B
ZNF644
LRIF1
RHOC
SIKE1
SETDB1
POGZ
LMNA
PMF1
POU2F1
TOR1AIP1
RNF2
KDM5B
PPP2R5A
NSL1
LBR
PARP1
NUP133
HNRNPU
AHCTF1
E2F6
DNMT3A
CENPA
SPDYA
SPAST
PCBP1
PCGF1
CHMP3
BCL2L11
RND3
CACNB4
HECW2
CANX
TXN
HMGA2
RPA1
BARD1
CDYL
SPTAN1
TMPO
C1QBP
XRCC5
NUP153
SET
SLC25A3
MIS12
OBSL1
POU5F1
TOR1A
TCHP
ACAP1
ACSL3
BAG6
BRD3
TBX5
POLR2A
SP100
EHMT2
WDR5
HNF1A
CHD3
PDE6D
NELFE
EHMT1
RHOF
AURKB
ATG7
BRD2
PFKP
SMAD9
NDEL1
TMEM43
RING1
SUV39H2
RB1
MPRIP
ZNF621
DAXX
VIM
KPNA3
SREBF1
CTNNB1
LEMD2
COMMD3.BMI1
LMO7
SUZ12
TRAK1
HMGA1
BMI1
UBAC2
PCGF2
RHOA
CUL7
ZNF239
CUL4A
CDK12
BAP1
MLIP
ZWINT
TFDP1
TOP2A
RYBP
ORC3
CDK1
CHAMP1
STAT3
CHMP2B
HDAC2
SIRT1
PARP2
RND2
SENP7
BCLAF1
DNAJB12
DAD1
BRCA1
RPN1
SYNE1
DUSP13
PRMT5
GFAP
TFDP2
QKI
PCGF6
CHMP4A
KPNB1
KPNA4
SUN1
NUP98
TINF2
CBX1
FXR1
GET4
RHOG
RHOJ
SRSF1
SOX2
ACTB
RRP8
SYNE2
KPNA2
DNAJB11
RAC1
TRIM66
MAX
SUMO2
TFRC
RPA3
MYOD1
FOS
CBX8
YTHDC1
CBX3
PAX6
BCL11B
EIF4A3
PAPSS1
HOXA5
DDB2
YY1
CHMP6
EGF
EGFR
DDB1
PPP2R5C
RAC3
PGRMC2
POM121
INCENP
RCOR1
SMCHD1
SMARCA5 AKAP9
STX5
BAHD1
PIAS2
SMAD1
SND1
TM7SF2
MGA
SMAD2
NPR3
NUP205
BANF1
TP53BP1
MBD1
NIPBL
TRIM24
RHOD
WDR76
ADNP2
NUP155
KIAA1549 GSTP1
ZNF280D
LMNB2
AGGF1
ZC3HAV1
NUMA1
SMAD3
SGTA
AP3B1
KDM7A
EED
PML
CHAF1A
XRCC4
BRAF
IFFO1
USP7
HNRNPM
PRDM6
AGK
CHD4
PRKCB
DNMT1
LMNB1
FAM131B PHC1
MAPK3
ILF3
PPP2CA
EZH2
GABARAPL1 ZNF764
SMARCA4
CAMLG
CHMP7
GUCY2C
PRR14
BRD4
SMAD5
PPP2CB
LRRK2
CTCF
AKAP8L
KIF20A
WRN
YAF2
CDH1
WIZ
MATR3
TERF1
SENP1
VPS4A
CEBPA
HDAC3
CHMP4C
RND1
SF3B3
ZNF382
FAM114A2 ESRP1
CBX5
IST1
SIRT2
NPM1
SMARCA2 IL23A
GABARAPL2 ERCC2
FAF2
PSIP1
BAZ2A
WWOX
CRX
Table S1. Heterochromatin Interactome
PPP2R1A
ZNF579
TRIM28
CHMP2A
DDRGK1
RRBP1
BANF2
ASXL1
DNMT3B
MAPRE1
CHMP4B
DSN1
L3MBTL1
ZMYND8
ADNP
VAPB
CTSZ
TPTE
CHAF1B
RRP1B
MCM3AP
MAPK1
SMARCB1
RAC2
SUN2
CBX7
L3MBTL2
XRCC6
NUP50
RBBP7
SUV39H1
KDM5C
ZMYM3
ATRX
CUL4B
MMGT1
FATE1
MECP2
EMD
UBL4A
TTN
ZNF462
LEMD3
ZC3H18
CCDC155
Genes output by the STRNG query composing the heterochromatin interactome or Set-of-Interest (SOI)
Background Gene
Gene of Interest
Breast invasive carcinoma
40
30
HOXA5
CDK1
CDCA8 KIF20A
LMNB1
EZH2 AURKB
ZWINT
CBX7
CLCN6
EGFR
CBX3
HNRNPU
−2.5
0.0
PARP1
KPNA2
RHOJ MPRIP
KDM5B
SYNE1
RND3
SRSF1 DSN1
WDR76
GABARAPL1
FOS
RPN1 BARD1
TMEM43
SMAD9 TFDP2
NDEL1 ZNF280D
QKI
SPOCD1
SUN2 SMARCA2
CEBPA VIM
RHOF ESRP1
SYNE2
MECP2
RND1
NPR3 ZNF462
SUN1
MLIP
EGF
CDH1
GFAP
20
10
0
TOP2A
CENPA
2.5
5.0
Head and Neck squamous cell carcinoma
40
30
20
CACNB4 TM7SF2 LRRK2 CBX7
− Log10 P
10
MAPK3
DUSP13 FOS RND1 LMO7
GUCY2C
MLIP EGF
TTN
0
CHMP2B
−2.5
DNMT3B
CBX3
0.0
SPOCD1
HMGA2
2.5
5.0
Lung adenocarcinoma
40
30
CBX7
ZWINT CDCA8
RHOJ
QKI PAPSS1
HOXA5
LMO7
SMAD9
CACNB4 NPR3
SMARCA5
FOS
TTN
20
EZH2
AURKB
CDK1
IL23A
EGF
LRRK2
10
RND1
0
KIF20A TOP2A
CENPA
HMGA2
SOX2
−2.5
0.0
2.5
5.0
Lung squamous cell carcinoma
40
30
20
10
0
LRRK2
LMO7
SYNE1
CENPA
CDCA8
EZH2
CHAF1A
CDK1
TOP2A
LMNB2
NUP155
BRCA1 KPNA2 ZWINT KIF20A AURKB
RAC3
LMNB1
SOX2
DNMT3B
PFKP
IL23A
RHOJ
HECW2
CBX7
TBX5
MPRIP
RND1
FOS
PRKCB
CACNB4
CCDC155 CHD5
−2.5
RHOA
MAPK3
RHOF
RND2
NPR3
0.0
Log2 fold change
HMGA2
MLIP
2.5
5.0
Supplemental Figure 1. Differential Gene Expression in the set-of-interest across different tissue types.
Differential gene expression in cancerous versus normal condition. Highlighted are genes in the set-of-interest (red) compared to background
genes from the TCGA dataset (blue) for all paired samples within specific tissue types. All p-value cutoffs are set to 0.01. P-values (y-axis) and
fold changes (x-axis) are produced with edgeR package. Plots produced with EnhancedVolcano R package.
Base Layer
Meta Layer
CIN
Feature
Accuracy
Baseline
Accuracy
p-value
Accuracy
Baseline
Accuracy
p-value
1p
1q
2p
2q
3p
3q
4p
4q
5p
5q
6p
6q
7p
7q
8p
0.847
0.843
0.819
0.793
0.836
0.817
0.738
0.815
0.801
0.862
0.825
0.811
0.830
0.824
0.802
0.795
0.697
0.833
0.869
0.702
0.747
0.731
0.755
0.708
0.722
0.800
0.755
0.699
0.739
0.592
9.10E-28
4.36E-166
9.99E-01
1.00E+00
5.53E-141
8.63E-43
1.08E-01
1.40E-31
8.64E-68
1.63E-162
2.28E-07
3.05E-28
4.16E-134
2.10E-60
5.13E-290
0.833
0.846
0.808
0.789
0.828
0.807
0.736
0.810
0.801
0.856
0.817
0.794
0.825
0.823
0.802
0.794
0.703
0.832
0.866
0.703
0.747
0.730
0.756
0.708
0.728
0.799
0.753
0.700
0.742
0.596
1.63E-21
5.09E-214
1.00E+00
1.00E+00
1.55E-161
3.19E-40
8.76E-02
3.28E-34
8.49E-89
2.61E-184
8.32E-06
7.07E-21
2.98E-160
7.00E-72
0.00E+00
0.840
0.844
0.813
0.791
0.832
0.812
0.737
0.813
0.801
0.859
0.821
0.803
0.828
0.823
0.802
0.794
0.700
0.833
0.867
0.703
0.747
0.730
0.756
0.708
0.725
0.799
0.754
0.699
0.741
0.594
8q
9p
9q
10p
10q
11p
11q
12p
12q
13q
14q
15q
16p
16q
17p
17q
18p
18q
19p
19q
20p
20q
21q
22q
0.772
0.752
0.832
0.800
0.835
0.782
0.806
0.746
0.800
0.837
0.826
0.784
0.805
0.824
0.842
0.813
0.727
0.804
0.824
0.784
0.792
0.836
0.779
0.833
0.669
0.689
0.731
0.746
0.773
0.781
0.787
0.754
0.820
0.701
0.767
0.785
0.771
0.686
0.607
0.796
0.705
0.691
0.809
0.796
0.715
0.714
0.736
0.719
3.08E-76
1.32E-29
4.13E-83
2.73E-25
6.26E-36
4.24E-01
8.96E-05
9.52E-01
1.00E+00
1.60E-144
5.45E-32
5.90E-01
4.13E-12
1.01E-143
0.00E+00
1.66E-04
3.13E-05
1.32E-96
6.10E-04
9.92E-01
4.25E-46
1.90E-118
3.25E-16
3.51E-105
0.776
0.754
0.825
0.782
0.833
0.770
0.792
0.745
0.788
0.833
0.823
0.781
0.786
0.828
0.847
0.803
0.736
0.800
0.814
0.786
0.776
0.835
0.771
0.824
0.672
0.688
0.728
0.751
0.774
0.784
0.786
0.759
0.821
0.700
0.766
0.790
0.771
0.686
0.613
0.797
0.706
0.692
0.807
0.793
0.720
0.716
0.739
0.714
2.13E-102
1.45E-42
4.08E-103
1.51E-12
2.57E-42
9.99E-01
9.04E-02
9.99E-01
1.00E+00
3.36E-182
3.66E-40
9.80E-01
2.99E-04
2.17E-203
0.00E+00
5.74E-02
2.49E-10
3.03E-116
3.40E-02
9.40E-01
6.87E-33
4.82E-150
1.64E-12
8.06E-126
0.774
0.753
0.828
0.791
0.834
0.776
0.799
0.745
0.794
0.835
0.825
0.783
0.796
0.826
0.845
0.808
0.732
0.802
0.819
0.785
0.784
0.835
0.775
0.828
0.670
0.688
0.730
0.748
0.774
0.782
0.787
0.756
0.820
0.700
0.767
0.788
0.771
0.686
0.610
0.796
0.705
0.691
0.808
0.794
0.718
0.715
0.737
0.716
Mean
0.808
0.741
0.802
0.742
0.805
0.741
Table S2. Chromosome instability model classification performance
Performance of the classification XGBoost learners in the base layer of the general model.
Mean
Mean Baseline
Accuracy
Accuracy
A
Accuracy of the Aneuploidy Learners
Cancer type: BRCA
1.0
0.9
Accuracy
0.8
Learner Type
Base
Meta
0.7
0.6
12
p
12
q
13
q
14
q
15
q
16
p
16
q
17
p
17
q
18
p
18
q
19
p
19
q
20
p
20
q
21
q
22
q
p
q
11
10
11
p
q
9q
10
9p
8q
8p
7q
7p
6q
6p
5q
5p
4q
4p
3q
3p
2q
2p
1q
1p
0.5
Learner
B
Accuracy of the Aneuploidy Learners
Cancer type: LUSC
1.0
0.9
Accuracy
0.8
Learner Type
0.7
Base
Meta
0.6
0.5
q
16
p
16
q
17
p
17
q
18
p
18
q
19
p
19
q
20
p
20
q
21
q
22
q
q
15
q
14
13
q
p
12
12
q
p
11
11
q
10
p
10
9q
9p
8q
8p
7q
7p
6q
6p
5q
5p
4q
4p
3q
3p
2q
2p
1q
1p
0.4
Learner
C
Accuracy of the Aneuploidy Learners
Cancer type: LUAD
1.0
0.9
Accuracy
0.8
Learner Type
0.7
Base
Meta
0.6
0.5
0.4
10
p
10
q
11
p
11
q
12
p
12
q
13
q
14
q
15
q
16
p
16
q
17
p
17
q
18
p
18
q
19
p
19
q
20
p
20
q
21
q
22
q
9q
9p
8q
8p
7q
7p
6q
6p
5q
5p
4q
4p
3q
3p
2q
2p
1q
1p
0.3
Learner
D
Accuracy of the Aneuploidy Learners
Cancer type: HNSC
1.0
0.9
Accuracy
0.8
Learner Type
0.7
Base
Meta
0.6
0.5
22
q
21
q
20
q
20
p
19
q
19
p
18
q
18
p
17
q
17
p
16
q
16
p
15
q
14
q
13
q
12
q
12
p
11
q
11
p
10
q
10
p
9q
9p
8q
8p
7q
7p
6q
6p
5q
5p
4q
4p
3q
3p
2q
2p
1q
1p
0.4
Learner
Supplemental Figure 2. Performance of the classification of the Cancer-speicific Chromosome Instability models
Classification performance of the individual XGBoost learners determined using the overall accuracy (y-axis) for the 39 classification learners
(x-axis), assessed for both the base layer (blue) and meta layer (red) for Breast invasive carcinoma (A), Lung squamous cell carcinoma (B),
Lung adenocarcinoma (C) and Head and Neck squamous cell carcinoma (D).
Base Layer
Meta Layer
2
2
CIN Feature
RMSE
RMSE
R
R
ai1
0.710
3.923
0.740
3.678
lst1
0.570
4.569
0.600
4.384
loh_hrd
0.590
3.218
0.620
3.050
peri_1
0.240
0.145
0.260
0.143
peri_2
0.400
0.104
0.430
0.100
peri_3
0.510
0.162
0.510
0.159
peri_4
0.220
0.155
0.260
0.149
peri_5
0.280
0.149
0.310
0.142
peri_7
0.460
0.164
0.490
0.156
peri_8
0.210
0.207
0.240
0.201
peri_9
0.400
0.173
0.410
0.168
peri_10
0.440
0.152
0.470
0.147
peri_16_1
0.420
0.130
0.460
0.126
peri_16_2
0.470
0.171
0.500
0.164
peri_17
0.360
0.168
0.390
0.161
peri_20
0.750
0.117
0.780
0.107
peri_22_3
0.540
0.164
0.550
0.162
Mean
0.445
0.816
0.472
0.776
Table S3. Chromosome instability model regression performance
2
Mean R
0.725
0.585
0.605
0.250
0.415
0.510
0.240
0.295
0.475
0.225
0.405
0.455
0.440
0.485
0.375
0.765
0.545
0.459
Performance of the regression XGBoost learners in the base layer of the general model in terms of the
coefficient of determination and root mean squared error.
Mean RMSE
3.801
4.476
3.134
0.144
0.102
0.161
0.152
0.145
0.160
0.204
0.171
0.149
0.128
0.167
0.165
0.112
0.163
0.796
A
Pericentromeric and HRD models in terms of R2
Cancer type: BRCA
1.0
0.9
0.8
0.7
Learner Type
0.6
R2
Base
Meta
0.5
0.4
0.3
0.2
0.1
_3
ri_
22
20
pe
pe
pe
pe
ri_
17
16
ri_
ri_
_2
_1
16
ri_
pe
10
pe
ri_
8
ri_
pe
lo
h
pe
ri_
1
rd
_h
ls
t1
ai
1
0.0
Learner
B
Pericentromeric and HRD models in terms of R2
Cancer type: LUSC
1.0
0.9
0.8
0.7
Learner Type
0.6
R2
Base
Meta
0.5
0.4
0.3
0.2
0.1
_3
20
ri_
22
ri_
pe
pe
17
pe
ri_
10
ri_
pe
pe
ri_
9
8
ri_
pe
7
ri_
pe
5
ri_
pe
4
ri_
ri_
pe
lo
pe
3
d
hr
h_
t1
ls
ai
1
0.0
Learner
C
Pericentromeric and HRD models in terms of R2
Cancer type: LUAD
1.0
0.9
0.8
0.7
Learner Type
0.6
R2
Base
Meta
0.5
0.4
0.3
0.2
0.1
8
i_
pe
r
7
i_
pe
r
5
r i_
pe
1
i_
pe
r
rd
lo
h_
h
t1
ls
ai
1
0.0
Learner
D
Pericentromeric and HRD models in terms of R2
Cancer type: HNSC
1.0
0.9
0.8
0.7
Learner Type
0.6
R2
Base
Meta
0.5
0.4
0.3
0.2
0.1
20
pe
ri_
8
ri_
pe
7
ri_
pe
3
r i_
pe
hr
d
h_
lo
ls
t1
ai
1
0.0
Learner
Supplemental Figure 3. Performance of the regression of the Cancer-speicific Chromosome Instability models
Performance of the individual XGBoost learners determined using the coefficient of determination (R2; y-axis) for the regression learners
(x-axis), assessed for both the base layer (blue) and meta layer (red) for Breast invasive carcinoma (A), Lung squamous cell carcinoma (B),
Lung adenocarcinoma (C) and Head and Neck squamous cell carcinoma (D).
Cancer type: BRCA
Cancer type: LUSC
Gain
0.0
0.1
0.2
Gain
0.3
0.0 0.1 0.2 0.3 0.4
peri_22_3
peri_20
peri_17
peri_10
peri_9
peri_8
peri_7
peri_5
peri_4
peri_3
loh_hrd
lst1
ai1
22q
21q
20q
20p
19q
19p
18q
18p
17q
17p
16q
16p
15q
14q
13q
12q
12p
11q
11p
10q
10p
9q
9p
8q
8p
7q
7p
6q
6p
5q
5p
4q
4p
3q
3p
2q
2p
1q
1p
B
C AG
1Q 6
C BP
B
C X
C
C D 3
O
M C ENK1
M H P
D MA
3. P
BM 7
D CU I1
D DR L4B
N G
AJ K
B1
E2 12
F
E 6
EGED
G
AB E F
ARSR R
AP
H PL1
D
H AC1
D
M AC2
M AP 3
M AP K1
C R
N M3 E1
U A
N P1 P
U 3
P 3
PPPC 15
P2 GF5
R 6
PS 5C
R IP1
P
SL R N1
C RP
SM25A8
SM A 3
C D2
H
SND1
SOD1
S X
SUP1 2
0
TM 0
TM ERO2
E F
TOM41
TR R 3
IM1A
2
U TX 8
BA N
U C2
S
V P
VPAP 7
S4B
XR W A
C IZ
YT YBC6
ZN H X1
F DC
ZN28 1
F60D
44
B
C AG
1
C QB 6
C AM P
C LG
2D
C 1B
C DK
E
C NP1
H A
M
D DA P7
D
D RGD1
N K
AJ 1
B1
2
EI EE
F D
ES 4A
3
H RP
L3 DA 1
M
M BTC2
AP L
R 2
M E1
B
N D
N IP 1
U B
P L
PC 13
G 3
PH F1
R C1
R AC
R 1
P
R 1B
R
SGP8
SI TA
K
S SI E1
SMLC RT
2 2
SMAR 5A
3
A C
SM RCA2
C A5
H
SU SND1
V3 D
1
SU9H
2
TE Z1
2
TF RF
1
TODP
2
U R1
BA A
C
U 2
SP
V 7
VPAP
B
X S
ZN RC 4A
F C
ZN280 5
ZNF6 D
F721
64
Base Learner
Base Learner
peri_22_3
peri_20
peri_17
peri_16_2
peri_16_1
peri_10
peri_8
peri_1
loh_hrd
lst1
ai1
22q
21q
20q
20p
19q
19p
18q
18p
17q
17p
16q
16p
15q
14q
13q
12q
12p
11q
11p
10q
10p
9q
9p
8q
8p
7q
7p
6q
6p
5q
5p
4q
4p
3q
3p
2q
2p
1q
1p
Gene
Gene
Uniform Gain Threshold: 0.003
Cancer type: LUAD
Uniform Gain Threshold: 0.003
Cancer type: HNSC
0.0
0.1
Gain
0.2
0.0
peri_8
peri_7
peri_5
peri_1
loh_hrd
lst1
ai1
22q
21q
20q
20p
19q
19p
18q
18p
17q
17p
16q
16p
15q
14q
13q
12q
12p
11q
11p
10q
10p
9q
9p
8q
8p
7q
7p
6q
6p
5q
5p
4q
4p
3q
3p
2q
2p
1q
1p
0.1
0.2
0.3
0.4
AH
C
AU TF
1
C RK
1Q B
C B
AM P
C LG
C BX
D 1
C C4
D 2
C K
C EN 12
H P
M A
C P4
H B
CM
D ULP7
D
D RG4A
N K
D AJB 1
N 1
M 2
G
AB E T3
AR GF B
AP R
G L1
H E
M DA T4
C C
M 3
3
N AP
IP
O BL
PCRC
G 3
F
POPIA 1
M S2
PS121
R IP1
H
R OA
IN
R G1
P
S RRN1
SMLC2 P8
A
SM R 5A3
A C
SM RCA5
C B1
H
SU S D1
V3TX5
TE 9H
R 2
T F
TR INF1
IM 2
U 28
VPSP
S 7
W 4A
D
R
5
XR WI
ZN C Z
F2 C5
ZN 80
F7 D
64
Base Learner
Base Learner
peri_20
peri_8
peri_7
peri_3
loh_hrd
lst1
ai1
22q
21q
20q
20p
19q
19p
18q
18p
17q
17p
16q
16p
15q
14q
13q
12q
12p
11q
11p
10q
10p
9q
9p
8q
8p
7q
7p
6q
6p
5q
5p
4q
4p
3q
3p
2q
2p
1q
1p
Gene
Uniform Gain Threshold: 0.003
AK
A
A P9
C TG
1Q 7
C BP
C B
C DCX1
C
H 4
O
M
M C P 2
M H 4B
D M
3
D .BMP7
D
D RG I1
N
D AJ K1
N B
AJ 11
B1
E 2
EGED
G FR
H ET
D 4
H A
M DAC2
C C
M 3
3
N AP
P IP
PG CBBL
R P1
M
PHC2
PMC1
PP P F1
P2 M
L
PRR5C
R R14
H
R OG
IN
R G1
P
R N1
R
SL SIRP8
C T2
SM25A
A 3
SMSM D2
A
SMAR D3
ARCA
SM C 2
C A4
TEHD
1
TF RF
D 1
TF P2
U R
BA C
V C2
VPAP
SB
W 4A
XRDR
5
C
XR C
C 5
C
6
Gain
Gene
Uniform Gain Threshold: 0.003
Supplementary Figure 4. Genomic Importance of Cancer-Specific Models.
The genomic contribution towards the individual base-learners (y-axis) in terms of the Gain metri (gradient) for the Breast invasive carcinoma
(A), Lung squamous cell carcinoma (B), Lung adenocarcinoma (C) and Head and Neck squamous cell carcinoma (D) models and displaying
the Uniform Gain Threshold value (see methods). Genes represented on the x-axis.
Cancer type: BRCA
Cancer type: LUSC
Gain
Gain
0.0 0.2 0.4 0.6 0.8
0.0 0.2 0.4 0.6 0.8
peri_22_3
peri_20
peri_17
peri_10
peri_9
peri_8
peri_7
peri_5
peri_4
peri_3
loh_hrd
lst1
ai1
22q
21q
20q
20p
19q
19p
18q
18p
17q
17p
16q
16p
15q
14q
13q
12q
12p
11q
11p
10q
10p
9q
9p
8q
8p
7q
7p
6q
6p
5q
5p
4q
4p
3q
3p
2q
2p
1q
1p
1p
1q
2p
2q
3p
3q
4p
4q
5p
5q
6p
6q
7p
7q
8p
8q
9p
9
10q
10p
11q
11p
12q
12p
13q
14q
15q
16q
16p
17q
17p
18q
18p
19q
19p
20q
20p
21q
22q
q
a
lo ls i1
h_ t1
pe hr
d
peri_
3
peri_
4
peri_
5
peri_
7
peri_
pe ri 8
_
peri_ 9
r 1
peperi_10
ri_ i_ 7
2220
_3
Meta Learner
1p
1q
2p
2q
3p
3q
4p
4q
5p
5q
6p
6q
7p
7q
8p
8q
9p
9
10q
10p
11q
11p
12q
12p
13q
14q
15q
16q
16p
17q
17p
18q
18p
19q
19p
20q
20p
21q
22q
q
a
lo ls i1
h_ t1
pe hrd
peri_1
peper ri_
8
peri_1i_1
ri_ 6_0
pe16_1
r 2
peperi_1
ri_ i_27
22 0
_3
Meta Learner
peri_22_3
peri_20
peri_17
peri_16_2
peri_16_1
peri_10
peri_8
peri_1
loh_hrd
lst1
ai1
22q
21q
20q
20p
19q
19p
18q
18p
17q
17p
16q
16p
15q
14q
13q
12q
12p
11q
11p
10q
10p
9q
9p
8q
8p
7q
7p
6q
6p
5q
5p
4q
4p
3q
3p
2q
2p
1q
1p
Base Learner
Base Learner
Uniform Gain Threshold: 0.02
Cancer type: LUAD
Uniform Gain Threshold: 0.019
Cancer type: HNSC
Gain
0.0
0.2
0.4
Gain
0.6
0.00 0.25 0.50 0.75
peri_8
peri_7
peri_5
peri_1
loh_hrd
lst1
ai1
22q
21q
20q
20p
19q
19p
18q
18p
17q
17p
16q
16p
15q
14q
13q
12q
12p
11q
11p
10q
10p
9q
9p
8q
8p
7q
7p
6q
6p
5q
5p
4q
4p
3q
3p
2q
2p
1q
1p
1p
1q
2p
2q
3p
3q
4p
4q
5p
5q
6p
6q
7p
7q
8p
8q
9p
9
10q
10p
11q
11p
12q
12p
13q
14q
15q
16q
16p
17q
17p
18q
18p
19q
19p
20q
20p
21q
22q
q
ai
1
lo ls
h_ t1
h
pe rd
peri_1
peri_5
peri_7
ri_
8
1p
1q
2p
2q
3p
3q
4p
4q
5p
5q
6p
6q
7p
7q
8p
8q
9p
9
10q
10p
11q
11p
12q
12p
13q
14q
15q
16q
16p
17q
17p
18q
18p
19q
19p
20q
20p
21q
22q
q
ai
1
lo ls
h_ t1
pe hrd
peri_3
p ri_7
peeri_
ri_ 8
20
Meta Learner
Meta Learner
peri_20
peri_8
peri_7
peri_3
loh_hrd
lst1
ai1
22q
21q
20q
20p
19q
19p
18q
18p
17q
17p
16q
16p
15q
14q
13q
12q
12p
11q
11p
10q
10p
9q
9p
8q
8p
7q
7p
6q
6p
5q
5p
4q
4p
3q
3p
2q
2p
1q
1p
Base Learner
Base Learner
Uniform Gain Threshold: 0.022
Uniform Gain Threshold: 0.022
Supplementary Figure 5. The CIN feature impacts of Cancer-Specific Models.
The base-learner (x-axis) contributions towards the individual meta-learners (y-axis) in terms of the Gain metric (gradient) for the Breast
invasive carcinoma (A), Lung squamous cell carcinoma (B), Lung adenocarcinoma (C) and Head and Neck squamous cell carcinoma (D)
models and displaying the corresponding Uniform Gain Threshold value (see methods).
Download