① 肿瘤WES研究除了需要患者自己的肿瘤组织外,还需要自己的非肿瘤组织来匹配,后者是过滤掉患者独特的胚胎系统变异(即我们生来就与其他人不同DNA序列变异),否则将无法判断来自体细胞的突变 (后天获得),还是天生就有 (遗传或新突变),即:Somatic Mutation vs. Germline Mutation。公共数据库不记录患者的先天遗传变异,因此患者自身的正常组织总是需要配对。
② 在大多数情况下,体细胞突变是随机的。因此,同一肿瘤组织一般可分为不同的细胞亚群,包括不同的驱动突变,然后有不同的转录表达 (这是空间异质性。肿瘤的演变/进化研究可能涉及时间异质性,如晚期癌症转移组织或服用靶向药物一年后的耐药性突变组织等。
到目前为止,我们已经展示了我们所知道的最全面的癌症相关基因变化目录 (Catalogueof cancer-associated gene alterations),该目录描述了国际癌症基因组联盟(International Cancer Genome Consortium, ICGC)以及癌症基因组图谱(The Cancer Genome Atlas, TCGA)的全基因组癌症分析(Pan-Cancer Analysis of Whole Genomes, PCAWG)联盟1,188名捐赠者的肿瘤转录组获得的。
我们 (进一步)利用匹配的 (Matched)全基因组测序数据组测序数据RNA改变与胚系和体细胞DNA改变相关联 (Associated several categories of RNA alterations with germline and somatic DNAalterations),并确定了可能性遗传机制(Genetic mechanism)。
Cis-eQTLs vs. trans-eQTLs.Expression quantitative trait loci (eQTLs) are genetic variants that influence expression levels of mRNA transcripts. Cis-eQTLs commonly refer to genetic variations that act on local genes , and trans-eQTLs are those that act on distant genes and genes residing on different chromosomes.
a) cis-eQTL, b) trans-eQTL, c) mediated/介导 trans-eQTL with a single cis-mediator, and d) mediated/介导 trans-eQTL with multiple cis-mediators. 清华大学统计科学中心,https://doi.org/10.1186/s12859-019-2651-6,BMC Bioinformatics
Identification of eQTLs can help advance our understanding of genetics and regulatory mechanisms of gene expression in various organisms. Consistent findings suggest that many genes are regulated by nearby SNPs, and the identified cis-eQTLs are typically close to transcription start sites (TSSs). In contrast to cis-eQTLs, trans-eQTL identification is much more challenging because a greater number of SNP-gene pairs are tested for trans-association. In order to achieve the same power, analysis of trans-eQTLs requires a much larger sample size and/or effect than that in the cis-eQTL analysis. However, trans-eQTLs tend to have weaker effects than cis-eQTLs.
Mediation diagram of the trans-association between rs2239804 and RPL34
Several methods have been developed to improve trans-eQTL detection, such as reducing the multiple-testing burden based on pairwise partial correlations from the gene expression data to increase power, and constructing or selecting variables to control for unmeasured confounders that may lead to spurious association
b,对基因表达水平进行方差成分分析 (Variance component analysis),显示不同种系和体细胞因素,对不同基因集的方差所占的平均比例 (Average proportion of variance explained by different germline and somatic factors for different sets of genes),包括所有因子的平均效应:
Extended Data Fig. 7 | 与遗传先导负荷 (Genic lead burden)相关联的7个体细胞eGenes的曼哈顿图
Extended Data Fig. 8 | 8个体细胞eGenes的散点图,显示先导权重负荷对基因表达残差的影响 (Plots show the effect of the lead weighted burden on the gene expression residuals (见原文的Methods) of these genes. a, CDK12. b, PI4KA. c, IRF4. d, AICDA. e, C11orf73. f, BCL2. g, SGK1. h, TEKT5
大多数eQTL (68.4%)与侧翼非编码突变负荷相关 (Extended Data Fig. 6e,见上文)。(由此可见:基因组的非编码区虽然不直接体现生命活动 (蛋白),但对基因表达的调控非常重要)
c, Enrichment of SINE elements in SAVs (Splicing-associated variants,剪接相关变异) compared to sequence background (BG). Shown for SINE elements overlapping in sense (middle) and antisense (right) directions.
a, 不同组织类型的不同改变的中位数. Histotypes are ordered by hierarchical clustering based on the pattern of different types of alteration. Alt., alternative; non-syn, non-synonymous. Cancer-type abbreviations are listed in Supplementary Table 23.
b, c, Circular representations of the selected genes significantly co-occurred with B2M (b) and PCBP2 (c). Connecting lines indicate the specific types of co-occurrence of alteration pairs. 内部直方图显示不同颜色的不同DNA/RNA变化类型的发生频率。
a Consensus matrix for four clusters. Samples are in both rows and columns and pairwise values range from 0 (samples never cluster together; white) to 1 (samples always cluster together; dark blue). (样本的相关性矩阵,发现聚集为4类)
b Comparison between the three UROMOL2016 transcriptomic classes and the UROMOL2021 four-cluster solution (76% of tumors in UROMOL2016 class 1 remained class 1, 92% of tumors in UROMOL2016 class 2 remained class 2a/2b and 67% of tumors in UROMOL2016 class 3 remained class 3). (样本前后分类、聚集的比较)
c Kaplan–Meier plot of progression-free survival (PFS) for 530 patients stratified by transcriptomic class. (以分组的转录组聚集分类,做无进展生存曲线;四条曲线分别对应4种分类)
d Kaplan–Meier plot of recurrence-free survival (RFS) for 511 patients stratified by transcriptomic class. (同上,无复发生存期 生存曲线)
e, f Clinicopathological information and selected gene expression signatures for all patients stratified by transcriptomic class. Samples are ordered after increasing silhouette score within each class (lowest to highest class correlation). CIS carcinoma in situ, EORTC European Organisation for Research and Treatment of Cancer, EAU European Association of Urology, MIBC muscle-invasive bladder cancer, EMT epithelial-mesenchymal transition. (转录组分类的,所有患者的临床病理信息、及选定的基因表达特征,二者的信息映射。样本在每个类别中增加轮廓分数后排序(从最低到最高类别相关性)。CIS原位癌,EORTC欧洲癌症研究和治疗组织,EAU欧洲泌尿学协会,MIBC肌肉浸润性膀胱癌,EMT上皮-间质转化) (比如EMT基因集合,在各个样本中的表达值做加和?)
g RNA-based immune score and immune-related gene expression signatures for all patients stratified by transcriptomic class. (转录组分类的所有患者的RNA免疫评分和免疫相关基因表达特征)
h Regulon activity profiles for 23 transcription factors. Samples are ordered after increasing silhouette score within each class (lowest to highest class correlation). Regulons (rows) are hierarchically clustered. (23个转录因子的调控活性图谱。样本在每个类别中增加轮廓分数后排序(从最低到最高类别相关性)。规则(行)是层级聚类的)
i Regulon activity profiles for potential regulators associated with chromatin remodeling. The most-upregulated regulons within each class are shown. Regulons are hierarchically clustered. P-values were calculated using two-sided Fisher’s exact test for categorical variables, Kruskal–Wallis rank-sum test for continuous variables and two-sided log-rank test for comparing survival curves. Source data are provided as a Source data file. (与染色质重塑相关的潜在调控因子的调控活性谱。每个类别中最受限制的规则显示出来。规则是层级聚类的。P值的计算采用分类变量的双侧Fisher精确检验,连续变量的Kruskal-Wallis秩和检验,生存曲线的比较采用双侧log-rank检验。源数据作为源数据文件提供)
图2 NMIBC中拷贝数的变化
a 根据基因组类别 (Genomic class, GC) 1-3分层的473个肿瘤的全基因组拷贝数图。增益(增益+高平衡增益)和损失(损失+高平衡损失)汇总在染色体带面板的左侧。EORTC欧洲癌症研究与治疗组织,EAU欧洲泌尿外科协会,MIBC肌肉浸润性膀胱癌。
b 426例按基因组分类的无进展生存期(PFS) Kaplan-Meier图。
c 399例按基因组分类的患者无复发生存期(RFS) Kaplan-Meier图。
d EORTC高危评分(n = 163)按基因组分类分层的患者的PFS Kaplan-Meier图。p值的计算采用双侧log-rank检验。源数据作为源数据文件提供。
Fig. 3 Genomic alterations associated with transcriptomic classes. (与转录组分类相关的基因组改变)
a Genomic classes (GCs) compared to transcriptomic classes (n = 303). 两个组学分类方式的交叉分布展示、统计检验。
b. 12-gene qPCR-based progression risk score compared to GCs. Colors indicate transcriptomic classes.
c Kaplan–Meier plot of progression-free survival (PFS) for 154 patients (including only class 2a and 2b tumors) stratified by GC.
d. Number of RNA-derived mutations according to transcriptomic classes.
e Landscape of genomic alterations according to transcriptomic classes. Samples are ordered after the combined contribution of the APOBEC-related mutational signatures. Panels: RNA-derived mutational load, relative contribution of four RNA-derived mutational signatures (inferred from 441 tumors having more than 100 single nucleotide variations), selected RNA-derived mutated genes, copy number alterations in selected disease driver genes (derived from SNP arrays). Asterisks indicate p-values below 0.05. Daggers indicate BH-adjusted p-values below 0.05.
f. Comparison of RNA-derived single nucleotide variations to whole-exome sequencing (WES) data from 38 patients for 11,016 mutations in all genes, 280 mutations in the genes most frequently mutated or differentially affected between the classes (n = 82, Supplementary Fig. 5b) and 93 mutations in 19 selected bladder cancer genes (Fig. 3e). Only mutations with > 10 reads in tumor and germline DNA were considered and a mutation was called observed when the frequency of the alternate allele was above 2%.
g. Genomic alterations significantly enriched in one transcriptomic class vs. all others.
h Overview of p53 pathway alterations for all tumors with available copy number data and RNA-Seq data (n = 303).
i Amount of genome altered according to p53 pathway alteration. intact (完好无损的)
j Number of mutations according to mutations in DNA-damage response (DDR) genes (including TP53, ATM, BRCA1, ERCC2, ATR, MDC1).
k. RNA-based immune scoreaccording to GCs.
l RNA-derived mutational load according to GCs.
m Relative contribution of the APOBEC-related mutational signaturesaccording to transcriptomic class.
(采用的统计检验方法等) P-values were calculated using two-sided Fisher’s exact test for categorical variables, Kruskal–Wallis rank-sum test for continuous variables and twosided log-rank test for comparing survival curves. For all boxplots, the center line represents the median, box hinges represent first and third quartiles and whiskers represent ± 1.5× interquartile range. Source data are provided as a Source data file.
Fig. 4 Spatial proteomics analysis of tumor immune contexture. a Multiplex immunofluorescence staining with Panel 1 (CD3, CD8, and FOXP3) of tumors with high- and low immune infiltration with magnifications of T helper cells (CD3+, CD8− and FOXP3−), a cytotoxic T lymphocyte (CTL; CD3+, CD8−, FOXP3−) and a regulatory T cell (Treg; CD3+, CD8− and FOXP3+). Yellow dashed lines divide the tumor tissue into parenchymal and stromal regions. Scale bar: 20 µm. All protein measurements were performed once for each distinct sample. b Spatial organization of immune cell infiltration and antigen recognition/escape mechanisms (MHC class 1 and PD-L1) with associated data for genomic class, transcriptomic class, and recurrence rate. The immune cells and immune evasion markers are defined as the percentage of positive cells in the different regions (stroma and parenchyma) and normalized using zscores, (1) z ¼ ðxμÞ σ . Columns are sorted by the degree of immune infiltration into the tumor parenchyma in descending order from left to right. c Immune infiltration stratified by transcriptomic class. Immune infiltration is defined as the percentage of total cells in the parenchyma classified as immune cells. The p-value was calculated using two-sided Wilcoxon rank-sum test. d Immune infiltration stratified by recurrence rate. The p-value was calculated by the one-sided Jonckheere–Terpstra test for trend. e Kaplan–Meier plot of recurrence-free survival (RFS) for patients with tumors with few genomic alterations (GC1 + 2) stratified by immune infiltration. P-value was calculated using two-sided log-rank test. f Distribution of CK5/6 and GATA3 positive carcinoma cells stratified by transcriptomic class. Each column represents a patient. The p-value reflects the difference in CK5/6 expression across classes and was calculated by chi-squared test. For boxplots, the center line represents the median, box hinges represent first and third quartiles and whiskers represent ± 1.5× interquartile range. Source data are provided as a Source data file.
Fig. 5 Prediction models and summary characteristics of classes. a Overview of hazard ratios calculated from univariate Cox regressions of progressionfree survival using clinical and molecular features. Black dots indicate hazard ratios and horizontal lines show 95% confidence intervals (CI). Asterisks indicate p-values below 0.05 and the sample sizes, n, used to derive statistics are written to the right. CIS carcinoma in situ, EORTC European Organisation for Research and Treatment of Cancer, EAU European Association of Urology. b Receiver operating characteristic (ROC) curves for predicting progression within 5 years using logistic regression models (n = 301, events = 19). Asterisks indicate significant model improvement compared to the EORTC model (Likelihood ratio test, BH-adjusted p-value below 0.05). AUC area under the curve, CI confidence interval. c Summary characteristics of the transcriptomic classes. Molecular features associated with the classes are mentioned, and suggestions for therapeutic options with potential clinical benefit are listed. MIBC muscle-invasive bladder cancer, EMT epithelial-mesenchymal transition, CTLs cytotoxic T lymphocytes. Source data are provided as a Source data file.
Fig. 6 Validation of transcriptomic classes in independent cohorts. a Summary of classification results and stage distribution for all tumors, tumors with microarray data and tumors with RNA-Seq data (1228 tumors were classified in total and 1225 of these were assigned to a class). b Association of tumor stage, tumor grade and FGFR3 and TP53 mutation status with transcriptomic classes. P-values were calculated using two-sided Fisher’s exact test. c Kaplan–Meier plot of progression-free survival (PFS) for 511 patients stratified by transcriptomic class. The p-value was calculated using two-sided logrank test. d Association of regulon activities (active vs. repressed status) with transcriptomic classes in the UROMOL cohort (including samples with positive silhouette scores, n = 505) and transcriptomic classes in the independent cohorts (pooled). The heatmap illustrates BH-adjusted p-values from two-sided Fisher’s exact tests. e Pathway enrichment scores within transcriptomic classes in the UROMOL cohort (including samples with positive silhouette scores, n = 505) and transcriptomic classes in the independent cohorts (pooled). Asterisks indicate significant association between pathway and class (one class vs. all other classes, two-sided Wilcoxon rank-sum test, BH-adjusted p-value below 0.05). Triangles indicate direction swaps of pathway enrichment in the independent cohorts compared to the UROMOL cohort. GSVA gene set variation analysis. Source data are provided as a Source data file.
Fig. 1 Landscape of somatic mutation profiles in HCC samples. A Mutation information of each gene in each sample was shown in the waterfall plot, where different colors with specific annotations at the bottom meant the various mutation types. The barplot above the legend exhibited the number of mutation burden. B Cohort summary plot displaying distribution of variants according to variant classification, type and SNV class. Bottom part (from left to right) indicates mutation load for each sample, variant classification type. A stacked barplot shows top ten mutated genes. C TCGA HCC样品降雨图,每个点都是一个根据SNV类型编码的突变颜色 (Rainfall plot of TCGA HCC sample TCGA−UB−A7MB−01A−11D−A33Q−10. Each point is a mutation color coded according to SNV class.) D 显示肝癌中SNV分布的 (核苷酸)转变及反转,可分为6个转变和反转事件。堆叠条形图显示了MAF文件中每个样本的突变谱分布 (Transition and transversion plot displaying distribution of SNVs in HCC classified into six transition and transversion events. Stacked bar plot (bottom) shows distribution of mutation spectra for every sample in the MAF file. E 突变基因间的一致性和排他性联系 (The coincident and exclusive associations across mutated genes). TMB与年龄的相关性 (The correlation of TMB with age) (F), gender (G) and T status (H)
Fig. 2 Construction of weighted gene co-expression network of HCC samples.
A Sample dendrogram and clinical-traits heatmap was plotted. B Selection of the soft threshold made the index of scale-free topologies reach 0.90 and analysis of the average connectivity of 1–20 soft threshold power. C TMB-related genes with similar expression patterns were merged into the same module using a dynamic tree-cutting algorithm, creating a hierarchical clustering tree. D Heatmap of the correlations between the modules and TMB value (traits). Within every square, the number on the top refers to the coefficient between the TMB level and corresponding module, and the bottom is the P value
Fig. 3 Differential analysis of gene expression data in high- and low-TMB groups and enrichment pathway annotation. A Volcano plot was delineated to visualize the DEGs. Red represented upregulated and green represented downregulated. B Heatmap of top 40 DEGswas drawn to reveal different distribution of expression state, where the colors of red to blue represented alterations from high expression to low expression. C Venn diagram of the hub genes from WGCNA blue module and DEGs. Pathway enrichment analyses of TMB hub genes. D Gene Ontology (GO) enrichment analysis of naïve B cells-related genes: biological processes (BP), cellular components (CC) and molecular function (MF). E KEGG enrichment analysis of naïve B cells-related genes.
Fig. 4 发现组预后风险特征的验证 (Validation of the prognostic risk signature in discovery group). A Heatmap presents the expression pattern of three hub genes in each patient. B 多基因签名风险评分分布 (Distribution of multi-genes signature risk score). C The survival status and interval of HCC patients. D Kaplan–Meier curve analysis presenting difference of overall survival between the high-risk and low-risk groups. E 体细胞突变数分布 (Distribution of somatic mutation count). F 总生存期的单因素Cox回归分析 (Univariate Cox regression analyses of overall survival). G多因素Cox回归分析总生存期 (Multivariate Cox regression analyses of overall survival).
Fig. 5 预后风险特征的临床意义 (Clinical significance of the prognostic risk signature). A 热图显示每个样本的临床特征,及相应的风险评分的分布情况。高、低风险评分组临床变量亚型的比例 (Heatmap presents the distribution of clinical feature and corresponding risk score in each sample. Rate of clinical variables subtypes in high or low risk score groups). B Age, C Gender, D WHO grade, E clinical stage, F T status, G N status and H M status
Extended Data Figure 1 | Cohort description and workflow. a, Venn diagram of samples analysed by whole-exome (WES), whole genome (CGI) and whole transcriptome (RNA-seq) sequencing in this cohort.