top of page
cihandcatingmembcr

Sonja Model (Sets 01 13) 33



Logistic regression was used to estimate ORs, using the additive model and adjusting for age and gender (and significant eigenvectors in the discovery set). Meta-analysis of the discovery and replication was conducted using the fixed-effects inverse variance method based on the β estimates and standard errors from each study.


Gene set enrichment (GSE) analysis is a popular framework for condensing information from gene expression profiles into a pathway or signature summary. The strengths of this approach over single gene analysis include noise and dimension reduction, as well as greater biological interpretability. As molecular profiling experiments move beyond simple case-control studies, robust and flexible GSE methodologies are needed that can model pathway activity within highly heterogeneous data sets.




Sonja Model (Sets 01 13) 33




GSVA provides increased power to detect subtle pathway activity changes over a sample population in comparison to corresponding methods. While GSE methods are generally regarded as end points of a bioinformatic analysis, GSVA constitutes a starting point to build pathway-centric models of biology. Moreover, GSVA contributes to the current need of GSE methods for RNA-seq data. GSVA is an open source software package for R which forms part of the Bioconductor project and can be downloaded at


To address these challenges, we present a non-parametric, unsupervised method called Gene Set Variation Analysis (GSVA). GSVA calculates sample-wise gene set enrichment scores as a function of genes inside and outside the gene set, analogously to a competitive gene set test. Further, it estimates variation of gene set enrichment over the samples independently of any class label. Conceptually, this methodology can be understood as a change in coordinate systems for gene expression data, from genes to gene sets. This transformation facilitates post-hoc construction of pathway-centric models, such as differential pathway activity identification or survival prediction. Further, we demonstrate the flexibility of GSVA by applying it to RNA-seq data.


Although the GSVA algorithm itself does not evaluate statistical significance for the enrichment of gene sets, significance with respect to a phenotype can be easily evaluated using conventional statistical models. Likewise, false discovery rates can be estimated by permuting the sample labels (Methods). We make no general prescription for thresholds of significance or false discovery, as these choices are highly context dependent and may vary according to each experiment. Examples of these techniques are provided in the following section.


Methods for gene set enrichment can be generally partitioned according to the criteria of supervised vs unsupervised, and population vs single sample assessments. Most GSE methods, such as GSEA [4], are supervised and population based, in that they compute an enrichment score per gene set to describe the entire data set, modeled on a phenotype (discrete, such as case-control, or continuous). The simplest of this genre is described by Tian et al.[6, 19], evaluated as the mean differential expression (e.g. case vs control) of a set of genes, compared to those genes not in the gene set. One of the major drawbacks of this method is that gene correlations are not taken into account, which might lead to an increased number of false-positive gene sets with respect to GSEA [30]. Many other supervised, population based approaches have also been described [12, 17, 20, 31-34].


In the first study, we simulated microarray data from a linear additive model with sample and probe effects for p=1,000 genes and two groups of samples (see Methods). Using this model we have generated data sets of increasing sample size and defined two gene sets formed by 30 genes each, where one gene set is differentially expressed (DE) and the other is not. For the DE gene set we considered strong and weak signal-to-noise ratios and two different fractions of DE genes (50% and 80%) resulting in four different simulation scenarios. Using the simulated data from each scenario, we have calculated pathway activity profiles with the four sample-wise GSE methods (GSVA, ssGSEA, PLAGE and the combined z-score) and applied a t-test on the DE and non-DE gene sets between the two groups of samples. Using the DE gene set and a significance threshold of α=0.05, we have estimated the statistical power of each method as function of the sample size. On the same data, but using the non-DE gene set, we have estimated the empirical type-I error rate at α=0.05. The results of this simulation in Figure 2 show that GSVA attains higher statistical power than the other three methods in each of the four simulated scenarios while providing similar control of the type-I error rate.


Comparison of statistical power and type-I error rate between GSVA, PLAGE, single sample GSEA (ssGSEA) and combined z-score (zscore). The averaged results of 1,000 simulations are depicted as function of the sample size on the x-axis, for each of the GSE methods. On the y-axis either the statistical power (A, C, E, G) or the empirical type-I error rate (B, D, F, H) is shown. Data were simulated from a linear additive model with sample and probe effects (see Methods) for p=1,000 genes. GSE scores were calculated with each method with respect to two gene sets, one of them differentially expressed (DE) and the other one not. Statistical power and empirical type-I error rates were estimated by performing a t-test on the DE and non-DE gene sets, respectively, at a significance level of α=0.05. These simulations were carried out under the following four different scenarios for the DE gene set: (A,B) weak signal-to-noise ratio, 50% of DE genes in the DE gene set; (C,D) strong signal-to-noise ratio, 50% of DE genes in the DE gene set; (E, F) weak signal-to-noise ratio, 80% of DE genes in the DE gene set; (G, H) strong signal-to-noise ratio, 80% of DE genes in the DE gene set.


In the second simulation study, we compared the accuracy of each GSE method to identify differential pathway activity by calling DE gene sets. For this, we used the previously defined four simulation scenarios as well as the linear additive model with a fixed sample size of n=60 and p=10,000 genes to simulate data of more realistic dimensions. We set the first 2,000 genes as DE and simulated 1,000 gene sets of which we defined 500 as DE (see Methods). For each simulated gene expression data set, GSE scores were calculated and a two-sample t-test was employed to call DE gene sets at 5% FDR. The performance of each GSE method was measured by the area under the ROC curve (AUC) across 100 independent simulations (see Methods). AUC values were calculated from the binary vector of DE calls to compare the ability of each method to identify DE gene sets at a genome-wide significance level. The results are shown in Figure 3. This figure shows that GSVA attains significantly higher mean AUC values than the other GSE methods (P


Finally, we carried out a third simulation study in the context of survival analysis. We used again the former linear additive model to simulate microarray data with p=1,000 genes and two groups of samples. This time, however, we performed a cross-validation study to assess predictive power using 50 gene sets, each consisting of 10 genes. One of the gene sets was set as DE between the two sample groups while the other 49 were not DE and formed by sampling uniformly at random among the other 990 genes. We used a fixed configuration on the magnitude of differential expression (strong) and on the fraction of DE genes in the DE gene set (50%). In a similar way to the survival simulation by Bair and Tibshirani [36], we generated survival times and censoring status for each observation with different parameters for each group of samples (see Methods). This setting was generated twice to have independent training and test data sets.


GSE scores were calculated separately on the training and test data. A Cox proportional hazards model (Cox PHM) was fitted to each GSE score profile in the training data. The model with the lowest p-value provided by the Wald test was used to predict risk on the test data. As baseline comparison, we also fitted a Cox PHM to each gene expression profile on the training data and selected the 10 genes, corresponding to the gene set size across all gene sets, with lowest p-values given by the Wald test to also predict risk on the test data.


The performance of each gene set and gene-level model (using 10 genes) on the test data was assessed by the concordance index. This simulation was repeated 100 times and four entire runs were performed on increasing sample sizes n=25,50,75,100 of the simulated data. In Figure 4 the distribution of concordance index values is reported separately for each method and sample size. GSVA provides higher mean and median concordance index values than the other methods at every of the four sample sizes and the difference in means is significant (P


Comparison of the predictive power for survival analysis of gene-level, GSVA, PLAGE, single sample GSEA (ssGSEA) and combined z-score (zscore) on simulated data. Each panel corresponds to a different sample size of the simulated data. The y-axis shows the concordance index values of predicting survival risk on test data from 100 independent simulations. On top of each boxplot the p-value of the t-test for no difference in means between GSVA and the corresponding method is reported. The method gene refers to a simple gene-level survival model built from the top 10 genes with lowest p-values reported by the Wald test performed on the training data. Diamonds indicate mean values in boxplots.


We next examined pathway models for predicting patient survival in ovarian serous cystadenocarcinoma (OV). We used a large gene expression experiment (n=588) from TCGA [40] to obtain pathway enrichment scores for each of the canonical gene sets (C2) in MSigDB, and compared the four GSE methods. We performed a five-fold cross-validation and calculated GSE scores separately on each training and testing partition of the data with each of the four compared methods. We also considered the original expression data for a simple gene-level model. On each of the training data sets, we fitted a Cox PHM for each gene set, and each gene, in the gene-level model. Then, we selected those five gene-sets, or genes in the gene-level model, with the lowest p-value of the Wald test for no effect on survival. Using the selected gene-sets, we fitted again a Cox PHM on the training data and used it to predict risk on the training and test data sets of GSE scores. We repeated this for the gene-level model. Finally, we assessed the predictive performance of those models, each of them representing a different method, by calculating the concordance index of the predicted risk. As Figure 6 shows, except for the training data set using the gene-level model, GSVA attains higher mean and median concordance index values than the other methods in both, training and testing data sets. 2ff7e9595c


1 view0 comments

Recent Posts

See All

Bubble Shooter 3 Download

Bubble Shooter 3 Download: Como jogar o jogo mais viciante de todos os tempos Você adora jogar jogos fáceis, divertidos e relaxantes?...

Kommentit


bottom of page