Simulation analysis (brainscore.simulate)
brainscore.simulate_demo.Rmd
Introduction
This tutorial demonstrates how to perform do simulation to assess the false positive rate of analysis with brainscore. The brainscore
function calculates the gene set scores and correlates them with the predictor variable. The simulation analysis is performed by
permuting the predictor variable (
sim_type ='randomize_pred'
);spinning the brain data (
sim_type ='spin_brain'
);resampling the gene data (
sim_type ='resampling_gene'
).
The simulation analysis is performed to assess the false positive rate (Type1 error; sim_setting ='type1'
) of the analysis, as when the above elements are permuted, we expect no the correlation between the gene set scores and the predictor variable. For demonstration purposes, we have used a small number of iterations (sim_n = 5) and one subsample size (subsample_size = 100). In practice, we recommend using a larger number of iterations (i.e. 1000), various subsample sizes (e.g., 50, 100, 150, 200) and larger size of null models for a more robust analysis.
1. Prepare Data for the Analysis
First, we will prepare the data needed for enrichment analysis. The data is simulated (not real data) from the HCP dataset, and we’ll be using various pre-defined gene sets from SynGO and brain data for analysis.
1.1. Load sample data simulated from HCP data.
The data is simulated from HCP data. Let’s load and inspect it.
data(sim_hcp)
str(sim_hcp)
#> 'data.frame': 100 obs. of 37 variables:
#> $ L_bankssts : num 2.56 2.48 2.47 2.63 2.48 ...
#> $ L_caudalanteriorcingulate : num 2.56 2.7 2.63 2.48 2.43 ...
#> $ L_caudalmiddlefrontal : num 2.52 2.75 2.57 2.67 2.68 ...
#> $ L_cuneus : num 1.7 1.92 1.87 1.87 1.85 ...
#> $ L_entorhinal : num 3.51 3.38 3.38 3.15 2.63 ...
#> $ L_fusiform : num 2.57 2.64 2.7 2.55 2.56 ...
#> $ L_inferiorparietal : num 2.42 2.51 2.51 2.56 2.54 ...
#> $ L_inferiortemporal : num 2.55 2.71 2.76 2.8 2.76 ...
#> $ L_isthmuscingulate : num 2.18 2.54 2.43 2.41 2.47 ...
#> $ L_lateraloccipital : num 2 2.25 2.21 2.17 2.29 ...
#> $ L_lateralorbitofrontal : num 2.79 2.77 2.7 2.84 2.67 ...
#> $ L_lingual : num 1.84 2.07 2.11 2.12 2.07 ...
#> $ L_medialorbitofrontal : num 2.54 2.43 2.41 2.68 2.45 ...
#> $ L_middletemporal : num 2.75 2.64 2.8 2.9 2.81 ...
#> $ L_parahippocampal : num 2.88 2.47 2.68 3.1 2.98 ...
#> $ L_paracentral : num 2.32 2.43 2.38 2.56 2.47 ...
#> $ L_parsopercularis : num 2.63 2.67 2.47 2.57 2.72 ...
#> $ L_parsorbitalis : num 2.86 2.74 3.08 2.78 2.61 ...
#> $ L_parstriangularis : num 2.44 2.39 2.54 2.51 2.47 ...
#> $ L_pericalcarine : num 1.3 1.87 1.86 1.82 1.61 ...
#> $ L_postcentral : num 1.74 2.18 2.14 2.42 2.13 ...
#> $ L_posteriorcingulate : num 2.34 2.72 2.32 2.38 2.52 ...
#> $ L_precentral : num 2.54 2.7 2.69 2.79 2.64 ...
#> $ L_precuneus : num 2.26 2.35 2.31 2.34 2.36 ...
#> $ L_rostralanteriorcingulate: num 3.06 2.72 2.91 2.57 2.83 ...
#> $ L_rostralmiddlefrontal : num 2.34 2.42 2.39 2.57 2.49 ...
#> $ L_superiorfrontal : num 2.81 2.83 2.71 2.77 2.85 ...
#> $ L_superiorparietal : num 2.06 2.24 2.16 2.35 2.29 ...
#> $ L_superiortemporal : num 2.73 2.9 2.91 3.09 2.86 ...
#> $ L_supramarginal : num 2.42 2.48 2.62 2.71 2.62 ...
#> $ L_frontalpole : num 2.8 2.75 3.1 2.87 3.29 ...
#> $ L_temporalpole : num 3.52 3.44 3.89 3.67 2.87 ...
#> $ L_transversetemporal : num 2.16 2.64 2.71 2.58 2.47 ...
#> $ L_insula : num 2.83 3.23 2.95 3.16 3.04 ...
#> $ Age : int 29 28 36 29 28 27 29 26 34 31 ...
#> $ Sex : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 2 2 ...
#> $ BMI : num 37.2 23.8 30.1 32.1 25.7 ...
1.2. prepare brain_data for scoring
Next, we’ll prepare the brain data by selecting the relevant columns and transforming the data for scoring.
brain_data <- dplyr::select(sim_hcp, starts_with("L_")) %>% t()
colnames(brain_data) <- paste0("sub-", 1:ncol(brain_data))
str(brain_data)
#> num [1:34, 1:100] 2.56 2.56 2.52 1.7 3.51 ...
#> - attr(*, "dimnames")=List of 2
#> ..$ : chr [1:34] "L_bankssts" "L_caudalanteriorcingulate" "L_caudalmiddlefrontal" "L_cuneus" ...
#> ..$ : chr [1:100] "sub-1" "sub-2" "sub-3" "sub-4" ...
1.3. Load gene expression data for the Analysis
We will now load gene expression data for the analysis. This data is based on a specific brain atlas (Desikan) and includes gene expression levels (correlation across donnors > 0.4) for the left hemisphere.
gene_data <- get_geneExp(atlas = "desikan", rdonor = "r0.4", hem = "L")
str(gene_data)
#> num [1:34, 1:6513] 0.469 0.607 0.512 0.341 0.664 ...
#> - attr(*, "dimnames")=List of 2
#> ..$ : chr [1:34] "L_bankssts" "L_caudalanteriorcingulate" "L_caudalmiddlefrontal" "L_cuneus" ...
#> ..$ : chr [1:6513] "A1BG" "A1BG-AS1" "AACS" "AADAT" ...
1.4. Load SynGO for the Analysis
We will load SynGO terms that will be used in the enrichment analysis. This data represents gene sets associated with various molecular functions.
annoData <- get_annoData(type = "SynGO")
geneSetList <- get_geneSetList(annoData)
length(geneSetList)
#> [1] 302
print(sprintf("Number of SynGO terms: %d", length(geneSetList)))
#> [1] "Number of SynGO terms: 302"
Filter the geneSetList (Optional)
Filtering the gene sets by size is optional and is shown here for demonstration purposes. This step is embedded in the brainenrich/brainscore function.
selected.gs <- filter_geneSetList(bg_genes = colnames(gene_data), geneSetList = geneSetList, minGSSize = 20, maxGSSize = 200)
print(sprintf("%d MF terms have gene size ranging between 20 and 200", length(selected.gs)))
2. Run the Simulation Analysis to assess the False Positive Rate (sim_setting ='type1'
)
In this section, we will run the enrichment analysis using the prepared data. We permute the data in different ways and subsample it to examine the false positive rate.
2.1. Simulation analysis with randomized pred variable i.e., BMI (sim_type ='randomize_pred'
)
Note: Since the gene set scores are already calculated, randomizing the predictor variable (BMI) offers limited information regarding the false positive rate of individual scoring. Basically, it is testing the false discovery rate of the correlation (i.e., lm model).
cov_df <- sim_hcp %>% dplyr::select(Age, Sex)
pred_df <- sim_hcp %>% dplyr::select(BMI)
res <- brainscore.simulate(
pred_df = pred_df,
cov_df = cov_df,
brain_data = brain_data,
gene_data = gene_data,
annoData = annoData,
sim_n = 5,
subsample_size = 100,
sim_setting = "type1",
sim_type = "randomize_pred",
cor_method = "pearson",
aggre_method = "mean",
minGSSize = 20,
maxGSSize = 200
)
str(res)
#> List of 5
#> $ sim_1_subsample_100: tibble [55 × 2] (S3: tbl_df/tbl/data.frame)
#> ..$ Dependent_vars : chr [1:55] "GO:0007268" "GO:0007416" "GO:0008021" "GO:0016079" ...
#> ..$ pval_nofdr_sim_1_subsample_100: num [1:55] 0.735 0.916 0.517 0.588 0.315 ...
#> $ sim_2_subsample_100: tibble [55 × 2] (S3: tbl_df/tbl/data.frame)
#> ..$ Dependent_vars : chr [1:55] "GO:0007268" "GO:0007416" "GO:0008021" "GO:0016079" ...
#> ..$ pval_nofdr_sim_2_subsample_100: num [1:55] 0.286 0.354 0.337 0.323 0.652 ...
#> $ sim_3_subsample_100: tibble [55 × 2] (S3: tbl_df/tbl/data.frame)
#> ..$ Dependent_vars : chr [1:55] "GO:0007268" "GO:0007416" "GO:0008021" "GO:0016079" ...
#> ..$ pval_nofdr_sim_3_subsample_100: num [1:55] 0.622 0.731 0.884 0.219 0.487 ...
#> $ sim_4_subsample_100: tibble [55 × 2] (S3: tbl_df/tbl/data.frame)
#> ..$ Dependent_vars : chr [1:55] "GO:0007268" "GO:0007416" "GO:0008021" "GO:0016079" ...
#> ..$ pval_nofdr_sim_4_subsample_100: num [1:55] 0.911 0.79 0.659 0.446 0.91 ...
#> $ sim_5_subsample_100: tibble [55 × 2] (S3: tbl_df/tbl/data.frame)
#> ..$ Dependent_vars : chr [1:55] "GO:0007268" "GO:0007416" "GO:0008021" "GO:0016079" ...
#> ..$ pval_nofdr_sim_5_subsample_100: num [1:55] 0.907 0.696 0.425 0.348 0.44 ...
The result is a list of lists, where each sublist contains the significance testing results (p values) for each term, separately for both with and without FDR correction. Two types of statistical tests are performed: parametric (based on lm model itself; without “np”) and non-parametric (based on the permutation of gene sets; with “np”). The permutation is done by spinning the brain data.
2.2. Simulation analysis with spin_brain (sim_type ='spin_brain'
)
We simulate the brain data by spinning it and then calculate the gene set scores before correlating them with the predictor variable. Both parametric and non-parametric tests (here, based on spin_brain) are used.
cov_df <- sim_hcp %>% dplyr::select(Age, Sex)
pred_df <- sim_hcp %>% dplyr::select(BMI)
res <- brainscore.simulate(
pred_df = pred_df,
cov_df = cov_df,
brain_data = brain_data,
gene_data = gene_data,
annoData = annoData,
sim_n = 5,
subsample_size = 100,
sim_setting = "type1",
sim_type = "spin_brain",
cor_method = "pearson",
aggre_method = "mean",
n_perm = 10,
perm_id = perm_id_dk_lh_5000,
minGSSize = 20,
maxGSSize = 200
)
str(res)
#> List of 5
#> $ sim_1_subsample_100: tibble [55 × 3] (S3: tbl_df/tbl/data.frame)
#> ..$ Dependent_vars : chr [1:55] "GO:0007268" "GO:0007416" "GO:0008021" "GO:0016079" ...
#> ..$ pval_nofdr_sim_1_subsample_100 : num [1:55] 0.6465 0.8591 0.1079 0.5399 0.0892 ...
#> ..$ np_pval_nofdr_sim_1_subsample_100: num [1:55] 0.727 0.727 0.182 0.818 0.364 ...
#> $ sim_2_subsample_100: tibble [55 × 3] (S3: tbl_df/tbl/data.frame)
#> ..$ Dependent_vars : chr [1:55] "GO:0007268" "GO:0007416" "GO:0008021" "GO:0016079" ...
#> ..$ pval_nofdr_sim_2_subsample_100 : num [1:55] 0.264 0.22 0.162 0.211 0.35 ...
#> ..$ np_pval_nofdr_sim_2_subsample_100: num [1:55] 0.455 0.273 0.364 0.545 0.818 ...
#> $ sim_3_subsample_100: tibble [55 × 3] (S3: tbl_df/tbl/data.frame)
#> ..$ Dependent_vars : chr [1:55] "GO:0007268" "GO:0007416" "GO:0008021" "GO:0016079" ...
#> ..$ pval_nofdr_sim_3_subsample_100 : num [1:55] 0.125 0.589 0.355 0.675 0.129 ...
#> ..$ np_pval_nofdr_sim_3_subsample_100: num [1:55] 0.273 0.455 0.818 0.909 0.455 ...
#> $ sim_4_subsample_100: tibble [55 × 3] (S3: tbl_df/tbl/data.frame)
#> ..$ Dependent_vars : chr [1:55] "GO:0007268" "GO:0007416" "GO:0008021" "GO:0016079" ...
#> ..$ pval_nofdr_sim_4_subsample_100 : num [1:55] 0.2692 0.7928 0.1688 0.3648 0.0861 ...
#> ..$ np_pval_nofdr_sim_4_subsample_100: num [1:55] 0.545 0.636 0.455 0.727 0.273 ...
#> $ sim_5_subsample_100: tibble [55 × 3] (S3: tbl_df/tbl/data.frame)
#> ..$ Dependent_vars : chr [1:55] "GO:0007268" "GO:0007416" "GO:0008021" "GO:0016079" ...
#> ..$ pval_nofdr_sim_5_subsample_100 : num [1:55] 0.717 0.393 0.319 0.115 0.826 ...
#> ..$ np_pval_nofdr_sim_5_subsample_100: num [1:55] 0.818 0.364 0.636 0.273 1 ...
The result is a list of lists, where each sublist contains the significance testing results (p values) for each term, separately for both with and without FDR correction. Two types of statistical tests are performed: parametric (based on lm model itself; without “np”) and non-parametric (based on the permutation of gene sets; with “np”). The permutation is done by spinning the brain data.
2.3. Simulation analysis with spin_brain (sim_type ='spin_brain'
) and pre-calculated null gsScore
As generation the null gsScore is time-consuming, we can use the pre-calculated null gsScore to speed up the simulation.
# pre-calculate the null gsScore
gsScore.spin_brain <- brainscore(
brain_data = brain_data,
gene_data = gene_data,
annoData = annoData,
null_model = "spin_brain",
n_perm = 10,
perm_id = perm_id_dk_lh_5000,
minGSSize = 20,
maxGSSize = 200
)
cov_df <- sim_hcp %>% dplyr::select(Age, Sex)
pred_df <- sim_hcp %>% dplyr::select(BMI)
res <- brainscore.simulate(
pred_df = pred_df,
cov_df = cov_df,
brain_data = brain_data,
gene_data = gene_data,
annoData = annoData,
gsScoreList.null = gsScore.spin_brain,
sim_n = 5,
subsample_size = 100,
sim_setting = "type1",
sim_type = "spin_brain",
cor_method = "pearson",
aggre_method = "mean",
n_perm = 10,
perm_id = perm_id_dk_lh_5000,
minGSSize = 20,
maxGSSize = 200
)
str(res)
#> List of 5
#> $ sim_1_subsample_100: tibble [55 × 3] (S3: tbl_df/tbl/data.frame)
#> ..$ Dependent_vars : chr [1:55] "GO:0007268" "GO:0007416" "GO:0008021" "GO:0016079" ...
#> ..$ pval_nofdr_sim_1_subsample_100 : num [1:55] 0.6465 0.8591 0.1079 0.5399 0.0892 ...
#> ..$ np_pval_nofdr_sim_1_subsample_100: num [1:55] 0.727 0.727 0.182 0.818 0.364 ...
#> $ sim_2_subsample_100: tibble [55 × 3] (S3: tbl_df/tbl/data.frame)
#> ..$ Dependent_vars : chr [1:55] "GO:0007268" "GO:0007416" "GO:0008021" "GO:0016079" ...
#> ..$ pval_nofdr_sim_2_subsample_100 : num [1:55] 0.264 0.22 0.162 0.211 0.35 ...
#> ..$ np_pval_nofdr_sim_2_subsample_100: num [1:55] 0.455 0.273 0.364 0.545 0.818 ...
#> $ sim_3_subsample_100: tibble [55 × 3] (S3: tbl_df/tbl/data.frame)
#> ..$ Dependent_vars : chr [1:55] "GO:0007268" "GO:0007416" "GO:0008021" "GO:0016079" ...
#> ..$ pval_nofdr_sim_3_subsample_100 : num [1:55] 0.125 0.589 0.355 0.675 0.129 ...
#> ..$ np_pval_nofdr_sim_3_subsample_100: num [1:55] 0.273 0.455 0.818 0.909 0.455 ...
#> $ sim_4_subsample_100: tibble [55 × 3] (S3: tbl_df/tbl/data.frame)
#> ..$ Dependent_vars : chr [1:55] "GO:0007268" "GO:0007416" "GO:0008021" "GO:0016079" ...
#> ..$ pval_nofdr_sim_4_subsample_100 : num [1:55] 0.2692 0.7928 0.1688 0.3648 0.0861 ...
#> ..$ np_pval_nofdr_sim_4_subsample_100: num [1:55] 0.545 0.636 0.455 0.727 0.273 ...
#> $ sim_5_subsample_100: tibble [55 × 3] (S3: tbl_df/tbl/data.frame)
#> ..$ Dependent_vars : chr [1:55] "GO:0007268" "GO:0007416" "GO:0008021" "GO:0016079" ...
#> ..$ pval_nofdr_sim_5_subsample_100 : num [1:55] 0.717 0.393 0.319 0.115 0.826 ...
#> ..$ np_pval_nofdr_sim_5_subsample_100: num [1:55] 0.818 0.364 0.636 0.273 1 ...
To ensure that the gsScore.spin_brain was generated with the same parameters and can be used in the function, The attribute of gsScore.spin_brain should match the following argument of the brainscore.simulate: cor_method, aggre_method, n_perm, perm_id, minGSSize, maxGSSize.
2.4. Simulation analysis with resample_gene (sim_type ='resample_gene'
)
We simulate the gene sets by resampling genes and then calculate the gene set scores before correlating them with the predictor variable. Both parametric and non-parametric tests (here, based on resample_gene) are used.
cov_df <- sim_hcp %>% dplyr::select(Age, Sex)
pred_df <- sim_hcp %>% dplyr::select(BMI)
res <- brainscore.simulate(
pred_df = pred_df,
cov_df = cov_df,
brain_data = brain_data,
gene_data = gene_data,
annoData = annoData,
sim_n = 5,
subsample_size = 100,
sim_setting = "type1",
sim_type = "resample_gene",
cor_method = "pearson",
aggre_method = "mean",
n_perm = 10,
minGSSize = 20,
maxGSSize = 200
)
str(res)
#> List of 5
#> $ sim_1_subsample_100: tibble [55 × 3] (S3: tbl_df/tbl/data.frame)
#> ..$ Dependent_vars : chr [1:55] "GO:0007268" "GO:0007416" "GO:0008021" "GO:0016079" ...
#> ..$ pval_nofdr_sim_1_subsample_100 : num [1:55] 0.4827 0.0331 0.0861 0.4237 0.2015 ...
#> ..$ np_pval_nofdr_sim_1_subsample_100: num [1:55] 1 0.182 0.364 0.727 0.727 ...
#> $ sim_2_subsample_100: tibble [55 × 3] (S3: tbl_df/tbl/data.frame)
#> ..$ Dependent_vars : chr [1:55] "GO:0007268" "GO:0007416" "GO:0008021" "GO:0016079" ...
#> ..$ pval_nofdr_sim_2_subsample_100 : num [1:55] 0.1399 0.0245 0.11 0.075 0.9835 ...
#> ..$ np_pval_nofdr_sim_2_subsample_100: num [1:55] 0.727 0.182 0.364 0.273 1 ...
#> $ sim_3_subsample_100: tibble [55 × 3] (S3: tbl_df/tbl/data.frame)
#> ..$ Dependent_vars : chr [1:55] "GO:0007268" "GO:0007416" "GO:0008021" "GO:0016079" ...
#> ..$ pval_nofdr_sim_3_subsample_100 : num [1:55] 0.9767 0.0284 0.0609 0.6526 0.1876 ...
#> ..$ np_pval_nofdr_sim_3_subsample_100: num [1:55] 1 0.182 0.364 0.909 0.727 ...
#> $ sim_4_subsample_100: tibble [55 × 3] (S3: tbl_df/tbl/data.frame)
#> ..$ Dependent_vars : chr [1:55] "GO:0007268" "GO:0007416" "GO:0008021" "GO:0016079" ...
#> ..$ pval_nofdr_sim_4_subsample_100 : num [1:55] 0.0964 0.5168 0.5042 0.2902 0.8038 ...
#> ..$ np_pval_nofdr_sim_4_subsample_100: num [1:55] 0.455 0.818 0.818 0.727 0.909 ...
#> $ sim_5_subsample_100: tibble [55 × 3] (S3: tbl_df/tbl/data.frame)
#> ..$ Dependent_vars : chr [1:55] "GO:0007268" "GO:0007416" "GO:0008021" "GO:0016079" ...
#> ..$ pval_nofdr_sim_5_subsample_100 : num [1:55] 0.0808 0.9064 0.0641 0.1122 0.0285 ...
#> ..$ np_pval_nofdr_sim_5_subsample_100: num [1:55] 0.455 0.909 0.364 0.364 0.273 ...
The result is a list of lists, where each sublist contains the significance testing results (p values) for each term, separately for both with and without FDR correction. Two types of statistical tests are performed: parametric (based on lm model itself; without “np”) and non-parametric (based on the permutation of gene sets; with “np”). The permutation is done by spinning the brain data.
2.5. Simulation analysis with resample_gene (sim_type ='resample_gene'
) and pre-calculated null gsScore
As generation the null gsScore is time-consuming, we can use the pre-calculated null gsScore to speed up the simulation.
# pre-calculate the null gsScore
gsScore.resample_gene <- brainscore(
brain_data = brain_data,
gene_data = gene_data,
annoData = annoData,
null_model = "resample_gene",
n_perm = 10,
minGSSize = 20,
maxGSSize = 200
)
cov_df <- sim_hcp %>% dplyr::select(Age, Sex)
pred_df <- sim_hcp %>% dplyr::select(BMI)
res <- brainscore.simulate(
pred_df = pred_df,
cov_df = cov_df,
brain_data = brain_data,
gene_data = gene_data,
annoData = annoData,
gsScoreList.null = gsScore.resample_gene,
sim_n = 5,
subsample_size = 100,
sim_setting = "type1",
sim_type = "resample_gene",
cor_method = "pearson",
aggre_method = "mean",
n_perm = 10,
minGSSize = 20,
maxGSSize = 200
)
str(res)
#> List of 5
#> $ sim_1_subsample_100: tibble [55 × 3] (S3: tbl_df/tbl/data.frame)
#> ..$ Dependent_vars : chr [1:55] "GO:0007268" "GO:0007416" "GO:0008021" "GO:0016079" ...
#> ..$ pval_nofdr_sim_1_subsample_100 : num [1:55] 0.9139 0.0248 0.8958 0.1857 0.269 ...
#> ..$ np_pval_nofdr_sim_1_subsample_100: num [1:55] 1 0.273 0.818 0.636 0.636 ...
#> $ sim_2_subsample_100: tibble [55 × 3] (S3: tbl_df/tbl/data.frame)
#> ..$ Dependent_vars : chr [1:55] "GO:0007268" "GO:0007416" "GO:0008021" "GO:0016079" ...
#> ..$ pval_nofdr_sim_2_subsample_100 : num [1:55] 0.21522 0.00712 0.05478 0.79082 0.24777 ...
#> ..$ np_pval_nofdr_sim_2_subsample_100: num [1:55] 0.7273 0.0909 0.2727 0.8182 0.5455 ...
#> $ sim_3_subsample_100: tibble [55 × 3] (S3: tbl_df/tbl/data.frame)
#> ..$ Dependent_vars : chr [1:55] "GO:0007268" "GO:0007416" "GO:0008021" "GO:0016079" ...
#> ..$ pval_nofdr_sim_3_subsample_100 : num [1:55] 0.7798 0.0478 0.1206 0.3709 0.0784 ...
#> ..$ np_pval_nofdr_sim_3_subsample_100: num [1:55] 1 0.364 0.273 0.818 0.364 ...
#> $ sim_4_subsample_100: tibble [55 × 3] (S3: tbl_df/tbl/data.frame)
#> ..$ Dependent_vars : chr [1:55] "GO:0007268" "GO:0007416" "GO:0008021" "GO:0016079" ...
#> ..$ pval_nofdr_sim_4_subsample_100 : num [1:55] 0.0575 0.0325 0.3707 0.5771 0.0417 ...
#> ..$ np_pval_nofdr_sim_4_subsample_100: num [1:55] 0.455 0.364 0.636 0.818 0.182 ...
#> $ sim_5_subsample_100: tibble [55 × 3] (S3: tbl_df/tbl/data.frame)
#> ..$ Dependent_vars : chr [1:55] "GO:0007268" "GO:0007416" "GO:0008021" "GO:0016079" ...
#> ..$ pval_nofdr_sim_5_subsample_100 : num [1:55] 0.703 0.141 0.327 0.238 0.458 ...
#> ..$ np_pval_nofdr_sim_5_subsample_100: num [1:55] 1 0.727 0.636 0.727 0.727 ...
Again, to ensure that the gsScore.resample_gene was generated with the same parameters and can be used in the function, The attribute of gsScore.resample_gene should match the following argument of the brainscore.simulate: cor_method, aggre_method, n_perm, perm_id, minGSSize, maxGSSize.
3 Run the Simulation Analysis to assess the power (sim_setting ='power'
)
Compared to the false positive rate, the power analysis is performed to assess the ability of the analysis to detect the true signal. As there is no grand truth in the simulation analysis, here we can estimate the power by assuming there is a signal and then examining the ability of the analysis in mutiple subsamples to detect it.
This analysis could be done by setting the sim_setting
to ‘power’ in the brainscore.simulate
function. The rest of the parameters are the same as in the type1 error simulation analysis.
res <- brainscore.simulate(
pred_df = pred_df,
cov_df = cov_df,
brain_data = brain_data,
gene_data = gene_data,
annoData = annoData,
gsScoreList.null = gsScore.resample_gene,
sim_n = 5,
subsample_size = 100,
sim_setting = "power",
sim_type = "resample_gene",
cor_method = "pearson",
aggre_method = "mean",
n_perm = 10,
minGSSize = 20,
maxGSSize = 200
)