--- title: "Grouped analysis across features" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Grouped analysis across features} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(ptestR) library(dplyr) library(tidyr) library(purrr) ``` ## The real use case In neuroimaging and multivariate biomedical research, it is common to run the same model across many features simultaneously — brain regions, EEG channels, metabolites, or questionnaire subscales. ptestR is designed for exactly this pattern via the `nest()` + `map()` + `unnest()` idiom from the tidyverse. --- ## Simulated dataset We simulate a dataset with 15 subjects, each measured at 3 time points, across 2 groups and 4 features. This gives 45 observations per feature — enough for `lmer` to fit random intercepts per subject comfortably. ```{r simulate} set.seed(42) n_subj <- 15 n_time <- 3 features <- c("feature_A", "feature_B", "feature_C", "feature_D") # Subject-level covariates subjects <- data.frame( subject = seq_len(n_subj), group = rep(c("control", "case"), length.out = n_subj), age = round(rnorm(n_subj, mean = 35, sd = 8), 1) ) # Expand to subject × time × feature, then add values df <- expand.grid( subject = seq_len(n_subj), time = seq_len(n_time), feature = features, stringsAsFactors = FALSE ) |> left_join(subjects, by = "subject") |> mutate( # Random intercept per subject (shared across features) rand_int = rep(rnorm(n_subj, sd = 1), times = n_time * length(features)), value = case_when( feature == "feature_A" ~ 0.6 * (group == "case") + 0.2 * time + rand_int + rnorm(n()), feature == "feature_B" ~ 0.1 * (group == "case") + 0.1 * time + rand_int + rnorm(n()), feature == "feature_C" ~ -0.5 * (group == "case") + 0.3 * time + rand_int + rnorm(n()), feature == "feature_D" ~ 0.0 * (group == "case") + 0.0 * time + rand_int + rnorm(n()) ) ) |> select(subject, time, group, age, feature, value) glimpse(df) ``` Each nested sub-dataset (one per feature) will have `n_subj × n_time = 45` rows with 15 distinct subjects — well within `lmer`'s requirements. --- ## Running across features with `grouped_perm_glmm()` The pattern is: `group_by()` the feature → `nest()` → `map()` the permutation function → `unnest()` the results. ```{r grouped-glmm} results <- df |> group_by(feature) |> nest() |> mutate( perm = map(data, \(d) grouped_perm_glmm( tbl = d, formla = value ~ group + age + time + (1 | subject), var_to_perm = "value", permNum = 499, seed = 42 )) ) |> unnest(perm) |> select(-data) results ``` --- ## Filtering to a term of interest Filter to the `group` effect across all features: ```{r filter-group} group_results <- results |> filter(term == "groupcontrol") |> arrange(p.perm) group_results |> select(feature, estimate, statistic, p.perm) ``` Feature A and C should show the strongest effects (we simulated them that way), while feature D should be near noise. --- ## FDR correction With multiple features, apply FDR correction across all models for each term: ```{r fdr} group_results <- group_results |> mutate(p.fdr = p.adjust(p.perm, method = "fdr")) group_results |> select(feature, estimate, statistic, p.perm, p.fdr) ``` --- ## Handling `p.perm = 0` When no permuted statistic is as extreme as the observed one, `p.perm = 0`. This is not a true zero — it means the effect is stronger than all `permNum` permuted null statistics. Report as `p < 1/permNum`: ```{r zero-perm} group_results |> mutate( p.label = ifelse(p.perm == 0, paste0("p < ", round(1 / 499, 4)), paste0("p = ", round(p.perm, 3))) ) |> select(feature, p.label, p.fdr) ``` With `permNum = 9999` (recommended for publication), the bound becomes `p < 0.0001`. --- ## Plotting the results A simple dot plot showing effect size and significance: ```{r dotplot, fig.width = 5.5, fig.height = 3.5} if (requireNamespace("ggplot2", quietly = TRUE)) { library(ggplot2) group_results |> mutate(sig = p.fdr < 0.05) |> ggplot(aes(x = estimate, y = feature, colour = sig)) + geom_vline(xintercept = 0, linetype = "dashed", colour = "#DCC3AA") + geom_point(size = 4) + scale_colour_manual( values = c("TRUE" = "#810B38", "FALSE" = "#DCC3AA"), labels = c("TRUE" = "FDR < 5%", "FALSE" = "ns"), name = NULL ) + labs( title = "Group effect by feature", x = "Coefficient estimate", y = NULL ) + theme_minimal(base_size = 12) } ``` --- ## Using `grouped_perm_glm()` instead If your data have no repeated measures, swap to `grouped_perm_glm()` and drop the random-effects term: ```{r grouped-glm} # Summarise to one observation per subject × feature (e.g. mean across time) df_subj <- df |> group_by(subject, group, age, feature) |> summarise(value = mean(value), .groups = "drop") glm_results <- df_subj |> group_by(feature) |> nest() |> mutate( perm = map(data, \(d) grouped_perm_glm( tbl = d, formla = value ~ group + age, var_to_perm = "value", permNum = 499, seed = 42 )) ) |> unnest(perm) |> select(-data) glm_results |> filter(term == "groupcontrol") |> select(feature, estimate, statistic, p.perm) |> arrange(p.perm) ``` --- ## Tips for large-scale analyses **Use more permutations for publication.** 499 is used here for speed; use 9 999 or more for final results. **Parallelise with `furrr`.** Each `map()` call is independent — swap `purrr::map()` for `furrr::future_map()` for free parallelisation: ```{r parallel, eval = FALSE} library(furrr) plan(multisession, workers = 4) results <- df |> group_by(feature) |> nest() |> mutate( perm = future_map(data, \(d) grouped_perm_glmm( d, formla = value ~ group + age + time + (1 | subject), var_to_perm = "value", permNum = 9999, seed = 42 ), .options = furrr_options(seed = TRUE)) ) |> unnest(perm) |> select(-data) ``` --- ## Further reading - `vignette("getting-started")` — core concepts and `grouped_perm_glm()` - `vignette("mixed-effects-models")` — `grouped_perm_glmm()` in depth - `?grouped_perm_glmm`, `?grouped_perm_glm`