Title: | Classifying High-Dimensional Phenotypes with Ensemble Learning |
---|---|
Description: | A system for binary and multi-class classification of high-dimensional phenotypic data using ensemble learning. By combining predictions from different classification models, this package attempts to improve performance over individual learners. The pre-processing, training, validation, and testing are performed end-to-end to minimize user input and simplify the process of classification. |
Authors: | Jay Devine [aut, cre, cph], Benedikt Hallgrimsson [aut] |
Maintainer: | Jay Devine <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.1.0 |
Built: | 2025-02-15 03:56:51 UTC |
Source: | https://github.com/jaydevine/pheble |
The ph_anomaly
function detects and removes anomalies with an autoencoder. Because it is general
purpose, it can be applied to a variety of data types. The parameters in this function (e.g., activation,
hidden, dropout_ratio) can be supplied as lists or vectors (see parameter details) to perform a grid search
for the optimal hyperparameter combination. The autoencoder with the lowest reconstruction error is selected as
the best model.
ph_anomaly( df, ids_col, class_col, method = "ae", scale = FALSE, center = NULL, sd = NULL, max_mem_size = "15g", port = 54321, train_seed = 123, hyper_params = list(), search = "random", tune_length = 100 )
ph_anomaly( df, ids_col, class_col, method = "ae", scale = FALSE, center = NULL, sd = NULL, max_mem_size = "15g", port = 54321, train_seed = 123, hyper_params = list(), search = "random", tune_length = 100 )
df |
A |
ids_col |
A |
class_col |
A |
method |
A |
scale |
A |
center |
Either a |
sd |
Either a |
max_mem_size |
A |
port |
A |
train_seed |
A |
hyper_params |
A
|
search |
A |
tune_length |
A |
A list containing the following components:
df |
The data frame with anomalies removed. |
model |
The best model from the grid search used to detect anomalies. |
anom_score |
A data frame of predicted anomaly scores. |
## Import data. data(ph_crocs) ## Remove anomalies with autoencoder. rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample", class_col = "Species", method = "ae") ## Alternatively, remove anomalies with extended isolation forest. Notice ## that port is defined, because running H2O sessions one after another ## can return connection errors. rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample", class_col = "Species", method = "iso", port = 50001)
## Import data. data(ph_crocs) ## Remove anomalies with autoencoder. rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample", class_col = "Species", method = "ae") ## Alternatively, remove anomalies with extended isolation forest. Notice ## that port is defined, because running H2O sessions one after another ## can return connection errors. rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample", class_col = "Species", method = "iso", port = 50001)
Abstract: The degree to which the ontogeny of organisms could facilitate our understanding of phylogenetic relationships has long been a subject of contention in evolutionary biology. The famed notion that ‘ontogeny recapitulates phylogeny’ has been largely discredited, but there remains an expectation that closely related organisms undergo similar morphological transformations throughout ontogeny. To test this assumption, we used three-dimensional geometric morphometric methods to characterize the cranial morphology of 10 extant crocodylian species and construct allometric trajectories that model the post-natal ontogenetic shape changes. Using time-calibrated molecular and morphological trees, we employed a suite of comparative phylogenetic methods to assess the extent of phylogenetic signal in these trajectories. All analyses largely demonstrated a lack of significant phylogenetic signal, indicating that ontogenetic shape changes contain little phylogenetic information. Notably, some Mantel tests yielded marginally significant results when analysed with the morphological tree, which suggest that the underlying signal in these trajectories is correlated with similarities in the adult cranial morphology. However, despite these instances, all other analyses, including more powerful tests for phylogenetic signal, recovered statistical and visual evidence against the assumption that similarities in ontogenetic shape changes are commensurate with phylogenetic relatedness and thus bring into question the efficacy of using allometric trajectories for phylogenetic inference.
ph_crocs
ph_crocs
ph_crocs
A data frame of Procrustes superimposed shape data with 183 rows and 236 columns:
Biosample
Species
...
Downloaded from <\doi{dx.doi.org/10.5061/dryad.14fn1}>
The ph_ctrl
function automatically generates a trControl
object. This can be used in the train
function to automatically tune hyperparameters for every classification model in the ensemble.
ph_ctrl( class, resample_method = "boot", number = ifelse(grepl("cv", resample_method, ignore.case = TRUE), 10, 25), repeats = ifelse(grepl("dcv$", resample_method, ignore.case = TRUE), 3, NA), search = "random", sampling = NULL )
ph_ctrl( class, resample_method = "boot", number = ifelse(grepl("cv", resample_method, ignore.case = TRUE), 10, 25), repeats = ifelse(grepl("dcv$", resample_method, ignore.case = TRUE), 3, NA), search = "random", sampling = NULL )
class |
A |
resample_method |
A |
number |
An |
repeats |
An |
search |
A |
sampling |
A |
A trainControl
object for the train
function.
## Import data. data(ph_crocs) ## Echo control object for train function. ctrl <- ph_ctrl(ph_crocs$Species, resample_method = "boot")
## Import data. data(ph_crocs) ## Echo control object for train function. ctrl <- ph_ctrl(ph_crocs$Species, resample_method = "boot")
The ph_ensemble
function uses classification predictions from a list of algorithms to train an ensemble model.
This can be a list of manually trained algorithms from train
or, more conveniently, the output from ph_train
.
The hyperparameter tuning and model evaluations are handled internally to simplify the ensembling process. This function
assumes some preprocessing has been performed, hence the training, validation, and test set requirements.
ph_ensemble( train_models, train_df, vali_df, test_df, class_col, ctrl, train_seed = 123, n_cores = 2, task = "multi", metric = ifelse(task == "multi", "Kappa", "ROC"), top_models = 3, metalearner = ifelse(task == "multi", "glmnet", "rf"), tune_length = 10, quiet = FALSE )
ph_ensemble( train_models, train_df, vali_df, test_df, class_col, ctrl, train_seed = 123, n_cores = 2, task = "multi", metric = ifelse(task == "multi", "Kappa", "ROC"), top_models = 3, metalearner = ifelse(task == "multi", "glmnet", "rf"), tune_length = 10, quiet = FALSE )
train_models |
A |
train_df |
A |
vali_df |
A |
test_df |
A |
class_col |
A |
ctrl |
A |
train_seed |
A |
n_cores |
An |
task |
A |
metric |
A |
top_models |
A |
metalearner |
A |
tune_length |
If |
quiet |
A |
A list containing the following components:
ensemble_test_preds |
The ensemble predictions for the test set. |
vali_preds |
The validation predictions for the top models. |
test_preds |
The test predictions for the top models. |
all_test_preds |
The test predictions for every successfully trained model. |
all_test_results |
The confusion matrix results obtained from comparing the model test predictions (i.e., original models and ensemble) against the actual test classes. |
ensemble_model |
The ensemble train object. |
var_imps |
The ensemble variable importances obtained via weighted averaging. The original train importances are multiplied by the model's importance in the ensemble, then averaged across models and normalized. |
train_df |
The training data frame. |
vali_df |
The validation data frame. |
test_df |
The test data frame. |
train_models |
The train models for the ensemble. |
ctrl |
A trainControl object. |
metric |
The summary metric used to select the optimal model. |
task |
The type of classification task. |
tune_length |
The maximum number of hyperparameter combinations ("random") or individual hyperparameter depth ("grid"). |
top_models |
The number of top methods selected for the ensemble. |
metalearner |
The algorithm used to train the ensemble. |
## Import data. data(ph_crocs) ## Remove anomalies with autoencoder. rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample", class_col = "Species", method = "ae") ## Preprocess anomaly-free data frame into train, validation, and test sets ## with PCs as predictors. pc_dfs <- ph_prep(df = rm_outs$df, ids_col = "Biosample", class_col = "Species", vali_pct = 0.15, test_pct = 0.15, method = "pca") ## Echo control object for train function. ctrl <- ph_ctrl(ph_crocs$Species, resample_method = "boot") ## Train all models for ensemble. ## Note: Increasing n_cores will dramatically reduce train time. train_models <- ph_train(train_df = pc_dfs$train_df, vali_df = pc_dfs$vali_df, test_df = pc_dfs$test_df, class_col = "Species", ctrl = ctrl, task = "multi", methods = "all", tune_length = 5, quiet = FALSE) ## You can also train just a few, although more is preferable. ## Note: Increasing n_cores will dramatically reduce train time. train_models <- ph_train(train_df = pc_dfs$train_df, vali_df = pc_dfs$vali_df, test_df = pc_dfs$test_df, class_col = "Species", ctrl = ctrl, task = "multi", methods = c("lda", "mda", "nnet", "pda", "sparseLDA"), tune_length = 5, quiet = FALSE) ## Train the ensemble. ## Note: Increasing n_cores will dramatically reduce train time. ensemble_model <- ph_ensemble(train_models = train_models$train_models, train_df = pc_dfs$train_df, vali_df = pc_dfs$vali_df, test_df = pc_dfs$test_df, class_col = "Species", ctrl = ctrl, task = "multi", top_models = 3, metalearner = "glmnet", tune_length = 25, quiet = FALSE)
## Import data. data(ph_crocs) ## Remove anomalies with autoencoder. rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample", class_col = "Species", method = "ae") ## Preprocess anomaly-free data frame into train, validation, and test sets ## with PCs as predictors. pc_dfs <- ph_prep(df = rm_outs$df, ids_col = "Biosample", class_col = "Species", vali_pct = 0.15, test_pct = 0.15, method = "pca") ## Echo control object for train function. ctrl <- ph_ctrl(ph_crocs$Species, resample_method = "boot") ## Train all models for ensemble. ## Note: Increasing n_cores will dramatically reduce train time. train_models <- ph_train(train_df = pc_dfs$train_df, vali_df = pc_dfs$vali_df, test_df = pc_dfs$test_df, class_col = "Species", ctrl = ctrl, task = "multi", methods = "all", tune_length = 5, quiet = FALSE) ## You can also train just a few, although more is preferable. ## Note: Increasing n_cores will dramatically reduce train time. train_models <- ph_train(train_df = pc_dfs$train_df, vali_df = pc_dfs$vali_df, test_df = pc_dfs$test_df, class_col = "Species", ctrl = ctrl, task = "multi", methods = c("lda", "mda", "nnet", "pda", "sparseLDA"), tune_length = 5, quiet = FALSE) ## Train the ensemble. ## Note: Increasing n_cores will dramatically reduce train time. ensemble_model <- ph_ensemble(train_models = train_models$train_models, train_df = pc_dfs$train_df, vali_df = pc_dfs$vali_df, test_df = pc_dfs$test_df, class_col = "Species", ctrl = ctrl, task = "multi", top_models = 3, metalearner = "glmnet", tune_length = 25, quiet = FALSE)
The ph_equate
function ensures that the factor levels in all columns are equal. When classification are heavily biased or inaccurate, they can return new class predictions that do not contain every level in the original data. This can interfere with model evaluation functions e.g. via a confusion matrix.
ph_equate(df, class)
ph_equate(df, class)
df |
A |
class |
A |
A data frame of column-wise class predictions with class levels equal to the observed class.
## Make data frame of predicted classes with different levels. ## An internal or external column should contain the observed ## classes with every possible level. obs <- as.factor(c("A", "C", "B", "D", "E")) method_a <- c("A", "B", "B", "C", "D") method_b <- c("A", "C", "B", "D", "C") method_c <- c("A", "C", "B", "B", "C") df <- data.frame(method_a, method_b, method_c) df <- ph_equate(df = df, class = obs)
## Make data frame of predicted classes with different levels. ## An internal or external column should contain the observed ## classes with every possible level. obs <- as.factor(c("A", "C", "B", "D", "E")) method_a <- c("A", "B", "B", "C", "D") method_b <- c("A", "C", "B", "D", "C") method_c <- c("A", "C", "B", "B", "C") df <- data.frame(method_a, method_b, method_c) df <- ph_equate(df = df, class = obs)
The ph_eval
function generates a confusion matrix for binary or multi-class classification; for the multi-class case, the results are averaged across all class levels.
ph_eval(pred, obs)
ph_eval(pred, obs)
pred |
A |
obs |
A |
A data.frame
of confusion matrix evaluation results; for the multi-class case, the results are averaged across all class levels.
## Import data. data(ph_crocs) ## Remove anomalies with autoencoder. rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample", class_col = "Species", method = "ae") ## Preprocess anomaly-free data frame into train, validation, and test sets ## with PCs as predictors. pc_dfs <- ph_prep(df = rm_outs$df, ids_col = "Biosample", class_col = "Species", vali_pct = 0.15, test_pct = 0.15, method = "pca") ## Echo control object for train function. ctrl <- ph_ctrl(ph_crocs$Species, resample_method = "boot") ## Train a few models for ensemble, although more is preferable. ## Note: Increasing n_cores will dramatically reduce train time. train_models <- ph_train(train_df = pc_dfs$train_df, vali_df = pc_dfs$vali_df, test_df = pc_dfs$test_df, class_col = "Species", ctrl = ctrl, task = "multi", methods = c("lda", "mda", "nnet", "pda", "sparseLDA"), tune_length = 5, quiet = FALSE) ## Evaluate e.g. the first model. test_pred <- predict(train_models$train_models[[1]], pc_dfs$test_df) test_obs <- as.factor(pc_dfs$test_df$Species) test_cm <- ph_eval(pred = test_pred, obs = test_obs)
## Import data. data(ph_crocs) ## Remove anomalies with autoencoder. rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample", class_col = "Species", method = "ae") ## Preprocess anomaly-free data frame into train, validation, and test sets ## with PCs as predictors. pc_dfs <- ph_prep(df = rm_outs$df, ids_col = "Biosample", class_col = "Species", vali_pct = 0.15, test_pct = 0.15, method = "pca") ## Echo control object for train function. ctrl <- ph_ctrl(ph_crocs$Species, resample_method = "boot") ## Train a few models for ensemble, although more is preferable. ## Note: Increasing n_cores will dramatically reduce train time. train_models <- ph_train(train_df = pc_dfs$train_df, vali_df = pc_dfs$vali_df, test_df = pc_dfs$test_df, class_col = "Species", ctrl = ctrl, task = "multi", methods = c("lda", "mda", "nnet", "pda", "sparseLDA"), tune_length = 5, quiet = FALSE) ## Evaluate e.g. the first model. test_pred <- predict(train_models$train_models[[1]], pc_dfs$test_df) test_obs <- as.factor(pc_dfs$test_df$Species) test_cm <- ph_eval(pred = test_pred, obs = test_obs)
The ph_iqr
function computes the interquartile range.
ph_iqr(x, na.rm = FALSE, type = 7)
ph_iqr(x, na.rm = FALSE, type = 7)
x |
A |
na.rm |
A |
type |
An |
The interquartile range.
The ph_outs
function computes outliers with the interquartile method.
ph_outs(x)
ph_outs(x)
x |
A |
The outlier indices.
The ph_prep
function splits a data frame into training, validation, and test sets, all while ensuring that
every class is represented in each dataset. By default, it performs a Principal Component Analysis on the training
set data and projects the validation and test data into that space. If a non-linear dimensionality reduction
strategy is preferred instead, an autoencoder can be used to extract deep features. Note that the parameters
max_mem_size
, activation
, hidden
, dropout_ratio
, rate
, search
, and
tune_length
are NULL
unless an autoencoder, method = "ae"
, is used. In this case,
lists or vectors can be supplied to these parameters (see parameter details) to perform a grid search for the
optimal hyperparameter combination. The autoencoder with the lowest reconstruction error is selected as
the best model.
ph_prep( df, ids_col, class_col, vali_pct = 0.15, test_pct = 0.15, scale = FALSE, center = NULL, sd = NULL, split_seed = 123, method = "pca", pca_pct = 0.95, max_mem_size = "15g", port = 54321, train_seed = 123, hyper_params = list(), search = "random", tune_length = 100 )
ph_prep( df, ids_col, class_col, vali_pct = 0.15, test_pct = 0.15, scale = FALSE, center = NULL, sd = NULL, split_seed = 123, method = "pca", pca_pct = 0.95, max_mem_size = "15g", port = 54321, train_seed = 123, hyper_params = list(), search = "random", tune_length = 100 )
df |
A |
ids_col |
A |
class_col |
A |
vali_pct |
A |
test_pct |
A |
scale |
A |
center |
Either a |
sd |
Either a |
split_seed |
A |
method |
A |
pca_pct |
If |
max_mem_size |
If |
port |
A |
train_seed |
A |
hyper_params |
A |
search |
If |
tune_length |
If |
A list containing the following components:
train_df |
The training set data frame. |
vali_df |
The validation set data frame. |
test_df |
The test set data frame. |
train_split |
The training set indices from the original data frame. |
vali_split |
The validation set indices from the original data frame. |
test_split |
The test set indices from the original data frame. |
vali_pct |
The percentage of training data used as validation data. |
test_pct |
The percentage of total data used as test data. |
method |
The dimensionality reduction method. |
## Import data. data(ph_crocs) ## Remove anomalies with autoencoder. rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample", class_col = "Species", method = "ae") ## Preprocess anomaly-free data frame into train, validation, and test sets ## with PCs as predictors. pc_dfs <- ph_prep(df = rm_outs$df, ids_col = "Biosample", class_col = "Species", vali_pct = 0.15, test_pct = 0.15, method = "pca") ## Alternatively, preprocess data frame into train, validation, and test ## sets with latent variables as predictors. Notice that port is defined, ## because running H2O sessions one after another can cause connection ## errors. ae_dfs <- ph_prep(df = rm_outs$df, ids_col = "Biosample", class_col = "Species", vali_pct = 0.15, test_pct = 0.15, method = "ae", port = 50001)
## Import data. data(ph_crocs) ## Remove anomalies with autoencoder. rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample", class_col = "Species", method = "ae") ## Preprocess anomaly-free data frame into train, validation, and test sets ## with PCs as predictors. pc_dfs <- ph_prep(df = rm_outs$df, ids_col = "Biosample", class_col = "Species", vali_pct = 0.15, test_pct = 0.15, method = "pca") ## Alternatively, preprocess data frame into train, validation, and test ## sets with latent variables as predictors. Notice that port is defined, ## because running H2O sessions one after another can cause connection ## errors. ae_dfs <- ph_prep(df = rm_outs$df, ids_col = "Biosample", class_col = "Species", vali_pct = 0.15, test_pct = 0.15, method = "ae", port = 50001)
Abstract: Ecological character displacement between the sexes, and sexual selection, integrate into a convergent set of factors that produce sexual variation. Ecologically-modulated, sexually mediated variation within and between sexes may be a major contributor to the amount of total variation that selection can act on in species. Threespine stickleback (Gasterosteus aculeatus) display rapid adaptive responses and sexual variation in many phenotypic traits. We examined phenotypic variation in the skull, pectoral and pelvic girdles of threespine stickleback from two freshwater and two coastal marine sites on the Sunshine Coast of British Columbia, Canada, using an approach that avoids a priori assumptions about bimodal patterns of variation. We quantified shape and size of the cranial, pectoral and pelvic regions of sticklebacks in marine and freshwater habitats using 3D geometric morphometrics and an index of sexually mediated variation. We show that the expression of phenotypic variation is structured in part by the effects of both habitat marine vs freshwater and the effects of individual sites within each habitat. Relative size exerts variable influence, and patterns of phenotypic variation associated with sex vary among body regions. This fine-grained quantification of sexually mediated variation in the context of habitat difference and different anatomical structures indicates a complex relationship between genetically inferred sex and environmental factors, demonstrating that the interplay between shared genetic background and sexually mediated, ecologically- based selective pressures structures the phenotypic expression of complex traits.
ph_stickleback
ph_stickleback
ph_stickleback
A data frame of Procrustes superimposed shape data with 190 rows and 214 columns:
Biosample
Habitat
Population
Sex
...
Downloaded from <doi:doi.org/10.5061/dryad.xd2547dkw>
The ph_train
function automatically trains a set of binary or multi-class classification models to ultimately
build a new dataset of predictions. The data preprocessing and hyperparameter tuning are handled internally to
minimize user input and simplify the training.
ph_train( train_df, vali_df, test_df, class_col, ctrl, train_seed = 123, n_cores = 2, task = "multi", methods = "all", metric = ifelse(task == "multi", "Kappa", "ROC"), tune_length = 10, quiet = FALSE )
ph_train( train_df, vali_df, test_df, class_col, ctrl, train_seed = 123, n_cores = 2, task = "multi", methods = "all", metric = ifelse(task == "multi", "Kappa", "ROC"), tune_length = 10, quiet = FALSE )
train_df |
A |
vali_df |
A |
test_df |
A |
class_col |
A |
ctrl |
A |
train_seed |
A |
n_cores |
An |
task |
A |
methods |
A
|
metric |
A |
tune_length |
If |
quiet |
A |
A list containing the following components:
train_models |
The train models for the ensemble. |
train_df |
The training data frame. |
vali_df |
The validation data frame. |
test_df |
The test data frame. |
task |
The type of classification task. |
ctrl |
A list of resampling parameters used in trainControl . |
methods |
The names of the classification methods to ensemble. |
search |
The hyperparameter search strategy. |
n_cores |
The number of cores for parallel processing. |
metric |
The summary metric used to select the optimal model. |
tune_length |
The maximum number of hyperparameter combinations ("random") or individual hyperparameter depth ("grid"). |
## Import data. data(ph_crocs) ## Remove anomalies with autoencoder. rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample", class_col = "Species", method = "ae") ## Preprocess anomaly-free data frame into train, validation, and test sets ## with PCs as predictors. pc_dfs <- ph_prep(df = rm_outs$df, ids_col = "Biosample", class_col = "Species", vali_pct = 0.15, test_pct = 0.15, method = "pca") ## Echo control object for train function. ctrl <- ph_ctrl(ph_crocs$Species, resample_method = "boot") ## Train all models for ensemble. ## Note: Increasing n_cores will dramatically reduce train time. train_models <- ph_train(train_df = pc_dfs$train_df, vali_df = pc_dfs$vali_df, test_df = pc_dfs$test_df, class_col = "Species", ctrl = ctrl, task = "multi", methods = "all", tune_length = 5, quiet = FALSE) ## You can also train just a few, although more is preferable. ## Note: Increasing n_cores will dramatically reduce train time. train_models <- ph_train(train_df = pc_dfs$train_df, vali_df = pc_dfs$vali_df, test_df = pc_dfs$test_df, class_col = "Species", ctrl = ctrl, task = "multi", methods = c("lda", "mda", "nnet", "pda", "sparseLDA"), tune_length = 5, quiet = FALSE)
## Import data. data(ph_crocs) ## Remove anomalies with autoencoder. rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample", class_col = "Species", method = "ae") ## Preprocess anomaly-free data frame into train, validation, and test sets ## with PCs as predictors. pc_dfs <- ph_prep(df = rm_outs$df, ids_col = "Biosample", class_col = "Species", vali_pct = 0.15, test_pct = 0.15, method = "pca") ## Echo control object for train function. ctrl <- ph_ctrl(ph_crocs$Species, resample_method = "boot") ## Train all models for ensemble. ## Note: Increasing n_cores will dramatically reduce train time. train_models <- ph_train(train_df = pc_dfs$train_df, vali_df = pc_dfs$vali_df, test_df = pc_dfs$test_df, class_col = "Species", ctrl = ctrl, task = "multi", methods = "all", tune_length = 5, quiet = FALSE) ## You can also train just a few, although more is preferable. ## Note: Increasing n_cores will dramatically reduce train time. train_models <- ph_train(train_df = pc_dfs$train_df, vali_df = pc_dfs$vali_df, test_df = pc_dfs$test_df, class_col = "Species", ctrl = ctrl, task = "multi", methods = c("lda", "mda", "nnet", "pda", "sparseLDA"), tune_length = 5, quiet = FALSE)