Package 'pheble' reference manual

Title:	Classifying High-Dimensional Phenotypes with Ensemble Learning
Description:	A system for binary and multi-class classification of high-dimensional phenotypic data using ensemble learning. By combining predictions from different classification models, this package attempts to improve performance over individual learners. The pre-processing, training, validation, and testing are performed end-to-end to minimize user input and simplify the process of classification.
Authors:	Jay Devine [aut, cre, cph], Benedikt Hallgrimsson [aut]
Maintainer:	Jay Devine <[email protected]>
License:	GPL (>= 3)
Version:	0.1.0
Built:	2025-03-17 04:05:32 UTC
Source:	https://github.com/jaydevine/pheble

The ph_anomaly function detects and removes anomalies with an autoencoder. Because it is general purpose, it can be applied to a variety of data types. The parameters in this function (e.g., activation, hidden, dropout_ratio) can be supplied as lists or vectors (see parameter details) to perform a grid search for the optimal hyperparameter combination. The autoencoder with the lowest reconstruction error is selected as the best model.

Usage

ph_anomaly(
  df,
  ids_col,
  class_col,
  method = "ae",
  scale = FALSE,
  center = NULL,
  sd = NULL,
  max_mem_size = "15g",
  port = 54321,
  train_seed = 123,
  hyper_params = list(),
  search = "random",
  tune_length = 100
)
ph_anomaly(
  df,
  ids_col,
  class_col,
  method = "ae",
  scale = FALSE,
  center = NULL,
  sd = NULL,
  max_mem_size = "15g",
  port = 54321,
  train_seed = 123,
  hyper_params = list(),
  search = "random",
  tune_length = 100
)

Arguments

`df`	A `data.frame` containing a column of ids, a column of classes, and an arbitrary number of predictors.
`ids_col`	A `character` value for the name of the ids column.
`class_col`	A `character` value for the name of the class column.
`method`	A `character` value for the anomaly detection method: "ae" (default), "iso" (abbv. for extended isolation forest).
`scale`	A `logical` value for whether to scale the data: FALSE (default). Recommended if `method = "ae"` and if user intends to train other models.
`center`	Either a `logical` value or numeric-alike vector of length equal to the number of columns of data to scale in `df`, where ‘numeric-alike’ means that as.numeric(.) will be applied successfully if is.numeric(.) is not true: NULL (default). If `scale = TRUE`, this is set to `TRUE` and is used to subtract the mean.
`sd`	Either a `logical` value or a numeric-alike vector of length equal to the number of columns of data to scale in `df`: NULL (default). If `scale = TRUE`, this is set to `TRUE` and is used to divide by the standard deviation.
`max_mem_size`	A `character` value for the memory of an h2o session: "15g" (default).
`port`	A `numeric` value for the port number of the H2O server.
`train_seed`	A `numeric` value to set the control the randomness of creating resamples: 123 (default).
`hyper_params`	A `list` of hyperparameters to perform a grid search. If `method = "ae"`, the "default" list is: list(missing_values_handling = "Skip", activation = c("Rectifier", "Tanh"), hidden = list(5, 25, 50, 100, 250, 500, nrow(df_h2o)), input_dropout_ratio = c(0, 0.1, 0.2, 0.3), rate = c(0, 0.01, 0.005, 0.001)) If `method = "iso"`, the "default" list is: list(ntrees = c(50, 100, 150, 200), sample_size = c(64, 128, 256, 512))
`search`	A `character` value for the hyperparameter search strategy: "random" (default), "grid".
`tune_length`	A `numeric` value (integer) for either the maximum number of hyperparameter combinations ("random") or individual hyperparameter depth ("grid"): 100 (default).

Value

A list containing the following components:

`df`	The data frame with anomalies removed.

`model`	The best model from the grid search used to detect anomalies.

`anom_score`	A data frame of predicted anomaly scores.

Examples

## Import data.
data(ph_crocs)

## Remove anomalies with autoencoder.
rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample",
                      class_col = "Species", method = "ae")
## Alternatively, remove anomalies with extended isolation forest. Notice
## that port is defined, because running H2O sessions one after another
## can return connection errors.
rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample",
                      class_col = "Species", method = "iso",
                      port = 50001)

## Import data.
data(ph_crocs)

## Remove anomalies with autoencoder.
rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample",
                      class_col = "Species", method = "ae")
## Alternatively, remove anomalies with extended isolation forest. Notice
## that port is defined, because running H2O sessions one after another
## can return connection errors.
rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample",
                      class_col = "Species", method = "iso",
                      port = 50001)

Data from: The utility of cranial ontogeny for phylogenetic inference: a case study in crocodylians using geometric morphometrics

Description

Abstract: The degree to which the ontogeny of organisms could facilitate our understanding of phylogenetic relationships has long been a subject of contention in evolutionary biology. The famed notion that ‘ontogeny recapitulates phylogeny’ has been largely discredited, but there remains an expectation that closely related organisms undergo similar morphological transformations throughout ontogeny. To test this assumption, we used three-dimensional geometric morphometric methods to characterize the cranial morphology of 10 extant crocodylian species and construct allometric trajectories that model the post-natal ontogenetic shape changes. Using time-calibrated molecular and morphological trees, we employed a suite of comparative phylogenetic methods to assess the extent of phylogenetic signal in these trajectories. All analyses largely demonstrated a lack of significant phylogenetic signal, indicating that ontogenetic shape changes contain little phylogenetic information. Notably, some Mantel tests yielded marginally significant results when analysed with the morphological tree, which suggest that the underlying signal in these trajectories is correlated with similarities in the adult cranial morphology. However, despite these instances, all other analyses, including more powerful tests for phylogenetic signal, recovered statistical and visual evidence against the assumption that similarities in ontogenetic shape changes are commensurate with phylogenetic relatedness and thus bring into question the efficacy of using allometric trajectories for phylogenetic inference.

Usage

ph_crocs
ph_crocs

Format

`ph_crocs`

A data frame of Procrustes superimposed shape data with 183 rows and 236 columns:

Biosample: Biosample
Species: Species

...

Source

Downloaded from <\doi{dx.doi.org/10.5061/dryad.14fn1}>

Parameters for resampling and training a dataset.

Description

The ph_ctrl function automatically generates a trControl object. This can be used in the train function to automatically tune hyperparameters for every classification model in the ensemble.

Usage

ph_ctrl(
  class,
  resample_method = "boot",
  number = ifelse(grepl("cv", resample_method, ignore.case = TRUE), 10, 25),
  repeats = ifelse(grepl("dcv$", resample_method, ignore.case = TRUE), 3, NA),
  search = "random",
  sampling = NULL
)
ph_ctrl(
  class,
  resample_method = "boot",
  number = ifelse(grepl("cv", resample_method, ignore.case = TRUE), 10, 25),
  repeats = ifelse(grepl("dcv$", resample_method, ignore.case = TRUE), 3, NA),
  search = "random",
  sampling = NULL
)

Arguments

`class`	A `factor` value for training data classes.
`resample_method`	A `character` value for the resampling training method: "boot" (default), "cv", LOOCV", "repeatedcv".
`number`	An `integer` value for the number of resampling iterations (25 default for boot) or folds (10 default for cross-validation).
`repeats`	An `integer` value for the number of sets of folds for repeated cross-validation.
`search`	A `character` value for the hyperparameter search strategy: "random" (default), "grid".
`sampling`	A `character` value for the sampling strategy, sometimes used to fix class imbalances: `NULL` (default), "up", "down", "smote".

Value

A trainControl object for the train function.

Examples

## Import data.
data(ph_crocs)
## Echo control object for train function.
ctrl <- ph_ctrl(ph_crocs$Species, resample_method = "boot")
## Import data.
data(ph_crocs)
## Echo control object for train function.
ctrl <- ph_ctrl(ph_crocs$Species, resample_method = "boot")

Classify phenotypes via ensemble learning.

Description

The ph_ensemble function uses classification predictions from a list of algorithms to train an ensemble model. This can be a list of manually trained algorithms from train or, more conveniently, the output from ph_train. The hyperparameter tuning and model evaluations are handled internally to simplify the ensembling process. This function assumes some preprocessing has been performed, hence the training, validation, and test set requirements.

Usage

ph_ensemble(
  train_models,
  train_df,
  vali_df,
  test_df,
  class_col,
  ctrl,
  train_seed = 123,
  n_cores = 2,
  task = "multi",
  metric = ifelse(task == "multi", "Kappa", "ROC"),
  top_models = 3,
  metalearner = ifelse(task == "multi", "glmnet", "rf"),
  tune_length = 10,
  quiet = FALSE
)
ph_ensemble(
  train_models,
  train_df,
  vali_df,
  test_df,
  class_col,
  ctrl,
  train_seed = 123,
  n_cores = 2,
  task = "multi",
  metric = ifelse(task == "multi", "Kappa", "ROC"),
  top_models = 3,
  metalearner = ifelse(task == "multi", "glmnet", "rf"),
  tune_length = 10,
  quiet = FALSE
)

Arguments

`train_models`	A `list` of at least two `train` models.
`train_df`	A `data.frame` containing a class column and the training data.
`vali_df`	A `data.frame` containing a class column and the validation data.
`test_df`	A `data.frame` containing a class column and the test data.
`class_col`	A `character` value for the name of the class column. This should be consistent across data frames.
`ctrl`	A `list` containing the resampling strategy (e.g., "boot") and other parameters for `trainControl`. Automatically create one via `ph_ctrl` or manually create it with `trainControl`.
`train_seed`	A `numeric` value to set the training seed and control the randomness of creating resamples: 123 (default).
`n_cores`	An `integer` value for the number of cores to include in the cluster: 2 (default). We highly recommend increasing this value to, e.g., parallel::detectCores() - 1.
`task`	A `character` value for the type of classification `task`: "multi" (default), "binary".
`metric`	A `character` value for which summary metric should be used to select the optimal model: "ROC" (default for "binary") and "Kappa" (default for "multi"). Other options include "logLoss", "Accuracy", "Mean_Balanced_Accuracy", and "Mean_F1".
`top_models`	A `numeric` value for the top n training models to ensemble: 3 (default). Every training model is ordered according to their final metric value (e.g., "ROC" or "Kappa") and the top n models are selected.
`metalearner`	A `character` value for the algorithm used to train the ensemble: "glmnet" (default), "rf". Other methods, such as those listed in ph_train methods, may also be used.
`tune_length`	If `search = "random"` (default), this is an `integer` value for the maximum number of hyperparameter combinations to test for each training model in the ensemble; if `search = "grid"`, this is an `integer` value for the number of levels of each hyperparameter to test for each model.
`quiet`	A `logical` value for whether progress should be printed: TRUE (default), FALSE.

Value

A list containing the following components:

`ensemble_test_preds`	The ensemble predictions for the test set.

`vali_preds`	The validation predictions for the top models.

`test_preds`	The test predictions for the top models.

`all_test_preds`	The test predictions for every successfully trained model.

`all_test_results`	The confusion matrix results obtained from comparing the model test predictions (i.e., original models and ensemble) against the actual test classes.

`ensemble_model`	The ensemble `train` object.

`var_imps`	The ensemble variable importances obtained via weighted averaging. The original train importances are multiplied by the model's importance in the ensemble, then averaged across models and normalized.

`train_df`	The training data frame.

`vali_df`	The validation data frame.

`test_df`	The test data frame.

`train_models`	The `train` models for the ensemble.

`ctrl`	A `trainControl` object.

`metric`	The summary metric used to select the optimal model.

`task`	The type of classification task.

`tune_length`	The maximum number of hyperparameter combinations ("random") or individual hyperparameter depth ("grid").

`top_models`	The number of top methods selected for the ensemble.

`metalearner`	The algorithm used to train the ensemble.

Examples

## Import data.
data(ph_crocs)

## Remove anomalies with autoencoder.
rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample",
                      class_col = "Species", method = "ae")
## Preprocess anomaly-free data frame into train, validation, and test sets
## with PCs as predictors.
pc_dfs <- ph_prep(df = rm_outs$df, ids_col = "Biosample",
                  class_col = "Species", vali_pct = 0.15,
                  test_pct = 0.15, method = "pca")
## Echo control object for train function.
ctrl <- ph_ctrl(ph_crocs$Species, resample_method = "boot")
## Train all models for ensemble.
## Note: Increasing n_cores will dramatically reduce train time.
train_models <- ph_train(train_df = pc_dfs$train_df,
                         vali_df = pc_dfs$vali_df,
                         test_df = pc_dfs$test_df,
                         class_col = "Species",
                         ctrl = ctrl,
                         task = "multi",
                         methods = "all",
                         tune_length = 5,
                         quiet = FALSE)
## You can also train just a few, although more is preferable.
## Note: Increasing n_cores will dramatically reduce train time.
train_models <- ph_train(train_df = pc_dfs$train_df,
                         vali_df = pc_dfs$vali_df,
                         test_df = pc_dfs$test_df,
                         class_col = "Species",
                         ctrl = ctrl,
                         task = "multi",
                         methods = c("lda", "mda",
                         "nnet", "pda", "sparseLDA"),
                         tune_length = 5,
                         quiet = FALSE)
## Train the ensemble.
## Note: Increasing n_cores will dramatically reduce train time.
ensemble_model <- ph_ensemble(train_models = train_models$train_models,
                              train_df = pc_dfs$train_df,
                              vali_df = pc_dfs$vali_df,
                              test_df = pc_dfs$test_df,
                              class_col = "Species",
                              ctrl = ctrl,
                              task = "multi",
                              top_models = 3,
                              metalearner = "glmnet",
                              tune_length = 25,
                              quiet = FALSE)

## Import data.
data(ph_crocs)

## Remove anomalies with autoencoder.
rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample",
                      class_col = "Species", method = "ae")
## Preprocess anomaly-free data frame into train, validation, and test sets
## with PCs as predictors.
pc_dfs <- ph_prep(df = rm_outs$df, ids_col = "Biosample",
                  class_col = "Species", vali_pct = 0.15,
                  test_pct = 0.15, method = "pca")
## Echo control object for train function.
ctrl <- ph_ctrl(ph_crocs$Species, resample_method = "boot")
## Train all models for ensemble.
## Note: Increasing n_cores will dramatically reduce train time.
train_models <- ph_train(train_df = pc_dfs$train_df,
                         vali_df = pc_dfs$vali_df,
                         test_df = pc_dfs$test_df,
                         class_col = "Species",
                         ctrl = ctrl,
                         task = "multi",
                         methods = "all",
                         tune_length = 5,
                         quiet = FALSE)
## You can also train just a few, although more is preferable.
## Note: Increasing n_cores will dramatically reduce train time.
train_models <- ph_train(train_df = pc_dfs$train_df,
                         vali_df = pc_dfs$vali_df,
                         test_df = pc_dfs$test_df,
                         class_col = "Species",
                         ctrl = ctrl,
                         task = "multi",
                         methods = c("lda", "mda",
                         "nnet", "pda", "sparseLDA"),
                         tune_length = 5,
                         quiet = FALSE)
## Train the ensemble.
## Note: Increasing n_cores will dramatically reduce train time.
ensemble_model <- ph_ensemble(train_models = train_models$train_models,
                              train_df = pc_dfs$train_df,
                              vali_df = pc_dfs$vali_df,
                              test_df = pc_dfs$test_df,
                              class_col = "Species",
                              ctrl = ctrl,
                              task = "multi",
                              top_models = 3,
                              metalearner = "glmnet",
                              tune_length = 25,
                              quiet = FALSE)

Equate factors levels.

Description

The ph_equate function ensures that the factor levels in all columns are equal. When classification are heavily biased or inaccurate, they can return new class predictions that do not contain every level in the original data. This can interfere with model evaluation functions e.g. via a confusion matrix.

Usage

ph_equate(df, class)
ph_equate(df, class)

Arguments

`df`	A `data.frame` of column-wise class predictions.
`class`	A `factor` value for the observed or classes.

Value

A data frame of column-wise class predictions with class levels equal to the observed class.

Examples

## Make data frame of predicted classes with different levels.
## An internal or external column should contain the observed
## classes with every possible level.
obs <- as.factor(c("A", "C", "B", "D", "E"))
method_a <- c("A", "B", "B", "C", "D")
method_b <- c("A", "C", "B", "D", "C")
method_c <- c("A", "C", "B", "B", "C")
df <- data.frame(method_a, method_b, method_c)
df <- ph_equate(df = df, class = obs)
## Make data frame of predicted classes with different levels.
## An internal or external column should contain the observed
## classes with every possible level.
obs <- as.factor(c("A", "C", "B", "D", "E"))
method_a <- c("A", "B", "B", "C", "D")
method_b <- c("A", "C", "B", "D", "C")
method_c <- c("A", "C", "B", "B", "C")
df <- data.frame(method_a, method_b, method_c)
df <- ph_equate(df = df, class = obs)

Evaluate a phenotype classification model.

Description

The ph_eval function generates a confusion matrix for binary or multi-class classification; for the multi-class case, the results are averaged across all class levels.

Usage

ph_eval(pred, obs)
ph_eval(pred, obs)

Arguments

`pred`	A `factor` value of predicted classes.
`obs`	A `factor` value of the observed or actual classes.

Value

A data.frame of confusion matrix evaluation results; for the multi-class case, the results are averaged across all class levels.

Examples

## Import data.
data(ph_crocs)

## Remove anomalies with autoencoder.
rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample",
                      class_col = "Species", method = "ae")
## Preprocess anomaly-free data frame into train, validation, and test sets
## with PCs as predictors.
pc_dfs <- ph_prep(df = rm_outs$df, ids_col = "Biosample",
                  class_col = "Species", vali_pct = 0.15,
                  test_pct = 0.15, method = "pca")
## Echo control object for train function.
ctrl <- ph_ctrl(ph_crocs$Species, resample_method = "boot")
## Train a few models for ensemble, although more is preferable.
## Note: Increasing n_cores will dramatically reduce train time.
train_models <- ph_train(train_df = pc_dfs$train_df,
                         vali_df = pc_dfs$vali_df,
                         test_df = pc_dfs$test_df,
                         class_col = "Species",
                         ctrl = ctrl,
                         task = "multi",
                         methods = c("lda", "mda",
                         "nnet", "pda", "sparseLDA"),
                         tune_length = 5,
                         quiet = FALSE)
## Evaluate e.g. the first model.
test_pred <- predict(train_models$train_models[[1]], pc_dfs$test_df)
test_obs <- as.factor(pc_dfs$test_df$Species)
test_cm <- ph_eval(pred = test_pred, obs = test_obs)

## Import data.
data(ph_crocs)

## Remove anomalies with autoencoder.
rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample",
                      class_col = "Species", method = "ae")
## Preprocess anomaly-free data frame into train, validation, and test sets
## with PCs as predictors.
pc_dfs <- ph_prep(df = rm_outs$df, ids_col = "Biosample",
                  class_col = "Species", vali_pct = 0.15,
                  test_pct = 0.15, method = "pca")
## Echo control object for train function.
ctrl <- ph_ctrl(ph_crocs$Species, resample_method = "boot")
## Train a few models for ensemble, although more is preferable.
## Note: Increasing n_cores will dramatically reduce train time.
train_models <- ph_train(train_df = pc_dfs$train_df,
                         vali_df = pc_dfs$vali_df,
                         test_df = pc_dfs$test_df,
                         class_col = "Species",
                         ctrl = ctrl,
                         task = "multi",
                         methods = c("lda", "mda",
                         "nnet", "pda", "sparseLDA"),
                         tune_length = 5,
                         quiet = FALSE)
## Evaluate e.g. the first model.
test_pred <- predict(train_models$train_models[[1]], pc_dfs$test_df)
test_obs <- as.factor(pc_dfs$test_df$Species)
test_cm <- ph_eval(pred = test_pred, obs = test_obs)

Compute interquartile range.

Description

The ph_iqr function computes the interquartile range.

Usage

ph_iqr(x, na.rm = FALSE, type = 7)
ph_iqr(x, na.rm = FALSE, type = 7)

Arguments

`x`	A `numeric` vector.
`na.rm`	A `logical` value: FALSE (default). If true, any NA is removed before quantiles are computed.
`type`	An `integer` value (1:9) selecting one of 9 quantile algorithms.

Value

The interquartile range.

Find outlier indices.

Description

The ph_outs function computes outliers with the interquartile method.

Usage

ph_outs(x)
ph_outs(x)

Arguments

`x`	A `numeric` vector.

Value

The outlier indices.

Preprocessing for phenotype classification via ensemble learning.

Description

The ph_prep function splits a data frame into training, validation, and test sets, all while ensuring that every class is represented in each dataset. By default, it performs a Principal Component Analysis on the training set data and projects the validation and test data into that space. If a non-linear dimensionality reduction strategy is preferred instead, an autoencoder can be used to extract deep features. Note that the parameters max_mem_size, activation, hidden, dropout_ratio, rate, search, and tune_length are NULL unless an autoencoder, method = "ae", is used. In this case, lists or vectors can be supplied to these parameters (see parameter details) to perform a grid search for the optimal hyperparameter combination. The autoencoder with the lowest reconstruction error is selected as the best model.

Usage

ph_prep(
  df,
  ids_col,
  class_col,
  vali_pct = 0.15,
  test_pct = 0.15,
  scale = FALSE,
  center = NULL,
  sd = NULL,
  split_seed = 123,
  method = "pca",
  pca_pct = 0.95,
  max_mem_size = "15g",
  port = 54321,
  train_seed = 123,
  hyper_params = list(),
  search = "random",
  tune_length = 100
)
ph_prep(
  df,
  ids_col,
  class_col,
  vali_pct = 0.15,
  test_pct = 0.15,
  scale = FALSE,
  center = NULL,
  sd = NULL,
  split_seed = 123,
  method = "pca",
  pca_pct = 0.95,
  max_mem_size = "15g",
  port = 54321,
  train_seed = 123,
  hyper_params = list(),
  search = "random",
  tune_length = 100
)

Arguments

`df`	A `data.frame` containing a column of unique ids, a column of classes, and an arbitrary number of `numeric` columns.
`ids_col`	A `character` value for the name of the ids column.
`class_col`	A `character` value for the name of the class column.
`vali_pct`	A `numeric` value for the percentage of training data to use as validation data: 0.15 (default).
`test_pct`	A `numeric` value for the percentage of total data to use as test data: 0.15 (default).
`scale`	A `logical` value for whether to scale the data: FALSE (default). Recommended if `method = "ae"` and if user intends to train other models.
`center`	Either a `logical` value or numeric-alike vector of length equal to the number of columns of data to scale in `df`, where ‘numeric-alike’ means that as.numeric(.) will be applied successfully if is.numeric(.) is not true: NULL (default). If `scale = TRUE`, this is set to `TRUE` and is used to subtract the mean.
`sd`	Either a `logical` value or a numeric-alike vector of length equal to the number of columns of data to scale in `df`: NULL (default). If `scale = TRUE`, this is set to `TRUE` and is used to divide by the standard deviation.
`split_seed`	A `numeric` value to set the seed and control the randomness of splitting the data: 123 (default).
`method`	A `character` value for the dimensionality reduction method: "pca" (default), "ae", "none".
`pca_pct`	If `method = "pca"`, a `numeric` value for the proportion of variance to subset the PCA with: 0.95 (default).
`max_mem_size`	If `method = "ae"`, a `character` value for the memory of an h2o session: "15g" (default).
`port`	A `numeric` value for the port number of the H2O server.
`train_seed`	A `numeric` value to set the control the randomness of creating resamples: 123 (default).
`hyper_params`	A `list` of hyperparameters to perform a grid search. the "default" list is: list(missing_values_handling = "Skip", activation = c("Rectifier", "Tanh"), hidden = list(5, 25, 50, 100, 250, 500, nrow(df_h2o)), input_dropout_ratio = c(0, 0.1, 0.2, 0.3), rate = c(0, 0.01, 0.005, 0.001)).
`search`	If `method = "ae"`, a `character` value for the hyperparameter search strategy: "random" (default), "grid".
`tune_length`	If `method = "ae"`, a `numeric` value (integer) for either the maximum number of hyperparameter combinations ("random") or individual hyperparameter depth ("grid").

Value

A list containing the following components:

`train_df`	The training set data frame.

`vali_df`	The validation set data frame.

`test_df`	The test set data frame.

`train_split`	The training set indices from the original data frame.

`vali_split`	The validation set indices from the original data frame.

`test_split`	The test set indices from the original data frame.

`vali_pct`	The percentage of training data used as validation data.

`test_pct`	The percentage of total data used as test data.

`method`	The dimensionality reduction method.

Examples

## Import data.
data(ph_crocs)

## Remove anomalies with autoencoder.
rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample",
                      class_col = "Species", method = "ae")
## Preprocess anomaly-free data frame into train, validation, and test sets
## with PCs as predictors.
pc_dfs <- ph_prep(df = rm_outs$df, ids_col = "Biosample",
                  class_col = "Species", vali_pct = 0.15,
                  test_pct = 0.15, method = "pca")
## Alternatively, preprocess data frame into train, validation, and test
## sets with latent variables as predictors. Notice that port is defined,
## because running H2O sessions one after another can cause connection
## errors.
ae_dfs <- ph_prep(df = rm_outs$df, ids_col = "Biosample", class_col = "Species",
                  vali_pct = 0.15, test_pct = 0.15, method = "ae", port = 50001)

## Import data.
data(ph_crocs)

## Remove anomalies with autoencoder.
rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample",
                      class_col = "Species", method = "ae")
## Preprocess anomaly-free data frame into train, validation, and test sets
## with PCs as predictors.
pc_dfs <- ph_prep(df = rm_outs$df, ids_col = "Biosample",
                  class_col = "Species", vali_pct = 0.15,
                  test_pct = 0.15, method = "pca")
## Alternatively, preprocess data frame into train, validation, and test
## sets with latent variables as predictors. Notice that port is defined,
## because running H2O sessions one after another can cause connection
## errors.
ae_dfs <- ph_prep(df = rm_outs$df, ids_col = "Biosample", class_col = "Species",
                  vali_pct = 0.15, test_pct = 0.15, method = "ae", port = 50001)

Data from: Sexually mediated phenotypic variation within and between sexes as a continuum structured by ecology: The mosaic nature of skeletal variation across body regions in Threespine stickleback (Gasterosteus aculeatus L.)

Description

Abstract: Ecological character displacement between the sexes, and sexual selection, integrate into a convergent set of factors that produce sexual variation. Ecologically-modulated, sexually mediated variation within and between sexes may be a major contributor to the amount of total variation that selection can act on in species. Threespine stickleback (Gasterosteus aculeatus) display rapid adaptive responses and sexual variation in many phenotypic traits. We examined phenotypic variation in the skull, pectoral and pelvic girdles of threespine stickleback from two freshwater and two coastal marine sites on the Sunshine Coast of British Columbia, Canada, using an approach that avoids a priori assumptions about bimodal patterns of variation. We quantified shape and size of the cranial, pectoral and pelvic regions of sticklebacks in marine and freshwater habitats using 3D geometric morphometrics and an index of sexually mediated variation. We show that the expression of phenotypic variation is structured in part by the effects of both habitat marine vs freshwater and the effects of individual sites within each habitat. Relative size exerts variable influence, and patterns of phenotypic variation associated with sex vary among body regions. This fine-grained quantification of sexually mediated variation in the context of habitat difference and different anatomical structures indicates a complex relationship between genetically inferred sex and environmental factors, demonstrating that the interplay between shared genetic background and sexually mediated, ecologically- based selective pressures structures the phenotypic expression of complex traits.

Usage

ph_stickleback
ph_stickleback

Format

`ph_stickleback`

A data frame of Procrustes superimposed shape data with 190 rows and 214 columns:

Biosample: Biosample
Habitat: Habitat
Population: Population
Sex: Sex

...

Source

Downloaded from <doi:doi.org/10.5061/dryad.xd2547dkw>

Generate predictions for phenotype ensemble.

Description

The ph_train function automatically trains a set of binary or multi-class classification models to ultimately build a new dataset of predictions. The data preprocessing and hyperparameter tuning are handled internally to minimize user input and simplify the training.

Usage

ph_train(
  train_df,
  vali_df,
  test_df,
  class_col,
  ctrl,
  train_seed = 123,
  n_cores = 2,
  task = "multi",
  methods = "all",
  metric = ifelse(task == "multi", "Kappa", "ROC"),
  tune_length = 10,
  quiet = FALSE
)
ph_train(
  train_df,
  vali_df,
  test_df,
  class_col,
  ctrl,
  train_seed = 123,
  n_cores = 2,
  task = "multi",
  methods = "all",
  metric = ifelse(task == "multi", "Kappa", "ROC"),
  tune_length = 10,
  quiet = FALSE
)

Arguments

`train_df`	A `data.frame` containing a class column and the training data.
`vali_df`	A `data.frame` containing a class column and the validation data.
`test_df`	A `data.frame` containing a class column and the test d
`class_col`	A `character` value for the name of the class column shared across the train, validation, and test sets.
`ctrl`	A `list` containing the resampling strategy (e.g., "boot") and other parameters for `trainControl`. Automatically create one via `ph_ctrl` or manually create it with `trainControl`.
`train_seed`	A `numeric` value to set the training seed and control the randomness of creating resamples: 123 (default).
`n_cores`	An `integer` value for the number of cores to include in the cluster: 2 (default). We highly recommend increasing this value to, e.g., parallel::detectCores() - 1.
`task`	A `character` value for the type of classification `task`: "multi" (default), "binary".
`methods`	A `character` value enumerating the names (at least two, unless "all") of the classification methods to ensemble: "all" (default). If `task = "binary"`, there are 33 methods to choose from: "AdaBag", "AdaBoost.M1", "C5.0", "evtree", "glmnet", "hda", "kernelpls", "kknn", "lda", "loclda", "mda", "nb", "nnet", "pda", "pls", "qda", "rda", "rf", "sparseLDA", "stepLDA", "stepQDA", "treebag", "svmLinear", "svmPoly","svmRadial", "gaussprLinear" (slow), "gaussprPoly" (slow), "gaussprRadial" (slow), "bagEarthGCV", "cforest", "earth", "fda", "hdda". If `task = "multi"`, there are 30 methods to choose from: "AdaBag", "AdaBoost.M1", "C5.0", "evtree", "glmnet", "hda", "kernelpls", "kknn", "lda", "loclda", "mda", "nb", "nnet", "pda", "pls", "qda", "rda", "rf", "sparseLDA", "stepLDA", "stepQDA", "treebag", "svmLinear", "svmPoly", "svmRadial", "bagEarthGCV", "cforest", "earth", "fda", "hdda".
`metric`	A `character` value for which summary metric should be used to select the optimal model: "ROC" (default for "binary") and "Kappa" (default for "multi"). Other options include "logLoss", "Accuracy", "Mean_Balanced_Accuracy", and "Mean_F1".
`tune_length`	If `search = "random"` (default), this is an `integer` value for the maximum number of hyperparameter combinations to test for each training model in the ensemble; if `search = "grid"`, this is an `integer` value for the number of levels of each hyperparameter to test for each model.
`quiet`	A `logical` value for whether progress should be printed: TRUE (default), FALSE.

Value

A list containing the following components:

`train_models`	The `train` models for the ensemble.

`train_df`	The training data frame.

`vali_df`	The validation data frame.

`test_df`	The test data frame.

`task`	The type of classification task.

`ctrl`	A list of resampling parameters used in `trainControl`.

`methods`	The names of the classification methods to ensemble.

`search`	The hyperparameter search strategy.

`n_cores`	The number of cores for parallel processing.

`metric`	The summary metric used to select the optimal model.

`tune_length`	The maximum number of hyperparameter combinations ("random") or individual hyperparameter depth ("grid").

Examples

## Import data.
data(ph_crocs)

## Remove anomalies with autoencoder.
rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample",
                      class_col = "Species", method = "ae")
## Preprocess anomaly-free data frame into train, validation, and test sets
## with PCs as predictors.
pc_dfs <- ph_prep(df = rm_outs$df, ids_col = "Biosample",
                  class_col = "Species", vali_pct = 0.15,
                  test_pct = 0.15, method = "pca")
## Echo control object for train function.
ctrl <- ph_ctrl(ph_crocs$Species, resample_method = "boot")
## Train all models for ensemble.
## Note: Increasing n_cores will dramatically reduce train time.
train_models <- ph_train(train_df = pc_dfs$train_df,
                         vali_df = pc_dfs$vali_df,
                         test_df = pc_dfs$test_df,
                         class_col = "Species",
                         ctrl = ctrl,
                         task = "multi",
                         methods = "all",
                         tune_length = 5,
                         quiet = FALSE)
## You can also train just a few, although more is preferable.
## Note: Increasing n_cores will dramatically reduce train time.
train_models <- ph_train(train_df = pc_dfs$train_df,
                         vali_df = pc_dfs$vali_df,
                         test_df = pc_dfs$test_df,
                         class_col = "Species",
                         ctrl = ctrl,
                         task = "multi",
                         methods = c("lda", "mda",
                         "nnet", "pda", "sparseLDA"),
                         tune_length = 5,
                         quiet = FALSE)

## Import data.
data(ph_crocs)

## Remove anomalies with autoencoder.
rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample",
                      class_col = "Species", method = "ae")
## Preprocess anomaly-free data frame into train, validation, and test sets
## with PCs as predictors.
pc_dfs <- ph_prep(df = rm_outs$df, ids_col = "Biosample",
                  class_col = "Species", vali_pct = 0.15,
                  test_pct = 0.15, method = "pca")
## Echo control object for train function.
ctrl <- ph_ctrl(ph_crocs$Species, resample_method = "boot")
## Train all models for ensemble.
## Note: Increasing n_cores will dramatically reduce train time.
train_models <- ph_train(train_df = pc_dfs$train_df,
                         vali_df = pc_dfs$vali_df,
                         test_df = pc_dfs$test_df,
                         class_col = "Species",
                         ctrl = ctrl,
                         task = "multi",
                         methods = "all",
                         tune_length = 5,
                         quiet = FALSE)
## You can also train just a few, although more is preferable.
## Note: Increasing n_cores will dramatically reduce train time.
train_models <- ph_train(train_df = pc_dfs$train_df,
                         vali_df = pc_dfs$vali_df,
                         test_df = pc_dfs$test_df,
                         class_col = "Species",
                         ctrl = ctrl,
                         task = "multi",
                         methods = c("lda", "mda",
                         "nnet", "pda", "sparseLDA"),
                         tune_length = 5,
                         quiet = FALSE)

Package 'pheble'

Help Index

Global variables.

Description

Detect anomalies.

Description

Usage

Arguments

Value

Examples

Data from: The utility of cranial ontogeny for phylogenetic inference: a case study in crocodylians using geometric morphometrics

Description

Usage

Format

ph_crocs

Source

Parameters for resampling and training a dataset.

Description

Usage

Arguments

Value

Examples

Classify phenotypes via ensemble learning.

Description

Usage

Arguments

Value

Examples

Equate factors levels.

Description

Usage

Arguments

Value

Examples

Evaluate a phenotype classification model.

Description

Usage

Arguments

Value

Examples

Compute interquartile range.

Description

Usage

Arguments

Value

Find outlier indices.

Description

Usage

Arguments

Value

Preprocessing for phenotype classification via ensemble learning.

Description

Usage

Arguments

Value

Examples

Data from: Sexually mediated phenotypic variation within and between sexes as a continuum structured by ecology: The mosaic nature of skeletal variation across body regions in Threespine stickleback (Gasterosteus aculeatus L.)

Description

Usage

Format

ph_stickleback

Source

Generate predictions for phenotype ensemble.

Description

Usage

Arguments

Value

Examples

`ph_crocs`

`ph_stickleback`