Title: | Applicability Domain Methods of Viral Load and CD4 Lymphocytes |
---|---|
Description: | Provides methods for assessing the applicability domain of models that predict viral load and CD4 (Cluster of Differentiation 4) lymphocyte counts. These methods help determine the extent of extrapolation when making predictions. |
Authors: | Juan Pablo Acuña González [aut, cre] |
Maintainer: | Juan Pablo Acuña González <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.0.6.9000 |
Built: | 2025-01-15 04:55:39 UTC |
Source: | https://github.com/juanv66x/viraldomain |
This function fits a K-Nearest Neighbor (KNN) model to the provided data and computes a domain applicability score based on PCA distances.
knn_domain_score( featured_col, train_data, knn_hyperparameters, test_data, threshold_value )
knn_domain_score( featured_col, train_data, knn_hyperparameters, test_data, threshold_value )
featured_col |
The name of the response variable to predict. |
train_data |
The training dataset containing predictor variables and the response variable. |
knn_hyperparameters |
A list of hyperparameters for the KNN model, including:
|
test_data |
The test dataset for making predictions. |
threshold_value |
The threshold value used for computing domain scores. |
A data frame containing the computed domain scores for each observation in the test dataset.
set.seed(123) library(dplyr) featured_col <- "cd_2022" # Specifying features for training and testing procedures train_data = viral |> dplyr::select(cd_2022, vl_2022) test_data = sero knn_hyperparameters <- list(neighbors = 5, weight_func = "optimal", dist_power = 0.3304783) threshold_value <- 0.99 # Call the function
set.seed(123) library(dplyr) featured_col <- "cd_2022" # Specifying features for training and testing procedures train_data = viral |> dplyr::select(cd_2022, vl_2022) test_data = sero knn_hyperparameters <- list(neighbors = 5, weight_func = "optimal", dist_power = 0.3304783) threshold_value <- 0.99 # Call the function
This function fits a MARS (Multivariate Adaptive Regression Splines) model to the provided data and computes a domain applicability score based on PCA distances.
mars_domain_score( featured_col, train_data, mars_hyperparameters, test_data, threshold_value )
mars_domain_score( featured_col, train_data, mars_hyperparameters, test_data, threshold_value )
featured_col |
The name of the featured column. |
train_data |
A data frame containing the training data. |
mars_hyperparameters |
A list of hyperparameters for the MARS model, including:
|
test_data |
A data frame containing the test data. |
threshold_value |
The threshold value for the domain score. |
A tibble with the domain applicability scores.
set.seed(123) library(dplyr) featured_col <- "cd_2022" # Specifying features for training and testing procedures train_data = viral |> dplyr::select(cd_2022, vl_2022) test_data = sero mars_hyperparameters <- list(num_terms = 3, prod_degree = 1, prune_method = "none") threshold_value <- 0.99 # Call the function
set.seed(123) library(dplyr) featured_col <- "cd_2022" # Specifying features for training and testing procedures train_data = viral |> dplyr::select(cd_2022, vl_2022) test_data = sero mars_hyperparameters <- list(num_terms = 3, prod_degree = 1, prune_method = "none") threshold_value <- 0.99 # Call the function
This function fits a Neural Network model to the provided data and computes a domain applicability score based on PCA distances.
nn_domain_score( featured_col, train_data, nn_hyperparameters, test_data, threshold_value )
nn_domain_score( featured_col, train_data, nn_hyperparameters, test_data, threshold_value )
featured_col |
The name of the featured column in the training data. |
train_data |
The training data used to fit the Neural Network model. |
nn_hyperparameters |
A list of Neural Network hyperparameters, including hidden_units, penalty, and epochs. |
test_data |
The testing domain data used to calculate the domain applicability score. |
threshold_value |
The threshold value for domain applicability scoring. |
A tibble with the domain applicability scores.
set.seed(123) library(dplyr) featured_col <- "cd_2022" # Specifying features for training and testing procedures train_data = viral |> dplyr::select(cd_2022, vl_2022) test_data = sero nn_hyperparameters <- list(hidden_units = 1, penalty = 0.3746312, epochs = 480) threshold_value <- 0.99 # Call the function
set.seed(123) library(dplyr) featured_col <- "cd_2022" # Specifying features for training and testing procedures train_data = viral |> dplyr::select(cd_2022, vl_2022) test_data = sero nn_hyperparameters <- list(hidden_units = 1, penalty = 0.3746312, epochs = 480) threshold_value <- 0.99 # Call the function
This function generates a domain plot for a normalized model based on PCA distances of the provided data.
normalized_domain_plot(featured_col, train_data, test_data, treshold_value)
normalized_domain_plot(featured_col, train_data, test_data, treshold_value)
featured_col |
The name of the featured column. |
train_data |
A data frame containing the training data. |
test_data |
A data frame containing the test data. |
treshold_value |
The threshold value for the domain plot. |
A domain plot visualizing the distances of imputed values.
set.seed(123) library(dplyr) # Specifying featured column featured_col = "cd_2022" train_data = viral |> dplyr::select("cd_2022", "vl_2022") test_data = sero treshold_value = 0.99 # Call the function
set.seed(123) library(dplyr) # Specifying featured column featured_col = "cd_2022" train_data = viral |> dplyr::select("cd_2022", "vl_2022") test_data = sero treshold_value = 0.99 # Call the function
This function fits a Random Forest model to the provided data and computes a domain applicability score based on PCA distances.
rf_domain_score( featured_col, train_data, rf_hyperparameters, test_data, threshold_value )
rf_domain_score( featured_col, train_data, rf_hyperparameters, test_data, threshold_value )
featured_col |
A character string specifying the name of the response variable to predict. |
train_data |
A data frame containing predictor variables and the response variable for training the model. |
rf_hyperparameters |
A list of hyperparameters for the Random Forest model, including:
|
test_data |
A data frame for making predictions. |
threshold_value |
A numeric threshold value used for computing domain applicability scores. |
Random Forest creates a large number of decision trees, each independent of the others. The final prediction combines the predictions from all individual trees. This function uses the ranger
engine for fitting regression models.
A data frame containing the computed domain applicability scores for each observation in the test dataset.
set.seed(123) library(dplyr) featured_col <- "cd_2022" train_data <- viral %>% dplyr::select(cd_2022, vl_2022) test_data <- sero rf_hyperparameters <- list(mtry = 2, min_n = 5, trees = 500) threshold_value <- 0.99 # Call the function
set.seed(123) library(dplyr) featured_col <- "cd_2022" train_data <- viral %>% dplyr::select(cd_2022, vl_2022) test_data <- sero rf_hyperparameters <- list(mtry = 2, min_n = 5, trees = 500) threshold_value <- 0.99 # Call the function
This dataset is designed for testing the applicability domain of methods related to HIV research. It provides a tibble with 53 rows and 2 columns containing numeric measurements of CD4 lymphocyte counts (cd_2022) and viral load (vl_2022) for seropositive individuals in 2022. These measurements are vital indicators of HIV disease status. This dataset is ideal for evaluating the performance and suitability of various HIV-predictive models and as an aid in developing diagnostic tools within a seropositive context.
data(sero)
data(sero)
A tibble (data frame) with 53 rows and 2 columns.
To explore more rows of this dataset, you can use the print(n = ...)
function.
Juan Pablo Acuña González [email protected]
data(sero) sero
data(sero) sero
This function generates a domain plot for a simple model based on PCA distances of the provided data.
simple_domain_plot(featured_col, train_data, test_data, treshold_value)
simple_domain_plot(featured_col, train_data, test_data, treshold_value)
featured_col |
Name of the featured column in the training data. |
train_data |
The training data used to fit the model. |
test_data |
The testing domain data used to calculate PCA distances. |
treshold_value |
The threshold for domain applicability scoring. |
A simple damain plot
set.seed(123) library(dplyr) # Specifying featured column featured_col = "cd_2022" train_data = viral |> dplyr::select("cd_2022", "vl_2022") test_data = sero treshold_value = 0.99 # Call the function
set.seed(123) library(dplyr) # Specifying featured column featured_col = "cd_2022" train_data = viral |> dplyr::select("cd_2022", "vl_2022") test_data = sero treshold_value = 0.99 # Call the function
This dataset serves as input for predictive modeling tasks related to HIV research. It contains numeric measurements of CD4 lymphocyte counts (cd) and viral load (vl) at three different time points: 2019, 2021, and 2022. These measurements are crucial indicators of HIV disease progression.
data(viral)
data(viral)
A tibble (data frame) with 35 rows and 6 columns.
To explore more rows of this dataset, you can use the print(n = ...)
function.
Juan Pablo Acuña González [email protected]
data(viral) viral
data(viral) viral