Title: | Methodologies for Functional Data Based on the Epigraph and Hypograph Indices |
---|---|
Description: | Implements methods for functional data analysis based on the epigraph and hypograph indices. These methods transform functional datasets, whether in one or multiple dimensions, into multivariate datasets. The transformation involves applying the epigraph, hypograph, and their modified versions to both the original curves and their first and second derivatives. The calculation of these indices is tailored to the dimensionality of the functional dataset, with special considerations for dependencies between dimensions in multidimensional cases. This approach extends traditional multivariate data analysis techniques to the functional data setting. A key application of this package is the EHyClus method, which enhances clustering analysis for functional data across one or multiple dimensions using the epigraph and hypograph indices. See Pulido et al. (2023) <doi:10.1007/s11222-023-10213-7> and Pulido et al. (2024) <doi:10.48550/arXiv.2307.16720>. |
Authors: | Belen Pulido [aut, cre] |
Maintainer: | Belen Pulido <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.1 |
Built: | 2025-02-26 05:47:04 UTC |
Source: | https://github.com/bpulidob/ehymet |
Create a table containing four validation metrics for clustering: Purity, F-measure and Rand Index (RI) and Adjusted Rand Index (ARI). This function considers pairs of points
clustering_validation(clusters, true_labels, digits = 4)
clustering_validation(clusters, true_labels, digits = 4)
clusters |
The clusters predicted by the clustering method. |
true_labels |
Atomic vector with the true labels of the data. |
digits |
Number of digits for rounding. |
A list
containing values for Purity, F-measure, RI and ARI.
set.seed(1221) vars <- list(c("dtaEI", "dtaMEI")) data <- sim_model_ex1() true_labels <- c(rep(1, 50), rep(2, 50)) data_ind <- generate_indices(data) clus_kmeans <- clustInd_kmeans(data_ind, vars) cluskmeans_mahalanobis_dtaEIdtaMEI <- clus_kmeans$kmeans_mahalanobis_dtaEIdtaMEI$cluster clustering_validation(cluskmeans_mahalanobis_dtaEIdtaMEI, true_labels)
set.seed(1221) vars <- list(c("dtaEI", "dtaMEI")) data <- sim_model_ex1() true_labels <- c(rep(1, 50), rep(2, 50)) data_ind <- generate_indices(data) clus_kmeans <- clustInd_kmeans(data_ind, vars) cluskmeans_mahalanobis_dtaEIdtaMEI <- clus_kmeans$kmeans_mahalanobis_dtaEIdtaMEI$cluster clustering_validation(cluskmeans_mahalanobis_dtaEIdtaMEI, true_labels)
Perform hierarchical clustering for a different combinations of indices, method and distance
clustInd_hierarch( ind_data, vars_combinations, method_list = c("single", "complete", "average", "centroid", "ward.D2"), dist_vector = c("euclidean", "manhattan"), n_cluster = 2, true_labels = NULL, n_cores = 1 )
clustInd_hierarch( ind_data, vars_combinations, method_list = c("single", "complete", "average", "centroid", "ward.D2"), dist_vector = c("euclidean", "manhattan"), n_cluster = 2, true_labels = NULL, n_cores = 1 )
ind_data |
Dataframe containing indices applied to the original data and its first and second derivatives. See generate_indices. |
vars_combinations |
|
method_list |
|
dist_vector |
|
n_cluster |
number of clusters to generate. |
true_labels |
Vector of true labels for validation (if it is not known true_labels is set to NULL) |
n_cores |
Number of cores to do parallel computation. 1 by default, which mean no parallel execution. |
A list
containing hierarchical clustering results
for each configuration.
vars1 <- c("dtaEI", "dtaMEI") vars2 <- c("dtaHI", "dtaMHI") data <- ehymet::sim_model_ex1() data_ind <- generate_indices(data) clustInd_hierarch(data_ind, list(vars1, vars2))
vars1 <- c("dtaEI", "dtaMEI") vars2 <- c("dtaHI", "dtaMHI") data <- ehymet::sim_model_ex1() data_ind <- generate_indices(data) clustInd_hierarch(data_ind, list(vars1, vars2))
Perform kernel kmeans clustering for a different combinations of indices and kernel
clustInd_kkmeans( ind_data, vars_combinations, kernel_list = c("rbfdot", "polydot"), n_cluster = 2, true_labels = NULL, n_cores = 1 )
clustInd_kkmeans( ind_data, vars_combinations, kernel_list = c("rbfdot", "polydot"), n_cluster = 2, true_labels = NULL, n_cores = 1 )
ind_data |
Dataframe containing indices applied to the original data and its first and second derivatives. See generate_indices. |
vars_combinations |
|
kernel_list |
List of kernels |
n_cluster |
Number of clusters to create |
true_labels |
Vector of true labels for validation (if it is not known true_labels is set to NULL) |
n_cores |
Number of cores to do parallel computation. 1 by default, which mean no parallel execution. |
A list
containing kernel-kmeans clustering results for each configuration.
vars1 <- c("dtaEI", "dtaMEI") vars2 <- c("dtaHI", "dtaMHI") data <- ehymet::sim_model_ex1() data_ind <- generate_indices(data) clustInd_kkmeans(data_ind, list(vars1, vars2))
vars1 <- c("dtaEI", "dtaMEI") vars2 <- c("dtaHI", "dtaMHI") data <- ehymet::sim_model_ex1() data_ind <- generate_indices(data) clustInd_kkmeans(data_ind, list(vars1, vars2))
Perform k-means clustering for a different combinations of indices and distances.
clustInd_kmeans( ind_data, vars_combinations, dist_vector = c("euclidean", "mahalanobis"), n_cluster = 2, init = "random", true_labels = NULL, n_cores = 1 )
clustInd_kmeans( ind_data, vars_combinations, dist_vector = c("euclidean", "mahalanobis"), n_cluster = 2, init = "random", true_labels = NULL, n_cores = 1 )
ind_data |
Dataframe containing indices applied to the original data and its first and second derivatives. See generate_indices. |
vars_combinations |
|
dist_vector |
Atomic vector of distance metrics. The possible values are, "euclidean", "mahalanobis" or both. |
n_cluster |
Number of clusters to create. |
init |
Centroids initialization meathod. It can be "random" or "kmeanspp". |
true_labels |
Vector of true labels for validation. (if it is not known true_labels is set to NULL) |
n_cores |
Number of cores to do parallel computation. 1 by default, which mean no parallel execution. |
A list containing hierarchical clustering results for each configuration
A list containing kmeans clustering results for each configuration
vars1 <- c("dtaEI", "dtaMEI") vars2 <- c("dtaHI", "dtaMHI") data <- ehymet::sim_model_ex1() data_ind <- generate_indices(data) clustInd_kmeans(data_ind, list(vars1, vars2))
vars1 <- c("dtaEI", "dtaMEI") vars2 <- c("dtaHI", "dtaMHI") data <- ehymet::sim_model_ex1() data_ind <- generate_indices(data) clustInd_kmeans(data_ind, list(vars1, vars2))
Perform spectral clustering for a different combinations of indices and kernels
clustInd_spc( ind_data, vars_combinations, kernel_list = c("rbfdot", "polydot"), n_cluster = 2, true_labels = NULL, n_cores = 1 )
clustInd_spc( ind_data, vars_combinations, kernel_list = c("rbfdot", "polydot"), n_cluster = 2, true_labels = NULL, n_cores = 1 )
ind_data |
Dataframe containing indices applied to the original data and its first and second derivatives. See generate_indices. |
vars_combinations |
|
kernel_list |
List of kernels |
n_cluster |
Number of clusters to create |
true_labels |
Vector of true labels for validation (if it is not known true_labels is set to NULL) |
n_cores |
Number of cores to do parallel computation. 1 by default, which mean no parallel execution. |
A list containing kkmeans clustering results for each configuration
vars1 <- c("dtaEI", "dtaMEI") vars2 <- c("dtaHI", "dtaMHI") data <- ehymet::sim_model_ex1() data_ind <- generate_indices(data) clustInd_spc(data_ind, list(vars1, vars2))
vars1 <- c("dtaEI", "dtaMEI") vars2 <- c("dtaHI", "dtaMHI") data <- ehymet::sim_model_ex1() data_ind <- generate_indices(data) clustInd_spc(data_ind, list(vars1, vars2))
It creates a multivariate dataset containing the epigraph, hypograph and/or its modified versions on the curves and derivatives and then perform hierarchical clustering, kmeans, kernel kmeans, and spectral clustering
EHyClus( curves, vars_combinations, k = 30, n_clusters = 2, bs = "cr", clustering_methods = c("hierarch", "kmeans", "kkmeans", "spc"), l_method_hierarch = c("single", "complete", "average", "centroid", "ward.D2"), l_dist_hierarch = c("euclidean", "manhattan"), l_dist_kmeans = c("euclidean", "mahalanobis"), l_kernel = c("rbfdot", "polydot"), true_labels = NULL, only_best = FALSE, verbose = FALSE, n_cores = 1, ... )
EHyClus( curves, vars_combinations, k = 30, n_clusters = 2, bs = "cr", clustering_methods = c("hierarch", "kmeans", "kkmeans", "spc"), l_method_hierarch = c("single", "complete", "average", "centroid", "ward.D2"), l_dist_hierarch = c("euclidean", "manhattan"), l_dist_kmeans = c("euclidean", "mahalanobis"), l_kernel = c("rbfdot", "polydot"), true_labels = NULL, only_best = FALSE, verbose = FALSE, n_cores = 1, ... )
curves |
Dataset containing the curves to apply a clustering algorithm.
The functional dataset can be one dimensional ( |
vars_combinations |
If |
k |
Number of basis functions for the B-splines. If equals to |
n_clusters |
Number of clusters to generate. |
bs |
A two letter character string indicating the (penalized) smoothing
basis to use. See |
clustering_methods |
character vector specifying at least one of the following clustering methods to be computed: "hierarch", "kmeans", "kkmeans" or "spc". |
l_method_hierarch |
|
l_dist_hierarch |
|
l_dist_kmeans |
|
l_kernel |
|
true_labels |
Numeric vector of true labels for validation. If provided, evaluation metrics are computed in the final result. |
only_best |
|
verbose |
If |
n_cores |
Number of cores to do parallel computation. 1 by default, which mean no parallel execution. Must be an integer number greater than 1. |
... |
Additional arguments for tfb. See |
A list
containing the clustering partition for each method and indices
combination and, if true_labels
is provided a data frame containing the time elapsed for obtaining a
clustering partition of the indices dataset for each methodology. Also, the number of
generated clusters and the combinations of variables used can be seen as attributes
of this object.
# univarariate data without labels curves <- sim_model_ex1(n = 10) vars_combinations <- list(c("dtaEI", "dtaMEI"), c("dtaHI", "dtaMHI")) EHyClus(curves, vars_combinations = vars_combinations) # multivariate data with labels curves <- sim_model_ex2(n = 5) true_labels <- c(rep(1, 5), rep(2, 5)) vars_combinations <- list(c("dtaMEI", "ddtaMEI"), c("dtaMEI", "d2dtaMEI")) res <- EHyClus(curves, vars_combinations = vars_combinations, true_labels = true_labels) res$cluster # clustering results # multivariate data and generic (default) vars_combinations curves <- sim_model_ex2(n = 5) EHyClus(curves)
# univarariate data without labels curves <- sim_model_ex1(n = 10) vars_combinations <- list(c("dtaEI", "dtaMEI"), c("dtaHI", "dtaMHI")) EHyClus(curves, vars_combinations = vars_combinations) # multivariate data with labels curves <- sim_model_ex2(n = 5) true_labels <- c(rep(1, 5), rep(2, 5)) vars_combinations <- list(c("dtaMEI", "ddtaMEI"), c("dtaMEI", "d2dtaMEI")) res <- EHyClus(curves, vars_combinations = vars_combinations, true_labels = true_labels) res$cluster # clustering results # multivariate data and generic (default) vars_combinations curves <- sim_model_ex2(n = 5) EHyClus(curves)
The Epigraph Index of a curve x is one minus the proportion of curves in the sample that are above x.
EI(curves, ...)
EI(curves, ...)
curves |
|
... |
Ignored. |
numeric vector
containing the EI for each curve.
x <- matrix(c(1, 2, 3, 3, 2, 1, 5, 2, 3, 9, 8, 7), ncol = 3, nrow = 4) EI(x) y <- array(c(1, 2, 3, 3, 2, 1, 5, 2, 3, 9, 8, 7, -1, -5, -6, 2, 3, 0, -1, 0, 2, -1, -2, 0), dim = c(3, 4, 2) ) EI(y)
x <- matrix(c(1, 2, 3, 3, 2, 1, 5, 2, 3, 9, 8, 7), ncol = 3, nrow = 4) EI(x) y <- array(c(1, 2, 3, 3, 2, 1, 5, 2, 3, 9, 8, 7, -1, -5, -6, 2, 3, 0, -1, 0, 2, -1, -2, 0), dim = c(3, 4, 2) ) EI(y)
Create a dataset with indices from a functional dataset in one or multiple dimensions
generate_indices( curves, k, bs = "cr", indices = c("EI", "HI", "MEI", "MHI"), n_cores = 1, ... )
generate_indices( curves, k, bs = "cr", indices = c("EI", "HI", "MEI", "MHI"), n_cores = 1, ... )
curves |
|
k |
Number of basis functions for the B-splines. If equals to 0, the number of basis functions will be automatically selected. |
bs |
A two letter character string indicating the (penalized) smoothing
basis to use. See |
indices |
Set of indices to be applied to the dataset. They should be any between EI, HI, MEI and MHI. |
n_cores |
Number of cores to do parallel computation. 1 by default, which mean no parallel execution. Must be an integer number greater than 1. |
... |
Additional arguments for tfb. See |
A dataframe containing the indices provided in indices
for
original data, first and second derivatives
# 3-dimensional array x1 <- array(c(1, 2, 3, 3, 2, 1, 5, 2, 3, 9, 8, 7, -1, -5, -6, 2, 3, 0, -1, 0, 2, -1, -2, 0), dim = c(3, 4, 2) ) generate_indices(x1, k = 4) # matrix x2 <- matrix(c(1, 2, 3, 3, 2, 1, 5, 2, 3, 9, 8, 7), nrow = 3, ncol = 4) generate_indices(x2, k = 4) # using additional parameter for tf::tfb curves <- sim_model_ex1(n = 10) generate_indices( curves = curves, k = 20, bs = "bs", m = c(3,2), # additional parameter for tfb penalized = FALSE # additional parameter for tfb )
# 3-dimensional array x1 <- array(c(1, 2, 3, 3, 2, 1, 5, 2, 3, 9, 8, 7, -1, -5, -6, 2, 3, 0, -1, 0, 2, -1, -2, 0), dim = c(3, 4, 2) ) generate_indices(x1, k = 4) # matrix x2 <- matrix(c(1, 2, 3, 3, 2, 1, 5, 2, 3, 9, 8, 7), nrow = 3, ncol = 4) generate_indices(x2, k = 4) # using additional parameter for tf::tfb curves <- sim_model_ex1(n = 10) generate_indices( curves = curves, k = 20, bs = "bs", m = c(3,2), # additional parameter for tfb penalized = FALSE # additional parameter for tfb )
The Hypograph Index of a curve x is the proportion of curves in the sample that are below x.
HI(curves, ...)
HI(curves, ...)
curves |
|
... |
Ignored. |
numeric vector
containing the HI for each curve.
x <- matrix(c(1, 2, 3, 3, 2, 1, 5, 2, 3, 9, 8, 7), ncol = 3, nrow = 4) HI(x) y <- array(c(1, 2, 3, 3, 2, 1, 5, 2, 3, 9, 8, 7, -1, -5, -6, 2, 3, 0, -1, 0, 2, -1, -2, 0), dim = c(3, 4, 2) ) HI(y)
x <- matrix(c(1, 2, 3, 3, 2, 1, 5, 2, 3, 9, 8, 7), ncol = 3, nrow = 4) HI(x) y <- array(c(1, 2, 3, 3, 2, 1, 5, 2, 3, 9, 8, 7, -1, -5, -6, 2, 3, 0, -1, 0, 2, -1, -2, 0), dim = c(3, 4, 2) ) HI(y)
The Modified Epigraph Index of a curve x is one minus the proportion of "time" the curves in the sample are above x.
MEI(curves, ...)
MEI(curves, ...)
curves |
|
... |
Ignored. |
numeric vector
containing the MEI for each curve.
x <- matrix(c(1, 2, 3, 3, 2, 1, 5, 2, 3, 9, 8, 7), ncol = 3, nrow = 4) MEI(x) y <- array(c(1, 2, 3, 3, 2, 1, 5, 2, 3, 9, 8, 7, -1, -5, -6, 2, 3, 0, -1, 0, 2, -1, -2, 0), dim = c(3, 4, 2) ) MEI(y)
x <- matrix(c(1, 2, 3, 3, 2, 1, 5, 2, 3, 9, 8, 7), ncol = 3, nrow = 4) MEI(x) y <- array(c(1, 2, 3, 3, 2, 1, 5, 2, 3, 9, 8, 7, -1, -5, -6, 2, 3, 0, -1, 0, 2, -1, -2, 0), dim = c(3, 4, 2) ) MEI(y)
The Modified Hypograph Index of a curve x is the proportion of "time" the curves in the sample are below x.
MHI(curves, ...)
MHI(curves, ...)
curves |
|
... |
Ignored. |
numeric vector
containing the MHI for each curve.
x <- matrix(c(1, 2, 3, 3, 2, 1, 5, 2, 3, 9, 8, 7), ncol = 3, nrow = 4) MHI(x) y <- array(c(1, 2, 3, 3, 2, 1, 5, 2, 3, 9, 8, 7, -1, -5, -6, 2, 3, 0, -1, 0, 2, -1, -2, 0), dim = c(3, 4, 2) ) MHI(y)
x <- matrix(c(1, 2, 3, 3, 2, 1, 5, 2, 3, 9, 8, 7), ncol = 3, nrow = 4) MHI(x) y <- array(c(1, 2, 3, 3, 2, 1, 5, 2, 3, 9, 8, 7, -1, -5, -6, 2, 3, 0, -1, 0, 2, -1, -2, 0), dim = c(3, 4, 2) ) MHI(y)
Each dataset has 2 groups with n
curves each, defined in the interval
with
p
equidistant points. The first n
curves are
generated fron the following model
where
is the mean function and
is a centered Gaussian process with
covariance matrix
The remaining 50 functions are generated from model
i_sim
with
i_sim
The first three models contain changes in the mean, while the covariance
matrix does not change. Model 4 and 5 are obtained by multiplying the
covariance matrix by a constant. Model 6 is obtained from adding to
a centered Gaussian process
whose covariance matrix
is given by
.
Model 7 and 8 are obtained by a different mean function.
sim_model_ex1(n = 50, p = 30, i_sim = 1)
sim_model_ex1(n = 50, p = 30, i_sim = 1)
n |
Number of curves to generate for each of the two groups. Set to 50 by default. |
p |
Number of grid points of the curves.
Curves are generated over the interval |
i_sim |
Integer set to |
data matrix of size .
sm1 <- sim_model_ex1() dim(sm1)
sm1 <- sim_model_ex1() dim(sm1)
The function can generate one-dimensional or multi-dimensional curves.
For i_sim
1 or 2, one-dimensional curves are generated.
For i_sim
3 or 4, multi-dimensional curves are generated.
sim_model_ex2(n = 50, p = 150, i_sim = 1)
sim_model_ex2(n = 50, p = 150, i_sim = 1)
n |
Number of curves to generate for each of the two groups. Set to 50 by default. |
p |
Number of grid points of the curves.
Curves are generated over the interval |
i_sim |
Integer set to |
data matrix of size if
or an array of dimensions
if
.
sm1 <- sim_model_ex2() dim(sm1) # This should output (100, 150) by default, since n = 50 and p = 150 sm4 <- sim_model_ex2(i_sim = 4) dim(sm4) # This should output (100, 150, 2) by default, since n = 50 and p = 150
sm1 <- sim_model_ex2() dim(sm1) # This should output (100, 150) by default, since n = 50 and p = 150 sm4 <- sim_model_ex2(i_sim = 4) dim(sm4) # This should output (100, 150, 2) by default, since n = 50 and p = 150