Package 'ehymet'

Title: Methodologies for Functional Data Based on the Epigraph and Hypograph Indices
Description: Implements methods for functional data analysis based on the epigraph and hypograph indices. These methods transform functional datasets, whether in one or multiple dimensions, into multivariate datasets. The transformation involves applying the epigraph, hypograph, and their modified versions to both the original curves and their first and second derivatives. The calculation of these indices is tailored to the dimensionality of the functional dataset, with special considerations for dependencies between dimensions in multidimensional cases. This approach extends traditional multivariate data analysis techniques to the functional data setting. A key application of this package is the EHyClus method, which enhances clustering analysis for functional data across one or multiple dimensions using the epigraph and hypograph indices. See Pulido et al. (2023) <doi:10.1007/s11222-023-10213-7> and Pulido et al. (2024) <doi:10.48550/arXiv.2307.16720>.
Authors: Belen Pulido [aut, cre] , Jose Ignacio Diez [ctr]
Maintainer: Belen Pulido <[email protected]>
License: MIT + file LICENSE
Version: 0.1.1
Built: 2025-02-26 05:47:04 UTC
Source: https://github.com/bpulidob/ehymet

Help Index


Create a table containing four validation metrics for clustering: Purity, F-measure and Rand Index (RI) and Adjusted Rand Index (ARI). This function considers pairs of points

Description

Create a table containing four validation metrics for clustering: Purity, F-measure and Rand Index (RI) and Adjusted Rand Index (ARI). This function considers pairs of points

Usage

clustering_validation(clusters, true_labels, digits = 4)

Arguments

clusters

The clusters predicted by the clustering method.

true_labels

Atomic vector with the true labels of the data.

digits

Number of digits for rounding.

Value

A list containing values for Purity, F-measure, RI and ARI.

Examples

set.seed(1221)
vars <- list(c("dtaEI", "dtaMEI"))
data <- sim_model_ex1()
true_labels <- c(rep(1, 50), rep(2, 50))
data_ind <- generate_indices(data)
clus_kmeans <- clustInd_kmeans(data_ind, vars)
cluskmeans_mahalanobis_dtaEIdtaMEI <- clus_kmeans$kmeans_mahalanobis_dtaEIdtaMEI$cluster
clustering_validation(cluskmeans_mahalanobis_dtaEIdtaMEI, true_labels)

Perform hierarchical clustering for a different combinations of indices, method and distance

Description

Perform hierarchical clustering for a different combinations of indices, method and distance

Usage

clustInd_hierarch(
  ind_data,
  vars_combinations,
  method_list = c("single", "complete", "average", "centroid", "ward.D2"),
  dist_vector = c("euclidean", "manhattan"),
  n_cluster = 2,
  true_labels = NULL,
  n_cores = 1
)

Arguments

ind_data

Dataframe containing indices applied to the original data and its first and second derivatives. See generate_indices.

vars_combinations

list containing one or more combinations of indices in ind_data. If it is non-named, the names of the variables are set to vars1, ..., varsk, where k is the number of elements in vars_combinations.

method_list

list of clustering methods.

dist_vector

list of distance metrics.

n_cluster

number of clusters to generate.

true_labels

Vector of true labels for validation (if it is not known true_labels is set to NULL)

n_cores

Number of cores to do parallel computation. 1 by default, which mean no parallel execution.

Value

A list containing hierarchical clustering results for each configuration.

Examples

vars1 <- c("dtaEI", "dtaMEI")
vars2 <- c("dtaHI", "dtaMHI")
data <- ehymet::sim_model_ex1()
data_ind <- generate_indices(data)
clustInd_hierarch(data_ind, list(vars1, vars2))

Kernel k-means clustering using indices

Description

Perform kernel kmeans clustering for a different combinations of indices and kernel

Usage

clustInd_kkmeans(
  ind_data,
  vars_combinations,
  kernel_list = c("rbfdot", "polydot"),
  n_cluster = 2,
  true_labels = NULL,
  n_cores = 1
)

Arguments

ind_data

Dataframe containing indices applied to the original data and its first and second derivatives. See generate_indices.

vars_combinations

list containing one or more combinations of indices in ind_data. If it is non-named, the names of the variables are set to vars1, ..., varsk, where k is the number of elements in vars_combinations.

kernel_list

List of kernels

n_cluster

Number of clusters to create

true_labels

Vector of true labels for validation (if it is not known true_labels is set to NULL)

n_cores

Number of cores to do parallel computation. 1 by default, which mean no parallel execution.

Value

A list containing kernel-kmeans clustering results for each configuration.

Examples

vars1 <- c("dtaEI", "dtaMEI")
vars2 <- c("dtaHI", "dtaMHI")
data <- ehymet::sim_model_ex1()
data_ind <- generate_indices(data)
clustInd_kkmeans(data_ind, list(vars1, vars2))

K-means clustering with indices

Description

Perform k-means clustering for a different combinations of indices and distances.

Usage

clustInd_kmeans(
  ind_data,
  vars_combinations,
  dist_vector = c("euclidean", "mahalanobis"),
  n_cluster = 2,
  init = "random",
  true_labels = NULL,
  n_cores = 1
)

Arguments

ind_data

Dataframe containing indices applied to the original data and its first and second derivatives. See generate_indices.

vars_combinations

list containing one or more combinations of indices in ind_data. If it is non-named, the names of the variables are set to vars1, ..., varsk, where k is the number of elements in vars_combinations.

dist_vector

Atomic vector of distance metrics. The possible values are, "euclidean", "mahalanobis" or both.

n_cluster

Number of clusters to create.

init

Centroids initialization meathod. It can be "random" or "kmeanspp".

true_labels

Vector of true labels for validation. (if it is not known true_labels is set to NULL)

n_cores

Number of cores to do parallel computation. 1 by default, which mean no parallel execution.

Value

A list containing hierarchical clustering results for each configuration

A list containing kmeans clustering results for each configuration

Examples

vars1 <- c("dtaEI", "dtaMEI")
vars2 <- c("dtaHI", "dtaMHI")
data <- ehymet::sim_model_ex1()
data_ind <- generate_indices(data)
clustInd_kmeans(data_ind, list(vars1, vars2))

Spectral clustering using indices

Description

Perform spectral clustering for a different combinations of indices and kernels

Usage

clustInd_spc(
  ind_data,
  vars_combinations,
  kernel_list = c("rbfdot", "polydot"),
  n_cluster = 2,
  true_labels = NULL,
  n_cores = 1
)

Arguments

ind_data

Dataframe containing indices applied to the original data and its first and second derivatives. See generate_indices.

vars_combinations

list containing one or more combinations of indices in ind_data. If it is non-named, the names of the variables are set to vars1, ..., varsk, where k is the number of elements in vars_combinations.

kernel_list

List of kernels

n_cluster

Number of clusters to create

true_labels

Vector of true labels for validation (if it is not known true_labels is set to NULL)

n_cores

Number of cores to do parallel computation. 1 by default, which mean no parallel execution.

Value

A list containing kkmeans clustering results for each configuration

Examples

vars1 <- c("dtaEI", "dtaMEI")
vars2 <- c("dtaHI", "dtaMHI")
data <- ehymet::sim_model_ex1()
data_ind <- generate_indices(data)
clustInd_spc(data_ind, list(vars1, vars2))

Clustering using Epigraph and Hypograph indices

Description

It creates a multivariate dataset containing the epigraph, hypograph and/or its modified versions on the curves and derivatives and then perform hierarchical clustering, kmeans, kernel kmeans, and spectral clustering

Usage

EHyClus(
  curves,
  vars_combinations,
  k = 30,
  n_clusters = 2,
  bs = "cr",
  clustering_methods = c("hierarch", "kmeans", "kkmeans", "spc"),
  l_method_hierarch = c("single", "complete", "average", "centroid", "ward.D2"),
  l_dist_hierarch = c("euclidean", "manhattan"),
  l_dist_kmeans = c("euclidean", "mahalanobis"),
  l_kernel = c("rbfdot", "polydot"),
  true_labels = NULL,
  only_best = FALSE,
  verbose = FALSE,
  n_cores = 1,
  ...
)

Arguments

curves

Dataset containing the curves to apply a clustering algorithm. The functional dataset can be one dimensional (n×pn \times p) where nn is the number of curves and p the number of time points, or multidimensional (n×p×qn \times p \times q) where qq represents the number of dimensions in the data

vars_combinations

If list, each element of the list should be an atomic vector of strings with the names of the variables. Combinations with non-valid variable names will be discarded. If the list is non-named, the names of the variables are set to vars1, ..., varsk, where k is the number of elements in vars_combinations. If "auto" is provided, an unique combination of variable will be found. If not provided, generic combinations of variables will be used. They will not be the same for uni-dimensional and multi-dimensional problems.

k

Number of basis functions for the B-splines. If equals to 0, the number of basis functions will be automatically selected.

n_clusters

Number of clusters to generate.

bs

A two letter character string indicating the (penalized) smoothing basis to use. See smooth.terms.

clustering_methods

character vector specifying at least one of the following clustering methods to be computed: "hierarch", "kmeans", "kkmeans" or "spc".

l_method_hierarch

list of clustering methods for hierarchical clustering.

l_dist_hierarch

list of distances for hierarchical clustering.

l_dist_kmeans

list of distances for kmeans clustering.

l_kernel

list of kernels for kkmeans or spc.

true_labels

Numeric vector of true labels for validation. If provided, evaluation metrics are computed in the final result.

only_best

logical value. If TRUE and true_labels is provided, the function will return only the result for the best clustering method based on the Rand Index. Defaults to FALSE.

verbose

If TRUE, the function will print logs for about the execution of some clustering methods. Defaults to FALSE.

n_cores

Number of cores to do parallel computation. 1 by default, which mean no parallel execution. Must be an integer number greater than 1.

...

Additional arguments for tfb. See tfb.

Value

A list containing the clustering partition for each method and indices combination and, if true_labels is provided a data frame containing the time elapsed for obtaining a clustering partition of the indices dataset for each methodology. Also, the number of generated clusters and the combinations of variables used can be seen as attributes of this object.

Examples

# univarariate data without labels
curves <- sim_model_ex1(n = 10)
vars_combinations <- list(c("dtaEI", "dtaMEI"), c("dtaHI", "dtaMHI"))
EHyClus(curves, vars_combinations = vars_combinations)

# multivariate data with labels
curves <- sim_model_ex2(n = 5)
true_labels <- c(rep(1, 5), rep(2, 5))
vars_combinations <- list(c("dtaMEI", "ddtaMEI"), c("dtaMEI", "d2dtaMEI"))
res <- EHyClus(curves, vars_combinations = vars_combinations, true_labels = true_labels)
res$cluster # clustering results

# multivariate data and generic (default) vars_combinations
curves <- sim_model_ex2(n = 5)
EHyClus(curves)

Epigraph Index (EI) for a functional dataset

Description

The Epigraph Index of a curve x is one minus the proportion of curves in the sample that are above x.

Usage

EI(curves, ...)

Arguments

curves

matrix where each row represents a curve, and each column represents values along the curve or array with dimension n×p×qn \times p \times q with nn curves, pp values along the curve, and qq dimensions.

...

Ignored.

Value

numeric vector containing the EI for each curve.

Examples

x <- matrix(c(1, 2, 3, 3, 2, 1, 5, 2, 3, 9, 8, 7), ncol = 3, nrow = 4)
EI(x)

y <- array(c(1, 2, 3, 3, 2, 1, 5, 2, 3, 9, 8, 7, -1, -5, -6, 2, 3, 0, -1, 0, 2, -1, -2, 0),
  dim = c(3, 4, 2)
)
EI(y)

Create a dataset with indices from a functional dataset in one or multiple dimensions

Description

Create a dataset with indices from a functional dataset in one or multiple dimensions

Usage

generate_indices(
  curves,
  k,
  bs = "cr",
  indices = c("EI", "HI", "MEI", "MHI"),
  n_cores = 1,
  ...
)

Arguments

curves

matrix with dimension n×pn \times p in the case of a one-dimensional functional dataset, or array of dimension n×p×qn \times p \times q in the case of a multivariate functional dataset. nn represents the number of curves, pp the number of values along the curve, and in the second case, qq is the number of dimensions.

k

Number of basis functions for the B-splines. If equals to 0, the number of basis functions will be automatically selected.

bs

A two letter character string indicating the (penalized) smoothing basis to use. See smooth.terms.

indices

Set of indices to be applied to the dataset. They should be any between EI, HI, MEI and MHI.

n_cores

Number of cores to do parallel computation. 1 by default, which mean no parallel execution. Must be an integer number greater than 1.

...

Additional arguments for tfb. See tfb.

Value

A dataframe containing the indices provided in indices for original data, first and second derivatives

Examples

# 3-dimensional array
x1 <- array(c(1, 2, 3, 3, 2, 1, 5, 2, 3, 9, 8, 7, -1, -5, -6, 2, 3, 0, -1, 0, 2, -1, -2, 0),
  dim = c(3, 4, 2)
)
generate_indices(x1, k = 4)

# matrix
x2 <- matrix(c(1, 2, 3, 3, 2, 1, 5, 2, 3, 9, 8, 7), nrow = 3, ncol = 4)
generate_indices(x2, k = 4)

# using additional parameter for tf::tfb
curves <- sim_model_ex1(n = 10)
generate_indices(
  curves = curves,
  k = 20,
  bs = "bs",
  m = c(3,2),        # additional parameter for tfb
  penalized = FALSE  # additional parameter for tfb
)

Hypograph Index (HI) for a functional dataset

Description

The Hypograph Index of a curve x is the proportion of curves in the sample that are below x.

Usage

HI(curves, ...)

Arguments

curves

matrix where each row represents a curve, and each column represents values along the curve or array with dimension n×p×qn \times p \times q with nn curves, pp values along the curve, and qq dimensions.

...

Ignored.

Value

numeric vector containing the HI for each curve.

Examples

x <- matrix(c(1, 2, 3, 3, 2, 1, 5, 2, 3, 9, 8, 7), ncol = 3, nrow = 4)
HI(x)

y <- array(c(1, 2, 3, 3, 2, 1, 5, 2, 3, 9, 8, 7, -1, -5, -6, 2, 3, 0, -1, 0, 2, -1, -2, 0),
  dim = c(3, 4, 2)
)
HI(y)

Modified Epigraph Index (MEI) for functional dataset.

Description

The Modified Epigraph Index of a curve x is one minus the proportion of "time" the curves in the sample are above x.

Usage

MEI(curves, ...)

Arguments

curves

matrix where each row represents a curve, and each column represents values along the curve or an array with dimension n×p×qn \times p \times q with nn curves, pp values along the curve, and qq dimensions.

...

Ignored.

Value

numeric vector containing the MEI for each curve.

Examples

x <- matrix(c(1, 2, 3, 3, 2, 1, 5, 2, 3, 9, 8, 7), ncol = 3, nrow = 4)
MEI(x)
y <- array(c(1, 2, 3, 3, 2, 1, 5, 2, 3, 9, 8, 7, -1, -5, -6, 2, 3, 0, -1, 0, 2, -1, -2, 0),
  dim = c(3, 4, 2)
)
MEI(y)

Modified Hypograph Index (MHI) for a functional dataset

Description

The Modified Hypograph Index of a curve x is the proportion of "time" the curves in the sample are below x.

Usage

MHI(curves, ...)

Arguments

curves

matrix where each row represents a curve, and each column represents values along the curve or an array with dimension n×p×qn \times p \times q with nn curves, pp values along the curve, and qq dimensions.

...

Ignored.

Value

numeric vector containing the MHI for each curve.

Examples

x <- matrix(c(1, 2, 3, 3, 2, 1, 5, 2, 3, 9, 8, 7), ncol = 3, nrow = 4)
MHI(x)
y <- array(c(1, 2, 3, 3, 2, 1, 5, 2, 3, 9, 8, 7, -1, -5, -6, 2, 3, 0, -1, 0, 2, -1, -2, 0),
  dim = c(3, 4, 2)
)
MHI(y)

Function for generating functional data in one dimension

Description

Each dataset has 2 groups with n curves each, defined in the interval t=[0,1]t=[0, 1] with p equidistant points. The first n curves are generated fron the following model X1(t)=E1(t)+e(t)X_1(t)=E_1(t)+e(t) where E1(t)=E1(X(t))=30t32(1t)E_1(t)=E_1(X(t))=30t^{ \frac{3}{2}}(1-t) is the mean function and e(t)e(t) is a centered Gaussian process with covariance matrix Cov(e(ti),e(tj))=0.3exp(titj0.3)Cov(e(t_i),e(t_j))=0.3 \exp(-\frac{\lvert t_i-t_j \rvert}{0.3}) The remaining 50 functions are generated from model i_sim with i_sim {1,,8}.\in \{1, \ldots, 8\}. The first three models contain changes in the mean, while the covariance matrix does not change. Model 4 and 5 are obtained by multiplying the covariance matrix by a constant. Model 6 is obtained from adding to E1(t)E_1(t) a centered Gaussian process h(t)h(t) whose covariance matrix is given by Cov(e(ti),e(tj))=0.5exp(titj0.2)Cov(e(t_i),e(t_j))=0.5 \exp (-\frac{\lvert t_i-t_j\rvert}{0.2}). Model 7 and 8 are obtained by a different mean function.

Model 1.

X1(t)=30t32(1t)+0.5+e(t).X_1(t)=30t^{\frac{3}{2}}(1-t)+0.5+e(t).

Model 2.

X2(t)=30t32(1t)+0.75+e(t).X_2(t)=30t^{\frac{3}{2}}(1-t)+0.75+e(t).

Model 3.

X3(t)=30t32(1t)+1+e(t).X_3(t)=30t^{\frac{3}{2}}(1-t)+1+e(t).

Model 4.

X4(t)=30t32(1t)+2e(t).X_4(t)=30t^{\frac{3}{2}}(1-t)+2 e(t).

Model 5.

X5(t)=30t32(1t)+0.25e(t).X_5(t)=30t^{\frac{3}{2}}(1-t)+0.25 e(t).

Model 6.

X6(t)=30t32(1t)+h(t).X_6(t)=30t^{\frac{3}{2}}(1-t)+ h(t).

Model 7.

X7(t)=30t(1t)2+h(t).X_7(t)=30t{(1-t)}^2+ h(t).

Model 8.

X8(t)=30t(1t)2+e(t).X_8(t)=30t{(1-t)}^2+ e(t).

Usage

sim_model_ex1(n = 50, p = 30, i_sim = 1)

Arguments

n

Number of curves to generate for each of the two groups. Set to 50 by default.

p

Number of grid points of the curves. Curves are generated over the interval [0,1][0, 1]. Set to 30 grid point by default.

i_sim

Integer set to 1,,81, \ldots, 8.

Value

data matrix of size 2n×p2n \times p.

Examples

sm1 <- sim_model_ex1()
dim(sm1)

Function for generating functional data in one or multiple dimension

Description

The function can generate one-dimensional or multi-dimensional curves. For i_sim 1 or 2, one-dimensional curves are generated. For i_sim 3 or 4, multi-dimensional curves are generated.

Usage

sim_model_ex2(n = 50, p = 150, i_sim = 1)

Arguments

n

Number of curves to generate for each of the two groups. Set to 50 by default.

p

Number of grid points of the curves. Curves are generated over the interval [0,1][0, 1]. Set to 150 grid point by default.

i_sim

Integer set to 1,,41, \ldots, 4 NULL by default in which case a seed is not set.

Value

data matrix of size 2n×p2n \times p if i_sim1,2i\_sim \in {1,2} or an array of dimensions 2n×p×22n \times p \times 2 if i_sim3,4i\_sim \in {3, 4}.

Examples

sm1 <- sim_model_ex2()
dim(sm1) # This should output (100, 150) by default, since n = 50 and p = 150

sm4 <- sim_model_ex2(i_sim = 4)
dim(sm4) # This should output (100, 150, 2) by default, since n = 50 and p = 150