Title: | Integrative Clustering for Heterogeneous Biomedical Datasets |
---|---|
Description: | Integrative context-dependent clustering for heterogeneous biomedical datasets. Identifies local clustering structures in related datasets, and a global clusters that exist across the datasets. |
Authors: | Evelina Gabasova |
Maintainer: | Evelina Gabasova <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.1 |
Built: | 2025-02-28 04:47:52 UTC |
Source: | https://github.com/evelinag/clusternomics |
Estimate sizes of clusters from global cluster assignments.
clusterSizes(assignments)
clusterSizes(assignments)
assignments |
Matrix of cluster assignments, where each row corresponds to cluster assignments sampled in one MCMC iteration |
Sizes of individual clusters in each MCMC iteration.
# Generate simple test dataset groupCounts <- c(50, 10, 40, 60) means <- c(-1.5,1.5) testData <- generateTestData_2D(groupCounts, means) datasets <- testData$data # Fit the model # 1. specify number of clusters clusterCounts <- list(global=10, context=c(3,3)) # 2. Run inference # Number of iterations is just for demonstration purposes, use # a larger number of iterations in practice! results <- contextCluster(datasets, clusterCounts, maxIter = 10, burnin = 5, lag = 1, dataDistributions = 'diagNormal', verbose = TRUE) # Extract only the sampled global assignments samples <- results$samples clusters <- plyr::laply(1:length(samples), function(i) samples[[i]]$Global) clusterSizes(clusters)
# Generate simple test dataset groupCounts <- c(50, 10, 40, 60) means <- c(-1.5,1.5) testData <- generateTestData_2D(groupCounts, means) datasets <- testData$data # Fit the model # 1. specify number of clusters clusterCounts <- list(global=10, context=c(3,3)) # 2. Run inference # Number of iterations is just for demonstration purposes, use # a larger number of iterations in practice! results <- contextCluster(datasets, clusterCounts, maxIter = 10, burnin = 5, lag = 1, dataDistributions = 'diagNormal', verbose = TRUE) # Extract only the sampled global assignments samples <- results$samples clusters <- plyr::laply(1:length(samples), function(i) samples[[i]]$Global) clusterSizes(clusters)
Compute the posterior co-clustering matrix from global cluster assignments.
coclusteringMatrix(assignments)
coclusteringMatrix(assignments)
assignments |
Matrix of cluster assignments, where each row corresponds to cluster assignments sampled in one MCMC iteration |
Posterior co-clustering matrix, where element [i,j]
represents
the posterior probability that data points i
and j
belong
to the same cluster.
# Generate simple test dataset groupCounts <- c(50, 10, 40, 60) means <- c(-1.5,1.5) testData <- generateTestData_2D(groupCounts, means) datasets <- testData$data # Fit the model # 1. specify number of clusters clusterCounts <- list(global=10, context=c(3,3)) # 2. Run inference # Number of iterations is just for demonstration purposes, use # a larger number of iterations in practice! results <- contextCluster(datasets, clusterCounts, maxIter = 10, burnin = 5, lag = 1, dataDistributions = 'diagNormal', verbose = TRUE) # Extract only the sampled global assignments samples <- results$samples clusters <- plyr::laply(1:length(samples), function(i) samples[[i]]$Global) coclusteringMatrix(clusters)
# Generate simple test dataset groupCounts <- c(50, 10, 40, 60) means <- c(-1.5,1.5) testData <- generateTestData_2D(groupCounts, means) datasets <- testData$data # Fit the model # 1. specify number of clusters clusterCounts <- list(global=10, context=c(3,3)) # 2. Run inference # Number of iterations is just for demonstration purposes, use # a larger number of iterations in practice! results <- contextCluster(datasets, clusterCounts, maxIter = 10, burnin = 5, lag = 1, dataDistributions = 'diagNormal', verbose = TRUE) # Extract only the sampled global assignments samples <- results$samples clusters <- plyr::laply(1:length(samples), function(i) samples[[i]]$Global) coclusteringMatrix(clusters)
This function fits the context-dependent clustering model to the data using Gibbs sampling. It allows the user to specify a different number of clusters on the global level, as well as on the local level.
contextCluster(datasets, clusterCounts, dataDistributions = "diagNormal", prior = NULL, maxIter = 1000, burnin = NULL, lag = 3, verbose = FALSE)
contextCluster(datasets, clusterCounts, dataDistributions = "diagNormal", prior = NULL, maxIter = 1000, burnin = NULL, lag = 3, verbose = FALSE)
datasets |
List of data matrices where each matrix represents a context-specific dataset. Each data matrix has the size N times M, where N is the number of data points and M is the dimensionality of the data. The full list of matrices has length C. The number of data points N must be the same for all data matrices. |
clusterCounts |
Number of cluster on the global level and in each context.
List with the following structure: |
dataDistributions |
Distribution of data in each dataset. Can be either a list of
length C where |
prior |
Prior distribution. If |
maxIter |
Number of iterations of the Gibbs sampling algorithm. |
burnin |
Number of burn-in iterations that will be discarded. If not specified,
the algorithm discards the first half of the |
lag |
Used for thinning the samples. |
verbose |
Print progress, by default |
Returns list containing the sequence of MCMC states and the log likelihoods of the individual states.
samples |
List of assignments sampled from the posterior,
each state |
logliks |
Log likelihoods during MCMC iterations. |
DIC |
Deviance information criterion to help select the number of clusters. Lower values of DIC correspond to better-fitting models. |
# Example with simulated data (see vignette for details) # Number of elements in each cluster groupCounts <- c(50, 10, 40, 60) # Centers of clusters means <- c(-1.5,1.5) testData <- generateTestData_2D(groupCounts, means) datasets <- testData$data # Fit the model # 1. specify number of clusters clusterCounts <- list(global=10, context=c(3,3)) # 2. Run inference # Number of iterations is just for demonstration purposes, use # a larger number of iterations in practice! results <- contextCluster(datasets, clusterCounts, maxIter = 10, burnin = 5, lag = 1, dataDistributions = 'diagNormal', verbose = TRUE) # Extract results from the samples # Final state: state <- results$samples[[length(results$samples)]] # 1) assignment to global clusters globalAssgn <- state$Global # 2) context-specific assignmnets- assignment in specific dataset (context) contextAssgn <- state[,"Context 1"] # Assess the fit of the model with DIC results$DIC
# Example with simulated data (see vignette for details) # Number of elements in each cluster groupCounts <- c(50, 10, 40, 60) # Centers of clusters means <- c(-1.5,1.5) testData <- generateTestData_2D(groupCounts, means) datasets <- testData$data # Fit the model # 1. specify number of clusters clusterCounts <- list(global=10, context=c(3,3)) # 2. Run inference # Number of iterations is just for demonstration purposes, use # a larger number of iterations in practice! results <- contextCluster(datasets, clusterCounts, maxIter = 10, burnin = 5, lag = 1, dataDistributions = 'diagNormal', verbose = TRUE) # Extract results from the samples # Final state: state <- results$samples[[length(results$samples)]] # 1) assignment to global clusters globalAssgn <- state$Global # 2) context-specific assignmnets- assignment in specific dataset (context) contextAssgn <- state[,"Context 1"] # Assess the fit of the model with DIC results$DIC
Fit an empirical Bayes prior to the data
empiricalBayesPrior(datasets, distributions = "diagNormal", globalConcentration = 0.1, localConcentration = 0.1, type = "fitRate")
empiricalBayesPrior(datasets, distributions = "diagNormal", globalConcentration = 0.1, localConcentration = 0.1, type = "fitRate")
datasets |
List of data matrices where each matrix represents a context-specific dataset. Each data matrix has the size N times M, where N is the number of data points and M is the dimensionality of the data. The full list of matrices has length C. The number of data points N must be the same for all data matrices. |
distributions |
Distribution of data in each dataset. Can be either a list of
length C where |
globalConcentration |
Prior concentration parameter for the global clusters. Small values of this parameter give larger prior probability to smaller number of clusters. |
localConcentration |
Prior concentration parameter for the local context-specific clusters. Small values of this parameter give larger prior probability to smaller number of clusters. |
type |
Type of prior that is fitted to the data. The algorithm can fit either rate of the prior covariance matrix, or fit the full covariance matrix to the data. |
Returns the prior object that can be used as an input for the contextCluster
function.
# Example with simulated data (see vignette for details) nContexts <- 2 # Number of elements in each cluster groupCounts <- c(50, 10, 40, 60) # Centers of clusters means <- c(-1.5,1.5) testData <- generateTestData_2D(groupCounts, means) datasets <- testData$data # Generate the prior fullDataDistributions <- rep('diagNormal', nContexts) prior <- empiricalBayesPrior(datasets, fullDataDistributions, 0.01, 0.1, 'fitRate') # Fit the model # 1. specify number of clusters clusterCounts <- list(global=10, context=c(3,3)) # 2. Run inference # Number of iterations is just for demonstration purposes, use # a larger number of iterations in practice! results <- contextCluster(datasets, clusterCounts, maxIter = 10, burnin = 5, lag = 1, dataDistributions = 'diagNormal', prior = prior, verbose = TRUE)
# Example with simulated data (see vignette for details) nContexts <- 2 # Number of elements in each cluster groupCounts <- c(50, 10, 40, 60) # Centers of clusters means <- c(-1.5,1.5) testData <- generateTestData_2D(groupCounts, means) datasets <- testData$data # Generate the prior fullDataDistributions <- rep('diagNormal', nContexts) prior <- empiricalBayesPrior(datasets, fullDataDistributions, 0.01, 0.1, 'fitRate') # Fit the model # 1. specify number of clusters clusterCounts <- list(global=10, context=c(3,3)) # 2. Run inference # Number of iterations is just for demonstration purposes, use # a larger number of iterations in practice! results <- contextCluster(datasets, clusterCounts, maxIter = 10, burnin = 5, lag = 1, dataDistributions = 'diagNormal', prior = prior, verbose = TRUE)
Creates a basic prior distribution for the clustering model, assuming a unit prior covariance matrix for clusters in each dataset.
generatePrior(datasets, distributions = "diagNormal", globalConcentration = 0.1, localConcentration = 0.1)
generatePrior(datasets, distributions = "diagNormal", globalConcentration = 0.1, localConcentration = 0.1)
datasets |
List of data matrices where each matrix represents a context-specific dataset. Each data matrix has the size N times M, where N is the number of data points and M is the dimensionality of the data. The full list of matrices has length C. The number of data points N must be the same for all data matrices. |
distributions |
Distribution of data in each dataset. Can be either a list of
length C where |
globalConcentration |
Prior concentration parameter for the global clusters. Small values of this parameter give larger prior probability to smaller number of clusters. |
localConcentration |
Prior concentration parameter for the local context-specific clusters. Small values of this parameter give larger prior probability to smaller number of clusters. |
Returns the prior object that can be used as an input for the contextCluster
function.
# Example with simulated data (see vignette for details) nContexts <- 2 # Number of elements in each cluster groupCounts <- c(50, 10, 40, 60) # Centers of clusters means <- c(-1.5,1.5) testData <- generateTestData_2D(groupCounts, means) datasets <- testData$data # Generate the prior fullDataDistributions <- rep('diagNormal', nContexts) prior <- generatePrior(datasets, fullDataDistributions, 0.01, 0.1) # Fit the model # 1. specify number of clusters clusterCounts <- list(global=10, context=c(3,3)) # 2. Run inference # Number of iterations is just for demonstration purposes, use # a larger number of iterations in practice! results <- contextCluster(datasets, clusterCounts, maxIter = 10, burnin = 5, lag = 1, dataDistributions = 'diagNormal', prior = prior, verbose = TRUE)
# Example with simulated data (see vignette for details) nContexts <- 2 # Number of elements in each cluster groupCounts <- c(50, 10, 40, 60) # Centers of clusters means <- c(-1.5,1.5) testData <- generateTestData_2D(groupCounts, means) datasets <- testData$data # Generate the prior fullDataDistributions <- rep('diagNormal', nContexts) prior <- generatePrior(datasets, fullDataDistributions, 0.01, 0.1) # Fit the model # 1. specify number of clusters clusterCounts <- list(global=10, context=c(3,3)) # 2. Run inference # Number of iterations is just for demonstration purposes, use # a larger number of iterations in practice! results <- contextCluster(datasets, clusterCounts, maxIter = 10, burnin = 5, lag = 1, dataDistributions = 'diagNormal', prior = prior, verbose = TRUE)
Generate simple 1D dataset with two contexts, where the data are generated from Gaussian distributions. The generated output contains two datasets, where each dataset contains 4 global clusters, originating from two local clusters in each context.
generateTestData_1D(groupCounts, means)
generateTestData_1D(groupCounts, means)
groupCounts |
Number of data samples in each global cluster.
It is assumed to be a vector of four elements: |
means |
Means of the simulated clusters.
It is assumed to be a vector of two elements: |
Returns the simulated datasets together with true assignmets.
data |
List of datasets for each context. This can be used as an input
for the |
groups |
True cluster assignments that were used to generate the data. |
groupCounts <- c(50, 10, 40, 60) means <- c(-1.5,1.5) testData <- generateTestData_1D(groupCounts, means) # Use the dataset as an input for the contextCluster function for testing datasets <- testData$data
groupCounts <- c(50, 10, 40, 60) means <- c(-1.5,1.5) testData <- generateTestData_1D(groupCounts, means) # Use the dataset as an input for the contextCluster function for testing datasets <- testData$data
Generate simple 2D dataset with two contexts, where the data are generated from Gaussian distributions. The generated output contains two datasets, where each dataset contains 4 global clusters, originating from two local clusters in each context.
generateTestData_2D(groupCounts, means, variances = NULL)
generateTestData_2D(groupCounts, means, variances = NULL)
groupCounts |
Number of data samples in each global cluster.
It is assumed to be a vector of four elements: |
means |
Means of the simulated clusters.
It is assumed to be a vector of two elements: |
variances |
Optionally, it is possible to specify different variance
for each of the clusters. The variance is assumed to be a vector
of two elements: |
Returns the simulated datasets together with true assignmets.
data |
List of datasets for each context. This can be used as an input
for the |
groups |
True cluster assignments that were used to generate the data. |
groupCounts <- c(50, 10, 40, 60) means <- c(-1.5,1.5) testData <- generateTestData_1D(groupCounts, means) # Use the dataset as an input for the contextCluster function for testing datasets <- testData$data
groupCounts <- c(50, 10, 40, 60) means <- c(-1.5,1.5) testData <- generateTestData_1D(groupCounts, means) # Use the dataset as an input for the contextCluster function for testing datasets <- testData$data
Estimate number of clusters from global cluster assignments.
numberOfClusters(assignments)
numberOfClusters(assignments)
assignments |
Matrix of cluster assignments, where each row corresponds to cluster assignments sampled in one MCMC iteration |
Number of unique clusters in each MCMC iteration.
# Generate simple test dataset groupCounts <- c(50, 10, 40, 60) means <- c(-1.5,1.5) testData <- generateTestData_2D(groupCounts, means) datasets <- testData$data # Fit the model # 1. specify number of clusters clusterCounts <- list(global=10, context=c(3,3)) # 2. Run inference # Number of iterations is just for demonstration purposes, use # a larger number of iterations in practice! results <- contextCluster(datasets, clusterCounts, maxIter = 10, burnin = 5, lag = 1, dataDistributions = 'diagNormal', verbose = TRUE) # Extract only the sampled global assignments samples <- results$samples clusters <- plyr::laply(1:length(samples), function(i) samples[[i]]$Global) numberOfClusters(clusters)
# Generate simple test dataset groupCounts <- c(50, 10, 40, 60) means <- c(-1.5,1.5) testData <- generateTestData_2D(groupCounts, means) datasets <- testData$data # Fit the model # 1. specify number of clusters clusterCounts <- list(global=10, context=c(3,3)) # 2. Run inference # Number of iterations is just for demonstration purposes, use # a larger number of iterations in practice! results <- contextCluster(datasets, clusterCounts, maxIter = 10, burnin = 5, lag = 1, dataDistributions = 'diagNormal', verbose = TRUE) # Extract only the sampled global assignments samples <- results$samples clusters <- plyr::laply(1:length(samples), function(i) samples[[i]]$Global) numberOfClusters(clusters)