This clustering method accepts a single parameter controlling the granularity of the resulting clustering called inflation. Low inflation leads to coarser clusterings, high inflation leads to fine-grained clusterings. It is suggested to use a few values, for example 1.4, 2, and 6.
The program mcxload constructs networks from various types of label input. The first is label or column format, where each line in a file contains two white-space separated labels, and optionally a third column containing an edge weight. The second is cluster or fingerprint format, where each line contains a number of labels, representing the set of neighbours or characters for a given entity.
The program mcxarray constructs networks from tabular data such as provided by gene expression arrays. Either Pearson or Spearman correlation can be used. The program can handle missing data in the form of empty columns, NA values (not available/applicable) or NaN (not a number). It is efficient, parallelized and can handle large data sets.
The mcx program in mode query. One main use is to vary a cutoff below which edges are removed, emitting statistics on the resulting thresholded graphs such as the number of components, the number of singletons, the average and median node degrees, and the average and median edge weights. This program can be used for example to find a good correlation cutoff for networks created using mcxarray. It is similarly possible to gauge the same statistics when varying the parameter k in the k-NN transform. In this transform an edge is kept if it occurs in the k edges of highest weight for both of its incident nodes.
The mcx program in mode ctty. It computes betweenness centrality for all nodes in a network, a very compute-intensive task. The program uses the efficient update algorithm by Ulrik Brandes, a clever node-wise parallelizable algorithm. This mode can run on multiple machines, each machine running multiple threads, and hence can make effective use of available resources.
The mcx program in mode diameter. It computes the diameter of a graph as well as the eccentricity of each node. This is also a computationally intensive task, and this mode can also run on multiple machines, each machine running multiple threads.
The mcx program in mode clcf. It computes the clustering coefficient for each node in a network. This is not a computationally intensive operation, and hence parallelism is not required.
The mcx program in mode erdos. It computes ensembles of shortest simple (unweighted) paths between two nodes. It was written with a focus on speed.
The clm program in mode order. Given a set of input clusterings, this program creates a reconciled fully nested set of output clusterings. Additionally, clusters are reordered at all levels such that larger clusters precede smaller clusters. It can output a tree structure that can be converted to Newick format with mcxdump.
The clm program in mode dist. It computes distances between clusterings, according to one of the split/join, variance of information, or Mirkin metrics.
The clm program in mode info. It outputs a simple numerical performance criterion for a clustering. It rewards clusterings both for being granular and for capturing many edges in the input graph. Its criterion lies in the range [0-1] and achieves 1 only for the canonical clustering of a graph that consists of pairwise disjoint internally completely connected subparts. In addition, it is affected by differentiation among the edge weight. It is not intended as an optimization criterion, but can be used to detect trends and optionally to spot bad clusterings.
This program can generate random graphs using a uniform edge generation model. It can also shuffle an existing graph while preserving the node degree distribution.