lingpy.algorithm package

Subpackages

Submodules

lingpy.algorithm.cluster_util module

Various utility functions which are useful for algorithmic operations

lingpy.algorithm.cluster_util.generate_all_clusters(numbers)

Generate all possible clusters for a number of elements.

Returns:

clr : iterator

An iterator that will yield the next of all possible clusters.

lingpy.algorithm.cluster_util.generate_random_cluster(numbers, bias=False)

Generate a random cluster for a number of elements.

Parameters:

numbers : int

Number of separate entities which should be clustered.

bias : str (default=False)

When set to “lumper” will tend to create larger groups, when set to “splitter” it will tend to produce smaller groups.

Returns:

cluster : list

A list with consecutive ordering of clusters, starting from zero.

lingpy.algorithm.cluster_util.mutate_cluster(clr, chance=0.5)

Mutate a cluster.

Parameters:

clr : cluster

A list with ordered clusters.

chance : float (default=0.5)

The mutation rate for each element in a cluster. If set to 0.5, this means that in 50% of the cases, an element will be assigned to another cluster or a new cluster.

Returns:

valid_cluster : list

A newly clustered list in consecutive order.

lingpy.algorithm.cluster_util.order_cluster(clr)

Order a cluster into the form of a valid cluster.

Parameters:

clr : list

A list with clusters assigned by given each element a specific clusuter ID.

Returns:

valid_cluster : list

A list in which the IDs start from zero and increase consecutively with each new cluster introduced.

lingpy.algorithm.cluster_util.valid_cluster(sequence)

Only allow to have sequences which have consecutive ordering of elements.

Parameters:

sequence : list

A cluster sequence in which elements should be consecutively ordered, starting from 0, and identical segments in the sequence retrieve the same number.

Returns:

valid_cluster : bool

True, if the cluster is valid, and False if it judged to be invalid.

Examples

We define a valid and an invalid cluster sequence:

>>> clrA = [0, 1, 2, 3]
>>> clrB = [1, 1, 2, 3] # should be [0, 0, 1, 2]
>>> from lingpy.algorithm.cluster_util import valid_cluster
>>> valid_cluster(clrA)
True
>>> valid_cluster(clrB)
False

lingpy.algorithm.clustering module

Module provides general clustering functions for LingPy.

lingpy.algorithm.clustering.best_threshold(matrix, trange=(0.3, 0.7, 0.05))

Calculate the best threshold by maximizing partition density for a given range of thresholds.

Notes

This method makes use of the idea of partition density proposed in Ahn2010.

lingpy.algorithm.clustering.check_taxon_names(taxa)
lingpy.algorithm.clustering.find_threshold(matrix, thresholds=[0.9, 0.8500000000000001, 0.8, 0.75, 0.7000000000000001, 0.65, 0.6000000000000001, 0.55, 0.5, 0.45, 0.4, 0.35000000000000003, 0.30000000000000004, 0.25, 0.2, 0.15000000000000002, 0.1, 0.05], logs=True)

Use a variant of the method by Apeltsin2011 in order to find an optimal threshold.

Parameters:

matrix : list

The distance matrix for which the threshold shall be determined.

thresholds : list (default=[i*0.05 for i in range(1,19)[::-1])

The range of thresholds that shall be tested.

logs : {bool,builtins.function} (default=True)

If set to True, the logarithm of the score beyond the threshold will be assigned as weight to the graph. If set to c{False} all weights will be set to 1. Use a custom function to define individual ways to calculate the weights.

Returns:

threshold : {float,None}

If a float is returned, this is the threshold identified by the method. If None is returned, no threshold could be identified.

Notes

This is a very simple method that may not work well depending on the dataset. So we recommend to use it with great care.

lingpy.algorithm.clustering.flat_cluster(method, threshold, matrix, taxa=None, revert=False)

Carry out a flat cluster analysis based on linkage algorithms.

Parameters:

method : { “upgma”, “single”, “complete”, “ward”}

Select between ‘ugpma’, ‘single’, and ‘complete’. You can also test “ward”, but there’s no guarantee that this is the correct algorithm.

threshold : float

The threshold which terminates the algorithm.

matrix : list

A two-dimensional list containing the distances.

taxa : list (default=None)

A list containing the names of the taxa. If the list is left empty, the indices of the taxa will be returned instead of their names.

Returns:

clusters : dict

A dictionary with cluster-IDs as keys and a list of the taxa corresponding to the respective ID as values.

Examples

The function is automatically imported along with LingPy.

>>> from lingpy import *
>>> from lingpy.algorithm import squareform

Create a list of arbitrary taxa.

>>> taxa = ['German','Swedish','Icelandic','English','Dutch']

Create an arbitrary distance matrix.

>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3])
>>> matrix
[[0.0, 0.5, 0.67, 0.8, 0.2],
 [0.5, 0.0, 0.4, 0.7, 0.6],
 [0.67, 0.4, 0.0, 0.8, 0.8],
 [0.8, 0.7, 0.8, 0.0, 0.3],
 [0.2, 0.6, 0.8, 0.3, 0.0]]

Carry out the flat cluster analysis.

>>> flat_cluster('upgma',0.6,matrix,taxa)
{0: ['German', 'Dutch', 'English'], 1: ['Swedish', 'Icelandic']}
lingpy.algorithm.clustering.flat_upgma(threshold, matrix, taxa=None, revert=False)

Carry out a flat cluster analysis based on the UPGMA algorithm (Sokal1958).

Parameters:

threshold : float

The threshold which terminates the algorithm.

matrix : list

A two-dimensional list containing the distances.

taxa : list (default=None)

A list containing the names of the taxa. If the list is left empty, the indices of the taxa will be returned instead of their names.

Returns:

clusters : dict

A dictionary with cluster-IDs as keys and a list of the taxa corresponding to the respective ID as values.

Examples

The function is automatically imported along with LingPy.

>>> from lingpy import *
>>> from lingpy.algorithm import squareform

Create a list of arbitrary taxa.

>>> taxa = ['German','Swedish','Icelandic','English','Dutch']

Create an arbitrary distance matrix.

>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3])
>>> matrix
[[0.0, 0.5, 0.67, 0.8, 0.2],
 [0.5, 0.0, 0.4, 0.7, 0.6],
 [0.67, 0.4, 0.0, 0.8, 0.8],
 [0.8, 0.7, 0.8, 0.0, 0.3],
 [0.2, 0.6, 0.8, 0.3, 0.0]]

Carry out the flat cluster analysis.

>>> flat_upgma(0.6,matrix,taxa)
{0: ['German', 'Dutch', 'English'], 1: ['Swedish', 'Icelandic']}
lingpy.algorithm.clustering.fuzzy(threshold, matrix, taxa, method='upgma', revert=False)

Create fuzzy cluster of a given distance matrix.

Parameters:

threshold : float

The threshold that shall be used for the basic clustering of the data.

matrix : list

A two-dimensional list containing the distances.

taxa : list

An list containing the names of all taxa corresponding to the distances in the matrix.

method : { “upgma”, “single”, “complete” } (default=”upgma”)

Select the method for the flat cluster analysis.

distances : bool

If set to “False”, only the topology of the tree will be returned.

revert : bool (default=False)

Specify whether a reverted dictionary should be returned.

Returns:

cluster : dict

A dictionary with cluster-IDs as keys and a list as value, containing the taxa that are assigned to a given cluster-ID.

See also

link_clustering

Notes

This is a very simple fuzzy clustering algorithm. It basically does nothing else than removing taxa successively from the matrix, flat-clustering the remaining taxa with the corresponding threshold, and then returning a combined “consensus” cluster in which taxa may be assigned to multiple clusters.

Examples

The function is automatically imported along with LingPy.

>>> from lingpy import *
from lingpy.algorithm import squareform

Create a list of arbitrary taxa.

>>> taxa = ['German','Swedish','Icelandic','English','Dutch']

Create an arbitrary distance matrix.

>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3])
>>> matrix
[[0.0, 0.5, 0.67, 0.8, 0.2],
 [0.5, 0.0, 0.4, 0.7, 0.6],
 [0.67, 0.4, 0.0, 0.8, 0.8],
 [0.8, 0.7, 0.8, 0.0, 0.3],
 [0.2, 0.6, 0.8, 0.3, 0.0]]

Carry out the fuzzy flat cluster analysis.

>>> fuzzy(0.5,matrix,taxa)
{1: ['Swedish', 'Icelandic'], 2: ['Dutch', 'German'], 3: ['Dutch', 'English']}

Carry out a link clustering analysis using the method by Ahn2010.

Parameters:

threshold : {float, bool}

The threshold that shall be used for the initial selection of links assigned to the data. If set to c{False}, the weights from the matrix will be used directly.

matrix : list

A two-dimensional list containing the distances.

taxa : list

An list containing the names of all taxa corresponding to the distances in the matrix.

link_threshold : float (default=0.5)

The threshold that shall be used for the internal clustering of the data.

matrix_type : {“distances”,”similarities”,”weights”} (default=”distances”)

Specify the type of the matrix. If the matrix contains distance data, it will be adapted to similarity data. If it contains “similarities”, no adaptation is needed. If it contains “weights”, a weighted version of link clustering (see the supplementary in Ahn2010 for details) ]will be carried out.

Returns:

cluster : dict

A dictionary with cluster-IDs as keys and a list as value, containing the taxa that are assigned to a given cluster-ID.

See also

fuzzy

Examples

The function is automatically imported along with LingPy.

>>> from lingpy import *
>>> from lingpy.algorithm import squareform

Create a list of arbitrary taxa.

>>> taxa = ['German','Swedish','Icelandic','English','Dutch']

Create an arbitrary distance matrix.

>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3])
>>> matrix
[[0.0, 0.5, 0.67, 0.8, 0.2],
 [0.5, 0.0, 0.4, 0.7, 0.6],
 [0.67, 0.4, 0.0, 0.8, 0.8],
 [0.8, 0.7, 0.8, 0.0, 0.3],
 [0.2, 0.6, 0.8, 0.3, 0.0]]

Carry out the link-clustering analysis.

>>> link_clustering(0.5,matrix,taxa)
{1: ['Dutch', 'English', 'German'], 2: ['Icelandic', 'Swedish']}
lingpy.algorithm.clustering.matrix2groups(threshold, matrix, taxa, cluster_method='upgma')

Calculate flat cluster of distance matrix.

Parameters:

threshold : float

The threshold to be used for the calculation.

matrix : list

The distance matrix to be used.

taxa : list

A list of the taxa in the distance matrix.

cluster_method : {“upgma”, “mcl”, “single”, “complete”} (default=”upgma”)

Returns:

groups : dict

A dictionary with the taxa as keys and the group assignment as values.

Notes

This function is important for internal calculations within wordlist. It is not recommended for further use.

lingpy.algorithm.clustering.matrix2tree(matrix, taxa, tree_calc='neighbor', distances=True, filename='')

Calculate a tree of a given distance matrix.

Parameters:

matrix : list

The distance matrix to be used.

taxa : list

A list of the taxa in the distance matrix.

tree_calc : str (default=”neighbor”)

The method for tree calculation that shall be used. Select between:

distances : bool (default=True)

If set to c{True}, distances will be included in the tree-representation.

filename : str (default=’’)

If a filename is specified, the data will be written to that file.

Returns:

tree : ~lingpy.thirdparty.cogent.tree.PhyloNode

A ~lingpy.thirdparty.cogent.tree.PhyloNode object for handling tree files.

lingpy.algorithm.clustering.mcl(threshold, matrix, taxa, max_steps=1000, inflation=2, expansion=2, add_self_loops=True, revert=False, logs=True, matrix_type='distances')

Carry out a clustering using the MCL algorithm (Dongen2000).

Parameters:

threshold : {float, bool}

The threshold that shall be used for the initial selection of links assigned to the data. If set to c{False}, the weights from the matrix will be used directly.

matrix : list

A two-dimensional list containing the distances.

taxa : list

An list containing the names of all taxa corresponding to the distances in the matrix.

max_steps : int (default=1000)

Maximal number of iterations.

inflation : int (default=2)

Inflation parameter for the MCL algorithm.

expansion : int (default=2)

Expansion parameter of the MCL algorithm.

add_self_loops : {True, False, builtins.function} (default=True)

Determine whether self-loops should be added, and if so, how they should be weighted. If a function for the calculation of self-loops is given, it will take the whole column of the matrix for each taxon as input.

logs : { bool, function } (default=True)

If set to c{True}, the logarithm of the score beyond the threshold will be assigned as weight to the graph. If set to c{False} all weights will be set to 1. Use a custom function to define individual ways to calculate the weights.

matrix_type : { “distances”, “similarities” }

Specify the type of the matrix. If the matrix contains distance data, it will be adapted to similarity data. If it contains “similarities”, no adaptation is needed.

Examples

The function is automatically imported along with LingPy.

>>> from lingpy import *
>>> from lingpy.algorithm import squareform

Create a list of arbitrary taxa.

>>> taxa = ['German','Swedish','Icelandic','English','Dutch']

Create an arbitrary distance matrix.

>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3])
>>> matrix
[[0.0, 0.5, 0.67, 0.8, 0.2],
 [0.5, 0.0, 0.4, 0.7, 0.6],
 [0.67, 0.4, 0.0, 0.8, 0.8],
 [0.8, 0.7, 0.8, 0.0, 0.3],
 [0.2, 0.6, 0.8, 0.3, 0.0]]

Carry out the link-clustering analysis.

>>> mcl(0.5,matrix,taxa)
{1: ['German', 'English', 'Dutch'], 2: ['Swedish', 'Icelandic']}
lingpy.algorithm.clustering.neighbor(matrix, taxa, distances=True)

Function clusters data according to the Neighbor-Joining algorithm (Saitou1987).

Parameters:

matrix : list

A two-dimensional list containing the distances.

taxa : list

An list containing the names of all taxa corresponding to the distances in the matrix.

distances : bool (default=True)

If set to False, only the topology of the tree will be returned.

Returns:

newick : str

A string in newick-format which can be further used in biological software packages to view and plot the tree.

See also

upgma

Examples

Function is automatically imported when importing lingpy.

>>> from lingpy import *
>>> from lingpy.algorithm import squareform

Create an arbitrary list of taxa.

>>> taxa = ['Norwegian','Swedish','Icelandic','Dutch','English']

Create an arbitrary matrix.

>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3])

Carry out the cluster analysis.

>>> neighbor(matrix,taxa)
'(((Norwegian,(Swedish,Icelandic)),English),Dutch);'
lingpy.algorithm.clustering.partition_density(matrix, t)

Calculate partition density for a given threshold on a distance matrix.

Notes

See Ahn2012 for details on the calculation of partition density in a given network.

lingpy.algorithm.clustering.upgma(matrix, taxa, distances=True)

Carry out a cluster analysis based on the UPGMA algorithm (Sokal1958).

Parameters:

matrix : list

A two-dimensional list containing the distances.

taxa : list

An list containing the names of all taxa corresponding to the distances in the matrix.

distances : bool (default=True)

If set to False, only the topology of the tree will be returned.

Returns:

newick : str

A string in newick-format which can be further used in biological software packages to view and plot the tree.

See also

neighbor

Examples

Function is automatically imported when importing lingpy.

>>> from lingpy import *
>>> from lingpy.algorithm import squareform

Create an arbitrary list of taxa.

>>> taxa = ['German','Swedish','Icelandic','English','Dutch']

Create an arbitrary matrix.

>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3])

Carry out the cluster analysis.

>>> upgma(matrix,taxa,distances=False)
'((Swedish,Icelandic),(English,(German,Dutch)));'

lingpy.algorithm.extra module

Adapting specific cluster algorithms from scikit-learn to LingPy.

lingpy.algorithm.extra.affinity_propagation(threshold, matrix, taxa, revert=False)

Compute affinity propagation from the matrix.

Parameters:

threshold : float

The threshold for clustering you want to use.

matrix : list

The two-dimensional matrix passed as list or array.

taxa : list

The list of taxon names. If set to “False” a fake list of taxon names will be created, giving a positive numerical ID in increasing order for each column in the matrix.

revert : bool

If set to “False”, don’t return taxon names but simply the language identifiers and their labels as a dictionary. Otherwise returns a dictionary with labels as keys and list of taxon names as values.

Returns:

clusters : dict

Either a dictionary of taxon identifiers and labels, or a dictionary of labels and taxon names.

Notes

Affinity propagation is a clustering method originally proposed by Frey2007.

Requires the scikitlearn package, downloadable from http://scikit-learn.org/.

lingpy.algorithm.extra.dbscan(threshold, matrix, taxa, revert=False, min_samples=1)

Compute DBSCAN cluster analysis.

Parameters:

threshold : float

The threshold for clustering you want to use.

matrix : list

The two-dimensional matrix passed as list or array.

taxa : list

The list of taxon names. If set to “False” a fake list of taxon names will be created, giving a positive numerical ID in increasing order for each column in the matrix.

revert : bool

If set to “False”, don’t return taxon names but simply the language identifiers and their labels as a dictionary. Otherwise returns a dictionary with labels as keys and list of taxon names as values.

min_samples : int (default=1)

The minimal samples parameter of the DBCSCAN method from the SKLEARN package.

Returns:

clusters : dict

Either a dictionary of taxon identifiers and labels, or a dictionary of labels and taxon names.

Notes

This method does not work as expected, probably since it normally requires distances between points as input. We list it only for completeness here, but urge to be careful when using the code and checking properly our implementation in the source code.

Requires the scikitlearn package, downloadable from http://scikit-learn.org/.

lingpy.algorithm.extra.infomap_clustering(threshold, matrix, taxa=False, revert=False)

Compute the Infomap clustering analysis of the data.

Parameters:

threshold : float

The threshold for clustering you want to use.

matrix : list

The two-dimensional matrix passed as list or array.

taxa : list

The list of taxon names. If set to “False” a fake list of taxon names will be created, giving a positive numerical ID in increasing order for each column in the matrix.

revert : bool

If set to “False”, don’t return taxon names but simply the language identifiers and their labels as a dictionary. Otherwise returns a dictionary with labels as keys and list of taxon names as values.

Returns:

clusters : dict

Either a dictionary of taxon identifiers and labels, or a dictionary of labels and taxon names.

Notes

Infomap clustering is a community detection method originally proposed by Rosvall2008. This method requires the igraph package, downloadable from http://igraph.org/.

Module contents

Package for specific algorithms and time-intensive routines.