lingpy.algorithm.cython package

Submodules

lingpy.algorithm.cython.calign module

lingpy.algorithm.cython.cluster module

lingpy.algorithm.cython.cluster.flat_cluster()

Carry out a flat cluster analysis based on the UPGMA algorithm.

Parameters:

method : str { ‘upgma’, ‘single’, ‘complete’ }

Select between ‘ugpma’, ‘single’, and ‘complete’.

threshold : float

The threshold which terminates the algorithm.

matrix : list or numpy.array

A two-dimensional list containing the distances.

taxa : list (default = [])

A list containing the names of the taxa. If the list is left empty, the indices of the taxa will be returned instead of their names.

Returns:

clusters : dict

A dictionary with cluster-IDs as keys and a list of the taxa corresponding to the respective ID as values.

Examples

The function is automatically imported along with LingPy.

>>> from lingpy import *

Create a list of arbitrary taxa.

>>> taxa = ['German','Swedish','Icelandic','English','Dutch']

Create an arbitrary distance matrix.

>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3])
>>> matrix
array([[ 0.  ,  0.5 ,  0.67,  0.8 ,  0.2 ],
       [ 0.5 ,  0.  ,  0.4 ,  0.7 ,  0.6 ],
       [ 0.67,  0.4 ,  0.  ,  0.8 ,  0.8 ],
       [ 0.8 ,  0.7 ,  0.8 ,  0.  ,  0.3 ],
       [ 0.2 ,  0.6 ,  0.8 ,  0.3 ,  0.  ]])

Carry out the flat cluster analysis.

>>> flat_upgma(0.5,matrix,taxa)
{0: ['German', 'Dutch', 'English'], 1: ['Swedish', 'Icelandic']}
lingpy.algorithm.cython.cluster.flat_upgma()

Carry out a flat cluster analysis based on the UPGMA algorithm (Sokal1958).

Parameters:

threshold : float

The threshold which terminates the algorithm.

matrix : list or numpy.array

A two-dimensional list containing the distances.

taxa : list (default = [])

A list containing the names of the taxa. If the list is left empty, the indices of the taxa will be returned instead of their names.

Returns:

clusters : dict

A dictionary with cluster-IDs as keys and a list of the taxa corresponding to the respective ID as values.

Examples

The function is automatically imported along with LingPy.

>>> from lingpy import *

Create a list of arbitrary taxa.

>>> taxa = ['German','Swedish','Icelandic','English','Dutch']

Create an arbitrary distance matrix.

>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3])
>>> matrix
array([[ 0.  ,  0.5 ,  0.67,  0.8 ,  0.2 ],
       [ 0.5 ,  0.  ,  0.4 ,  0.7 ,  0.6 ],
       [ 0.67,  0.4 ,  0.  ,  0.8 ,  0.8 ],
       [ 0.8 ,  0.7 ,  0.8 ,  0.  ,  0.3 ],
       [ 0.2 ,  0.6 ,  0.8 ,  0.3 ,  0.  ]])

Carry out the flat cluster analysis.

>>> flat_upgma(0.5,matrix,taxa)
{0: ['German', 'Dutch', 'English'], 1: ['Swedish', 'Icelandic']}
lingpy.algorithm.cython.cluster.neighbor()

Function clusters data according to the Neighbor-Joining algorithm (Saitou1987).

Parameters:

matrix : list or numpy.array

A two-dimensional list containing the distances.

taxa : list

An list containing the names of all taxa corresponding to the distances in the matrix.

distances : bool

If set to False, only the topology of the tree will be returned.

Returns:

newick : str

A string in newick-format which can be further used in biological software packages to view and plot the tree.

Examples

Function is automatically imported when importing lingpy.

>>> from lingpy import *

Create an arbitrary list of taxa.

>>> taxa = ['Norwegian','Swedish','Icelandic','Dutch','English']

Create an arbitrary matrix.

>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3])

Carry out the cluster analysis.

>>> neighbor(matrix,taxa)
'(((Norwegian,(Swedish,Icelandic)),English),Dutch);'
lingpy.algorithm.cython.cluster.upgma()

Carry out a cluster analysis based on the UPGMA algorithm (Sokal1958).

Parameters:

matrix : list or numpy.array

A two-dimensional list containing the distances.

taxa : list

An list containing the names of all taxa corresponding to the distances in the matrix.

distances : bool

If set to False, only the topology of the tree will be returned.

Returns:

newick : str

A string in newick-format which can be further used in biological software packages to view and plot the tree.

Examples

Function is automatically imported when importing lingpy.

>>> from lingpy import *

Create an arbitrary list of taxa.

>>> taxa = ['German','Swedish','Icelandic','English','Dutch']

Create an arbitrary matrix.

>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3])

Carry out the cluster analysis.

>>> upgma(matrix,taxa,distances=False)
'((Swedish,Icelandic),(English,(German,Dutch)));'

lingpy.algorithm.cython.compilePYX module

Script handles compilation of Cython files to C and also to C-Extension modules.

lingpy.algorithm.cython.compilePYX.main()
lingpy.algorithm.cython.compilePYX.pyx2py(infile, debug=False)

lingpy.algorithm.cython.malign module

This module provides various alignment functions in an optimized version.

lingpy.algorithm.cython.malign.edit_dist()

Return the edit-distance between two strings.

Parameters:

seqA, seqB : list

The sequences to be aligned, passed as list.

normalized : bool

Indicate whether you want the normalized or the unnormalized edit distance to be returned.

Returns:

dist : { int, float }

Either the normalized or the unnormalized edit distance.

Notes

This function computes the edit distance between two list type objects. We recommend to use it if you need a fast implementation. Otherwise, especially, if you want to pass strings, we recommend to have a look at the wrapper function with the same name in the pairwise module.

lingpy.algorithm.cython.malign.nw_align()

Align two sequences using the Needleman-Wunsch algorithm.

Parameters:

seqA, seqB : list

The sequences to be aligned, passed as list.

scorer : dict

A dictionary containing tuples of two segments as key and numbers as values.

gap : int

The gap penalty.

Returns:

alignment : tuple

A tuple of the two aligned sequences, and the similarity score.

Notes

This function is a very straightforward implementation of the Needleman-Wunsch algorithm (Needleman1970). We recommend to use the function if you want to test your own scoring dictionaries and profit from a fast implementation (as we use Cython, the implementation is indeed faster than pure Python implementations, as long as you use Python 3 and have Cython installed). If you want to test the NW algorithm without specifying a scoring dictionary, we recommend to have a look at our wrapper function with the same name in the pairwise module.

lingpy.algorithm.cython.malign.restricted_edit_dist()

Return the restricted edit-distance between two strings.

Parameters:

seqA, seqB : list

The two sequences passed as list.

resA, resB : str

The restrictions passed as a string with the same length as the corresponding sequence. We note a restriction if the strings show different symbols in their restriction string. If the symbols are identical, it is modeled as a non-restriction.

normalized : bool

Determine whether you want to return the normalized or the unnormalized edit distance.

Notes

Restrictions follow the definition of Heeringa2006: Segments that are not allowed to match are given a penalty of \infty. We model restrictions as strings, for example consisting of letters “c” and “v”. So the sequence “woldemort” could be modeled as “cvccvcvcc”, and when aligning it with the sequence “walter” and its restriction string “cvccvc”, the matching of those segments in the sequences in which the segments of the restriction string differ, would be heavily penalized, thus prohibiting an alignment of “vowels” and “consonants” (“v” and “c”).

lingpy.algorithm.cython.malign.structalign()

Carry out a structural alignment analysis using Dijkstra’s algorithm.

Parameters:

seqA,seqB : str

The input sequences.

restricted_chars : str (default = “”)

The characters which are used to separate secondary from primary segments in the input sequences. Currently, the use of restricted chars may fail to yield an alignment.

Notes

Structural alignment is hereby understood as an alignment of two sequences whose alphabets differ. The algorithm returns all alignments with minimal edit distance. Edit distance in this context refers to the number of edit operations that are needed in order to convert one sequence into the other, with repeated edit operations being penalized only once.

lingpy.algorithm.cython.malign.sw_align()

Align two sequences using the Smith-Waterman algorithm.

Parameters:

seqA, seqB : list

The sequences to be aligned, passed as list.

scorer : dict

A dictionary containing tuples of two segments as key and numbers as values.

gap : int

The gap penalty.

Returns:

alignment : tuple

A tuple of the two aligned sequences, and the similarity score.

Notes

This function is a very straightforward implementation of the Smith-Waterman algorithm (Smith1981). We recommend to use the function if you want to test your own scoring dictionaries and profit from a fast implementation (as we use Cython, the implementation is indeed faster than pure Python implementations, as long as you use Python 3 and have Cython installed). If you want to test the SW algorithm without specifying a scoring dictionary, we recommend to have a look at our wrapper function with the same name in the pairwise module.

lingpy.algorithm.cython.malign.we_align()

Align two sequences using the Waterman-Eggert algorithm.

Parameters:

seqA, seqB : list

The input sequences passed as a list.

scorer : dict

A dictionary containing tuples of two segments as key and numbers as values.

gap : int

The gap penalty.

Returns:

alignments : list

A list consisting of tuples. Each tuple gives the alignment of one of the subsequences of the input sequences. Each tuple contains the aligned part of the first, the aligned part of the second sequence, and the score of the alignment.

Notes

This function is a very straightforward implementation of the Waterman-Eggert algorithm (Waterman1987). We recommend to use the function if you want to test your own scoring dictionaries and profit from a fast implementation (as we use Cython, the implementation is indeed faster than pure Python implementations, as long as you use Python 3 and have Cython installed). If you want to test the WE algorithm without specifying a scoring dictionary, we recommend to have a look at our wrapper function with the same name in the pairwise module.

lingpy.algorithm.cython.misc module

class lingpy.algorithm.cython.misc.ScoreDict

Bases: object

Class allows quick access to scoring functions using dictionary syntax.

Parameters:

chars : list

The list of all character tokens for the scoring dictionary.

matrix : list

A two-dimensional scoring matrix.

Notes

Since this class has dictionary syntax, you can always also just create a dictionary in order to store your scoring functions. Scoring dictionaries should contain a tuple of segments to be compared as a key, and a float or integer as a value, with negative values indicating dissimilarity, and positive values similarity.

Examples

Initialize a ScoreDict object::
>>> from lingpy.algorith.cython.misc import ScoreDict
>>> scorer = ScoreDict(['a', 'b'], [1, -1, -1, 1])
Retrieve scores::
>>> scorer['a', 'b']
-1
>>> scorer['a', 'a']
1
>>> scorer['a', 'X']
-22.5
lingpy.algorithm.cython.misc.squareform()

A simplified version of the scipy.spatial.distance.squareform() function.

Parameters:

x : numpy.array or list

The one-dimensional flat representation of a symmetrix distance matrix.

Returns:

matrix : numpy.array

The two-dimensional redundant representation of a symmetric distance matrix.

lingpy.algorithm.cython.misc.transpose()

Transpose a matrix along its two dimensions.

Parameters:

matrix : list

A two-dimensional list.

lingpy.algorithm.cython.talign module

lingpy.algorithm.cython.talign.align_pair()

Align a pair of sequences.

Parameters:

seqA, seqB : list

The sequences to be aligned, passed as lists.

gop : int

The gap opening penalty.

scale : float

The gap extension scale.

scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict }

The scoring dictionary containing scores for all possible segment combinations in the two sequences.

mode : { “global”, “local”, “overlap”, “dialign” }

Select the mode for the alignment analysis (“overlap” refers to semi-global alignments).

distance : int (default=0)

Select whether you want distances or similarities to be returned (0 indicates similarities, 1 indicates distances, 2 indicates both).

Returns:

alignment : tuple

The aligned sequences and the similarity or distance scores, or both.

Notes

This is a utility function that allows calls any of the four classical alignment functions (lingpy.algorithm.cython.talign.globalign lingpy.algorithm.cython.talign.semi_globalign, lingpy.algorithm.cython.talign.lotalign, lingpy.algorithm.cython.talign.dialign,) and their secondary counterparts.

lingpy.algorithm.cython.talign.align_pairs()

Align multiple sequence pairs.

Parameters:

seqs : list

The sequences to be aligned, passed as lists.

gop : int

The gap opening penalty.

scale : float

The gap extension scale.

scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict }

The scoring dictionary containing scores for all possible segment combinations in the two sequences.

mode : { “global”, “local”, “overlap”, “dialign” }

Select the mode for the alignment analysis (“overlap” refers to semi-global alignments).

distance : int (default=0)

Indicate whether distances or similarities should be returned.

Returns:

alignments : list

A list of tuples, containing the aligned sequences, and the similarity or the distance scores.

Notes

This function aligns all pairs which are passed to it.

lingpy.algorithm.cython.talign.align_pairwise()

Align all sequences pairwise.

Parameters:

seqs : list

The sequences to be aligned, passed as lists.

gop : int

The gap opening penalty.

scale : float

The gap extension scale.

scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict }

The scoring dictionary containing scores for all possible segment combinations in the two sequences.

mode : { “global”, “local”, “overlap”, “dialign” }

Select the mode for the alignment analysis (“overlap” refers to semi-global alignments).

Returns:

alignments : list

A list of tuples, containing the aligned sequences, the similarity and the distance scores.

Notes

This function aligns all possible pairs between the sequences you pass to it. It is important for multiple alignment, where it can be used to construct the guide tree.

lingpy.algorithm.cython.talign.align_profile()

Align two profiles using the basic modes.

Parameters:

profileA, profileB : list

Two-dimensional list for each of the profiles.

gop : int

The gap opening penalty.

scale : float

The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty.

scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict }

The scoring function which needs to provide scores for all segments in the two profiles.

mode : { “global”, “overlap”, “dialign” }

Select one of the four basic modes for alignment analyses.

gap_weight : float

This handles the weight that is given to gaps in a column. If you set it to 0, for example, this means that all gaps will be ignored when determining the score for two columns in the profile.

Returns:

alignment : tuple

The aligned profiles, and the overall similarity of the profiles.

Notes

This function computes alignments of two profiles of multiple sequences (see Durbin2002 for details on profiles) and is important for multiple alignment analyses.

lingpy.algorithm.cython.talign.dialign()

Carry out dialign alignment of two sequences.

Parameters:

seqA, seqB : list

The sequences to be aligned, passed as lists.

M, N : int

The length of the two sequences.

scale : float

The gap extension scale.

scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict }

The scoring dictionary containing scores for all possible segment combinations in the two sequences.

Returns:

alignment : tuple

The aligned sequences and the similarity score.

Notes

This algorithm carries out dialign alignment (Morgenstern1996).

lingpy.algorithm.cython.talign.globalign()

Carry out global alignment of two sequences.

Parameters:

seqA, seqB : list

The sequences to be aligned, passed as lists.

M, N : int

The length of the two sequences.

gop : int

The gap opening penalty.

scale : float

The gap extension scale.

scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict }

The scoring dictionary containing scores for all possible segment combinations in the two sequences.

Returns:

alignment : tuple

The aligned sequences and the similarity score.

Notes

This algorithm carries out classical Needleman-Wunsch alignment (Needleman1970).

lingpy.algorithm.cython.talign.localign()

Carry out semi-global alignment of two sequences.

Parameters:

seqA, seqB : list

The sequences to be aligned, passed as lists.

M, N : int

The length of the two sequences.

gop : int

The gap opening penalty.

scale : float

The gap extension scale.

scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict }

The scoring dictionary containing scores for all possible segment combinations in the two sequences.

Returns:

alignment : tuple

The aligned sequences and the similarity score.

Notes

This algorithm carries out local alignment (Smith1981).

lingpy.algorithm.cython.talign.score_profile()

Basic function for the scoring of profiles.

Parameters:

colA, colB : list

The two columns of a profile.

scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict }

The scoring function which needs to provide scores for all segments in the two profiles.

gap_weight : float (default=0.0)

This handles the weight that is given to gaps in a column. If you set it to 0, for example, this means that all gaps will be ignored when determining the score for two columns in the profile.

Returns:

score : float

The score for the profile

Notes

This function handles how profiles are scored.

lingpy.algorithm.cython.talign.semi_globalign()

Carry out semi-global alignment of two sequences.

Parameters:

seqA, seqB : list

The sequences to be aligned, passed as lists.

M, N : int

The length of the two sequences.

gop : int

The gap opening penalty.

scale : float

The gap extension scale.

scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict }

The scoring dictionary containing scores for all possible segment combinations in the two sequences.

Returns:

alignment : tuple

The aligned sequences and the similarity score.

Notes

This algorithm carries out semi-global alignment (Durbin2002).

lingpy.algorithm.cython.talign.swap_score_profile()

Basic function for the scoring of profiles in swapped sequences.

Parameters:

colA, colB : list

The two columns of a profile.

scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict }

The scoring function which needs to provide scores for all segments in the two profiles.

gap_weight : float (default=0.0)

This handles the weight that is given to gaps in a column. If you set it to 0, for example, this means that all gaps will be ignored when determining the score for two columns in the profile.

swap_penalty : int (default=-5)

The swap penalty applied to swapped columns.

Returns:

score : float

The score for the profile.

Notes

This function handles how profiles with swapped segments are scored.

Module contents

Package provides modules for time-consuming routines.