lingpy.algorithm.cython package¶
Submodules¶
lingpy.algorithm.cython.calign module¶
lingpy.algorithm.cython.cluster module¶

lingpy.algorithm.cython.cluster.
flat_cluster
()¶ Carry out a flat cluster analysis based on the UPGMA algorithm.
Parameters: method : str { ‘upgma’, ‘single’, ‘complete’ }
Select between ‘ugpma’, ‘single’, and ‘complete’.
threshold : float
The threshold which terminates the algorithm.
matrix : list or
numpy.array
A twodimensional list containing the distances.
taxa : list (default = [])
A list containing the names of the taxa. If the list is left empty, the indices of the taxa will be returned instead of their names.
Returns: clusters : dict
A dictionary with clusterIDs as keys and a list of the taxa corresponding to the respective ID as values.
Examples
The function is automatically imported along with LingPy.
>>> from lingpy import *
Create a list of arbitrary taxa.
>>> taxa = ['German','Swedish','Icelandic','English','Dutch']
Create an arbitrary distance matrix.
>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3]) >>> matrix array([[ 0. , 0.5 , 0.67, 0.8 , 0.2 ], [ 0.5 , 0. , 0.4 , 0.7 , 0.6 ], [ 0.67, 0.4 , 0. , 0.8 , 0.8 ], [ 0.8 , 0.7 , 0.8 , 0. , 0.3 ], [ 0.2 , 0.6 , 0.8 , 0.3 , 0. ]])
Carry out the flat cluster analysis.
>>> flat_upgma(0.5,matrix,taxa) {0: ['German', 'Dutch', 'English'], 1: ['Swedish', 'Icelandic']}

lingpy.algorithm.cython.cluster.
flat_upgma
()¶ Carry out a flat cluster analysis based on the UPGMA algorithm (
Sokal1958
).Parameters: threshold : float
The threshold which terminates the algorithm.
matrix : list or
numpy.array
A twodimensional list containing the distances.
taxa : list (default = [])
A list containing the names of the taxa. If the list is left empty, the indices of the taxa will be returned instead of their names.
Returns: clusters : dict
A dictionary with clusterIDs as keys and a list of the taxa corresponding to the respective ID as values.
Examples
The function is automatically imported along with LingPy.
>>> from lingpy import *
Create a list of arbitrary taxa.
>>> taxa = ['German','Swedish','Icelandic','English','Dutch']
Create an arbitrary distance matrix.
>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3]) >>> matrix array([[ 0. , 0.5 , 0.67, 0.8 , 0.2 ], [ 0.5 , 0. , 0.4 , 0.7 , 0.6 ], [ 0.67, 0.4 , 0. , 0.8 , 0.8 ], [ 0.8 , 0.7 , 0.8 , 0. , 0.3 ], [ 0.2 , 0.6 , 0.8 , 0.3 , 0. ]])
Carry out the flat cluster analysis.
>>> flat_upgma(0.5,matrix,taxa) {0: ['German', 'Dutch', 'English'], 1: ['Swedish', 'Icelandic']}

lingpy.algorithm.cython.cluster.
neighbor
()¶ Function clusters data according to the NeighborJoining algorithm (
Saitou1987
).Parameters: matrix : list or
numpy.array
A twodimensional list containing the distances.
taxa : list
An list containing the names of all taxa corresponding to the distances in the matrix.
distances : bool
If set to
False
, only the topology of the tree will be returned.Returns: newick : str
A string in newickformat which can be further used in biological software packages to view and plot the tree.
Examples
Function is automatically imported when importing lingpy.
>>> from lingpy import *
Create an arbitrary list of taxa.
>>> taxa = ['Norwegian','Swedish','Icelandic','Dutch','English']
Create an arbitrary matrix.
>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3])
Carry out the cluster analysis.
>>> neighbor(matrix,taxa) '(((Norwegian,(Swedish,Icelandic)),English),Dutch);'

lingpy.algorithm.cython.cluster.
upgma
()¶ Carry out a cluster analysis based on the UPGMA algorithm (
Sokal1958
).Parameters: matrix : list or
numpy.array
A twodimensional list containing the distances.
taxa : list
An list containing the names of all taxa corresponding to the distances in the matrix.
distances : bool
If set to
False
, only the topology of the tree will be returned.Returns: newick : str
A string in newickformat which can be further used in biological software packages to view and plot the tree.
Examples
Function is automatically imported when importing lingpy.
>>> from lingpy import *
Create an arbitrary list of taxa.
>>> taxa = ['German','Swedish','Icelandic','English','Dutch']
Create an arbitrary matrix.
>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3])
Carry out the cluster analysis.
>>> upgma(matrix,taxa,distances=False) '((Swedish,Icelandic),(English,(German,Dutch)));'
lingpy.algorithm.cython.compilePYX module¶
Script handles compilation of Cython files to C and also to CExtension modules.

lingpy.algorithm.cython.compilePYX.
main
()¶

lingpy.algorithm.cython.compilePYX.
pyx2py
(infile, debug=False)¶
lingpy.algorithm.cython.malign module¶
This module provides various alignment functions in an optimized version.

lingpy.algorithm.cython.malign.
edit_dist
()¶ Return the editdistance between two strings.
Parameters: seqA, seqB : list
The sequences to be aligned, passed as list.
normalized : bool
Indicate whether you want the normalized or the unnormalized edit distance to be returned.
Returns: dist : { int, float }
Either the normalized or the unnormalized edit distance.
Notes
This function computes the edit distance between two list type objects. We recommend to use it if you need a fast implementation. Otherwise, especially, if you want to pass strings, we recommend to have a look at the wrapper function with the same name in the
pairwise
module.

lingpy.algorithm.cython.malign.
nw_align
()¶ Align two sequences using the NeedlemanWunsch algorithm.
Parameters: seqA, seqB : list
The sequences to be aligned, passed as list.
scorer : dict
A dictionary containing tuples of two segments as key and numbers as values.
gap : int
The gap penalty.
Returns: alignment : tuple
A tuple of the two aligned sequences, and the similarity score.
Notes
This function is a very straightforward implementation of the NeedlemanWunsch algorithm (
Needleman1970
). We recommend to use the function if you want to test your own scoring dictionaries and profit from a fast implementation (as we use Cython, the implementation is indeed faster than pure Python implementations, as long as you use Python 3 and have Cython installed). If you want to test the NW algorithm without specifying a scoring dictionary, we recommend to have a look at our wrapper function with the same name in thepairwise
module.

lingpy.algorithm.cython.malign.
restricted_edit_dist
()¶ Return the restricted editdistance between two strings.
Parameters: seqA, seqB : list
The two sequences passed as list.
resA, resB : str
The restrictions passed as a string with the same length as the corresponding sequence. We note a restriction if the strings show different symbols in their restriction string. If the symbols are identical, it is modeled as a nonrestriction.
normalized : bool
Determine whether you want to return the normalized or the unnormalized edit distance.
Notes
Restrictions follow the definition of
Heeringa2006
: Segments that are not allowed to match are given a penalty of . We model restrictions as strings, for example consisting of letters “c” and “v”. So the sequence “woldemort” could be modeled as “cvccvcvcc”, and when aligning it with the sequence “walter” and its restriction string “cvccvc”, the matching of those segments in the sequences in which the segments of the restriction string differ, would be heavily penalized, thus prohibiting an alignment of “vowels” and “consonants” (“v” and “c”).

lingpy.algorithm.cython.malign.
structalign
()¶ Carry out a structural alignment analysis using Dijkstra’s algorithm.
Parameters: seqA,seqB : str
The input sequences.
restricted_chars : str (default = “”)
The characters which are used to separate secondary from primary segments in the input sequences. Currently, the use of restricted chars may fail to yield an alignment.
Notes
Structural alignment is hereby understood as an alignment of two sequences whose alphabets differ. The algorithm returns all alignments with minimal edit distance. Edit distance in this context refers to the number of edit operations that are needed in order to convert one sequence into the other, with repeated edit operations being penalized only once.

lingpy.algorithm.cython.malign.
sw_align
()¶ Align two sequences using the SmithWaterman algorithm.
Parameters: seqA, seqB : list
The sequences to be aligned, passed as list.
scorer : dict
A dictionary containing tuples of two segments as key and numbers as values.
gap : int
The gap penalty.
Returns: alignment : tuple
A tuple of the two aligned sequences, and the similarity score.
Notes
This function is a very straightforward implementation of the SmithWaterman algorithm (
Smith1981
). We recommend to use the function if you want to test your own scoring dictionaries and profit from a fast implementation (as we use Cython, the implementation is indeed faster than pure Python implementations, as long as you use Python 3 and have Cython installed). If you want to test the SW algorithm without specifying a scoring dictionary, we recommend to have a look at our wrapper function with the same name in thepairwise
module.

lingpy.algorithm.cython.malign.
we_align
()¶ Align two sequences using the WatermanEggert algorithm.
Parameters: seqA, seqB : list
The input sequences passed as a list.
scorer : dict
A dictionary containing tuples of two segments as key and numbers as values.
gap : int
The gap penalty.
Returns: alignments : list
A list consisting of tuples. Each tuple gives the alignment of one of the subsequences of the input sequences. Each tuple contains the aligned part of the first, the aligned part of the second sequence, and the score of the alignment.
Notes
This function is a very straightforward implementation of the WatermanEggert algorithm (
Waterman1987
). We recommend to use the function if you want to test your own scoring dictionaries and profit from a fast implementation (as we use Cython, the implementation is indeed faster than pure Python implementations, as long as you use Python 3 and have Cython installed). If you want to test the WE algorithm without specifying a scoring dictionary, we recommend to have a look at our wrapper function with the same name in thepairwise
module.
lingpy.algorithm.cython.misc module¶

class
lingpy.algorithm.cython.misc.
ScoreDict
¶ Bases:
object
Class allows quick access to scoring functions using dictionary syntax.
Parameters: chars : list
The list of all character tokens for the scoring dictionary.
matrix : list
A twodimensional scoring matrix.
Notes
Since this class has dictionary syntax, you can always also just create a dictionary in order to store your scoring functions. Scoring dictionaries should contain a tuple of segments to be compared as a key, and a float or integer as a value, with negative values indicating dissimilarity, and positive values similarity.
Examples
 Initialize a ScoreDict object::
>>> from lingpy.algorith.cython.misc import ScoreDict >>> scorer = ScoreDict(['a', 'b'], [1, 1, 1, 1])
 Retrieve scores::
>>> scorer['a', 'b'] 1 >>> scorer['a', 'a'] 1 >>> scorer['a', 'X'] 22.5

lingpy.algorithm.cython.misc.
squareform
()¶ A simplified version of the
scipy.spatial.distance.squareform()
function.Parameters: x :
numpy.array
or listThe onedimensional flat representation of a symmetrix distance matrix.
Returns: matrix :
numpy.array
The twodimensional redundant representation of a symmetric distance matrix.

lingpy.algorithm.cython.misc.
transpose
()¶ Transpose a matrix along its two dimensions.
Parameters: matrix : list
A twodimensional list.
lingpy.algorithm.cython.talign module¶

lingpy.algorithm.cython.talign.
align_pair
()¶ Align a pair of sequences.
Parameters: seqA, seqB : list
The sequences to be aligned, passed as lists.
gop : int
The gap opening penalty.
scale : float
The gap extension scale.
scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict }
The scoring dictionary containing scores for all possible segment combinations in the two sequences.
mode : { “global”, “local”, “overlap”, “dialign” }
Select the mode for the alignment analysis (“overlap” refers to semiglobal alignments).
distance : int (default=0)
Select whether you want distances or similarities to be returned (0 indicates similarities, 1 indicates distances, 2 indicates both).
Returns: alignment : tuple
The aligned sequences and the similarity or distance scores, or both.
Notes
This is a utility function that allows calls any of the four classical alignment functions (
lingpy.algorithm.cython.talign.globalign
lingpy.algorithm.cython.talign.semi_globalign
,lingpy.algorithm.cython.talign.lotalign
,lingpy.algorithm.cython.talign.dialign
,) and their secondary counterparts.

lingpy.algorithm.cython.talign.
align_pairs
()¶ Align multiple sequence pairs.
Parameters: seqs : list
The sequences to be aligned, passed as lists.
gop : int
The gap opening penalty.
scale : float
The gap extension scale.
scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict }
The scoring dictionary containing scores for all possible segment combinations in the two sequences.
mode : { “global”, “local”, “overlap”, “dialign” }
Select the mode for the alignment analysis (“overlap” refers to semiglobal alignments).
distance : int (default=0)
Indicate whether distances or similarities should be returned.
Returns: alignments : list
A list of tuples, containing the aligned sequences, and the similarity or the distance scores.
Notes
This function aligns all pairs which are passed to it.

lingpy.algorithm.cython.talign.
align_pairwise
()¶ Align all sequences pairwise.
Parameters: seqs : list
The sequences to be aligned, passed as lists.
gop : int
The gap opening penalty.
scale : float
The gap extension scale.
scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict }
The scoring dictionary containing scores for all possible segment combinations in the two sequences.
mode : { “global”, “local”, “overlap”, “dialign” }
Select the mode for the alignment analysis (“overlap” refers to semiglobal alignments).
Returns: alignments : list
A list of tuples, containing the aligned sequences, the similarity and the distance scores.
Notes
This function aligns all possible pairs between the sequences you pass to it. It is important for multiple alignment, where it can be used to construct the guide tree.

lingpy.algorithm.cython.talign.
align_profile
()¶ Align two profiles using the basic modes.
Parameters: profileA, profileB : list
Twodimensional list for each of the profiles.
gop : int
The gap opening penalty.
scale : float
The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty.
scorer : { dict,
lingpy.algorithm.cython.misc.ScoreDict
}The scoring function which needs to provide scores for all segments in the two profiles.
mode : { “global”, “overlap”, “dialign” }
Select one of the four basic modes for alignment analyses.
gap_weight : float
This handles the weight that is given to gaps in a column. If you set it to 0, for example, this means that all gaps will be ignored when determining the score for two columns in the profile.
Returns: alignment : tuple
The aligned profiles, and the overall similarity of the profiles.
Notes
This function computes alignments of two profiles of multiple sequences (see
Durbin2002
for details on profiles) and is important for multiple alignment analyses.

lingpy.algorithm.cython.talign.
dialign
()¶ Carry out dialign alignment of two sequences.
Parameters: seqA, seqB : list
The sequences to be aligned, passed as lists.
M, N : int
The length of the two sequences.
scale : float
The gap extension scale.
scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict }
The scoring dictionary containing scores for all possible segment combinations in the two sequences.
Returns: alignment : tuple
The aligned sequences and the similarity score.
Notes
This algorithm carries out dialign alignment (
Morgenstern1996
).

lingpy.algorithm.cython.talign.
globalign
()¶ Carry out global alignment of two sequences.
Parameters: seqA, seqB : list
The sequences to be aligned, passed as lists.
M, N : int
The length of the two sequences.
gop : int
The gap opening penalty.
scale : float
The gap extension scale.
scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict }
The scoring dictionary containing scores for all possible segment combinations in the two sequences.
Returns: alignment : tuple
The aligned sequences and the similarity score.
Notes
This algorithm carries out classical NeedlemanWunsch alignment (
Needleman1970
).

lingpy.algorithm.cython.talign.
localign
()¶ Carry out semiglobal alignment of two sequences.
Parameters: seqA, seqB : list
The sequences to be aligned, passed as lists.
M, N : int
The length of the two sequences.
gop : int
The gap opening penalty.
scale : float
The gap extension scale.
scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict }
The scoring dictionary containing scores for all possible segment combinations in the two sequences.
Returns: alignment : tuple
The aligned sequences and the similarity score.
Notes
This algorithm carries out local alignment (
Smith1981
).

lingpy.algorithm.cython.talign.
score_profile
()¶ Basic function for the scoring of profiles.
Parameters: colA, colB : list
The two columns of a profile.
scorer : { dict,
lingpy.algorithm.cython.misc.ScoreDict
}The scoring function which needs to provide scores for all segments in the two profiles.
gap_weight : float (default=0.0)
This handles the weight that is given to gaps in a column. If you set it to 0, for example, this means that all gaps will be ignored when determining the score for two columns in the profile.
Returns: score : float
The score for the profile
Notes
This function handles how profiles are scored.

lingpy.algorithm.cython.talign.
semi_globalign
()¶ Carry out semiglobal alignment of two sequences.
Parameters: seqA, seqB : list
The sequences to be aligned, passed as lists.
M, N : int
The length of the two sequences.
gop : int
The gap opening penalty.
scale : float
The gap extension scale.
scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict }
The scoring dictionary containing scores for all possible segment combinations in the two sequences.
Returns: alignment : tuple
The aligned sequences and the similarity score.
Notes
This algorithm carries out semiglobal alignment (
Durbin2002
).

lingpy.algorithm.cython.talign.
swap_score_profile
()¶ Basic function for the scoring of profiles in swapped sequences.
Parameters: colA, colB : list
The two columns of a profile.
scorer : { dict,
lingpy.algorithm.cython.misc.ScoreDict
}The scoring function which needs to provide scores for all segments in the two profiles.
gap_weight : float (default=0.0)
This handles the weight that is given to gaps in a column. If you set it to 0, for example, this means that all gaps will be ignored when determining the score for two columns in the profile.
swap_penalty : int (default=5)
The swap penalty applied to swapped columns.
Returns: score : float
The score for the profile.
Notes
This function handles how profiles with swapped segments are scored.
Module contents¶
Package provides modules for timeconsuming routines.