lingpy.algorithm.cython package¶
Submodules¶
lingpy.algorithm.cython.calign module¶

lingpy.algorithm.cython.calign.
align_pair
()¶ Align a pair of sequences.
Parameters: seqA, seqB : list
The list containing the sequences.
gopA, gopB : list
The gap opening penalties (individual for each sequence, therefore passed as a list of floats or integers).
proA, proB : str
The prosodic strings which have the same length as seqA and seqB.
scale : float
The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty.
factor : float
The factor by which matches are increased when two segments occur in the same prosodic position of an alignment.
scorer : { dict,
lingpy.algorithm.cython.misc.ScoreDict
}The scoring function which needs to provide scores for all segments in seqA and seqB.
mode : { “global”, “local”, “overlap”, “dialign” }
Select one of the four basic modes for alignment analyses.
restricted_chars : str
The string containing restricted characters. Restricted characters occur, as a rule, in the prosodic strings, not in the normal sequence.
distance : int (default=0)
Select whether you want to calculate the normalized distance or the similarity between two strings (following
Downey2008
for normalization).Returns: alignment : tuple
The aligned sequences and the similarity or distance.
Notes
This is a utility function that allows calls any of the four classical alignment functions (
lingpy.algorithm.cython.calign.globalign
lingpy.algorithm.cython.calign.semi_globalign
,lingpy.algorithm.cython.calign.localign
,lingpy.algorithm.cython.calign.dialign
,) and their secondary counterparts.

lingpy.algorithm.cython.calign.
align_pairs
()¶ Align multiple sequence pairs.
Parameters: seqs : list
A twodimensional list containing one pair of sequences each.
gops : list
The gap opening penalties (individual for each sequence, therefore passed as a list of floats or integers).
pros : list
The prosodic strings which have the same length as seqA and seqB.
scale : float
The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty.
factor : float
The factor by which matches are increased when two segments occur in the same prosodic position of an alignment.
scorer : { dict,
lingpy.algorithm.cython.misc.ScoreDict
}The scoring function which needs to provide scores for all segments in seqA and seqB.
mode : { “global”, “local”, “overlap”, “dialign” }
Select one of the four basic modes for alignment analyses.
restricted_chars : { str }
The string containing restricted characters. Restricted characters occur, as a rule, in the prosodic strings, not in the normal sequence.
distance : int (default=0)
Select whether you want to calculate the normalized distance or the similarity between two strings (following
Downey2008
for normalization). If you set this value to 2, both distances and similarities will be returned.Returns: alignments : list
A list of tuples of size 3 or 4, containing the alignments, and the similarity or the distance (or both, if distance is set to 2).
Notes
This function computes alignments of all pairs passed in the list of sequence pairs (a twodimensional list with two sequences each) and is basically used in LingPy’s module for cognate detection (
lingpy.compare.lexstat.LexStat
).

lingpy.algorithm.cython.calign.
align_pairwise
()¶ Align a list of sequences pairwise.
Parameters: seqs : list
The list containing the sequences.
gops : list
The gap opening penalties (individual for each sequence, therefore passed as a list of floats or integers).
pros : list
The prosodic strings which have the same length as seqA and seqB.
scale : float
The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty.
factor : float
The factor by which matches are increased when two segments occur in the same prosodic position of an alignment.
scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict }
The scoring function which needs to provide scores for all segments in seqA and seqB.
mode : { “global”, “local”, “overlap”, “dialign” }
Select one of the four basic modes for alignment analyses.
r : str
The string containing restricted characters. Restricted characters occur, as a rule, in the prosodic strings, not in the normal sequence.
Returns: alignments : list
A list of tuples of size 4, containing the alignment, the similarity and the distance for each sequence pair.
Notes
This function computes alignments of all possible pairs passed in the list of sequences and is basically used in LingPy’s module for multiple alignment analyses (
lingpy.align.multiple
).

lingpy.algorithm.cython.calign.
align_profile
()¶ Align two profiles using the basic modes.
Parameters: profileA, profileB : list
Twodimensional list for each of the profiles.
gopA, gopB : list
The gap opening penalties (individual for each sequence, therefore passed as a list of floats or integers).
proA, proB : str
The prosodic strings which have the same length as profileA and profileB.
gop : int
The general gap opening penalty which will be used to introduce a gap between the two profiles.
scale : float
The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty.
factor : float
The factor by which matches are increased when two segments occur in the same prosodic position of an alignment.
scorer : { dict,
lingpy.algorithm.cython.misc.ScoreDict
}The scoring function which needs to provide scores for all segments in the two profiles.
restricted_chars : { str }
The string containing restricted characters. Restricted characters occur, as a rule, in the prosodic strings, not in the normal sequence. They need to be computed by computing a consensus string from all prosodic strings in the profile.
mode : { “global”, “local”, “overlap”, “dialign” }
Select one of the four basic modes for alignment analyses.
gap_weight : float
This handles the weight that is given to gaps in a column. If you set it to 0, for example, this means that all gaps will be ignored when determining the score for two columns in the profile.
Returns: alignment : tuple
The aligned profiles, and the overall similarity of the profiles.
Notes
This function computes alignments of two profiles of multiple sequences (see
Durbin2002
for details on profiles) and is basically used in LingPy’s module for multiple alignment (lingpy.align.multiple
).

lingpy.algorithm.cython.calign.
corrdist
()¶ Create a correspondence distribution for a given language pair.
Parameters: threshold : float
The threshold of sequence distance which determines whether a sequence pair is included or excluded from the calculation of the distribution.
seqs : list
The sequences passed as a twodimensional list of sequence pairs.
gops : list
The gap opening penalties, passed as individual lists of penalties for each sequence.
pros : list
The list of prosodic strings for each sequence.
gop : int
The general gap opening penalty which will be used to introduce a gap between the two profiles.
scale : float
The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty.
factor : float
The factor by which matches are increased when two segments occur in the same prosodic position of an alignment.
scorer : { dict,
lingpy.algorithm.cython.misc.ScoreDict
}The scoring function which needs to provide scores for all segments in the two profiles.
mode : { “global”, “local”, “overlap”, “dialign” }
Select one of the four basic modes for alignment analyses.
restricted_chars : { str }
The string containing restricted characters. Restricted characters occur, as a rule, in the prosodic strings, not in the normal sequence. They need to be computed by computing a consensus string from all prosodic strings in the profile.
Returns: results : tuple
A dictionary containing the distribution, and the number of included sequences.
Notes
This function is the core of the
LexStat
function to compute distributions of aligned segment pairs.

lingpy.algorithm.cython.calign.
dialign
()¶ Carry out dialign alignment of two sequences.
Parameters: seqA, seqB : list
The list containing the sequences.
proA, proB : str
The prosodic strings which have the same length as seqA and seqB.
M, N : int
The lengths of seqA and seqB.
scale : float
The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty.
factor : float
The factor by which matches are increased when two segments occur in the same prosodic position of an alignment.
scorer : { dict,
lingpy.algorithm.cython.misc.ScoreDict
}The scoring function which needs to provide scores for all segments in seqA and seqB.
Returns: alignment : tuple
A tuple of the two alignments and the alignment score.
Notes
This is the function that is called to carry out local dialign alignment analyses (keyword “dialign”) when using many of LingPy’s classes for alignment analyses which is at the same time sensitive for secondary sequence structures (see the description of secondary alignment in
List2014d
for details), likePairwise
,Multiple
, orLexStat
. Dialign (seeMorgenstern1996
) is an alignment algorithm that does not require gap penalties and generally works in a rather local fashion.

lingpy.algorithm.cython.calign.
globalign
()¶ Carry out global alignment of two sequences.
Parameters: seqA, seqB : list
The list containing the sequences.
gopA, gopB : list
The gap opening penalties (individual for each sequence, therefore passed as a list of floats or integers).
proA, proB : str
The prosodic strings which have the same length as seqA and seqB.
M, N : int
The lengths of seqA and seqB.
scale : float
The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty.
factor : float
The factor by which matches are increased when two segments occur in the same prosodic position of an alignment.
scorer : { dict,
lingpy.algorithm.cython.misc.ScoreDict
}The scoring function which needs to provide scores for all segments in seqA and seqB.
Returns: alignment : tuple
A tuple of the two alignments and the alignment score.
Notes
This is the function that is called to carry out global alignment analyses when using many of LingPy’s classes for alignment analyses, like
Pairwise
,Multiple
, orLexStat
. It differs from classical NeedlemanWunsch alignment (compareNeedleman1970
) in a couple of aspects. These include, among others, the use of a gap extension scale rather than a gap extension penalty (the scale consecutively reduces the gap penalty and thus lets gap penalties approach zero if gapped regions are large), the use of individual gap opening penalties for all positions of a sequence, and the use of prosodic strings, and prosodic factors that raise scores when segments occur in the same prosodic environment.If one sets certain of these parameters to zero or one and uses the same gap opening penalties, however, the function will behave like the traditional NeedlemanWunsch algorithm, and since it is implemented in Cython, it will work faster than a pure Python implementation for alignment algorithms.
Examples
We show that the NeedlemanWunsch algorithms yields the same result as the globalign algorithm, provided we adjust the parameters:
>>> from lingpy.algorithm.cython.calign import globalign >>> from lingpy.align.pairwise import nw_align >>> nw_align('abab', 'baba') (['a', 'b', 'a', 'b', ''], ['', 'b', 'a', 'b', 'a'], 1) >>> globalign(list('abab'), list('baba'), 4 * [1], 4 * [1], 'aaaa', 'aaaa', 4, 4, 1, 0, {("a","b"):1, ("b","a"): 1, ("a","a"): 1, ("b", "b"): 1}) (['a', 'b', 'a', 'b', ''], ['', 'b', 'a', 'b', 'a'], 1.0)

lingpy.algorithm.cython.calign.
localign
()¶ Carry out semiglobal alignment of two sequences.
Parameters: seqA, seqB : list
The list containing the sequences.
gopA, gopB : list
The gap opening penalties (individual for each sequence, therefore passed as a list of floats or integers).
proA, proB : str
The prosodic strings which have the same length as seqA and seqB.
M, N : int
The lengths of seqA and seqB.
scale : float
The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty.
factor : float
The factor by which matches are increased when two segments occur in the same prosodic position of an alignment.
scorer : { dict,
lingpy.algorithm.cython.misc.ScoreDict
}The scoring function which needs to provide scores for all segments in seqA and seqB.
Returns: alignment : tuple
A tuple of the two alignments and the alignment score. The alignments are each a list of suffix, alignment, and prefix.
Notes
This is the function that is called to carry out local alignment analyses when using many of LingPy’s classes for alignment analyses which is at the same time sensitive for secondary sequence structures (see the description of secondary alignment in
List2014d
for details), likePairwise
,Multiple
, orLexStat
. Local alignment means that only the best matching substring between two sequences is returned (compareSmith1981
), also called the SmithWaterman algorithm.

lingpy.algorithm.cython.calign.
score_profile
()¶ Basic function for the scoring of profiles.
Parameters: colA, colB : list
The two columns of a profile.
scorer : { dict,
lingpy.algorithm.cython.misc.ScoreDict
}The scoring function which needs to provide scores for all segments in the two profiles.
gap_weight : float (default=0.0)
This handles the weight that is given to gaps in a column. If you set it to 0, for example, this means that all gaps will be ignored when determining the score for two columns in the profile.
Returns: score : float
The score for the profile
Notes
This function handles how profiles are scored.

lingpy.algorithm.cython.calign.
secondary_dialign
()¶ Carry out dialign alignment of two sequences with sensitivity for secondary sequence structures.
Parameters: seqA, seqB : list
The list containing the sequences.
proA, proB : str
The prosodic strings which have the same length as seqA and seqB.
M, N : int
The lengths of seqA and seqB.
scale : float
The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty.
factor : float
The factor by which matches are increased when two segments occur in the same prosodic position of an alignment.
scorer : { dict,
ScoreDict
}The scoring function which needs to provide scores for all segments in seqA and seqB.
r : { str }
The string containing restricted characters. Restricted characters occur, as a rule, in the prosodic strings, not in the normal sequence.
Returns: alignment : tuple
A tuple of the two alignments and the alignment score.
Notes
This is the function that is called to carry out local dialign alignment analyses (keyword “dialign”) when using many of LingPy’s classes for alignment analyses which is at the same time sensitive for secondary sequence structures (see the description of secondary alignment in
List2014d
for details), likePairwise
,Multiple
, orLexStat
. Dialign (seeMorgenstern1996
) is an alignment algorithm that does not require gap penalties and generally works in a rather local fashion.

lingpy.algorithm.cython.calign.
secondary_globalign
()¶ Carry out global alignment of two sequences with secondary sequence structures.
Parameters: seqA, seqB : list
The list containing the sequences.
gopA, gopB : list
The gap opening penalties (individual for each sequence, therefore passed as a list of floats or integers).
proA, proB : str
The prosodic strings which have the same length as seqA and seqB.
M, N : int
The lengths of seqA and seqB.
scale : float
The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty.
factor : float
The factor by which matches are increased when two segments occur in the same prosodic position of an alignment.
scorer : { dict,
lingpy.algorithm.cython.misc.ScoreDict
}The scoring function which needs to provide scores for all segments in seqA and seqB.
r : { str }
The string containing restricted characters. Restricted characters occur, as a rule, in the prosodic strings, not in the normal sequence.
Returns: alignment : tuple
A tuple of the two alignments and the alignment score.
Notes
This is the function that is called to carry out global alignment analyses when using many of LingPy’s classes for alignment analyses which is at the same time sensitive for secondary sequence structures (see the description of secondary alignment in
List2014d
for details), likePairwise
,Multiple
, orLexStat
. It differs from classical NeedlemanWunsch alignment (compareNeedleman1970
) in a couple of aspects. These include, among others, the use of a gap extension scale rather than a gap extension penalty (the scale consecutively reduces the gap penalty and thus lets gap penalties approach zero if gapped regions are large), the use of individual gap opening penalties for all positions of a sequence, and the use of prosodic strings, and prosodic factors that raise scores when segments occur in the same prosodic environment.If one sets certain of these parameters to zero or one and uses the same gap opening penalties, however, the function will behave like the traditional NeedlemanWunsch algorithm, and since it is implemented in Cython, it will work faster than a pure Python implementation for alignment algorithms.
Examples
 We compare globalign with secondary_globalign::
>>> from lingpy.algorithm.cython.calign import globalign, secondary_globalign >>> globalign(list('abab'), list('baba'), 4 * [1], 4 * [1], 'aaaa', 'aaaa', 4, 4, 1, 0, {("a","b"):1, ("b","a"): 1, ("a","a"): 1, ("b", "b"): 1}) (['a', 'b', 'a', 'b', ''], ['', 'b', 'a', 'b', 'a'], 1.0) >>> secondary_globalign(list('ab.ab'), list('ba.ba'), 5 * [1], 5 * [1], 'ab.ab', 'ba.ba', 5, 5, 1, 0, {("a","b"):1, ("b","a"): 1, ("a","a"): 1, ("b", "b"): 1, ("a",".") : 1, ("b","."):1, (".","."):0, (".", "b"): 1, (".", "a"):1}, '.') (['a', 'b', '', '.', 'a', 'b', ''], ['', 'b', 'a', '.', '', 'b', 'a'], 2.0)

lingpy.algorithm.cython.calign.
secondary_localign
()¶ Carry out lobal alignment of two sequences with sensitivity to secondary sequence structures.
Parameters: seqA, seqB : list
The list containing the sequences.
gopA, gopB : list
The gap opening penalties (individual for each sequence, therefore passed as a list of floats or integers).
proA, proB : str
The prosodic strings which have the same length as seqA and seqB.
M, N : int
The lengths of seqA and seqB.
scale : float
The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty.
factor : float
The factor by which matches are increased when two segments occur in the same prosodic position of an alignment.
scorer : { dict,
lingpy.algorithm.cython.misc.ScoreDict
}The scoring function which needs to provide scores for all segments in seqA and seqB.
r : { str }
The string containing restricted characters. Restricted characters occur, as a rule, in the prosodic strings, not in the normal sequence.
Returns: alignment : tuple
A tuple of the two alignments and the alignment score. The alignments are each a list of suffix, alignment, and prefix.
Notes
This is the function that is called to carry out local alignment analyses when using many of LingPy’s classes for alignment analyses which is at the same time sensitive for secondary sequence structures (see the description of secondary alignment in
List2014d
for details), likePairwise
,Multiple
, orLexStat
. Local alignment means that only the best matching substring between two sequences is returned (compareSmith1981
), also called the SmithWaterman algorithm.

lingpy.algorithm.cython.calign.
secondary_semi_globalign
()¶ Carry out semiglobal alignment of two sequences with sensitivity to secondary sequence structures.
Parameters: seqA, seqB : list
The list containing the sequences.
gopA, gopB : list
The gap opening penalties (individual for each sequence, therefore passed as a list of floats or integers).
proA, proB : str
The prosodic strings which have the same length as seqA and seqB.
M, N : int
The lengths of seqA and seqB.
scale : float
The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty.
factor : float
The factor by which matches are increased when two segments occur in the same prosodic position of an alignment.
scorer : { dict,
lingpy.algorithm.cython.misc.ScoreDict
}The scoring function which needs to provide scores for all segments in seqA and seqB.
r : { str }
The string containing restricted characters. Restricted characters occur, as a rule, in the prosodic strings, not in the normal sequence.
Returns: alignment : tuple
A tuple of the two alignments and the alignment score.
Notes
This is the function that is called to carry out semiglobal alignment analyses (keyword “overlap”) when using many of LingPy’s classes for alignment analyses which is at the same time sensitive for secondary sequence structures (see the description of secondary alignment in
List2014d
for details), likePairwise
,Multiple
, orLexStat
. Semiglobal alignment means that the suffixes or prefixes in one of the words are not penalized.

lingpy.algorithm.cython.calign.
semi_globalign
()¶ Carry out semiglobal alignment of two sequences.
Parameters: seqA, seqB : list
The list containing the sequences.
gopA, gopB : list
The gap opening penalties (individual for each sequence, therefore passed as a list of floats or integers).
proA, proB : str
The prosodic strings which have the same length as seqA and seqB.
M, N : int
The lengths of seqA and seqB.
scale : float
The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty.
factor : float
The factor by which matches are increased when two segments occur in the same prosodic position of an alignment.
scorer : { dict,
lingpy.algorithm.cython.misc.ScoreDict
}The scoring function which needs to provide scores for all segments in seqA and seqB.
Returns: alignment : tuple
A tuple of the two alignments and the alignment score.
Notes
This is the function that is called to carry out semiglobal alignment analyses (keyword “overlap”) when using many of LingPy’s classes for alignment analyses which is at the same time sensitive for secondary sequence structures (see the description of secondary alignment in
List2014d
for details), likePairwise
,Multiple
, orLexStat
. Semiglobal alignment means that the suffixes or prefixes in one of the words are not penalized.Examples
 We compare globalign with semi_globalign::
>>> from lingpy.algorithm.cython.calign import globalign, semi_globalign >>> globalign(list('abab'), list('baba'), 4 * [1], 4 * [1], 'aaaa', 'aaaa', 4, 4, 1, 0, {("a","b"):1, ("b","a"): 1, ("a","a"): 1, ("b", "b"): 1}) (['a', 'b', 'a', 'b', ''], ['', 'b', 'a', 'b', 'a'], 1.0) >>> semi_globalign(list('abab'), list('baba'), 4 * [1], 4 * [1], 'aaaa', 'aaaa', 4, 4, 1, 0, {("a","b"):1, ("b","a"): 1, ("a","a"): 1, ("b", "b"): 1}) (['a', 'b', 'a', 'b', ''], ['', 'b', 'a', 'b', 'a'], 3.0)

lingpy.algorithm.cython.calign.
swap_score_profile
()¶ Basic function for the scoring of profiles which contain swapped sequences.
Parameters: colA, colB : list
The two columns of a profile.
scorer : { dict,
lingpy.algorithm.cython.misc.ScoreDict
}The scoring function which needs to provide scores for all segments in the two profiles.
gap_weight : float (default=0.0)
This handles the weight that is given to gaps in a column. If you set it to 0, for example, this means that all gaps will be ignored when determining the score for two columns in the profile.
swap_penalty : int (default=5)
The swap penalty applied to swapped columns.
Returns: score : float
The score for the profile.
Notes
This function handles how profiles with swapped segments are scored.
lingpy.algorithm.cython.cluster module¶

lingpy.algorithm.cython.cluster.
flat_cluster
()¶ Carry out a flat cluster analysis based on the UPGMA algorithm.
Parameters: method : str { ‘upgma’, ‘single’, ‘complete’ }
Select between ‘ugpma’, ‘single’, and ‘complete’.
threshold : float
The threshold which terminates the algorithm.
matrix : list or
numpy.array
A twodimensional list containing the distances.
taxa : list (default = [])
A list containing the names of the taxa. If the list is left empty, the indices of the taxa will be returned instead of their names.
Returns: clusters : dict
A dictionary with clusterIDs as keys and a list of the taxa corresponding to the respective ID as values.
Examples
The function is automatically imported along with LingPy.
>>> from lingpy import *
Create a list of arbitrary taxa.
>>> taxa = ['German','Swedish','Icelandic','English','Dutch']
Create an arbitrary distance matrix.
>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3]) >>> matrix array([[ 0. , 0.5 , 0.67, 0.8 , 0.2 ], [ 0.5 , 0. , 0.4 , 0.7 , 0.6 ], [ 0.67, 0.4 , 0. , 0.8 , 0.8 ], [ 0.8 , 0.7 , 0.8 , 0. , 0.3 ], [ 0.2 , 0.6 , 0.8 , 0.3 , 0. ]])
Carry out the flat cluster analysis.
>>> flat_upgma(0.5,matrix,taxa) {0: ['German', 'Dutch', 'English'], 1: ['Swedish', 'Icelandic']}

lingpy.algorithm.cython.cluster.
flat_upgma
()¶ Carry out a flat cluster analysis based on the UPGMA algorithm (
Sokal1958
).Parameters: threshold : float
The threshold which terminates the algorithm.
matrix : list or
numpy.array
A twodimensional list containing the distances.
taxa : list (default = [])
A list containing the names of the taxa. If the list is left empty, the indices of the taxa will be returned instead of their names.
Returns: clusters : dict
A dictionary with clusterIDs as keys and a list of the taxa corresponding to the respective ID as values.
Examples
The function is automatically imported along with LingPy.
>>> from lingpy import *
Create a list of arbitrary taxa.
>>> taxa = ['German','Swedish','Icelandic','English','Dutch']
Create an arbitrary distance matrix.
>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3]) >>> matrix array([[ 0. , 0.5 , 0.67, 0.8 , 0.2 ], [ 0.5 , 0. , 0.4 , 0.7 , 0.6 ], [ 0.67, 0.4 , 0. , 0.8 , 0.8 ], [ 0.8 , 0.7 , 0.8 , 0. , 0.3 ], [ 0.2 , 0.6 , 0.8 , 0.3 , 0. ]])
Carry out the flat cluster analysis.
>>> flat_upgma(0.5,matrix,taxa) {0: ['German', 'Dutch', 'English'], 1: ['Swedish', 'Icelandic']}

lingpy.algorithm.cython.cluster.
neighbor
()¶ Function clusters data according to the NeighborJoining algorithm (
Saitou1987
).Parameters: matrix : list or
numpy.array
A twodimensional list containing the distances.
taxa : list
An list containing the names of all taxa corresponding to the distances in the matrix.
distances : bool
If set to
False
, only the topology of the tree will be returned.Returns: newick : str
A string in newickformat which can be further used in biological software packages to view and plot the tree.
Examples
Function is automatically imported when importing lingpy.
>>> from lingpy import *
Create an arbitrary list of taxa.
>>> taxa = ['Norwegian','Swedish','Icelandic','Dutch','English']
Create an arbitrary matrix.
>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3])
Carry out the cluster analysis.
>>> neighbor(matrix,taxa) '(((Norwegian,(Swedish,Icelandic)),English),Dutch);'

lingpy.algorithm.cython.cluster.
upgma
()¶ Carry out a cluster analysis based on the UPGMA algorithm (
Sokal1958
).Parameters: matrix : list or
numpy.array
A twodimensional list containing the distances.
taxa : list
An list containing the names of all taxa corresponding to the distances in the matrix.
distances : bool
If set to
False
, only the topology of the tree will be returned.Returns: newick : str
A string in newickformat which can be further used in biological software packages to view and plot the tree.
Examples
Function is automatically imported when importing lingpy.
>>> from lingpy import *
Create an arbitrary list of taxa.
>>> taxa = ['German','Swedish','Icelandic','English','Dutch']
Create an arbitrary matrix.
>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3])
Carry out the cluster analysis.
>>> upgma(matrix,taxa,distances=False) '((Swedish,Icelandic),(English,(German,Dutch)));'
lingpy.algorithm.cython.compilePYX module¶
Script handles compilation of Cython files to C and also to CExtension modules.

lingpy.algorithm.cython.compilePYX.
main
()¶

lingpy.algorithm.cython.compilePYX.
pyx2py
(infile, debug=False)¶
lingpy.algorithm.cython.malign module¶
This module provides various alignment functions in an optimized version.

lingpy.algorithm.cython.malign.
edit_dist
()¶ Return the editdistance between two strings.
Parameters: seqA, seqB : list
The sequences to be aligned, passed as list.
normalized : bool
Indicate whether you want the normalized or the unnormalized edit distance to be returned.
Returns: dist : { int, float }
Either the normalized or the unnormalized edit distance.

lingpy.algorithm.cython.malign.
nw_align
()¶ Align two sequences using the NeedlemanWunsch algorithm.
Parameters: seqA, seqB : list
The sequences to be aligned, passed as list.
scorer : dict
A dictionary containing tuples of two segments as key and numbers as values.
gap : int
The gap penalty.
Returns: alignment : tuple
A tuple of the two aligned sequences, and the similarity score.
Notes
This function is a very straightforward implementation of the NeedlemanWunsch algorithm (
Needleman1970
). We recommend to use the function if you want to test your own scoring dictionaries and profit from a fast implementation (as we use Cython, the implementation is indeed faster than pure Python implementations, as long as you use Python 3 and have Cython installed). If you want to test the NW algorithm without specifying a scoring dictionary, we recommend to have a look at our wrapper function with the same name in thepairwise
module.

lingpy.algorithm.cython.malign.
restricted_edit_dist
()¶ Return the restricted editdistance between two strings.
Parameters: seqA, seqB : list
The two sequences passed as list.
resA, resB : str
The restrictions passed as a string with the same length as the corresponding sequence. We note a restriction if the strings show different symbols in their restriction string. If the symbols are identical, it is modeled as a nonrestriction.
normalized : bool
Determine whether you want to return the normalized or the unnormalized edit distance.
Notes
Restrictions follow the definition of
Heeringa2006
: Segments that are not allowed to match are given a penalty of . We model restrictions as strings, for example consisting of letters “c” and “v”. So the sequence “woldemort” could be modeled as “cvccvcvcc”, and when aligning it with the sequence “walter” and its restriction string “cvccvc”, the matching of those segments in the sequences in which the segments of the restriction string differ, would be heavily penalized, thus prohibiting an alignment of “vowels” and “consonants” (“v” and “c”).

lingpy.algorithm.cython.malign.
structalign
()¶ Carry out a structural alignment analysis using Dijkstra’s algorithm.
Parameters: seqA,seqB : str
The input sequences.
restricted_chars : str (default = “”)
The characters which are used to separate secondary from primary segments in the input sequences. Currently, the use of restricted chars may fail to yield an alignment.
Notes
Structural alignment is hereby understood as an alignment of two sequences whose alphabets differ. The algorithm returns all alignments with minimal edit distance. Edit distance in this context refers to the number of edit operations that are needed in order to convert one sequence into the other, with repeated edit operations being penalized only once.

lingpy.algorithm.cython.malign.
sw_align
()¶ Align two sequences using the SmithWaterman algorithm.
Parameters: seqA, seqB : list
The sequences to be aligned, passed as list.
scorer : dict
A dictionary containing tuples of two segments as key and numbers as values.
gap : int
The gap penalty.
Returns: alignment : tuple
A tuple of the two aligned sequences, and the similarity score.
Notes
This function is a very straightforward implementation of the SmithWaterman algorithm (
Smith1981
). We recommend to use the function if you want to test your own scoring dictionaries and profit from a fast implementation (as we use Cython, the implementation is indeed faster than pure Python implementations, as long as you use Python 3 and have Cython installed). If you want to test the SW algorithm without specifying a scoring dictionary, we recommend to have a look at our wrapper function with the same name in thepairwise
module.

lingpy.algorithm.cython.malign.
we_align
()¶ Align two sequences using the WatermanEggert algorithm.
Parameters: seqA, seqB : list
The input sequences passed as a list.
scorer : dict
A dictionary containing tuples of two segments as key and numbers as values.
gap : int
The gap penalty.
Returns: alignments : list
A list consisting of tuples. Each tuple gives the alignment of one of the subsequences of the input sequences. Each tuple contains the aligned part of the first, the aligned part of the second sequence, and the score of the alignment.
Notes
This function is a very straightforward implementation of the WatermanEggert algorithm (
Waterman1987
). We recommend to use the function if you want to test your own scoring dictionaries and profit from a fast implementation (as we use Cython, the implementation is indeed faster than pure Python implementations, as long as you use Python 3 and have Cython installed). If you want to test the WE algorithm without specifying a scoring dictionary, we recommend to have a look at our wrapper function with the same name in thepairwise
module.
lingpy.algorithm.cython.misc module¶

class
lingpy.algorithm.cython.misc.
ScoreDict
¶ Bases:
object
Class allows quick access to scoring functions using dictionary syntax.
Parameters: chars : list
The list of all character tokens for the scoring dictionary.
matrix : list
A twodimensional scoring matrix.
Notes
Since this class has dictionary syntax, you can always also just create a dictionary in order to store your scoring functions. Scoring dictionaries should contain a tuple of segments to be compared as a key, and a float or integer as a value, with negative values indicating dissimilarity, and positive values similarity.
Examples
 Initialize a ScoreDict object::
>>> from lingpy.algorith.cython.misc import ScoreDict >>> scorer = ScoreDict(['a', 'b'], [1, 1, 1, 1])
 Retrieve scores::
>>> scorer['a', 'b'] 1 >>> scorer['a', 'a'] 1 >>> scorer['a', 'X'] 22.5

lingpy.algorithm.cython.misc.
squareform
()¶ A simplified version of the
scipy.spatial.distance.squareform()
function.Parameters: x :
numpy.array
or listThe onedimensional flat representation of a symmetrix distance matrix.
Returns: matrix :
numpy.array
The twodimensional redundant representation of a symmetric distance matrix.

lingpy.algorithm.cython.misc.
transpose
()¶ Transpose a matrix along its two dimensions.
Parameters: matrix : list
A twodimensional list.
lingpy.algorithm.cython.talign module¶

lingpy.algorithm.cython.talign.
align_pair
()¶ Align a pair of sequences.
Parameters: seqA, seqB : list
The sequences to be aligned, passed as lists.
gop : int
The gap opening penalty.
scale : float
The gap extension scale.
scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict }
The scoring dictionary containing scores for all possible segment combinations in the two sequences.
mode : { “global”, “local”, “overlap”, “dialign” }
Select the mode for the alignment analysis (“overlap” refers to semiglobal alignments).
distance : int (default=0)
Select whether you want distances or similarities to be returned (0 indicates similarities, 1 indicates distances, 2 indicates both).
Returns: alignment : tuple
The aligned sequences and the similarity or distance scores, or both.
Notes
This is a utility function that allows calls any of the four classical alignment functions (
lingpy.algorithm.cython.talign.globalign
lingpy.algorithm.cython.talign.semi_globalign
,lingpy.algorithm.cython.talign.lotalign
,lingpy.algorithm.cython.talign.dialign
,) and their secondary counterparts.

lingpy.algorithm.cython.talign.
align_pairs
()¶ Align multiple sequence pairs.
Parameters: seqs : list
The sequences to be aligned, passed as lists.
gop : int
The gap opening penalty.
scale : float
The gap extension scale.
scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict }
The scoring dictionary containing scores for all possible segment combinations in the two sequences.
mode : { “global”, “local”, “overlap”, “dialign” }
Select the mode for the alignment analysis (“overlap” refers to semiglobal alignments).
distance : int (default=0)
Indicate whether distances or similarities should be returned.
Returns: alignments : list
A list of tuples, containing the aligned sequences, and the similarity or the distance scores.
Notes
This function aligns all pairs which are passed to it.

lingpy.algorithm.cython.talign.
align_pairwise
()¶ Align all sequences pairwise.
Parameters: seqs : list
The sequences to be aligned, passed as lists.
gop : int
The gap opening penalty.
scale : float
The gap extension scale.
scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict }
The scoring dictionary containing scores for all possible segment combinations in the two sequences.
mode : { “global”, “local”, “overlap”, “dialign” }
Select the mode for the alignment analysis (“overlap” refers to semiglobal alignments).
Returns: alignments : list
A list of tuples, containing the aligned sequences, the similarity and the distance scores.
Notes
This function aligns all possible pairs between the sequences you pass to it. It is important for multiple alignment, where it can be used to construct the guide tree.

lingpy.algorithm.cython.talign.
align_profile
()¶ Align two profiles using the basic modes.
Parameters: profileA, profileB : list
Twodimensional list for each of the profiles.
gop : int
The gap opening penalty.
scale : float
The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty.
scorer : { dict,
lingpy.algorithm.cython.misc.ScoreDict
}The scoring function which needs to provide scores for all segments in the two profiles.
mode : { “global”, “overlap”, “dialign” }
Select one of the four basic modes for alignment analyses.
gap_weight : float
This handles the weight that is given to gaps in a column. If you set it to 0, for example, this means that all gaps will be ignored when determining the score for two columns in the profile.
Returns: alignment : tuple
The aligned profiles, and the overall similarity of the profiles.
Notes
This function computes alignments of two profiles of multiple sequences (see
Durbin2002
for details on profiles) and is important for multiple alignment analyses.

lingpy.algorithm.cython.talign.
dialign
()¶ Carry out dialign alignment of two sequences.
Parameters: seqA, seqB : list
The sequences to be aligned, passed as lists.
M, N : int
The length of the two sequences.
scale : float
The gap extension scale.
scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict }
The scoring dictionary containing scores for all possible segment combinations in the two sequences.
Returns: alignment : tuple
The aligned sequences and the similarity score.
Notes
This algorithm carries out dialign alignment (
Morgenstern1996
).

lingpy.algorithm.cython.talign.
globalign
()¶ Carry out global alignment of two sequences.
Parameters: seqA, seqB : list
The sequences to be aligned, passed as lists.
M, N : int
The length of the two sequences.
gop : int
The gap opening penalty.
scale : float
The gap extension scale.
scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict }
The scoring dictionary containing scores for all possible segment combinations in the two sequences.
Returns: alignment : tuple
The aligned sequences and the similarity score.
Notes
This algorithm carries out classical NeedlemanWunsch alignment (
Needleman1970
).

lingpy.algorithm.cython.talign.
localign
()¶ Carry out semiglobal alignment of two sequences.
Parameters: seqA, seqB : list
The sequences to be aligned, passed as lists.
M, N : int
The length of the two sequences.
gop : int
The gap opening penalty.
scale : float
The gap extension scale.
scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict }
The scoring dictionary containing scores for all possible segment combinations in the two sequences.
Returns: alignment : tuple
The aligned sequences and the similarity score.
Notes
This algorithm carries out local alignment (
Smith1981
).

lingpy.algorithm.cython.talign.
score_profile
()¶ Basic function for the scoring of profiles.
Parameters: colA, colB : list
The two columns of a profile.
scorer : { dict,
lingpy.algorithm.cython.misc.ScoreDict
}The scoring function which needs to provide scores for all segments in the two profiles.
gap_weight : float (default=0.0)
This handles the weight that is given to gaps in a column. If you set it to 0, for example, this means that all gaps will be ignored when determining the score for two columns in the profile.
Returns: score : float
The score for the profile
Notes
This function handles how profiles are scored.

lingpy.algorithm.cython.talign.
semi_globalign
()¶ Carry out semiglobal alignment of two sequences.
Parameters: seqA, seqB : list
The sequences to be aligned, passed as lists.
M, N : int
The length of the two sequences.
gop : int
The gap opening penalty.
scale : float
The gap extension scale.
scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict }
The scoring dictionary containing scores for all possible segment combinations in the two sequences.
Returns: alignment : tuple
The aligned sequences and the similarity score.
Notes
This algorithm carries out semiglobal alignment (
Durbin2002
).

lingpy.algorithm.cython.talign.
swap_score_profile
()¶ Basic function for the scoring of profiles in swapped sequences.
Parameters: colA, colB : list
The two columns of a profile.
scorer : { dict,
lingpy.algorithm.cython.misc.ScoreDict
}The scoring function which needs to provide scores for all segments in the two profiles.
gap_weight : float (default=0.0)
This handles the weight that is given to gaps in a column. If you set it to 0, for example, this means that all gaps will be ignored when determining the score for two columns in the profile.
swap_penalty : int (default=5)
The swap penalty applied to swapped columns.
Returns: score : float
The score for the profile.
Notes
This function handles how profiles with swapped segments are scored.
Module contents¶
Package provides modules for timeconsuming routines.