lingpy.algorithm.cython package

Submodules

lingpy.algorithm.cython.calign module

lingpy.algorithm.cython.calign.align_pair()

Align a pair of sequences.

Parameters:

seqA, seqB : list

The list containing the sequences.

gopA, gopB : list

The gap opening penalties (individual for each sequence, therefore passed as a list of floats or integers).

proA, proB : str

The prosodic strings which have the same length as seqA and seqB.

scale : float

The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty.

factor : float

The factor by which matches are increased when two segments occur in the same prosodic position of an alignment.

scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict }

The scoring function which needs to provide scores for all segments in seqA and seqB.

mode : { “global”, “local”, “overlap”, “dialign” }

Select one of the four basic modes for alignment analyses.

restricted_chars : str

The string containing restricted characters. Restricted characters occur, as a rule, in the prosodic strings, not in the normal sequence.

distance : int (default=0)

Select whether you want to calculate the normalized distance or the similarity between two strings (following Downey2008 for normalization).

Returns:

alignment : tuple

The aligned sequences and the similarity or distance.

Notes

This is a utility function that allows calls any of the four classical alignment functions (lingpy.algorithm.cython.calign.globalign lingpy.algorithm.cython.calign.semi_globalign, lingpy.algorithm.cython.calign.localign, lingpy.algorithm.cython.calign.dialign,) and their secondary counterparts.

lingpy.algorithm.cython.calign.align_pairs()

Align multiple sequence pairs.

Parameters:

seqs : list

A two-dimensional list containing one pair of sequences each.

gops : list

The gap opening penalties (individual for each sequence, therefore passed as a list of floats or integers).

pros : list

The prosodic strings which have the same length as seqA and seqB.

scale : float

The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty.

factor : float

The factor by which matches are increased when two segments occur in the same prosodic position of an alignment.

scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict }

The scoring function which needs to provide scores for all segments in seqA and seqB.

mode : { “global”, “local”, “overlap”, “dialign” }

Select one of the four basic modes for alignment analyses.

restricted_chars : { str }

The string containing restricted characters. Restricted characters occur, as a rule, in the prosodic strings, not in the normal sequence.

distance : int (default=0)

Select whether you want to calculate the normalized distance or the similarity between two strings (following Downey2008 for normalization). If you set this value to 2, both distances and similarities will be returned.

Returns:

alignments : list

A list of tuples of size 3 or 4, containing the alignments, and the similarity or the distance (or both, if distance is set to 2).

Notes

This function computes alignments of all pairs passed in the list of sequence pairs (a two-dimensional list with two sequences each) and is basically used in LingPy’s module for cognate detection (lingpy.compare.lexstat.LexStat).

lingpy.algorithm.cython.calign.align_pairwise()

Align a list of sequences pairwise.

Parameters:

seqs : list

The list containing the sequences.

gops : list

The gap opening penalties (individual for each sequence, therefore passed as a list of floats or integers).

pros : list

The prosodic strings which have the same length as seqA and seqB.

scale : float

The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty.

factor : float

The factor by which matches are increased when two segments occur in the same prosodic position of an alignment.

scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict }

The scoring function which needs to provide scores for all segments in seqA and seqB.

mode : { “global”, “local”, “overlap”, “dialign” }

Select one of the four basic modes for alignment analyses.

r : str

The string containing restricted characters. Restricted characters occur, as a rule, in the prosodic strings, not in the normal sequence.

Returns:

alignments : list

A list of tuples of size 4, containing the alignment, the similarity and the distance for each sequence pair.

Notes

This function computes alignments of all possible pairs passed in the list of sequences and is basically used in LingPy’s module for multiple alignment analyses (lingpy.align.multiple).

lingpy.algorithm.cython.calign.align_profile()

Align two profiles using the basic modes.

Parameters:

profileA, profileB : list

Two-dimensional list for each of the profiles.

gopA, gopB : list

The gap opening penalties (individual for each sequence, therefore passed as a list of floats or integers).

proA, proB : str

The prosodic strings which have the same length as profileA and profileB.

gop : int

The general gap opening penalty which will be used to introduce a gap between the two profiles.

scale : float

The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty.

factor : float

The factor by which matches are increased when two segments occur in the same prosodic position of an alignment.

scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict }

The scoring function which needs to provide scores for all segments in the two profiles.

restricted_chars : { str }

The string containing restricted characters. Restricted characters occur, as a rule, in the prosodic strings, not in the normal sequence. They need to be computed by computing a consensus string from all prosodic strings in the profile.

mode : { “global”, “local”, “overlap”, “dialign” }

Select one of the four basic modes for alignment analyses.

gap_weight : float

This handles the weight that is given to gaps in a column. If you set it to 0, for example, this means that all gaps will be ignored when determining the score for two columns in the profile.

Returns:

alignment : tuple

The aligned profiles, and the overall similarity of the profiles.

Notes

This function computes alignments of two profiles of multiple sequences (see Durbin2002 for details on profiles) and is basically used in LingPy’s module for multiple alignment (lingpy.align.multiple).

lingpy.algorithm.cython.calign.corrdist()

Create a correspondence distribution for a given language pair.

Parameters:

threshold : float

The threshold of sequence distance which determines whether a sequence pair is included or excluded from the calculation of the distribution.

seqs : list

The sequences passed as a two-dimensional list of sequence pairs.

gops : list

The gap opening penalties, passed as individual lists of penalties for each sequence.

pros : list

The list of prosodic strings for each sequence.

gop : int

The general gap opening penalty which will be used to introduce a gap between the two profiles.

scale : float

The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty.

factor : float

The factor by which matches are increased when two segments occur in the same prosodic position of an alignment.

scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict }

The scoring function which needs to provide scores for all segments in the two profiles.

mode : { “global”, “local”, “overlap”, “dialign” }

Select one of the four basic modes for alignment analyses.

restricted_chars : { str }

The string containing restricted characters. Restricted characters occur, as a rule, in the prosodic strings, not in the normal sequence. They need to be computed by computing a consensus string from all prosodic strings in the profile.

Returns:

results : tuple

A dictionary containing the distribution, and the number of included sequences.

Notes

This function is the core of the LexStat function to compute distributions of aligned segment pairs.

lingpy.algorithm.cython.calign.dialign()

Carry out dialign alignment of two sequences.

Parameters:

seqA, seqB : list

The list containing the sequences.

proA, proB : str

The prosodic strings which have the same length as seqA and seqB.

M, N : int

The lengths of seqA and seqB.

scale : float

The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty.

factor : float

The factor by which matches are increased when two segments occur in the same prosodic position of an alignment.

scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict }

The scoring function which needs to provide scores for all segments in seqA and seqB.

Returns:

alignment : tuple

A tuple of the two alignments and the alignment score.

Notes

This is the function that is called to carry out local dialign alignment analyses (keyword “dialign”) when using many of LingPy’s classes for alignment analyses which is at the same time sensitive for secondary sequence structures (see the description of secondary alignment in List2014d for details), like Pairwise, Multiple, or LexStat. Dialign (see Morgenstern1996) is an alignment algorithm that does not require gap penalties and generally works in a rather local fashion.

lingpy.algorithm.cython.calign.globalign()

Carry out global alignment of two sequences.

Parameters:

seqA, seqB : list

The list containing the sequences.

gopA, gopB : list

The gap opening penalties (individual for each sequence, therefore passed as a list of floats or integers).

proA, proB : str

The prosodic strings which have the same length as seqA and seqB.

M, N : int

The lengths of seqA and seqB.

scale : float

The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty.

factor : float

The factor by which matches are increased when two segments occur in the same prosodic position of an alignment.

scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict }

The scoring function which needs to provide scores for all segments in seqA and seqB.

Returns:

alignment : tuple

A tuple of the two alignments and the alignment score.

Notes

This is the function that is called to carry out global alignment analyses when using many of LingPy’s classes for alignment analyses, like Pairwise, Multiple, or LexStat. It differs from classical Needleman-Wunsch alignment (compare Needleman1970) in a couple of aspects. These include, among others, the use of a gap extension scale rather than a gap extension penalty (the scale consecutively reduces the gap penalty and thus lets gap penalties approach zero if gapped regions are large), the use of individual gap opening penalties for all positions of a sequence, and the use of prosodic strings, and prosodic factors that raise scores when segments occur in the same prosodic environment.

If one sets certain of these parameters to zero or one and uses the same gap opening penalties, however, the function will behave like the traditional Needleman-Wunsch algorithm, and since it is implemented in Cython, it will work faster than a pure Python implementation for alignment algorithms.

Examples

We show that the Needleman-Wunsch algorithms yields the same result as the globalign algorithm, provided we adjust the parameters:

>>> from lingpy.algorithm.cython.calign import globalign
>>> from lingpy.align.pairwise import nw_align
>>> nw_align('abab', 'baba')
(['a', 'b', 'a', 'b', '-'], ['-', 'b', 'a', 'b', 'a'], 1)

>>> globalign(list('abab'), list('baba'), 4 * [-1], 4 * [-1], 'aaaa', 'aaaa', 4, 4, 1, 0, {("a","b"):-1, ("b","a"): -1, ("a","a"): 1, ("b", "b"): 1})
(['a', 'b', 'a', 'b', '-'], ['-', 'b', 'a', 'b', 'a'], 1.0)
lingpy.algorithm.cython.calign.localign()

Carry out semi-global alignment of two sequences.

Parameters:

seqA, seqB : list

The list containing the sequences.

gopA, gopB : list

The gap opening penalties (individual for each sequence, therefore passed as a list of floats or integers).

proA, proB : str

The prosodic strings which have the same length as seqA and seqB.

M, N : int

The lengths of seqA and seqB.

scale : float

The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty.

factor : float

The factor by which matches are increased when two segments occur in the same prosodic position of an alignment.

scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict }

The scoring function which needs to provide scores for all segments in seqA and seqB.

Returns:

alignment : tuple

A tuple of the two alignments and the alignment score. The alignments are each a list of suffix, alignment, and prefix.

Notes

This is the function that is called to carry out local alignment analyses when using many of LingPy’s classes for alignment analyses which is at the same time sensitive for secondary sequence structures (see the description of secondary alignment in List2014d for details), like Pairwise, Multiple, or LexStat. Local alignment means that only the best matching substring between two sequences is returned (compare Smith1981), also called the Smith-Waterman algorithm.

lingpy.algorithm.cython.calign.score_profile()

Basic function for the scoring of profiles.

Parameters:

colA, colB : list

The two columns of a profile.

scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict }

The scoring function which needs to provide scores for all segments in the two profiles.

gap_weight : float (default=0.0)

This handles the weight that is given to gaps in a column. If you set it to 0, for example, this means that all gaps will be ignored when determining the score for two columns in the profile.

Returns:

score : float

The score for the profile

Notes

This function handles how profiles are scored.

lingpy.algorithm.cython.calign.secondary_dialign()

Carry out dialign alignment of two sequences with sensitivity for secondary sequence structures.

Parameters:

seqA, seqB : list

The list containing the sequences.

proA, proB : str

The prosodic strings which have the same length as seqA and seqB.

M, N : int

The lengths of seqA and seqB.

scale : float

The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty.

factor : float

The factor by which matches are increased when two segments occur in the same prosodic position of an alignment.

scorer : { dict, ScoreDict }

The scoring function which needs to provide scores for all segments in seqA and seqB.

r : { str }

The string containing restricted characters. Restricted characters occur, as a rule, in the prosodic strings, not in the normal sequence.

Returns:

alignment : tuple

A tuple of the two alignments and the alignment score.

Notes

This is the function that is called to carry out local dialign alignment analyses (keyword “dialign”) when using many of LingPy’s classes for alignment analyses which is at the same time sensitive for secondary sequence structures (see the description of secondary alignment in List2014d for details), like Pairwise, Multiple, or LexStat. Dialign (see Morgenstern1996) is an alignment algorithm that does not require gap penalties and generally works in a rather local fashion.

lingpy.algorithm.cython.calign.secondary_globalign()

Carry out global alignment of two sequences with secondary sequence structures.

Parameters:

seqA, seqB : list

The list containing the sequences.

gopA, gopB : list

The gap opening penalties (individual for each sequence, therefore passed as a list of floats or integers).

proA, proB : str

The prosodic strings which have the same length as seqA and seqB.

M, N : int

The lengths of seqA and seqB.

scale : float

The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty.

factor : float

The factor by which matches are increased when two segments occur in the same prosodic position of an alignment.

scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict }

The scoring function which needs to provide scores for all segments in seqA and seqB.

r : { str }

The string containing restricted characters. Restricted characters occur, as a rule, in the prosodic strings, not in the normal sequence.

Returns:

alignment : tuple

A tuple of the two alignments and the alignment score.

Notes

This is the function that is called to carry out global alignment analyses when using many of LingPy’s classes for alignment analyses which is at the same time sensitive for secondary sequence structures (see the description of secondary alignment in List2014d for details), like Pairwise, Multiple, or LexStat. It differs from classical Needleman-Wunsch alignment (compare Needleman1970) in a couple of aspects. These include, among others, the use of a gap extension scale rather than a gap extension penalty (the scale consecutively reduces the gap penalty and thus lets gap penalties approach zero if gapped regions are large), the use of individual gap opening penalties for all positions of a sequence, and the use of prosodic strings, and prosodic factors that raise scores when segments occur in the same prosodic environment.

If one sets certain of these parameters to zero or one and uses the same gap opening penalties, however, the function will behave like the traditional Needleman-Wunsch algorithm, and since it is implemented in Cython, it will work faster than a pure Python implementation for alignment algorithms.

Examples

We compare globalign with secondary_globalign::
>>> from lingpy.algorithm.cython.calign import globalign, secondary_globalign
>>> globalign(list('abab'), list('baba'), 4 * [-1], 4 * [-1], 'aaaa', 'aaaa', 4, 4, 1, 0, {("a","b"):-1, ("b","a"): -1, ("a","a"): 1, ("b", "b"): 1})
(['a', 'b', 'a', 'b', '-'], ['-', 'b', 'a', 'b', 'a'], 1.0)
>>> secondary_globalign(list('ab.ab'), list('ba.ba'), 5 * [-1], 5 * [-1], 'ab.ab', 'ba.ba', 5, 5, 1, 0, {("a","b"):-1, ("b","a"): -1, ("a","a"): 1, ("b", "b"): 1, ("a",".") : -1, ("b","."):-1, (".","."):0, (".", "b"): -1, (".", "a"):-1}, '.')
(['a', 'b', '-', '.', 'a', 'b', '-'],
['-', 'b', 'a', '.', '-', 'b', 'a'],
-2.0)
lingpy.algorithm.cython.calign.secondary_localign()

Carry out lobal alignment of two sequences with sensitivity to secondary sequence structures.

Parameters:

seqA, seqB : list

The list containing the sequences.

gopA, gopB : list

The gap opening penalties (individual for each sequence, therefore passed as a list of floats or integers).

proA, proB : str

The prosodic strings which have the same length as seqA and seqB.

M, N : int

The lengths of seqA and seqB.

scale : float

The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty.

factor : float

The factor by which matches are increased when two segments occur in the same prosodic position of an alignment.

scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict }

The scoring function which needs to provide scores for all segments in seqA and seqB.

r : { str }

The string containing restricted characters. Restricted characters occur, as a rule, in the prosodic strings, not in the normal sequence.

Returns:

alignment : tuple

A tuple of the two alignments and the alignment score. The alignments are each a list of suffix, alignment, and prefix.

Notes

This is the function that is called to carry out local alignment analyses when using many of LingPy’s classes for alignment analyses which is at the same time sensitive for secondary sequence structures (see the description of secondary alignment in List2014d for details), like Pairwise, Multiple, or LexStat. Local alignment means that only the best matching substring between two sequences is returned (compare Smith1981), also called the Smith-Waterman algorithm.

lingpy.algorithm.cython.calign.secondary_semi_globalign()

Carry out semi-global alignment of two sequences with sensitivity to secondary sequence structures.

Parameters:

seqA, seqB : list

The list containing the sequences.

gopA, gopB : list

The gap opening penalties (individual for each sequence, therefore passed as a list of floats or integers).

proA, proB : str

The prosodic strings which have the same length as seqA and seqB.

M, N : int

The lengths of seqA and seqB.

scale : float

The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty.

factor : float

The factor by which matches are increased when two segments occur in the same prosodic position of an alignment.

scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict }

The scoring function which needs to provide scores for all segments in seqA and seqB.

r : { str }

The string containing restricted characters. Restricted characters occur, as a rule, in the prosodic strings, not in the normal sequence.

Returns:

alignment : tuple

A tuple of the two alignments and the alignment score.

Notes

This is the function that is called to carry out semi-global alignment analyses (keyword “overlap”) when using many of LingPy’s classes for alignment analyses which is at the same time sensitive for secondary sequence structures (see the description of secondary alignment in List2014d for details), like Pairwise, Multiple, or LexStat. Semi-global alignment means that the suffixes or prefixes in one of the words are not penalized.

lingpy.algorithm.cython.calign.semi_globalign()

Carry out semi-global alignment of two sequences.

Parameters:

seqA, seqB : list

The list containing the sequences.

gopA, gopB : list

The gap opening penalties (individual for each sequence, therefore passed as a list of floats or integers).

proA, proB : str

The prosodic strings which have the same length as seqA and seqB.

M, N : int

The lengths of seqA and seqB.

scale : float

The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty.

factor : float

The factor by which matches are increased when two segments occur in the same prosodic position of an alignment.

scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict }

The scoring function which needs to provide scores for all segments in seqA and seqB.

Returns:

alignment : tuple

A tuple of the two alignments and the alignment score.

Notes

This is the function that is called to carry out semi-global alignment analyses (keyword “overlap”) when using many of LingPy’s classes for alignment analyses which is at the same time sensitive for secondary sequence structures (see the description of secondary alignment in List2014d for details), like Pairwise, Multiple, or LexStat. Semi-global alignment means that the suffixes or prefixes in one of the words are not penalized.

Examples

We compare globalign with semi_globalign::
>>> from lingpy.algorithm.cython.calign import globalign, semi_globalign
>>> globalign(list('abab'), list('baba'), 4 * [-1], 4 * [-1], 'aaaa', 'aaaa', 4, 4, 1, 0, {("a","b"):-1, ("b","a"): -1, ("a","a"): 1, ("b", "b"): 1})
(['a', 'b', 'a', 'b', '-'], ['-', 'b', 'a', 'b', 'a'], 1.0)
>>> semi_globalign(list('abab'), list('baba'), 4 * [-1], 4 * [-1], 'aaaa', 'aaaa', 4, 4, 1, 0, {("a","b"):-1, ("b","a"): -1, ("a","a"): 1, ("b", "b"): 1})
(['a', 'b', 'a', 'b', '-'], ['-', 'b', 'a', 'b', 'a'], 3.0)
lingpy.algorithm.cython.calign.swap_score_profile()

Basic function for the scoring of profiles which contain swapped sequences.

Parameters:

colA, colB : list

The two columns of a profile.

scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict }

The scoring function which needs to provide scores for all segments in the two profiles.

gap_weight : float (default=0.0)

This handles the weight that is given to gaps in a column. If you set it to 0, for example, this means that all gaps will be ignored when determining the score for two columns in the profile.

swap_penalty : int (default=-5)

The swap penalty applied to swapped columns.

Returns:

score : float

The score for the profile.

Notes

This function handles how profiles with swapped segments are scored.

lingpy.algorithm.cython.cluster module

lingpy.algorithm.cython.cluster.flat_cluster()

Carry out a flat cluster analysis based on the UPGMA algorithm.

Parameters:

method : str { ‘upgma’, ‘single’, ‘complete’ }

Select between ‘ugpma’, ‘single’, and ‘complete’.

threshold : float

The threshold which terminates the algorithm.

matrix : list or numpy.array

A two-dimensional list containing the distances.

taxa : list (default = [])

A list containing the names of the taxa. If the list is left empty, the indices of the taxa will be returned instead of their names.

Returns:

clusters : dict

A dictionary with cluster-IDs as keys and a list of the taxa corresponding to the respective ID as values.

Examples

The function is automatically imported along with LingPy.

>>> from lingpy import *

Create a list of arbitrary taxa.

>>> taxa = ['German','Swedish','Icelandic','English','Dutch']

Create an arbitrary distance matrix.

>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3])
>>> matrix
array([[ 0.  ,  0.5 ,  0.67,  0.8 ,  0.2 ],
       [ 0.5 ,  0.  ,  0.4 ,  0.7 ,  0.6 ],
       [ 0.67,  0.4 ,  0.  ,  0.8 ,  0.8 ],
       [ 0.8 ,  0.7 ,  0.8 ,  0.  ,  0.3 ],
       [ 0.2 ,  0.6 ,  0.8 ,  0.3 ,  0.  ]])

Carry out the flat cluster analysis.

>>> flat_upgma(0.5,matrix,taxa)
{0: ['German', 'Dutch', 'English'], 1: ['Swedish', 'Icelandic']}
lingpy.algorithm.cython.cluster.flat_upgma()

Carry out a flat cluster analysis based on the UPGMA algorithm (Sokal1958).

Parameters:

threshold : float

The threshold which terminates the algorithm.

matrix : list or numpy.array

A two-dimensional list containing the distances.

taxa : list (default = [])

A list containing the names of the taxa. If the list is left empty, the indices of the taxa will be returned instead of their names.

Returns:

clusters : dict

A dictionary with cluster-IDs as keys and a list of the taxa corresponding to the respective ID as values.

Examples

The function is automatically imported along with LingPy.

>>> from lingpy import *

Create a list of arbitrary taxa.

>>> taxa = ['German','Swedish','Icelandic','English','Dutch']

Create an arbitrary distance matrix.

>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3])
>>> matrix
array([[ 0.  ,  0.5 ,  0.67,  0.8 ,  0.2 ],
       [ 0.5 ,  0.  ,  0.4 ,  0.7 ,  0.6 ],
       [ 0.67,  0.4 ,  0.  ,  0.8 ,  0.8 ],
       [ 0.8 ,  0.7 ,  0.8 ,  0.  ,  0.3 ],
       [ 0.2 ,  0.6 ,  0.8 ,  0.3 ,  0.  ]])

Carry out the flat cluster analysis.

>>> flat_upgma(0.5,matrix,taxa)
{0: ['German', 'Dutch', 'English'], 1: ['Swedish', 'Icelandic']}
lingpy.algorithm.cython.cluster.neighbor()

Function clusters data according to the Neighbor-Joining algorithm (Saitou1987).

Parameters:

matrix : list or numpy.array

A two-dimensional list containing the distances.

taxa : list

An list containing the names of all taxa corresponding to the distances in the matrix.

distances : bool

If set to False, only the topology of the tree will be returned.

Returns:

newick : str

A string in newick-format which can be further used in biological software packages to view and plot the tree.

Examples

Function is automatically imported when importing lingpy.

>>> from lingpy import *

Create an arbitrary list of taxa.

>>> taxa = ['Norwegian','Swedish','Icelandic','Dutch','English']

Create an arbitrary matrix.

>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3])

Carry out the cluster analysis.

>>> neighbor(matrix,taxa)
'(((Norwegian,(Swedish,Icelandic)),English),Dutch);'
lingpy.algorithm.cython.cluster.upgma()

Carry out a cluster analysis based on the UPGMA algorithm (Sokal1958).

Parameters:

matrix : list or numpy.array

A two-dimensional list containing the distances.

taxa : list

An list containing the names of all taxa corresponding to the distances in the matrix.

distances : bool

If set to False, only the topology of the tree will be returned.

Returns:

newick : str

A string in newick-format which can be further used in biological software packages to view and plot the tree.

Examples

Function is automatically imported when importing lingpy.

>>> from lingpy import *

Create an arbitrary list of taxa.

>>> taxa = ['German','Swedish','Icelandic','English','Dutch']

Create an arbitrary matrix.

>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3])

Carry out the cluster analysis.

>>> upgma(matrix,taxa,distances=False)
'((Swedish,Icelandic),(English,(German,Dutch)));'

lingpy.algorithm.cython.compilePYX module

Script handles compilation of Cython files to C and also to C-Extension modules.

lingpy.algorithm.cython.compilePYX.main()
lingpy.algorithm.cython.compilePYX.pyx2py(infile, debug=False)

lingpy.algorithm.cython.malign module

This module provides various alignment functions in an optimized version.

lingpy.algorithm.cython.malign.edit_dist()

Return the edit-distance between two strings.

Parameters:

seqA, seqB : list

The sequences to be aligned, passed as list.

normalized : bool

Indicate whether you want the normalized or the unnormalized edit distance to be returned.

Returns:

dist : { int, float }

Either the normalized or the unnormalized edit distance.

lingpy.algorithm.cython.malign.nw_align()

Align two sequences using the Needleman-Wunsch algorithm.

Parameters:

seqA, seqB : list

The sequences to be aligned, passed as list.

scorer : dict

A dictionary containing tuples of two segments as key and numbers as values.

gap : int

The gap penalty.

Returns:

alignment : tuple

A tuple of the two aligned sequences, and the similarity score.

Notes

This function is a very straightforward implementation of the Needleman-Wunsch algorithm (Needleman1970). We recommend to use the function if you want to test your own scoring dictionaries and profit from a fast implementation (as we use Cython, the implementation is indeed faster than pure Python implementations, as long as you use Python 3 and have Cython installed). If you want to test the NW algorithm without specifying a scoring dictionary, we recommend to have a look at our wrapper function with the same name in the pairwise module.

lingpy.algorithm.cython.malign.restricted_edit_dist()

Return the restricted edit-distance between two strings.

Parameters:

seqA, seqB : list

The two sequences passed as list.

resA, resB : str

The restrictions passed as a string with the same length as the corresponding sequence. We note a restriction if the strings show different symbols in their restriction string. If the symbols are identical, it is modeled as a non-restriction.

normalized : bool

Determine whether you want to return the normalized or the unnormalized edit distance.

Notes

Restrictions follow the definition of Heeringa2006: Segments that are not allowed to match are given a penalty of \infty. We model restrictions as strings, for example consisting of letters “c” and “v”. So the sequence “woldemort” could be modeled as “cvccvcvcc”, and when aligning it with the sequence “walter” and its restriction string “cvccvc”, the matching of those segments in the sequences in which the segments of the restriction string differ, would be heavily penalized, thus prohibiting an alignment of “vowels” and “consonants” (“v” and “c”).

lingpy.algorithm.cython.malign.structalign()

Carry out a structural alignment analysis using Dijkstra’s algorithm.

Parameters:

seqA,seqB : str

The input sequences.

restricted_chars : str (default = “”)

The characters which are used to separate secondary from primary segments in the input sequences. Currently, the use of restricted chars may fail to yield an alignment.

Notes

Structural alignment is hereby understood as an alignment of two sequences whose alphabets differ. The algorithm returns all alignments with minimal edit distance. Edit distance in this context refers to the number of edit operations that are needed in order to convert one sequence into the other, with repeated edit operations being penalized only once.

lingpy.algorithm.cython.malign.sw_align()

Align two sequences using the Smith-Waterman algorithm.

Parameters:

seqA, seqB : list

The sequences to be aligned, passed as list.

scorer : dict

A dictionary containing tuples of two segments as key and numbers as values.

gap : int

The gap penalty.

Returns:

alignment : tuple

A tuple of the two aligned sequences, and the similarity score.

Notes

This function is a very straightforward implementation of the Smith-Waterman algorithm (Smith1981). We recommend to use the function if you want to test your own scoring dictionaries and profit from a fast implementation (as we use Cython, the implementation is indeed faster than pure Python implementations, as long as you use Python 3 and have Cython installed). If you want to test the SW algorithm without specifying a scoring dictionary, we recommend to have a look at our wrapper function with the same name in the pairwise module.

lingpy.algorithm.cython.malign.we_align()

Align two sequences using the Waterman-Eggert algorithm.

Parameters:

seqA, seqB : list

The input sequences passed as a list.

scorer : dict

A dictionary containing tuples of two segments as key and numbers as values.

gap : int

The gap penalty.

Returns:

alignments : list

A list consisting of tuples. Each tuple gives the alignment of one of the subsequences of the input sequences. Each tuple contains the aligned part of the first, the aligned part of the second sequence, and the score of the alignment.

Notes

This function is a very straightforward implementation of the Waterman-Eggert algorithm (Waterman1987). We recommend to use the function if you want to test your own scoring dictionaries and profit from a fast implementation (as we use Cython, the implementation is indeed faster than pure Python implementations, as long as you use Python 3 and have Cython installed). If you want to test the WE algorithm without specifying a scoring dictionary, we recommend to have a look at our wrapper function with the same name in the pairwise module.

lingpy.algorithm.cython.misc module

class lingpy.algorithm.cython.misc.ScoreDict

Bases: object

Class allows quick access to scoring functions using dictionary syntax.

Parameters:

chars : list

The list of all character tokens for the scoring dictionary.

matrix : list

A two-dimensional scoring matrix.

Notes

Since this class has dictionary syntax, you can always also just create a dictionary in order to store your scoring functions. Scoring dictionaries should contain a tuple of segments to be compared as a key, and a float or integer as a value, with negative values indicating dissimilarity, and positive values similarity.

Examples

Initialize a ScoreDict object::
>>> from lingpy.algorith.cython.misc import ScoreDict
>>> scorer = ScoreDict(['a', 'b'], [1, -1, -1, 1])
Retrieve scores::
>>> scorer['a', 'b']
-1
>>> scorer['a', 'a']
1
>>> scorer['a', 'X']
-22.5
lingpy.algorithm.cython.misc.squareform()

A simplified version of the scipy.spatial.distance.squareform() function.

Parameters:

x : numpy.array or list

The one-dimensional flat representation of a symmetrix distance matrix.

Returns:

matrix : numpy.array

The two-dimensional redundant representation of a symmetric distance matrix.

lingpy.algorithm.cython.misc.transpose()

Transpose a matrix along its two dimensions.

Parameters:

matrix : list

A two-dimensional list.

lingpy.algorithm.cython.talign module

lingpy.algorithm.cython.talign.align_pair()

Align a pair of sequences.

Parameters:

seqA, seqB : list

The sequences to be aligned, passed as lists.

gop : int

The gap opening penalty.

scale : float

The gap extension scale.

scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict }

The scoring dictionary containing scores for all possible segment combinations in the two sequences.

mode : { “global”, “local”, “overlap”, “dialign” }

Select the mode for the alignment analysis (“overlap” refers to semi-global alignments).

distance : int (default=0)

Select whether you want distances or similarities to be returned (0 indicates similarities, 1 indicates distances, 2 indicates both).

Returns:

alignment : tuple

The aligned sequences and the similarity or distance scores, or both.

Notes

This is a utility function that allows calls any of the four classical alignment functions (lingpy.algorithm.cython.talign.globalign lingpy.algorithm.cython.talign.semi_globalign, lingpy.algorithm.cython.talign.lotalign, lingpy.algorithm.cython.talign.dialign,) and their secondary counterparts.

lingpy.algorithm.cython.talign.align_pairs()

Align multiple sequence pairs.

Parameters:

seqs : list

The sequences to be aligned, passed as lists.

gop : int

The gap opening penalty.

scale : float

The gap extension scale.

scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict }

The scoring dictionary containing scores for all possible segment combinations in the two sequences.

mode : { “global”, “local”, “overlap”, “dialign” }

Select the mode for the alignment analysis (“overlap” refers to semi-global alignments).

distance : int (default=0)

Indicate whether distances or similarities should be returned.

Returns:

alignments : list

A list of tuples, containing the aligned sequences, and the similarity or the distance scores.

Notes

This function aligns all pairs which are passed to it.

lingpy.algorithm.cython.talign.align_pairwise()

Align all sequences pairwise.

Parameters:

seqs : list

The sequences to be aligned, passed as lists.

gop : int

The gap opening penalty.

scale : float

The gap extension scale.

scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict }

The scoring dictionary containing scores for all possible segment combinations in the two sequences.

mode : { “global”, “local”, “overlap”, “dialign” }

Select the mode for the alignment analysis (“overlap” refers to semi-global alignments).

Returns:

alignments : list

A list of tuples, containing the aligned sequences, the similarity and the distance scores.

Notes

This function aligns all possible pairs between the sequences you pass to it. It is important for multiple alignment, where it can be used to construct the guide tree.

lingpy.algorithm.cython.talign.align_profile()

Align two profiles using the basic modes.

Parameters:

profileA, profileB : list

Two-dimensional list for each of the profiles.

gop : int

The gap opening penalty.

scale : float

The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty.

scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict }

The scoring function which needs to provide scores for all segments in the two profiles.

mode : { “global”, “overlap”, “dialign” }

Select one of the four basic modes for alignment analyses.

gap_weight : float

This handles the weight that is given to gaps in a column. If you set it to 0, for example, this means that all gaps will be ignored when determining the score for two columns in the profile.

Returns:

alignment : tuple

The aligned profiles, and the overall similarity of the profiles.

Notes

This function computes alignments of two profiles of multiple sequences (see Durbin2002 for details on profiles) and is important for multiple alignment analyses.

lingpy.algorithm.cython.talign.dialign()

Carry out dialign alignment of two sequences.

Parameters:

seqA, seqB : list

The sequences to be aligned, passed as lists.

M, N : int

The length of the two sequences.

scale : float

The gap extension scale.

scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict }

The scoring dictionary containing scores for all possible segment combinations in the two sequences.

Returns:

alignment : tuple

The aligned sequences and the similarity score.

Notes

This algorithm carries out dialign alignment (Morgenstern1996).

lingpy.algorithm.cython.talign.globalign()

Carry out global alignment of two sequences.

Parameters:

seqA, seqB : list

The sequences to be aligned, passed as lists.

M, N : int

The length of the two sequences.

gop : int

The gap opening penalty.

scale : float

The gap extension scale.

scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict }

The scoring dictionary containing scores for all possible segment combinations in the two sequences.

Returns:

alignment : tuple

The aligned sequences and the similarity score.

Notes

This algorithm carries out classical Needleman-Wunsch alignment (Needleman1970).

lingpy.algorithm.cython.talign.localign()

Carry out semi-global alignment of two sequences.

Parameters:

seqA, seqB : list

The sequences to be aligned, passed as lists.

M, N : int

The length of the two sequences.

gop : int

The gap opening penalty.

scale : float

The gap extension scale.

scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict }

The scoring dictionary containing scores for all possible segment combinations in the two sequences.

Returns:

alignment : tuple

The aligned sequences and the similarity score.

Notes

This algorithm carries out local alignment (Smith1981).

lingpy.algorithm.cython.talign.score_profile()

Basic function for the scoring of profiles.

Parameters:

colA, colB : list

The two columns of a profile.

scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict }

The scoring function which needs to provide scores for all segments in the two profiles.

gap_weight : float (default=0.0)

This handles the weight that is given to gaps in a column. If you set it to 0, for example, this means that all gaps will be ignored when determining the score for two columns in the profile.

Returns:

score : float

The score for the profile

Notes

This function handles how profiles are scored.

lingpy.algorithm.cython.talign.semi_globalign()

Carry out semi-global alignment of two sequences.

Parameters:

seqA, seqB : list

The sequences to be aligned, passed as lists.

M, N : int

The length of the two sequences.

gop : int

The gap opening penalty.

scale : float

The gap extension scale.

scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict }

The scoring dictionary containing scores for all possible segment combinations in the two sequences.

Returns:

alignment : tuple

The aligned sequences and the similarity score.

Notes

This algorithm carries out semi-global alignment (Durbin2002).

lingpy.algorithm.cython.talign.swap_score_profile()

Basic function for the scoring of profiles in swapped sequences.

Parameters:

colA, colB : list

The two columns of a profile.

scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict }

The scoring function which needs to provide scores for all segments in the two profiles.

gap_weight : float (default=0.0)

This handles the weight that is given to gaps in a column. If you set it to 0, for example, this means that all gaps will be ignored when determining the score for two columns in the profile.

swap_penalty : int (default=-5)

The swap penalty applied to swapped columns.

Returns:

score : float

The score for the profile.

Notes

This function handles how profiles with swapped segments are scored.

Module contents

Package provides modules for time-consuming routines.