lingpy.compare package

Submodules

lingpy.compare.lexstat module

class lingpy.compare.lexstat.LexStat(filename, **keywords)

Bases: Wordlist

Basic class for automatic cognate detection.

Parameters:

filename : str

The name of the file that shall be loaded.

model : Model

The sound-class model that shall be used for the analysis. Defaults to the SCA sound-class model.

merge_vowels : bool (default=True)

Indicate whether consecutive vowels should be merged into single tokens or kept apart as separate tokens.

transform : dict

A dictionary that indicates how prosodic strings should be simplified (or generally transformed), using a simple key-value structure with the key referring to the original prosodic context and the value to the new value. Currently, prosodic strings (see prosodic_string()) offer 11 different prosodic contexts. Since not all these are helpful in preliminary analyses for cognate detection, it is useful to merge some of these contexts into one. The default settings distinguish only 5 instead of 11 available contexts, namely:

  • C for all consonants in prosodically ascending position,

  • c for all consonants in prosodically descending position,

  • V for all vowels,

  • T for all tones, and

  • _ for word-breaks.

Make sure to check also the “vowel” keyword when initialising a LexStat object, since the symbols you use for vowels and tones should be identical with the ones you define in your transform dictionary.

vowels : str (default=”VT_”)

For scoring function creation using the get_scorer function, you have the possibility to use reduced scores for the matching of tones and vowels by modifying the “vscale” parameter, which is set to 0.5 as a default. In order to make sure that vowels and tones are properly detected, make sure your prosodic string representation of vowels matches the one in this keyword. Thus, if you change the prosodic strings using the “transform” keyword, you also need to change the vowel string, to make sure that “vscale” works as wanted in the get_scorer function.

check : bool (default=False)

If set to True, the input file will first be checked for errors before the calculation is carried out. Errors will be written to the file errors, defaulting to errors.log. See also apply_checks apply_checks : bool (default=False) If set to True, any errors identified by check will be handled silently.

no_bscorer: bool (default=False) :

If set to True, this will suppress the creation of a language-specific scoring function (which may become quite large and is additional ballast if the method “lexstat” is not used after all. If you use the “lexstat” method, however, this needs to be set to False.

errors : str

The name of the error log.

segments : str (default=”tokens”)

The name of the column in your data which contains the segmented transcriptions, or in which the segmented transcriptions should be placed.

transcription : str (default=”ipa”)

The name of the column in your data which contains the unsegmented transcriptions.

classes : str (default=”classes”)

The name of the column in the data which contains the sound class representation of the transcriptions, or in which this information shall be placed after automatic conversion.

numbers : str (default=”numbers”)

The language-specific triples consisting of language id (numeric), sound class string (one character only), and prosodic string (one character only). Usually, numbers are automatically created from the columns “classes”, “prostrings”, and “langid”, but you can also provide them in your data.

langid : str (default=”langid”)

Name of the column that contains a numerical language identifier, needed to produce the language-specific character triples (“numbers”). Unless specific explicitly, this is automatically created.

prostrings : str (default=”prostrings”)

Name of the column containing prosodic strings (see List2014d for more details) of the segmented transcriptions, containing one character per prosodic string. Prostrings add a contextual component to phonetic sequences. They are automatically created, but can likewise be submitted from the initial data.

weights : str (default=”weights”)

The name of the column which stores the individual gap-weights for each sequence. Gap weights are positive floats for each segment in a string, which modify the gap opening penalty during alignment.

tokenize : function (default=ipa2tokens)

The function which should be used to tokenize the entries in the column storing the transcriptions in case no segmentation is provided by the user.

get_prostring : function (default=prosodic_string)

The function which should be used to create prosodic strings from the segmented transcription data. If you want to completely ignore prosodic strings in LexStat calculations, you could just pass the following function:

>>> lex = LexStat('inputfile.tsv', get_prostring=lambda x: ["x" for
    y in x])

cldf : bool (default=True)

If set to True, as by default, this will allow for a specific treatment of phonetic symbols which cannot be completely resolved when internally converting tokens to classes (e.g., laryngeal h₂ in Indo-European). Following the CLDF specifications (in particular the specifications for writing transcriptions in segmented strings, as employed by the CLTS initiative), in cases of insecurity of pronunciation, users can adopt a `source/target` style, where the source is the symbol used, e.g., in a reconstruction system, and the target is a proposed phonetic interpretation. This practice is also accepted by the EDICTOR tool.

Notes

Instantiating this class does not require a lot of parameters. However, the user may modify its behaviour by providing additional attributes in the input file.

Attributes

pairs

dict

A dictionary with tuples of language names as key and indices as value, pointing to unique combinations of words with the same meaning in all language pairs.

model

Model

The sound class model instance which serves to convert the phonetic data into sound classes.

chars

list

A list of all unique language-specific character types in the instantiated LexStat object. The characters in this list consist of

  • the language identifier (numeric, referenced as “langid” as a default, but customizable via the keyword “langid”)

  • the sound class symbol for the respective IPA transcription value

  • the prosodic class value

All values are represented in the above order as one string, separated by a dot. Gaps are also included in this collection. They are traditionally represented as “X” for the sound class and “-” for the prosodic string.

rchars

list

A list containing all unique character types across languages. In contrast to the chars-attribute, the “rchars” (raw chars) do not contain the language identifier, thus they only consist of two values, separated by a dot, namely, the sound class symbol, and the prosodic class value.

scorer

dict

A collection of ScoreDict objects, which are used to score the strings. LexStat distinguishes two different scoring functions:

  • rscorer: A “raw” scorer that is not language-specific and consists only of sound class values and prosodic string values. This scorer is traditionally used to carry out the first alignment in order to calculate the language-specific scorer. It is directly accessible as an attribute of the LexStat class (rscorer). The characters which constitute the values in this scorer are accessible via the “rchars” attribue of each lexstat class.

  • bscorer: The language-specific scorer. This scorer is made of unique language-specific characters. These are accessible via the “chars” attribute of each LexStat class. As the “rscorer”, the “bscorer” can also be accessed directly as an attribute of the LexStat class (bscorer).

align_pairs(idxA, idxB, concept=None, **keywords)

Align all or some words of a given pair of languages.

Parameters:

idxA,idxB : {int, str}

Use an integer to refer to the words by their unique internal ID, use language names to select all words for a given language.

method : {‘lexstat’,’sca’}

Define the method to be used for the alignment of the words.

mode : {‘global’,’local’,’overlap’,’dialign’} (default=’overlap’)

Select the mode for the alignment analysis.

gop : int (default=-2)

If ‘sca’ is selected as a method, define the gap opening penalty.

scale : float (default=0.5)

Select the scale for the gap extension penalty.

factor : float (default=0.3)

Select the factor for extra scores for identical prosodic segments.

restricted_chars : str (default=”T_”)

Select the restricted chars (boundary markers) in the prosodic strings in order to enable secondary alignment.

distance : bool (default=True)

If set to c{True}, return the distance instead of the similarity score.

pprint : bool (default=True)

If set to c{True}, print the results to the terminal.

return_distance : bool (default=False)

If set to c{True}, return the distance score, otherwise, nothing will be returned.

cluster(method='sca', cluster_method='upgma', threshold=0.3, scale=0.5, factor=0.3, restricted_chars='_T', mode='overlap', gop=-2, restriction='', ref='', external_function=None, **keywords)

Function for flat clustering of words into cognate sets.

Parameters:

method : {‘sca’,’lexstat’,’edit-dist’,’turchin’} (default=’sca’)

Select the method that shall be used for the calculation.

cluster_method : {‘upgma’,’single’,’complete’, ‘mcl’} (default=’upgma’)

Select the cluster method. ‘upgma’ (Sokal1958) refers to average linkage clustering, ‘mcl’ refers to the “Markov Clustering Algorithm” (Dongen2000).

threshold : float (default=0.3)

Select the threshold for the cluster approach. If set to c{False}, an automatic threshold will be calculated by calculating the average distance of unrelated sequences (use with care).

scale : float (default=0.5)

Select the scale for the gap extension penalty.

factor : float (default=0.3)

Select the factor for extra scores for identical prosodic segments.

restricted_chars : str (default=”T_”)

Select the restricted chars (boundary markers) in the prosodic strings in order to enable secondary alignment.

mode : {‘global’,’local’,’overlap’,’dialign’} (default=’overlap’)

Select the mode for the alignment analysis.

verbose : bool (default=False)

Define whether verbose output should be used or not.

gop : int (default=-2)

If ‘sca’ is selected as a method, define the gap opening penalty.

restriction : {‘cv’} (default=””)

Specify the restriction for calculations using the edit-distance. Currently, only “cv” is supported. If edit-dist is selected as method and restriction is set to cv, consonant-vowel matches will be prohibited in the calculations and the edit distance will be normalized by the length of the alignment rather than the length of the longest sequence, as described in Heeringa2006.

inflation : {int, float} (default=2)

Specify the inflation parameter for the use of the MCL algorithm.

expansion : int (default=2)

Specify the expansion parameter for the use of the MCL algorithm.

get_distances(method='sca', mode='overlap', gop=-2, scale=0.5, factor=0.3, restricted_chars='T_', aggregate=True)

Method calculates different distance estimates for language pairs.

Parameters:

method : {‘sca’,’lexstat’,’edit-dist’,’turchin’} (default=’sca’)

Select the method that shall be used for the calculation.

runs : int (default=100)

Select the number of random alignments for each language pair.

mode : {‘global’,’local’,’overlap’,’dialign’} (default=’overlap’)

Select the mode for the alignment analysis.

gop : int (default=-2)

If ‘sca’ is selected as a method, define the gap opening penalty.

scale : float (default=0.5)

Select the scale for the gap extension penalty.

factor : float (default=0.3)

Select the factor for extra scores for identical prosodic segments.

restricted_chars : str (default=”T_”)

Select the restricted chars (boundary markers) in the prosodic strings in order to enable secondary alignment.

aggregate : bool (default=True)

Return aggregated distances in form of a distance matrix for all taxa in the data.

Returns:

D : c{numpy.array}

An array with all distances calculated for each sequence pair.

get_frequencies(ftype='sounds', ref='tokens', aggregated=False)

Computes the frequencies of a given wordlist.

Parameters:

ftype: str (default=’sounds’) :

The type of frequency which shall be calculated. Select between “sounds” (type-token frequencies of sounds), and “wordlength” (average word length per taxon or in aggregated form), or “diversity” for the diversity index (requires that you have carried out cognate judgments, and make sure to set the “ref” keyword to the column in which your cognates are).

ref : str (default=”tokens”)

The reference column, with the column for “tokens” as a default. Make sure to modify this keyword in case you want to check for the “diversity”.

aggregated : bool (default=False)

Determine whether frequencies should be calculated in an aggregated way, for all languages, or on a language-per-language basis.

Returns:

freqs : {dict, float}

Depending on the selection of the datatype you chose, this returns either a dictionary containing the frequencies or a float indicating the ratio.

get_random_distances(method='lexstat', runs=100, mode='overlap', gop=-2, scale=0.5, factor=0.3, restricted_chars='T_')

Method calculates randoms scores for unrelated words in a dataset.

Parameters:

method : {‘sca’,’lexstat’,’edit-dist’,’turchin’} (default=’sca’)

Select the method that shall be used for the calculation.

runs : int (default=100)

Select the number of random alignments for each language pair.

mode : {‘global’,’local’,’overlap’,’dialign’} (default=’overlap’)

Select the mode for the alignment analysis.

gop : int (default=-2)

If ‘sca’ is selected as a method, define the gap opening penalty.

scale : float (default=0.5)

Select the scale for the gap extension penalty.

factor : float (default=0.3)

Select the factor for extra scores for identical prosodic segments.

restricted_chars : str (default=”T_”)

Select the restricted chars (boundary markers) in the prosodic strings in order to enable secondary alignment.

Returns:

D : c{numpy.array}

An array with all distances calculated for each sequence pair.

get_scorer(**keywords)

Create a scoring function based on sound correspondences.

Parameters:

method : str (default=’shuffle’)

Select between “markov”, for automatically generated random strings, and “shuffle”, for random strings taken directly from the data.

ratio : tuple (default=3,2)

Define the ratio between derived and original score for sound-matches.

vscale : float (default=0.5)

Define a scaling factor for vowels, in order to decrease their score in the calculations.

runs : int (default=1000)

Choose the number of random runs that shall be made in order to derive the random distribution.

threshold : float (default=0.7)

The threshold which used to select those words that are compared in order to derive the attested distribution.

modes : list (default = [(“global”,-2,0.5),(“local”,-1,0.5)])

The modes which are used in order to derive the distributions from pairwise alignments.

factor : float (default=0.3)

The scaling factor for sound segments with identical prosodic environment.

force : bool (default=False)

Force recalculation of existing distribution.

preprocessing: bool (default=False) :

Select whether SCA-analysis shall be used to derive a preliminary set of cognates from which the attested distribution shall be derived.

rands : int (default=1000)

If “method” is set to “markov”, this parameter defines the number of strings to produce for the calculation of the random distribution.

limit : int (default=10000)

If “method” is set to “markov”, this parameter defines the limit above which no more search for unique strings will be carried out.

cluster_method : {“upgma” “single” “complete”} (default=”upgma”)

Select the method to be used for the calculation of cognates in the preprocessing phase, if “preprocessing” is set to c{True}.

gop : int (default=-2)

If “preprocessing” is selected, define the gap opening penalty for the preprocessing calculation of cognates.

unattested : {int, float} (default=-5)

If a pair of sounds is not attested in the data, but expected by the alignment algorithm that computes the expected distribution, the score would be -infinity. Yet in order to allow to smooth this behaviour and to reduce the strictness, we set a default negative value which does not necessarily need to be too high, since it may well be that we miss a potentially good pairing in the first runs of alignment analyses. Use this keyword to adjust this parameter.

unexpected : {int, float} (default=0.000001)

If a pair is encountered in a given alignment but not expected according to the randomized alignments, the score would be not calculable, since we had to divide by zero. For this reason, we set a very small constant, by which the score is divided in this case. Not that this constant is only relevant in those cases where the shuffling procedure was not carried out long enough.

get_subset(sublist, ref='concept')

Function creates a specific subset of all word pairs.

Parameters:

sublist : list

A list which contains those items which should be considered for the subset creation, for example, a list of concepts.

ref : string (default=”concept”)

The reference point to compare the given sublist.

Notes

This function can be used to consider only a smaller part of word pairs when creating a scorer. Normally, all words are compared, but defining a subset allows to compare only those belonging to a specific concept list (Swadesh list).

output(fileformat, **keywords)

Write data to file.

Parameters:

fileformat : {‘tsv’, ‘tre’,’nwk’,’dst’, ‘taxa’,’starling’, ‘paps.nex’, ‘paps.csv’}

The format that is written to file. This corresponds to the file extension, thus ‘tsv’ creates a file in tsv-format, ‘dst’ creates a file in Phylip-distance format, etc.

filename : str

Specify the name of the output file (defaults to a filename that indicates the creation date).

subset : bool (default=False)

If set to c{True}, return only a subset of the data. Which subset is specified in the keywords ‘cols’ and ‘rows’.

cols : list

If subset is set to c{True}, specify the columns that shall be written to the csv-file.

rows : dict

If subset is set to c{True}, use a dictionary consisting of keys that specify a column and values that give a Python-statement in raw text, such as, e.g., “== ‘hand’”. The content of the specified column will then be checked against statement passed in the dictionary, and if it is evaluated to c{True}, the respective row will be written to file.

ref : str

Name of the column that contains the cognate IDs if ‘starling’ is chosen as an output format.

missing : { str, int } (default=0)

If ‘paps.nex’ or ‘paps.csv’ is chosen as fileformat, this character will be inserted as an indicator of missing data.

tree_calc : {‘neighbor’, ‘upgma’}

If no tree has been calculated and ‘tre’ or ‘nwk’ is chosen as output format, the method that is used to calculate the tree.

threshold : float (default=0.6)

The threshold that is used to carry out a flat cluster analysis if ‘groups’ or ‘cluster’ is chosen as output format.

ignore : { list, “all” }

Modifies the output format in “tsv” output and allows to ignore certain blocks in extended “tsv”, like “msa”, “taxa”, “json”, etc., which should be passed as a list. If you choose “all” as a plain string and not a list, this will ignore all additional blocks and output only plain “tsv”.

prettify : bool (default=True)

Inserts comment characters between concepts in the “tsv” file output format, which makes it easier to see blocks of words denoting the same concept. Switching this off will output the file in plain “tsv”.

lingpy.compare.lexstat.char_from_charstring(cstring)
lingpy.compare.lexstat.get_score_dict(chars, model)

lingpy.compare.partial module

Module provides a class for partial cognate detection, expanding the LexStat class.

class lingpy.compare.partial.Partial(infile, **keywords)

Bases: LexStat

Extended class for automatic detection of partial cognates.

Parameters:

filename : str

The name of the file that shall be loaded.

model : Model

The sound-class model that shall be used for the analysis. Defaults to the SCA sound-class model.

merge_vowels : bool (default=True)

Indicate whether consecutive vowels should be merged into single tokens or kept apart as separate tokens.

transform : dict

A dictionary that indicates how prosodic strings should be simplified (or generally transformed), using a simple key-value structure with the key referring to the original prosodic context and the value to the new value. Currently, prosodic strings (see prosodic_string()) offer 11 different prosodic contexts. Since not all these are helpful in preliminary analyses for cognate detection, it is useful to merge some of these contexts into one. The default settings distinguish only 5 instead of 11 available contexts, namely:

  • C for all consonants in prosodically ascending position,

  • c for all consonants in prosodically descending position,

  • V for all vowels,

  • T for all tones, and

  • _ for word-breaks.

Make sure to check also the “vowel” keyword when initialising a LexStat object, since the symbols you use for vowels and tones should be identical with the ones you define in your transform dictionary.

vowels : str (default=”VT_”)

For scoring function creation using the get_scorer function, you have the possibility to use reduced scores for the matching of tones and vowels by modifying the “vscale” parameter, which is set to 0.5 as a default. In order to make sure that vowels and tones are properly detected, make sure your prosodic string representation of vowels matches the one in this keyword. Thus, if you change the prosodic strings using the “transform” keyword, you also need to change the vowel string, to make sure that “vscale” works as wanted in the get_scorer function.

check : bool (default=False)

If set to True, the input file will first be checked for errors before the calculation is carried out. Errors will be written to the file errors, defaulting to errors.log. See also apply_checks

apply_checks : bool (default=False)

If set to True, any errors identified by check will be handled silently.

no_bscorer: bool (default=False) :

If set to True, this will suppress the creation of a language-specific scoring function (which may become quite large and is additional ballast if the method “lexstat” is not used after all. If you use the “lexstat” method, however, this needs to be set to False.

errors : str

The name of the error log.

Notes

This method automatically infers partial cognate sets from data which was previously morphologically segmented.

Attributes

pairs

dict

A dictionary with tuples of language names as key and indices as value, pointing to unique combinations of words with the same meaning in all language pairs.

model

Model

The sound class model instance which serves to convert the phonetic data into sound classes.

chars

list

A list of all unique language-specific character types in the instantiated LexStat object. The characters in this list consist of

  • the language identifier (numeric, referenced as “langid” as a default, but customizable via the keyword “langid”)

  • the sound class symbol for the respective IPA transcription value

  • the prosodic class value

All values are represented in the above order as one string, separated by a dot. Gaps are also included in this collection. They are traditionally represented as “X” for the sound class and “-” for the prosodic string.

rchars

list

A list containing all unique character types across languages. In contrast to the chars-attribute, the “rchars” (raw chars) do not contain the language identifier, thus they only consist of two values, separated by a dot, namely, the sound class symbol, and the prosodic class value.

scorer

dict

A collection of ScoreDict objects, which are used to score the strings. LexStat distinguishes two different scoring functions:

  • rscorer: A “raw” scorer that is not language-specific and consists only of sound class values and prosodic string values. This scorer is traditionally used to carry out the first alignment in order to calculate the language-specific scorer. It is directly accessible as an attribute of the LexStat class (rscorer). The characters which constitute the values in this scorer are accessible via the “rchars” attribue of each lexstat class.

  • bscorer: The language-specific scorer. This scorer is made of unique language-specific characters. These are accessible via the “chars” attribute of each LexStat class. As the “rscorer”, the “bscorer” can also be accessed directly as an attribute of the LexStat class (bscorer).

add_cognate_ids(source, target, idtype='strict', override=False)

Compute normal cognate identifiers from partial cognate sets.

Parameters:

source: str :

Name of the source column in your wordlist file.

target : str

Name of the target column in your wordlist file.

idtype : str (default=”strict”)

Select between “strict” and “loose”.

override: bool (default=False) :

Specify whether you want to override existing columns.

Notes

While the computation of strict cognate IDs from partial cognate IDs is straightforward and just judges those words as cognate which are identical in all their parts, the computation of loose cognate IDs constructs a network between all words, draws lines between all words that share a common morpheme, and judges all connected components in this network as cognate.

get_partial_scorer(**keywords)

Create a scoring function based on sound correspondences.

Parameters:

method : str (default=’shuffle’)

Select between “markov”, for automatically generated random strings, and “shuffle”, for random strings taken directly from the data.

ratio : tuple (default=3,2)

Define the ratio between derived and original score for sound-matches.

vscale : float (default=0.5)

Define a scaling factor for vowels, in order to decrease their score in the calculations.

runs : int (default=1000)

Choose the number of random runs that shall be made in order to derive the random distribution.

threshold : float (default=0.7)

The threshold which used to select those words that are compared in order to derive the attested distribution.

modes : list (default = [(“global”,-2,0.5),(“local”,-1,0.5)])

The modes which are used in order to derive the distributions from pairwise alignments.

factor : float (default=0.3)

The scaling factor for sound segments with identical prosodic environment.

force : bool (default=False)

Force recalculation of existing distribution.

preprocessing: bool (default=False) :

Select whether SCA-analysis shall be used to derive a preliminary set of cognates from which the attested distribution shall be derived.

rands : int (default=1000)

If “method” is set to “markov”, this parameter defines the number of strings to produce for the calculation of the random distribution.

limit : int (default=10000)

If “method” is set to “markov”, this parameter defines the limit above which no more search for unique strings will be carried out.

cluster_method : {“upgma” “single” “complete”} (default=”upgma”)

Select the method to be used for the calculation of cognates in the preprocessing phase, if “preprocessing” is set to c{True}.

gop : int (default=-2)

If “preprocessing” is selected, define the gap opening penalty for the preprocessing calculation of cognates.

unattested : {int, float} (default=-5)

If a pair of sounds is not attested in the data, but expected by the alignment algorithm that computes the expected distribution, the score would be -infinity. Yet in order to allow to smooth this behaviour and to reduce the strictness, we set a default negative value which does not necessarily need to be too high, since it may well be that we miss a potentially good pairing in the first runs of alignment analyses. Use this keyword to adjust this parameter.

unexpected : {int, float} (default=0.000001)

If a pair is encountered in a given alignment but not expected according to the randomized alignments, the score would be not calculable, since we had to divide by zero. For this reason, we set a very small constant, by which the score is divided in this case. Not that this constant is only relevant in those cases where the shuffling procedure was not carried out long enough.

partial_cluster(method='sca', threshold=0.45, scale=0.5, factor=0.3, restricted_chars='_T', mode='overlap', cluster_method='infomap', gop=-1, restriction='', ref='', external_function=None, split_on_tones=False, **keywords)

Cluster the words into partial cognate sets.

Function for flat clustering of words into cognate sets.

Parameters:

method : {‘sca’,’lexstat’,’edit-dist’,’turchin’} (default=’sca’)

Select the method that shall be used for the calculation.

cluster_method : {‘upgma’,’single’,’complete’, ‘mcl’} (default=’upgma’)

Select the cluster method. ‘upgma’ (Sokal1958) refers to average linkage clustering, ‘mcl’ refers to the “Markov Clustering Algorithm” (Dongen2000).

threshold : float (default=0.3)

Select the threshold for the cluster approach. If set to c{False}, an automatic threshold will be calculated by calculating the average distance of unrelated sequences (use with care).

scale : float (default=0.5)

Select the scale for the gap extension penalty.

factor : float (default=0.3)

Select the factor for extra scores for identical prosodic segments.

restricted_chars : str (default=”T_”)

Select the restricted chars (boundary markers) in the prosodic strings in order to enable secondary alignment.

mode : {‘global’,’local’,’overlap’,’dialign’} (default=’overlap’)

Select the mode for the alignment analysis.

verbose : bool (default=False)

Define whether verbose output should be used or not.

gop : int (default=-2)

If ‘sca’ is selected as a method, define the gap opening penalty.

restriction : {‘cv’} (default=””)

Specify the restriction for calculations using the edit-distance. Currently, only “cv” is supported. If edit-dist is selected as method and restriction is set to cv, consonant-vowel matches will be prohibited in the calculations and the edit distance will be normalized by the length of the alignment rather than the length of the longest sequence, as described in Heeringa2006.

inflation : {int, float} (default=2)

Specify the inflation parameter for the use of the MCL algorithm.

expansion : int (default=2)

Specify the expansion parameter for the use of the MCL algorithm.

lingpy.compare.phylogeny module

Phylogeny-based detection of borrowings in lexicostatistical wordlists.

class lingpy.compare.phylogeny.PhyBo(dataset, tree=None, paps='pap', ref='cogid', tree_calc='neighbor', output_dir=None, **keywords)

Bases: Wordlist

Basic class for calculations using the TreBor method.

Parameters:

dataset : string

Name of the dataset that shall be analyzed.

tree : {None, string}

Name of the tree file.

paps : string (default=”pap”)

Name of the column that stores the specific cognate IDs consisting of an arbitrary integer key and a key for the concept.

ref : string (default=”cogid”)

Name of the column that stores the general cognate ids (the “reference” of the analysis).

tree_calc : {‘neighbor’,’upgma’} (default=’neighbor’)

Select the algorithm to be used for the tree calculation if no tree is passed with the file.

missing : int (default=-1)

Specify how missing data should be handled. If set to -1, missing data can account for both presence or absence of a cognate set in the given language. If set to 0, missing data is treated as absence.

degree : int (default=100)

The degree which is chosen for the projection of the tree layout.

analyze(runs='default', mixed=False, output_gml=False, tar=False, full_analysis=True, plot_dists=False, output_plot=False, plot_mln=False, plot_msn=False, **keywords)

Carry out a full analysis using various parameters.

Parameters:

runs : {str list} (default=”default”)

Define a couple of different models to be analyzed. Select between:

  • ‘default’: weighted analysis, using parsimony and weights for gains and losses

  • ‘topdown’: use the traditional approach by Nelson-Sathi2011

  • ‘restriction’: use the restriction approach

You can also define your own mix of models.

usetex : bool (default=True)

Specify whether you want to use LaTeX to render plots.

mixed : bool (default=False)

If set to c{True}, calculate a mixed model by selecting the best model for each item separately.

output_gml : bool (default=False)

Set to c{True} in order to output every gain-loss-scenario in GML-format.

full_analysis : bool (default=True)

Specifies whether a full analysis is carried out or not.

plot_mln : bool (default=True)

Select or unselect output plot for the MLN.

plot_msn : bool (default=False)

Select or unselect output plot for the MSN.

get_ACS(glm, **keywords)

Compute the ancestral character states (ACS) for all internal nodes.

get_AVSD(glm, **keywords)

Function retrieves all pap s for ancestor languages in a given tree.

get_CVSD()

Calculate the Contemporary Vocabulary Size Distribution (CVSD).

get_GLS(mode='weighted', ratio=(1, 1), restriction=3, output_gml=False, output_plot=False, tar=False, **keywords)

Create gain-loss-scenarios for all non-singleton paps in the data.

Parameters:

mode : string (default=”weighted”)

Select between “weighted”, “restriction” and “topdown”. The three modes refer to the following frameworks:

  • “weighted” refers to the weighted parsimony framework described in List2014b and List2014a. Weights are specified with help of a ratio for the scoring of gain and loss events. The ratio can be defined with help of the ratio keyword.

  • “restrictino” refers to a simple method in which only a specific amount of gain events is allowed. The maximally allowed number of gain events can be defined with help of the restriction keyword.

  • “topdown” refers to the top-down method outlined in Dagan2007 and first applied to linguistic data in Nelson-Sathi2011. This method also defines a maximal number of gain events, but in contrast to the “restriction” approach, it starts from the top of the tree and stops if the maximal number of restrictions has been reached. The maximally allowed number of gain events can, again, be specified with help of the restriction keyword.

ratio : tuple (default=(1,1))

If “weighted” mode is selected, define the ratio between the weights for gains and losses.

restriction : int (default=3)

If “restriction” is selected as mode, define the maximal number of gains.

output_gml : bool (default=False)

If set to c{True}, the decisions for each GLS are stored in a separate file in GML-format.

tar : bool (default=False)

If set to c{True}, the GML-files will be added to a compressed tar-file.

gpl : int (default=1)

Specifies the maximal number of gains per lineage. This parameter specifies how cases should be handled in which a character is first gained, then lost, and then gained again. By setting this parameter to 1 (the default setting), such cases are prohibited, since only one gain per lineage is allowed.

missing_data : int (default=0)

Currently, we offer two ways to handle missing data. The first case just treats missing data in the same way in which the absence of a character is handled and can be evoked by setting this parameter to 0. The second case will treat missing data as either absent or present characters, based on how well each option coincides with the overall evolutionary scenario. This behaviour can be evoked by setting this parameter to -1.

push_gains: bool (default=True) :

In bottom-up calculations, there will often be multiple scenarios upon which only one is selected by the method. In order to define consistent criteria for scenario selection, we follow Mirkin2003 in allowing to force the algorithm to prefer those scenarios in which gains are pushed to the leaves. This behaviour is handle by this parameter. Setting it to True will force the algorithm to push gain events to the leaves of the tree. Setting it to False will force it to prefer those scenarios where the gains are closer to the root.

get_IVSD(output_gml=False, output_plot=False, tar=True, leading_model=False, mixed_threshold=0.0, evaluation='mwu', **keywords)

Calculate VSD on the basis of each item.

get_MLN(glm, threshold=1, method='mr')

Compute an Minimal Lateral Network for a given model.

Parameters:

glm : str

The dictionary key for the gain-loss-model.

threshold : int (default=1)

The threshold used to exclude edges.

method : str (default=’mr’)

Select the method for MLN calculation. Choose between: * “mr”: majority-rule, multiple links are resolved by selecting

those which occur most frequently

  • “td”: tree-distance, multiple links are resolved by selecting those which are closest on the tree

  • “bc”: betweenness-centrality, multiple links are resolved by selecting those which have the highest betweenness centrality

get_MSN(glm='', external_edges=False, deep_nodes=False, **keywords)

Plot the Minimal Spatial Network.

Parameters:

glm : str (default=’’)

A string that encodes which model should be plotted.

filename : str

The name of the file to which the plot shall be written.

fileformat : str

The output format of the plot.

threshold : int (default=1)

The threshold for the minimal amount of shared links that shall be plotted.

usetex : bool (default=True)

Specify whether LaTeX shall be used for the plot.

get_PDC(glm, **keywords)

Calculate Patchily Distributed Cognates.

get_edge(glm, nodeA, nodeB, entries='', msn=False)

Return the edge data for a given gain-loss model.

get_stats(glm, subset='', filename='')

Calculate basic statistics for a given gain-loss model.

plot_ACS(glm, **keywords)

Plot a tree in which the node size correlates with the size of the ancestral node.

plot_GLS(glm, **keywords)

Plot the inferred scenarios for a given model.

plot_MLN(glm='', fileformat='pdf', threshold=1, usetex=False, taxon_labels='taxon_short_labels', alphat=False, alpha=0.75, **keywords)

Plot the MLN with help of Matplotlib.

glmstr (default=’’)

Identifier for the gain-loss model that is plotted. Defaults to the model that had the best scores in terms of probability.

filenamestr (default=’’)

If no filename is selected, the filename is identical with the dataset.

fileformat{‘svg’,’png’,’jpg’,’pdf’} (default=’pdf’)

Select the format of the output plot.

thresholdint (default=1)

Select the threshold for drawing lateral edges.

usetexbool (default=True)

Specify whether you want to use LaTeX to render plots.

colormap{None matplotlib.cm}

A matplotlib.colormap instance. If set to c{None}, this defaults to jet.

taxon_labelsstr (default=’taxon.short_labels’)

Specify the taxon labels that should be included in the plot.

plot_MLN_3d(glm='', filename='', fileformat='pdf', threshold=1, usetex=True, colormap=None, taxon_labels='taxon_short_labels', alphat=False, alpha=0.75, **keywords)

Plot the MLN with help of Matplotlib in 3d.

glmstr (default=’’)

Identifier for the gain-loss model that is plotted. Defaults to the model that had the best scores in terms of probability.

filenamestr (default=’’)

If no filename is selected, the filename is identical with the dataset.

fileformat{‘svg’,’png’,’jpg’,’pdf’} (default=’pdf’)

Select the format of the output plot.

thresholdint (default=1)

Select the threshold for drawing lateral edges.

usetexbool (default=True)

Specify whether you want to use LaTeX to render plots.

colormap{None matplotlib.cm}

A matplotlib.colormap instance. If set to c{None}, this defaults to jet.

taxon_labelsstr (default=’taxon.short_labels’)

Specify the taxon labels that should be included in the plot.

plot_MSN(glm='', fileformat='pdf', threshold=1, usetex=False, alphat=False, alpha=0.75, only=[], **keywords)

Plot a minimal spatial network.

plot_concept_evolution(glm, concept='', fileformat='png', **keywords)

Plot the evolution of specific concepts along the reference tree.

plot_two_concepts(concept, cogA, cogB, labels={1: '1', 2: '2', 3: '3', 4: '4'}, tcolor={1: 'white', 2: 'black', 3: '0.5', 4: '0.1'}, filename='pdf', fileformat='pdf', usetex=True)

Plot the evolution of two concepts in space.

Notes

This function may be useful to contrast patterns of different words in geographic space.

lingpy.compare.phylogeny.TreBor

alias of PhyBo

lingpy.compare.phylogeny.get_gls(paps, taxa, tree, gpl=1, weights=(1, 1), push_gains=True, missing_data=0)

Calculate a gain-loss scenario.

Parameters:

paps : list

A list containing the presence-absence patterns for all leaves of the reference tree. Presence is indicated by 1, and absence by 0. Missing characters are indicated by -1.

taxa : list

The list of taxa (leaves of the tree).

tree : str

A tree in Newick-format. Taxon names should (of course) be identical with the names in the list of taxa.

gpl : int

Gains per lineage. Specify the maximal amount of gains per lineage. One lineage is hereby defined as one path in the tree. If set to 0, only one gain per lineage is allowed, if set to 1, one additional gain is allowed, and so on. Use with care, since this will lead to larger computation costs (more possibilities have to be taken care of) and can also be quite unrealistic.

weights : tuple (default=(1,1))

Specify the weights for gains and losses. Setting this parameter to (2,1) will penalize gain events with 2 and loss events with 1.

push_gains : bool (default=True)

Determine whether of a set of equally parsimonious patterns those should be retained that show gains closer to the leaves of the tree or not.

missing_data : int (default=0)

Determine how missing data should be represented. If set to 0 (default), missing data will be treated in the same way as absence character states. If you want missing data to be accounted for in the algorithm, set this parameter to -1.

Notes

This is an enhanced version of the older approach to parsimony-based gain-loss mapping. The algorithm is much faster than the previous one and also written much clearer as to the code. In most tests I ran so far, it also outperformed other approaches by finding more parsimonious solutions.

lingpy.compare.sanity module

Module provides basic checks for wordlists.

lingpy.compare.sanity.average_coverage(wordlist, concepts='concepts')

Compute average mutual coverage for a given wordlist.

Parameters:

wordlist : ~lingpy.basic.wordlist.Wordlist

Your Wordlist object (or a descendant class).

concepts : str (default=”concept”)

The column which stores your concepts.

Returns:

coverage : dict

A dictionary of dictionaries whose value is the number of items two languages share.

Examples

Compute coverage for the KSL.qlc dataset:

>>> from lingpy.compare.sanity import average_coverage
>>> from lingpy import *
>>> from lingpy.tests.util import test_data
>>> wl = Wordlist(test_data('KSL.qlc'))
>>> average_coverage(wl)
1.0
lingpy.compare.sanity.check_cognates(wordlist, ref='crossids')

Function checks for internal consistency of partial cognates.

lingpy.compare.sanity.check_length(a, b, dimA=1, dimB=1)

Custom function to check the length of two basictypes in LingPy.

lingpy.compare.sanity.check_sequence_length(wordlist, entities=['tokens', 'crossids', 'morphemes', 'structure'], dimensions=[2, 1, 2, 1])

Function checks for identical sequence length in different columns.

lingpy.compare.sanity.check_strict_cognates(wordlist, ref='crossids', segments='tokens')

Check if cognates are really strict.

lingpy.compare.sanity.mutual_coverage(wordlist, concepts='concept')

Compute mutual coverage for all language pairs in your data.

Parameters:

wordlist : ~lingpy.basic.wordlist.Wordlist

Your Wordlist object (or a descendant class).

concepts : str (default=”concept”)

The column which stores your concepts.

Returns:

coverage : dict

A dictionary of dictionaries whose value is the number of items two languages share.

Examples

Compute coverage for the KSL.qlc dataset:

>>> from lingpy.compare.sanity import mutual_coverage
>>> from lingpy import *
>>> from lingpy.tests.util import test_data
>>> wl = Wordlist(test_data('KSL.qlc'))
>>> cov = mutual_coverage(wl)
>>> cov['English']['German']
200
lingpy.compare.sanity.mutual_coverage_check(wordlist, threshold, concepts='concept')

Check whether a given mutual coverage is fulfilled by the dataset.

Parameters:

wordlist : ~lingpy.basic.wordlist.Wordlist

Your Wordlist object (or a descendant class).

concepts : str (default=”concept”)

The column which stores your concepts.

threshold : int

The threshold which should be checked.

Returns:

c: bool :

True, if coverage is fulfilled for all language pairs, False if otherwise.

Examples

Compute minimal mutual coverage for the KSL dataset:

>>> from lingpy.compare.sanity import mutual_coverage
>>> from lingpy import *
>>> from lingpy.tests.util import test_data
>>> wl = Wordlist(test_data('KSL.qlc'))
>>> for i in range(wl.height, 1, -1):
        if mutual_coverage_check(wl, i):
            print('mutual coverage is {0}'.format(i))
            break
    200
lingpy.compare.sanity.mutual_coverage_subset(wordlist, threshold, concepts='concept')

Compute maximal mutual coverage for all language in a wordlist.

Parameters:

wordlist : ~lingpy.basic.wordlist.Wordlist

Your Wordlist object (or a descendant class).

concepts : str (default=”concept”)

The column which stores your concepts.

threshold : int

The threshold which should be checked.

Returns:

coverage : tuple

A tuple consisting of the number of languages for which the coverage could be found as well as a list of all pairings in which this coverage is possible. The list itself contains the mutual coverage inside each pair and the list of languages.

Examples

Compute all sets of languages with coverage at 200 for the KSL dataset:

>>> from lingpy.compare.sanity import mutual_coverage_subset
>>> from lingpy import *
>>> from lingpy.tests.util import test_data
>>> wl = Wordlist(test_data('KSL.qlc'))
>>> number_of_languages, pairs = mutual_coverage_subset(wl, 200)
>>> for number_of_items, languages in pairs:
        print(number_of_items, ','.join(languages))
    200 Albanian,English,French,German,Hawaiian,Navajo,Turkish
lingpy.compare.sanity.synonymy(wordlist, concepts='concept', languages='doculect')

Check the number of synonyms per language and concept.

Parameters:

wordlist : ~lingpy.basic.wordlist.Wordlist

Your Wordlist object (or a descendant class).

concepts : str (default=”concept”)

The column which stores your concepts.

languages : str (default=”doculect”)

The column which stores your language names.

Returns:

synonyms : dict

A dictionary with language and concept as key and the number of synonyms as value.

Examples

Calculate synonymy in KSL.qlc dataset:

>>> from lingpy.compare.sanity import synonymy      
>>> from lingpy import *
>>> from lingpy.tests.util import test_data
>>> wl = Wordlist(test_data('KSL.qlc'))
>>> syns = synonymy(wl)
>>> for a, b in syns.items():
        if b > 1:
            print(a[0], a[1], b)

There is no case where synonymy exceeds 1 word per concept per language, since Kessler2001 was paying particular attention to avoid synonyms.

lingpy.compare.strings module

Module provides various string similarity metrics.

lingpy.compare.strings.bidist1(a, b, normalized=True)

Computes bigram-based distance.

Notes

The binary version. Checks if two bigrams are equal or not.

lingpy.compare.strings.bidist2(a, b, normalized=True)

Computes bigram based distance.

Notes

The comprehensive version of the bigram distance.

lingpy.compare.strings.bidist3(a, b, normalized=True)

Computes bigram based distance.

Notes

Computes the positional version of the bigrams. Assigns a partial distance between two bigrams based on positional similarity of bigrams.

lingpy.compare.strings.bisim1(a, b, normalized=True)

computes the binary version of bigram similarity.

lingpy.compare.strings.bisim2(a, b, normalized=True)

Computes bigram similarity “the comprehensive version”.

Notes

Computes the number of common 1-grams between two n-grams.

lingpy.compare.strings.bisim3(a, b, normalized=True)

Computes bi-sim the positional version.

Notes

The partial similarity between two bigrams is defined as the number of matching 1-grams at each position.

lingpy.compare.strings.dice(a, b, normalized=True)

Computes the Dice measure that measures the number of common bigrams.

lingpy.compare.strings.ident(a, b)

Computes the identity between two strings. If yes, returns 1, else, returns 0.

lingpy.compare.strings.jcd(a, b, normalized=True)

Computes the bigram-based Jaccard Index.

lingpy.compare.strings.jcdn(a, b, normalized=True)

Computes the bigram and trigram-based Jaccard Index

lingpy.compare.strings.lcs(a, b, normalized=True)

Computes the longest common subsequence between two strings.

lingpy.compare.strings.ldn(a, b, normalized=True)

Basic Levenshtein distance without swap operation (all operations are equal costs).

lingpy.compare.strings.ldn_swap(a, b, normalized=True)

Basic Levenshtein distance with swap operation included (identifies metathesis).

lingpy.compare.strings.prefix(a, b, normalized=True)

Computes the longest common prefix between two strings.

lingpy.compare.strings.tridist1(a, b, normalized=True)

Computes trigram-based distance.

Notes

The binary version. Checks if two trigrams are equal or not.

lingpy.compare.strings.tridist2(a, b, normalized=True)

Computes bigram based distance.

Notes

The comprehensive version of the bigram distance.

lingpy.compare.strings.tridist3(a, b, normalized=True)

Computes trigram based distance.

Notes

Computes the positional version of the trigrams. Assigns a partial distance between two trigrams based on positional similarity of trigrams.

lingpy.compare.strings.trigram(a, b, normalized=True)

Computes the number of common trigrams between two strings.

lingpy.compare.strings.trisim1(a, b, normalized=True)

Computes the binary version of trigram similarity.

lingpy.compare.strings.trisim2(a, b, normalized=True)

Computes tri-sim “the comprehensive version”.

Notes

Simply computes the number of common 1-grams between two n-grams instead of calling LCS as should be done in Kondrak2005 paper. Note that the LCS for a trigram can be computed in O(n) time if we asssume that list lookup is in constant time.

lingpy.compare.strings.trisim3(a, b, normalized=True)

Computes tri-sim the “positional version”.

Notes

Simply computes the number of matching 1-grams in each position.

lingpy.compare.strings.xdice(a, b, normalized=True)

Computes the skip 1 character version of Dice.

lingpy.compare.strings.xxdice(a, b, normalized=True)

Returns the XXDice between two strings.

Notes

Taken from Brew1996.

lingpy.compare.util module

Module contents

Basic module for language comparison.