Cognate Detection (LexStat)

class lingpy.compare.lexstat.LexStat(filename, **keywords)

Basic class for automatic cognate detection.

Parameters:

filename : str

The name of the file that shall be loaded.

model : Model

The sound-class model that shall be used for the analysis. Defaults to the SCA sound-class model.

merge_vowels : bool (default=True)

Indicate whether consecutive vowels should be merged into single tokens or kept apart as separate tokens.

transform : dict

A dictionary that indicates how prosodic strings should be simplified (or generally transformed), using a simple key-value structure with the key referring to the original prosodic context and the value to the new value. Currently, prosodic strings (see prosodic_string()) offer 11 different prosodic contexts. Since not all these are helpful in preliminary analyses for cognate detection, it is useful to merge some of these contexts into one. The default settings distinguish only 5 instead of 11 available contexts, namely:

  • C for all consonants in prosodically ascending position,
  • c for all consonants in prosodically descending position,
  • V for all vowels,
  • T for all tones, and
  • _ for word-breaks.

Make sure to check also the “vowel” keyword when initialising a LexStat object, since the symbols you use for vowels and tones should be identical with the ones you define in your transform dictionary.

vowels : str (default=”VT_”)

For scoring function creation using the get_scorer function, you have the possibility to use reduced scores for the matching of tones and vowels by modifying the “vscale” parameter, which is set to 0.5 as a default. In order to make sure that vowels and tones are properly detected, make sure your prosodic string representation of vowels matches the one in this keyword. Thus, if you change the prosodic strings using the “transform” keyword, you also need to change the vowel string, to make sure that “vscale” works as wanted in the get_scorer function.

check : bool (default=False)

If set to True, the input file will first be checked for errors before the calculation is carried out. Errors will be written to the file errors, defaulting to errors.log. See also apply_checks

apply_checks : bool (default=False)

If set to True, any errors identified by check will be handled silently.

no_bscorer: bool (default=False) :

If set to True, this will suppress the creation of a language-specific scoring function (which may become quite large and is additional ballast if the method “lexstat” is not used after all. If you use the “lexstat” method, however, this needs to be set to False.

errors : str

The name of the error log.

Notes

Instantiating this class does not require a lot of parameters. However, the user may modify its behaviour by providing additional attributes in the input file.

Attributes

pairs dict A dictionary with tuples of language names as key and indices as value, pointing to unique combinations of words with the same meaning in all language pairs.
model Model The sound class model instance which serves to convert the phonetic data into sound classes.
chars list

A list of all unique language-specific character types in the instantiated LexStat object. The characters in this list consist of

  • the language identifier (numeric, referenced as “langid” as a default, but customizable via the keyword “langid”)
  • the sound class symbol for the respective IPA transcription value
  • the prosodic class value

All values are represented in the above order as one string, separated by a dot. Gaps are also included in this collection. They are traditionally represented as “X” for the sound class and “-” for the prosodic string.

rchars list A list containing all unique character types across languages. In contrast to the chars-attribute, the “rchars” (raw chars) do not contain the language identifier, thus they only consist of two values, separated by a dot, namely, the sound class symbol, and the prosodic class value.
scorer dict

A collection of ScoreDict objects, which are used to score the strings. LexStat distinguishes two different scoring functions:

  • rscorer: A “raw” scorer that is not language-specific and consists only of sound class values and prosodic string values. This scorer is traditionally used to carry out the first alignment in order to calculate the language-specific scorer. It is directly accessible as an attribute of the LexStat class (rscorer). The characters which constitute the values in this scorer are accessible via the “rchars” attribue of each lexstat class.
  • bscorer: The language-specific scorer. This scorer is made of unique language-specific characters. These are accessible via the “chars” attribute of each LexStat class. As the “rscorer”, the “bscorer” can also be accessed directly as an attribute of the LexStat class (bscorer).

Methods

align_pairs(idxA, idxB[, concept]) Align all or some words of a given pair of languages.
cluster([method, cluster_method, threshold, ...]) Function for flat clustering of words into cognate sets.
get_distances([method, mode, gop, scale, ...]) Method calculates different distance estimates for language pairs.
get_random_distances([method, runs, mode, ...]) Method calculates randoms scores for unrelated words in a dataset.
get_scorer(**keywords) Create a scoring function based on sound correspondences.
output(fileformat, **keywords) Write data to file.

Inherited WordList Methods

pickle([filename]) Store the QLCParser instance in a pickle file.
get_entries(entry) Return all entries matching the given entry-type as a two-dimensional list.
add_entries(entry, source, function[, override]) Add new entry-types to the word list by modifying given ones.
calculate(data[, taxa, concepts, ref]) Function calculates specific data.
export(fileformat[, sections, entries, ...]) Export the wordlist to specific fileformats.
export(fileformat[, sections, entries, ...]) Export the wordlist to specific fileformats.
get_dict([col, row, entry]) Function returns dictionaries of the cells matched by the indices.
get_dict([col, row, entry]) Function returns dictionaries of the cells matched by the indices.
get_etymdict([ref, entry, modify_ref]) Return an etymological dictionary representation of the word list.
get_etymdict([ref, entry, modify_ref]) Return an etymological dictionary representation of the word list.
get_list([row, col, entry, flat]) Function returns lists of rows and columns specified by their name.
get_list([row, col, entry, flat]) Function returns lists of rows and columns specified by their name.
get_paps([ref, entry, missing, modify_ref]) Function returns a list of present-absent-patterns of a given word list.
get_paps([ref, entry, missing, modify_ref]) Function returns a list of present-absent-patterns of a given word list.
output(fileformat, **keywords) Write wordlist to file.
renumber(source[, target, override]) Renumber a given set of string identifiers by replacing the ids by integers.