Partial Cognate Detection (Partial)

class lingpy.compare.partial.Partial(infile, **keywords)

Extended class for automatic detection of partial cognates.

Parameters:

filename : str

The name of the file that shall be loaded.

model : Model

The sound-class model that shall be used for the analysis. Defaults to the SCA sound-class model.

merge_vowels : bool (default=True)

Indicate whether consecutive vowels should be merged into single tokens or kept apart as separate tokens.

transform : dict

A dictionary that indicates how prosodic strings should be simplified (or generally transformed), using a simple key-value structure with the key referring to the original prosodic context and the value to the new value. Currently, prosodic strings (see prosodic_string()) offer 11 different prosodic contexts. Since not all these are helpful in preliminary analyses for cognate detection, it is useful to merge some of these contexts into one. The default settings distinguish only 5 instead of 11 available contexts, namely:

  • C for all consonants in prosodically ascending position,

  • c for all consonants in prosodically descending position,

  • V for all vowels,

  • T for all tones, and

  • _ for word-breaks.

Make sure to check also the “vowel” keyword when initialising a LexStat object, since the symbols you use for vowels and tones should be identical with the ones you define in your transform dictionary.

vowels : str (default=”VT_”)

For scoring function creation using the get_scorer function, you have the possibility to use reduced scores for the matching of tones and vowels by modifying the “vscale” parameter, which is set to 0.5 as a default. In order to make sure that vowels and tones are properly detected, make sure your prosodic string representation of vowels matches the one in this keyword. Thus, if you change the prosodic strings using the “transform” keyword, you also need to change the vowel string, to make sure that “vscale” works as wanted in the get_scorer function.

check : bool (default=False)

If set to True, the input file will first be checked for errors before the calculation is carried out. Errors will be written to the file errors, defaulting to errors.log. See also apply_checks

apply_checks : bool (default=False)

If set to True, any errors identified by check will be handled silently.

no_bscorer: bool (default=False) :

If set to True, this will suppress the creation of a language-specific scoring function (which may become quite large and is additional ballast if the method “lexstat” is not used after all. If you use the “lexstat” method, however, this needs to be set to False.

errors : str

The name of the error log.

Notes

This method automatically infers partial cognate sets from data which was previously morphologically segmented.

Attributes

pairs

dict

A dictionary with tuples of language names as key and indices as value, pointing to unique combinations of words with the same meaning in all language pairs.

model

Model

The sound class model instance which serves to convert the phonetic data into sound classes.

chars

list

A list of all unique language-specific character types in the instantiated LexStat object. The characters in this list consist of

  • the language identifier (numeric, referenced as “langid” as a default, but customizable via the keyword “langid”)

  • the sound class symbol for the respective IPA transcription value

  • the prosodic class value

All values are represented in the above order as one string, separated by a dot. Gaps are also included in this collection. They are traditionally represented as “X” for the sound class and “-” for the prosodic string.

rchars

list

A list containing all unique character types across languages. In contrast to the chars-attribute, the “rchars” (raw chars) do not contain the language identifier, thus they only consist of two values, separated by a dot, namely, the sound class symbol, and the prosodic class value.

scorer

dict

A collection of ScoreDict objects, which are used to score the strings. LexStat distinguishes two different scoring functions:

  • rscorer: A “raw” scorer that is not language-specific and consists only of sound class values and prosodic string values. This scorer is traditionally used to carry out the first alignment in order to calculate the language-specific scorer. It is directly accessible as an attribute of the LexStat class (rscorer). The characters which constitute the values in this scorer are accessible via the “rchars” attribue of each lexstat class.

  • bscorer: The language-specific scorer. This scorer is made of unique language-specific characters. These are accessible via the “chars” attribute of each LexStat class. As the “rscorer”, the “bscorer” can also be accessed directly as an attribute of the LexStat class (bscorer).

Methods

get_partial_scorer(**keywords)

Create a scoring function based on sound correspondences.

partial_cluster([method, threshold, scale, ...])

Cluster the words into partial cognate sets.

add_cognate_ids(source, target[, idtype, ...])

Compute normal cognate identifiers from partial cognate sets.

Inherited LexStat Methods

align_pairs(idxA, idxB[, concept])

Align all or some words of a given pair of languages.

cluster([method, cluster_method, threshold, ...])

Function for flat clustering of words into cognate sets.

get_distances([method, mode, gop, scale, ...])

Method calculates different distance estimates for language pairs.

get_random_distances([method, runs, mode, ...])

Method calculates randoms scores for unrelated words in a dataset.

get_scorer(**keywords)

Create a scoring function based on sound correspondences.

output(fileformat, **keywords)

Write data to file.

Inherited WordList Methods

get_entries(entry)

Return all entries matching the given entry-type as a two-dimensional list.

add_entries(entry, source, function[, override])

Add new entry-types to the word list by modifying given ones.

calculate(data[, taxa, concepts, ref])

Function calculates specific data.

export(fileformat[, sections, entries, ...])

Export the wordlist to specific fileformats.

get_dict([col, row, entry])

Function returns dictionaries of the cells matched by the indices.

get_etymdict([ref, entry, modify_ref])

Return an etymological dictionary representation of the word list.

get_list([row, col, entry, flat])

Function returns lists of rows and columns specified by their name.

get_paps([ref, entry, missing, modify_ref])

Function returns a list of present-absent-patterns of a given word list.

output(fileformat, **keywords)

Write wordlist to file.

renumber(source[, target, override])

Renumber a given set of string identifiers by replacing the ids by integers.