lingpy.sequence package¶

Submodules¶

lingpy.sequence.generate module¶

Module provides simple basic classes for sequence generation using Markov models.

class lingpy.sequence.generate.MCBasic(seqs)¶

Bases: object

Basic class for creating Markov chains from sequence training data.

Parameters:: seq : list

A list of sequences. Sequences are assumed to be tokenized, i.e. they should be either passed as lists or as tuples.

walk()¶: Create random sequence from the distribution.

class lingpy.sequence.generate.MCPhon(words, tokens=False, prostrings=[], classes=False, class_model=<sca-model "sca">, **keywords)¶

Bases: MCBasic

Class for the creation of phonetic sequences (“pseudo words”).

Parameters:

words : list

List of phonetic sequences. This list can contain tokenized sequences (lists or tuples), or simple untokenized IPA strings.

tokens : bool (default=False)

If set to True, no tokenization of input sequences is carried out.

prostring : list (default=[])

List containing the prosodic profiles of the input sequences. If the list is empty, the profiles are generated automatically.

evaluate_string(string, tokens=False, **keywords)¶

get_string(new=True, tokens=False)¶

Generate a string from the Markov chain created from the training data.

Parameters:

new : bool (default=True)

Determine whether the string created should be different from the training data or not.

tokens : bool (default=False)

If set to True he full list of tokens that was internally used to represent the sequences as a Markov chain is returned.

lingpy.sequence.ngrams module¶

This modules provides methods for generating and collecting ngrams.

The methods allow to collect different kind of subsequences, such as standard ngrams (preceding context), skip ngrams with both single or multiple gap openings (both preceding and following context), and positional ngrams (both preceding and following context).

class lingpy.sequence.ngrams.NgramModel(pre_order=0, post_order=0, pad_symbol='$$$', sequences=None)¶

Bases: object

Class for operation upon sequences using ngrams models.

This class allows different operations upon sequences after training ngram models, such as sequence relative likelihood computation (both per state and overall), random sequence generation, computation of a model entropy and of cross-entropy/perplexity of a sequence, etc. As model training is computationally and time consuming for large datasets, trained models can be saved and loaded (“serialized”) from disk.

add_sequences(sequences)¶

Adds sequences to a model, collecting their ngrams.

This method does not return any value, but cleans the internal matrix probability, if one was previously computed, and automatically updates the ngram counters. The actual training, with the computation of smoothed log-probabilities, is not performed automatically, and must be requested by the user by calling the .train() method.

Parameters:: sequences: list :

A list of sequences to be added to the model.

entropy(sequence, base=2.0)¶

Calculates the cross-entropy of a sequence.

Parameters:

sequence: list :

The sequence whose cross-entropy will be calculated.

base: float :

The logarithmic base for the cross-entropy calculation. Defaults to 2.0, following the standard approach set by Shannon that allows to consider entropy in terms of bits needed for unique representation.

Returns:

ch: float :

The cross-entropy calculated for the sequence, a real number.

model_entropy()¶

Return the model entropy.

This methods collects the P x log(P) for all contexts, returning their sum. This is different from a sequence cross-entropy, and should be used to estimate the complexity of a model.

Please note that for very large models the computation of this entropy might run into underflow problems.

Returns:: h: float :

The model entropy.

perplexity(sequence)¶

Calculates the perplexity of a model.

As per definition, this is simply 2.0 to the cross-entropy of the given sequence on logarithmic base of 2.0.

Parameters:: sequence: list :

The sequence whose perplexity should be calculated.
Returns:: perplexity: float :

The calculated perplexity for the sequence.

random_seqs(k=1, seq_len=None, scale=2, only_longest=False, attempts=10, seed=None)¶

Return a set of random sequences based in the observed transition frequencies.

This function tries to generate a set of k random sequences from the internal model. Given that the random selection and the parameters might lead to a long or infinite search loop, the number of attempts for each word generation is limited, meaning that there is no guarantee that the returned list will be of length k, but only that it will be at most of length k.

Parameters:

k: int :

The desired and maximum number of random sequences to be returned. While the algorithm should be robust enough for most cases, there is no guarantee that the desired number or even that a single random sequence will be returned. In case of missing sequences, try increasing the parameter attempts.

seq_len: int or list :

An optional integer with length of the sequences to be generated or a list of lengths to be uniformly drawn for the generated sequences. If the parameter is not specified, the length of the sequences will be drawn by the sequence lengths observed in training according to their frequencies.

scale: numeric :

The exponent used for weighting ngram probabilities according to their length in number of states. The higher this value, the less likely the algorithm will be to drawn shorter ngrams, which contribute to a higher variety in words but can also result in less likely sequences. Defaults to 2.

only_longest: bool :

Whether the algorithm should only collect the longest possible ngrams when computing the search space from which each new random character is obtained. This usually translates into less variation in the generated sequences and a longer searching time, which might need to be increased via the attempts parameters. Defaults to False.

tries: int :

The number of times the algorithm will try to generate a random sequence. If the algorithm is unable to generate a suitable random sequence after the specified number of attempts, the loop will advance to the following sequence (if any). Defaults to 10.

seed: obj :

Any hasheable object, used to feed the random number generator and thus reproduce the generated set of random sequences.

Returns:

seqs: list :

A list of size k with random sequences.

score(sequence, use_length=True)¶

Returns the relative likelihood of a sequence.

The model must have been trained before using this function.

Parameters:

sequence: list :

A list of states to be scored.

use_length: bool :

Whether to correct the sequence relative likelihood by using length probability. Defaults to True.

Returns:

prob: list :

A list of floats, of the same length of sequence, with the individual log-probability for each state.

state_score(sequence)¶

Returns the relative likelihood for each state in a sequence.

Please note that this does not perform correction due to sequence length, as optionally and by default performed by the .score() method. The model must have been trained in advance.

Parameters:: sequence: list :

A list of states to be scored.
Returns:: prob: list :

A list of floats, of the same length of sequence, with the individual log-probability for each state.

train(method='laplace', normalize=False, bins=None, **kwargs)¶

Train a model after ngrams have been collected.

This method does not return any value, but sets the internal variables with smoothed probabilities (such as self._p and self._p0) and internally marks the model as having been trained.

Parameters:

method: str :

The name of the smoothing method to be used, as used by smooth_dist(). Either “uniform”, “random”, “mle”, “lidstone”, “laplace”, “ele”, “wittenbell”, “certaintydegree”, or “sgt”. Defaults to “laplace”.

normalize: boolean :

Whether to normalize the log-probabilities for each ngram in the model after smoothing, i.e., to guarantee that the probabilities (with the probability for unobserved transitions counted a single time) sum to 1.0. This is computationally expansive, and should be only used if the model is intended for later serialization. While experiments with real data demonstrated that this normalization does not improve the results or performance of the methods, the computational cost of normalizing the probabilities might be justified if descriptive statistics of the model, like samples from the matrix of transition probabilities or the entropy/perplexity of a sequence, are needed (such as for publication), as they will be more in line with what is generally expected and will facilitate the comparison of different models.

bins: int :

The number of bins to be assumed when smoothing, for the smoothing methods that use this information. Defaults to the number of unique states observed, as gathered from the count of ngrams with no context.

lingpy.sequence.ngrams.bigrams(sequence, *, order=2, pad_symbol='$$$')¶

Build an iterator for collecting all bigrams of a sequence.

The sequence is padded by default.

Parameters:

sequence: list or str :

The sequence from which the bigrams will be collected.

pad_symbol: object :

An optional symbol to be used as start-of- and end-of-sequence boundaries. The same symbol is used for both boundaries. Must be a value different from None, defaults to “$$$”.

Returns:

out: iterable :

An iterable over the bigrams of the sequence, returned as tuples.

Examples

>>> from lingpy.sequence import *
>>> sent = "Insurgents killed in ongoing fighting"
>>> for ngram in bigrams(sent):
...     print(ngram)
...
('$$$', 'Insurgents')
('Insurgents', 'killed')
('killed', 'in')
('in', 'ongoing')
('ongoing', 'fighting')
('fighting', '$$$')

lingpy.sequence.ngrams.fourgrams(sequence, *, order=4, pad_symbol='$$$')¶

Build an iterator for collecting all fourgrams of a sequence.

The sequence is padded by default.

Parameters:

sequence: list or str :

The sequence from which the fourgrams will be collected.

pad_symbol: object :

An optional symbol to be used as start-of- and end-of-sequence boundaries. The same symbol is used for both boundaries. Must be a value different from None, defaults to “$$$”.

Returns:

out: iterable :

An iterable over the fourgrams of the sequence, returned as tuples.

Examples

>>> from lingpy.sequence import *
>>> sent = "Insurgents killed in ongoing fighting"
>>> for ngram in fourgrams(sent):
...     print(ngram)
...
('$$$', '$$$', '$$$', 'Insurgents')
('$$$', '$$$', 'Insurgents', 'killed')
('$$$', 'Insurgents', 'killed', 'in')
('Insurgents', 'killed', 'in', 'ongoing')
('killed', 'in', 'ongoing', 'fighting')
('in', 'ongoing', 'fighting', '$$$')
('ongoing', 'fighting', '$$$', '$$$')
('fighting', '$$$', '$$$', '$$$')

lingpy.sequence.ngrams.get_all_ngrams(sequence, sort=False)¶

Function returns all possible n-grams of a given sequence.

Parameters:: sequence : list or str

The sequence that shall be converted into it’s ngram-representation.
Returns:: out : list

A list of all ngrams of the input word, sorted in decreasing order of length.

Examples

>>> get_all_ngrams('abcde')
['abcde', 'bcde', 'abcd', 'cde', 'abc', 'bcd', 'ab', 'de', 'cd', 'bc', 'a', 'e', 'b', 'd', 'c']

lingpy.sequence.ngrams.get_all_ngrams_by_order(sequence, orders=None, pad_symbol='$$$')¶

Build an iterator for collecting all ngrams of a given set of orders.

If no set of orders (i.e., “lengths”) is provided, this will collect all possible ngrams in the sequence.

Parameters:

sequence: list or str :

The sequence from which the ngrams will be collected.

orders: list :

An optional list of the orders of the ngrams to be collected. Can be larger than the length of the sequence, in which case the latter will be padded accordingly if requested. Defaults to the collection of all possible ngrams in the sequence with the minimum padding.

pad_symbol: object :

An optional symbol to be used as start-of- and end-of-sequence boundaries. The same symbol is used for both boundaries. Must be a value different from None, defaults to “$$$”.

Returns:

out: iterable :

An iterable over the ngrams of the sequence, returned as tuples.

Examples

>>> from lingpy.sequence import *
>>> sent = "Insurgents were killed"
>>> for ngram in get_all_ngrams_by_order(sent):
...     print(ngram)
...
('Insurgents',)
('were',)
('killed',)
('$$$', 'Insurgents')
('Insurgents', 'were')
('were', 'killed')
('killed', '$$$')
('$$$', '$$$', 'Insurgents')
('$$$', 'Insurgents', 'were')
('Insurgents', 'were', 'killed')
('were', 'killed', '$$$')
('killed', '$$$', '$$$')

lingpy.sequence.ngrams.get_all_posngrams(sequence, pre_orders, post_orders, pad_symbol='$$$', elm_symbol='###')¶

Build an iterator for collecting all positional ngrams of a sequence.

The elements of the iterator, as returned by “get_posngrams()”, include a tuple of the context, which can be hashed (as any tuple), the transition symbol, and the position of the symbol in the sequence. Such output is primarily intended for state-by-state relative likelihood computations with stochastics models, and can be approximated to a collection of “shingles”.

Parameters:

sequence: list or str :

The sequence from which the ngrams will be collected.

pre-orders: int or list :

An integer with the maximum length of the preceding context or a list with all preceding context lengths to be collected. If an integer is passed, all lengths from zero to the informed one will be collected.

post-orders: int or list :

An integer with the maximum length of the following context or a list with all following context lengths to be collected. If an integer is passed, all lengths from zero to the informed one will be collected.

pad_symbol: object :

An optional symbol to be used as start-of- and end-of-sequence boundaries. The same symbol is used for both boundaries. Must be a value different from None, defaults to “$$$”.

elm_symbol: object :

An optional symbol to be used as transition symbol replacement in the context tuples (the first element in the returned iterator). Defaults to “###”.

Returns:

out: iterable :

An iterable over the positional ngrams of the sequence, returned as tuples whose elements are: (1) a tuple representing the context (thus including preceding context, the transition symbol, and the following context), (2) an object with the value of the transition symbol, and (3) the index of the transition symbol in the sequence.

Examples

>>> from lingpy.sequence import *
>>> sent = "Insurgents were killed"
>>> for ngram in get_all_posngrams(sent, 2, 1):
...     print(ngram)
...
(('###',), 'Insurgents', 0)
(('###',), 'were', 1)
(('###',), 'killed', 2)
(('###', 'were'), 'Insurgents', 0)
(('###', 'killed'), 'were', 1)
(('###', '$$$'), 'killed', 2)
(('$$$', '###'), 'Insurgents', 0)
(('Insurgents', '###'), 'were', 1)
(('were', '###'), 'killed', 2)
(('$$$', '###', 'were'), 'Insurgents', 0)
(('Insurgents', '###', 'killed'), 'were', 1)
(('were', '###', '$$$'), 'killed', 2)
(('$$$', '$$$', '###'), 'Insurgents', 0)
(('$$$', 'Insurgents', '###'), 'were', 1)
(('Insurgents', 'were', '###'), 'killed', 2)
(('$$$', '$$$', '###', 'were'), 'Insurgents', 0)
(('$$$', 'Insurgents', '###', 'killed'), 'were', 1)
(('Insurgents', 'were', '###', '$$$'), 'killed', 2)

lingpy.sequence.ngrams.get_n_ngrams(sequence, order, pad_symbol='$$$')¶

Build an iterator for collecting all ngrams of a given order.

The sequence can optionally be padded with boundary symbols which are equal for before and and after sequence boundaries.

Parameters:

sequence: list or str :

The sequence from which the ngrams will be collected.

order: int :

The order of the ngrams to be collected.

pad_symbol: object :

An optional symbol to be used as start-of- and end-of-sequence boundaries. The same symbol is used for both boundaries. Must be a value different from None, defaults to “$$$”.

Returns:

out: iterable :

An iterable over the ngrams of the sequence, returned as tuples.

Examples

>>> from lingpy.sequence import *
>>> sent = "Insurgents killed in ongoing fighting"
>>> for ngram in get_n_ngrams(sent, 2):
...     print(ngram)
...
('$$$', 'Insurgents')
('Insurgents', 'killed')
('killed', 'in')
('in', 'ongoing')
('ongoing', 'fighting')
('fighting', '$$$')

>>> for ngram in get_n_ngrams(sent, 1):
...     print(ngram)
...
('Insurgents',)
('killed',)
('in',)
('ongoing',)
('fighting',)

>>> for ngram in get_n_ngrams(sent, 0):
...     print(ngram)
...

lingpy.sequence.ngrams.get_posngrams(sequence, pre_order=0, post_order=0, pad_symbol='$$$', elm_symbol='###')¶

Build an iterator for collecting all positional ngrams of a sequence.

The preceding and a following orders (i.e., “contexts”) must always be informed. The elements of the iterator include a tuple of the context, which can be hashed as any tuple, the transition symbol, and the position of the symbol in the sequence. Such output is primarily intended for state-by-state relative likelihood computations with stochastics models.

Parameters:

sequence: list or str :

The sequence from which the ngrams will be collected.

pre_order: int :

An optional integer specifying the length of the preceding context. Defaults to zero.

post_order: int :

An optional integer specifying the length of the following context. Defaults to zero.

pad_symbol: object :

An optional symbol to be used as start-of- and end-of-sequence boundaries. The same symbol is used for both boundaries. Must be a value different from None, defaults to “$$$”.

elm_symbol: object :

An optional symbol to be used as transition symbol replacement in the context tuples (the first element in the returned iterator). Defaults to “###”.

Returns:

out: iterable :

An iterable over the positional ngrams of the sequence, returned as tuples whose elements are: (1) a tuple representing the context (thus including preceding context, the transition symbol, and the following context), (2) an object with the value of the transition symbol, and (3) the index of the transition symbol in the sequence.

Examples

>>> from lingpy.sequence import *
>>> sent = "Insurgents killed in ongoing fighting"
>>> for ngram in get_posngrams(sent, 2, 1):
...     print(ngram)
...
(('$$$', '$$$', '###', 'killed'), 'Insurgents', 0)
(('$$$', 'Insurgents', '###', 'in'), 'killed', 1)
(('Insurgents', 'killed', '###', 'ongoing'), 'in', 2)
(('killed', 'in', '###', 'fighting'), 'ongoing', 3)
(('in', 'ongoing', '###', '$$$'), 'fighting', 4)

lingpy.sequence.ngrams.get_skipngrams(sequence, order, max_gaps, pad_symbol='$$$', single_gap=True)¶

Build an iterator for collecting all skip ngrams of a given length.

The function requires an information of the length of the skip ngrams to be collected, allowing to either collect ngrams with an unlimited number of gap openings (as described and implemented in Guthrie et al. 2006) or with at most one gap opening.

Parameters:

sequence: list or str :

The sequence from which the ngrams will be collected. Must not include “None” as an element, as it is used as a sentinel during skip ngram collection following the implementation offered by Bird et al. 2018 (NLTK), which is a de facto standard.

order: int :

The order of the ngrams to be collected (parameter “n” in Guthrie et al. 2006).

max_gaps: int :

The maximum number of gaps in the ngrams to be collected (parameter “k” in Guthrie et al. 2006).

pad_symbol: object :

An optional symbol to be used as start-of- and end-of-sequence boundaries. The same symbol is used for both boundaries. Must be a value different from None, defaults to “$$$”.

single_gap: boolean :

An optional logic value indicating if multiple gap openings are to be allowed, as in Guthrie et al. (2006) and Bird et al. (2018), or if at most one gap_opening is to be allowed. Defaults to True.

Returns:

out: iterable :

An iterable over the ngrams of the sequence, returned as tuples.

Examples

>>> from lingpy.sequence import *
>>> sent = "Insurgents killed in ongoing fighting"
>>> for ngram in get_skipngrams(sent, 2, 2):
...     print(ngram)
...
('$$$', 'Insurgents')
('Insurgents', 'killed')
('killed', 'in')
('in', 'ongoing')
('ongoing', 'fighting')
('fighting', '$$$')
('$$$', 'killed')
('Insurgents', 'in')
('killed', 'ongoing')
('in', 'fighting')
('ongoing', '$$$')
('$$$', 'in')
('Insurgents', 'ongoing')
('killed', 'fighting')
('in', '$$$')
>>> for ngram in get_skipngrams(sent, 2, 2, single_gap=False):
...     print(ngram)
...
('$$$', 'Insurgents')
('$$$', 'killed')
('$$$', 'in')
('Insurgents', 'killed')
('Insurgents', 'in')
('Insurgents', 'ongoing')
('killed', 'in')
('killed', 'ongoing')
('killed', 'fighting')
('in', 'ongoing')
('in', 'fighting')
('in', '$$$')
('ongoing', 'fighting')
('ongoing', '$$$')
('fighting', '$$$')

lingpy.sequence.ngrams.trigrams(sequence, *, order=3, pad_symbol='$$$')¶

Build an iterator for collecting all trigrams of a sequence.

The sequence is padded by default.

Parameters:

sequence: list or str :

The sequence from which the trigrams will be collected.

pad_symbol: object :

An optional symbol to be used as start-of- and end-of-sequence boundaries. The same symbol is used for both boundaries. Must be a value different from None, defaults to “$$$”.

Returns:

out: iterable :

An iterable over the trigrams of the sequence, returned as tuples.

Examples

>>> from lingpy.sequence import *
>>> sent = "Insurgents killed in ongoing fighting"
>>> for ngram in trigrams(sent):
...     print(ngram)
...
('$$$', '$$$', 'Insurgents')
('$$$', 'Insurgents', 'killed')
('Insurgents', 'killed', 'in')
('killed', 'in', 'ongoing')
('in', 'ongoing', 'fighting')
('ongoing', 'fighting', '$$$')
('fighting', '$$$', '$$$')

lingpy.sequence.profile module¶

Module provides methods for the handling of orthography profiles.

lingpy.sequence.profile.context_profile(wordlist, ref='ipa', col='doculect', semi_diacritics='hsʃ̢ɕʂʐʑʒw', merge_vowels=False, brackets=None, splitters='/,;~', merge_geminates=True, clts=False, bad_word='<???>', bad_sound='<?>', unknown_sound='!{0}', examples=2, max_entries=100, normalization_form='NFC')¶

Create an advanced Orthography Profile with context and doculect information.

Parameters:

wordlist : ~lingpy.basic.wordlist.Wordlist

A wordlist from which you want to derive an initial orthography profile.

ref : str (default=”ipa”)

The name of the reference column in which the words are stored.

col : str (default=”doculect”)

Indicate in which column the information on the language variety is stored.

semi_diacritics : str

Indicate characters which can occur both as “diacritics” (second part in a sound) or alone.

merge_vowels : bool (default=True)

Indicate whether consecutive vowels should be merged.

brackets : dict

A dictionary with opening brackets as key and closing brackets as values. Defaults to a pre-defined set of frequently occurring brackets.

splitters : str

The characters which force the automatic splitting of an entry.

clts : dict (default=None)

A dictionary(like) object that converts a given source sound into a potential target sound, using the get()-method of the dictionary. Normally, we think of a CLTS instance here (that is: a cross-linguistic transcription system as defined in the pyclts package).

bad_word : str (default=”«???»”)

Indicate how words that could not be parsed should be handled. Note that both “bad_word” and “bad_sound” are format-strings, so you can add formatting information here.

bad_sound : str (default=”«?»”)

Indicate how sounds that could not be converted to a sound class be handled. Note that both “bad_word” and “bad_sound” are format-strings, so you can add formatting information here.

unknown_sound : str (default=”!{0}”)

If with_clts is set to True, use this string to indicate that sounds are classified as “unknown sound” in the CLTS framework.

examples : int(default=2)

Indicate the number of examples that should be printed out.

Returns:

profile : generator

A generator of tuples (three items), indicating the segment, its frequency, the conversion to sound classes in the Dolgopolsky sound-class model, and the unicode-codepoints.

lingpy.sequence.profile.simple_profile(wordlist, ref='ipa', semi_diacritics='hsʃ̢ɕʂʐʑʒw', merge_vowels=False, brackets=None, splitters='/,;~', merge_geminates=True, normalization_form='NFC', bad_word='<???>', bad_sound='<?>', clts=None, unknown_sound='!{0}')¶

Create an initial Orthography Profile using Lingpy’s clean_string procedure.

Parameters:

wordlist : ~lingpy.basic.wordlist.Wordlist

A wordlist from which you want to derive an initial orthography profile.

ref : str (default=”ipa”)

The name of the reference column in which the words are stored.

semi_diacritics : str

Indicate characters which can occur both as “diacritics” (second part in a sound) or alone.

merge_vowels : bool (default=True)

Indicate whether consecutive vowels should be merged.

brackets : dict

A dictionary with opening brackets as key and closing brackets as values. Defaults to a pre-defined set of frequently occurring brackets.

splitters : str

The characters which force the automatic splitting of an entry.

clts : dict (default=None)

A dictionary(like) object that converts a given source sound into a potential target sound, using the get()-method of the dictionary. Normally, we think of a CLTS instance here (that is: a cross-linguistic transcription system as defined in the pyclts package).

bad_word : str (default=”«???»”)

Indicate how words that could not be parsed should be handled. Note that both “bad_word” and “bad_sound” are format-strings, so you can add formatting information here.

bad_sound : str (default=”«?»”)

Indicate how sounds that could not be converted to a sound class be handled. Note that both “bad_word” and “bad_sound” are format-strings, so you can add formatting information here.

unknown_sound : str (default=”!{0}”)

If with_clts is set to True, use this string to indicate that sounds are classified as “unknown sound” in the CLTS framework.

Returns:

profile : generator

A generator of tuples (three items), indicating the segment, its frequency, the conversion to sound classes in the Dolgopolsky sound-class model, and the unicode-codepoints.

lingpy.sequence.smoothing module¶

Module providing various methods for using Ngram models.

The smoothing methods are implemented to be as compatible as possible with those offered by NLTK. In fact, both implementation and comments try to follow Bird at al. as close as possible.

lingpy.sequence.smoothing.certaintydegree_dist(freqdist, **kwargs)¶

Returns a log-probability distribution based on the degree of certainty.

In this distribution a mass probability is reserved for unobserved samples from a computation of the degree of certainty that the are no unobserved samples.

Under development and test by Tiago Tresoldi, this is an experimental probability distribution that should not be used as the sole or main distribution for the time being.

Parameters:

freqdist : dict

Frequency distribution of samples (keys) and counts (values) from which the probability distribution will be calculated.

bins: int :

The optional number of sample bins that can be generated by the experiment that is described by the probability distribution. If not specified, it will default to the number of samples in the frequency distribution.

unobs_prob : float

An optional mass probability to be reserved for unobserved states, from 0.0 to 1.0.

Returns:

state_prob: dict :

A dictionary of sample to log-probabilities for all the samples in the frequency distribution.

unobserved_prob: float :

The log-probability for samples not found in the frequency distribution.

lingpy.sequence.smoothing.ele_dist(freqdist, **kwargs)¶

Returns an Expected-Likelihood estimate log-probability distribution.

In an Expected-Likelihood estimate log-probability the frequency distribution of observed samples is used to estimate the probability distribution of the experiment that generated such observation, following a parameter given by a real number gamma set by definition to 0.5. As such, it is a generalization of the Lidstone estimate.

Parameters:

freqdist : dict

Frequency distribution of samples (keys) and counts (values) from which the probability distribution will be calculated.

bins: int :

The optional number of sample bins that can be generated by the experiment that is described by the probability distribution. If not specified, it will default to the number of samples in the frequency distribution.

Returns:

state_prob: dict :

A dictionary of sample to log-probabilities for all the samples in the frequency distribution.

unobserved_prob: float :

The log-probability for samples not found in the frequency distribution.

lingpy.sequence.smoothing.laplace_dist(freqdist, **kwargs)¶

Returns a Laplace estimate log-probability distribution.

In a Laplace estimate log-probability the frequency distribution of observed samples is used to estimate the probability distribution of the experiment that generated such observation, following a parameter given by a real number gamma set by definition to 1. As such, it is a generalization of the Lidstone estimate.

Parameters:

freqdist : dict

Frequency distribution of samples (keys) and counts (values) from which the probability distribution will be calculated.

bins: int :

The optional number of sample bins that can be generated by the experiment that is described by the probability distribution. If not specified, it will default to the number of samples in the frequency distribution.

Returns:

state_prob: dict :

A dictionary of sample to log-probabilities for all the samples in the frequency distribution.

unobserved_prob: float :

The log-probability for samples not found in the frequency distribution.

lingpy.sequence.smoothing.lidstone_dist(freqdist, **kwargs)¶

Returns a Lidstone estimate log-probability distribution.

In a Lidstone estimate log-probability the frequency distribution of observed samples is used to estimate the probability distribution of the experiment that generated such observation, following a parameter given by a real number gamma typycally randing from 0.0 to 1.0. The Lidstone estimate approximates the probability of a sample with count c from an experiment with N outcomes and B bins as (c+gamma)/(N+B*gamma). This is equivalent to adding gamma to the count of each bin and taking the Maximum-Likelihood estimate of the resulting frequency distribution, with the corrected space of observation; the probability for an unobserved sample is given by frequency of a sample with gamma observations.

Also called “additive smoothing”, this estimation method is frequently used with a gamma of 1.0 (the so-called “Laplace smoothing”) or of 0.5 (the so-called “Expected likelihood estimate”, or ELE).

Parameters:

freqdist : dict

Frequency distribution of samples (keys) and counts (values) from which the probability distribution will be calculated.

gamma : float

A real number used to parameterize the estimate.

bins: int :

The optional number of sample bins that can be generated by the experiment that is described by the probability distribution. If not specified, it will default to the number of samples in the frequency distribution.

Returns:

state_prob: dict :

A dictionary of sample to log-probabilities for all the samples in the frequency distribution.

unobserved_prob: float :

The log-probability for samples not found in the frequency distribution.

lingpy.sequence.smoothing.mle_dist(freqdist, **kwargs)¶

Returns a Maximum-Likelihood Estimation log-probability distribution.

In an MLE log-probability distribution the probability of each sample is approximated as the frequency of the same sample in the frequency distribution of observed samples. It is the distribution people intuitively adopt when thinking of probability distributions. A mass probability can optionally be reserved for unobserved samples.

Parameters:

freqdist : dict

Frequency distribution of samples (keys) and counts (values) from which the probability distribution will be calculated.

unobs_prob : float

An optional mass probability to be reserved for unobserved states, from 0.0 to 1.0.

Returns:

state_prob: dict :

A dictionary of sample to log-probabilities for all the samples in the frequency distribution.

unobserved_prob: float :

The log-probability for samples not found in the frequency distribution.

lingpy.sequence.smoothing.random_dist(freqdist, **kwargs)¶

Returns a random log-probability distribution.

In a random log-probability distribution all samples, no matter the observed counts, will have a random log-probability computed from a set of randomly drawn floating point values. A mass probability can optionally be reserved for unobserved samples.

Parameters:

freqdist : dict

Frequency distribution of samples (keys) and counts (values) from which the probability distribution will be calculated.

unobs_prob : float

An optional mass probability to be reserved for unobserved states, from 0.0 to 1.0.

seed : any hasheable value

An optional seed for the random number generator, defaulting to None.

Returns:

state_prob: dict :

A dictionary of sample to log-probabilities for all the samples in the frequency distribution.

unobserved_prob: float :

The log-probability for samples not found in the frequency distribution.

lingpy.sequence.smoothing.sgt_dist(freqdist, **kwargs)¶

Returns a Simple Good-Turing log-probability distribution.

The returned log-probability distribution is based on the Good-Turing frequency estimation, as first developed by Alan Turing and I. J. Good and implemented in a more easily computable way by Gale and Sampson’s (1995/2001 reprint) in the so-called “Simple Good-Turing”.

This implementation is based mostly in the one by “maxbane” (2011) (https://github.com/maxbane/simplegoodturing/blob/master/sgt.py), as well as in the original one in C by Geoffrey Sampson (1995; 2000; 2005; 2008) (https://www.grsampson.net/Resources.html), and in the one by Loper, Bird et al. (2001-2018, NLTK Project) (http://www.nltk.org/_modules/nltk/probability.html). Please note that due to minor differences in implementation intended to guarantee non-zero probabilities even in cases of expected underflow, as well as our relience on scipy’s libraries for speed and our way of handling probabilities that are not computable when the assumptions of SGT are not met, most results will not exactly match those of the ‘gold standard’ of Gale and Sampson, even though the differences are never expected to be significative and are equally distributed across the samples.

Parameters:

freqdist : dict

Frequency distribution of samples (keys) and counts (values) from which the probability distribution will be calculated.

p_value : float

The p-value for calculating the confidence interval of the empirical Turing estimate, which guides the decision of using either the Turing estimate “x” or the loglinear smoothed “y”. Defaults to 0.05, as per the reference implementation by Sampson, but consider that the authors, both in their paper and in the code following suggestions credited to private communication with Fan Yang, consider using a value of 0.1.

allow_fail : bool

A logic value informing if the function is allowed to fail, throwing RuntimeWarning exceptions, if the essential assumptions on the frequency distribution are not met, i.e., if the slope of the loglinear regression is > -1.0 or if an unobserved count is reached before we are able to cross the smoothing threshold. If set to False, the estimation might result in an unreliable probability distribution; defaults to True.

default_p0 : float

An optional value indicating the probability for unobserved samples (“p0”) in cases where no samples with a single count are observed; if this value is not specified, “p0” will default to a Laplace estimation for the current frequency distribution. Please note that this is intended change from the reference implementation by Gale and Sampson.

Returns:

state_prob: dict :

A dictionary of sample to log-probabilities for all the samples in the frequency distribution.

unobserved_prob: float :

The log-probability for samples not found in the frequency distribution.

lingpy.sequence.smoothing.smooth_dist(freqdist, method, **kwargs)¶

Returns a smoothed log-probability distribution from a named method.

This method is used to generalize over all implemented smoothing methods, especially in terms of serialization. The method argument informs which smoothing mehtod to use and passes all the arguments to the appropriate function.

Parameters:

freqdist : dict

Frequency distribution of samples (keys) and counts (values) from which the log-probability distribution will be calculated.

method: str :

The name of the probability smoothing method to use. Either “uniform”, “random”, “mle”, “lidstone”, “laplace”, “ele”, “wittenbell”, “certaintydegree”, or “sgt”.

kwargs: additional arguments :

Additional arguments passed to the appropriate smoothing method function.

Returns:

state_prob: dict :

A dictionary of sample to log-probabilities for all the samples in the frequency distribution.

unobserved_prob: float :

The log-probability for samples not found in the frequency distribution.

lingpy.sequence.smoothing.uniform_dist(freqdist, **kwargs)¶

Returns a uniform log-probability distribution.

In a uniform log-probability distribution all samples, no matter the observed counts, will have the same log-probability. A mass probability can optionally be reserved for unobserved samples.

Parameters:

freqdist : dict

Frequency distribution of samples (keys) and counts (values) from which the log-probability distribution will be calculated.

unobs_prob : float

An optional mass probability to be reserved for unobserved states, from 0.0 to 1.0.

Returns:

state_prob: dict :

A dictionary of sample to log-probabilities for all the samples in the frequency distribution.

unobserved_prob: float :

The log-probability for samples not found in the frequency distribution.

lingpy.sequence.smoothing.wittenbell_dist(freqdist, **kwargs)¶

Returns a Witten-Bell estimate log-probability distribution.

In a Witten-Bell estimate log-probability a uniform probability mass is allocated to yet unobserved samples by using the number of samples that have only been observed once. The probability mass reserved for unobserved samples is equal to T / (N +T), where T is the number of observed samples and N the number of total observations. This equates to the Maximum-Likelihood Estimate of a new type of sample occurring. The remaining probability mass is discounted such that all probability estimates sum to one, yielding:

p = T / Z (N + T), if count == 0

p = c / (N + T), otherwise

Parameters:

freqdist : dict

Frequency distribution of samples (keys) and counts (values) from which the probability distribution will be calculated.

bins: int :

The optional number of sample bins that can be generated by the experiment that is described by the probability distribution. If not specified, it will default to the number of samples in the frequency distribution.

Returns:

state_prob: dict :

A dictionary of sample to log-probabilities for all the samples in the frequency distribution.

unobserved_prob: float :

The log-probability for samples not found in the frequency distribution.

lingpy.sequence.sound_classes module¶

Module provides various methods for the handling of sound classes.

lingpy.sequence.sound_classes.asjp2tokens(seq, merge_vowels=True)¶

lingpy.sequence.sound_classes.check_tokens(tokens, **keywords)¶: Function checks whether tokens are given in a consistent input format.

lingpy.sequence.sound_classes.class2tokens(tokens, classes, gap_char='-', local=False)¶

Turn aligned sound-class sequences into an aligned sequences of IPA tokens.

Parameters:

tokens : list

The list of tokens corresponding to the unaligned IPA string.

classes : string or list

The aligned class string.

gap_char : string (default=”-“)

The character which indicates gaps in the output string.

local : bool (default=False)

If set to True a local alignment with prefix and suffix can be converted.

Returns:

alignment : list

A list of tokens with gaps at the positions where they occured in the alignment of the class string.

See also

ipa2tokens, tokens2class

Examples

>>> from lingpy import *
>>> tokens = ipa2tokens('t͡sɔyɡə')
>>> aligned_sequence = 'CU-KE'
>>> print ', '.join(class2tokens(tokens,aligned_sequence))
t͡s, ɔy, -, ɡ, ə

lingpy.sequence.sound_classes.clean_string(sequence, semi_diacritics='hsʃ̢ɕʂʐʑʒw', merge_vowels=False, segmentized=False, rules=None, ignore_brackets=True, brackets=None, split_entries=True, splitters='/,;~', preparse=None, merge_geminates=True, normalization_form='NFC')¶

Function exhaustively checks how well a sequence is understood by LingPy.

Parameters:

semi_diacritics : str

Indicate characters which can occur both as “diacritics” (second part in a sound) or alone.

merge_vowels : bool (default=True)

Indicate whether consecutive vowels should be merged.

segmentized : False

Indicate whether the input string is already segmentized or not. If set to True, items in brackets can no longer be ignored.

rules : dict

Replacement rules to be applied to a segmentized string.

ignore_brackets : bool

If set to True, ignore all content within a given bracket.

brackets : dict

A dictionary with opening brackets as key and closing brackets as values. Defaults to a pre-defined set of frequently occurring brackets.

split_entries : bool (default=True)

Indicate whether multiple entries (with a comma etc.) should be split into separate entries.

splitters : str

The characters which force the automatic splitting of an entry.

preparse : list

List of tuples, giving simple replacement patterns (source and target), which are applied before every processing starts.

Returns:

cleaned_strings : list

A list of cleaned strings which are segmented by space characters. If splitters are encountered, indicating that the entry contains two variants, the list will contain one for each element in a separate entry. If there are no splitters, the list has only size one.

lingpy.sequence.sound_classes.codepoint(s)¶: Return unicode codepoint(s) for a character set.

lingpy.sequence.sound_classes.get_all_ngrams(sequence, sort=False)¶

Function returns all possible n-grams of a given sequence.

Parameters:: sequence : list or str

The sequence that shall be converted into it’s ngram-representation.
Returns:: out : list

A list of all ngrams of the input word, sorted in decreasing order of length.

Examples

>>> get_all_ngrams('abcde')
['abcde', 'bcde', 'abcd', 'cde', 'abc', 'bcd', 'ab', 'de', 'cd', 'bc', 'a', 'e', 'b', 'd', 'c']

lingpy.sequence.sound_classes.ipa2tokens(sequence: str, **keywords)¶

Tokenize IPA-encoded strings.

Parameters:

sequence : str

The input sequence that shall be tokenized.

diacritics : {str, None} (default=None)

A string containing all diacritics which shall be considered in the respective analysis. When set to None, the default diacritic string will be used.

vowels : {str, None} (default=None)

A string containing all vowel symbols which shall be considered in the respective analysis. When set to None, the default vowel string will be used.

tones : {str, None} (default=None)

A string indicating all tone letter symbals which shall be considered in the respective analysis. When set to None, the default tone string will be used.

combiners : str (default=”͜͡”)

A string with characters that are used to combine two separate characters (compare affricates such as t͡s).

breaks : str (default=”-.”)

A string containing the characters that indicate that a new token starts right after them. These can be used to indicate that two consecutive vowels should not be treated as diphtongs or for diacritics that are put before the following letter.

merge_vowels : bool (default=True)

Indicate, whether vowels should be merged into diphtongs, or whether each vowel symbol should be considered separately.

merge_geminates : bool (default=True)

Indicate, whether identical symbols should be merged into one token, or rather be kept separate.

expand_nasals : bool (default=False)

semi_diacritics: str (default=’’) :

Indicate which symbols shall be treated as “semi-diacritics”, that is, as symbols which can occur on their own, but which eventually, when preceded by a consonant, will form clusters with it. If you want to disable this features, just set the keyword to an empty string.

clean_string : bool (default=False)

Conduct a rough string-cleaning strategy by which all items between brackets are removed along with the brackets, and

Returns:

tokens : list

A list of IPA tokens.

See also

tokens2class, class2tokens

Examples

>>> from lingpy import *
>>> myseq = 't͡sɔyɡə'
>>> ipa2tokens(myseq)
['t͡s', 'ɔy', 'ɡ', 'ə']

lingpy.sequence.sound_classes.ono_parse(word, output='', **keywords)¶

Carry out a rough onset-nucleus-offset parse of a word in IPA.

Notes

Method is an approximation and not supposed to do without flaws. It is, however, rather helpful in most instances. It defines a so far simple model in which 7 different contexts for each word are distinguished:

“#”: onset cluster in a word’s initial
“C”: onset cluster in a word’s non-initial
“V”: nucleus vowel in a word’s initial syllable
“v”: nucleus vowel in a word’s non-initial and non-final syllable
“>”: nucleus vowel in a word’s final syllable
“c”: offset cluster in a word’s non-final syllable
“$”: offset cluster in a word’s final syllable

lingpy.sequence.sound_classes.pgrams(sequence, **keywords)¶: Convert a given sequence into bigrams consisting of prosodic string symbols and the tokens of the original sequence.

lingpy.sequence.sound_classes.pid(almA, almB, mode=2)¶

Calculate the Percentage Identity (PID) score for aligned sequence pairs.

Parameters:

almA, almB : string or list

The aligned sequences which can be either a string or a list.

mode : { 1, 2, 3, 4, 5 }

Indicate which of the four possible PID scores described in Raghava2006 should be calculated, the fifth possibility is added for linguistic purposes:

identical positions / (aligned positions + internal gap positions),

identical positions / aligned positions,

identical positions / shortest sequence, or

identical positions / shortest sequence (including internal gap pos.)

identical positions / (aligned positions + 2 * number of gaps)

Returns:

score : float

The PID score of the given alignment as a floating point number between 0 and 1.

See also

lingpy.compare.Multiple.get_pid,

Notes

The PID score is a common measure for the diversity of a given alignment. The implementation employed by LingPy follows the description of Raghava2006 where four different variants of PID scores are distinguished. Essentially, the PID score is based on the comparison of identical residue pairs with the total number of residue pairs in a given alignment.

Examples

Load an alignment from the test suite.

>>> from lingpy import *
>>> pairs = PSA(get_file('test.psa'))

Extract the alignments of the first aligned sequence pair.

>>> almA,almB,score = pairs.alignments[0]

Calculate the PID score of the alignment.

>>> pid(almA,almB)
0.44444444444444442

lingpy.sequence.sound_classes.prosodic_string(string, _output=True, **keywords)¶

Create a prosodic string of the sonority profile of a sequence.

Parameters:

seq : list

A list of integers indicating the sonority of the tokens of the underlying sequence.

stress : str (default=rcParams[‘stress’])

A string containing the stress symbols used in the analysis. Defaults to the stress as defined in ~lingpy.settings.rcParams.

diacritics : str (default=rcParams[‘diacritics’])

A string containing diacritic symbols used in the analysis. Defaults to the diacritic symbolds defined in ~lingpy.settings.rcParams.

cldf : bool (default=False)

If set to True, this will allow for a specific treatment of phonetic symbols which cannot be completely resolved (e.g., laryngeal h₂ in Indo-European). Following the CLDF specifications (in particular the specifications for writing transcriptions in segmented strings, as employed by the CLTS initiative), in cases of insecurity of pronunciation, users can adopt a `source/target` style, where the source is the symbol used, e.g., in a reconstruction system, and the target is a proposed phonetic interpretation. This practice is also accepted by the EDICTOR tool.

Returns:

prostring : string

A prosodic string corresponding to the sonority profile of the underlying sequence.

See also

prosodic

Notes

A prosodic string is a sequence of specific characters which indicating their resprective prosodic context (see List2012 or List2012a for a detailed description). In contrast to the previous model, the current implementation allows for a more fine-graded distinction between different prosodic segments. The current scheme distinguishes 9 prosodic positions:

A: sequence-initial consonant
B: syllable-initial, non-sequence initial consonant in a context of ascending sonority
C: non-syllable, non-initial consonant in ascending sonority context
L: non-syllable-final consonant in descending environment
M: syllable-final consonant in descending environment
N: word-final consonant
X: first vowel in a word
Y: non-final vowel in a word
Z: vowel occuring in the last position of a word
T: tone
_: word break

Examples

>>> prosodic_string(ipa2tokens('t͡sɔyɡə')
'AXBZ'

lingpy.sequence.sound_classes.prosodic_weights(prostring, _transform={})¶

Calculate prosodic weights for each position of a sequence.

Parameters:

prostring : string

A prosodic string as it is returned by prosodic_string().

_transform : dict

A dictionary that determines how prosodic strings should be transformed into prosodic weights. Use this dictionary to adjust the prosodic strings to your own user-defined prosodic weight schema.

Returns:

weights : list

A list of floats reflecting the modification of the weight for each position.

See also

prosodic_string

Notes

Prosodic weights are specific scaling factors which decrease or increase the gap score of a given segment in alignment analyses (see List2012 or List2012a for a detailed description).

Examples

>>> from lingpy import *
>>> prostring = '#vC>'
>>> prosodic_weights(prostring)
[2.0, 1.3, 1.5, 0.7]

lingpy.sequence.sound_classes.sampa2uni(seq)¶

Convert sequence in IPA-sampa-format to IPA-unicode.

Notes

This function is based on code taken from Peter Kleiweg (http://www.let.rug.nl/~kleiweg/L04/devel/python/xsampa.html).

lingpy.sequence.sound_classes.syllabify(seq, output='flat', **keywords)¶

Carry out a simple syllabification of a sequence, using sonority as a proxy.

Parameters:

output: {“flat”, “breakpoints”, “nested”} (default=”flat”) :

Define how to output the syllabification. Select between: * “flat”: A syllable separator is introduced to mark the syllable boundaries * “breakpoins”: A tuple consisting of indices that slice the original sequence into syllables is returned. * “nested”: A nested list reflecting the syllable structure is returned.

sep : str (default=”◦”)

Select your preferred syllable separator.

Returns:

syllable : list

Either a flat list containing a morpheme separator, or a nested list, reflecting the syllable structure, or a list of tuples containing the indices indicating where the input sequence should be sliced in order to split it into syllables.

Notes

When analyzing the sequence, we start a new syllable in all cases where we reach a deepest point in the sonority hierarchy of the sonority profile of the sequence. When passing an aligned string to this function, the gaps will be ignored when computing boundaries, but later on re-introduced, if the alignment is passed in segmented form.

lingpy.sequence.sound_classes.token2class(token, model, stress=None, diacritics=None, cldf=None)¶

Convert a single token into a sound-class.

tokensstr: A token (phonetic segment).
modelModel: A Model object.
stressstr (default=rcParams[‘stress’]): A string containing the stress symbols used in the analysis. Defaults to the stress as defined in ~lingpy.settings.rcParams.
diacriticsstr (default=rcParams[‘diacritics’]): A string containing diacritic symbols used in the analysis. Defaults to the diacritic symbolds defined in ~lingpy.settings.rcParams.
cldfbool (default=False): If set to True, this will allow for a specific treatment of phonetic symbols which cannot be completely resolved (e.g., laryngeal h₂ in Indo-European). Following the CLDF specifications (in particular the specifications for writing transcriptions in segmented strings, as employed by the CLTS initiative), in cases of insecurity of pronunciation, users can adopt a `source/target` style, where the source is the symbol used, e.g., in a reconstruction system, and the target is a proposed phonetic interpretation. This practice is also accepted by the EDICTOR tool.

Returns:: sound_class : str

A sound-class representation of the phonetic segment. If the segment cannot be resolved, the respective string will be rendered as “0” (zero).

lingpy.sequence.sound_classes.tokens2class(tokens, model, stress=None, diacritics=None, cldf=True)¶

Convert tokenized IPA strings into their respective class strings.

Parameters:

tokens : list

A list of tokens as they are returned from ipa2tokens().

model : Model

A Model object.

stress : str (default=rcParams[‘stress’])

A string containing the stress symbols used in the analysis. Defaults to the stress as defined in ~lingpy.settings.rcParams.

diacritics : str (default=rcParams[‘diacritics’])

A string containing diacritic symbols used in the analysis. Defaults to the diacritic symbolds defined in ~lingpy.settings.rcParams.

cldf : bool (default=True)

If set to True, as by default, this will allow for a specific treatment of phonetic symbols which cannot be completely resolved (e.g., laryngeal h₂ in Indo-European). Following the CLDF specifications (in particular the specifications for writing transcriptions in segmented strings, as employed by the CLTS initiative), in cases of insecurity of pronunciation, users can adopt a `source/target` style, where the source is the symbol used, e.g., in a reconstruction system, and the target is a proposed phonetic interpretation. This practice is also accepted by the EDICTOR tool.

Returns:

classes : list

A sound-class representation of the tokenized IPA string in form of a list. If sound classes cannot be resolved, the respective string will be rendered as “0” (zero).

Notes

The function ~lingpy.sequence.sound_classes.token2class returns a “0” (zero) if the sound is not recognized by LingPy’s sound class models. While an unknown sound in a longer sequence is no problem for alignment algorithms, we have some unwanted and often even unforeseeable behavior, if the sequence is completely unknown. For this reason, this function raises a ValueError, if a resulting sequence only contains unknown sounds.

Examples

>>> from lingpy import *
>>> tokens = ipa2tokens('t͡sɔyɡə')
>>> classes = tokens2class(tokens,'sca')
>>> print(classes)
CUKE

lingpy.sequence.sound_classes.tokens2morphemes(tokens, **keywords)¶

Split a string into morphemes if it contains separators.

Parameters:

sep : str (default=”◦”)

Select your morpheme separator.

word_sep: str (default=”_”) :

Select your word separator.

Returns:

morphemes : list

A nested list of the original segments split into morphemes.

Notes

Function splits a list of tokens into subsequent lists of morphemes if the list contains morpheme separators. If no separators are found, but tonemarkers, it will still split the string according to the tones. If you want to avoid this behavior, set the keyword split_on_tones to False.