lingpy.sequence package

Submodules

lingpy.sequence.generate module

Module provides simple basic classes for sequence generation using Markov models.

class lingpy.sequence.generate.MCBasic(seqs)

Bases: object

Basic class for creating Markov chains from sequence training data.

Parameters:

seq : list

A list of sequences. Sequences are assumed to be tokenized, i.e. they should be either passed as lists or as tuples.

walk()

Create random sequence from the distribution.

class lingpy.sequence.generate.MCPhon(words, tokens=False, prostrings=[], classes=False, class_model=<sca-model "sca">, **keywords)

Bases: lingpy.sequence.generate.MCBasic

Class for the creation of phonetic sequences (“pseudo words”).

Parameters:

words : list

List of phonetic sequences. This list can contain tokenized sequences (lists or tuples), or simple untokenized IPA strings.

tokens : bool (default=False)

If set to True, no tokenization of input sequences is carried out.

prostring : list (default=[])

List containing the prosodic profiles of the input sequences. If the list is empty, the profiles are generated automatically.

evaluate_string(string, tokens=False, **keywords)
get_string(new=True, tokens=False)

Generate a string from the Markov chain created from the training data.

Parameters:

new : bool (default=True)

Determine whether the string created should be different from the training data or not.

tokens : bool (default=False)

If set to True he full list of tokens that was internally used to represent the sequences as a Markov chain is returned.

lingpy.sequence.profile module

Module provides methods for the handling of orthography profiles.

lingpy.sequence.profile.context_profile(wordlist, ref='ipa', col='doculect', semi_diacritics='hsʃ̢ɕʂʐʑʒw', merge_vowels=False, brackets=None, splitters='/, ;~', merge_geminates=True, clts=False, bad_word='<???>', bad_sound='<?>', unknown_sound='!{0}', examples=2)

Create an advanced Orthography Profile with context and doculect information.

Parameters:

wordlist : ~lingpy.basic.wordlist.Wordlist

A wordlist from which you want to derive an initial orthography profile.

ref : str (default=”ipa”)

The name of the reference column in which the words are stored.

col : str (default=”doculect”)

Indicate in which column the information on the language variety is stored.

semi_diacritics : str

Indicate characters which can occur both as “diacritics” (second part in a sound) or alone.

merge_vowels : bool (default=True)

Indicate whether consecutive vowels should be merged.

brackets : dict

A dictionary with opening brackets as key and closing brackets as values. Defaults to a pre-defined set of frequently occurring brackets.

splitters : str

The characters which force the automatic splitting of an entry.

clts : dict (default=None)

A dictionary(like) object that converts a given source sound into a potential target sound, using the get()-method of the dictionary. Normally, we think of a CLTS instance here (that is: a cross-linguistic transcription system as defined in the pyclts package).

bad_word : str (default=”«???»”)

Indicate how words that could not be parsed should be handled. Note that both “bad_word” and “bad_sound” are format-strings, so you can add formatting information here.

bad_sound : str (default=”«?»”)

Indicate how sounds that could not be converted to a sound class be handled. Note that both “bad_word” and “bad_sound” are format-strings, so you can add formatting information here.

unknown_sound : str (default=”!{0}”)

If with_clts is set to True, use this string to indicate that sounds are classified as “unknown sound” in the CLTS framework.

examples : int(default=2)

Indicate the number of examples that should be printed out.

Returns:

profile : generator

A generator of tuples (three items), indicating the segment, its frequency, the conversion to sound classes in the Dolgopolsky sound-class model, and the unicode-codepoints.

lingpy.sequence.profile.simple_profile(wordlist, ref='ipa', semi_diacritics='hsʃ̢ɕʂʐʑʒw', merge_vowels=False, brackets=None, splitters='/, ;~', merge_geminates=True, bad_word='<???>', bad_sound='<?>', clts=None, unknown_sound='!{0}')

Create an initial Orthography Profile using Lingpy’s clean_string procedure.

Parameters:

wordlist : ~lingpy.basic.wordlist.Wordlist

A wordlist from which you want to derive an initial orthography profile.

ref : str (default=”ipa”)

The name of the reference column in which the words are stored.

semi_diacritics : str

Indicate characters which can occur both as “diacritics” (second part in a sound) or alone.

merge_vowels : bool (default=True)

Indicate whether consecutive vowels should be merged.

brackets : dict

A dictionary with opening brackets as key and closing brackets as values. Defaults to a pre-defined set of frequently occurring brackets.

splitters : str

The characters which force the automatic splitting of an entry.

clts : dict (default=None)

A dictionary(like) object that converts a given source sound into a potential target sound, using the get()-method of the dictionary. Normally, we think of a CLTS instance here (that is: a cross-linguistic transcription system as defined in the pyclts package).

bad_word : str (default=”«???»”)

Indicate how words that could not be parsed should be handled. Note that both “bad_word” and “bad_sound” are format-strings, so you can add formatting information here.

bad_sound : str (default=”«?»”)

Indicate how sounds that could not be converted to a sound class be handled. Note that both “bad_word” and “bad_sound” are format-strings, so you can add formatting information here.

unknown_sound : str (default=”!{0}”)

If with_clts is set to True, use this string to indicate that sounds are classified as “unknown sound” in the CLTS framework.

Returns:

profile : generator

A generator of tuples (three items), indicating the segment, its frequency, the conversion to sound classes in the Dolgopolsky sound-class model, and the unicode-codepoints.

lingpy.sequence.sound_classes module

Module provides various methods for the handling of sound classes.

lingpy.sequence.sound_classes.asjp2tokens(seq, merge_vowels=True)
lingpy.sequence.sound_classes.bigrams(sequence)

Convert a given sequence into a sequence of bigrams.

lingpy.sequence.sound_classes.check_tokens(tokens, **keywords)

Function checks whether tokens are given in a consistent input format.

lingpy.sequence.sound_classes.class2tokens(tokens, classes, gap_char='-', local=False)

Turn aligned sound-class sequences into an aligned sequences of IPA tokens.

Parameters:

tokens : list

The list of tokens corresponding to the unaligned IPA string.

classes : string or list

The aligned class string.

gap_char : string (default=”-“)

The character which indicates gaps in the output string.

local : bool (default=False)

If set to True a local alignment with prefix and suffix can be converted.

Returns:

alignment : list

A list of tokens with gaps at the positions where they occured in the alignment of the class string.

Examples

>>> from lingpy import *
>>> tokens = ipa2tokens('t͡sɔyɡə')
>>> aligned_sequence = 'CU-KE'
>>> print ', '.join(class2tokens(tokens,aligned_sequence))
t͡s, ɔy, -, ɡ, ə
lingpy.sequence.sound_classes.clean_string(sequence, semi_diacritics='hsʃ̢ɕʂʐʑʒw', merge_vowels=False, segmentized=False, rules=None, ignore_brackets=True, brackets=None, split_entries=True, splitters='/, ;~', preparse=None, merge_geminates=True, normalization_form='NFC')

Function exhaustively checks how well a sequence is understood by LingPy.

Parameters:

semi_diacritics : str

Indicate characters which can occur both as “diacritics” (second part in a sound) or alone.

merge_vowels : bool (default=True)

Indicate whether consecutive vowels should be merged.

segmentized : False

Indicate whether the input string is already segmentized or not. If set to True, items in brackets can no longer be ignored.

rules : dict

Replacement rules to be applied to a segmentized string.

ignore_brackets : bool

If set to True, ignore all content within a given bracket.

brackets : dict

A dictionary with opening brackets as key and closing brackets as values. Defaults to a pre-defined set of frequently occurring brackets.

split_entries : bool (default=True)

Indicate whether multiple entries (with a comma etc.) should be split into separate entries.

splitters : str

The characters which force the automatic splitting of an entry.

prepares : list

List of tuples, giving simple replacement patterns (source and target), which are applied before every processing starts.

Returns:

cleaned_strings : list

A list of cleaned strings which are segmented by space characters. If splitters are encountered, indicating that the entry contains two variants, the list will contain one for each element in a separate entry. If there are no splitters, the list has only size one.

lingpy.sequence.sound_classes.codepoint(s)

Return unicode codepoint(s) for a character set.

lingpy.sequence.sound_classes.fourgrams(sequence)

Convert a given sequence into a sequence of trigrams.

lingpy.sequence.sound_classes.get_all_ngrams(sequence, sort=False)

Function returns all possible n-grams of a given sequence.

Parameters:

sequence : list or str

The sequence that shall be converted into it’s ngram-representation.

Returns:

out : list

A list of all ngrams of the input word, sorted in decreasing order of length.

Examples

>>> get_all_ngrams('abcde')
['abcde', 'bcde', 'abcd', 'cde', 'abc', 'bcd', 'ab', 'de', 'cd', 'bc', 'a', 'e', 'b', 'd', 'c']
lingpy.sequence.sound_classes.get_n_ngrams(sequence, ngram=4)

convert a given sequence into a sequence of ngrams.

lingpy.sequence.sound_classes.ipa2tokens(istring, **keywords)

Tokenize IPA-encoded strings.

Parameters:

seq : str

The input sequence that shall be tokenized.

diacritics : {str, None} (default=None)

A string containing all diacritics which shall be considered in the respective analysis. When set to None, the default diacritic string will be used.

vowels : {str, None} (default=None)

A string containing all vowel symbols which shall be considered in the respective analysis. When set to None, the default vowel string will be used.

tones : {str, None} (default=None)

A string indicating all tone letter symbals which shall be considered in the respective analysis. When set to None, the default tone string will be used.

combiners : str (default=”͜͡”)

A string with characters that are used to combine two separate characters (compare affricates such as t͡s).

breaks : str (default=”-.”)

A string containing the characters that indicate that a new token starts right after them. These can be used to indicate that two consecutive vowels should not be treated as diphtongs or for diacritics that are put before the following letter.

merge_vowels : bool (default=False)

Indicate, whether vowels should be merged into diphtongs (default=True), or whether each vowel symbol should be considered separately.

merge_geminates : bool (default=False)

Indicate, whether identical symbols should be merged into one token, or rather be kept separate.

expand_nasals : bool (default=False)

semi_diacritics: str (default=’‘) :

Indicate which symbols shall be treated as “semi-diacritics”, that is, as symbols which can occur on their own, but which eventually, when preceded by a consonant, will form clusters with it. If you want to disable this features, just set the keyword to an empty string.

clean_string : bool (default=False)

Conduct a rough string-cleaning strategy by which all items between brackets are removed along with the brackets, and

Returns:

tokens : list

A list of IPA tokens.

Examples

>>> from lingpy import *
>>> myseq = 't͡sɔyɡə'
>>> ipa2tokens(myseq)
['t͡s', 'ɔy', 'ɡ', 'ə']
lingpy.sequence.sound_classes.ono_parse(word, output='', **keywords)

Carry out a rough onset-nucleus-offset parse of a word in IPA.

Notes

Method is an approximation and not supposed to do without flaws. It is, however, rather helpful in most instances. It defines a so far simple model in which 7 different contexts for each word are distinguished:

  • “#”: onset cluster in a word’s initial
  • “C”: onset cluster in a word’s non-initial
  • “V”: nucleus vowel in a word’s initial syllable
  • “v”: nucleus vowel in a word’s non-initial and non-final syllable
  • “>”: nucleus vowel in a word’s final syllable
  • “c”: offset cluster in a word’s non-final syllable
  • “$”: offset cluster in a word’s final syllable
lingpy.sequence.sound_classes.pgrams(sequence, **keywords)

Convert a given sequence into bigrams consisting of prosodic string symbols and the tokens of the original sequence.

lingpy.sequence.sound_classes.pid(almA, almB, mode=2)

Calculate the Percentage Identity (PID) score for aligned sequence pairs.

Parameters:

almA, almB : string or list

The aligned sequences which can be either a string or a list.

mode : { 1, 2, 3, 4, 5 }

Indicate which of the four possible PID scores described in Raghava2006 should be calculated, the fifth possibility is added for linguistic purposes:

  1. identical positions / (aligned positions + internal gap positions),
  2. identical positions / aligned positions,
  3. identical positions / shortest sequence, or
  4. identical positions / shortest sequence (including internal gap pos.)
  5. identical positions / (aligned positions + 2 * number of gaps)
Returns:

score : float

The PID score of the given alignment as a floating point number between 0 and 1.

See also

lingpy.compare.Multiple.get_pid,

Notes

The PID score is a common measure for the diversity of a given alignment. The implementation employed by LingPy follows the description of Raghava2006 where four different variants of PID scores are distinguished. Essentially, the PID score is based on the comparison of identical residue pairs with the total number of residue pairs in a given alignment.

Examples

Load an alignment from the test suite.

>>> from lingpy import *
>>> pairs = PSA(get_file('test.psa'))

Extract the alignments of the first aligned sequence pair.

>>> almA,almB,score = pairs.alignments[0]

Calculate the PID score of the alignment.

>>> pid(almA,almB)
0.44444444444444442
lingpy.sequence.sound_classes.prosodic_string(string, _output=True, **keywords)

Create a prosodic string of the sonority profile of a sequence.

Parameters:

seq : list

A list of integers indicating the sonority of the tokens of the underlying sequence.

stress : str (default=rcParams[‘stress’])

A string containing the stress symbols used in the analysis. Defaults to the stress as defined in ~lingpy.settings.rcParams.

diacritics : str (default=rcParams[‘diacritics’])

A string containing diacritic symbols used in the analysis. Defaults to the diacritic symbolds defined in ~lingpy.settings.rcParams.

cldf : bool (default=False)

If set to True, this will allow for a specific treatment of phonetic symbols which cannot be completely resolved (e.g., laryngeal h₂ in Indo-European). Following the CLDF specifications (in particular the specifications for writing transcriptions in segmented strings, as employed by the CLTS initiative), in cases of insecurity of pronunciation, users can adopt a `source/target` style, where the source is the symbol used, e.g., in a reconstruction system, and the target is a proposed phonetic interpretation. This practice is also accepted by the EDICTOR tool.

Returns:

prostring : string

A prosodic string corresponding to the sonority profile of the underlying sequence.

Notes

A prosodic string is a sequence of specific characters which indicating their resprective prosodic context (see List2012 or List2012a for a detailed description). In contrast to the previous model, the current implementation allows for a more fine-graded distinction between different prosodic segments. The current scheme distinguishes 9 prosodic positions:

  • A: sequence-initial consonant
  • B: syllable-initial, non-sequence initial consonant in a context of ascending sonority
  • C: non-syllable, non-initial consonant in ascending sonority context
  • L: non-syllable-final consonant in descending environment
  • M: syllable-final consonant in descending environment
  • N: word-final consonant
  • X: first vowel in a word
  • Y: non-final vowel in a word
  • Z: vowel occuring in the last position of a word
  • T: tone
  • _: word break

Examples

>>> prosodic_string(ipa2tokens('t͡sɔyɡə')
'AXBZ'
lingpy.sequence.sound_classes.prosodic_weights(prostring, _transform={})

Calculate prosodic weights for each position of a sequence.

Parameters:

prostring : string

A prosodic string as it is returned by prosodic_string().

_transform : dict

A dictionary that determines how prosodic strings should be transformed into prosodic weights. Use this dictionary to adjust the prosodic strings to your own user-defined prosodic weight schema.

Returns:

weights : list

A list of floats reflecting the modification of the weight for each position.

See also

prosodic_string

Notes

Prosodic weights are specific scaling factors which decrease or increase the gap score of a given segment in alignment analyses (see List2012 or List2012a for a detailed description).

Examples

>>> from lingpy import *
>>> prostring = '#vC>'
>>> prosodic_weights(prostring)
[2.0, 1.3, 1.5, 0.7]
lingpy.sequence.sound_classes.sampa2uni(seq)

Convert sequence in IPA-sampa-format to IPA-unicode.

Notes

This function is based on code taken from Peter Kleiweg (http://www.let.rug.nl/~kleiweg/L04/devel/python/xsampa.html).

lingpy.sequence.sound_classes.syllabify(seq, output='flat', **keywords)

Carry out a simple syllabification of a sequence, using sonority as a proxy.

Parameters:

output: {“flat”, “breakpoints”, “nested”} (default=”flat”) :

Define how to output the syllabification. Select between: * “flat”: A syllable separator is introduced to mark the syllable boundaries * “breakpoins”: A tuple consisting of indices that slice the original sequence into syllables is returned. * “nested”: A nested list reflecting the syllable structure is returned.

sep : str (default=”◦”)

Select your preferred syllable separator.

Returns:

syllable : list

Either a flat list containing a morpheme separator, or a nested list, reflecting the syllable structure, or a list of tuples containing the indices indicating where the input sequence should be sliced in order to split it into syllables.

Notes

When analyzing the sequence, we start a new syllable in all cases where we reach a deepest point in the sonority hierarchy of the sonority profile of the sequence. When passing an aligned string to this function, the gaps will be ignored when computing boundaries, but later on re-introduced, if the alignment is passed in segmented form.

lingpy.sequence.sound_classes.token2class(token, model, stress=None, diacritics=None, cldf=None)

Convert a single token into a sound-class.

tokens : str
A token (phonetic segment).
model : Model
A Model object.
stress : str (default=rcParams[‘stress’])
A string containing the stress symbols used in the analysis. Defaults to the stress as defined in ~lingpy.settings.rcParams.
diacritics : str (default=rcParams[‘diacritics’])
A string containing diacritic symbols used in the analysis. Defaults to the diacritic symbolds defined in ~lingpy.settings.rcParams.
cldf : bool (default=False)
If set to True, this will allow for a specific treatment of phonetic symbols which cannot be completely resolved (e.g., laryngeal h₂ in Indo-European). Following the CLDF specifications (in particular the specifications for writing transcriptions in segmented strings, as employed by the CLTS initiative), in cases of insecurity of pronunciation, users can adopt a `source/target` style, where the source is the symbol used, e.g., in a reconstruction system, and the target is a proposed phonetic interpretation. This practice is also accepted by the EDICTOR tool.
Returns:

sound_class : str

A sound-class representation of the phonetic segment. If the segment cannot be resolved, the respective string will be rendered as “0” (zero).

lingpy.sequence.sound_classes.tokens2class(tokens, model, stress=None, diacritics=None, cldf=False)

Convert tokenized IPA strings into their respective class strings.

Parameters:

tokens : list

A list of tokens as they are returned from ipa2tokens().

model : Model

A Model object.

stress : str (default=rcParams[‘stress’])

A string containing the stress symbols used in the analysis. Defaults to the stress as defined in ~lingpy.settings.rcParams.

diacritics : str (default=rcParams[‘diacritics’])

A string containing diacritic symbols used in the analysis. Defaults to the diacritic symbolds defined in ~lingpy.settings.rcParams.

cldf : bool (default=False)

If set to True, this will allow for a specific treatment of phonetic symbols which cannot be completely resolved (e.g., laryngeal h₂ in Indo-European). Following the CLDF specifications (in particular the specifications for writing transcriptions in segmented strings, as employed by the CLTS initiative), in cases of insecurity of pronunciation, users can adopt a `source/target` style, where the source is the symbol used, e.g., in a reconstruction system, and the target is a proposed phonetic interpretation. This practice is also accepted by the EDICTOR tool.

Returns:

classes : list

A sound-class representation of the tokenized IPA string in form of a list. If sound classes cannot be resolved, the respective string will be rendered as “0” (zero).

Notes

The function ~lingpy.sequence.sound_classes.token2class returns a “0” (zero) if the sound is not recognized by LingPy’s sound class models. While an unknown sound in a longer sequence is no problem for alignment algorithms, we have some unwanted and often even unforeseeable behavior, if the sequence is completely unknown. For this reason, this function raises a ValueError, if a resulting sequence only contains unknown sounds.

Examples

>>> from lingpy import *
>>> tokens = ipa2tokens('t͡sɔyɡə')
>>> classes = tokens2class(tokens,'sca')
>>> print(classes)
CUKE
lingpy.sequence.sound_classes.tokens2morphemes(tokens, **keywords)

Split a string into morphemes if it contains separators.

Parameters:

sep : str (default=”◦”)

Select your morpheme separator.

word_sep: str (default=”_”) :

Select your word separator.

Returns:

morphemes : list

A nested list of the original segments split into morphemes.

Notes

Function splits a list of tokens into subsequent lists of morphemes if the list contains morpheme separators. If no separators are found, but tonemarkers, it will still split the string according to the tones. If you want to avoid this behavior, set the keyword split_on_tones to False.

lingpy.sequence.sound_classes.trigrams(sequence)

Convert a given sequence into a sequence of trigrams.

lingpy.sequence.tiers module

Module provides tools to handle transcriptions as multi-tiered sequences.

lingpy.sequence.tiers.cvcv(sequence, **keywords)

Create a CV-template representation out of a sound sequence.

lingpy.sequence.tiers.get_stress(sound)
lingpy.sequence.tiers.is_consonant(sound)
lingpy.sequence.tiers.is_sound(sound, what)

Check whether a sound is a vowel or not.

lingpy.sequence.tiers.is_stressed(sound)

Quick check for stress.

lingpy.sequence.tiers.is_tone(sound)
lingpy.sequence.tiers.is_vowel(sound)
lingpy.sequence.tiers.remove_stress(sound)
lingpy.sequence.tiers.sound_type(sound)

Shortcut to determine basic sound type (C, V, or T).

Module contents

Module provides methods and functions for dealing with linguistic sequences.