lingpy.sequence package


lingpy.sequence.generate module

Module provides simple basic classes for sequence generation using Markov models.

class lingpy.sequence.generate.MCBasic(seqs)

Bases: object

Basic class for creating Markov chains from sequence training data.


seq : list

A list of sequences. Sequences are assumed to be tokenized, i.e. they should be either passed as lists or as tuples.


Create random sequence from the distribution.

class lingpy.sequence.generate.MCPhon(words, tokens=False, prostrings=[], classes=False, class_model=<sca-model "sca">, **keywords)

Bases: lingpy.sequence.generate.MCBasic

Class for the creation of phonetic sequences (“pseudo words”).


words : list

List of phonetic sequences. This list can contain tokenized sequences (lists or tuples), or simple untokenized IPA strings.

tokens : bool (default=False)

If set to True, no tokenization of input sequences is carried out.

prostring : list (default=[])

List containing the prosodic profiles of the input sequences. If the list is empty, the profiles are generated automatically.

evaluate_string(string, tokens=False, **keywords)
get_string(new=True, tokens=False)

Generate a string from the Markov chain created from the training data.


new : bool (default=True)

Determine whether the string created should be different from the training data or not.

tokens : bool (default=False)

If set to True he full list of tokens that was internally used to represent the sequences as a Markov chain is returned.

lingpy.sequence.sound_classes module

Module provides various methods for the handling of sound classes.

lingpy.sequence.sound_classes.asjp2tokens(seq, merge_vowels=True)

Convert a given sequence into a sequence of bigrams.

lingpy.sequence.sound_classes.check_tokens(tokens, **keywords)

Function checks whether tokens are given in a consistent input format.

lingpy.sequence.sound_classes.class2tokens(tokens, classes, gap_char='-', local=False)

Turn aligned sound-class sequences into an aligned sequences of IPA tokens.


tokens : list

The list of tokens corresponding to the unaligned IPA string.

classes : string or list

The aligned class string.

gap_char : string (default=”-”)

The character which indicates gaps in the output string.

local : bool (default=False)

If set to True a local alignment with prefix and suffix can be converted.


alignment : list

A list of tokens with gaps at the positions where they occured in the alignment of the class string.


>>> from lingpy import *
>>> tokens = ipa2tokens('t͡sɔyɡə')
>>> aligned_sequence = 'CU-KE'
>>> print ', '.join(class2tokens(tokens,aligned_sequence))
t͡s, ɔy, -, ɡ, ə

Convert a given sequence into a sequence of trigrams.

lingpy.sequence.sound_classes.get_all_ngrams(sequence, sort=False)

Function returns all possible n-grams of a given sequence.


sequence : list or str

The sequence that shall be converted into it’s ngram-representation.


out : list

A list of all ngrams of the input word, sorted in decreasing order of length.


>>> get_all_ngrams('abcde')
['abcde', 'bcde', 'abcd', 'cde', 'abc', 'bcd', 'ab', 'de', 'cd', 'bc', 'a', 'e', 'b', 'd', 'c']
lingpy.sequence.sound_classes.get_n_ngrams(sequence, ngram=4)

convert a given sequence into a sequence of ngrams.

lingpy.sequence.sound_classes.ipa2tokens(istring, **keywords)

Tokenize IPA-encoded strings.


seq : str

The input sequence that shall be tokenized.

diacritics : {str, None} (default=None)

A string containing all diacritics which shall be considered in the respective analysis. When set to None, the default diacritic string will be used.

vowels : {str, None} (default=None)

A string containing all vowel symbols which shall be considered in the respective analysis. When set to None, the default vowel string will be used.

tones : {str, None} (default=None)

A string indicating all tone letter symbals which shall be considered in the respective analysis. When set to None, the default tone string will be used.

combiners : str (default=”͜͡”)

A string with characters that are used to combine two separate characters (compare affricates such as t͡s).

breaks : str (default=”-.”)

A string containing the characters that indicate that a new token starts right after them. These can be used to indicate that two consecutive vowels should not be treated as diphtongs or for diacritics that are put before the following letter.

merge_vowels : bool (default=False)

Indicate, whether vowels should be merged into diphtongs (default=True), or whether each vowel symbol should be considered separately.

merge_geminates : bool (default=False)

Indicate, whether identical symbols should be merged into one token, or rather be kept separate.

expand_nasals : bool (default=False)

semi_diacritics: str (default=’‘) :

Indicate which symbols shall be treated as “semi-diacritics”, that is, as symbols which can occur on their own, but which eventually, when preceded by a consonant, will form clusters with it. If you want to disable this features, just set the keyword to an empty string.

clean_string : bool (default=False)

Conduct a rough string-cleaning strategy by which all items between brackets are removed along with the brackets, and


tokens : list

A list of IPA tokens.


>>> from lingpy import *
>>> myseq = 't͡sɔyɡə'
>>> ipa2tokens(myseq)
['t͡s', 'ɔy', 'ɡ', 'ə']
lingpy.sequence.sound_classes.ono_parse(word, output='', **keywords)

Carry out a rough onset-nucleus-offset parse of a word in IPA.


Method is an approximation and not supposed to do without flaws. It is, however, rather helpful in most instances. It defines a so far simple model in which 7 different contexts for each word are distinguished:

  • “#”: onset cluster in a word’s initial
  • “C”: onset cluster in a word’s non-initial
  • “V”: nucleus vowel in a word’s initial syllable
  • “v”: nucleus vowel in a word’s non-initial and non-final syllable
  • “>”: nucleus vowel in a word’s final syllable
  • “c”: offset cluster in a word’s non-final syllable
  • “$”: offset cluster in a word’s final syllable
lingpy.sequence.sound_classes.pgrams(sequence, **keywords)

Convert a given sequence into bigrams consisting of prosodic string symbols and the tokens of the original sequence., almB, mode=2)

Calculate the Percentage Identity (PID) score for aligned sequence pairs.


almA, almB : string or list

The aligned sequences which can be either a string or a list.

mode : { 1, 2, 3, 4, 5 }

Indicate which of the four possible PID scores described in Raghava2006 should be calculated, the fifth possibility is added for linguistic purposes:

  1. identical positions / (aligned positions + internal gap positions),
  2. identical positions / aligned positions,
  3. identical positions / shortest sequence, or
  4. identical positions / shortest sequence (including internal gap pos.)
  5. identical positions / (aligned positions + 2 * number of gaps)

score : float

The PID score of the given alignment as a floating point number between 0 and 1.

See also,


The PID score is a common measure for the diversity of a given alignment. The implementation employed by LingPy follows the description of Raghava2006 where four different variants of PID scores are distinguished. Essentially, the PID score is based on the comparison of identical residue pairs with the total number of residue pairs in a given alignment.


Load an alignment from the test suite.

>>> from lingpy import *
>>> pairs = PSA(get_file('test.psa'))

Extract the alignments of the first aligned sequence pair.

>>> almA,almB,score = pairs.alignments[0]

Calculate the PID score of the alignment.

>>> pid(almA,almB)
lingpy.sequence.sound_classes.prosodic_string(string, _output=True, **keywords)

Create a prosodic string of the sonority profile of a sequence.


seq : list

A list of integers indicating the sonority of the tokens of the underlying sequence.


prostring : string

A prosodic string corresponding to the sonority profile of the underlying sequence.


A prosodic string is a sequence of specific characters which indicating their resprective prosodic context (see List2012 or List2012a for a detailed description). In contrast to the previous model, the current implementation allows for a more fine-graded distinction between different prosodic segments. The current scheme distinguishes 9 prosodic positions:

  • A: sequence-initial consonant
  • B: syllable-initial, non-sequence initial consonant in a context of ascending sonority
  • C: non-syllable, non-initial consonant in ascending sonority context
  • L: non-syllable-final consonant in descending environment
  • M: syllable-final consonant in descending environment
  • N: word-final consonant
  • X: first vowel in a word
  • Y: non-final vowel in a word
  • Z: vowel occuring in the last position of a word


>>> prosodic_string(ipa2tokens('t͡sɔyɡə')
lingpy.sequence.sound_classes.prosodic_weights(prostring, _transform={})

Calculate prosodic weights for each position of a sequence.


prostring : string

A prosodic string as it is returned by prosodic_string().

_transform : dict

A dictionary that determines how prosodic strings should be transformed into prosodic weights. Use this dictionary to adjust the prosodic strings to your own user-defined prosodic weight schema.


weights : list

A list of floats reflecting the modification of the weight for each position.

See also



Prosodic weights are specific scaling factors which decrease or increase the gap score of a given segment in alignment analyses (see List2012 or List2012a for a detailed description).


>>> from lingpy import *
>>> prostring = '#vC>'
>>> prosodic_weights(prostring)
[2.0, 1.3, 1.5, 0.7]

Convert sequence in IPA-sampa-format to IPA-unicode.


This function is based on code taken from Peter Kleiweg (

lingpy.sequence.sound_classes.syllabify(seq, output='flat', **keywords)

Carry out a simple syllabification of a sequence, using sonority as a proxy.


output: {“flat”, “breakpoints”, “nested”} (default=”flat”) :

Define how to output the syllabification. Select between: * “flat”: A syllable separator is introduced to mark the syllable boundaries * “breakpoins”: A tuple consisting of indices that slice the original sequence into syllables is returned. * “nested”: A nested list reflecting the syllable structure is returned.

sep : str (default=”◦”)

Select your preferred syllable separator.


syllable : list

Either a flat list containing a morpheme separator, or a nested list, reflecting the syllable structure, or a list of tuples containing the indices indicating where the input sequence should be sliced in order to split it into syllables.


When analyzing the sequence, we start a new syllable in all cases where we reach a deepest point in the sonority hierarchy of the sonority profile of the sequence. When passing an aligned string to this function, the gaps will be ignored when computing boundaries, but later on re-introduced, if the alignment is passed in segmented form.

lingpy.sequence.sound_classes.token2class(token, model, **keywords)

Convert a single token into a sound-class.

: str
A token (IPA-string).
: Model
A Model object.

c : str

The corresponding sound-class value.

lingpy.sequence.sound_classes.tokens2class(tstring, model, **keywords)

Convert tokenized IPA strings into their respective class strings.


tokens : list

A list of tokens as they are returned from ipa2tokens().

model : Model

A Model object.


classes : string

A sound-class representation of the tokenized IPA string.


>>> from lingpy import *
>>> tokens = ipa2tokens('t͡sɔyɡə')
>>> classes = tokens2class(tokens,'sca')
>>> print(classes)
lingpy.sequence.sound_classes.tokens2morphemes(tokens, **keywords)

Function splits a list of tokens into subsequent lists of morphemes if the list contains morpheme separators.


sep : str (default=”◦”)

Select your morpheme separator.

word_sep: str (default=”_”) :

Select your word separator.


morphemes : list

A nested list of the original segments split into morphemes.


Convert a given sequence into a sequence of trigrams.

Module contents

Module provides methods and functions for dealing with linguistic sequences.