Whats New in Version 2.6.4?

Version 2.6.4 of LingPy introduces a couple of refinements of LingPy, but it does not introduce many new algorithms. New algorithms can now mostly be found in additional packages, such as SinoPy (https://github.com/lingpy/sinopy) or LingRex (https://github.com/lingpy/lingrex). While SinoPy is a very specialized package to be used for manipulation of SEA and Chinese language data, LingRex offers a couple of new algorithms that we may consider including in a future version of LingPy, but which have not been sufficiently tested to be included right now.

What was introduced in this new version, is a new module to manipulate ngrams and Markov Models (provided by Tiago Tresoldi) and a new function to import CLDF data with help of the Wordlist class and its derivatives.

In the following, we list both the new features introduced in version 2.6.3, as well as the new functions from the new version 2.6.4.

Loading CLDF Data from Wordlist Objects (2.6.4)

Instead of using the from_cldf function, you can now simply use the Wordlist class to load a CLDF dataset:

>>> from lingpy import *
>>> wl = Wordlist.from_cldf('path/to/metadata/json')

Handling Ngrams (2.6.4)

This module offers a general class to handle Markov Models and many utility functions for the manipulation of ngrams:

>>> from lingpy.sequence.ngrams import *
>>> list(get_n_ngrams('LingPy', 3))
[('$$$', '$$$', '$$$', 'L'),
 ('$$$', '$$$', 'L', 'i'),
 ('$$$', 'L', 'i', 'n'),
 ('L', 'i', 'n', 'g'),
 ('i', 'n', 'g', 'P'),
 ('n', 'g', 'P', 'y'),
 ('g', 'P', 'y', '$$$'),
 ('P', 'y', '$$$', '$$$'),
 ('y', '$$$', '$$$', '$$$')]

So as you can see, this function allows you to retrieve as many ngrams from a given string as you want. The new module also inherits the functions that one could find in the sound_classes module before, including shortcuts for creating bigrams and trigrams.

Bugfixes in Wordlist Class (2.6.3)

We fixed some problems related to the Wordlist class in LingPy. First, if you make the transition from a Wordlist object to a LexStat object, the data will be deep-copied. As a result, writing:

>>> from lingpy.tests.util import test_data
>>> from lingpy import *
>>> wl = Wordlist(test_data('KSL.qlc'))
>>> lex = LexStat(wl)
>>> lex.add_entries('segments', 'tokens', lambda x: x)
>>> 'segments' in wl.header
False

will no longer add the same number of new entries to the Wordlist object (as it happened before). You can now also easily access the header of a wordlist in its current order:

>>> wl = Wordlist(test_data('KSL.qlc'))
>>> wl.columns
['doculect', 'concept', 'glossid', 'orthography', 'ipa', 'tokens', 'cogid']

This is specifically convenient if you construct a Wordlist by creating a dictionary inside a Python script.

Sanity Checks of Linguistic Data (2.6.3)

We introduce a new module, called sanity. In this module, we provide a couple of new functions which you can use to check the consistency of your data. One specific focus, given that LingPy is focused on sequence comparison, is the coverage of words in a wordlist. Coverage is hereby understood as the minimal number of words shared per meaning slot in each language pair. Interestingly, you will notice, when testing certain datasets, that mutual coverage can at times be extremely low. As a rule of thumb, if your data’s mutual coverage is below 100 for moderately divergent language samples, you should not run an analysis using the “lexstat” algorithm of the ~lingpy.compare.lexstat.LexStat class, but instead turn to the “sca” method provided by the same class. Mutual coverage can be computed in a straightforward manner:

>>> from lingpy.compare.sanity import mutual_coverage_check
>>> wl = Wordlist(test_data('KSL.qlc'))
>>> for i in range(wl.height, 1, -1):
        if mutual_coverage_check(wl, i):
            print('mutual coverage is {0}'.format(i))
            break
    200

We highly recommend all users which deal with spotty and patchy data to have a closer look at the coverage functions offered in the sanity module. You may also want to check out the synonymy() function which computes the number of synonyms in your dataset. Here again, we recommend to pay specific attention to not exceed a value of maximally three words per concept and language.

Alignments and Cognate Sets (2.6.3)

This is again a small change, but to allow for a more consistent integration with external tools, we now allow to use the cognate set identifier “0” to account for cases where not decision has yet been made. That means, if you assign cognate sets IDs (“COGID”) to your data, and one is set to “0”, this won’t be aligned, and not clustered together with other words which have identifier “0”.

Random Clusters, Lumper, and Splitter (2.6.3)

We added a new experimental module, in which we added a couple of functions that help to deal with random clusters. The module cluster_util offers ways to mutate a given cluster, to create a random cluster, or to create all possible clusters for a given size of entities. This module was originally prepared to allow to add random cognate sets to a wordlist in order to compare this random output with non-random algorithms for cognate detection:

>>> from lingpy.evaluate.acd import random_clusters
>>> random_clusters(wl, ref="randomid")

We then figured that it would also be interesting to compare random clusters with “extreme” clusters, namely, the clusters provided by “lumpers” and “splitters”. Thus, we added another custom function:

>>> from lingpy.evaluate.acd import extreme_clusters
>>> extreme_clusters(wl, ref='lumperid', bias='lumper')

If you carry out an assessment of cognate detection algorithms of your data, we recommend to always compare those against the lumper and the splitter, as this can already tell you a lot about the nature of your data. In some datasets, the lumper will receive evaluation scores of more than 73%, only because the cognate density is so high in the data. This means in turn, if your algorithm does not perform much better than 73% in these cases, it does not mean that it is performing quite well.

Adding Support to Read CLDF (2.6.3)

The cross-linguistic data formats (CLDF) initiative (Forkel2017a) provides standardized formats for wordlists and cognate judgments. The pycldf package provides support to convert LingPy-formatted data sets into CLDF format. LingPy now also provides support to read CLDF-files and convert them into ~lingpy.basic.wordlist.Wordlist objects:

>>> wl = Wordlist(test_data('KSL.qlc'))
>>> from lingpy.convert.cldf import to_cldf, from_cldf
>>> to_cldf(wl)
>>> wl = from_cldf('cldf/Wordlist-metadata.json')

Adding Nexus Output (2.6.3)

We had nexus output before, but now, Simon Greenhill has helped us to provide a stable export to both MrBayes and Beast, also including the situation where you want to calculate rates for each concept class:

>>> wl = Wordlist(test_data("KSL.qlc"))
>>> from lingpy.convert.strings import write_nexus
>>> mb = write_nexus(wl, 'mrbayes', filename='ksl-mrbayes.nex')
>>> beast = write_nexus(wl, 'beastwords', filename='ksl-beast.nex')

Orthography Profile Creation (2.6.3)

We introduced the commandline of LingPy in version 2.5, but have not really worked on it since then, as we think that cognate detection analyses should not go on silent by just typing a command into the terminal. Instead, we encourage users to learn what the major algorithms are, and how they should be used. We also recommend you to have a look at an online tutorial which Johann-Mattis List prepared early in 2017. However, in one particular cases, the commandline actually proved to be very useful, and this is for the creation of orthography profiles (Moran2017). Thus, if you now have a normal LingPy wordlist, you can preparse it with LingPy to create an initial (hopefully useful) orthography profile, via the commandline:

$ lingpy profile -i input-file.tsv --column=form --context -o output-profile.tsv

If you choose the “—context” option, this means that LingPy will add information on which language shows which pattern, and whether this occurs in the beginning (“^”) or the end (“$”) of a given phonetic sequence. More information can also be found in [this](https://zenodo.org/badge/latestdoi/109654333) tutorial on LingPy and EDICTOR.

LingPy Cookbook (2.6.3)

We decided that we will start introducing LingPy in bits and bytes, since the full reference may at times overwhel new users. We started collecting some little howtos on our LingPy Cookbook, which you can fine here.