lingpy.basic package

Submodules

lingpy.basic.ops module

Module provides basic operations on Wordlist-Objects.

lingpy.basic.ops.calculate_data(wordlist, data, taxa='taxa', concepts='concepts', ref='cogid', **keywords)

Manipulate a wordlist object by adding different kinds of data.

Parameters:

data : str

The type of data that shall be calculated. Currently supports

  • “tree”: calculate a reference tree based on shared cognates
  • “dst”: get distances between taxa based on shared cognates
  • “cluster”: cluster the taxa into groups using different methods
lingpy.basic.ops.clean_taxnames(wordlist, column='doculect', f=<function <lambda>>)

Function cleans taxon names in order to make sure they can be used in Newick files.

lingpy.basic.ops.coverage(wordlist)

Determine the average coverage of a wordlist.

lingpy.basic.ops.get_score(wl, ref, mode, taxA, taxB, concepts_attr='concepts', ignore_missing=False)
lingpy.basic.ops.renumber(wordlist, source, target='', override=False)

Create numerical identifiers from string identifiers.

lingpy.basic.ops.triple2tsv(triples_or_fname, output='table')

Function reads in a triple file and converts it to a tabular data structure.

lingpy.basic.ops.tsv2triple(wordlist, outfile=None)

Function converts a wordlist to a triple data structure.

Notes

The basic values of which the triples consist are:
  • ID (the ID in the TSV file)
  • COLUMN (the column in the TSV file)
  • VALUE (the entry in the TSV file)
lingpy.basic.ops.wl2dict(wordlist, sections, entries, exclude=None)

Convert a wordlist to a complex dictionary with headings as keys.

lingpy.basic.ops.wl2dst(wl, taxa='taxa', concepts='concepts', ref='cogid', refB='', mode='swadesh', ignore_missing=False, **keywords)

Function converts wordlist to distance matrix.

lingpy.basic.ops.wl2multistate(wordlist, ref)

Helper function converts a wordlist to multistate format (compatible with PAUP).

lingpy.basic.ops.wl2qlc(header, data, filename='', formatter='concept', **keywords)

Write the basic data of a wordlist to file.

lingpy.basic.parser module

Basic parser for text files in QLC format.

class lingpy.basic.parser.QLCParser(filename, conf='')

Bases: object

Basic class for the handling of text files in QLC format.

add_entries(entry, source, function, override=False, **keywords)

Add new entry-types to the word list by modifying given ones.

Parameters:

entry : string

A string specifying the name of the new entry-type to be added to the word list.

source : string

A string specifying the basic entry-type that shall be modified. If multiple entry-types shall be used to create a new entry, they should be passed in a simple string separated by a comma.

function : function

A function which is used to convert the source into the target value.

keywords : {dict}

A dictionary of keywords that are passed as parameters to the function.

Notes

This method can be used to add new entry-types to the data by converting given ones. There are a lot of possibilities for adding new entries, but the most basic procedure is to use an existing entry-type and to modify it with help of a function.

pickle(filename=None)

Store the QLCParser instance in a pickle file.

Notes

The function stores a binary file called FILENAME.pkl with FILENAME corresponding to the name of the original file in the user cache dir for lingpy on your system. To restore the instance from the pickle call unpickle().

static unpickle(filename)
class lingpy.basic.parser.QLCParserWithRowsAndCols(filename, row, col, conf)

Bases: lingpy.basic.parser.QLCParser

get_entries(entry)

Return all entries matching the given entry-type as a two-dimensional list.

Parameters:

entry : string

The entry-type of the data that shall be returned in tabular format.

lingpy.basic.tree module

Basic module for the handling of language trees.

class lingpy.basic.tree.Tree(tree, **keywords)

Bases: lingpy.thirdparty.cogent.tree.PhyloNode

Basic class for the handling of phylogenetic trees.

Parameters:

tree : {str file list}

A string or a file containing trees in Newick format. As an alternative, you can also simply pass a list containing taxon names. In that case, a random tree will be created from the list of taxa.

branch_lengths : bool (default=False)

When set to True, and a list of taxa is passed instead of a Newick string or a file containing a Newick string, a random tree with random branch lengths will be created with the branch lengths being in order of the maximum number of the total number of internal branches.

getDistanceToRoot(node)

Return the distance from the given node to the root.

Parameters:

node : str

The name of a given node in the tree.

Returns:

distance : int

The distance of the given node to the root of the tree.

get_distance(other, distance='grf', debug=False)

Function returns the Robinson Fould distance between the two trees.

Parameters:

other : lingpy.basic.tree.Tree

A tree object. It should have the same number of taxa as the intitial tree.

distance : { “grf”, “rf”, “branch”, “symmetric”} (default=”grf”)

The distance which shall be calculated. Select between:

  • “grf”: the generalized Robinson Fould Distance

  • “rf”: the Robinson Fould Distance

  • “branch”: the distance in terms of branch lengths

  • “symmetric”: the symmetric difference between all partitions of

    the trees

lingpy.basic.tree.random_tree(taxa, branch_lengths=False)

Create a random tree from a list of taxa.

Parameters:

taxa : list

The list containing the names of the taxa from which the tree will be created.

branch_lengths : bool (default=False)

When set to True, a random tree with random branch lengths will be created with the branch lengths being in order of the maximum number of the total number of internal branches.

Returns:

tree_string : str

A string representation of the random tree in Newick format.

lingpy.basic.wordlist module

This module provides a basic class for the handling of word lists.

class lingpy.basic.wordlist.Wordlist(filename, row='concept', col='doculect', conf=None)

Bases: lingpy.basic.parser.QLCParserWithRowsAndCols

Basic class for the handling of multilingual word lists.

Parameters:

filename : { string, dict }

The input file that contains the data. Otherwise a dictionary with consecutive integers as keys and lists as values with the key 0 specifying the header.

row : str (default = “concept”)

A string indicating the name of the row that shall be taken as the basis for the tabular representation of the word list.

col : str (default = “doculect”)

A string indicating the name of the column that shall be taken as the basis for the tabular representation of the word list.

conf : string (default=’‘)

A string defining the path to the configuration file (more information in the notes).

Notes

A word list is created from a dictionary containing the data. The idea is a three-dimensional representation of (linguistic) data. The first dimension is called col (column, usually “language”), the second one is called row (row, usually “concept”), the third is called entry, and in contrast to the first two dimensions, which have to consist of unique items, it contains flexible values, such as “ipa” (phonetic sequence), “cogid” (identifier for cognate sets), “tokens” (tokenized representation of phonetic sequences). The LingPy website offers some tutorials for word lists which we recommend to read in case you are looking for more information.

A couple of methods is provided along with the word list class in order to access the multi-dimensional input data. The main idea is to provide an easy way to access two-dimensional slices of the data by specifying which entry type should be returned. Thus, if a word list consists not only of simple orthographical entries but also of IPA encoded phonetic transcriptions, both the orthographical source and the IPA transcriptions can be easily accessed as two separate two-dimensional lists.

add_entries(entry, source, function, override=False, **keywords)

Add new entry-types to the word list by modifying given ones.

Parameters:

entry : string

A string specifying the name of the new entry-type to be added to the word list.

source : string

A string specifying the basic entry-type that shall be modified. If multiple entry-types shall be used to create a new entry, they should be passed in a simple string separated by a comma.

function : function

A function which is used to convert the source into the target value.

keywords : {dict}

A dictionary of keywords that are passed as parameters to the function.

Notes

This method can be used to add new entry-types to the data by converting given ones. There are a lot of possibilities for adding new entries, but the most basic procedure is to use an existing entry-type and to modify it with help of a function.

calculate(data, taxa='taxa', concepts='concepts', ref='cogid', **keywords)

Function calculates specific data.

Parameters:

data : str

The type of data that shall be calculated. Currently supports

  • “tree”: calculate a reference tree based on shared cognates
  • “dst”: get distances between taxa based on shared cognates
  • “cluster”: cluster the taxa into groups using different methods
coverage(stats='absolute')

Function determines the coverage of a wordlist.

export(fileformat, sections=None, entries=None, entry_sep='', item_sep='', template='', **keywords)

Export the wordlist to specific fileformats.

Notes

The difference between export and output is that the latter mostly serves for internal purposes and formats, while the former serves for publication of data, using specific, nested statements to create, for example, HTML or LaTeX files from the wordlist data.

get_dict(col='', row='', entry='', **keywords)

Function returns dictionaries of the cells matched by the indices.

Parameters:

col : string (default=””)

The column index evaluated by the method. It should contain one of the values in the row of the Wordlist instance, usually a taxon (language) name.

row : string (default=””)

The row index evaluated by the method. It should contain one of the values in the row of the Wordlist instance, usually a taxon (language) name.

entry : string (default=””)

The index for the entry evaluated by the method. It can be used to specify the datatype of the rows or columns selected. As a default, the indices of the entries are returned.

Returns:

entries : dict

A dictionary of keys and values specifying the selected part of the data. Typically, this can be a dictionary of a given language with keys for the concept and values as specified in the “entry” keyword.

Notes

The “col” and “row” keywords in the function are all aliased according to the description in the wordlist.rc file. Thus, instead of using these attributes, the aliases can also be taken. For selecting a language, one may type something like:

>>> Wordlist.get_dict(language='LANGUAGE')

and for the selection of a concept, one may type something like:

>>> Wordlist.get_dict(concept='CONCEPT')

See the examples below for details.

Examples

Load the harry_potter.csv file:

>>> wl = Wordlist('harry_potter.csv')

Select all IPA-entries for the language “German”:

>>> wl.get_dict(language='German',entry='ipa')
{'Harry': ['haralt'], 'hand': ['hant'], 'leg': ['bain']}

Select all words (orthographical representation) for the concept “Harry”:

>>> wl.get_dict(concept="Harry",entry="words")
{'English': ['hæri'], 'German': ['haralt'], 'Russian': ['gari'],             'Ukrainian': ['gari']}

Note that the values of the dictionary that is returned are always lists, since it is possible that the original file contains synonyms (multiple words corresponding to the same concept).

get_etymdict(ref='cogid', entry='', modify_ref=False)

Return an etymological dictionary representation of the word list.

Parameters:

ref : string (default = “cogid”)

The reference entry which is used to store the cognate ids.

entry : string (default = ‘’)

The entry-type which shall be selected.

modify_ref : function (default=False)

Use a function to modify the reference. If your cognate identifiers are numerical, for example, and negative values are assigned as loans, but you want to suppress this behaviour, just set this keyword to “abs”, and all cognate IDs will be converted to their absolute value.

Returns:

etym_dict : dict

An etymological dictionary representation of the data.

Notes

In contrast to the word-list representation of the data, an etymological dictionary representation sorts the counterparts according to the cognate sets of which they are reflexes. If more than one cognate ID are assigned to an entry, for example in cases of fuzzy cognate IDs or partial cognate IDs, the etymological dictionary will return one cognate set for each of the IDs.

get_list(row='', col='', entry='', flat=False, **keywords)

Function returns lists of rows and columns specified by their name.

Parameters:

row: string (default = ‘’) :

The row name whose entries are selected from the data.

col : string (default = ‘’)

The column name whose entries are selected from the data.

entry: string (default = ‘’) :

The entry-type which is selected from the data.

flat : bool (default = False)

Specify whether the returned list should be one- or two-dimensional, or whether it should contain gaps or not.

Returns:

data : list

A list representing the selected part of the data.

Notes

The ‘col’ and ‘row’ keywords in the function are all aliased according to the description in the wordlist.rc file. Thus, instead of using these attributes, the aliases can also be taken. For selecting a language, one may type something like:

>>> Wordlist.get_list(language='LANGUAGE')

and for the selection of a concept, one may type something like:

>>> Wordlist.get_list(concept='CONCEPT')

See the examples below for details.

Examples

Load the harry_potter.csv file:

>>> wl = Wordlist('harry_potter.csv')

Select all IPA-entries for the language “German”:

>>> wl.get_list(language='German',entry='ipa'
['bain', 'hant', 'haralt']

Note that this function returns 0 for missing values (concepts that don’t have a word in the given language). If one wants to avoid this, the ‘flat’ keyword should be set to True.

Select all words (orthographical representation) for the concept “Harry”:

>>> wl.get_list(concept="Harry",entry="words")
[['Harry', 'Harald', 'Гари', 'Гарi']]

Note that the values of the list that is returned are always two-dimensional lists, since it is possible that the original file contains synonyms (multiple words corresponding to the same concept). If one wants to have a flat representation of the entries, the ‘flat’ keyword should be set to True:

>>> wl.get_list(concept="Harry",entry="words",flat=True)
['hæri', 'haralt', 'gari', 'hari']
get_paps(ref='cogid', entry='concept', missing=0, modify_ref=False)

Function returns a list of present-absent-patterns of a given word list.

Parameters:

ref : string (default = “cogid”)

The reference entry which is used to store the cognate ids.

entry : string (default = “concept”)

The field which is used to check for missing data.

missing : string,int (default = 0)

The marker for missing items.

output(fileformat, **keywords)

Write wordlist to file.

Parameters:

fileformat : {“tsv”,”tre”,”nwk”,”dst”, “taxa”, “starling”, “paps.nex”, “paps.csv”}

The format that is written to file. This corresponds to the file extension, thus ‘tsv’ creates a file in extended tsv-format, ‘dst’ creates a file in Phylip-distance format, etc.

filename : str

Specify the name of the output file (defaults to a filename that indicates the creation date).

subset : bool (default=False)

If set to c{True}, return only a subset of the data. Which subset is specified in the keywords ‘cols’ and ‘rows’.

cols : list

If subset is set to c{True}, specify the columns that shall be written to the csv-file.

rows : dict

If subset is set to c{True}, use a dictionary consisting of keys that specify a column and values that give a Python-statement in raw text, such as, e.g., “== ‘hand’”. The content of the specified column will then be checked against statement passed in the dictionary, and if it is evaluated to c{True}, the respective row will be written to file.

ref : str

Name of the column that contains the cognate IDs if ‘starling’ is chosen as an output format.

missing : { str, int } (default=0)

If ‘paps.nex’ or ‘paps.csv’ is chosen as fileformat, this character will be inserted as an indicator of missing data.

tree_calc : {‘neighbor’, ‘upgma’}

If no tree has been calculated and ‘tre’ or ‘nwk’ is chosen as output format, the method that is used to calculate the tree.

threshold : float (default=0.6)

The threshold that is used to carry out a flat cluster analysis if ‘groups’ or ‘cluster’ is chosen as output format.

ignore : { list, “all” }

Modifies the output format in “tsv” output and allows to ignore certain blocks in extended “tsv”, like “msa”, “taxa”, “json”, etc., which should be passed as a list. If you choose “all” as a plain string and not a list, this will ignore all additional blocks and output only plain “tsv”.

prettify : bool (default=True)

Inserts comment characters between concepts in the “tsv” file output format, which makes it easier to see blocks of words denoting the same concept. Switching this off will output the file in plain “tsv”.

renumber(source, target='', override=False)

Renumber a given set of string identifiers by replacing the ids by integers.

Parameters:

source : str

The source column to be manipulated.

target : str (default=’‘)

The name of the target colummn. If no name is chosen, the target column will be manipulated by adding “id” to the name of the source column.

override : bool (default=False)

Force to overwrite the data if the target column already exists.

Notes

In addition to a new column, an further entry is added to the “_meta” attribute of the wordlist by which newly coined ids can be retrieved from the former string attributes. This attribute is called “source2target” and can be accessed either via the “_meta” dictionary or directly as an attribute of the wordlist.

lingpy.basic.wordlist.get_wordlist(path, **keywords)

Load a wordlist from a normal CSV file.

Parameters:

path : str

The path to your CSV file.

delimiter : str

The delimiter in the CSV file.

quotechar : str

The quote character in your data.

row : str (default = “concept”)

A string indicating the name of the row that shall be taken as the basis for the tabular representation of the word list.

col : str (default = “doculect”)

A string indicating the name of the column that shall be taken as the basis for the tabular representation of the word list.

conf : string (default=’‘)

A string defining the path to the configuration file.

Notes

This function returns a Wordlist object. In contrast to the normal way to load a wordlist from a tab-separated file, however, this allows to directly load a wordlist from any “normal” csv-file, with your own specified delimiters and quote characters. If the first cell in the first row of your CSV file is not named “ID”, the integer identifiers, which are required by LingPy will be automatically created.

lingpy.basic.workflow module

Package provides generic workflow modules for LingPy.

class lingpy.basic.workflow.Workflow(infile)

Bases: object

Class provides access to generic workflows.

Parameters:

infile : str

A tsv-file providing the input data for the given workflow.

cognate_detection(**keywords)

Method runs a cognate detection analysis.

Module contents

This module provides basic classes for the handling of linguistic data.

The basic idea is to provide classes that allow the user to handle basic linguistic datatypes (spreadsheets, wordlists) in a consistent way.