================== Handling Wordlists ================== What is a Word List? -------------------- Generally, a word list is a simple tabular data structure in which multiple languages are structured in such a way that **words** are ordered in rows and columns according to the **language** to which they belong and the **concept** they denote. A simple word list could thus be displayed as a simple tab-delimited text file:: CONCEPT GERMAN ENGLISH RUSSIAN UKRAINIAN hand Hand hand рука рука leg Bein leg нога нога Woldemort Waldemar Woldemort Владимир Володимир Harry Harald Harry Гарри Гаррi However, this format has a striking drawback, in so far as what we call a "word" can have multiple manifestations in our data. Thus, the same word list could look like this, if we preferred to have the words represented in phonetic transcription:: CONCEPT GERMAN ENGLISH RUSSIAN UKRAINIAN hand hant hænd ruka ruka leg bain lɛg noga noga Woldemort valdəmar wɔldəmɔrt vladimir volodimir Harry haralt hæri gari gari And if we wanted to display only which of the words are cognate, we could represent it in a numerical format, where all words sharing the same number are cognate:: CONCEPT GERMAN ENGLISH RUSSIAN UKRAINIAN hand 1 1 2 2 leg 3 4 5 5 Woldemort 6 6 6 6 Harry 7 7 7 7 When dealing with word lists in general, we thus need something more than just a two-dimensional representation format. A solution is to use a simple csv-format with a header which specifies not only the concept and the language, but also all different possible **entry-types** a word can have, just as in the file `harry_potter.csv`_:: @author: Potter, Harry @date: 2012-11-07 # ID CONCEPT COUNTERPART IPA DOCULECT COGID 1 hand Hand hant German 1 2 hand hand hænd English 1 3 hand рука ruka Russian 2 4 hand рука ruka Ukrainian 2 5 leg Bein bain German 3 6 leg leg lɛg English 4 7 leg нога noga Russian 5 8 leg нога noha Ukrainian 5 9 Woldemort Waldemar valdemar German 6 10 Woldemort Woldemort wɔldemɔrt English 6 11 Woldemort Владимир vladimir Russian 6 12 Woldemort Володимир volodimir Ukrainian 6 13 Harry Harald haralt German 7 14 Harry Harry hæri English 7 15 Harry Гарри gari Russian 7 16 Harry Гаррi hari Ukrainian 7 This format is, of course, much more redundant, than the word list format, but it allows to display multiple entry-types for the counterparts of a given concept in a given language. Moreover, this format is the basic of the :py:class:`~lingpy.basic.wordlist.Wordlist` class in LingPy, which makes it easy to handle word lists with multiple entry-types of words. Basic Operations with Help of Wordlists --------------------------------------- The above-given csv-file `harry_potter.csv`_ is available in the test folder of LingPy. In order to get it loaded, we simply pass it as first argument to the Wordlist class:: >>> from lingpy import * >>> d = Wordlist('harry_potter.csv') If one wants to access only the IPA values in tabular format, all one has to do is:: >>> wl.ipa [['wɔldemɔrt', 'valdemar', 'vladimir', 'volodimir'], ['hæri', 'haralt', 'gari', 'hari'], ['lɛg', 'bain', 'noga', 'noha'], ['hænd', 'hant', 'ruka', 'ruka']] The same for cognates:: >>> wl.cognate [[6, 6, 6, 6], [7, 7, 7, 7], [4, 3, 5, 5], [1, 1, 2, 2]] Or for the languages and the concepts in the dataset:: >>> wl.language ['English', 'German', 'Russian', 'Ukrainian'] >>> wl.concept ['Harry', 'Woldemort', 'hand', 'leg'] Furthermore, using specific functions, even more concise samples of the data can be extracted, thus, using the :py:class:`~lingpy.basic.wordlist.Wordlist.get_dict` function, we can specify a given language and extract all phonetic transcriptions corresponding to a given concept as a dictionary:: >>> wl.get_dict(col="German",entry="ipa") {'Harry': ['haralt'], 'Woldemort': ['valdemar'], 'hand': ['hant'], 'leg': ['bain']} We can likewise extract all cognate IDs corresponding to a given concept by using the function :py:class:`~lingpy.basic.wordlist.Wordlist.get_list`:: >>> wl.get_list(row="hand",entry="cogid",flat=True) [1, 1, 2, 2] Other entry-types can be added:: >>> from lingpy.algorithm.misc import ipa2tokens >>> wl.add_entries("tokens","ipa",ipa2tokens) >>> wl.tokens [[['w', 'ɔ', 'l', 'd', 'e', 'm', 'ɔ', 'r', 't'], ['v', 'a', 'l', 'd', 'e', 'm', 'a', 'r'], ['v', 'l', 'a', 'd', 'i', 'm', 'i', 'r'], ['v', 'o', 'l', 'o', 'd', 'i', 'm', 'i', 'r']], [['l', 'ɛ', 'g'], ['b', 'ai', 'n'], ['n', 'o', 'g', 'a'], ['n', 'o', 'h', 'a']], [['h', 'æ', 'n', 'd'], ['h', 'a', 'n', 't'], ['r', 'u', 'k', 'a'], ['r', 'u', 'k', 'a']], [['h', 'æ', 'r', 'i'], ['h', 'a', 'r', 'a', 'l', 't'], ['g', 'a', 'r', 'i'], ['h', 'a', 'r', 'i']]] The wordlist.rc file -------------------- The structure of word lists is defined by the configuration file `wordlist.rc`_. This file is automatically loaded when initializing a Wordlist instance:: >>> wl = Wordlist(data) It can, however, also be passed by the user:: >>> wl = Wordlist(data, conf="path_to_file") All rc-files (which are used for different wordlist-like object in LingPy) are currently located at `lingpy/data/rc/` and have a simple tab-separated structure of four three columns: 1. basic namespace (alphanumeric, lower case) 2. class of all entries in that namespace 3. alias for the namespace (alphanumeric, all lower case, comma-separated) As an example, consider the following minimal rc-file for a wordlist object:: ipa str ipa,orthography,transcription tokens lambda x: x.split(" ") tokens,segments cogid int cognate_set_id,cognates,cogid This rc-file, which you can call by passing the path of your file as an argument when loading a wordlist, will treat all entries in columns named "ipa" or "orthography" or "transcription" in your data as strings, it will further define "ipa" as the basic name for those columns and use this name when you output the file. It will split all entries in the column "tokens" (or "segments") along spaces and store them as a list, and it will convert all "cogid" entries to integers. .. _harry_potter.csv: examples/harry_potter.csv .. _wordlist.rc: examples/wordlist.rc