lingpy.basic package¶
Submodules¶
lingpy.basic.ops module¶
Module provides basic operations on Wordlist-Objects.
- lingpy.basic.ops.calculate_data(wordlist, data, taxa='taxa', concepts='concepts', ref='cogid', **keywords)¶
Manipulate a wordlist object by adding different kinds of data.
- Parameters
data : str
The type of data that shall be calculated. Currently supports
“tree”: calculate a reference tree based on shared cognates
“dst”: get distances between taxa based on shared cognates
“cluster”: cluster the taxa into groups using different methods
- lingpy.basic.ops.clean_taxnames(wordlist, column='doculect', f=<function <lambda>>)¶
Function cleans taxon names for use in Newick files.
- lingpy.basic.ops.coverage(wordlist)¶
Determine the average coverage of a wordlist.
- lingpy.basic.ops.get_score(wl, ref, mode, taxA, taxB, concepts_attr='concepts', ignore_missing=False)¶
- lingpy.basic.ops.iter_rows(wordlist, *values)¶
Function generates a list of the specified values in a wordlist.
- Parameters
wordlist : ~lingpy.basic.wordlist.Wordlist
A wordlist object or one of the daughter classes of wordlists.
value : str
A value as defined in the header of the wordlist.
- Returns
list : list
A generator object that generates list containing the key of each row in the wordlist and the corresponding cells, as specified in the headers.
Notes
Use this function to quickly iterate over specified fields in the wordlist. For example, when trying to access all pairs of language names and concepts, you may write:
>>> for k, language, concept in iter_rows(wl, 'language', 'concept'): print(k, language, concept)
Note that this function returns the key of the given row as a first value. So if you do not specify anything, the output will just be the key.
- lingpy.basic.ops.renumber(wordlist, source, target='', override=False)¶
Create numerical identifiers from string identifiers.
- lingpy.basic.ops.triple2tsv(triples_or_fname, output='table')¶
Function reads a triple file and converts it to a tabular data structure.
- lingpy.basic.ops.tsv2triple(wordlist, outfile=None)¶
Function converts a wordlist to a triple data structure.
Notes
- The basic values of which the triples consist are:
ID (the ID in the TSV file)
COLUMN (the column in the TSV file)
VALUE (the entry in the TSV file)
- lingpy.basic.ops.wl2dict(wordlist, sections, entries, exclude=None)¶
Convert a wordlist to a complex dictionary with headings as keys.
- lingpy.basic.ops.wl2dst(wl, taxa='taxa', concepts='concepts', ref='cogid', refB='', mode='swadesh', ignore_missing=False, **keywords)¶
Function converts wordlist to distance matrix.
- lingpy.basic.ops.wl2multistate(wordlist, ref, missing)¶
Function converts a wordlist to multistate format (compatible with PAUP).
- lingpy.basic.ops.wl2qlc(header, data, filename='', formatter='concept', **keywords)¶
Write the basic data of a wordlist to file.
lingpy.basic.parser module¶
Basic parser for text files in QLC format.
- class lingpy.basic.parser.QLCParser(filename, conf='')¶
Bases:
object
Basic class for the handling of text files in QLC format.
- add_entries(entry, source, function, override=False, **keywords)¶
Add new entry-types to the word list by modifying given ones.
- Parameters
entry : string
A string specifying the name of the new entry-type to be added to the word list.
source : string
A string specifying the basic entry-type that shall be modified. If multiple entry-types shall be used to create a new entry, they should be passed in a simple string separated by a comma.
function : function
A function which is used to convert the source into the target value.
keywords : {dict}
A dictionary of keywords that are passed as parameters to the function.
Notes
This method can be used to add new entry-types to the data by converting given ones. There are a lot of possibilities for adding new entries, but the most basic procedure is to use an existing entry-type and to modify it with help of a function.
- class lingpy.basic.parser.QLCParserWithRowsAndCols(filename, row, col, conf)¶
Bases:
lingpy.basic.parser.QLCParser
- get_entries(entry)¶
Return all entries matching the given entry-type as a two-dimensional list.
- Parameters
entry : string
The entry-type of the data that shall be returned in tabular format.
- lingpy.basic.parser.read_conf(conf='')¶
lingpy.basic.tree module¶
Basic module for the handling of language trees.
- class lingpy.basic.tree.Tree(tree, **keywords)¶
Bases:
lingpy.thirdparty.cogent.tree.PhyloNode
Basic class for the handling of phylogenetic trees.
- Parameters
tree : {str file list}
A string or a file containing trees in Newick format. As an alternative, you can also simply pass a list containing taxon names. In that case, a random tree will be created from the list of taxa.
branch_lengths : bool (default=False)
When set to True, and a list of taxa is passed instead of a Newick string or a file containing a Newick string, a random tree with random branch lengths will be created with the branch lengths being in order of the maximum number of the total number of internal branches.
- getDistanceToRoot(node)¶
Return the distance from the given node to the root.
- Parameters
node : str
The name of a given node in the tree.
- Returns
distance : int
The distance of the given node to the root of the tree.
- get_distance(other, distance='grf', debug=False)¶
Function returns the Robinson-Foulds distance between the two trees.
- Parameters
other : lingpy.basic.tree.Tree
A tree object. It should have the same number of taxa as the intitial tree.
distance : { “grf”, “rf”, “branch”, “symmetric”} (default=”grf”)
The distance which shall be calculated. Select between:
“grf”: the generalized Robinson-Foulds Distance
“rf”: the Robinson-Foulds Distance
- “symmetric”: the symmetric difference between all partitions of
the trees
- lingpy.basic.tree.random_tree(taxa, branch_lengths=False)¶
Create a random tree from a list of taxa.
- Parameters
taxa : list
The list containing the names of the taxa from which the tree will be created.
branch_lengths : bool (default=False)
When set to True, a random tree with random branch lengths will be created with the branch lengths being in order of the maximum number of the total number of internal branches.
- Returns
tree_string : str
A string representation of the random tree in Newick format.
lingpy.basic.wordlist module¶
This module provides a basic class for the handling of word lists.
- class lingpy.basic.wordlist.BounceAsID¶
Bases:
object
A helper class for CLDF conversion when tables are missing.
A class with trivial ‘item lookup’:
>>> b = BounceAsID() >>> b[5] {"ID": 5} >>> b["long_id"] {"ID": "long_id"}
- class lingpy.basic.wordlist.Wordlist(filename, row='concept', col='doculect', conf=None)¶
Bases:
lingpy.basic.parser.QLCParserWithRowsAndCols
Basic class for the handling of multilingual word lists.
- Parameters
filename : { string, dict }
The input file that contains the data. Otherwise a dictionary with consecutive integers as keys and lists as values with the key 0 specifying the header.
row : str (default = “concept”)
A string indicating the name of the row that shall be taken as the basis for the tabular representation of the word list.
col : str (default = “doculect”)
A string indicating the name of the column that shall be taken as the basis for the tabular representation of the word list.
conf : string (default=’’)
A string defining the path to the configuration file (more information in the notes).
Notes
A word list is created from a dictionary containing the data. The idea is a three-dimensional representation of (linguistic) data. The first dimension is called col (column, usually “language”), the second one is called row (row, usually “concept”), the third is called entry, and in contrast to the first two dimensions, which have to consist of unique items, it contains flexible values, such as “ipa” (phonetic sequence), “cogid” (identifier for cognate sets), “tokens” (tokenized representation of phonetic sequences). The LingPy website offers some tutorials for word lists which we recommend to read in case you are looking for more information.
A couple of methods is provided along with the word list class in order to access the multi-dimensional input data. The main idea is to provide an easy way to access two-dimensional slices of the data by specifying which entry type should be returned. Thus, if a word list consists not only of simple orthographical entries but also of IPA encoded phonetic transcriptions, both the orthographical source and the IPA transcriptions can be easily accessed as two separate two-dimensional lists.
- add_entries(entry, source, function, override=False, **keywords)¶
Add new entry-types to the word list by modifying given ones.
- Parameters
entry : string
A string specifying the name of the new entry-type to be added to the word list.
source : string
A string specifying the basic entry-type that shall be modified. If multiple entry-types shall be used to create a new entry, they should be passed in a simple string separated by a comma.
function : function
A function which is used to convert the source into the target value.
keywords : {dict}
A dictionary of keywords that are passed as parameters to the function.
Notes
This method can be used to add new entry-types to the data by converting given ones. There are a lot of possibilities for adding new entries, but the most basic procedure is to use an existing entry-type and to modify it with help of a function.
- calculate(data, taxa='taxa', concepts='concepts', ref='cogid', **keywords)¶
Function calculates specific data.
- Parameters
data : str
The type of data that shall be calculated. Currently supports
“tree”: calculate a reference tree based on shared cognates
“dst”: get distances between taxa based on shared cognates
“cluster”: cluster the taxa into groups using different methods
- coverage(stats='absolute')¶
Function determines the coverage of a wordlist.
- export(fileformat, sections=None, entries=None, entry_sep='', item_sep='', template='', **keywords)¶
Export the wordlist to specific fileformats.
Notes
The difference between export and output is that the latter mostly serves for internal purposes and formats, while the former serves for publication of data, using specific, nested statements to create, for example, HTML or LaTeX files from the wordlist data.
- classmethod from_cldf(path, columns=('parameter_id', 'concept_name', 'language_id', 'language_name', 'value', 'form', 'segments', 'language_glottocode', 'concept_concepticon_id', 'language_latitude', 'language_longitude', 'cognacy'), namespace=(('concept_name', 'concept'), ('language_id', 'doculect'), ('segments', 'tokens'), ('language_glottocode', 'glottolog'), ('concept_concepticon_id', 'concepticon'), ('language_latitude', 'latitude'), ('language_longitude', 'longitude'), ('cognacy', 'cognacy'), ('cogid_cognateset_id', 'cogid')), filter=<function Wordlist.<lambda>>, **kwargs)¶
Load a CLDF dataset.
Open a CLDF Dataset – with metadata or metadata-free – (only Wordlist datasets are supported for now, because other modules don’t seem to make sense for LingPy) and transform it into this Class. Columns from the FormTable are imported in lowercase, columns from LanguageTable, ParameterTable and CognateTable are prefixed with langage_, concept_ and `cogid_`and converted to lowercase.
- Parameters
columns: list or tuple :
The list of columns to import. (default: all columns)
filter: function: rowdict → bool :
A condition function for importing only some rows. (default: lambda row: row[“form”])
All other parameters are passed on to the `cls` :
- Returns
A `cls` object representing the CLDF dataset :
Notes
CLDFs default column names for wordlists are different from LingPy’s, so you probably have to use:
>>> lingpy.Wordlist.from_cldf(
“Wordlist-metadata.json”, )
in order to avoid errors from LingPy not finding required columns.
- get_dict(col='', row='', entry='', **keywords)¶
Function returns dictionaries of the cells matched by the indices.
- Parameters
col : string (default=””)
The column index evaluated by the method. It should contain one of the values in the row of the
Wordlist
instance, usually a taxon (language) name.row : string (default=””)
The row index evaluated by the method. It should contain one of the values in the row of the
Wordlist
instance, usually a taxon (language) name.entry : string (default=””)
The index for the entry evaluated by the method. It can be used to specify the datatype of the rows or columns selected. As a default, the indices of the entries are returned.
- Returns
entries : dict
A dictionary of keys and values specifying the selected part of the data. Typically, this can be a dictionary of a given language with keys for the concept and values as specified in the “entry” keyword.
See also
Notes
The “col” and “row” keywords in the function are all aliased according to the description in the
wordlist.rc
file. Thus, instead of using these attributes, the aliases can also be taken. For selecting a language, one may type something like:>>> Wordlist.get_dict(language='LANGUAGE')
and for the selection of a concept, one may type something like:
>>> Wordlist.get_dict(concept='CONCEPT')
See the examples below for details.
Examples
Load the
harry_potter.csv
file:>>> wl = Wordlist('harry_potter.csv')
Select all IPA-entries for the language “German”:
>>> wl.get_dict(language='German',entry='ipa') {'Harry': ['haralt'], 'hand': ['hant'], 'leg': ['bain']}
Select all words (orthographical representation) for the concept “Harry”:
>>> wl.get_dict(concept="Harry",entry="words") {'English': ['hæri'], 'German': ['haralt'], 'Russian': ['gari'], 'Ukrainian': ['gari']}
Note that the values of the dictionary that is returned are always lists, since it is possible that the original file contains synonyms (multiple words corresponding to the same concept).
- get_distances(**kw)¶
- get_etymdict(ref='cogid', entry='', modify_ref=False)¶
Return an etymological dictionary representation of the word list.
- Parameters
ref : string (default = “cogid”)
The reference entry which is used to store the cognate ids.
entry : string (default = ‘’)
The entry-type which shall be selected.
modify_ref : function (default=False)
Use a function to modify the reference. If your cognate identifiers are numerical, for example, and negative values are assigned as loans, but you want to suppress this behaviour, just set this keyword to “abs”, and all cognate IDs will be converted to their absolute value.
- Returns
etym_dict : dict
An etymological dictionary representation of the data.
Notes
In contrast to the word-list representation of the data, an etymological dictionary representation sorts the counterparts according to the cognate sets of which they are reflexes. If more than one cognate ID are assigned to an entry, for example in cases of fuzzy cognate IDs or partial cognate IDs, the etymological dictionary will return one cognate set for each of the IDs.
- get_list(row='', col='', entry='', flat=False, **keywords)¶
Function returns lists of rows and columns specified by their name.
- Parameters
row: string (default = ‘’) :
The row name whose entries are selected from the data.
col : string (default = ‘’)
The column name whose entries are selected from the data.
entry: string (default = ‘’) :
The entry-type which is selected from the data.
flat : bool (default = False)
Specify whether the returned list should be one- or two-dimensional, or whether it should contain gaps or not.
- Returns
data : list
A list representing the selected part of the data.
See also
Notes
The ‘col’ and ‘row’ keywords in the function are all aliased according to the description in the
wordlist.rc
file. Thus, instead of using these attributes, the aliases can also be taken. For selecting a language, one may type something like:>>> Wordlist.get_list(language='LANGUAGE')
and for the selection of a concept, one may type something like:
>>> Wordlist.get_list(concept='CONCEPT')
See the examples below for details.
Examples
Load the
harry_potter.csv
file:>>> wl = Wordlist('harry_potter.csv')
Select all IPA-entries for the language “German”:
>>> wl.get_list(language='German',entry='ipa' ['bain', 'hant', 'haralt']
Note that this function returns 0 for missing values (concepts that don’t have a word in the given language). If one wants to avoid this, the ‘flat’ keyword should be set to True.
Select all words (orthographical representation) for the concept “Harry”:
>>> wl.get_list(concept="Harry",entry="words") [['Harry', 'Harald', 'Гари', 'Гарi']]
Note that the values of the list that is returned are always two-dimensional lists, since it is possible that the original file contains synonyms (multiple words corresponding to the same concept). If one wants to have a flat representation of the entries, the ‘flat’ keyword should be set to True:
>>> wl.get_list(concept="Harry",entry="words",flat=True) ['hæri', 'haralt', 'gari', 'hari']
- get_paps(ref='cogid', entry='concept', missing=0, modify_ref=False)¶
Function returns a list of present-absent-patterns of a given word list.
- Parameters
ref : string (default = “cogid”)
The reference entry which is used to store the cognate ids.
entry : string (default = “concept”)
The field which is used to check for missing data.
missing : string,int (default = 0)
The marker for missing items.
- get_tree(**kw)¶
- iter_cognates(ref, *entries)¶
Iterate over cognate sets in a wordlist.
- iter_rows(*entries)¶
Iterate over the columns in a wordlist.
- Parameters
entries : list
The name of the columns which shall be iterated.
- Returns
iterator : iterator
An iterator yielding lists in which the first entry is the ID of the wordlist row and the following entries are the content of the columns as specified.
Examples
Load a wordlist from LingPy’s test data:
>>> from lingpy.tests.util import test_data >>> from lingpy import Wordlist >>> wl = Wordlist(test_data("KSL.qlc")) >>> list(wl.iter_rows('ipa'))[:10] [[1, 'ɟiθ'], [2, 'ɔl'], [3, 'tut'], [4, 'al'], [5, 'apa.u'], [6, 'ʔayɬʦo'], [7, 'bytyn'], [8, 'e'], [9, 'and'], [10, 'e']]
So as you can see, the function returns the key of the wordlist as well as the specified entry.
- output(fileformat, **keywords)¶
Write wordlist to file.
- Parameters
fileformat : {“tsv”,”tre”,”nwk”,”dst”, “taxa”, “starling”, “paps.nex”, “paps.csv”}
The format that is written to file. This corresponds to the file extension, thus ‘tsv’ creates a file in extended tsv-format, ‘dst’ creates a file in Phylip-distance format, etc.
filename : str
Specify the name of the output file (defaults to a filename that indicates the creation date).
subset : bool (default=False)
If set to c{True}, return only a subset of the data. Which subset is specified in the keywords ‘cols’ and ‘rows’.
cols : list
If subset is set to c{True}, specify the columns that shall be written to the csv-file.
rows : dict
If subset is set to c{True}, use a dictionary consisting of keys that specify a column and values that give a Python-statement in raw text, such as, e.g., “== ‘hand’”. The content of the specified column will then be checked against statement passed in the dictionary, and if it is evaluated to c{True}, the respective row will be written to file.
ref : str
Name of the column that contains the cognate IDs if ‘starling’ is chosen as an output format.
missing : { str, int } (default=0)
If ‘paps.nex’ or ‘paps.csv’ is chosen as fileformat, this character will be inserted as an indicator of missing data.
tree_calc : {‘neighbor’, ‘upgma’}
If no tree has been calculated and ‘tre’ or ‘nwk’ is chosen as output format, the method that is used to calculate the tree.
threshold : float (default=0.6)
The threshold that is used to carry out a flat cluster analysis if ‘groups’ or ‘cluster’ is chosen as output format.
ignore : { list, “all” (default=’all’)}
Modifies the output format in “tsv” output and allows to ignore certain blocks in extended “tsv”, like “msa”, “taxa”, “json”, etc., which should be passed as a list. If you choose “all” as a plain string and not a list, this will ignore all additional blocks and output only plain “tsv”.
prettify : bool (default=False)
Inserts comment characters between concepts in the “tsv” file output format, which makes it easier to see blocks of words denoting the same concept. Switching this off will output the file in plain “tsv”.
- renumber(source, target='', override=False)¶
Renumber a given set of string identifiers by replacing the ids by integers.
- Parameters
source : str
The source column to be manipulated.
target : str (default=’’)
The name of the target colummn. If no name is chosen, the target column will be manipulated by adding “id” to the name of the source column.
override : bool (default=False)
Force to overwrite the data if the target column already exists.
Notes
In addition to a new column, an further entry is added to the “_meta” attribute of the wordlist by which newly coined ids can be retrieved from the former string attributes. This attribute is called “source2target” and can be accessed either via the “_meta” dictionary or directly as an attribute of the wordlist.
- lingpy.basic.wordlist.from_cldf(path, to=<class 'lingpy.basic.wordlist.Wordlist'>, concept='Name', concepticon='Concepticon_ID', glottocode='Glottocode', language='Name')¶
Load data from CLDF into a LingPy Wordlist object or similar.
- Parameters
path : str
The path to the metadata-file of your CLDF dataset.
to : ~lingpy.basic.wordlist.Wordlist
A ~lingpy.basic.wordlist.Wordlist object or one of the descendants (LexStat, Alignmnent).
concept : str (default=’gloss’)
The name used for the basic gloss in the parameters.csv table.
glottocode : str (default=’glottocode’)
The default name for the column storing the Glottolog ID in the languages.csv table.
language : str (default=’name’)
The default name for the language name in the languages.csv table.
concepticon : str (default=’conceptset’)
The default name for the concept set in the paramters.csv table.
Notes
This function does not offer absolute flexibility regarding the data you can input so far. However, it can regularly read CLDF-formatted data into LingPy and thus allow you to use CLDF data in LingPy analyses.
- lingpy.basic.wordlist.get_wordlist(path, delimiter=',', quotechar='"', normalization_form='NFC', **keywords)¶
Load a wordlist from a normal CSV file.
- Parameters
path : str
The path to your CSV file.
delimiter : str
The delimiter in the CSV file.
quotechar : str
The quote character in your data.
row : str (default = “concept”)
A string indicating the name of the row that shall be taken as the basis for the tabular representation of the word list.
col : str (default = “doculect”)
A string indicating the name of the column that shall be taken as the basis for the tabular representation of the word list.
conf : string (default=’’)
A string defining the path to the configuration file.
Notes
This function returns a
Wordlist
object. In contrast to the normal way to load a wordlist from a tab-separated file, however, this allows to directly load a wordlist from any “normal” csv-file, with your own specified delimiters and quote characters. If the first cell in the first row of your CSV file is not named “ID”, the integer identifiers, which are required by LingPy will be automatically created.
Module contents¶
This module provides basic classes for the handling of linguistic data.
The basic idea is to provide classes that allow the user to handle basic linguistic datatypes (spreadsheets, wordlists) in a consistent way.