lingpy.align package

Submodules

lingpy.align.multiple module

Module provides classes and functions for multiple alignment analyses.

class lingpy.align.multiple.Multiple(seqs, **keywords)

Bases: object

Basic class for multiple sequence alignment analyses.

Parameters

seqs : list

List of sequences that shall be aligned.

Notes

Depending on the structure of the sequences, further keywords can be specified that manage how the items get tokenized.

align(method, **kw)
get_local_peaks(threshold=2, gap_weight=0.0)

Return all peaks in a given alignment.

Parameters

threshold : { int, float } (default=2)

The threshold to determine whether a given column is a peak or not.

gap_weight : float (default=0.0)

The weight for gaps.

get_pairwise_alignments(**keywords)

Function creates a dictionary of all pairwise alignments scores.

Parameters

new_calc : bool (default=True)

Specify, whether the analysis should be repeated from the beginning, or whether already conducted analyses should be carried out.

model : string (default=”sca”)

A string indicating the name of the Model object that shall be used for the analysis. Currently, three models are supported:

  • “dolgo” – a sound-class model based on Dolgopolsky1986,

  • “sca” – an extension of the “dolgo” sound-class model based on List2012b, and

  • “asjp” – an independent sound-class model which is based on the sound-class model of Brown2008 and the empirical data of Brown2011 (see the description in List2012.

mode : string (default=”global”)

A string indicating which kind of alignment analysis should be carried out during the progressive phase. Select between:

  • “global” – traditional global alignment analysis based on the Needleman-Wunsch algorithm Needleman1970,

  • “dialign” – global alignment analysis which seeks to maximize local similarities Morgenstern1996.

gop : int (default=-3)

The gap opening penalty (GOP) used in the analysis.

gep_scale : float (default=0.6)

The factor by which the penalty for the extension of gaps (gap extension penalty, GEP) shall be decreased. This approach is essentially inspired by the exension of the basic alignment algorithm for affine gap penalties Gotoh1982.

factor : float (default=1)

The factor by which the initial and the descending position shall be modified.

gap_weight : float (default=0)

The factor by which gaps in aligned columns contribute to the calculation of the column score. When set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters.

restricted_chars : string (default=”T”)

Define which characters of the prosodic string of a sequence reflect its secondary structure (cf. List2012b) and should therefore be aligned specifically. This defaults to “T”, since this is the character that represents tones in the prosodic strings of sequences.

get_peaks(gap_weight=0)

Calculate the profile score for each column of the alignment.

Parameters

gap_weight : float (default=0)

The factor by which gaps in aligned columns contribute to the calculation of the column score. When set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters.

Returns

peaks : list

A list containing the profile scores for each column of the given alignment.

get_pid(mode=1)

Return the Percentage Identity (PID) score of the calculated MSA.

Parameters

mode : { 1, 2, 3, 4, 5 } (default=1)

Indicate which of the four possible PID scores described in Raghava2006 should be calculated, the fifth possibility is added for linguistic purposes:

  1. identical positions / (aligned positions + internal gap positions),

  2. identical positions / aligned positions,

  3. identical positions / shortest sequence, or

  4. identical positions / shortest sequence (including internal gap pos.)

  5. identical positions / (aligned positions + 2 * number of gaps)

Returns

score : float

The PID score of the given alignment as a floating point number between 0 and 1.

iterate_all_sequences(check='final', mode='global', gop=- 3, scale=0.5, factor=0, gap_weight=1, restricted_chars='T_')

Iterative refinement based on a complete realignment of all sequences.

Parameters

check : { “final”, “immediate” } (default=”final”)

Specify when to check for improved sum-of-pairs scores: After each iteration (“immediate”) or after all iterations have been carried out (“final”).

mode : { “global”, “overlap”, “dialign” } (default=”global”)

A string indicating which kind of alignment analysis should be carried out during the progressive phase. Select between:

  • “global” – traditional global alignment analysis based on the Needleman-Wunsch algorithm Needleman1970,

  • “dialign” – global alignment analysis which seeks to maximize local similarities Morgenstern1996.

  • “overlap” – semi-global alignment, where gaps introduced in the beginning and the end of a sequence do not score.

gop : int (default=-5)

The gap opening penalty (GOP) used in the analysis.

gep_scale : float (default=0.5)

The factor by which the penalty for the extension of gaps (gap extension penalty, GEP) shall be decreased. This approach is essentially inspired by the exension of the basic alignment algorithm for affine gap penalties Gotoh1981.

factor : float (default=0.3)

The factor by which the initial and the descending position shall be modified.

gap_weight : float (default=0)

The factor by which gaps in aligned columns contribute to the calculation of the column score. When set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters.

Notes

This method essentially follows the iterative method of Barton1987 with the exception that an MSA has already been calculated.

iterate_clusters(threshold, check='final', mode='global', gop=- 3, scale=0.5, factor=0, gap_weight=1, restricted_chars='T_')

Iterative refinement based on a flat cluster analysis of the data.

Parameters

threshold : float

The threshold for the flat cluster analysis.

check : string (default=”final”)

Specify when to check for improved sum-of-pairs scores: After each iteration (“immediate”) or after all iterations have been carried out (“final”).

mode : { “global”, “overlap”, “dialign” } (default=”global”)

A string indicating which kind of alignment analysis should be carried out during the progressive phase. Select between:

  • ‘global’ – traditional global alignment analysis based on the Needleman-Wunsch algorithm Needleman1970,

  • ‘dialign’ – global alignment analysis which seeks to maximize local similarities Morgenstern1996.

  • ‘overlap’ – semi-global alignment, where gaps introduced in the beginning and the end of a sequence do not score.

gop : int (default=-5)

The gap opening penalty (GOP) used in the analysis.

gep_scale : float (default=0.6)

The factor by which the penalty for the extension of gaps (gap extension penalty, GEP) shall be decreased. This approach is essentially inspired by the exension of the basic alignment algorithm for affine gap penalties Gotoh1981.

factor : float (default=0.3)

The factor by which the initial and the descending position shall be modified.

gap_weight : float (default=0)

The factor by which gaps in aligned columns contribute to the calculation of the column score. When set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters.

Notes

This method uses the lingpy.algorithm.clustering.flat_upgma() function in order to retrieve a flat cluster of the data.

iterate_orphans(check='final', mode='global', gop=- 3, scale=0.5, factor=0, gap_weight=1.0, restricted_chars='T_')

Iterate over the most divergent sequences in the sample.

Parameters

check : string (default=”final”)

Specify when to check for improved sum-of-pairs scores: After each iteration (“immediate”) or after all iterations have been carried out (“final”).

mode : { “global”, “overlap”, “dialign” } (default=”global”)

A string indicating which kind of alignment analysis should be carried out during the progressive phase. Select between:

  • “global” – traditional global alignment analysis based on the Needleman-Wunsch algorithm Needleman1970,

  • “dialign” – global alignment analysis which seeks to maximize local similarities Morgenstern1996.

  • “overlap” – semi-global alignment, where gaps introduced in the beginning and the end of a sequence do not score.

gop : int (default=-5)

The gap opening penalty (GOP) used in the analysis.

gep_scale : float (default=0.6)

The factor by which the penalty for the extension of gaps (gap extension penalty, GEP) shall be decreased. This approach is essentially inspired by the exension of the basic alignment algorithm for affine gap penalties Gotoh1981.

factor : float (default=0.3)

The factor by which the initial and the descending position shall be modified.

gap_weight : float (default=0)

The factor by which gaps in aligned columns contribute to the calculation of the column score. When set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters.

Notes

The most divergent sequences are those whose average distance to all other sequences is above the average distance of all sequence pairs.

iterate_similar_gap_sites(check='final', mode='global', gop=- 3, scale=0.5, factor=0, gap_weight=1, restricted_chars='T_')

Iterative refinement based on the Similar Gap Sites heuristic.

Parameters

check : { “final”, “immediate” } (default=”final”)

Specify when to check for improved sum-of-pairs scores: After each iteration (“immediate”) or after all iterations have been carried out (“final”).

mode : { “global”, “overlap”, “dialign” } (default=”global”)

A string indicating which kind of alignment analysis should be carried out during the progressive phase. Select between:

  • ‘global’ – traditional global alignment analysis based on the Needleman-Wunsch algorithm Needleman1970,

  • ‘dialign’ – global alignment analysis which seeks to maximize local similarities Morgenstern1996.

  • ‘overlap’ – semi-global alignment, where gaps introduced in the beginning and the end of a sequence do not score.

gop : int (default=-5)

The gap opening penalty (GOP) used in the analysis.

gep_scale : float (default=0.5)

The factor by which the penalty for the extension of gaps (gap extension penalty, GEP) shall be decreased. This approach is essentially inspired by the exension of the basic alignment algorithm for affine gap penalties Gotoh1982.

factor : float (default=0.3)

The factor by which the initial and the descending position shall be modified.

gap_weight : float (default=1)

The factor by which gaps in aligned columns contribute to the calculation of the column score. When, e.g., set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters.

Notes

This heuristic is fairly simple. The idea is to try to split a given MSA into partitions with identical gap sites.

lib_align(**keywords)

Carry out a library-based progressive alignment analysis of the sequences.

Parameters

model : { “dolgo”, “sca”, “asjp” } (default=”sca”)

A string indicating the name of the Model object that shall be used for the analysis. Currently, three models are supported:

  • “dolgo” – a sound-class model based on Dolgopolsky1986,

  • “sca” – an extension of the “dolgo” sound-class model based on List2012b, and

  • “asjp” – an independent sound-class model which is based on the sound-class model of Brown2008 and the empirical data of Brown2011 (see the description in List2012.

mode : { “global”, “dialign” } (default=”global”)

A string indicating which kind of alignment analysis should be carried out during the progressive phase. Select between:

  • “global” – traditional global alignment analysis based on the Needleman-Wunsch algorithm Needleman1970,

  • “dialign” – global alignment analysis which seeks to maximize local similarities Morgenstern1996.

modes : list (default=[(“global”,-10,0.6),(“local”,-1,0.6)])

Indicate the mode, the gap opening penalties (GOP), and the gap extension scale (GEP scale), of the pairwise alignment analyses which are used to create the library.

gop : int (default=-5)

The gap opening penalty (GOP) used in the analysis.

gep_scale : float (default=0.6)

The factor by which the penalty for the extension of gaps (gap extension penalty, GEP) shall be decreased. This approach is essentially inspired by the exension of the basic alignment algorithm for affine gap penalties Gotoh1982.

factor : float (default=1)

The factor by which the initial and the descending position shall be modified.

tree_calc : { “neighbor”, “upgma” } (default=”upgma”)

The cluster algorithm which shall be used for the calculation of the guide tree. Select between neighbor, the Neighbor-Joining algorithm (Saitou1987), and upgma, the UPGMA algorithm (Sokal1958).

guide_tree : tree_matrix

Use a custom guide tree instead of performing a cluster algorithm for constructing one based on the input similarities. The use of this option makes the tree_calc option irrelevant.

gap_weight : float (default=0)

The factor by which gaps in aligned columns contribute to the calculation of the column score. When set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters.

restricted_chars : string (default=”T”)

Define which characters of the prosodic string of a sequence reflect its secondary structure (cf. List2012b) and should therefore be aligned specifically. This defaults to “T”, since this is the character that represents tones in the prosodic strings of sequences.

Notes

In contrast to traditional progressive multiple sequence alignment approaches such as Feng1981 and Thompson1994, library-based progressive alignment Notredame2000 is based on a pre-processing of the data where the information given in global and local pairwise alignments of the input sequences is used to derive a refined scoring function (library) which is later used in the progressive phase.

prog_align(**keywords)

Carry out a progressive alignment analysis of the input sequences.

Parameters

model : { “dolgo”, “sca”, “asjp” } (defaul=”sca”)

A string indicating the name of the Model object that shall be used for the analysis. Currently, three models are supported:

  • “dolgo” – a sound-class model based on Dolgopolsky1986,

  • “sca” – an extension of the “dolgo” sound-class model based on List2012b, and

  • “asjp” – an independent sound-class model which is based on the sound-class model of Brown2008 and the empirical data of Brown2011 (see the description in List2012.

mode : { “global”, “dialign” } (default=”global”)

A string indicating which kind of alignment analysis should be carried out during the progressive phase. Select between:

  • “global” – traditional global alignment analysis based on the Needleman-Wunsch algorithm Needleman1970,

  • “dialign” – global alignment analysis which seeks to maximize local similarities Morgenstern1996.

gop : int (default=-2)

The gap opening penalty (GOP) used in the analysis.

scale : float (default=0.5)

The factor by which the penalty for the extension of gaps (gap extension penalty, GEP) shall be decreased. This approach is essentially inspired by the exension of the basic alignment algorithm for affine gap penalties Gotoh1982.

factor : float (default=0.3)

The factor by which the initial and the descending position shall be modified.

tree_calc : { “neighbor”, “upgma” } (default=”upgma”)

The cluster algorithm which shall be used for the calculation of the guide tree. Select between neighbor, the Neighbor-Joining algorithm (Saitou1987), and upgma, the UPGMA algorithm (Sokal1958).

guide_tree : tree_matrix

Use a custom guide tree instead of performing a cluster algorithm for constructing one based on the input similarities. The use of this option makes the tree_calc option irrelevant.

gap_weight : float (default=0.5)

The factor by which gaps in aligned columns contribute to the calculation of the column score. When set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters.

restricted_chars : string (default=”T”)

Define which characters of the prosodic string of a sequence reflect its secondary structure (cf. List2012b) and should therefore be aligned specifically. This defaults to “T”, since this is the character that represents tones in the prosodic strings of sequences.

sum_of_pairs(alm_matrix='self', mat=None, gap_weight=0.0, gop=- 1)

Calculate the sum-of-pairs score for a given alignment analysis.

Parameters

alm_matrix : { “self”, “other” } (default=”self”)

Indicate for which MSA the sum-of-pairs score shall be calculated.

mat : { None, list }

If “other” is chosen as an option for alm_matrix, define for which matrix the sum-of-pairs score shall be calculated.

gap_weight : float (default=0)

The factor by which gaps in aligned columns contribute to the calculation of the column score. When set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters.

Returns

The sum-of-pairs score of the alignment. :

swap_check(swap_penalty=- 3, score_mode='classes')

Check for possibly swapped sites in the alignment.

Parameters

swap_penalty : { int, float } (default=-3)

Specify the penalty for swaps in the alignment.

score_mode : { “classes”, “library” } (default=”classes”)

Define the score-mode of the calculation which is either based on sound classes proper, or on the specific scores derived from the library approach.

Returns

result : bool

Returns True, if a swap was identified, and False otherwise. The information regarding the position of the swap is stored in the attribute swap_index.

Notes

The method for swap detection is described in detail in List2012b.

Examples

Define a set of strings whose alignment contans a swap.

>>> from lingpy import *
>>> mult = Multiple(["woldemort", "waldemar", "wladimir"])

Align the data, using the progressive approach.

>>> mult.prog_align()

Check for swaps.

>>> mult.swap_check()
True

Print the alignment

>>> print(mult)
w   o   l   -   d   e   m   o   r   t
w   a   l   -   d   e   m   a   r   -
v   -   l   a   d   i   m   i   r   -
lingpy.align.multiple.mult_align(seqs, gop=- 1, scale=0.5, tree_calc='upgma', scoredict=False, pprint=False)

A short-cut method for multiple alignment analyses.

Parameters

seqs : list

The input sequences.

gop = int (default=-1) :

The gap opening penalty.

scale : float (default=0.5)

The scaling factor by which penalties for gap extensions are decreased.

tree_calc : { “upgma” “neighbor” } (default=”upgma”)

The algorithm which is used for the calculation of the guide tree.

pprint : bool (default=False)

Indicate whether results shall be printed onto screen.

Returns

alignments : list

A two-dimensional list in which alignments are represented as a list of tokens.

Examples

>>> m = mult_align(["woldemort", "waldemar", "vladimir"], pprint=True)
w   o   l   -   d   e   m   o   r   t
w   a   l   -   d   e   m   a   r   -
-   v   l   a   d   i   m   i   r   -

lingpy.align.pairwise module

Module provides classes and functions for pairwise alignment analyses.

class lingpy.align.pairwise.Pairwise(seqs, seqB=False, **keywords)

Bases: object

Basic class for the handling of pairwise sequence alignments (PSA).

Parameters

seqs : string list

Either the first string of a sequence pair that shall be aligned, or a list of sequence tuples.

seqB : string or bool (default=None)

Define the second sequence that shall be aligned with the first sequence, if only two sequences shall be compared.

align(**keywords)

Align a pair of sequences or multiple sequence pairs.

Parameters

gop : int (default=-1)

The gap opening penalty (GOP).

scale : float (default=0.5)

The gap extension penalty (GEP), calculated with help of a scaling factor.

mode : {“global”,”local”,”overlap”,”dialign”}

The alignment mode, see List2012a for details.

factor : float (default = 0.3)

The factor by which matches in identical prosodic position are increased.

restricted_chars : str (default=”T_”)

The restricted chars that function as an indicator of syllable or morpheme breaks for secondary alignment, see List2012c for details.

distance : bool (default=False)

If set to True, return the distance instead of the similarity score. Distance is calculated using the formula by Downey2008.

model : { None, ~lingpy.data.model.Model }

Specify the sound class model that shall be used for the analysis. If no model is specified, the default model of List2012a will be used.

pprint : bool (default=False)

If set to True, the alignments are printed to the screen.

lingpy.align.pairwise.edit_dist(seqA, seqB, normalized=False, restriction='')

Return the edit distance between two strings.

Parameters

seqA,seqB : str

The strings that shall be compared.

normalized : bool (default=False)

Specify whether the normalized edit distance shall be returned. If no restrictions are chosen, the edit distance is normalized by dividing by the length of the longer string. If restriction is set to cv (consonant-vowel), the edit distance is normalized by the length of the alignment.

restriction : {“cv”} (default=””)

Specify the restrictions to be used. Currently, only cv is supported. This prohibits matches of vowels with consonants.

Returns

dist : {int float}

The edit distance, which is a float if normalized is set to c{True}, and an integer otherwise.

Notes

The edit distance was first formally defined by V. I. Levenshtein (Levenshtein1965). The first algorithm to compute the edit distance was proposed by Wagner and Fisher (Wagner1974).

Examples

Align two sequences::
>>> seqA = 'fat cat'
>>> seqB = 'catfat'
>>> edit_dist(seqA, seqB)
3
lingpy.align.pairwise.nw_align(seqA, seqB, scorer=False, gap=- 1)

Carry out the traditional Needleman-Wunsch algorithm.

Parameters

seqA, seqB : {str, list, tuple}

The input strings. These should be iterables, so you can use tuples, lists, or strings.

scorerdict (default=False)

If set to c{False} a scorer will automatically be calculated, otherwise, the scorer needs to be passed as a dictionary that covers all segment matches between the input strings (segment matches need to be passed as tuples of two segments, following the order of the input sequences). Note also that the scorer can well be asymmetric, so you could also use it for two completely different alphabets. All you need to make sure is that the tuples representing the segment matches follow the order of your input sequences.

gapint (default=-1)

The gap penalty.

Returns

alm : tuple

A tuple consisting of the aligments of the first and the second sequence, and the alignment score.

Notes

The Needleman-Wunsch algorithm (see Needleman1970) returns a global alignment of two sequences.

‘+’ ‘.join(almB), “(sim={0})”.format(sim))

a b a b - - b a b a (sim=1)

Nothing unexpected so far, you could reach the same result without the scorer. But now let’s make a scorer that favors mismatches for our little two-letter alphabet:

>>> scorer = { ('a','b'): 1, ('a','a'):-1, ('b','b'):-1, ('b', 'a'): 1}
>>> seqA, seqB = 'abab', 'baba'
>>> almA, almB, sim = nw_align(seqA, seqB, scorer=scorer)
>>> print(' '.join(almA)+'
‘+’ ‘.join(almB), “(sim={0})”.format(sim))

a b a b b a b a (sim=4)

Now, let’s analyse two strings which are completely different, but where we use the scorer to define mappings between the segments. We simply do this by using lower case letters in one and upper case letters in the other case, which will, of course, be treated as different symbols in Python:

>>> scorer = { ('A','a'): 1, ('A','b'):-1, ('B','a'):-1, ('B', 'B'): 1}
>>> seqA, seqB = 'ABAB', 'aa'
>>> almA, almB, sim = nw_align(seqA, seqB, scorer=scorer)
>>> print(' '.join(almA)+'
‘+’ ‘.join(almB), “(sim={0})”.format(sim))

A B A B a - a - (sim=0)

lingpy.align.pairwise.pw_align(seqA, seqB, gop=- 1, scale=0.5, scorer=False, mode='global', distance=False, **keywords)

Align two sequences in various ways.

Parameters

seqA, seqB : {str, list, tuple}

The input strings. These should be iterables, so you can use tuples, lists, or strings.

scorer : dict (default=False)

If set to c{False} a scorer will automatically be calculated, otherwise, the scorer needs to be passed as a dictionary that covers all segment matches between the input strings.

gop : int (default=-1)

The gap opening penalty.

scale : float (default=0.5)

The gap extension scale. This scale is similar to the gap extension penalty, but in contrast to the traditional GEP, it “scales” the gap opening penalty.

mode : {“global”, “local”, “dialign”, “overlap”} (default=”global”)

Select between one of the four different alignment modes regularly implemented in LingPy, see List2012a for details.

distance : bool (default=False)

If set to c{True} return the distance score following the formula by Downey2008. Otherwise, return the basic similarity score.

Examples

Align two words using the dialign algorithm::
>>> seqA = 'fat cat'
>>> seqB = 'catfat'
>>> pw_align(seqA, seqB, mode='dialign')
(['f', 'a', 't', ' ', 'c', 'a', 't', '-', '-', '-'],
 ['-', '-', '-', '-', 'c', 'a', 't', 'f', 'a', 't'],
 3.0)
lingpy.align.pairwise.structalign(seqA, seqB)

Experimental function for testing structural alignment algorithms.

lingpy.align.pairwise.sw_align(seqA, seqB, scorer=False, gap=- 1)

Carry out the traditional Smith-Waterman algorithm.

Parameters

seqA, seqB : {str, list, tuple}

The input strings. These should be iterables, so you can use tuples, lists, or strings.

scorer : dict (default=False)

If set to c{False} a scorer will automatically be calculated, otherwise, the scorer needs to be passed as a dictionary that covers all segment matches between the input strings.

gap : int (default=-1)

The gap penalty.

Returns

alm : tuple

A tuple consisting of prefix, alignment, and suffix of the first and the second sequence, and the alignment score.

Notes

The Smith-Waterman algorithm (see Smith1981) returns a local alignment between two sequences. A local alignment is an alignment of those subsequences of the input sequences that yields the highest score.

Examples

Align two sequences::
>>> seqA = 'fat cat'
>>> seqB = 'catfat'
>>> sw_align(seqA, seqB)
(([], ['f', 'a', 't'], [' ', 'c', 'a', 't']),
 (['c', 'a', 't'], ['f', 'a', 't'], []),
 3.0)
lingpy.align.pairwise.turchin(seqA, seqB, model='dolgo', **keywords)

Return cognate judgment based on the method by Turchin2010.

Parameters

seqA, seqB : {str, list, tuple}

The input strings. These should be iterables, so you can use tuples, lists, or strings.

model : {“asjp”, “sca”, “dolgo”} (default=”dolgo”)

A sound-class model instance or a string that denotes one of the standard sound class models used in LingPy.

Returns

cognacy : {0, 1}

The cognacy assertion which is either 0 (words are probably cognate) or 1 (words are not likely to be cognate).

lingpy.align.pairwise.we_align(seqA, seqB, scorer=False, gap=- 1)

Carry out the traditional Waterman-Eggert algorithm.

Parameters

seqA, seqB : {str, list, tuple}

The input strings. These should be iterables, so you can use tuples, lists, or strings.

scorer : dict (default=False)

If set to c{False} a scorer will automatically be calculated, otherwise, the scorer needs to be passed as a dictionary that covers all segment matches between the input strings.

gap : int (default=-1)

The gap penalty.

Returns

alms : list

A list consisting of tuples. Each tuple gives the alignment of one of the subsequences of the input sequences. Each tuple contains the aligned part of the first, the aligned part of the second sequence, and the score of the alignment.

Notes

The Waterman-Eggert algorithm (see Waterman1987) returns all local matches between two sequences.

Examples

Align two sequences::
>>> seqA = 'fat cat'
>>> seqB = 'catfat'
>>> we_align(seqA, seqB)
[(['f', 'a', 't'], ['f', 'a', 't'], 3.0),
 (['c', 'a', 't'], ['c', 'a', 't'], 3.0)]

lingpy.align.sca module

Basic module for pairwise and multiple sequence comparison.

The module consists of four classes which deal with pairwise and multiple sequence comparison from the sequence and the alignment perspective. The sequence perspective deals with unaligned sequences. The alignment perspective deals with aligned sequences.

class lingpy.align.sca.Alignments(infile, row='concept', col='doculect', conf='', modify_ref=False, _interactive=True, split_on_tones=False, ref='cogid', **keywords)

Bases: lingpy.basic.wordlist.Wordlist

Class handles Wordlists for the purpose of alignment analyses.

Parameters

infile : str

The name of the input file that should conform to the basic format of the ~lingpy.basic.wordlist.Wordlist class and define a specific ID for cognate sets.

row : str (default = “concept”)

A string indicating the name of the row that shall be taken as the basis for the tabular representation of the word list.

col : str (default = “doculect”)

A string indicating the name of the column that shall be taken as the basis for the tabular representation of the word list.

conf : string (default=’’)

A string defining the path to the configuration file.

ref : string (default=’cogid’)

The name of the column that stores the cognate IDs.

modify_ref : function (default=False)

Use a function to modify the reference. If your cognate identifiers are numerical, for example, and negative values are assigned as loans, but you want to suppress this behaviour, just set this keyword to “abs”, and all cognate IDs will be converted to their absolute value.

split_on_tones : bool (default=False)

If set to True, this means that in the case of fuzzy alignment mode, the algorithm will attempt to split words into morphemes by tones if no explicit morpheme markers can be found.

Notes

This class inherits from Wordlist and additionally creates instances of the Multiple class for all cognate sets that are specified by the ref keyword.

Attributes

msa

dict

A dictionary storing multiple alignments as dictionaries which can be directly opened and aligned with help of the ~lingpy.align.sca.SCA function. The alignment objects are referenced by a key which is identical with the “reference” (ref-keyword) of the alignment, that is the name of the column which contains the cognate identifiers.

add_alignments(ref=False, modify_ref=False, fuzzy=False, split_on_tones=True, override=False)

Function adds a new set of alignments to the data.

Parameters

ref: str (default=False) :

Use this to set the name of the column which contains the cognate sets.

fuzzy: bool (default=False) :

If set to true, force the algorithm to treat the cognate sets as fuzzy cognate sets, i.e., as multiple cognate sets which are in order assigned to a word (proper “partial cognates”).

align(**keywords)

Carry out a multiple alignment analysis of the data.

Parameters

method : { “progressive”, “library” } (default=”progressive”)

Select the method to use for the analysis.

iteration : bool (default=False)

Set to c{True} in order to use iterative refinement methods.

swap_check : bool (default=False)

Set to c{True} in order to carry out a swap-check.

model : { ‘dolgo’, ‘sca’, ‘asjp’ }

A string indicating the name of the Model object that shall be used for the analysis. Currently, three models are supported:

  • “dolgo” – a sound-class model based on Dolgopolsky1986,

  • “sca” – an extension of the “dolgo” sound-class model based on List2012b, and

  • “asjp” – an independent sound-class model which is based on the sound-class model of Brown2008 and the empirical data of Brown2011 (see the description in List2012.

mode : { ‘global’, ‘dialign’ }

A string indicating which kind of alignment analysis should be carried out during the progressive phase. Select between:

  • “global” – traditional global alignment analysis based on the Needleman-Wunsch algorithm Needleman1970,

  • “dialign” – global alignment analysis which seeks to maximize local similarities Morgenstern1996.

modes : list (default=[(‘global’,-2,0.5),(‘local’,-1,0.5)])

Indicate the mode, the gap opening penalties (GOP), and the gap extension scale (GEP scale), of the pairwise alignment analyses which are used to create the library.

gop : int (default=-5)

The gap opening penalty (GOP) used in the analysis.

scale : float (default=0.6)

The factor by which the penalty for the extension of gaps (gap extension penalty, GEP) shall be decreased. This approach is essentially inspired by the exension of the basic alignment algorithm for affine gap penalties Gotoh1982.

factor : float (default=1)

The factor by which the initial and the descending position shall be modified.

tree_calc : { ‘neighbor’, ‘upgma’ } (default=’upgma’)

The cluster algorithm which shall be used for the calculation of the guide tree. Select between neighbor, the Neighbor-Joining algorithm (Saitou1987), and upgma, the UPGMA algorithm (Sokal1958).

gap_weight : float (default=0)

The factor by which gaps in aligned columns contribute to the calculation of the column score. When set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters.

restricted_chars : string (default=”T”)

Define which characters of the prosodic string of a sequence reflect its secondary structure (cf. List2012b) and should therefore be aligned specifically. This defaults to “T”, since this is the character that represents tones in the prosodic strings of sequences.

get_confidence(scorer, ref='lexstatid', gap_weight=0.25)

Function creates confidence scores for a given set of alignments.

Parameters

scorer : ScoreDict

A ScoreDict object which gives similarity scores for all segments in the alignment.

ref : str (default=”lexstatid”)

The reference entry-type, referring to the cognate-set to be used for the analysis.

gap_weight : {loat} (default=1.0)

Determine the weight assigned to matches containing gaps.

get_consensus(tree=False, gaps=False, classes=False, consensus='consensus', counterpart='ipa', weights=[], return_data=False, **keywords)

Calculate a consensus string of all MSAs in the wordlist.

Parameters

msa : {c{list} ~lingpy.align.multiple.Multiple}

Either an MSA object or an MSA matrix.

tree : {c{str} ~lingpy.thirdparty.cogent.PhyloNode}

A tree object or a Newick string along which the consensus shall be calculated.

gaps : c{bool} (default=False)

If set to c{True}, return the gap positions in the consensus.

classes : c{bool} (default=False)

Specify whether sound classes shall be used to calculate the consensus.

model : ~lingpy.data.model.Model

A sound class model according to which the IPA strings shall be converted to sound-class strings.

return_data : c{bool} (default=False)

Return the data instead of adding it in a column to the wordlist object.

get_msa(ref)
output(fileformat, **keywords)

Write wordlist to file.

Parameters

fileformat : {“tsv”, “msa”, “tre”, “nwk”, “dst”, “taxa”, “starling”, “paps.nex”,

“paps.csv” “html”} The format that is written to file. This corresponds to the file extension, thus ‘tsv’ creates a file in tsv-format, ‘dst’ creates a file in Phylip-distance format, etc. Specific output is created for the formats “html” and “msa”:

  • “msa” will create a folder containing all alignments of all cognate sets in “msa”-format

  • “html” will create html-output in which words are sorted according to meaning, cognate set, and all cognate words are aligned

filename : str

Specify the name of the output file (defaults to a filename that indicates the creation date).

subset : bool (default=False)

If set to c{True}, return only a subset of the data. Which subset is specified in the keywords ‘cols’ and ‘rows’.

cols : list

If subset is set to c{True}, specify the columns that shall be written to the csv-file.

rows : dict

If subset is set to c{True}, use a dictionary consisting of keys that specify a column and values that give a Python-statement in raw text, such as, e.g., “== ‘hand’”. The content of the specified column will then be checked against statement passed in the dictionary, and if it is evaluated to c{True}, the respective row will be written to file.

ref : str

Name of the column that contains the cognate IDs if ‘starling’ is chosen as an output format.

missing : { str, int } (default=0)

If ‘paps.nex’ or ‘paps.csv’ is chosen as fileformat, this character will be inserted as an indicator of missing data.

tree_calc : {‘neighbor’, ‘upgma’}

If no tree has been calculated and ‘tre’ or ‘nwk’ is chosen as output format, the method that is used to calculate the tree.

threshold : float (default=0.6)

The threshold that is used to carry out a flat cluster analysis if ‘groups’ or ‘cluster’ is chosen as output format.

style : str (default=”id”)

If “msa” is chosen as output format, this will write the alignments for each msa-file in a specific format in which the first column contains a direct reference to the word via its ID in the wordlist.

ignore : { list, “all” }

Modifies the output format in “tsv” output and allows to ignore certain blocks in extended “tsv”, like “msa”, “taxa”, “json”, etc., which should be passed as a list. If you choose “all” as a plain string and not a list, this will ignore all additional blocks and output only plain “tsv”.

prettify : bool (default=True)

Inserts comment characters between concepts in the “tsv” file output format, which makes it easier to see blocks of words denoting the same concept. Switching this off will output the file in plain “tsv”.

reduce_alignments(alignment=False, ref=False)

Function reduces alignments which contain columns that are marked to be ignored by the user.

Notes

This function changes the data only internally: All alignments are checked as to whether they contain data that should be ignored. If this is the case, the alignments are then reduced, and stored in a specific item of the alignment string. If the method doesn’t find any instances for reduction, it still makes the copies of the alignments in order to guarantee that the alignments with with we want to work are at the same place in the dictionary.

class lingpy.align.sca.MSA(infile, **keywords)

Bases: lingpy.align.multiple.Multiple

Basic class for carrying out multiple sequence alignment analyses.

Parameters

infile : file

A file in msq-format or msa-format.

merge_vowels : bool (default=True)

Indicate, whether neighboring vowels should be merged into diphtongs, or whether they should be kept separated during the analysis.

comment : char (default=’#’)

The comment character which, inserted in the beginning of a line, prevents that line from being read.

normalize : bool (default=True)

Normalize the alignment, that is, add gap characters for all sequences which are shorter than the longest sequence, and delete all columns from the alignment in which only gaps occur.

Notes

There are two possible input formats for this class: the MSQ-format, and the MSA-format (see Multiple Alignments (MSQ and MSA) for details). This class directly inherits all methods of the Multiple class.

Examples

Get the path to a file from the testset.

>>> from lingpy import *
>>> path = rc("test_path")+'harry.msq'

Load the file into the Multiple class.

>>> mult = Multiple(path)

Carry out a progressive alignment analysis of the sequences.

>>> mult.prog_align()

Print the result to the screen:

>>> print(mult)
w    o    l    -    d    e    m    o    r    t
w    a    l    -    d    e    m    a    r    -
v    -    l    a    d    i    m    i    r    -
ipa2cls(**keywords)

Retrieve sound-class strings from aligned IPA sequences.

Parameters

model : str (default=’sca’)

The sound-class model according to which the sequences shall be converted.

Notes

This function is only useful when an msa-file with already conducted alignment analyses was loaded.

output(fileformat='msa', filename=None, sorted_seqs=False, unique_seqs=False, **keywords)

Write data to file.

Parameters

fileformat : { “psa”, “msa”, “msq” }

Indicate which data should be written to file. Select between:

  • “psa” – output of all pairwise alignments in psa-format,

  • “msa” – output of the multiple alignment in msa-format, or

  • “msq” – output of the multiple sequences in msq-format.

  • “html” – output of the multiple alignment in html-format.

filename : str

Select a specific name for the outfile, otherwise, the name of the infile will be taken by default.

sorted_seqs : bool

Indicate whether the sequences should be sorted or not (applys only to ‘msa’ and ‘msq’ output.

unique_seqs : bool

Indicate whether only unique sequences should be written to file or not.

class lingpy.align.sca.PSA(infile, **keywords)

Bases: lingpy.align.pairwise.Pairwise

Basic class for dealing with the pairwise alignment of sequences.

Parameters

infile : file

A file in psq-format.

merge_vowels : bool (default=True)

Indicate, whether neighboring vowels should be merged into diphtongs, or whether they should be kept separated during the analysis.

comment : char (default=’#’)

The comment character which, inserted in the beginning of a line, prevents that line from being read.

Notes

In order to read in data from text files, two different file formats can be used along with this class: the PSQ-format, and the PSA-format (see Pairwise Alignments (PSQ and PSA) for details). This class inherits the methods of the Pairwise class.

Attributes

taxa

list

A list of tuples containing the taxa of all sequence pairs.

seqs

list

A list of tuples containing all sequence pairs.

tokens

list

A list of tuples containing all sequence pairs in a tokenized form.

output(fileformat='psa', filename=None, **keywords)

Write the results of the analyses to a text file.

Parameters

fileformat : { ‘psa’, ‘psq’ }

Indicate which data should be written to file. Select between:

  • ‘psa’ – output of all pairwise alignments in psa-format,

  • ‘psq’ – output of the multiple sequences in psq-format.

filename : str

Select a specific name for the outfile, otherwise, the name of the infile will be taken by default.

lingpy.align.sca.SCA(infile, **keywords)

Method returns alignment objects depending on input file or input data.

Notes

This method checks for the type of an alignment object and returns an alignment object of the respective type.

lingpy.align.sca.get_consensus(msa, gaps=False, taxa=False, classes=False, **keywords)

Calculate a consensus string of a given MSA.

Parameters

msa : {c{list} ~lingpy.align.multiple.Multiple}

Either an MSA object or an MSA matrix.

gaps : c{bool} (default=False)

If set to c{True}, return the gap positions in the consensus.

taxa : {c{list} bool} (default=False)

If tree is chosen as a parameter, specify the taxa in order of the aligned strings.

classes : c{bool} (default=False)

Specify whether sound classes shall be used to calculate the consensus.

model : ~lingpy.data.model.Model

A sound class model according to which the IPA strings shall be converted to sound-class strings.

local : { c{bool}, “peaks”, “gaps” }(default=False)

Specify whether local pre-processing should be applied to the data. If set to c{peaks}, the average alignment score of each column is taken as reference to remove low-scoring columns from the alignment. If set to “gaps”, the columns with the highest proportion of gaps will be excluded.

Returns

cons : c{str}

A consensus string of the given MSA.

Module contents

Package provides basic modules for alignment analyses.