lingpy.align package¶
Submodules¶
lingpy.align.multiple module¶
Module provides classes and functions for multiple alignment analyses.
- class lingpy.align.multiple.Multiple(seqs, **keywords)¶
Bases:
object
Basic class for multiple sequence alignment analyses.
- Parameters
seqs : list
List of sequences that shall be aligned.
Notes
Depending on the structure of the sequences, further keywords can be specified that manage how the items get tokenized.
- align(method, **kw)¶
- get_local_peaks(threshold=2, gap_weight=0.0)¶
Return all peaks in a given alignment.
- Parameters
threshold : { int, float } (default=2)
The threshold to determine whether a given column is a peak or not.
gap_weight : float (default=0.0)
The weight for gaps.
- get_pairwise_alignments(**keywords)¶
Function creates a dictionary of all pairwise alignments scores.
- Parameters
new_calc : bool (default=True)
Specify, whether the analysis should be repeated from the beginning, or whether already conducted analyses should be carried out.
model : string (default=”sca”)
A string indicating the name of the
Model
object that shall be used for the analysis. Currently, three models are supported:“dolgo” – a sound-class model based on
Dolgopolsky1986
,“sca” – an extension of the “dolgo” sound-class model based on
List2012b
, and“asjp” – an independent sound-class model which is based on the sound-class model of
Brown2008
and the empirical data ofBrown2011
(see the description inList2012
.
mode : string (default=”global”)
A string indicating which kind of alignment analysis should be carried out during the progressive phase. Select between:
“global” – traditional global alignment analysis based on the Needleman-Wunsch algorithm
Needleman1970
,“dialign” – global alignment analysis which seeks to maximize local similarities
Morgenstern1996
.
gop : int (default=-3)
The gap opening penalty (GOP) used in the analysis.
gep_scale : float (default=0.6)
The factor by which the penalty for the extension of gaps (gap extension penalty, GEP) shall be decreased. This approach is essentially inspired by the exension of the basic alignment algorithm for affine gap penalties
Gotoh1982
.factor : float (default=1)
The factor by which the initial and the descending position shall be modified.
gap_weight : float (default=0)
The factor by which gaps in aligned columns contribute to the calculation of the column score. When set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters.
restricted_chars : string (default=”T”)
Define which characters of the prosodic string of a sequence reflect its secondary structure (cf.
List2012b
) and should therefore be aligned specifically. This defaults to “T”, since this is the character that represents tones in the prosodic strings of sequences.
- get_peaks(gap_weight=0)¶
Calculate the profile score for each column of the alignment.
- Parameters
gap_weight : float (default=0)
The factor by which gaps in aligned columns contribute to the calculation of the column score. When set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters.
- Returns
peaks : list
A list containing the profile scores for each column of the given alignment.
- get_pid(mode=1)¶
Return the Percentage Identity (PID) score of the calculated MSA.
- Parameters
mode : { 1, 2, 3, 4, 5 } (default=1)
Indicate which of the four possible PID scores described in
Raghava2006
should be calculated, the fifth possibility is added for linguistic purposes:identical positions / (aligned positions + internal gap positions),
identical positions / aligned positions,
identical positions / shortest sequence, or
identical positions / shortest sequence (including internal gap pos.)
identical positions / (aligned positions + 2 * number of gaps)
- Returns
score : float
The PID score of the given alignment as a floating point number between 0 and 1.
See also
- iterate_all_sequences(check='final', mode='global', gop=- 3, scale=0.5, factor=0, gap_weight=1, restricted_chars='T_')¶
Iterative refinement based on a complete realignment of all sequences.
- Parameters
check : { “final”, “immediate” } (default=”final”)
Specify when to check for improved sum-of-pairs scores: After each iteration (“immediate”) or after all iterations have been carried out (“final”).
mode : { “global”, “overlap”, “dialign” } (default=”global”)
A string indicating which kind of alignment analysis should be carried out during the progressive phase. Select between:
“global” – traditional global alignment analysis based on the Needleman-Wunsch algorithm
Needleman1970
,“dialign” – global alignment analysis which seeks to maximize local similarities
Morgenstern1996
.“overlap” – semi-global alignment, where gaps introduced in the beginning and the end of a sequence do not score.
gop : int (default=-5)
The gap opening penalty (GOP) used in the analysis.
gep_scale : float (default=0.5)
The factor by which the penalty for the extension of gaps (gap extension penalty, GEP) shall be decreased. This approach is essentially inspired by the exension of the basic alignment algorithm for affine gap penalties
Gotoh1981
.factor : float (default=0.3)
The factor by which the initial and the descending position shall be modified.
gap_weight : float (default=0)
The factor by which gaps in aligned columns contribute to the calculation of the column score. When set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters.
Notes
This method essentially follows the iterative method of
Barton1987
with the exception that an MSA has already been calculated.
- iterate_clusters(threshold, check='final', mode='global', gop=- 3, scale=0.5, factor=0, gap_weight=1, restricted_chars='T_')¶
Iterative refinement based on a flat cluster analysis of the data.
- Parameters
threshold : float
The threshold for the flat cluster analysis.
check : string (default=”final”)
Specify when to check for improved sum-of-pairs scores: After each iteration (“immediate”) or after all iterations have been carried out (“final”).
mode : { “global”, “overlap”, “dialign” } (default=”global”)
A string indicating which kind of alignment analysis should be carried out during the progressive phase. Select between:
‘global’ – traditional global alignment analysis based on the Needleman-Wunsch algorithm
Needleman1970
,‘dialign’ – global alignment analysis which seeks to maximize local similarities
Morgenstern1996
.‘overlap’ – semi-global alignment, where gaps introduced in the beginning and the end of a sequence do not score.
gop : int (default=-5)
The gap opening penalty (GOP) used in the analysis.
gep_scale : float (default=0.6)
The factor by which the penalty for the extension of gaps (gap extension penalty, GEP) shall be decreased. This approach is essentially inspired by the exension of the basic alignment algorithm for affine gap penalties
Gotoh1981
.factor : float (default=0.3)
The factor by which the initial and the descending position shall be modified.
gap_weight : float (default=0)
The factor by which gaps in aligned columns contribute to the calculation of the column score. When set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters.
Notes
This method uses the
lingpy.algorithm.clustering.flat_upgma()
function in order to retrieve a flat cluster of the data.
- iterate_orphans(check='final', mode='global', gop=- 3, scale=0.5, factor=0, gap_weight=1.0, restricted_chars='T_')¶
Iterate over the most divergent sequences in the sample.
- Parameters
check : string (default=”final”)
Specify when to check for improved sum-of-pairs scores: After each iteration (“immediate”) or after all iterations have been carried out (“final”).
mode : { “global”, “overlap”, “dialign” } (default=”global”)
A string indicating which kind of alignment analysis should be carried out during the progressive phase. Select between:
“global” – traditional global alignment analysis based on the Needleman-Wunsch algorithm
Needleman1970
,“dialign” – global alignment analysis which seeks to maximize local similarities
Morgenstern1996
.“overlap” – semi-global alignment, where gaps introduced in the beginning and the end of a sequence do not score.
gop : int (default=-5)
The gap opening penalty (GOP) used in the analysis.
gep_scale : float (default=0.6)
The factor by which the penalty for the extension of gaps (gap extension penalty, GEP) shall be decreased. This approach is essentially inspired by the exension of the basic alignment algorithm for affine gap penalties
Gotoh1981
.factor : float (default=0.3)
The factor by which the initial and the descending position shall be modified.
gap_weight : float (default=0)
The factor by which gaps in aligned columns contribute to the calculation of the column score. When set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters.
See also
Multiple.iterate_clusters
,Multiple.iterate_similar_gap_sites
,Multiple.iterate_all_sequences
Notes
The most divergent sequences are those whose average distance to all other sequences is above the average distance of all sequence pairs.
- iterate_similar_gap_sites(check='final', mode='global', gop=- 3, scale=0.5, factor=0, gap_weight=1, restricted_chars='T_')¶
Iterative refinement based on the Similar Gap Sites heuristic.
- Parameters
check : { “final”, “immediate” } (default=”final”)
Specify when to check for improved sum-of-pairs scores: After each iteration (“immediate”) or after all iterations have been carried out (“final”).
mode : { “global”, “overlap”, “dialign” } (default=”global”)
A string indicating which kind of alignment analysis should be carried out during the progressive phase. Select between:
‘global’ – traditional global alignment analysis based on the Needleman-Wunsch algorithm
Needleman1970
,‘dialign’ – global alignment analysis which seeks to maximize local similarities
Morgenstern1996
.‘overlap’ – semi-global alignment, where gaps introduced in the beginning and the end of a sequence do not score.
gop : int (default=-5)
The gap opening penalty (GOP) used in the analysis.
gep_scale : float (default=0.5)
The factor by which the penalty for the extension of gaps (gap extension penalty, GEP) shall be decreased. This approach is essentially inspired by the exension of the basic alignment algorithm for affine gap penalties
Gotoh1982
.factor : float (default=0.3)
The factor by which the initial and the descending position shall be modified.
gap_weight : float (default=1)
The factor by which gaps in aligned columns contribute to the calculation of the column score. When, e.g., set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters.
Notes
This heuristic is fairly simple. The idea is to try to split a given MSA into partitions with identical gap sites.
- lib_align(**keywords)¶
Carry out a library-based progressive alignment analysis of the sequences.
- Parameters
model : { “dolgo”, “sca”, “asjp” } (default=”sca”)
A string indicating the name of the
Model
object that shall be used for the analysis. Currently, three models are supported:“dolgo” – a sound-class model based on
Dolgopolsky1986
,“sca” – an extension of the “dolgo” sound-class model based on
List2012b
, and“asjp” – an independent sound-class model which is based on the sound-class model of
Brown2008
and the empirical data ofBrown2011
(see the description inList2012
.
mode : { “global”, “dialign” } (default=”global”)
A string indicating which kind of alignment analysis should be carried out during the progressive phase. Select between:
“global” – traditional global alignment analysis based on the Needleman-Wunsch algorithm
Needleman1970
,“dialign” – global alignment analysis which seeks to maximize local similarities
Morgenstern1996
.
modes : list (default=[(“global”,-10,0.6),(“local”,-1,0.6)])
Indicate the mode, the gap opening penalties (GOP), and the gap extension scale (GEP scale), of the pairwise alignment analyses which are used to create the library.
gop : int (default=-5)
The gap opening penalty (GOP) used in the analysis.
gep_scale : float (default=0.6)
The factor by which the penalty for the extension of gaps (gap extension penalty, GEP) shall be decreased. This approach is essentially inspired by the exension of the basic alignment algorithm for affine gap penalties
Gotoh1982
.factor : float (default=1)
The factor by which the initial and the descending position shall be modified.
tree_calc : { “neighbor”, “upgma” } (default=”upgma”)
The cluster algorithm which shall be used for the calculation of the guide tree. Select between
neighbor
, the Neighbor-Joining algorithm (Saitou1987
), andupgma
, the UPGMA algorithm (Sokal1958
).guide_tree : tree_matrix
Use a custom guide tree instead of performing a cluster algorithm for constructing one based on the input similarities. The use of this option makes the tree_calc option irrelevant.
gap_weight : float (default=0)
The factor by which gaps in aligned columns contribute to the calculation of the column score. When set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters.
restricted_chars : string (default=”T”)
Define which characters of the prosodic string of a sequence reflect its secondary structure (cf.
List2012b
) and should therefore be aligned specifically. This defaults to “T”, since this is the character that represents tones in the prosodic strings of sequences.
Notes
In contrast to traditional progressive multiple sequence alignment approaches such as
Feng1981
andThompson1994
, library-based progressive alignmentNotredame2000
is based on a pre-processing of the data where the information given in global and local pairwise alignments of the input sequences is used to derive a refined scoring function (library) which is later used in the progressive phase.
- prog_align(**keywords)¶
Carry out a progressive alignment analysis of the input sequences.
- Parameters
model : { “dolgo”, “sca”, “asjp” } (defaul=”sca”)
A string indicating the name of the
Model
object that shall be used for the analysis. Currently, three models are supported:“dolgo” – a sound-class model based on
Dolgopolsky1986
,“sca” – an extension of the “dolgo” sound-class model based on
List2012b
, and“asjp” – an independent sound-class model which is based on the sound-class model of
Brown2008
and the empirical data ofBrown2011
(see the description inList2012
.
mode : { “global”, “dialign” } (default=”global”)
A string indicating which kind of alignment analysis should be carried out during the progressive phase. Select between:
“global” – traditional global alignment analysis based on the Needleman-Wunsch algorithm
Needleman1970
,“dialign” – global alignment analysis which seeks to maximize local similarities
Morgenstern1996
.
gop : int (default=-2)
The gap opening penalty (GOP) used in the analysis.
scale : float (default=0.5)
The factor by which the penalty for the extension of gaps (gap extension penalty, GEP) shall be decreased. This approach is essentially inspired by the exension of the basic alignment algorithm for affine gap penalties
Gotoh1982
.factor : float (default=0.3)
The factor by which the initial and the descending position shall be modified.
tree_calc : { “neighbor”, “upgma” } (default=”upgma”)
The cluster algorithm which shall be used for the calculation of the guide tree. Select between
neighbor
, the Neighbor-Joining algorithm (Saitou1987
), andupgma
, the UPGMA algorithm (Sokal1958
).guide_tree : tree_matrix
Use a custom guide tree instead of performing a cluster algorithm for constructing one based on the input similarities. The use of this option makes the tree_calc option irrelevant.
gap_weight : float (default=0.5)
The factor by which gaps in aligned columns contribute to the calculation of the column score. When set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters.
restricted_chars : string (default=”T”)
Define which characters of the prosodic string of a sequence reflect its secondary structure (cf.
List2012b
) and should therefore be aligned specifically. This defaults to “T”, since this is the character that represents tones in the prosodic strings of sequences.
- sum_of_pairs(alm_matrix='self', mat=None, gap_weight=0.0, gop=- 1)¶
Calculate the sum-of-pairs score for a given alignment analysis.
- Parameters
alm_matrix : { “self”, “other” } (default=”self”)
Indicate for which MSA the sum-of-pairs score shall be calculated.
mat : { None, list }
If “other” is chosen as an option for alm_matrix, define for which matrix the sum-of-pairs score shall be calculated.
gap_weight : float (default=0)
The factor by which gaps in aligned columns contribute to the calculation of the column score. When set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters.
- Returns
The sum-of-pairs score of the alignment. :
- swap_check(swap_penalty=- 3, score_mode='classes')¶
Check for possibly swapped sites in the alignment.
- Parameters
swap_penalty : { int, float } (default=-3)
Specify the penalty for swaps in the alignment.
score_mode : { “classes”, “library” } (default=”classes”)
Define the score-mode of the calculation which is either based on sound classes proper, or on the specific scores derived from the library approach.
- Returns
result : bool
Returns
True
, if a swap was identified, andFalse
otherwise. The information regarding the position of the swap is stored in the attributeswap_index
.
Notes
The method for swap detection is described in detail in
List2012b
.Examples
Define a set of strings whose alignment contans a swap.
>>> from lingpy import * >>> mult = Multiple(["woldemort", "waldemar", "wladimir"])
Align the data, using the progressive approach.
>>> mult.prog_align()
Check for swaps.
>>> mult.swap_check() True
Print the alignment
>>> print(mult) w o l - d e m o r t w a l - d e m a r - v - l a d i m i r -
- lingpy.align.multiple.mult_align(seqs, gop=- 1, scale=0.5, tree_calc='upgma', scoredict=False, pprint=False)¶
A short-cut method for multiple alignment analyses.
- Parameters
seqs : list
The input sequences.
gop = int (default=-1) :
The gap opening penalty.
scale : float (default=0.5)
The scaling factor by which penalties for gap extensions are decreased.
tree_calc : { “upgma” “neighbor” } (default=”upgma”)
The algorithm which is used for the calculation of the guide tree.
pprint : bool (default=False)
Indicate whether results shall be printed onto screen.
- Returns
alignments : list
A two-dimensional list in which alignments are represented as a list of tokens.
Examples
>>> m = mult_align(["woldemort", "waldemar", "vladimir"], pprint=True) w o l - d e m o r t w a l - d e m a r - - v l a d i m i r -
lingpy.align.pairwise module¶
Module provides classes and functions for pairwise alignment analyses.
- class lingpy.align.pairwise.Pairwise(seqs, seqB=False, **keywords)¶
Bases:
object
Basic class for the handling of pairwise sequence alignments (PSA).
- Parameters
seqs : string list
Either the first string of a sequence pair that shall be aligned, or a list of sequence tuples.
seqB : string or bool (default=None)
Define the second sequence that shall be aligned with the first sequence, if only two sequences shall be compared.
- align(**keywords)¶
Align a pair of sequences or multiple sequence pairs.
- Parameters
gop : int (default=-1)
The gap opening penalty (GOP).
scale : float (default=0.5)
The gap extension penalty (GEP), calculated with help of a scaling factor.
mode : {“global”,”local”,”overlap”,”dialign”}
The alignment mode, see
List2012a
for details.factor : float (default = 0.3)
The factor by which matches in identical prosodic position are increased.
restricted_chars : str (default=”T_”)
The restricted chars that function as an indicator of syllable or morpheme breaks for secondary alignment, see
List2012c
for details.distance : bool (default=False)
If set to True, return the distance instead of the similarity score. Distance is calculated using the formula by
Downey2008
.model : { None, ~lingpy.data.model.Model }
Specify the sound class model that shall be used for the analysis. If no model is specified, the default model of
List2012a
will be used.pprint : bool (default=False)
If set to True, the alignments are printed to the screen.
- lingpy.align.pairwise.edit_dist(seqA, seqB, normalized=False, restriction='')¶
Return the edit distance between two strings.
- Parameters
seqA,seqB : str
The strings that shall be compared.
normalized : bool (default=False)
Specify whether the normalized edit distance shall be returned. If no restrictions are chosen, the edit distance is normalized by dividing by the length of the longer string. If restriction is set to cv (consonant-vowel), the edit distance is normalized by the length of the alignment.
restriction : {“cv”} (default=””)
Specify the restrictions to be used. Currently, only
cv
is supported. This prohibits matches of vowels with consonants.- Returns
dist : {int float}
The edit distance, which is a float if normalized is set to c{True}, and an integer otherwise.
Notes
The edit distance was first formally defined by V. I. Levenshtein (
Levenshtein1965
). The first algorithm to compute the edit distance was proposed by Wagner and Fisher (Wagner1974
).Examples
- Align two sequences::
>>> seqA = 'fat cat' >>> seqB = 'catfat' >>> edit_dist(seqA, seqB) 3
- lingpy.align.pairwise.nw_align(seqA, seqB, scorer=False, gap=- 1)¶
Carry out the traditional Needleman-Wunsch algorithm.
- Parameters
seqA, seqB : {str, list, tuple}
The input strings. These should be iterables, so you can use tuples, lists, or strings.
- scorerdict (default=False)
If set to c{False} a scorer will automatically be calculated, otherwise, the scorer needs to be passed as a dictionary that covers all segment matches between the input strings (segment matches need to be passed as tuples of two segments, following the order of the input sequences). Note also that the scorer can well be asymmetric, so you could also use it for two completely different alphabets. All you need to make sure is that the tuples representing the segment matches follow the order of your input sequences.
- gapint (default=-1)
The gap penalty.
- Returns
alm : tuple
A tuple consisting of the aligments of the first and the second sequence, and the alignment score.
Notes
The Needleman-Wunsch algorithm (see
Needleman1970
) returns a global alignment of two sequences.- ‘+’ ‘.join(almB), “(sim={0})”.format(sim))
a b a b - - b a b a (sim=1)
Nothing unexpected so far, you could reach the same result without the scorer. But now let’s make a scorer that favors mismatches for our little two-letter alphabet:
>>> scorer = { ('a','b'): 1, ('a','a'):-1, ('b','b'):-1, ('b', 'a'): 1} >>> seqA, seqB = 'abab', 'baba' >>> almA, almB, sim = nw_align(seqA, seqB, scorer=scorer) >>> print(' '.join(almA)+'
- ‘+’ ‘.join(almB), “(sim={0})”.format(sim))
a b a b b a b a (sim=4)
Now, let’s analyse two strings which are completely different, but where we use the scorer to define mappings between the segments. We simply do this by using lower case letters in one and upper case letters in the other case, which will, of course, be treated as different symbols in Python:
>>> scorer = { ('A','a'): 1, ('A','b'):-1, ('B','a'):-1, ('B', 'B'): 1} >>> seqA, seqB = 'ABAB', 'aa' >>> almA, almB, sim = nw_align(seqA, seqB, scorer=scorer) >>> print(' '.join(almA)+'
- ‘+’ ‘.join(almB), “(sim={0})”.format(sim))
A B A B a - a - (sim=0)
- lingpy.align.pairwise.pw_align(seqA, seqB, gop=- 1, scale=0.5, scorer=False, mode='global', distance=False, **keywords)¶
Align two sequences in various ways.
- Parameters
seqA, seqB : {str, list, tuple}
The input strings. These should be iterables, so you can use tuples, lists, or strings.
scorer : dict (default=False)
If set to c{False} a scorer will automatically be calculated, otherwise, the scorer needs to be passed as a dictionary that covers all segment matches between the input strings.
gop : int (default=-1)
The gap opening penalty.
scale : float (default=0.5)
The gap extension scale. This scale is similar to the gap extension penalty, but in contrast to the traditional GEP, it “scales” the gap opening penalty.
mode : {“global”, “local”, “dialign”, “overlap”} (default=”global”)
Select between one of the four different alignment modes regularly implemented in LingPy, see
List2012a
for details.distance : bool (default=False)
If set to c{True} return the distance score following the formula by
Downey2008
. Otherwise, return the basic similarity score.
Examples
- Align two words using the dialign algorithm::
>>> seqA = 'fat cat' >>> seqB = 'catfat' >>> pw_align(seqA, seqB, mode='dialign') (['f', 'a', 't', ' ', 'c', 'a', 't', '-', '-', '-'], ['-', '-', '-', '-', 'c', 'a', 't', 'f', 'a', 't'], 3.0)
- lingpy.align.pairwise.structalign(seqA, seqB)¶
Experimental function for testing structural alignment algorithms.
- lingpy.align.pairwise.sw_align(seqA, seqB, scorer=False, gap=- 1)¶
Carry out the traditional Smith-Waterman algorithm.
- Parameters
seqA, seqB : {str, list, tuple}
The input strings. These should be iterables, so you can use tuples, lists, or strings.
scorer : dict (default=False)
If set to c{False} a scorer will automatically be calculated, otherwise, the scorer needs to be passed as a dictionary that covers all segment matches between the input strings.
gap : int (default=-1)
The gap penalty.
- Returns
alm : tuple
A tuple consisting of prefix, alignment, and suffix of the first and the second sequence, and the alignment score.
Notes
The Smith-Waterman algorithm (see
Smith1981
) returns a local alignment between two sequences. A local alignment is an alignment of those subsequences of the input sequences that yields the highest score.Examples
- Align two sequences::
>>> seqA = 'fat cat' >>> seqB = 'catfat' >>> sw_align(seqA, seqB) (([], ['f', 'a', 't'], [' ', 'c', 'a', 't']), (['c', 'a', 't'], ['f', 'a', 't'], []), 3.0)
- lingpy.align.pairwise.turchin(seqA, seqB, model='dolgo', **keywords)¶
Return cognate judgment based on the method by
Turchin2010
.- Parameters
seqA, seqB : {str, list, tuple}
The input strings. These should be iterables, so you can use tuples, lists, or strings.
model : {“asjp”, “sca”, “dolgo”} (default=”dolgo”)
A sound-class model instance or a string that denotes one of the standard sound class models used in LingPy.
- Returns
cognacy : {0, 1}
The cognacy assertion which is either 0 (words are probably cognate) or 1 (words are not likely to be cognate).
- lingpy.align.pairwise.we_align(seqA, seqB, scorer=False, gap=- 1)¶
Carry out the traditional Waterman-Eggert algorithm.
- Parameters
seqA, seqB : {str, list, tuple}
The input strings. These should be iterables, so you can use tuples, lists, or strings.
scorer : dict (default=False)
If set to c{False} a scorer will automatically be calculated, otherwise, the scorer needs to be passed as a dictionary that covers all segment matches between the input strings.
gap : int (default=-1)
The gap penalty.
- Returns
alms : list
A list consisting of tuples. Each tuple gives the alignment of one of the subsequences of the input sequences. Each tuple contains the aligned part of the first, the aligned part of the second sequence, and the score of the alignment.
Notes
The Waterman-Eggert algorithm (see
Waterman1987
) returns all local matches between two sequences.Examples
- Align two sequences::
>>> seqA = 'fat cat' >>> seqB = 'catfat' >>> we_align(seqA, seqB) [(['f', 'a', 't'], ['f', 'a', 't'], 3.0), (['c', 'a', 't'], ['c', 'a', 't'], 3.0)]
lingpy.align.sca module¶
Basic module for pairwise and multiple sequence comparison.
The module consists of four classes which deal with pairwise and multiple sequence comparison from the sequence and the alignment perspective. The sequence perspective deals with unaligned sequences. The alignment perspective deals with aligned sequences.
- class lingpy.align.sca.Alignments(infile, row='concept', col='doculect', conf='', modify_ref=False, _interactive=True, split_on_tones=False, ref='cogid', **keywords)¶
Bases:
lingpy.basic.wordlist.Wordlist
Class handles Wordlists for the purpose of alignment analyses.
- Parameters
infile : str
The name of the input file that should conform to the basic format of the ~lingpy.basic.wordlist.Wordlist class and define a specific ID for cognate sets.
row : str (default = “concept”)
A string indicating the name of the row that shall be taken as the basis for the tabular representation of the word list.
col : str (default = “doculect”)
A string indicating the name of the column that shall be taken as the basis for the tabular representation of the word list.
conf : string (default=’’)
A string defining the path to the configuration file.
ref : string (default=’cogid’)
The name of the column that stores the cognate IDs.
modify_ref : function (default=False)
Use a function to modify the reference. If your cognate identifiers are numerical, for example, and negative values are assigned as loans, but you want to suppress this behaviour, just set this keyword to “abs”, and all cognate IDs will be converted to their absolute value.
split_on_tones : bool (default=False)
If set to True, this means that in the case of fuzzy alignment mode, the algorithm will attempt to split words into morphemes by tones if no explicit morpheme markers can be found.
Notes
This class inherits from
Wordlist
and additionally creates instances of theMultiple
class for all cognate sets that are specified by the ref keyword.Attributes
msa
dict
A dictionary storing multiple alignments as dictionaries which can be directly opened and aligned with help of the ~lingpy.align.sca.SCA function. The alignment objects are referenced by a key which is identical with the “reference” (ref-keyword) of the alignment, that is the name of the column which contains the cognate identifiers.
- add_alignments(ref=False, modify_ref=False, fuzzy=False, split_on_tones=True, override=False)¶
Function adds a new set of alignments to the data.
- Parameters
ref: str (default=False) :
Use this to set the name of the column which contains the cognate sets.
fuzzy: bool (default=False) :
If set to true, force the algorithm to treat the cognate sets as fuzzy cognate sets, i.e., as multiple cognate sets which are in order assigned to a word (proper “partial cognates”).
- align(**keywords)¶
Carry out a multiple alignment analysis of the data.
- Parameters
method : { “progressive”, “library” } (default=”progressive”)
Select the method to use for the analysis.
iteration : bool (default=False)
Set to c{True} in order to use iterative refinement methods.
swap_check : bool (default=False)
Set to c{True} in order to carry out a swap-check.
model : { ‘dolgo’, ‘sca’, ‘asjp’ }
A string indicating the name of the
Model
object that shall be used for the analysis. Currently, three models are supported:“dolgo” – a sound-class model based on
Dolgopolsky1986
,“sca” – an extension of the “dolgo” sound-class model based on
List2012b
, and“asjp” – an independent sound-class model which is based on the sound-class model of
Brown2008
and the empirical data ofBrown2011
(see the description inList2012
.
mode : { ‘global’, ‘dialign’ }
A string indicating which kind of alignment analysis should be carried out during the progressive phase. Select between:
“global” – traditional global alignment analysis based on the Needleman-Wunsch algorithm
Needleman1970
,“dialign” – global alignment analysis which seeks to maximize local similarities
Morgenstern1996
.
modes : list (default=[(‘global’,-2,0.5),(‘local’,-1,0.5)])
Indicate the mode, the gap opening penalties (GOP), and the gap extension scale (GEP scale), of the pairwise alignment analyses which are used to create the library.
gop : int (default=-5)
The gap opening penalty (GOP) used in the analysis.
scale : float (default=0.6)
The factor by which the penalty for the extension of gaps (gap extension penalty, GEP) shall be decreased. This approach is essentially inspired by the exension of the basic alignment algorithm for affine gap penalties
Gotoh1982
.factor : float (default=1)
The factor by which the initial and the descending position shall be modified.
tree_calc : { ‘neighbor’, ‘upgma’ } (default=’upgma’)
The cluster algorithm which shall be used for the calculation of the guide tree. Select between
neighbor
, the Neighbor-Joining algorithm (Saitou1987
), andupgma
, the UPGMA algorithm (Sokal1958
).gap_weight : float (default=0)
The factor by which gaps in aligned columns contribute to the calculation of the column score. When set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters.
restricted_chars : string (default=”T”)
Define which characters of the prosodic string of a sequence reflect its secondary structure (cf.
List2012b
) and should therefore be aligned specifically. This defaults to “T”, since this is the character that represents tones in the prosodic strings of sequences.
- get_confidence(scorer, ref='lexstatid', gap_weight=0.25)¶
Function creates confidence scores for a given set of alignments.
- Parameters
scorer :
ScoreDict
A ScoreDict object which gives similarity scores for all segments in the alignment.
ref : str (default=”lexstatid”)
The reference entry-type, referring to the cognate-set to be used for the analysis.
gap_weight : {loat} (default=1.0)
Determine the weight assigned to matches containing gaps.
- get_consensus(tree=False, gaps=False, classes=False, consensus='consensus', counterpart='ipa', weights=[], return_data=False, **keywords)¶
Calculate a consensus string of all MSAs in the wordlist.
- Parameters
msa : {c{list} ~lingpy.align.multiple.Multiple}
Either an MSA object or an MSA matrix.
tree : {c{str} ~lingpy.thirdparty.cogent.PhyloNode}
A tree object or a Newick string along which the consensus shall be calculated.
gaps : c{bool} (default=False)
If set to c{True}, return the gap positions in the consensus.
classes : c{bool} (default=False)
Specify whether sound classes shall be used to calculate the consensus.
model : ~lingpy.data.model.Model
A sound class model according to which the IPA strings shall be converted to sound-class strings.
return_data : c{bool} (default=False)
Return the data instead of adding it in a column to the wordlist object.
- get_msa(ref)¶
- output(fileformat, **keywords)¶
Write wordlist to file.
- Parameters
fileformat : {“tsv”, “msa”, “tre”, “nwk”, “dst”, “taxa”, “starling”, “paps.nex”,
“paps.csv” “html”} The format that is written to file. This corresponds to the file extension, thus ‘tsv’ creates a file in tsv-format, ‘dst’ creates a file in Phylip-distance format, etc. Specific output is created for the formats “html” and “msa”:
“msa” will create a folder containing all alignments of all cognate sets in “msa”-format
“html” will create html-output in which words are sorted according to meaning, cognate set, and all cognate words are aligned
filename : str
Specify the name of the output file (defaults to a filename that indicates the creation date).
subset : bool (default=False)
If set to c{True}, return only a subset of the data. Which subset is specified in the keywords ‘cols’ and ‘rows’.
cols : list
If subset is set to c{True}, specify the columns that shall be written to the csv-file.
rows : dict
If subset is set to c{True}, use a dictionary consisting of keys that specify a column and values that give a Python-statement in raw text, such as, e.g., “== ‘hand’”. The content of the specified column will then be checked against statement passed in the dictionary, and if it is evaluated to c{True}, the respective row will be written to file.
ref : str
Name of the column that contains the cognate IDs if ‘starling’ is chosen as an output format.
missing : { str, int } (default=0)
If ‘paps.nex’ or ‘paps.csv’ is chosen as fileformat, this character will be inserted as an indicator of missing data.
tree_calc : {‘neighbor’, ‘upgma’}
If no tree has been calculated and ‘tre’ or ‘nwk’ is chosen as output format, the method that is used to calculate the tree.
threshold : float (default=0.6)
The threshold that is used to carry out a flat cluster analysis if ‘groups’ or ‘cluster’ is chosen as output format.
style : str (default=”id”)
If “msa” is chosen as output format, this will write the alignments for each msa-file in a specific format in which the first column contains a direct reference to the word via its ID in the wordlist.
ignore : { list, “all” }
Modifies the output format in “tsv” output and allows to ignore certain blocks in extended “tsv”, like “msa”, “taxa”, “json”, etc., which should be passed as a list. If you choose “all” as a plain string and not a list, this will ignore all additional blocks and output only plain “tsv”.
prettify : bool (default=True)
Inserts comment characters between concepts in the “tsv” file output format, which makes it easier to see blocks of words denoting the same concept. Switching this off will output the file in plain “tsv”.
- reduce_alignments(alignment=False, ref=False)¶
Function reduces alignments which contain columns that are marked to be ignored by the user.
Notes
This function changes the data only internally: All alignments are checked as to whether they contain data that should be ignored. If this is the case, the alignments are then reduced, and stored in a specific item of the alignment string. If the method doesn’t find any instances for reduction, it still makes the copies of the alignments in order to guarantee that the alignments with with we want to work are at the same place in the dictionary.
- class lingpy.align.sca.MSA(infile, **keywords)¶
Bases:
lingpy.align.multiple.Multiple
Basic class for carrying out multiple sequence alignment analyses.
- Parameters
infile : file
A file in
msq
-format ormsa
-format.merge_vowels : bool (default=True)
Indicate, whether neighboring vowels should be merged into diphtongs, or whether they should be kept separated during the analysis.
comment : char (default=’#’)
The comment character which, inserted in the beginning of a line, prevents that line from being read.
normalize : bool (default=True)
Normalize the alignment, that is, add gap characters for all sequences which are shorter than the longest sequence, and delete all columns from the alignment in which only gaps occur.
Notes
There are two possible input formats for this class: the MSQ-format, and the MSA-format (see Multiple Alignments (MSQ and MSA) for details). This class directly inherits all methods of the
Multiple
class.Examples
Get the path to a file from the testset.
>>> from lingpy import * >>> path = rc("test_path")+'harry.msq'
Load the file into the Multiple class.
>>> mult = Multiple(path)
Carry out a progressive alignment analysis of the sequences.
>>> mult.prog_align()
Print the result to the screen:
>>> print(mult) w o l - d e m o r t w a l - d e m a r - v - l a d i m i r -
- ipa2cls(**keywords)¶
Retrieve sound-class strings from aligned IPA sequences.
- Parameters
model : str (default=’sca’)
The sound-class model according to which the sequences shall be converted.
Notes
This function is only useful when an
msa
-file with already conducted alignment analyses was loaded.
- output(fileformat='msa', filename=None, sorted_seqs=False, unique_seqs=False, **keywords)¶
Write data to file.
- Parameters
fileformat : { “psa”, “msa”, “msq” }
Indicate which data should be written to file. Select between:
“psa” – output of all pairwise alignments in
psa
-format,“msa” – output of the multiple alignment in
msa
-format, or“msq” – output of the multiple sequences in
msq
-format.“html” – output of the multiple alignment in
html
-format.
filename : str
Select a specific name for the outfile, otherwise, the name of the infile will be taken by default.
sorted_seqs : bool
Indicate whether the sequences should be sorted or not (applys only to ‘msa’ and ‘msq’ output.
unique_seqs : bool
Indicate whether only unique sequences should be written to file or not.
- class lingpy.align.sca.PSA(infile, **keywords)¶
Bases:
lingpy.align.pairwise.Pairwise
Basic class for dealing with the pairwise alignment of sequences.
- Parameters
infile : file
A file in
psq
-format.merge_vowels : bool (default=True)
Indicate, whether neighboring vowels should be merged into diphtongs, or whether they should be kept separated during the analysis.
comment : char (default=’#’)
The comment character which, inserted in the beginning of a line, prevents that line from being read.
Notes
In order to read in data from text files, two different file formats can be used along with this class: the PSQ-format, and the PSA-format (see Pairwise Alignments (PSQ and PSA) for details). This class inherits the methods of the
Pairwise
class.Attributes
taxa
list
A list of tuples containing the taxa of all sequence pairs.
seqs
list
A list of tuples containing all sequence pairs.
tokens
list
A list of tuples containing all sequence pairs in a tokenized form.
- output(fileformat='psa', filename=None, **keywords)¶
Write the results of the analyses to a text file.
- Parameters
fileformat : { ‘psa’, ‘psq’ }
Indicate which data should be written to file. Select between:
‘psa’ – output of all pairwise alignments in
psa
-format,‘psq’ – output of the multiple sequences in
psq
-format.
filename : str
Select a specific name for the outfile, otherwise, the name of the infile will be taken by default.
- lingpy.align.sca.SCA(infile, **keywords)¶
Method returns alignment objects depending on input file or input data.
Notes
This method checks for the type of an alignment object and returns an alignment object of the respective type.
- lingpy.align.sca.get_consensus(msa, gaps=False, taxa=False, classes=False, **keywords)¶
Calculate a consensus string of a given MSA.
- Parameters
msa : {c{list} ~lingpy.align.multiple.Multiple}
Either an MSA object or an MSA matrix.
gaps : c{bool} (default=False)
If set to c{True}, return the gap positions in the consensus.
taxa : {c{list} bool} (default=False)
If tree is chosen as a parameter, specify the taxa in order of the aligned strings.
classes : c{bool} (default=False)
Specify whether sound classes shall be used to calculate the consensus.
model : ~lingpy.data.model.Model
A sound class model according to which the IPA strings shall be converted to sound-class strings.
local : { c{bool}, “peaks”, “gaps” }(default=False)
Specify whether local pre-processing should be applied to the data. If set to c{peaks}, the average alignment score of each column is taken as reference to remove low-scoring columns from the alignment. If set to “gaps”, the columns with the highest proportion of gaps will be excluded.
- Returns
cons : c{str}
A consensus string of the given MSA.
Module contents¶
Package provides basic modules for alignment analyses.