lingpy.algorithm package¶
Subpackages¶
Submodules¶
lingpy.algorithm.cluster_util module¶
Various utility functions which are useful for algorithmic operations
- lingpy.algorithm.cluster_util.generate_all_clusters(numbers)¶
Generate all possible clusters for a number of elements.
- Returns
clr : iterator
An iterator that will yield the next of all possible clusters.
- lingpy.algorithm.cluster_util.generate_random_cluster(numbers, bias=False)¶
Generate a random cluster for a number of elements.
- Parameters
numbers : int
Number of separate entities which should be clustered.
bias : str (default=False)
When set to “lumper” will tend to create larger groups, when set to “splitter” it will tend to produce smaller groups.
- Returns
cluster : list
A list with consecutive ordering of clusters, starting from zero.
- lingpy.algorithm.cluster_util.mutate_cluster(clr, chance=0.5)¶
Mutate a cluster.
- Parameters
clr : cluster
A list with ordered clusters.
chance : float (default=0.5)
The mutation rate for each element in a cluster. If set to 0.5, this means that in 50% of the cases, an element will be assigned to another cluster or a new cluster.
- Returns
valid_cluster : list
A newly clustered list in consecutive order.
- lingpy.algorithm.cluster_util.order_cluster(clr)¶
Order a cluster into the form of a valid cluster.
- Parameters
clr : list
A list with clusters assigned by given each element a specific clusuter ID.
- Returns
valid_cluster : list
A list in which the IDs start from zero and increase consecutively with each new cluster introduced.
- lingpy.algorithm.cluster_util.valid_cluster(sequence)¶
Only allow to have sequences which have consecutive ordering of elements.
- Parameters
sequence : list
A cluster sequence in which elements should be consecutively ordered, starting from 0, and identical segments in the sequence retrieve the same number.
- Returns
valid_cluster : bool
True, if the cluster is valid, and False if it judged to be invalid.
Examples
We define a valid and an invalid cluster sequence:
>>> clrA = [0, 1, 2, 3] >>> clrB = [1, 1, 2, 3] # should be [0, 0, 1, 2] >>> from lingpy.algorithm.cluster_util import valid_cluster >>> valid_cluster(clrA) True >>> valid_cluster(clrB) False
lingpy.algorithm.clustering module¶
Module provides general clustering functions for LingPy.
- lingpy.algorithm.clustering.best_threshold(matrix, trange=(0.3, 0.7, 0.05))¶
Calculate the best threshold by maximizing partition density for a given range of thresholds.
Notes
This method makes use of the idea of partition density proposed in
Ahn2010
.
- lingpy.algorithm.clustering.check_taxon_names(taxa)¶
- lingpy.algorithm.clustering.find_threshold(matrix, thresholds=[0.9, 0.8500000000000001, 0.8, 0.75, 0.7000000000000001, 0.65, 0.6000000000000001, 0.55, 0.5, 0.45, 0.4, 0.35000000000000003, 0.30000000000000004, 0.25, 0.2, 0.15000000000000002, 0.1, 0.05], logs=True)¶
Use a variant of the method by
Apeltsin2011
in order to find an optimal threshold.- Parameters
matrix : list
The distance matrix for which the threshold shall be determined.
thresholds : list (default=[i*0.05 for i in range(1,19)[::-1])
The range of thresholds that shall be tested.
logs : {bool,builtins.function} (default=True)
If set to True, the logarithm of the score beyond the threshold will be assigned as weight to the graph. If set to c{False} all weights will be set to 1. Use a custom function to define individual ways to calculate the weights.
- Returns
threshold : {float,None}
If a float is returned, this is the threshold identified by the method. If None is returned, no threshold could be identified.
Notes
This is a very simple method that may not work well depending on the dataset. So we recommend to use it with great care.
- lingpy.algorithm.clustering.flat_cluster(method, threshold, matrix, taxa=None, revert=False)¶
Carry out a flat cluster analysis based on linkage algorithms.
- Parameters
method : { “upgma”, “single”, “complete”, “ward”}
Select between ‘ugpma’, ‘single’, and ‘complete’. You can also test “ward”, but there’s no guarantee that this is the correct algorithm.
threshold : float
The threshold which terminates the algorithm.
matrix : list
A two-dimensional list containing the distances.
taxa : list (default=None)
A list containing the names of the taxa. If the list is left empty, the indices of the taxa will be returned instead of their names.
- Returns
clusters : dict
A dictionary with cluster-IDs as keys and a list of the taxa corresponding to the respective ID as values.
See also
Examples
The function is automatically imported along with LingPy.
>>> from lingpy import * >>> from lingpy.algorithm import squareform
Create a list of arbitrary taxa.
>>> taxa = ['German','Swedish','Icelandic','English','Dutch']
Create an arbitrary distance matrix.
>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3]) >>> matrix [[0.0, 0.5, 0.67, 0.8, 0.2], [0.5, 0.0, 0.4, 0.7, 0.6], [0.67, 0.4, 0.0, 0.8, 0.8], [0.8, 0.7, 0.8, 0.0, 0.3], [0.2, 0.6, 0.8, 0.3, 0.0]]
Carry out the flat cluster analysis.
>>> flat_cluster('upgma',0.6,matrix,taxa) {0: ['German', 'Dutch', 'English'], 1: ['Swedish', 'Icelandic']}
- lingpy.algorithm.clustering.flat_upgma(threshold, matrix, taxa=None, revert=False)¶
Carry out a flat cluster analysis based on the UPGMA algorithm (
Sokal1958
).- Parameters
threshold : float
The threshold which terminates the algorithm.
matrix : list
A two-dimensional list containing the distances.
taxa : list (default=None)
A list containing the names of the taxa. If the list is left empty, the indices of the taxa will be returned instead of their names.
- Returns
clusters : dict
A dictionary with cluster-IDs as keys and a list of the taxa corresponding to the respective ID as values.
See also
Examples
The function is automatically imported along with LingPy.
>>> from lingpy import * >>> from lingpy.algorithm import squareform
Create a list of arbitrary taxa.
>>> taxa = ['German','Swedish','Icelandic','English','Dutch']
Create an arbitrary distance matrix.
>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3]) >>> matrix [[0.0, 0.5, 0.67, 0.8, 0.2], [0.5, 0.0, 0.4, 0.7, 0.6], [0.67, 0.4, 0.0, 0.8, 0.8], [0.8, 0.7, 0.8, 0.0, 0.3], [0.2, 0.6, 0.8, 0.3, 0.0]]
Carry out the flat cluster analysis.
>>> flat_upgma(0.6,matrix,taxa) {0: ['German', 'Dutch', 'English'], 1: ['Swedish', 'Icelandic']}
- lingpy.algorithm.clustering.fuzzy(threshold, matrix, taxa, method='upgma', revert=False)¶
Create fuzzy cluster of a given distance matrix.
- Parameters
threshold : float
The threshold that shall be used for the basic clustering of the data.
matrix : list
A two-dimensional list containing the distances.
taxa : list
An list containing the names of all taxa corresponding to the distances in the matrix.
method : { “upgma”, “single”, “complete” } (default=”upgma”)
Select the method for the flat cluster analysis.
distances : bool
If set to “False”, only the topology of the tree will be returned.
revert : bool (default=False)
Specify whether a reverted dictionary should be returned.
- Returns
cluster : dict
A dictionary with cluster-IDs as keys and a list as value, containing the taxa that are assigned to a given cluster-ID.
See also
Notes
This is a very simple fuzzy clustering algorithm. It basically does nothing else than removing taxa successively from the matrix, flat-clustering the remaining taxa with the corresponding threshold, and then returning a combined “consensus” cluster in which taxa may be assigned to multiple clusters.
Examples
The function is automatically imported along with LingPy.
>>> from lingpy import * from lingpy.algorithm import squareform
Create a list of arbitrary taxa.
>>> taxa = ['German','Swedish','Icelandic','English','Dutch']
Create an arbitrary distance matrix.
>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3]) >>> matrix [[0.0, 0.5, 0.67, 0.8, 0.2], [0.5, 0.0, 0.4, 0.7, 0.6], [0.67, 0.4, 0.0, 0.8, 0.8], [0.8, 0.7, 0.8, 0.0, 0.3], [0.2, 0.6, 0.8, 0.3, 0.0]]
Carry out the fuzzy flat cluster analysis.
>>> fuzzy(0.5,matrix,taxa) {1: ['Swedish', 'Icelandic'], 2: ['Dutch', 'German'], 3: ['Dutch', 'English']}
- lingpy.algorithm.clustering.link_clustering(threshold, matrix, taxa, link_threshold=False, revert=False, matrix_type='distances', fuzzy=True)¶
Carry out a link clustering analysis using the method by
Ahn2010
.- Parameters
threshold : {float, bool}
The threshold that shall be used for the initial selection of links assigned to the data. If set to c{False}, the weights from the matrix will be used directly.
matrix : list
A two-dimensional list containing the distances.
taxa : list
An list containing the names of all taxa corresponding to the distances in the matrix.
link_threshold : float (default=0.5)
The threshold that shall be used for the internal clustering of the data.
matrix_type : {“distances”,”similarities”,”weights”} (default=”distances”)
Specify the type of the matrix. If the matrix contains distance data, it will be adapted to similarity data. If it contains “similarities”, no adaptation is needed. If it contains “weights”, a weighted version of link clustering (see the supplementary in
Ahn2010
for details) ]will be carried out.- Returns
cluster : dict
A dictionary with cluster-IDs as keys and a list as value, containing the taxa that are assigned to a given cluster-ID.
See also
Examples
The function is automatically imported along with LingPy.
>>> from lingpy import * >>> from lingpy.algorithm import squareform
Create a list of arbitrary taxa.
>>> taxa = ['German','Swedish','Icelandic','English','Dutch']
Create an arbitrary distance matrix.
>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3]) >>> matrix [[0.0, 0.5, 0.67, 0.8, 0.2], [0.5, 0.0, 0.4, 0.7, 0.6], [0.67, 0.4, 0.0, 0.8, 0.8], [0.8, 0.7, 0.8, 0.0, 0.3], [0.2, 0.6, 0.8, 0.3, 0.0]]
Carry out the link-clustering analysis.
>>> link_clustering(0.5,matrix,taxa) {1: ['Dutch', 'English', 'German'], 2: ['Icelandic', 'Swedish']}
- lingpy.algorithm.clustering.matrix2groups(threshold, matrix, taxa, cluster_method='upgma')¶
Calculate flat cluster of distance matrix.
- Parameters
threshold : float
The threshold to be used for the calculation.
matrix : list
The distance matrix to be used.
taxa : list
A list of the taxa in the distance matrix.
cluster_method : {“upgma”, “mcl”, “single”, “complete”} (default=”upgma”)
- Returns
groups : dict
A dictionary with the taxa as keys and the group assignment as values.
Notes
This function is important for internal calculations within wordlist. It is not recommended for further use.
- lingpy.algorithm.clustering.matrix2tree(matrix, taxa, tree_calc='neighbor', distances=True, filename='')¶
Calculate a tree of a given distance matrix.
- Parameters
matrix : list
The distance matrix to be used.
taxa : list
A list of the taxa in the distance matrix.
tree_calc : str (default=”neighbor”)
The method for tree calculation that shall be used. Select between:
“neighbor”: Neighbor-joining method (
Saitou1987
)“upgma” : UPGMA method (
Sokal1958
)
distances : bool (default=True)
If set to c{True}, distances will be included in the tree-representation.
filename : str (default=’’)
If a filename is specified, the data will be written to that file.
- Returns
tree : ~lingpy.thirdparty.cogent.tree.PhyloNode
A ~lingpy.thirdparty.cogent.tree.PhyloNode object for handling tree files.
- lingpy.algorithm.clustering.mcl(threshold, matrix, taxa, max_steps=1000, inflation=2, expansion=2, add_self_loops=True, revert=False, logs=True, matrix_type='distances')¶
Carry out a clustering using the MCL algorithm (
Dongen2000
).- Parameters
threshold : {float, bool}
The threshold that shall be used for the initial selection of links assigned to the data. If set to c{False}, the weights from the matrix will be used directly.
matrix : list
A two-dimensional list containing the distances.
taxa : list
An list containing the names of all taxa corresponding to the distances in the matrix.
max_steps : int (default=1000)
Maximal number of iterations.
inflation : int (default=2)
Inflation parameter for the MCL algorithm.
expansion : int (default=2)
Expansion parameter of the MCL algorithm.
add_self_loops : {True, False, builtins.function} (default=True)
Determine whether self-loops should be added, and if so, how they should be weighted. If a function for the calculation of self-loops is given, it will take the whole column of the matrix for each taxon as input.
logs : { bool, function } (default=True)
If set to c{True}, the logarithm of the score beyond the threshold will be assigned as weight to the graph. If set to c{False} all weights will be set to 1. Use a custom function to define individual ways to calculate the weights.
matrix_type : { “distances”, “similarities” }
Specify the type of the matrix. If the matrix contains distance data, it will be adapted to similarity data. If it contains “similarities”, no adaptation is needed.
Examples
The function is automatically imported along with LingPy.
>>> from lingpy import * >>> from lingpy.algorithm import squareform
Create a list of arbitrary taxa.
>>> taxa = ['German','Swedish','Icelandic','English','Dutch']
Create an arbitrary distance matrix.
>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3]) >>> matrix [[0.0, 0.5, 0.67, 0.8, 0.2], [0.5, 0.0, 0.4, 0.7, 0.6], [0.67, 0.4, 0.0, 0.8, 0.8], [0.8, 0.7, 0.8, 0.0, 0.3], [0.2, 0.6, 0.8, 0.3, 0.0]]
Carry out the link-clustering analysis.
>>> mcl(0.5,matrix,taxa) {1: ['German', 'English', 'Dutch'], 2: ['Swedish', 'Icelandic']}
- lingpy.algorithm.clustering.neighbor(matrix, taxa, distances=True)¶
Function clusters data according to the Neighbor-Joining algorithm (
Saitou1987
).- Parameters
matrix : list
A two-dimensional list containing the distances.
taxa : list
An list containing the names of all taxa corresponding to the distances in the matrix.
distances : bool (default=True)
If set to False, only the topology of the tree will be returned.
- Returns
newick : str
A string in newick-format which can be further used in biological software packages to view and plot the tree.
See also
Examples
Function is automatically imported when importing lingpy.
>>> from lingpy import * >>> from lingpy.algorithm import squareform
Create an arbitrary list of taxa.
>>> taxa = ['Norwegian','Swedish','Icelandic','Dutch','English']
Create an arbitrary matrix.
>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3])
Carry out the cluster analysis.
>>> neighbor(matrix,taxa) '(((Norwegian,(Swedish,Icelandic)),English),Dutch);'
- lingpy.algorithm.clustering.partition_density(matrix, t)¶
Calculate partition density for a given threshold on a distance matrix.
Notes
See
Ahn2012
for details on the calculation of partition density in a given network.
- lingpy.algorithm.clustering.upgma(matrix, taxa, distances=True)¶
Carry out a cluster analysis based on the UPGMA algorithm (
Sokal1958
).- Parameters
matrix : list
A two-dimensional list containing the distances.
taxa : list
An list containing the names of all taxa corresponding to the distances in the matrix.
distances : bool (default=True)
If set to False, only the topology of the tree will be returned.
- Returns
newick : str
A string in newick-format which can be further used in biological software packages to view and plot the tree.
See also
Examples
Function is automatically imported when importing lingpy.
>>> from lingpy import * >>> from lingpy.algorithm import squareform
Create an arbitrary list of taxa.
>>> taxa = ['German','Swedish','Icelandic','English','Dutch']
Create an arbitrary matrix.
>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3])
Carry out the cluster analysis.
>>> upgma(matrix,taxa,distances=False) '((Swedish,Icelandic),(English,(German,Dutch)));'
lingpy.algorithm.extra module¶
Adapting specific cluster algorithms from scikit-learn to LingPy.
- lingpy.algorithm.extra.affinity_propagation(threshold, matrix, taxa, revert=False)¶
Compute affinity propagation from the matrix.
- Parameters
threshold : float
The threshold for clustering you want to use.
matrix : list
The two-dimensional matrix passed as list or array.
taxa : list
The list of taxon names. If set to “False” a fake list of taxon names will be created, giving a positive numerical ID in increasing order for each column in the matrix.
revert : bool
If set to “False”, don’t return taxon names but simply the language identifiers and their labels as a dictionary. Otherwise returns a dictionary with labels as keys and list of taxon names as values.
- Returns
clusters : dict
Either a dictionary of taxon identifiers and labels, or a dictionary of labels and taxon names.
Notes
Affinity propagation is a clustering method originally proposed by
Frey2007
.Requires the scikitlearn package, downloadable from http://scikit-learn.org/.
- lingpy.algorithm.extra.dbscan(threshold, matrix, taxa, revert=False, min_samples=1)¶
Compute DBSCAN cluster analysis.
- Parameters
threshold : float
The threshold for clustering you want to use.
matrix : list
The two-dimensional matrix passed as list or array.
taxa : list
The list of taxon names. If set to “False” a fake list of taxon names will be created, giving a positive numerical ID in increasing order for each column in the matrix.
revert : bool
If set to “False”, don’t return taxon names but simply the language identifiers and their labels as a dictionary. Otherwise returns a dictionary with labels as keys and list of taxon names as values.
min_samples : int (default=1)
The minimal samples parameter of the DBCSCAN method from the SKLEARN package.
- Returns
clusters : dict
Either a dictionary of taxon identifiers and labels, or a dictionary of labels and taxon names.
Notes
This method does not work as expected, probably since it normally requires distances between points as input. We list it only for completeness here, but urge to be careful when using the code and checking properly our implementation in the source code.
Requires the scikitlearn package, downloadable from http://scikit-learn.org/.
- lingpy.algorithm.extra.infomap_clustering(threshold, matrix, taxa=False, revert=False)¶
Compute the Infomap clustering analysis of the data.
- Parameters
threshold : float
The threshold for clustering you want to use.
matrix : list
The two-dimensional matrix passed as list or array.
taxa : list
The list of taxon names. If set to “False” a fake list of taxon names will be created, giving a positive numerical ID in increasing order for each column in the matrix.
revert : bool
If set to “False”, don’t return taxon names but simply the language identifiers and their labels as a dictionary. Otherwise returns a dictionary with labels as keys and list of taxon names as values.
- Returns
clusters : dict
Either a dictionary of taxon identifiers and labels, or a dictionary of labels and taxon names.
Notes
Infomap clustering is a community detection method originally proposed by
Rosvall2008
. This method requires the igraph package, downloadable from http://igraph.org/.
Module contents¶
Package for specific algorithms and time-intensive routines.