如何为给定外群的一组物种生成所有可能的纽维克树排列？

2023-12-31

对于那些不知道什么是 Newick 树格式的人，可以在以下位置找到一个很好的描述：https://en.wikipedia.org/wiki/Newick_format https://en.wikipedia.org/wiki/Newick_format

我想为给定外群的一组物种创建所有可能的纽维克树排列。我期望处理的叶节点数量很可能是 4、5 或 6 个叶节点。

“软”和“硬”多瓣都是允许的。https://en.wikipedia.org/wiki/Polytomy#Soft_polytomies_vs._hard_polytomies https://en.wikipedia.org/wiki/Polytomy#Soft_polytomies_vs._hard_polytomies https://biology.stackexchange.com/questions/23667/evidence-discussions-of-hard-polytomy https://biology.stackexchange.com/questions/23667/evidence-discussions-of-hard-polytomy

下面显示的是理想的输出，其中“E”设置为外群

理想输出：

((("A","B","C"),("D"),("E"));
((("A","B","D"),("C"),("E"));
((("A","C","D"),("B"),("E"));
((("B","C","D"),("A"),("E"));
((("A","B")("C","D"),("E"));
((("A","C")("B","D"),("E"));
((("B","C")("A","D"),("E"));
(("A","B","C","D"),("E"));
(((("A","B"),"C"),"D"),("E"));

然而，我使用 itertools 提供的任何可能的解决方案，特别是 itertools.permutations，都遇到了等效输出的问题。我想到的最后一个想法涉及如下所示的等效输出。

等效输出：

((("C","B","A"),("D"),("E"));
((("C","A","B"),("D"),("E"));
((("A","C","B"),("D"),("E"));

这是我的解决方案想法的开始。但是，除了 itertools 之外，我现在不太确定如何解决这个问题。

import itertools

def Newick_Permutation_Generator(list_of_species, name_of_outgroup)
    permutations_list =(list(itertools.permutations(["A","B","C","D","E"])))

    for given_permutation in permutations_list:
        process(given_permutation)

Newick_Permutation_Generator(["A","B","C","D","E"], "E")

作为叶子集合的递归集合的树

让我们暂时搁置 newick 表示，并考虑该问题的可能的 python 表示。

一棵有根的树可以被视为一棵集合的递归层次结构（组（组...））叶子。集合是无序的，这非常适合描述树中的分支：{{{"A", "B"}, {"C", "D"}}, "E"}应该是一样的{{{"C", "D"}, {"B", "A"}}, "E"}.

如果我们考虑初始的叶子集{"A", "B", "C", "D", "E"}，以“E”为外群的树是以下形式的集合的集合{tree, "E"} where trees 取自可以从叶子集合构建的树集合{"A", "B", "C", "D"}。我们可以尝试写一个递归trees函数来生成这组树，我们的总树集将表示如下：

{{tree, "E"} for tree in trees({"A", "B", "C", "D"})}

（这里，我使用的是集合理解 https://en.wikipedia.org/wiki/List_comprehension#Set_comprehension符号。）

实际上，python 不允许集合的集合，因为集合的元素必须是“可散列的”（也就是说，python 必须能够计算对象的一些“散列”值，以便能够检查它们是否属于集）。碰巧Python集合没有这个属性。幸运的是，我们可以使用一个类似的数据结构，名为frozenset https://docs.python.org/3/library/stdtypes.html#frozenset，其行为非常像一个集合，但无法修改并且是“可散列的”。因此，我们的树集将是：

all_trees = frozenset(
    {frozenset({tree, "E"}) for tree in trees({"A", "B", "C", "D"})})

实施`trees`功能

现在让我们重点关注trees功能。

对于每一个可能的分割（分解为一组不相交的子集，包括所有元素）的叶子集，我们需要为分区的每个部分找到所有可能的树（通过递归调用）。对于给定的分区，我们将为跨其各个部分的子树的每个可能组合创建一棵树。

例如，如果一个分区是{"A", {"B", "C", "D"}}，我们将考虑可以由部分组成的所有可能的树"A"（其实只是叶子"A"本身），以及可以由部分组成的所有可能的树{"B", "C", "D"}（那是，trees({"B", "C", "D"})）。然后，通过获取其中一个元素仅来自的所有可能对来获得该分区的可能树"A"，另一个来自trees({"B", "C", "D"}).

这可以推广到具有两个以上部分的分区，并且product函数来自itertools这里似乎很有用。

因此，我们需要一种方法来生成一组叶子的可能分区。

生成集合的分区

这里我做了一个partitions_of_set函数改编自这个解决方案 https://stackoverflow.com/a/30134039/1878788:

# According to https://stackoverflow.com/a/30134039/1878788:
# The problem is solved recursively:
# If you already have a partition of n-1 elements, how do you use it to partition n elements?
# Either place the n'th element in one of the existing subsets, or add it as a new, singleton subset.
def partitions_of_set(s):
    if len(s) == 1:
        yield frozenset(s)
        return
    # Extract one element from the set
    # https://stackoverflow.com/a/43804050/1878788
    elem, *_ = s
    rest = frozenset(s - {elem})
    for partition in partitions_of_set(rest):
        for subset in partition:
            # Insert the element in the subset
            try:
                augmented_subset = frozenset(subset | frozenset({elem}))
            except TypeError:
                # subset is actually an atomic element
                augmented_subset = frozenset({subset} | frozenset({elem}))
            yield frozenset({augmented_subset}) | (partition - {subset})
        # Case with the element in its own extra subset
        yield frozenset({elem}) | partition

为了检查获得的分区，我们创建一个函数来使它们更容易显示（这对于稍后创建树的 newick 表示也很有用）：

def print_set(f):
    if type(f) not in (set, frozenset):
        return str(f)
    return "(" + ",".join(sorted(map(print_set, f))) + ")"

我们测试分区是否有效：

for partition in partitions_of_set({"A", "B", "C", "D"}):
    print(len(partition), print_set(partition))

Output:

1 ((A,B,C,D))
2 ((A,B,D),C)
2 ((A,C),(B,D))
2 ((B,C,D),A)
3 ((B,D),A,C)
2 ((A,B,C),D)
2 ((A,B),(C,D))
3 ((A,B),C,D)
2 ((A,D),(B,C))
2 ((A,C,D),B)
3 ((A,D),B,C)
3 ((A,C),B,D)
3 ((B,C),A,D)
3 ((C,D),A,B)
4 (A,B,C,D)

实际代码`trees`功能

现在我们可以写出tree功能：

from itertools import product
def trees(leaves):
    if type(leaves) not in (set, frozenset):
        # It actually is a single leaf
        yield leaves
        # Don't try to yield any more trees
        return
    # Otherwise, we will have to consider all the possible
    # partitions of the set of leaves, and for each partition,
    # construct the possible trees for each part
    for partition in partitions_of_set(leaves):
        # We need to skip the case where the partition
        # has only one subset (the initial set itself),
        # otherwise we will try to build an infinite
        # succession of nodes with just one subtree
        if len(partition) == 1:
            part, *_ = partition
            # Just to be sure the assumption is correct
            assert part == leaves
            continue
        # We recursively apply *tree* to each part
        # and obtain the possible trees by making
        # the product of the sets of possible subtrees.
        for subtree in product(*map(trees, partition)):
            # Using a frozenset guarantees
            # that there will be no duplicates
            yield frozenset(subtree)

测试它：

all_trees = frozenset(
    {frozenset({tree, "E"}) for tree in trees({"A", "B", "C", "D"})})

for tree in all_trees:
    print(print_set(tree) + ";")

Output:

(((B,C),A,D),E);
((((A,B),D),C),E);
((((B,D),A),C),E);
((((C,D),A),B),E);
(((A,D),B,C),E);
((A,B,C,D),E);
((((B,D),C),A),E);
(((A,B,C),D),E);
((((A,C),B),D),E);
((((C,D),B),A),E);
((((B,C),A),D),E);
(((A,B),C,D),E);
(((A,C),(B,D)),E);
(((B,D),A,C),E);
(((C,D),A,B),E);
((((A,B),C),D),E);
((((A,C),D),B),E);
(((A,C,D),B),E);
(((A,D),(B,C)),E);
((((A,D),C),B),E);
((((B,C),D),A),E);
(((A,B),(C,D)),E);
(((A,B,D),C),E);
((((A,D),B),C),E);
(((A,C),B,D),E);
(((B,C,D),A),E);

我希望结果是正确的。

这种方法要正确执行有点棘手。我花了一些时间才弄清楚如何避免无限递归（当分区时会发生这种情况）{{"A", "B", "C", "D"}}).

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)