This answer comes a bit late, but still I'd like to share it:
I used networkx http://networkx.github.com/documentation/latest/ and lxml http://lxml.de/(我发现它可以更优雅地遍历 DOM 树)。然而,树的布局取决于graphviz http://www.graphviz.org/ and pygraphviz http://networkx.lanl.gov/pygraphviz/安装。 networkx 本身只会以某种方式将节点分布在画布上。代码实际上比需要的要长,因为我自己绘制标签以将它们装箱(networkx 提供了绘制标签的功能,但它不会传递bbox
matplotlib 的关键字)。
import networkx as nx
from lxml import html
import matplotlib.pyplot as plt
from networkx.drawing.nx_agraph import graphviz_layout
raw = "...your raw html"
def traverse(parent, graph, labels):
labels[parent] = parent.tag
for node in parent.getchildren():
graph.add_edge(parent, node)
traverse(node, graph, labels)
G = nx.DiGraph()
labels = {} # needed to map from node to tag
html_tag = html.document_fromstring(raw)
traverse(html_tag, G, labels)
pos = graphviz_layout(G, prog='dot')
label_props = {'size': 16,
'color': 'black',
'weight': 'bold',
'horizontalalignment': 'center',
'verticalalignment': 'center',
'clip_on': True}
bbox_props = {'boxstyle': "round, pad=0.2",
'fc': "grey",
'ec': "b",
'lw': 1.5}
nx.draw_networkx_edges(G, pos, arrows=True)
ax = plt.gca()
for node, label in labels.items():
x, y = pos[node]
ax.text(x, y, label,
bbox=bbox_props,
**label_props)
ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
plt.show()
如果您喜欢(或必须)使用 BeautifulSoup,请更改代码:
我不是专家...只是第一次查看 BS4,...但它有效:
#from lxml import html
from bs4 import BeautifulSoup
from bs4.element import NavigableString
...
def traverse(parent, graph, labels):
labels[hash(parent)] = parent.name
for node in parent.children:
if isinstance(node, NavigableString):
continue
graph.add_edge(hash(parent), hash(node))
traverse(node, graph, labels)
...
#html_tag = html.document_fromstring(raw)
soup = BeautifulSoup(raw)
html_tag = next(soup.children)
...