我目前正在使用 PHP 和DOMXPath
获取所有的内容<p>
网页的元素:
<?php
...
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$paragraphs = $xpath->evaluate("/html/body//p");
foreach ($paragraphs as $paragraph){
echo $paragraph->textContent . "<br />";
}
我的问题是产生的字符串textContent
不尊重<br />
其中存在的标签<p>
元素。相反,它删除了换行符并将通常位于不同行上的单词推到一起。例如:
示例 HTML:
<p>
Some happy talk goes here talking about our great product.<br />
We would love for you to buy it!
</p>
<p>
Random information and what not<br />
Isn't that cool?
</p>
上面 PHP 的当前输出:
Some happy talk about our great product.We would love for you to buy it!
Random information and what notIsn't that cool?
我努力了$paragraphs = $doc->getElementsByTagName("p");
以及它给了我同样的东西。
有没有办法让 DOMXPath/DOMDocument 保留换行符?我需要能够分隔段落中的每个单词,而当前的输出不允许这样做。
如果有替代方法来检索其中的字符串<p>
元素同时保留<br />
or '\n'
那也太好了。
EDIT
经过进一步调查,有问题的 HTML 实际上是一个锚点列表,由<br>
标签但没有实际的换行符:
<p class="home_page_list"><a href="/home/personal-banking/checking/Category-Page-Classic-Checking/classic-checking.html">Classic Checking</a><br> <a href="/home/personal-banking/checking/Category-Page-Interest-Checking/interest-checking.html">Interest Checking</a><br> <a href="/home/personal-banking/checking/Category-Page-Interest-Checking/interest-premium-checking.html">Premium Checking</a><br> <a href="/home/personal-banking/Savings-Category-Page/Basic-Savings-Category-Page/basic-savings.html">Savings Plans</a><br> <a href="/home/personal-banking/Savings-Category-Page/Money-Market-Accounts-Category-Page/money-market-accounts.html">Money Market Accounts</a><br> <a href="/home/personal-banking/Savings-Category-Page/Certificates-of-Deposit-Category-Page/fixed-rate-CD.html">CDs</a><br> <a href="/home/personal-banking/Savings-Category-Page/Individual-Retirement-Account-Category-Page/individual-retirement-account.html">IRAs</a></p>
事实证明,这与给定的原始 HTML 可以正常工作。
更新:已解决
在@ircmaxell的回答以及@netcoder和@Gordon留下的评论的帮助下,这个问题已经解决了,它不是很优雅,但现在就可以了。
Example:
foreach ($paragraphs as $paragraph){
$p_text = new DOMDocument();
$p_text->loadHTML(str_ireplace(array("<br>", "<br />"), "\r\n", DOMinnerHTML($paragraph)));
//Do whatever, in this case get all of the words in an array.
$words = explode(" ", str_ireplace(array(",", ".", "&", ":", "-", "\r\n"), " ", $p_text->textContent));
print_r($words);
}
这利用了DOMinnerHTML https://stackoverflow.com/questions/2087103/innerhtml-in-phps-domdocument(按照@netcoder的建议)替换实例<br>
与“\r\n”(如@ircmaxell建议),然后可以对其进行评估textContent.
显然还有一些改进的空间,但它解决了我当前的问题。
感谢大家的帮助,
Ben