Solr ExtractingRequestHandler 提取链接中的“rect”

2023-12-30

我正在利用 solr ExtractingRequestHandler 来提取和索引 HTML 内容。我的问题涉及它生成的提取链接部分。返回的提取内容已在 HTML 源中不存在的位置插入“矩形”。

我的 solrconfig 单元配置如下：

  <requestHandler name="/upate/extract" 
              startup="lazy"
              class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
  <str name="lowernames">true</str>
  <!-- capture link hrefs but ignore div attributes -->
  <str name="captureAttr">true</str>
  <str name="fmap.div">ignored_</str>
</lst>

我的 solr schema.xml 包含以下 etnries：

   <field name="content_type" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="links" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="meta" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="content_encoding" type="string" indexed="false" stored="true" multiValued="false"/>
   <field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>

我将以下 HTML 发布到 sorl 单元格：

<!DOCTYPE html>
<html>
<body>
  <h1>Heading1</h1><a href="http://www.google.com">Link to Google</a><a href=
  "http://www.google.com">Link to Google2</a><a href="http://www.google.com">Link to
  Google3</a><a href="http://www.google.com">Link to Google</a>

  <p>Paragraph1</p>
</body>
</html>

Solr 有以下索引：

      {
    "meta": [
      "Content-Encoding",
      "ISO-8859-1",
      "ignored_hbaseindexer_mime_type",
      "text/html",
      "Content-Type",
      "text/html; charset=ISO-8859-1"
    ],
    "links": [
      "rect",
      "http://www.google.com",
      "rect",
      "http://www.google.com",
      "rect",
      "http://www.google.com",
      "rect",
      "http://www.google.com"
    ],
    "content_encoding": "ISO-8859-1",
    "content_type": [
      "text/html; charset=ISO-8859-1"
    ],
    "content": [
      "             Heading1  Link to Google  Link to Google2  Link to Google3  Link to Google  Paragraph1   "
    ],
    "id": "row69",
    "_version_": 1461665607851180000
  }

注意每个链接之间的“矩形”。为什么 solr cell 或 tika 插入这些？我没有定义要使用的 tika 配置文件。我需要配置tika吗？

虽然是一个老问题，但我在通过 Solr 8.7.0 索引 HTML 文档时也遇到了这个问题。

<requestHandler name="/update/extract" 
    class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
    ....

HTML:

<p>My website is <a href="https://buriedtruth.com/">BuriedTruth.com</a>.</p>

Result:

My website is rect https://buriedtruth.com/ BuriedTruth.com .

[ 我在 Linux 命令行上发布/索引：solr restart; sleep 1; post -c gettingstarted /mnt/Vancouver/programming/datasci/solr/test/solr_test9.html; ]

我 grep (ripgrep:rg --color=always -w -e 'rect' . |less）该单词的 Solr 代码，但什么也没找到，所以来源rect http...索引 URL 中的内容让我困惑。

我的解决方案是添加一个正则表达式处理器到我的solrconfig.xml:

  <updateRequestProcessorChain name="add-unknown-fields-to-the-schema" default="${update.autoCreateFields:true}"
           processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
    <processor class="solr.LogUpdateProcessorFactory"/>
    <processor class="solr.DistributedUpdateProcessorFactory"/>
    <!-- ======================================== -->
    <!-- https://lucene.apache.org/solr/7_4_0/solr-core/org/apache/solr/update/processor/RegexReplaceProcessorFactory.html -->
    <!-- Solr bug? URLs parse as "rect https..."  Managed-schema (Admin UI): defined p as text_general -->
    <!-- but did not parse. Looking at content | title: text_general copied to string, so added  -->
    <!-- copyfield of p (text_general) as p_str ... regex below now works! -->
    <processor class="solr.RegexReplaceProcessorFactory">
      <str name="fieldName">content</str>
      <str name="fieldName">title</str>
      <str name="fieldName">p</str>
      <!-- Case-sensitive (and only one pattern:replacement allowed, so use as many copies): -->
      <!-- of this processor as needed: -->
      <str name="pattern">rect http</str>
      <str name="replacement">http</str>
      <bool name="literalReplacement">true</bool>
    </processor>
    <!-- ======================================== -->
    <!-- This needs to be last (may need to clear documents and reindex to see changes, e.g. Solr Admin UI): -->
    <processor class="solr.RunUpdateProcessorFactory"/>
  </updateRequestProcessorChain>

正如我在该处理器中的评论中提到的，我正在提取<p />- 将 HTML 内容格式化为p field (field: p | type: text_general).

该内容无法解析RegexReplaceProcessorFactory处理器。

在 Solr 管理 UI 中我注意到title and content被复制为字符串（例如：field: content | type: text_general | copied to: content_str），所以我制作了复制字段（p>>p_str）解决了正则表达式问题。

为了完整起见，以下是我的相关部分solrconfig.xml与 HTML 文档索引相关，

  <lib dir="${solr.install.dir:../../..}/contrib/extraction/lib" regex=".*\.jar" />
  <lib dir="${solr.install.dir:../../..}/dist/" regex="solr-cell-\d.*\.jar" />

  <!-- https://lucene.472066.n3.nabble.com/Prons-an-Cons-of-Startup-Lazy-a-Handler-td4059111.html -->
                  <!-- startup="lazy" -->

  <requestHandler name="/update/extract"
                  class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="uprefix">ignored_</str>
      <str name="capture">div</str>
      <str name="fmap.div">div</str>
      <str name="capture">p</str>
      <str name="fmap.p">p</str>
    </lst>
  </requestHandler>

...再次注意到我将字段添加到managed-schema通过 Solr 管理 UI。

Result:

My website is https://buriedtruth.com/ BuriedTruth.com .

  <field name="p" type="text_general" uninvertible="true" indexed="true" stored="true"/>
  <copyField source="p" dest="p_str"/>

也可以看看：

re: <requestHandler name="/update/extract"...:
- Solr 8.6.3 无法索引 html 文件 https://stackoverflow.com/questions/64659922/solr-8-6-3-could-not-index-html-file
- https://lucene.apache.org/solr/guide/8_6/uploading-data-with-solr-cell-using-apache-tika.html#configuring-the-extractingrequesthandler-in-solrconfig-xml https://lucene.apache.org/solr/guide/8_6/uploading-data-with-solr-cell-using-apache-tika.html#configuring-the-extractingrequesthandler-in-solrconfig-xml
我在这里的回答（涉及与updateRequestProcessorChain />，上面）从 Solr 切换时managed-schema到经典schema.xml
- 如何从 HTML 文件中提取元标签并在 SOLR 和 TIKA 中对其进行索引 https://stackoverflow.com/questions/15005919/how-to-extract-metatags-from-html-files-and-index-them-in-solr-and-tika/64884222#64884222

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

Solr

apachetika

solrcell

Solr ExtractingRequestHandler 提取链接中的“rect” 的相关文章

Solr 自动提交和自动优化？

我很快就会将我的网站上传到 VPS 这是一个分类网站使用Solr与 MySql 集成每当放置或删除新的分类时 Solr 就会更新我需要一种方法来使commit and optimize 自动化例如每 3 小时左右一次我怎样才能做到
Solr 4 - 缺少必填字段：uuid

我在使用 Solr4 中的 dataImportHandler 生成 UUID 时遇到问题我正在尝试从现有的 MySQL 数据库导入我的 schema xml 包含
为什么 solr RemoveDuplicatesTokenFilterFactory 不起作用？

我的 schema xml 正在拆分产品名称然后使用RemoveDuplicate 删除拆分后的重复单词
如何查询SOLR中的空字段？

我有一个很大的 solr 索引我注意到一些字段没有正确更新索引是动态的这导致某些字段具有空的 id 字段我已经尝试过这些查询但它们不起作用 id id NULL id null id id id TO 有没有办法查询空字段 Tha
将 Nutch 爬虫与 Solr 结合使用

我可以将 Apache Nutch 爬虫程序与 Solr 索引服务器集成吗 Edit 我们的一位开发人员从这些帖子中提出了一个解决方案运行 Nutch 和 Solr http wiki apache org nutch RunningNu
SolrCloud：无法创建集合、锁定问题

我一直在尝试实现 SolrCloud 一切正常直到我尝试创建包含 6 个分片的集合我的设置如下 5 个虚拟服务器全部运行 Ubuntu 14 04 由一家公司跨不同数据中心托管为整体运行 ZooKeeper 3 4 6 的 3 台服
使用多个字段对 solr 搜索结果进行排序 (solrj)

我需要根据两个因素对从 apache solr 返回的结果进行排序我们的系统中有三个实体由 solr 索引组项目和数据集在结果中我希望首先显示数据集然后是项目然后是组但我仍然希望它尊重每种类型的评分值因此例如结果将是得
Solr 高亮显示

我看到了这个帖子here https stackoverflow com questions 4058913 how to highlighting search results using apache solr with php cod
加速 solr 索引

Solr 索引花费的时间太长我使用的mysql有超过3000万条记录我正在使用两级子查询请向我建议索引数据的最佳实践以便我可以加快该过程查看Solr性能因素 http wiki apache org solr SolrPerfor
SolrNet：过滤查询时保留 Facet 计数

当我查询时我收到以下方面 Field1 Key Best Facet 1 Value 999 Key Best Facet 2 Value 999 Field2 Key Second Best Facet 1 Value 421 Key
solr索引嵌套文档

solr支持嵌套文档吗有没有更好的方法来实现这种文档
对 solr 搜索结果进行排序。给出错误无法对多值字段进行排序：名称

我对 Apache Solr 搜索比较陌生我正在尝试对 Solr 查询中的结果集进行排序查询名称 abc AND 隐藏 false sort name desc 它显示错误无法对多值字段进行排序名称 Solr版本是 7 2 1 如
ckan本地安装，solr JSP支持未配置500错误

我正在尝试使用 Ubuntu 14 04 LTS 在本地计算机上安装 CKAN 我按照从找到的源安装的说明进行操作here http docs ckan org en latest maintaining installing instal
使用 sunspot/solr 搜索多个模型

我已经能够成功地实现基本的全文搜索但是当我尝试使用范围 with statements 时任何涉及多对多关系模型的查询似乎都不适合我我知道相关行位于数据库中因为我的 sql 语句确实返回了数据然而太阳黑子查询不会返回任何结果我
openNLP 与 Solr 集成时出现异常

我正在尝试将 openNLP 与 Solr 6 1 0 集成我配置了架构和 solrconfig 文件详细信息请参见 wiki 链接 https wiki apache org solr OpenNLP https wiki apach
Solr 动态价格范围和组

跟进问题如何获取 solr 结果中的方面范围 https stackoverflow com questions 33956 how to get facet ranges in solr results SolR 查询价格范围 htt
Solr PatternReplaceCharFilterFactory 未替换为指定模式

所以我对 Solr 很陌生但我尝试使用 PatternReplaceCharFilterFactory 对将存储的电话号码字符串进行一些预处理这是该字段的配置
如何根据特定字段对 solr 查询的前 100 个结果进行排序？

我想使用特定字段对 solr 的前 100 个文档进行排序但它对整个结果集进行排序然后显示结果以下是我的代码 query1 setQuery Natural Language query1 setStart 0 query1 setR
如何使用 solrnet 在 solr 中使字段搜索不区分大小写

在 solr 模式中我有如下字段
Solr 索引时间提升 VS 查询时间提升？

问题 1 我们可以只进行查询时间提升使用 dismax 而不在索引时间提及提升值吗问题 2 与查询时间提升相比索引时间提升有何优点缺点反之亦然查询时间和索引时间提升在索引时您可以选择提升特定文档整个文档或仅一个字段它作为

随机推荐

如何提高 teamcity 构建性能？

我的项目有以下步骤 build 单元测试测试覆盖率重复项查找器 fx cop 有没有办法让TeamCity并行执行2 5个步骤我可以为此使用多个构建代理吗是的假设您至少有四个构建代理您可以执行以下操作 Under MyProje
Azure 逻辑应用 - 从 URL 下载文件

我在逻辑应用程序中有一个要求我需要从网站 URL 执行 HTTP GET 该网站 URL 提供了我需要下载到 Azure 文件存储的文件我可以调用可下载 URL 但不确定如何将文件下载到 Azure 文件存储目录请让我知道您的输入我
WPF 组合框数据绑定所选项目

我正在尝试数据绑定ComboBox到一个列表strings 到目前为止我已经得到以下内容在我看来我有
使用 Douglas Crockford 的函数继承在 Javascript 中调用基本方法

基本上如何使用下面的模式调用基本方法 var GS GS baseClass function somedata var that that data somedata Base class method that someMethod f
适合初学者的 Java 编程 [关闭]

就目前情况而言这个问题不太适合我们的问答形式我们希望答案得到事实参考资料或专业知识的支持但这个问题可能会引发辩论争论民意调查或扩展讨论如果您觉得这个问题可以改进并可能重新开放访问帮助中心 help reopen questi
如何使用 Firebase 登录多个社交服务？

我希望用户能够使用多个不同的身份验证提供商例如 Facebook Twitter 或 Github 对我的 Firebase 应用程序进行身份验证经过身份验证后我希望用户无论使用哪种身份验证方法都可以访问同一帐户换句话说我想将多个
获取多行并存储在 1 个变量中 - ORACLE 存储过程

我正在研究 ORACLE 存储过程我有一个疑问我有一个查询它获取超过 1 行我想将所有这 3 行的值存储在 1 个变量中有人可以帮我解决这个问题吗我的查询是这样的 SELECT STUDENT NAME FROM STUDENT
Magento 网格问题

谁能给我指出如何在 Magento 网格中保存可编辑列的正确方向我有一个名为 sort order 的列其中有 editable gt true 它添加了一个要编辑的字段但如何使其将值保存到行中预先感谢您的帮助这是我的 grid
IE9 Javascript 引擎（代号“Chakra”）的 ProgId 或 CLSID 是什么

使用 NET 我可以编写一个应用程序该应用程序托管符合 Microsoft 的 IActiveScript 约定的脚本引擎这包括 Microsoft 的 JScript 和 VBScript 以及 PerlScript RubyScri
如何用C语言将文件内容读取到字符串中？

在 C 中打开文件并将其内容读入字符串 char char 等的最简单方法最不容易出错代码行数最少无论您想如何解释它是什么我倾向于将整个缓冲区作为原始内存块加载到内存中然后自己进行解析这样我就可以最好地控制标准库在多个平台上
MacOS 奇怪的终端提示，尝试重置终端

链接到终端图片 https i stack imgur com QggoJ jpg Last login Mon Feb 27 14 57 49 on ttys000 engr2 2 79 41 dhcp
当我在 Homestead 上打开“hhvm”时，我没有收到任何语法错误或缺少类错误，只是空白页

我使用 homestead 作为我的开发环境我打开了该站点的 hhvm 选项 sites map homestead app to home vagrant Code wheremyprojectis hhvm true 我发现当出现异常
在 Rust 中，如何定义将 Vec 转换为 Vec 的通用函数

我需要类似的东西 fn my convert

Solr ExtractingRequestHandler 提取链接中的“rect”

Solr ExtractingRequestHandler 提取链接中的“rect” 的相关文章

随机推荐

热门标签