Excel - 使用 FILTERXML 从字符串中提取子字符串

2024-05-04

Background

最近,我一直在尝试更熟悉将分隔字符串更改为 XML 以使用 Excel 进行解析的概念FILTERXML https://support.microsoft.com/en-us/office/filterxml-function-4df72efc-11ec-4951-86f5-c1374812f5b7并检索那些感兴趣的子字符串。请注意,此功能从 Excel 2013 开始提供,在 Excel for Mac 或 Excel Online 上不可用。

对于分隔字符串,我指的是使用空格作为分隔符的普通句子或可用于定义字符串中的子字符串的任何其他字符组合。例如,让我们想象以下情况:

ABC|123|DEF|456|XY-1A|ZY-2F|XY-3F|XY-4f|xyz|123

Question

所以,很多人都知道如何获得nth元素(例如:=TRIM(MID(SUBSTITUTE(A1,"|",REPT(" ",LEN(A1))),3*LEN(A1)+1,LEN(A1)))检索456)。或者其他组合LEN(), MID(), FIND()以及所有这些构造,我们如何使用FILTERXML使用更具体的标准来提取关注的子字符串并清理整个字符串?例如如何检索:

  • 按位置排列的元素
  • 数字或非数字元素
  • 自己包含子字符串的元素
  • 以子字符串开头或结尾的元素
  • 大写或小写的元素
  • 持有数字的元素
  • 独特的价值观
  • ...

Excel's FILTERXML uses XPATH 1.0不幸的是,这意味着它并不像我们希望的那样多样化。另外,Excel 似乎not允许返回修改后的节点值,并且专门允许您按出现顺序选择节点。然而,我们仍然可以使用相当一部分功能。可以找到更多相关信息here https://www.w3.org/TR/1999/REC-xpath-19991116/.

该函数有两个参数:=FILTERXML(<A string in valid XML format>,<A string in valid XPATH format>)

比如说细胞A1保存字符串:ABC|123|DEF|456|XY-1A|ZY-2F|XY-3F|XY-4f|xyz|123。为了创建有效的 XML 字符串,我们使用SUBSTITUTE将分隔符更改为有效的结束标记和开始标记结构。因此,要为给定示例获取有效的 XML 构造,我们可以执行以下操作:

"<t><s>"&SUBSTITUTE(A1,"|","</s><s>")&"</s></t>"

出于可读性原因,我将使用单词来引用上述结构<XML> as a 占位符。下面你会发现不同的有用XPATH有效构造中的函数用于过滤节点:


1) All Elements:

=FILTERXML(<XML>,"//s")

Returns: ABC, 123, DEF, 456, XY-1A, ZY-2F, XY-3F, XY-4f, xyz and 123 (all nodes)


2) Elements by position:

=FILTERXML(<XML>,"//s[position()=4]")

Or:

=FILTERXML(<XML>,"//s[4]")

Returns: 456 (node on index 4)

=FILTERXML(<XML>,"//s[position()<4]")

Returns: ABC, 123 and DEF (nodes on index < 4)

=FILTERXML(<XML>,"//s[position()=2 or position()>5]")

Returns: 123, ZY-2F, XY-3F, XY-4f, xyz and 123 (nodes on index 2 or > 5)

=FILTERXML(<XML>,"//s[last()]")

Returns: 123 (node on last index)

=FILTERXML(<XML>,"//s[position() mod 2 = 1]")

Returns: ABC, DEF, XY-1A, XY-3F and xyz (odd nodes)

=FILTERXML(<XML>,"//s[position() mod 2 = 0]")

Returns: 123, 456, ZF-2F, XY-4f and 123 (even nodes)


3) (Non) numeric elements:

=FILTERXML(<XML>,"//s[number()=.]")

Or:

=FILTERXML(<XML>,"//s[.*0=0]")

Returns: 123, 456, and 123 (numeric nodes)

=FILTERXML(<XML>,"//s[not(number()=.)]")

Or:

=FILTERXML(<XML>,"//s[.*0!=0)]")

Returns: ABC, DEF, XY-1A, ZY-2F, XY-3F, XY-4f and xyz (non-numeric nodes)


4) Elements that (not) contain:

=FILTERXML(<XML>,"//s[contains(., 'Y')]")

Returns: XY-1A, ZY-2F, XY-3F and XY-4f (containing 'Y', notice XPATH is case sensitive, exclusing xyz)

=FILTERXML(<XML>,"//s[not(contains(., 'Y'))]")

Returns: ABC, 123, DEF, 456, xyz and 123 (not containing 'Y', notice XPATH is case sensitive, including xyz)


5) Elements that (not) start or/and end with:

=FILTERXML(<XML>,"//s[starts-with(., 'XY')]")

Returns: XY-1A, XY-3F and XY-4f (starting with 'XY')

=FILTERXML(<XML>,"//s[not(starts-with(., 'XY'))]")

Returns: ABC, 123, DEF, 456, ZY-2F, xyz and 123 (don't start with 'XY')

=FILTERXML(<XML>,"//s[substring(., string-length(.) - string-length('F') +1) = 'F']")

Returns: DEF, ZY-2F and XY-3F (end with 'F', notice XPATH 1.0 does not support ends-with)

=FILTERXML(<XML>,"//s[not(substring(., string-length(.) - string-length('F') +1) = 'F')]")

Returns: ABC, 123, 456, XY-1A, XY-4f, xyz and 123 (don't end with 'F')

=FILTERXML(<XML>,"//s[starts-with(., 'X') and substring(., string-length(.) - string-length('A') +1) = 'A']")

Returns: XY-1A (start with 'X' and end with 'A')


6) Elements that are upper- or lowercase:

=FILTERXML(<XML>,"//s[translate(.,'abcdefghijklmnopqrstuvwxyz','ABCDEFGHIJKLMNOPQRSTUVWXYZ')=.]")

Returns: ABC, 123, DEF, 456, XY-1A, ZY-2F, XY-3F and 123 (uppercase nodes)

=FILTERXML(<XML>,"//s[translate(.,'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz')=.]")

Returns: 123, 456, xyz and 123 (lowercase nodes)

NOTE: Unfortunately XPATH 1.0 does not support upper-case() nor lower-case() so the above is a workaround. Add special characters if need be.


7) Elements that (not) contain any number:

=FILTERXML(<XML>,"//s[translate(.,'1234567890','')!=.]")

Returns: 123, 456, XY-1A, ZY-2F, XY-3F, XY-4f and 123 (contain any digit)

=FILTERXML(<XML>,"//s[translate(.,'1234567890','')=.]")

Returns: ABC, DEF and xyz (don't contain any digit)

=FILTERXML(<XML>,"//s[translate(.,'1234567890','')!=. and .*0!=0]")

Returns: XY-1A, ZY-2F, XY-3F and XY-4f (holding digits but not a a number on it's own)


8) Unique elements or duplicates:

=FILTERXML(<XML>,"//s[preceding::*=.]")

Returns: 123 (duplicate nodes)

=FILTERXML(<XML>,"//s[not(preceding::*=.)]")

Returns: ABC, 123, DEF, 456, XY-1A, ZY-2F, XY-3F, XY-4f and xyz (unique nodes)

=FILTERXML(<XML>,"//s[not(following::*=. or preceding::*=.)]")

Returns: ABC, DEF, 456, XY-1A, ZY-2F, XY-3F and XY-4f (nodes that have no similar sibling)


9) Elements of certain length:

=FILTERXML(<XML>,"//s[string-length()=5]")

Returns: XY-1A, ZY-2F, XY-3F and XY-4f (5 characters long)

=FILTERXML(<XML>,"//s[string-length()<4]")

Returns: ABC, 123, DEF, 456, xyz and 123 (shorter than 4 characters)


10) Elements based on preceding/following:

=FILTERXML(<XML>,"//s[preceding::*[1]='456']")

Returns: XY-1A (previous node equals '456')

=FILTERXML(<XML>,"//s[starts-with(preceding::*[1],'XY')]")

Returns: ZY-2F, XY-4f, and xyz (previous node starts with 'XY')

=FILTERXML(<XML>,"//s[following::*[1]='123']")

Returns: ABC, and xyz (following node equals '123')

=FILTERXML(<XML>,"//s[contains(following::*[1],'1')]")

Returns: ABC, 456, and xyz (following node contains '1')

=FILTERXML(<XML>,"//s[preceding::*='ABC' and following::*='XY-3F']")

Or:

=FILTERXML(<XML>,"//s[.='ABC']/following::s[following::s='XY-3F']")    

Returns: 123, DEF, 456, XY-1A and ZY-2F (everything between 'ABC' and 'XY-3F')


11) Elements based on sub-strings:

=FILTERXML(<XML>,"//s[substring-after(., '-') = '3F']")

Returns: XY-3F (nodes ending with '3F' after hyphen)

=FILTERXML(<XML>,"//s[contains(substring-after(., '-') , 'F')]")

Returns: ZY-2F and XY-3F (nodes containing 'F' after hyphen)

=FILTERXML(<XML>,"//s[substring-before(., '-') = 'ZY']")

Returns: ZY-2F (nodes starting with 'ZY' before hyphen)

=FILTERXML(<XML>,"//s[contains(substring-before(., '-'), 'Y')]")

Returns: XY-1A, ZY-2F, XY-3F and XY-4f (nodes containing 'Y' before hyphen)


12) Elements based on concatenation:

=FILTERXML(<XML>,"//s[concat(., '|', following::*[1])='ZY-2F|XY-3F']")

Returns: ZY-2F (nodes when concatenated with '|' and following sibling equals 'ZY-2F|XY-3F')

=FILTERXML(<XML>,"//s[contains(concat(., preceding::*[2]), 'FA')]")

Returns: DEF (nodes when concatenated with sibling two indices to the left contains 'FA')


13) Empty vs. Non-empty:

=FILTERXML(<XML>,"//s[count(node())>0]")

Or:

=FILTERXML(<XML>,"//s[node()]")

Returns: ABC, 123, DEF, 456, XY-1A, ZY-2F, XY-3F, XY-4f, xyz and 123 (all nodes that are not empty)

=FILTERXML(<XML>,"//s[count(node())=0]")

Or:

=FILTERXML(<XML>,"//s[not(node())]")

Returns: None (all nodes that are empty)


14) Preceding or Following:

=FILTERXML(<XML>,"//s[substring(., string-length(.) - string-length('F') +1) = 'F'][last()]/following::*")

Returns: XY-4f, xyz and 123 (all nodes to the right of the last node that ends with an uppercase 'F')

=FILTERXML(<XML>,"//s[substring(., string-length(.) - string-length('F') +1) = 'F'][1]/preceding::*")

Returns: ABC and 123 (all nodes to the left of the first node that ends with an uppercase 'F')


15) (Preceding or Following) and self:

=FILTERXML(<XML>,"(//s[.*0!=0][last()]|//s[.*0!=0][last()]/preceding::*)")

Returns: ABC, 123, DEF, 456, XY-1A, ZY-2F, XY-3F, XY-4f and xyz (trim all numeric nodes from the right)††

=FILTERXML(<XML>,"(//s[.*0=0][1]|//s[.*0=0][1]/following::*)")

Returns: 123, DEF, 456, XY-1A, ZY-2F, XY-3F, XY-4f, xyz and 123 (trim all non-numeric nodes from the left)


16) Maximum or Minimum:

=FILTERXML(<XML>,"(//s[.*0=0][not(.<//s[.*0=0])])[1]")

Returns: 456 (The maximum value looking at numeric nodes)

=FILTERXML(<XML>,"(//s[.*0=0][not(.>//s[.*0=0])])[1]")

Returns: 123 (The minimum value looking at numeric nodes)

NOTE: This is the equivalent to returning all numeric nodes as per #3 and post-process the array using Excel's MIN() and MAX() functions.


现在显然上面是可能性的演示XPATH 1.0功能,您可以获得上述及更多功能的一系列组合!我试图涵盖最常用的字符串函数。如果您有遗漏,请随时评论。

尽管问题本身相当广泛,但我希望就如何使用提供一些总体指导FILTERXML对于手头的查询。该公式返回要以任何其他方式使用的节点数组。很多时候我会用它TEXTJOIN() or INDEX()。但我想其他选项将是新的 DA 功能来泄露结果。

请注意,在解析字符串时FILTERXML()、与号字符 (&) 和左尖括号 (not以字面形式出现。它们分别需要替换为&amp; or &lt;。另一种选择是使用 ISO/IEC 10646 数字字符code https://m.tonysweb.be/ISO_IEC%2010646-12000(E).html being &#38; or &#60;分别。解析后,该函数将以文字形式将这些字符返回给您。不用说,用分号分割字符串就变得很棘手。


Each predicate, the structure between the opening and closing square brackets, is a filter of a given nodelist. To write multiple of these structures is in fact anding https://stackoverflow.com/questions/62969654/xpath-what-do-nested-square-brackets-mean/62970207#62970207 such predicates.

‡‡ There isn't really an easy, following/preceding siblings and self, construct. Therefor I used the Union https://examples.javacodegeeks.com/core-java/xml/xpath/xpath-union-operator-example/ operator. This, however, requires multiple expressions to be inside paranthesis https://stackoverflow.com/a/26598269/9758194. Much like alternations within a capture group if one would think about regular expressions.

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

Excel - 使用 FILTERXML 从字符串中提取子字符串 的相关文章

随机推荐