如果您想对可以使用的部分进行分组itertools.groupby
使用空行作为分隔符:
from itertools import groupby
with open("in.txt") as f:
for k, sec in groupby(f,key=lambda x: bool(x.strip())):
if k:
print(list(sec))
通过更多 itertools foo,我们可以使用大写标题作为分隔符来获取节:
from itertools import groupby, takewhile
with open("in.txt") as f:
grps = groupby(f,key=lambda x: x.isupper())
for k, sec in grps:
# if we hit a title line
if k:
# pull all paragraphs
v = next(grps)[1]
# skip two empty lines after title
next(v,""), next(v,"")
# take all lines up to next empty line/second paragraph
print(list(takewhile(lambda x: bool(x.strip()), v)))
这会给你:
['There is a vast field of fascinating human interest, lying only just outside our doors, which as yet has been but little explored. It is the Field of Animal Intelligence.\n']
['What I am trying to do here is, find the uppercase lines, and put them all in an array. Then, using the index method, I will find the first and last paragraphs of each section by comparing the indexes of these elements of this array I created.']
每个部分的开头都有一个全大写的标题,因此一旦我们点击该标题,我们就知道有两个空行,然后第一段和模式就会重复。
要将其分解为使用循环:
from itertools import groupby
from itertools import groupby
def parse_sec(bk):
with open(bk) as f:
grps = groupby(f, key=lambda x: bool(x.isupper()))
for k, sec in grps:
if k:
print("First paragraph from section titled :{}".format(next(sec).rstrip()))
v = next(grps)[1]
next(v, ""),next(v,"")
for line in v:
if not line.strip():
break
print(line)
对于您的文本:
In [11]: cat -E in.txt
THE LAY OF THE LAND$
$
$
There is a vast field of fascinating human interest, lying only just outside our doors, which as yet has been but little explored. It is the Field of Animal Intelligence.$
$
Of all the kinds of interest attaching to the study of the world's wild animals, there are none that surpass the study of their minds, their morals, and the acts that they perform as the results of their mental processes.$
$
$
WILD ANIMAL TEMPERAMENT & INDIVIDUALITY$
$
$
What I am trying to do here is, find the uppercase lines, and put them all in an array. Then, using the index method, I will find the first and last paragraphs of each section by comparing the indexes of these elements of this array I created.
美元符号是新行,输出是:
In [12]: parse_sec("in.txt")
First paragraph from section titled :THE LAY OF THE LAND
There is a vast field of fascinating human interest, lying only just outside our doors, which as yet has been but little explored. It is the Field of Animal Intelligence.
First paragraph from section titled :WILD ANIMAL TEMPERAMENT & INDIVIDUALITY
What I am trying to do here is, find the uppercase lines, and put them all in an array. Then, using the index method, I will find the first and last paragraphs of each section by comparing the indexes of these elements of this array I created.