【Lecture 5.3】Release the Kraken

2023-05-16

文章目录

- Release the Kraken

Release the Kraken

我们要看的下一个库是 Kraken，它是由巴黎PSL大学开发的。我们将使用Kraken的目的是在给定图像中检测文本行来作为的边界框（detect lines of text as bounding boxes）, tesseract 的最大局限在于其内部缺少布局（layout）引擎。 Tesseract 希望输入 clean 的文本图像，并且，如果我们不 crop up 其他artifacts，它就可能无法正确处理，但是Kraken可以帮助我们分割页面。让我们来看看。

首先我们来看看 Kranken 模块

import kraken
help(kraken)
----
Help on package kraken:

NAME
    kraken - entry point for kraken functionality

PACKAGE CONTENTS
    binarization
    ketos
    kraken
    lib (package)
    linegen
    pageseg # 
    repo
    rpred
    serialization
    transcribe

DATA
    absolute_import = _Feature((2, 5, 0, 'alpha', 1), (3, 0, 0, 'alpha', 0...
    division = _Feature((2, 2, 0, 'alpha', 2), (3, 0, 0, 'alpha', 0), 8192...
    print_function = _Feature((2, 6, 0, 'alpha', 2), (3, 0, 0, 'alpha', 0)...

FILE
    /opt/conda/lib/python3.7/site-packages/kraken/__init__.py

这里没有太多讨论，但是有许多看起来很有趣的子模块。我花了一些时间在他们的网站上，我认为处理所有页面细分的 the pageseg module 是我们要使用的模块。让我们看看

from kranken import pageseg
help(pageseg)

因此，看起来我们可以调用一些不同的函数，而 the segment function 看起来特别合适。

 segment(im, text_direction='horizontal-lr', scale=None, maxcolseps=2, black_colseps=False, no_hlines=True, pad=0, mask=None)
        # Segments a page into text lines.
        
        Segments a page into text lines and returns the absolute coordinates of
        each line in reading order.
        ...
        Args:
            im (PIL.Image): A bi-level page of mode '1' or 'L'
            text_direction (str): Principal direction of the text
                                  (horizontal-lr/rl/vertical-lr/rl)
            scale (float): Scale of the image
            maxcolseps (int): Maximum number of whitespace column separators
            black_colseps (bool): Whether column separators are assumed to be
                                  vertical black lines or not # 列分隔符是否为垂直黑线
            no_hlines (bool): Switch for horizontal line removal
            pad (int or tuple): Padding to add to line bounding boxes. If int the
                                same padding is used both left and right. If a
                                2-tuple, uses (padding_left, padding_right).
            mask (PIL.Image): A bi-level mask image of the same size as `im` where
                              0-valued regions are ignored for segmentation
                              purposes. Disables column detection.
        Returns:
            {'script_detection': True, 'text_direction': '$dir', 'boxes':
            [[(script, (x1, y1, x2, y2)),...]]}: A dictionary containing the text
            direction and a list of lists of reading order sorted bounding boxes
            under the key 'boxes' with each list containing the script segmentation
            of a single line. Script is a ISO15924 4 character identifier.

我喜欢这个库在文档方面的表现力-我可以立即看到我们正在使用PIL.Image文件，并且作者甚至指示我们需要传递二值图（e.g. ‘l’）或灰度图（e.g. ‘L’）。我们还可以看到，返回值是带有两个键的字典对象，“ text_direction”将返回文本方向的字符串，“ boxes”将显示为元组列表，其中每个元组为 a box iln the original image.。

让我们在文本图像上尝试一下。我在一个名为 two_col.png 的文件中有一段简单的文字，该文件来自校园的报纸

from PIL import Image
im = Image.open('readonly/two_col.png')
# 显示图片
display(im)
# 现在让我们将其转换为黑白，并用kranken包把它分成几行
bounding_boxes = pageseg.segment(im.convert('l'))['boxes']
# 然后我们把 box 打印出来：box的左上和右下的坐标
print(bounding_boxes)

[[100, 50, 449, 74], [131, 88, 414, 120], [59, 196, 522, 229], [18, 239, 522, 272], [19, 283, 522, 316], [19, 327, 525, 360], [19, 371, 523, 404], [18, 414, 524, 447], [17, 458, 522, 491], [19, 502, 141, 535], [58, 546, 521, 579], [18, 589, 522, 622], [19, 633, 521, 665], [563, 21, 1066, 54], [564, 64, 1066, 91], [563, 108, 1066, 135], [564, 152, 1065, 179], [563, 196, 1065, 229], [563, 239, 1066, 272], [562, 283, 909, 316], [600, 327, 1066, 360], [562, 371, 1066, 404], [562, 414, 1066, 447], [563, 458, 1065, 485], [563, 502, 1065, 535], [562, 546, 1066, 579], [562, 589, 1064, 622], [562, 633, 1066, 660], [18, 677, 833, 704], [18, 721, 1066, 754], [18, 764, 1065, 797], [17, 808, 1065, 841], [18, 852, 1067, 885], [18, 895, 1065, 928], [17, 939, 1065, 972], [17, 983, 1067, 1016], [18, 1027, 1065, 1060], [18, 1070, 1065, 1103], [18, 1114, 1065, 1147]]

好吧，非常简单的两列文本，然后是列表，这些列表是该文本行的边界框（a list of lists which are the bounding boxes of lines of that text. ）让我们编写一些例程以尝试更清楚地看到效果。我将整理一下自己的行为并编写真实的文档，这是一个好习惯

def show_boxes(img):
    '''修改传递的图像，以显示由kraken运行的图像上的一系列边界框
    :param img: A PIL.Image object
    :return img: The modified PIL.Image object'''
    # 引入 PIL 模块的 draw 对象 以对获取到的坐标进行方框绘制
    from PIL import ImageDraw
    drawing_object = ImageDraw.Draw(img)
    # 使用 pageseg.segment 方法创建一系列方框坐标
    bounding_boxes = pages.segment(img.convert('l'))['boxes']
    # 遍历这个（包含方框坐标元组）的列表
    for box in bounding_boxes:
        drawing_object.rectangle(box, fill = None, outline = 'red')
        
    return img

# 使用display 展示变更过后的 图像
display(show_boxes(Image.open('readonly/two_col.png')))

很不错！有趣的是，kraken不能完全确定如何处理这两列格式的文本。在某些情况下，kraken 在每一单列中标识了一条线，而在其他情况下，kraken把两列文本识别为一行。这有关系吗？好吧，这确实取决于我们的目标。在这种情况下，我想看看我们是否可以对此有所改进。

因此，我们将在此处不说脚本。尽管本周的讲座是关于库的，但最后一门课程的目标是使您充满信心，即使您使用的库不能完全满足您的要求，也可以将您的知识应用于实际的编程任务。看上图，带有两列示例和红色框，您认为我们如何修改此图像以提高kraken的文本行能力？

感谢您分享您的想法，我期待看到课程中每个人都提出的广泛想法。这是我的解决方案-在浏览 pageseg() 函数上的kraken文档时，我看到有一些参数来提高segmentation的效果。其中之一是 black_colseps 参数。如果设置为True，则kraken将假定各列是用黑线分隔。我们这里并没有用黑色竖线分割两列，但是，我认为我们可以将源图像更改为在列之间使用黑色分隔符。

我们在

def show_boxes(img):
    # Lets bring in our ImageDraw object
    from PIL import ImageDraw
    # And grab a drawing object to annotate that image
    drawing_object=ImageDraw.Draw(img)
    # 增加一个文本分割的参数 black_colseps=True
    bounding_boxes=pageseg.segment(img.convert('1'), black_colseps=True)['boxes']
    # Now lets go through the list of bounding boxes
    for box in bounding_boxes:
        # An just draw a nice rectangle
        drawing_object.rectangle(box, fill = None, outline ='red')
    # And to make it easy, lets return the image object
    return img

下一步是考虑我们要检测空白列分隔符（detect a white column separator）的算法。在进行一些实验时，我决定仅在的间隔至少为25个像素宽（大约是一个字符的宽度，六倍行高）时才认为这是分隔符。宽度很容易，让我们做一个变量

char_width = 25
# 高度较难，因为它取决于文本的高度。 我将编写一个例程来计算一行的平均高度
def calculate_line_height(img):
    '''计算 一个image 的 the average height of a line 
    :param img: A PIL.Image object
    :retuen: The average line height in pixels'''
    # 首先得到 bounding box 坐标的列表
    bounding_boxes = pageseg.segment(img.convert('l'))['boxes']
    # each box is a tuple of (top, left, bottom, right)
    # 因此每行的高度是(top-bottom)
    height_acculator = 0
    for box in bounding_boxes:
        height_accmulator = height_accumlator + box[3]-box[1]
        # 这有点棘手，请记住，我们从PIL的左上角开始计数！ 现在让我们只返回平均高度，就可以通过将其变为整数来将其更改为最接近的完整像素
    return int(height_accumulator/len(bounding_boxes))

# 让我们测试下每行的平均高度
line_height = calculate_line_height(Image.open('readonly/two_col.png'))
print(line_height)

好的，所以一条线的平均高度是31。现在，我们要扫描图像-依次查看每个像素：以确定是否存在空白的区域（whitespace）。 How bit of a block should we look for? 那是一门艺术，而不是一门科学。看我们的示例图像，我要说一个合适的块的大小应该是 one char_width wide, and six line_heights tall. 但是我也只是目测的，让我们创建一个新的框，称为间隙框（gap_box），代表该区域

# Lets create a new box called gap box that represents this area
gap_box=(0,0,char_width,line_height*6)
gap_box
---
(0, 0, 25, 186)

我们将希望有一个函数，输入一幅图像中的像素，可以检查该像素是否在其右侧和下方有whitespace。本质上，我们要测试像素是否是 gap_box 的对象的左上角。 If so, we should insert a line to “break up” this box before sending to kraken

回想一下，我们可以使用 img.getpixel() 函数获得一个坐标的像素值。它以整数元组的形式返回此值，每个颜色通道一个。当适用于二值化图像（黑白），将只返回一个值。如果值为0，则为黑色像素；如果为白色，则值为255
我们将假设图像已被二值化。检查边界框的算法非常简单：我们有一个位置作为起点，然后我们要检查该位置右边的所有像素，直到 gap_box[2]

# Lets call this new function gap_check
def gap_check(img, location):
    '''Checks the img in a given (x,y) location to see if it fits the description
    of a gap_box
    :param img: A PIL.Image file
    :param location: A tuple (x,y) which is a pixel location in that image
    :return: True if that fits the definition of a gap_box, otherwise False
    '''
    for x in range(location[0], location[0]+gap_box[2]):
        # the height is similar, so lets iterate a y variable to gap_box[3]
        for y in range(location[1], location[1]+gap_box[3]):
            # 我们想要check这个区域内的像素是不是白色，如果是白色我们什么都不做，
            # 如果是黑色,return False 说明从这个坐标点爱看i是的 block区域内不是 列间隔区域
            # 我们最好确认这在图像内
            if x < img.width and y < img.height:
                if img.getpixel((x,y)) != 255:
                    return False
    # 如果我们遍历了所有的以 loaction 为起点的 gap_box 则这就是一个gap
    return True

好的现在我们有了一个 function 来 check 列文本图像中的列gap，当我们找到这个 gap 之后我们想要做的是在这个 gap_box 的正中间画一条线，让我们写一个function来完成这个工作

def draw_sep(img,location):
    '''Draws a line in img in the middle of the gap discovered at location. Note that
    this doesn't draw the line in location, but draws it at the middle of a gap_box
    starting at location.
    :param img: A PIL.Image file
    :param location: A tuple(x,y) which is a pixel location in the image
    '''
    # First 画 操作的代码段 
    from PIL import ImageDraw
    drawing_object=ImageDraw.Draw(img)
    # next, lets decide what the middle means in terms of coordinates in the image
    x1=location[0]+int(gap_box[2]/2)
    # and our x2 is just the same thing, since this is a one pixel vertical line
    x2=x1
    # our starting y coordinate is just the y coordinate which was passed in, the top of the box
    y1=location[1]
    # but we want our final y coordinate to be the bottom of the box
    y2=y1+gap_box[3]
    drawing_object.rectangle((x1,y1,x2,y2), fill = 'black', outline ='black')
    # 这个函数不需要 return，因为我们改变了这个图

现在，我们把所有的方法合起来，我们迭代整个文本图像的所有像素来确认它是否存在一个 gap，如果存在则在整个 gap_box 中间画一条线，

def process_image(img):
    '''Takes in an image of text and adds black vertical bars to break up columns
    :param img: A PIL.Image file
    :return: A modified PIL.Image file
    '''
    # we'll start with a familiar iteration process
    for x in range(img.width):
        for y in range(img.height):
            # check if there is a gap at this point
            if (gap_check(img, (x,y))):
                # then update image to one which has a separator drawn on it
                draw_sep(img, (x,y))
    # and for good measure we'll return the image we modified
    return img

# Lets read in our test image and convert it through binarization
i=Image.open("readonly/two_col.png").convert("L")
i=process_image(i)
display(i)

#Note: This will take some time to run! Be patient!

一点也不差！图片底部的效果对我来说有点出乎意料，但这是有道理的。您可以想象有几种方法可以尝试控制它。让我们看看通过kraken layout engine运行时新图像的工作方式

display(show_boxes(i))  # i=process_image(i)

看起来非常准确，并且可以解决我们面临的问题。虽然我们创建的方法确实非常慢，但是如果我们想在较大的文本上使用它，这将是一个问题。但我想向您展示如何混合使用自己的逻辑并使用正在使用的库。仅仅因为Kraken不能完美地工作，并不意味着我们不能在其之上构建更具体的用例。
我想暂停一下本讲座，并请您反思一下我们在此处编写的代码。我们以一些非常简单的库用法开始了本课程，但是现在我们正在更深入地研究并在这些库的帮助下自行解决问题。在进入最后一个库之前，您认为您准备如何充分利用python技能？

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)