AWS Glue:爬网程序无法识别 CSV 格式的时间戳列

2024-02-13

运行 AWS Glue 爬网程序时,它无法识别时间戳列。

我已在 CSV 文件中正确设置 ISO8601 时间戳的格式。首先,我希望 Glue 能够自动将它们分类为时间戳,但事实并非如此。

我还尝试从此链接中使用自定义时间戳分类器https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html

这是我的分类器的样子

这也不能正确分类我的时间戳。

我已经放入 grok 调试器(https://grokdebug.herokuapp.com/ https://grokdebug.herokuapp.com/)我的数据,例如

id,iso_8601_now,iso_8601_yesterday
0,2019-05-16T22:47:33.409056,2019-05-15T22:47:33.409056
1,2019-05-16T22:47:33.409056,2019-05-15T22:47:33.409056

并且它在两者上都匹配

%{TIMESTAMP_ISO8601:时间戳}

%{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{HOUR}:?%{MINUTE}(?::?%{SECOND})?%{ISO8601_TIMEZONE}?

import csv
from datetime import datetime, timedelta

with open("timestamp_test.csv", 'w', newline='') as f:
    w = csv.writer(f, delimiter=',')

    w.writerow(["id", "iso_8601_now", "iso_8601_yesterday"])

    for i in range(1000):
        w.writerow([i, datetime.utcnow().isoformat(), (datetime.utcnow() - timedelta(days=1)).isoformat()])

我希望 AWSglue 能够自动将 iso_8601 列分类为时间戳。即使添加自定义 grok 分类器,它仍然不会将任一列分类为时间戳。

两列都被分类为字符串。

The classifer is active on the crawler classifier active

爬虫输出timestamp_test表

{
    "StorageDescriptor": {
        "cols": {
            "FieldSchema": [
                {
                    "name": "id",
                    "type": "bigint",
                    "comment": ""
                },
                {
                    "name": "iso_8601_now",
                    "type": "string",
                    "comment": ""
                },
                {
                    "name": "iso_8601_yesterday",
                    "type": "string",
                    "comment": ""
                }
            ]
        },
        "location": "s3://REDACTED/_csv_timestamp_test/",
        "inputFormat": "org.apache.hadoop.mapred.TextInputFormat",
        "outputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
        "compressed": "false",
        "numBuckets": "-1",
        "SerDeInfo": {
            "name": "",
            "serializationLib": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
            "parameters": {
                "field.delim": ","
            }
        },
        "bucketCols": [],
        "sortCols": [],
        "parameters": {
            "skip.header.line.count": "1",
            "sizeKey": "58926",
            "objectCount": "1",
            "UPDATED_BY_CRAWLER": "REDACTED",
            "CrawlerSchemaSerializerVersion": "1.0",
            "recordCount": "1227",
            "averageRecordSize": "48",
            "CrawlerSchemaDeserializerVersion": "1.0",
            "compressionType": "none",
            "classification": "csv",
            "columnsOrdered": "true",
            "areColumnsQuoted": "false",
            "delimiter": ",",
            "typeOfData": "file"
        },
        "SkewedInfo": {},
        "storedAsSubDirectories": "false"
    },
    "parameters": {
        "skip.header.line.count": "1",
        "sizeKey": "58926",
        "objectCount": "1",
        "UPDATED_BY_CRAWLER": "REDACTED",
        "CrawlerSchemaSerializerVersion": "1.0",
        "recordCount": "1227",
        "averageRecordSize": "48",
        "CrawlerSchemaDeserializerVersion": "1.0",
        "compressionType": "none",
        "classification": "csv",
        "columnsOrdered": "true",
        "areColumnsQuoted": "false",
        "delimiter": ",",
        "typeOfData": "file"
    }
}

根据创建表 https://docs.aws.amazon.com/athena/latest/ug/create-table.htmldoc,时间戳格式为yyyy-mm-dd hh:mm:ss[.f...]

如果必须使用ISO8601格式,请添加此Serde参数'timestamp.formats'='yyyy-MM-dd\'T\'HH:mm:ss.SSSSSS'

您可以从 Glue(1) 更改该表或从 Athena(2) 重新创建它:

  1. Glue 控制台 > 表格 > 编辑表格 > 将以上内容添加到 Serde 参数中。您还需要单击“编辑架构”并将数据类型从字符串更改为时间戳
  2. 从 Athena 删除表并运行:
CREATE EXTERNAL TABLE `table1`(
  `id` bigint, 
  `iso_8601_now` timestamp, 
  `iso_8601_yesterday` timestamp)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ( 
  'field.delim' = ',',
  'timestamp.formats'='yyyy-MM-dd\'T\'HH:mm:ss.SSSSSS') 
LOCATION
  's3://REDACTED/_csv_timestamp_test/'
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

AWS Glue:爬网程序无法识别 CSV 格式的时间戳列 的相关文章

随机推荐