我尝试让我的程序通过 apache arrow 的 StreamWriter 以 parquet 格式写出数据流。但输出文件没有元数据页脚。当尝试使用 python pandas 读取镶木地板时,出现以下错误:
Invalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
This 吉拉门票 https://issues.apache.org/jira/browse/ARROW-14066箭头似乎提供了解决方案。指出ParquetFileWriter
内StreamWriter
必须关闭才能写入页脚。 (购票建议致电Close()
间接通过调用StreamWriter
的析构函数。但我总是遇到分段错误ParquetFileWriter.Close()
.
以下是我如何设置 Writer:
std::shared_ptr<::arrow::io::FileOutputStream> outfile_{""};
std::string outputFilePath_ = "/tmp/part.0.parquet";
PARQUET_ASSIGN_OR_THROW(
outfile_,
::arrow::io::FileOutputStream::Open(outputFilePath_)
)
// build column names
parquet::schema::NodeVector columnNames_{};
columnNames_.push_back(
parquet::schema::PrimitiveNode::Make(
"Time", parquet::Repetition::REQUIRED, parquet::Type::INT64, parquet::ConvertedType::UINT_64
)
);
columnNames_.push_back(
parquet::schema::PrimitiveNode::Make(
"Value", parquet::Repetition::REQUIRED, parquet::Type::INT64, parquet::ConvertedType::UINT_64
)
);
auto schema = std::static_pointer_cast<parquet::schema::GroupNode>(
parquet::schema::GroupNode::Make("schema", parquet::Repetition::REQUIRED, columnNames_)
);
parquet::WriterProperties::Builder builder;
std::unique_ptr<parquet::ParquetFileWriter> fwriter = parquet::ParquetFileWriter::Open(outfile_, schema, builder.build())
parquet::StreamWriter os_ = parquet::StreamWriter {std::move(fwriter)};
// Start writing to os_, would be in a callback function
os_ << std::uint64_t{5} << std::uint64_t{59};
os_.EndRow();
os_.EndRowGroup();
我尝试了以下方法,但它们都会产生段错误:os_.~StreamWriter();
OR
fwriter.Close()