创建一个最小的示例并对格式进行逆向工程
在创建任何包文件之前创建一个简单的存储库(git gc
, git config gc.auto
, git-prune-packed
...),使用以下方法之一解压提交对象:如何使用命令行工具DEFLATE来提取git对象? https://stackoverflow.com/questions/3178566/deflate-command-line-tool
export GIT_AUTHOR_DATE="1970-01-01T00:00:00+0000"
export GIT_AUTHOR_EMAIL="[email protected] /cdn-cgi/l/email-protection"
export GIT_AUTHOR_NAME="Author Name" \
export GIT_COMMITTER_DATE="2000-01-01T00:00:00+0000" \
export GIT_COMMITTER_EMAIL="[email protected] /cdn-cgi/l/email-protection" \
export GIT_COMMITTER_NAME="Committer Name" \
git init
# First commit.
echo
touch a
git add a
git commit -m 'First message'
# (for python2, remove the two `.buffer`s in the next line)
python -c "import zlib,sys;sys.stdout.buffer.write(zlib.decompress(sys.stdin.buffer.read()))" \
<.git/objects/45/3a2378ba0eb310df8741aa26d1c861ac4c512f | hd
# Second commit.
echo
touch b
git add b
git commit -m 'Second message'
# (for python2, remove the two `.buffer`s in the next line)
python -c "import zlib,sys;sys.stdout.buffer.write(zlib.decompress(sys.stdin.buffer.read()))" \
<.git/objects/74/8e6f7e22cac87acec8c26ee690b4ff0388cbf5 | hd
输出是:
Initialized empty Git repository in /home/ciro/test/git/.git/
[master (root-commit) 453a237] First message
Author: Author Name <[email protected] /cdn-cgi/l/email-protection>
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100644 a
00000000 63 6f 6d 6d 69 74 20 31 37 34 00 74 72 65 65 20 |commit 174.tree |
00000010 34 39 36 64 36 34 32 38 62 39 63 66 39 32 39 38 |496d6428b9cf9298|
00000020 31 64 63 39 34 39 35 32 31 31 65 36 65 31 31 32 |1dc9495211e6e112|
00000030 30 66 62 36 66 32 62 61 0a 61 75 74 68 6f 72 20 |0fb6f2ba.author |
00000040 41 75 74 68 6f 72 20 4e 61 6d 65 20 3c 61 75 74 |Author Name <aut|
00000050 68 6f 72 40 65 78 61 6d 70 6c 65 2e 63 6f 6d 3e |[email protected] /cdn-cgi/l/email-protection>|
00000060 20 30 20 2b 30 30 30 30 0a 63 6f 6d 6d 69 74 74 | 0 +0000.committ|
00000070 65 72 20 43 6f 6d 6d 69 74 74 65 72 20 4e 61 6d |er Committer Nam|
00000080 65 20 3c 63 6f 6d 6d 69 74 74 65 72 40 65 78 61 |e <committer@exa|
00000090 6d 70 6c 65 2e 63 6f 6d 3e 20 39 34 36 36 38 34 |mple.com> 946684|
000000a0 38 30 30 20 2b 30 30 30 30 0a 0a 46 69 72 73 74 |800 +0000..First|
000000b0 20 6d 65 73 73 61 67 65 0a | message.|
000000ba
[master 748e6f7] Second message
Author: Author Name <[email protected] /cdn-cgi/l/email-protection>
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100644 b
00000000 63 6f 6d 6d 69 74 20 32 32 33 00 74 72 65 65 20 |commit 223.tree |
00000010 32 39 36 65 35 36 30 32 33 63 64 63 30 33 34 64 |296e56023cdc034d|
00000020 32 37 33 35 66 65 65 38 63 30 64 38 35 61 36 35 |2735fee8c0d85a65|
00000030 39 64 31 62 30 37 66 34 0a 70 61 72 65 6e 74 20 |9d1b07f4.parent |
00000040 34 35 33 61 32 33 37 38 62 61 30 65 62 33 31 30 |453a2378ba0eb310|
00000050 64 66 38 37 34 31 61 61 32 36 64 31 63 38 36 31 |df8741aa26d1c861|
00000060 61 63 34 63 35 31 32 66 0a 61 75 74 68 6f 72 20 |ac4c512f.author |
00000070 41 75 74 68 6f 72 20 4e 61 6d 65 20 3c 61 75 74 |Author Name <aut|
00000080 68 6f 72 40 65 78 61 6d 70 6c 65 2e 63 6f 6d 3e |[email protected] /cdn-cgi/l/email-protection>|
00000090 20 30 20 2b 30 30 30 30 0a 63 6f 6d 6d 69 74 74 | 0 +0000.committ|
000000a0 65 72 20 43 6f 6d 6d 69 74 74 65 72 20 4e 61 6d |er Committer Nam|
000000b0 65 20 3c 63 6f 6d 6d 69 74 74 65 72 40 65 78 61 |e <committer@exa|
000000c0 6d 70 6c 65 2e 63 6f 6d 3e 20 39 34 36 36 38 34 |mple.com> 946684|
000000d0 38 30 30 20 2b 30 30 30 30 0a 0a 53 65 63 6f 6e |800 +0000..Secon|
000000e0 64 20 6d 65 73 73 61 67 65 0a |d message.|
000000eb
那么我们推导出格式如下:
-
顶层:
commit {size}\0{content}
where {size}
是字节数{content}
.
所有对象类型都遵循相同的模式。
-
{content}
:
tree {tree_sha}
{parents}
author {author_name} <{author_email}> {author_date_seconds} {author_date_timezone}
committer {committer_name} <{committer_email}> {committer_date_seconds} {committer_date_timezone}
{commit message}
where:
-
{tree_sha}
:此提交指向的树对象的 SHA。
这代表顶级 Git 存储库目录。
该 SHA 来自树对象的格式:Git 树对象的内部格式是什么? https://stackoverflow.com/questions/14790681/what-is-the-internal-format-of-a-git-tree-object
-
{parents}
:父提交对象的可选列表,形式为:
parent {parent1_sha}
parent {parent2_sha}
...
如果没有父母,则列表可以为空,例如用于存储库中的第一次提交。
两个父级发生在定期合并提交中。
可以有两个以上的父母git merge -Xoctopus
,但这不是常见的工作流程。这是一个例子:https://github.com/cirosantilli/test-octopus-100k https://github.com/cirosantilli/test-octopus-100k
-
{author_name}
: e.g.: Ciro Santilli
。不能包含<
, \n
-
{author_email}
: e.g.: [email protected] /cdn-cgi/l/email-protection
。不能包含>
, \n
-
{author_date_seconds}
:自 1970 年以来的秒数,例如946684800
是 2000 年的第一秒
-
{author_date_timezone}
: e.g.: +0000
is UTC
-
提交者字段:类似于作者字段
-
{commit message}
: 随意的。
我制作了一个最小的 Python 脚本,它生成一个 git 存储库,并在以下位置进行了一些提交:https://github.com/cirosantilli/test-git-web-interface/blob/864d809c36b8f3b232d5b0668917060e8bcba3e8/other-test-repos/util.py#L83 https://github.com/cirosantilli/test-git-web-interface/blob/864d809c36b8f3b232d5b0668917060e8bcba3e8/other-test-repos/util.py#L83
我用它来做一些有趣的事情,比如:
- GitHub 上连续使用时间最长的用户是谁? https://stackoverflow.com/questions/20099235/who-is-the-user-with-the-longest-streak-on-github/27742165#27742165
- https://www.quora.com/Which-GitHub-repo-has-the-most-commits/answer/Ciro-Santilli https://www.quora.com/Which-GitHub-repo-has-the-most-commits/answer/Ciro-Santilli
- https://github.com/isaacs/github/issues/1344 https://github.com/isaacs/github/issues/1344
下面是标签对象格式的类似分析:git tag 对象的格式是什么以及如何计算其 SHA? https://stackoverflow.com/questions/10986615/what-is-the-format-of-a-git-tag-object-and-how-to-calculate-its-sha/52193441#52193441