hadoop2.7.2学习笔记05-hadoop文件系统API定义-hadoop文件系统类org.apache.hadoop.fs.FileSystem

2023-11-01

class `org.apache.hadoop.fs.FileSystem`

抽象类FileSystem是访问hadoop文件系统的最原生态的方式；它的非抽象的子类用来实现hadoop支持的各个文件系统。

所有基于此接口的的操作必须要支持相对路径，相对路径指相对于工作路径，工作路径由setWorkingDirectory()指定。

对于每个客户端都有一个当前工作目录的概念，但是这个目录的改变并不会影响到文件系统，它只作用于客户端。

对有效的文件系统的所有要求被区分为preconditions和postconditions，一个有效的文件系统上的所有操作的结果也需要是一个有效的文件系统。

接下类定义一些反映状态的操作

`boolean exists(Path p)`

def exists(FS, p) = p in paths(FS)

`boolean isDirectory(Path p)`

def isDirectory(FS, p)= p in directories(FS)

`boolean isFile(Path p)`

def isFile(FS, p) = p in files(FS)

`boolean isSymlink(Path p)`

def isSymlink(FS, p) = p in symlinks(FS)

‘boolean inEncryptionZone(Path p)’

如果路径p的数据是加密过的则返回true，但元数据是未加密的

前提条件preconditions：

if not exists(FS, p) : raise FileNotFoundException

结束条件postconditions：

forall d in directories(FS): inEncyptionZone(FS, d) implies
  forall c in children(FS, d) where (isFile(FS, c) or isDir(FS, c)) :
    inEncyptionZone(FS, c)

加密区域的所有文件的数据都是加密的，但是加密类型没有指定

  forall f in files(FS) where  inEncyptionZone(FS, c):
    isEncrypted(data(f))

`FileStatus getFileStatus(Path p)`

返回路径的状态

preconditions：

if not exists(FS, p) : raise FileNotFoundException

postconditions：

result = stat: FileStatus where:
    if isFile(FS, p) :
        stat.length = len(FS.Files[p])
        stat.isdir = False
    elif isDir(FS, p) :
        stat.length = 0
        stat.isdir = True
    elif isSymlink(FS, p) :
        stat.length = 0
        stat.isdir = False
        stat.symlink = FS.Symlinks[p]
    if inEncryptionZone(FS, p) :
        stat.isEncrypted = True
    else
        stat.isEncrypted = False

`Path getHomeDirectory()`

返回当前用户的home目录

postconditions：

result = p where valid-path(FS, p)

`FileSystem.listStatus(Path, PathFilter )`

如果路径满足筛选条件则返回true

preconditions：

if not exists(FS, p) : raise FileNotFoundException

postconditions：

if isFile(FS, p) and f(p) :
    result = [getFileStatus(p)]

elif isFile(FS, p) and not f(P) :
    result = []

elif isDir(FS, p):
   result [getFileStatus(c) for c in children(FS, p) where f(c) == True]

Atomicity and Consistency

listStatus()方法返回时，并不能保证它的结果能够反映当前的状态。在进行listStatus操作的时候，一些目录的状态可能发生变化。

路径p被创建后，在文件系统中发生任何其他改变之前，listStatus(p)必须找到文件并返回它的状态。

路径p被删除后，listStatus(p)必须抛出FileNotFoundException。

路径p被创建后，在文件系统中发生任何其他改变之前，listStatus(parent(p))的结果需要包括getFileStatus(p)的结果。

路径p被删除后，在文件系统中发生任何其他改变之前，listStatus(parent(p))的结果必须不能包括getFileStatus(p)的结果。

`List[BlockLocation] getFileBlockLocations(FileStatus f, int s, int l)`

preconditions：

if s < 0 or l < 0 : raise {HadoopIllegalArgumentException, InvalidArgumentException}

对于无效的偏移量或长度，HDFS会抛出HadoopIllegalArgumentException，它继承自IllegalArgumentException。

postconditions：

如果文件系统是位置敏感的，它将返回block位置的清单。

if f == null :
    result = null
elif f.getLen()) <= s
    result = []
else result = [ locations(FS, b) for all b in blocks(FS, p, s, s+l)]

def locations(FS, b) = a list of all locations of a block in the filesystem

def blocks(FS, p, s, s +  l)  = a list of the blocks containing  data(FS, path)[s:s+l]

目录的length(FS,f)结果为0，目录的getFileBlockLocations()结果是[]。

如果文件系统是位置不敏感的，它需要返回

  [
    BlockLocation(["localhost:50010"] ,
              ["localhost"],
              ["/default/localhost"]
               0, F.getLen())
   ] ;

`getFileBlockLocations(Path P, int S, int L)`

preconditions：

if p == null : raise NullPointerException
if not exists(FS, p) : raise FileNotFoundException

postconditions：

result = getFileBlockLocations(getStatus(P), S, L)

`getDefaultBlockSize()`

postconditions：

result = integer >= 0

`getDefaultBlockSize(Path P)`

postconditions：

result = integer  >= 0

`getBlockSize(Path P)`

preconditions：

if not exists(FS, p) :  raise FileNotFoundException

postconditions：

result == getFileStatus(P).getBlockSize()

返回值需要和getFileStatus(P)的返回值相同

`boolean mkdirs(Path p, FsPermission permission )`

preconditions：

 if exists(FS, p) and not isDir(FS, p) :
     raise [ParentNotDirectoryException, FileAlreadyExistsException, IOException]

postconditions：

FS' where FS'.Directories' = FS.Directories + [p] + ancestors(FS, p)
result = True

`FSDataOutputStream create(Path, ...)`

FSDataOutputStream create(Path p,
      FsPermission permission,
      boolean overwrite,
      int bufferSize,
      short replication,
      long blockSize,
      Progressable progress) throws IOException;

preconditions：

文件不能存在

if not overwrite and isFile(FS, p)  : raise FileAlreadyExistsException

向目录写入需要报错

if isDir(FS, p) : raise {FileAlreadyExistsException, FileNotFoundException, IOException}

postconditions：

FS' where :
   FS'.Files'[p] == []
   ancestors(p) is-subset-of FS'.Directories'

result = FSDataOutputStream

`FSDataOutputStream append(Path p, int bufferSize, Progressable progress)`

可能抛出UnsupportedOperationException

preconditions：

if not exists(FS, p) : raise FileNotFoundException

if not isFile(FS, p) : raise [FileNotFoundException, IOException]

postconditions：

FS
result = FSDataOutputStream

`FSDataInputStream open(Path f, int bufferSize)`

可能抛出UnsupportedOperationException

preconditions：

if not isFile(FS, p)) : raise [FileNotFoundException, IOException]

在打开时需要确保文件的存在，但是在打开后的读取过程中不能保证文件和数据是否存在。

postconditions

result = FSDataInputStream(0, FS.Files[p])

`FileSystem.delete(Path P, boolean recursive)`

preconditions：

存在子节点的目录只能使用循环删除

if isDir(FS, p) and not recursive and (children(FS, p) != {}) : raise IOException

postconditions：

如果路径不存在，则文件系统不会发生改变

if not exists(FS, p):
    FS' = FS
    result = False

若是单个文件则会将其移除

if isFile(FS, p) :
    FS' = (FS.Directories, FS.Files - [p], FS.Symlinks)
    result = True

若是空的根目录则不会改变文件系统

if isDir(FS, p) and isRoot(p) and children(FS, p) == {} :
    FS ' = FS
    result = (undetermined)

if isDir(FS, p) and not isRoot(p) and children(FS, p) == {} :
    FS' = (FS.Directories - [p], FS.Files, FS.Symlinks)
    result = True

循环删除根目录

POSIX模型允许删除一切

if isDir(FS, p) and isRoot(p) and recursive :
    FS' = ({["/"]}, {}, {}, {})
    result = True

HDFS不允许删除根目录，如果需要空的文件系统，可以将文件系统下线并格式化

if isDir(FS, p) and isRoot(p) and recursive :
    FS' = FS
    result = False

循环删除空目录（非根目录）将会删除路径及其子路径

if isDir(FS, p) and not isRoot(p) and recursive :
    FS' where:
        not isDir(FS', p)
        and forall d in descendants(FS, p):
            not isDir(FS', d)
            not isFile(FS', d)
            not isSymlink(FS', d)
    result = True

删除文件，删除空目录和循环删除目录必须是原子操作。

`FileSystem.rename(Path src, Path d)`

重命名需要计算目标路径。如果目标存在且是一个目录，那么最后的结果将会是目标+源路径的文件名

let dest = if (isDir(FS, src) and d != src) :
        d + [filename(src)]
    else :
        d

preconditions：

源路径需要存在

exists(FS, src) else raise FileNotFoundException

目标路径不能在源路径之下

if isDescendant(FS, src, dest) : raise IOException

目标路径必须是根目录或存在某个父节点

isRoot(FS, dest) or exists(FS, parent(dest)) else raise IOException

目标路径的父节点不能是文件

if isFile(FS, parent(dest)) : raise IOException

目标路径不能是已经存在的文件

if isFile(FS, dest) : raise FileAlreadyExistsException, IOException

postconditions：

在自身上重命名一个目录在POSIX是不可以的，但在HDFS是可以的

if isDir(FS, src) and src == dest :
    FS' = FS
    result = (undefined)

重命名一个文件给自身是可以的

 if isFile(FS, src) and src == dest :
     FS' = FS
     result = True

重命名一个文件到目录会将文件移动到目标目录下并以源路径的文件名进行命名

if isFile(FS, src) and src != dest:
    FS' where:
        not exists(FS', src)
        and exists(FS', dest)
        and data(FS', dest) == data (FS, dest)
    result = True

如果源路径是目录，目标也是目录，那么源路径下的所有子节点将会移动到目标目录，且源目录会被删除。

if isDir(FS, src) isDir(FS, dest) and src != dest :
    FS' where:
        not exists(FS', src)
        and dest in FS'.Directories]
        and forall c in descendants(FS, src) :
            not exists(FS', c))
        and forall c in descendants(FS, src) where isDir(FS, c):
            isDir(FS', dest + childElements(src, c)
        and forall c in descendants(FS, src) where not isDir(FS, c):
                data(FS', dest + childElements(s, c)) == data(FS, c)
    result = True

目标路径是一个父节点不存在的路径，此时hadoop支持的不同文件系统会有些许差异

对于hdfs，将会返回失败

FS' = FS; result = False

对于本地文件系统、S3N，将会返回成功，并创建之前不存在的父节点

exists(FS', parent(dest))

对于其他文件系统（包括Swift），将会明确拒绝这个操作，抛出FileNotFoundException

`concat(Path p, Path sources[])`

将多个blocks合并成单个文件。几乎只有hdfs实现了这个功能。

可能抛出UnsupportedOperationException

preconditions：

if not exists(FS, p) : raise FileNotFoundException

if sources==[] : raise IllegalArgumentException

所有源路径需要在同一个目录下

for s in sources: if parent(S) != parent(p) raise IllegalArgumentException

所有的block的大小需要和目标的block大小相匹配

for s in sources: getBlockSize(FS, S) == getBlockSize(FS, p)

没有重复的路径

not (exists p1, p2 in (sources + [p]) where p1 == p2)

HDFS还要求所有的block都必须是完成状态的，除了最后一个

for s in (sources[0:length(sources)-1] + [p]):
  (length(FS, s) mod getBlockSize(FS, p)) == 0

postconditions：

FS' where:
 (data(FS', T) = data(FS, T) + data(FS, sources[0]) + ... + data(FS, srcs[length(srcs)-1]))
 and for s in srcs: not exists(FS', S)

`boolean truncate(Path p, long newLength)`

将文件p截断到指定的长度，可能会抛出UnsupportedOperationException。

preconditions：

if not exists(FS, p) : raise FileNotFoundException

if isDir(FS, p) : raise [FileNotFoundException, IOException]

if newLength < 0 || newLength > len(FS.Files[p]) : raise HadoopIllegalArgumentException

hdfs要求源文件需要是关闭的。执行操作时不能对文件进行写入操作。

postconditions：

FS' where:
    len(FS.Files[p]) = newLength

如果截断成功，且可以对文件进行写入操作。那么返回true。否则false。

hdfs返回false时说明截断进程已经启动了，用户需要等待它执行完毕。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

Hadoop

学习笔记

文件系统