5) The Java Interface
a) Reading Data from a Hadoop URL.
使用hadoop URL来读取数据
b) Although we focus mainly on the HDFS implementation, DistributedFileSystem, in general you should strive to write your code against the FileSystem abstract class, to retain portability across filesystems.
虽然我们把主要的注意力都集中在HDFS的实现上,即DistributedFileSystem,但通常你应该针对抽象类FileSystem编写代码以保持其跨文件系统的可移植性。
c) One of the simplest ways to read a file from a Hadoop filesystem is by using a java.net.URL object to open a stream to read the data from. The general idiom is:
从一个hadoop文件系统中读取一个文件最简单的方式就是使用一个java.net.URL对象打开一个数据流去从中读取数据。通常格式是:
InputStream in = null;
try {
in = new URL("hdfs://host/path").openStream();
// process in
} finally {
IOUtils.closeStream(in);
}
There’s a little bit more work required to make Java recognize Hadoop’s hdfs URL scheme. This is achieved by calling the setURLStreamHandlerFactory() method on URL with an instance of FsUrlStreamHandlerFactory. This method can be called only once per JVM, so it is typically executed in a static block.
让Java识别hadoop的hdfs url方案还需要一点额外的工作,在这里可以通过FsUrlStreamHandlerFactory对象调用URL中的setURLStreamHandlerFactory()方法来实现。这个方法每一个JVM只能执行一次,因此通常在一个静态程序块中执行。
d) Example 3-1. Displaying files from a Hadoop filesystem on standard output using a URLStreamHandler.
使用URLStreamHandler用标准输出的方式列出一个hadoop文件系统中的文件。
public class URLCat {
static {
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
public static void main(String[] args) throws Exception {
InputStream in = null;
try {
in = new URL(args[0]).openStream();
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}
run:
% hadoop URLCat hdfs://localhost/user/tom/quangle.txt
e) We make use of the handy IOUtils class that comes with Hadoop for closing the stream in the finally clause, and also for copying bytes between the input stream and the output stream (System.out, in this case). The last two arguments to the copyBytes() method are the buffer size used for copying and whether to close the streams when the copy is complete. We close the input stream ourselves, and System.out doesn’t need to be closed.
我们使用了hadoop中就近的IOUtils类,并且在finally子句中关闭了数据流,并且在输入流和输出流之间复制数据(在这个例子中输出流是System.out). copyBytes()方法中最后的两个参数表示复制数据的缓存大小以及当复制完成时是否关闭数据流。在这里我们关闭了输入流,而输出流System.out不需要关闭。
f) Reading Data Using the FileSystem API.
使用FileSystem API来读取数据。
g) FileSystem is a general filesystem API, so the first step is to retrieve an instance for the filesystem we want to use — HDFS, in this case. There are several static factory methods for getting a FileSystem instance:
FileSystem类是一个通用文件系统的API,因此第一步就是获得一个文件系统的实力,在本例中是HDFS。获得一个FileSystem实例有几种静态工厂方法。
public static FileSystem get(Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf, String user) throws IOException
h) A Configuration object encapsulates a client or server’s configuration, which is set using configuration files read from the classpath, such as etc/hadoop/core-site.xml. The first method returns the default filesystem (as specified in core-site.xml, or the default local filesystem if not specified there). The second uses the given URI’s scheme and authority to determine the filesystem to use, falling back to the default filesystem if no scheme is specified in the given URI. The third retrieves the filesystem as the given user, which is important in the context of security.
Configuration对象封装了客户端或者服务器端的配置,其设置成使用配置文件从类路径中读取,比如etc/hadoop/core-site.xml。第一种方法返回默认的文件系统(其在core-site.xml中指定,如果没有在这里指定的话,就是默认的本地文件系统).第二种方法根据给定的URL方案和权限来决定所使用的文件系统,如果在给定的URL中没有指定具体的方案,那么返回默认的文件系统。第三种方法会去检索给定的用户的文件系统,在强调安全的背景下,这是很重要的。
l) Example 3-3. Displaying files from a Hadoop filesystem on standard output twice, by using seek():
使用seek()方法以标准输出方式列出2次hadoop文件系统的文件
public class FileSystemDoubleCat {
public static void main(String[] args) throws Exception {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FSDataInputStream in = null;
try {
in = fs.open(new Path(uri));
IOUtils.copyBytes(in, System.out, 4096, false);
in.seek(0); // go back to the start of the file
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}
run:
% hadoop FileSystemDoubleCat hdfs://localhost/user/tom/quangle.txt
j) The open() method on FileSystem actually returns an FSDataInputStream rather than a standard java.io class. This class is a specialization of java.io.DataInputStream with support for random access, so you can read from any part of the stream:
FileSystem类的open()方法实际上返回的是一个FSDataInputStream,而不是一个标准的Java IO类。这个类一个继承了java.io.DataInputStream类的特殊类,且支持随机访问,因此,可以读取数据流的任何部分。
package org.apache.hadoop.fs;
public class FSDataInputStream extends DataInputStream
implements Seekable, PositionedReadable {
// implementation elided
}
k) The Seekable interface permits seeking to a position in the file and provides a query method for the current offset from the start of the file (getPos()):
Seekable接口允许进行在文件中定位,并且提供一个当前位置相对文件起始位置的偏移量的查询方法(getPos()):
public interface Seekable {
void seek(long pos) throws IOException;
long getPos() throws IOException;
}
l) Example 3-3. Displaying files from a Hadoop filesystem on standard output twice, by using seek():
使用seek()方法以标准输出方式列出2次hadoop文件系统的文件
public class FileSystemDoubleCat {
public static void main(String[] args) throws Exception {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FSDataInputStream in = null;
try {
in = fs.open(new Path(uri));
IOUtils.copyBytes(in, System.out, 4096, false);
in.seek(0); // go back to the start of the file
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}
run:
% hadoop FileSystemDoubleCat hdfs://localhost/user/tom/quangle.txt
bear in mind that calling seek() is a relatively expensive operation and should be done sparingly. You should structure your application access patterns to rely on streaming data (by using MapReduce, for example) rather than performing a large number of seeks.
最后,别忘了调用seek()方法是一个相对开销比较大的操作,应该谨慎使用。你应该在流数据之上(比如,MapReduce)构建应用程序访问模式,而不是执行大量的seek()方法。
n) Writing Data
o) The FileSystem class has a number of methods for creating a file. The simplest is the method that takes a Path object for the file to be created and returns an output stream to write to:
FileSystem类有许多创建文件的方法。最简单的方法是给要创建的文件设置一个Path对象,并且返回一个可以给文件写入数据的输出流。
public FSDataOutputStream create(Path f) throws IOException
p) There’s also an overloaded method for passing a callback interface, Progressable, so your application can be notified of the progress of the data being written to the datanodes:
还有一个重载方法,用来传递一个回调接口Progressable,因此这样可以把数据写入节点的进度告知应用程序。
package org.apache.hadoop.util;
public interface Progressable {
public void progress();
}
q) As an alternative to creating a new file, you can append to an existing file using the append() method (there are also some other overloaded versions):
作为一个创建新文件的可选方式,你可以使用append()方法来附件一个已经存在的文件(也有其他的重载版本)。
public FSDataOutputStream append(Path f) throws IOException
r) Example 3-4. Copying a local file to a Hadoop filesystem
复制一个本地文件到hadoop文件系统。
public class FileCopyWithProgress {
public static void main(String[] args) throws Exception {
String localSrc = args[0];
String dst = args[1];
InputStream in = new BufferedInputStream(new FileInputStream(localSrc));
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(dst), conf);
OutputStream out = fs.create(new Path(dst), new Progressable() {
public void progress() {
System.out.print(".");
}
});
IOUtils.copyBytes(in, out, 4096, true);
}
}
s) The create() method on FileSystem returns an FSDataOutputStream, which, like FSDataInputStream, has a method for querying the current position in the file:
FileSystem类的create()方法返回了一个FSDataOutputStream,就像FSDataInputStream一样,也有一个方法用来查询文件中的当前位置:
package org.apache.hadoop.fs;
public class FSDataOutputStream extends DataOutputStream implements Syncable {
public long getPos() throws IOException {
// implementation elided
}
// implementation elided
}
However, unlike FSDataInputStream, FSDataOutputStream does not permit seeking. This is because HDFS allows only sequential writes to an open file or appends to an already written file. In other words, there is no support for writing to anywhere other than the end of the file, so there is no value in being able to seek while writing.
然而,跟FSDataInputStream不一样,FSDataOutputStream不允许检索。这是因为HDFS仅允许连续的写入一个已经打开的文件,或者附加到一个已经存在的可写入文档。换句话说,除了支持写入文件的末尾之外,其他位置都不支持,因此写入的时候进行定位就毫无意义。
t) FileSystem provides a method to create a directory:
FileSystem类提供了一个方法去创建目录。
public boolean mkdirs(Path f) throws IOException
Often, you don’t need to explicitly create a directory, because writing a file by calling create() will automatically create any parent directories.
通常,你不需要显示的创建一个目录,因为使用create()方法写入文件时会自动的创建任何需要的父目录。
u) Querying the Filesystem
v) An important feature of any filesystem is the ability to navigate its directory structure and retrieve information about the files and directories that it stores. The FileStatus class encapsulates filesystem metadata for files and directories, including file length, block size, replication, modification time, ownership, and permission information.
任何文件系统的一个重要特征就是具有浏览和检索所存储的文件和目录的目录结构和信息。FileStatus类封装了文件系统中文件和目录的元数据,包括文件长度、块大小、备份因素、修改时间,所有者以及权限信息。
w) The method getFileStatus() on FileSystem provides a way of getting a FileStatus object for a single file or directory.
FileSystem类的getFileStatus()方法提供了一个获取文件或目录的FileStatus对象的方式。
x) Finding information on a single file or directory is useful, but you also often need to be able to list the contents of a directory. That’s what FileSystem’s listStatus() methods are for:
在一个单个文件或目录上搜寻信息是有用的,但是你也会经常需要罗列一个目录的内容。这就是FileSystem类listStatus()方法的功能。
public FileStatus[] listStatus(Path f) throws IOException
public FileStatus[] listStatus(Path f, PathFilter filter) throws IOException
public FileStatus[] listStatus(Path[] files) throws IOException
public FileStatus[] listStatus(Path[] files, PathFilter filter) throws IOException
When the argument is a file, the simplest variant returns an array of FileStatus objects of length 1. When the argument is a directory, it returns zero or more FileStatus objects representing the files and directories contained in the directory.
当参数是一个文件时,最简单变化就是返回一个长度为1的FileStatus对象数组,当参数是一个目录时,返回0个或多个FileStatus对象,代表目录中包含的文件或者目录。
y) Example 3-6. Showing the file statuses for a collection of paths in a Hadoop filesystem.
显示hadoop文件系统中一组路径的文件状态
public class ListStatus {
public static void main(String[] args) throws Exception {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path[] paths = new Path[args.length];
for (int i = 0; i < paths.length; i++) {
paths[i] = new Path(args[i]);
}
FileStatus[] status = fs.listStatus(paths);
Path[] listedPaths = FileUtil.stat2Paths(status);
for (Path p : listedPaths) {
System.out.println(p);
}
}
}
z) Rather than having to enumerate each file and directory to specify the input, it is convenient to use wildcard characters to match multiple files with a single expression, an operation that is known as globbing. Hadoop provides two FileSystem methods for processing globs:
不同于使用枚举的方式去指定每一个文件和目录作为输入,它可以很方便的使用通配符用一个表达式去匹配多个文件,也就是被认为的globbing操作。hadoop提供了两种FileSystem类的方法去处理globs:
public FileStatus[] globStatus(Path pathPattern) throws IOException
public FileStatus[] globStatus(Path pathPattern, PathFilter filter) throws IOException
Hadoop supports the same set of glob characters as the Unix bash shell.
hadoop支持与Unix系统bash脚本一致的通配符表达。
hadoop权威指南(第四版)要点翻译(5)——Chapter 3. The HDFS(5)
原文地址:http://blog.csdn.net/thinkpadshi/article/details/47701213