Warning
This section documents the deprecated filesystem layer. You should use the :ref:`new filesystem layer <filesystem>` instead.
PyArrow comes with bindings to a C++-based interface to the Hadoop File System. You connect like so:
import pyarrow as pa
fs = pa.hdfs.connect(host, port, user=user, kerb_ticket=ticket_cache_path)
with fs.open(path, 'rb') as f:
# Do something with f
By default, pyarrow.hdfs.HadoopFileSystem
uses libhdfs, a JNI-based
interface to the Java Hadoop client. This library is loaded at runtime
(rather than at link / library load time, since the library may not be in your
LD_LIBRARY_PATH), and relies on some environment variables.
HADOOP_HOME
: the root of your installed Hadoop distribution. Often has lib/native/libhdfs.so.JAVA_HOME
: the location of your Java SDK installation.ARROW_LIBHDFS_DIR
(optional): explicit location oflibhdfs.so
if it is installed somewhere other than$HADOOP_HOME/lib/native
.CLASSPATH
: must contain the Hadoop jars. You can set these using:
export CLASSPATH=`$HADOOP_HOME/bin/hdfs classpath --glob`
If CLASSPATH
is not set, then it will be set automatically if the
hadoop
executable is in your system path, or if HADOOP_HOME
is set.
You can also use libhdfs3, a thirdparty C++ library for HDFS from Pivotal Labs:
fs = pa.hdfs.connect(host, port, user=user, kerb_ticket=ticket_cache_path,
driver='libhdfs3')
.. currentmodule:: pyarrow
.. autosummary:: :toctree: generated/ hdfs.connect HadoopFileSystem.cat HadoopFileSystem.chmod HadoopFileSystem.chown HadoopFileSystem.delete HadoopFileSystem.df HadoopFileSystem.disk_usage HadoopFileSystem.download HadoopFileSystem.exists HadoopFileSystem.get_capacity HadoopFileSystem.get_space_used HadoopFileSystem.info HadoopFileSystem.ls HadoopFileSystem.mkdir HadoopFileSystem.open HadoopFileSystem.rename HadoopFileSystem.rm HadoopFileSystem.upload HdfsFile