Skip to content

Latest commit

 

History

History
95 lines (71 loc) · 2.99 KB

filesystems_deprecated.rst

File metadata and controls

95 lines (71 loc) · 2.99 KB

Filesystem Interface (legacy)

Warning

This section documents the deprecated filesystem layer. You should use the :ref:`new filesystem layer <filesystem>` instead.

Hadoop File System (HDFS)

PyArrow comes with bindings to a C++-based interface to the Hadoop File System. You connect like so:

import pyarrow as pa
fs = pa.hdfs.connect(host, port, user=user, kerb_ticket=ticket_cache_path)
with fs.open(path, 'rb') as f:
    # Do something with f

By default, pyarrow.hdfs.HadoopFileSystem uses libhdfs, a JNI-based interface to the Java Hadoop client. This library is loaded at runtime (rather than at link / library load time, since the library may not be in your LD_LIBRARY_PATH), and relies on some environment variables.

  • HADOOP_HOME: the root of your installed Hadoop distribution. Often has lib/native/libhdfs.so.
  • JAVA_HOME: the location of your Java SDK installation.
  • ARROW_LIBHDFS_DIR (optional): explicit location of libhdfs.so if it is installed somewhere other than $HADOOP_HOME/lib/native.
  • CLASSPATH: must contain the Hadoop jars. You can set these using:
export CLASSPATH=`$HADOOP_HOME/bin/hdfs classpath --glob`

If CLASSPATH is not set, then it will be set automatically if the hadoop executable is in your system path, or if HADOOP_HOME is set.

You can also use libhdfs3, a thirdparty C++ library for HDFS from Pivotal Labs:

fs = pa.hdfs.connect(host, port, user=user, kerb_ticket=ticket_cache_path,
                    driver='libhdfs3')

HDFS API

.. currentmodule:: pyarrow

.. autosummary::
   :toctree: generated/

   hdfs.connect
   HadoopFileSystem.cat
   HadoopFileSystem.chmod
   HadoopFileSystem.chown
   HadoopFileSystem.delete
   HadoopFileSystem.df
   HadoopFileSystem.disk_usage
   HadoopFileSystem.download
   HadoopFileSystem.exists
   HadoopFileSystem.get_capacity
   HadoopFileSystem.get_space_used
   HadoopFileSystem.info
   HadoopFileSystem.ls
   HadoopFileSystem.mkdir
   HadoopFileSystem.open
   HadoopFileSystem.rename
   HadoopFileSystem.rm
   HadoopFileSystem.upload
   HdfsFile