forked from apache/arrow
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
184 lines (131 loc) · 8.62 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
---
output:
github_document:
html_preview: false
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# arrow
[data:image/s3,"s3://crabby-images/a61b7/a61b7f4cc7d6a89a491c3f62cd3f4b03425de089" alt="cran"](https://cran.r-project.org/package=arrow) [data:image/s3,"s3://crabby-images/da67c/da67c8fda9de04500abdb0fb1a9d093636987dca" alt="conda-forge"](https://anaconda.org/conda-forge/r-arrow) [data:image/s3,"s3://crabby-images/fc643/fc6438f774d0730faa71846934c91436ffb79971" alt="Nightly macOS Build Status"](https://travis-ci.org/ursa-labs/arrow-r-nightly) [data:image/s3,"s3://crabby-images/91ce5/91ce559f73f283774112a4409f18e5e68d40a146" alt="Nightly Windows Build Status"](https://ci.appveyor.com/project/nealrichardson/arrow-r-nightly-yxl55/branch/master) [data:image/s3,"s3://crabby-images/beafe/beafe2cca89d697ebffa7a9ccb8675d63aeba61c" alt="codecov"](https://codecov.io/gh/ursa-labs/arrow-r-nightly)
[Apache Arrow](https://arrow.apache.org/) is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication.
The `arrow` package exposes an interface to the Arrow C++ library to access many of its features in R. This includes support for working with Parquet (`read_parquet()`, `write_parquet()`) and Feather (`read_feather()`, `write_feather()`) files, as well as lower-level access to Arrow memory and messages.
## Installation
Install the latest release of `arrow` from CRAN with
```r
install.packages("arrow")
```
Conda users on Linux and macOS can install `arrow` from conda-forge with
```
conda install -c conda-forge r-arrow
```
On macOS and Windows, installing a binary package from CRAN will handle Arrow's C++ dependencies for you. On Linux, unless you use `conda` you'll need to first install the C++ library. See the [Arrow project installation page](https://arrow.apache.org/install/) to find pre-compiled binary packages for some common Linux distributions, including Debian, Ubuntu, and CentOS. You'll need to install `libparquet-dev` on Debian and Ubuntu, or `parquet-devel` on CentOS. This will also automatically install the Arrow C++ library as a dependency. Other Linux distributions must install the C++ library from source.
If you install the `arrow` package from source and the C++ library is not found, the R package functions will notify you that Arrow is not available. Call
```r
arrow::install_arrow()
```
for version- and platform-specific guidance on installing the Arrow C++ library.
When installing from source, if the R and C++ library versions do not match, installation may fail. If you've previously installed the libraries and want to upgrade the R package, you'll need to update the Arrow C++ library first.
## Example
```{r}
library(arrow)
set.seed(24)
tab <- arrow::table(x = 1:10, y = rnorm(10))
tab$schema
tab
as.data.frame(tab)
```
## Installing a development version
Binary R packages for macOS and Windows are built daily and hosted at https://dl.bintray.com/ursalabs/arrow-r/. To install from there:
```r
install.packages("arrow", repos="https://dl.bintray.com/ursalabs/arrow-r")
```
These daily package builds are not official Apache releases and are not recommended for production use. They may be useful for testing bug fixes and new features under active development.
Linux users will need to build the Arrow C++ library from source. See "Development" below. Once you have the C++ library, you can install the R package from GitHub using the [`remotes`](https://remotes.r-lib.org/) package. From within an R session,
```r
# install.packages("remotes") # Or install "devtools", which includes remotes
remotes::install_github("apache/arrow/r")
```
or if you prefer to stay at the command line,
```shell
R -e 'remotes::install_github("apache/arrow/r")'
```
You can specify a particular commit, branch, or [release](https://github.com/apache/arrow/releases) to install by including a `ref` argument to `install_github()`. This is particularly useful to match the R package version to the C++ library version you've installed.
## Developing
Windows and macOS users who wish to contribute to the R package and don't need to alter the Arrow C++ library may be able to obtain a recent version of the library without building from source. On macOS, you may install the C++ library using [Homebrew](https://brew.sh/):
```shell
# For the released version:
brew install apache-arrow
# Or for a development version, you can try:
brew install apache-arrow --HEAD
```
On Windows, you can download a .zip file with the arrow dependencies from the [rwinlib](https://github.com/rwinlib/arrow/releases) project, and then set the `RWINLIB_LOCAL` environment variable to point to that zip file before installing the `arrow` R package. That project contains released versions of the C++ library; for a development version, Windows users may be able to find a binary by going to the [Apache Arrow project's Appveyor](https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow), selecting an R job from a recent build, and downloading the `build\arrow-*.zip` file from the "Artifacts" tab.
If you need to alter both the Arrow C++ library and the R package code, or if
you can't get a binary version of the latest C++ library elsewhere, you'll need
to build it from source too.
First, install the C++ library. See the [C++ developer
guide](https://arrow.apache.org/docs/developers/cpp.html) for details.
Note that after any change to the C++ library, you must reinstall it and run
`make clean` or `git clean -fdx .` to remove any cached object code in the `r/src/`
directory before reinstalling the R package. This is only necessary if you make changes to the C++ library source; you do not need to manually purge object files if you are only editing R or Rcpp code inside `r/`.
Once you've built the C++ library, you can install the R package and its
dependencies, along with additional dev dependencies, from the git checkout:
```shell
cd ../../r
R -e 'install.packages(c("devtools", "roxygen2", "pkgdown")); devtools::install_dev_deps()'
R CMD INSTALL .
```
If you need to set any compilation flags while building the Rcpp extensions,
you can use the `ARROW_R_CXXFLAGS` environment variable. For example, if you
are using `perf` to profile the R extensions, you may need to set
```shell
export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer
```
If the package fails to install/load with an error like this:
```
** testing if installed package can be loaded from temporary location
Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...):
unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib
```
try setting the environment variable `R_LD_LIBRARY_PATH` to wherever Arrow C++
was put in `make install`, e.g. `export R_LD_LIBRARY_PATH=/usr/local/lib`, and
retry installing the R package.
For any other build/configuration challenges, see the [C++ developer
guide](https://arrow.apache.org/docs/developers/cpp.html#building).
### Editing Rcpp code
The `arrow` package uses some customized tools on top of `Rcpp` to prepare its
C++ code in `src/`. If you change C++ code in the R package, you will need to
set the `ARROW_R_DEV` environment variable to `TRUE` (optionally, add it to
your`~/.Renviron` file to persist across sessions) so that the
`data-raw/codegen.R` file is used for code generation.
The codegen.R script has these dependencies:
```r
remotes::install_github("romainfrancois/decor")
install.packages(c("dplyr", "purrr", "glue"))
```
### Useful functions
Within an R session, these can help with package development:
```r
devtools::load_all() # Load the dev package
devtools::test(filter="^regexp$") # Run the test suite, optionally filtering file names
devtools::document() # Update roxygen documentation
rmarkdown::render("README.Rmd") # To rebuild README.md
pkgdown::build_site() # To preview the documentation website
devtools::check() # All package checks; see also below
```
Any of those can be run from the command line by wrapping them in `R -e
'$COMMAND'`. There's also a `Makefile` to help with some common tasks from the
command line (`make test`, `make doc`, `make clean`, etc.)
### Full package validation
```shell
R CMD build --keep-empty-dirs .
R CMD check arrow_*.tar.gz --as-cran --no-manual
```
[1]: https://github.com/apache/arrow/blob/master/docs/source/developers/cpp.rst