Skip to content

Commit

Permalink
Init the code of spark-rapids-container
Browse files Browse the repository at this point in the history
Co-authored-by: Navin Kumar <[email protected]>

re-arranging directory and initial README.md

Signed-off-by: Navin Kumar <[email protected]>

Can build 22.08 container, will need some updates to find the right locations, and some documentation fixes

Signed-off-by: Navin Kumar <[email protected]>

Cleanup ganglia directory

Signed-off-by: Navin Kumar <[email protected]>

Add webterminal to Docker container

Signed-off-by: Navin Kumar <[email protected]>

Addressing some feedback

Signed-off-by: Navin Kumar <[email protected]>

Some updates including adding the alluxio init script

Signed-off-by: Navin Kumar <[email protected]>

Add init-alluxio init script

Signed-off-by: Navin Kumar <[email protected]>

Cleanup alluxio init script, make REQUIREMENTS an ARG, and allow for local jar file to be used in lieu of URL

Signed-off-by: Navin Kumar <[email protected]>

Move webterminal code to later in the Dockerfile, separating it from main build

Signed-off-by: Navin Kumar <[email protected]>

cleanup and change CUDA version to 11.5.1

Add tmux and JAR_FILE ARG

Signed-off-by: Navin Kumar <[email protected]>

init script running inside Docker container

Usage docs for Databricks docker container

Updated copyrights and docs

Update init scripts to allow customization of heap, cache percent and saving logs

Sync with GitHub PR updates

Add supervisor for Alluxio

Signed-off-by: Chong Gao <[email protected]>

Refactor

Remove Python Libraries from Docker container

Switch to a CUDA runtime base instead of cudnn

Add lost code: install supervisor

Update README with some step-reordering and screenshots to make instructions a bit clearer
  • Loading branch information
NVnavkumar authored and GaryShen2008 committed Nov 3, 2022
1 parent 51d013c commit 8db8741
Show file tree
Hide file tree
Showing 25 changed files with 2,223 additions and 0 deletions.
94 changes: 94 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# Contributing to RAPIDS Accelerator for Apache Spark Docker container

Contributions to RAPIDS Accelerator for Apache Spark Docker container fall into the following three categories.

1. To report a bug, request a new feature, or report a problem with
documentation, please file an [issue](https://github.com/NVIDIA/spark-rapids-container/issues/new/choose)
describing in detail the problem or new feature. The project team evaluates
and triages issues, and schedules them for a release. If you believe the
issue needs priority attention, please comment on the issue to notify the
team.
2. To propose and implement a new Feature, please file a new feature request
[issue](https://github.com/NVIDIA/spark-rapids-container/issues/new/choose). Describe the
intended feature and discuss the design and implementation with the team and
community. Once the team agrees that the plan looks good, go ahead and
implement it using the [code contributions](#code-contributions) guide below.
3. To implement a feature or bug-fix for an existing outstanding issue, please
follow the [code contributions](#code-contributions) guide below. If you
need more context on a particular issue, please ask in a comment.

## Branching Convention

There are two branches in this repository:

* `dev`: are development branches which can change often. Note that we merge into
the branch with the greatest version number, as that is our default branch.

* `main`: is the branch with the latest released code, and the version tag (i.e. `v0.1.0`)
is held here. `main` will change with new releases, but otherwise it should not change with
every pull request merged, making it a more stable branch.

## Code contributions

### Sign your work

We require that all contributors sign-off on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.

Any contribution which contains commits that are not signed off will not be accepted.

To sign off on a commit use the `--signoff` (or `-s`) option when committing your changes:

```shell
git commit -s -m "Add cool feature."
```

This will append the following to your commit message:

```
Signed-off-by: Your Name <[email protected]>
```

The sign-off is a simple line at the end of the explanation for the patch. Your signature certifies that you wrote the patch or otherwise have the right to pass it on as an open-source patch. Use your real name, no pseudonyms or anonymous contributions. If you set your `user.name` and `user.email` git configs, you can sign your commit automatically with `git commit -s`.


The signoff means you certify the below (from [developercertificate.org](https://developercertificate.org)):

```
Developer Certificate of Origin
Version 1.1
Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129
Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.
Developer's Certificate of Origin 1.1
By making a contribution to this project, I certify that:
(a) The contribution was created in whole or in part by me and I
have the right to submit it under the open source license
indicated in the file; or
(b) The contribution is based upon previous work that, to the best
of my knowledge, is covered under an appropriate open source
license and I have the right under that license to submit that
work with modifications, whether created in whole or in part
by me, under the same open source license (unless I am
permitted to submit under a different license), as indicated
in the file; or
(c) The contribution was provided directly to me by some other
person who certified (a), (b) or (c) and I have not modified
it.
(d) I understand and agree that this project and the contribution
are public and that a record of the contribution (including all
personal information I submit with it, including my sign-off) is
maintained indefinitely and may be redistributed consistent with
this project or the open source license(s) involved.
```
21 changes: 21 additions & 0 deletions Databricks/00-custom-spark-driver-defaults.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Copyright (c) 2022, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
[driver] {
"spark.plugins" = "com.nvidia.spark.SQLPlugin"
"spark.rapids.memory.pinnedPool.size" = "2G"
"spark.databricks.delta.optimizeWrite.enabled" = "false"
"spark.sql.optimizer.dynamicPartitionPruning.enabled" = "false"
"spark.sql.files.maxPartitionBytes" = "512m"
"spark.rapids.sql.concurrentGpuTasks" = "2"
}
228 changes: 228 additions & 0 deletions Databricks/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,228 @@
# Copyright (c) 2022, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#############
# Combine all below Dockerfiles together:
# https://github.com/databricks/containers/blob/master/ubuntu/gpu/cuda-11/base/Dockerfile
# https://github.com/databricks/containers/blob/master/ubuntu/minimal/Dockerfile
# https://github.com/databricks/containers/blob/master/ubuntu/python/Dockerfile
# https://github.com/databricks/containers/blob/master/ubuntu/dbfsfuse/Dockerfile
# https://github.com/databricks/containers/blob/master/ubuntu/standard/Dockerfile
# https://github.com/databricks/containers/blob/master/experimental/ubuntu/ganglia/Dockerfile
# https://github.com/dayananddevarapalli/containers/blob/main/webterminal/Dockerfile
#############
ARG CUDA_VERSION=11.5.2
FROM nvidia/cuda:${CUDA_VERSION}-runtime-ubuntu20.04 as base

ARG CUDA_PKG_VERSION=11-5

#############
# Install all needed libs
#############

RUN set -ex && \
cd /etc/apt/sources.list.d && \
mv cuda.list cuda.list.disabled && \
apt-get -y update && \
apt-get -y install wget && \
wget -qO - https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/3bf863cc.pub | apt-key add - && \
cd /etc/apt/sources.list.d && \
mv cuda.list.disabled cuda.list && \
apt-get -y update && \
apt-get -y upgrade && \
apt-get install -y software-properties-common && \
add-apt-repository ppa:deadsnakes/ppa -y && \
apt-get -y install python3.8 virtualenv python3-filelock libcairo2 cuda-cupti-${CUDA_PKG_VERSION} \
cuda-toolkit-${CUDA_PKG_VERSION}-config-common cuda-toolkit-11-config-common cuda-toolkit-config-common \
openjdk-8-jdk-headless iproute2 bash sudo coreutils procps gpg fuse openssh-server && \
apt-get -y install cuda-cudart-dev-${CUDA_PKG_VERSION} cuda-cupti-dev-${CUDA_PKG_VERSION} cuda-driver-dev-${CUDA_PKG_VERSION} \
cuda-nvcc-${CUDA_PKG_VERSION} cuda-thrust-${CUDA_PKG_VERSION} cuda-toolkit-${CUDA_PKG_VERSION}-config-common cuda-toolkit-11-config-common \
cuda-toolkit-config-common python3.8-dev libpq-dev libcairo2-dev build-essential unattended-upgrades cmake ccache \
openmpi-bin linux-headers-5.4.0-117 linux-headers-5.4.0-117-generic linux-headers-generic libopenmpi-dev unixodbc-dev \
sysstat ssh tmux supervisor && \
apt-get install -y less vim && \
/var/lib/dpkg/info/ca-certificates-java.postinst configure && \
# Initialize the default environment that Spark and notebooks will use
virtualenv -p python3.8 --system-site-packages /databricks/python3 --no-download --no-setuptools \
&& /databricks/python3/bin/pip install --no-cache-dir --upgrade pip \
&& /databricks/python3/bin/pip install \
databricks-cli \
ipython \
&& /databricks/python3/bin/pip install --force-reinstall \
virtualenv \
&& /databricks/python3/bin/pip cache purge && \
apt-get -y purge --autoremove software-properties-common cuda-cudart-dev-${CUDA_PKG_VERSION} cuda-cupti-dev-${CUDA_PKG_VERSION} \
cuda-driver-dev-${CUDA_PKG_VERSION} cuda-nvcc-${CUDA_PKG_VERSION} cuda-thrust-${CUDA_PKG_VERSION} \
python3.8-dev libpq-dev libcairo2-dev build-essential unattended-upgrades cmake ccache openmpi-bin \
linux-headers-5.4.0-117 linux-headers-5.4.0-117-generic linux-headers-generic libopenmpi-dev unixodbc-dev \
virtualenv python3-virtualenv && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* && \
mkdir -p /databricks/jars && \
mkdir -p /mnt/driver-daemon && \
#############
# Disable NVIDIA repos to prevent accidental upgrades.
#############
ln -s /databricks/jars /mnt/driver-daemon/jars && \
cd /etc/apt/sources.list.d && \
mv cuda.list cuda.list.disabled && \
# Create user "ubuntu"
useradd --create-home --shell /bin/bash --groups sudo ubuntu

#############
# Set all env variables
#############
ARG DATABRICKS_RUNTIME_VERSION=10.4
ENV PYSPARK_PYTHON=/databricks/python3/bin/python3
ENV DATABRICKS_RUNTIME_VERSION=${DATABRICKS_RUNTIME_VERSION}
ENV LANG=C.UTF-8
ENV USER=ubuntu
ENV PATH=/usr/local/nvidia/bin:/databricks/python3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin

FROM base as with-plugin

#############
# Spark RAPIDS configuration
#############
ARG DRIVER_CONF_FILE=00-custom-spark-driver-defaults.conf
ARG JAR_FILE=rapids-4-spark_2.12-22.10.0.jar
ARG JAR_URL=https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.10.0/${JAR_FILE}
ARG INIT_SCRIPT=init.sh
COPY ${DRIVER_CONF_FILE} /databricks/driver/conf/00-custom-spark-driver-defaults.conf

WORKDIR /databricks/jars
ADD $JAR_URL /databricks/jars/${JAR_FILE}

ADD $INIT_SCRIPT /opt/spark-rapids/init.sh
RUN chmod 755 /opt/spark-rapids/init.sh

WORKDIR /databricks

#############
# Setup Ganglia
#############
FROM with-plugin as with-ganglia

WORKDIR /databricks
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -q -y --force-yes --fix-missing --ignore-missing \
ganglia-monitor \
ganglia-webfrontend \
ganglia-monitor-python \
python3-pip \
wget \
rsync \
cron \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
# Upgrade Ganglia to 3.7.2 to patch XSS bug, see CJ-15250
# Upgrade Ganglia to 3.7.4 and use private forked repo to patch several security bugs, see CJ-20114
# SC-17279: We run gmetad as user ganglia, so change the owner from nobody to ganglia for the rrd directory
RUN cd /tmp \
&& export GANGLIA_WEB=ganglia-web-3.7.4-db-4 \
&& wget https://s3-us-west-2.amazonaws.com/databricks-build-files/$GANGLIA_WEB.tar.gz \
&& tar xvzf $GANGLIA_WEB.tar.gz \
&& cd $GANGLIA_WEB \
&& make install \
&& chown ganglia:ganglia /var/lib/ganglia/rrds
# Install Phantom.JS
RUN cd /tmp \
&& export PHANTOM_JS="phantomjs-2.1.1-linux-x86_64" \
&& wget https://s3-us-west-2.amazonaws.com/databricks-build-files/$PHANTOM_JS.tar.bz2 \
&& tar xvjf $PHANTOM_JS.tar.bz2 \
&& mv $PHANTOM_JS /usr/local/share \
&& ln -sf /usr/local/share/$PHANTOM_JS/bin/phantomjs /usr/local/bin
# Apache2 config. The `sites-enabled` config files are loaded into the container
# later.
RUN rm /etc/apache2/sites-enabled/* && a2enmod proxy && a2enmod proxy_http
RUN mkdir -p /etc/monit/conf.d
RUN echo '\
check process ganglia-monitor with pidfile /var/run/ganglia-monitor.pid\n\
start program = "/usr/sbin/service ganglia-monitor start"\n\
stop program = "/usr/sbin/service ganglia-monitor stop"\n\
if memory usage > 500 MB for 3 cycles then restart\n\
' > /etc/monit/conf.d/ganglia-monitor-not-active
RUN echo '\
check process gmetad with pidfile /var/run/gmetad.pid\n\
start program = "/usr/sbin/service gmetad start"\n\
stop program = "/usr/sbin/service gmetad stop"\n\
if memory usage > 500 MB for 3 cycles then restart\n\
\n\
check process apache2 with pidfile /var/run/apache2/apache2.pid\n\
start program = "/usr/sbin/service apache2 start"\n\
stop program = "/usr/sbin/service apache2 stop"\n\
if memory usage > 500 MB for 3 cycles then restart\n\
' > /etc/monit/conf.d/gmetad-not-active
RUN echo '\
check process spark-slave with pidfile /tmp/spark-root-org.apache.spark.deploy.worker.Worker-1.pid\n\
start program = "/databricks/spark/scripts/restart-workers"\n\
stop program = "/databricks/spark/scripts/kill_worker.sh"\n\
' > /etc/monit/conf.d/spark-slave-not-active
# add Ganglia configuration file indicating the DocumentRoot - Databricks checks this to enable Ganglia upon cluster startup
RUN mkdir -p /etc/apache2/sites-enabled
ADD ganglia/ganglia.conf /etc/apache2/sites-enabled
RUN chmod 775 /etc/apache2/sites-enabled/ganglia.conf
ADD ganglia/gconf/* /etc/ganglia/
RUN mkdir -p /databricks/spark/scripts/ganglia/
RUN mkdir -p /databricks/spark/scripts/
ADD ganglia/start_spark_slave.sh /databricks/spark/scripts/start_spark_slave.sh

# add local monit shell script in the right location
RUN mkdir -p /etc/init.d
ADD scripts/monit /etc/init.d
RUN chmod 775 /etc/init.d/monit

#############
# Set up webterminal ssh
#############
FROM with-ganglia as with-webterminal

RUN wget https://github.com/tsl0922/ttyd/releases/download/1.6.3/ttyd.x86_64 && \
mkdir -p /databricks/driver/logs && \
mkdir -p /databricks/spark/scripts/ttyd/ && \
mkdir -p /etc/monit/conf.d/ && \
mv ttyd.x86_64 /databricks/spark/scripts/ttyd/ttyd && \
export TTYD_BIN_FILE=/databricks/spark/scripts/ttyd/ttyd

ENV TTYD_DIR=/databricks/spark/scripts/ttyd
ENV TTYD_BIN_FILE=$TTYD_DIR/ttyd

COPY webterminal/setup_ttyd_daemon.sh $TTYD_DIR/setup_ttyd_daemon.sh
COPY webterminal/stop_ttyd_daemon.sh $TTYD_DIR/stop_ttyd_daemon.sh
COPY webterminal/start_ttyd_daemon.sh $TTYD_DIR/start_ttyd_daemon.sh
COPY webterminal/webTerminalBashrc $TTYD_DIR/webTerminalBashrc
RUN echo '\
check process ttyd with pidfile /var/run/ttyd-daemon.pid\n\
start program = "/databricks/spark/scripts/ttyd/start_ttyd_daemon.sh"\n\
stop program = "/databricks/spark/scripts/ttyd/stop_ttyd_daemon.sh"' > /etc/monit/conf.d/ttyd-daemon-not-active

FROM with-webterminal as with-alluxio
#############
# Setup Alluxio
#############
ARG ALLUXIO_VERSION=2.8.0
ARG ALLUXIO_HOME="/opt/alluxio-${ALLUXIO_VERSION}"
ARG ALLUXIO_TAR_FILE="alluxio-${ALLUXIO_VERSION}-bin.tar.gz"
ARG ALLUXIO_DOWNLOAD_URL="https://downloads.alluxio.io/downloads/files/${ALLUXIO_VERSION}/${ALLUXIO_TAR_FILE}"

RUN wget -O /tmp/$ALLUXIO_TAR_FILE ${ALLUXIO_DOWNLOAD_URL} \
&& tar zxf /tmp/${ALLUXIO_TAR_FILE} -C /opt/ \
&& rm -f /tmp/${ALLUXIO_TAR_FILE} \
&& cp ${ALLUXIO_HOME}/client/alluxio-${ALLUXIO_VERSION}-client.jar /databricks/jars/

#############
# Allow ubuntu user to sudo without password
#############
RUN echo "ubuntu ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/ubuntu \
&& chmod 555 /etc/sudoers.d/ubuntu
Loading

0 comments on commit 8db8741

Please sign in to comment.