Init the code of spark-rapids-container

Co-authored-by: Navin Kumar <[email protected]> re-arranging directory and initial README.md Signed-off-by: Navin Kumar <[email protected]> Can build 22.08 container, will need some updates to find the right locations, and some documentation fixes Signed-off-by: Navin Kumar <[email protected]> Cleanup ganglia directory Signed-off-by: Navin Kumar <[email protected]> Add webterminal to Docker container Signed-off-by: Navin Kumar <[email protected]> Addressing some feedback Signed-off-by: Navin Kumar <[email protected]> Some updates including adding the alluxio init script Signed-off-by: Navin Kumar <[email protected]> Add init-alluxio init script Signed-off-by: Navin Kumar <[email protected]> Cleanup alluxio init script, make REQUIREMENTS an ARG, and allow for local jar file to be used in lieu of URL Signed-off-by: Navin Kumar <[email protected]> Move webterminal code to later in the Dockerfile, separating it from main build Signed-off-by: Navin Kumar <[email protected]> cleanup and change CUDA version to 11.5.1 Add tmux and JAR_FILE ARG Signed-off-by: Navin Kumar <[email protected]> init script running inside Docker container Usage docs for Databricks docker container Updated copyrights and docs Update init scripts to allow customization of heap, cache percent and saving logs Sync with GitHub PR updates Add supervisor for Alluxio Signed-off-by: Chong Gao <[email protected]> Refactor Remove Python Libraries from Docker container Switch to a CUDA runtime base instead of cudnn Add lost code: install supervisor Update README with some step-reordering and screenshots to make instructions a bit clearer
nartal1 · Nov 3, 2022 · 8db8741 · 8db8741
1 parent 51d013c
commit 8db8741
Show file tree

Hide file tree

Showing 25 changed files with 2,223 additions and 0 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,94 @@
+# Contributing to RAPIDS Accelerator for Apache Spark Docker container
+
+Contributions to RAPIDS Accelerator for Apache Spark Docker container fall into the following three categories.
+
+1. To report a bug, request a new feature, or report a problem with
+    documentation, please file an [issue](https://github.com/NVIDIA/spark-rapids-container/issues/new/choose)
+    describing in detail the problem or new feature. The project team evaluates
+    and triages issues, and schedules them for a release. If you believe the
+    issue needs priority attention, please comment on the issue to notify the
+    team.
+2. To propose and implement a new Feature, please file a new feature request
+    [issue](https://github.com/NVIDIA/spark-rapids-container/issues/new/choose). Describe the
+    intended feature and discuss the design and implementation with the team and
+    community. Once the team agrees that the plan looks good, go ahead and
+    implement it using the [code contributions](#code-contributions) guide below.
+3. To implement a feature or bug-fix for an existing outstanding issue, please
+    follow the [code contributions](#code-contributions) guide below. If you
+    need more context on a particular issue, please ask in a comment.
+
+## Branching Convention
+
+There are two branches in this repository:
+
+* `dev`: are development branches which can change often. Note that we merge into
+  the branch with the greatest version number, as that is our default branch.
+
+* `main`: is the branch with the latest released code, and the version tag (i.e. `v0.1.0`)
+  is held here. `main` will change with new releases, but otherwise it should not change with
+  every pull request merged, making it a more stable branch.
+
+## Code contributions
+
+### Sign your work
+
+We require that all contributors sign-off on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
+
+Any contribution which contains commits that are not signed off will not be accepted.
+
+To sign off on a commit use the `--signoff` (or `-s`) option when committing your changes:
+
+```shell
+git commit -s -m "Add cool feature."
+```
+
+This will append the following to your commit message:
+
+```
+Signed-off-by: Your Name <[email protected]>
+```
+
+The sign-off is a simple line at the end of the explanation for the patch. Your signature certifies that you wrote the patch or otherwise have the right to pass it on as an open-source patch. Use your real name, no pseudonyms or anonymous contributions.  If you set your `user.name` and `user.email` git configs, you can sign your commit automatically with `git commit -s`.
+
+
+The signoff means you certify the below (from [developercertificate.org](https://developercertificate.org)):
+
+```
+Developer Certificate of Origin
+Version 1.1
+
+Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
+1 Letterman Drive
+Suite D4700
+San Francisco, CA, 94129
+
+Everyone is permitted to copy and distribute verbatim copies of this
+license document, but changing it is not allowed.
+
+
+Developer's Certificate of Origin 1.1
+
+By making a contribution to this project, I certify that:
+
+(a) The contribution was created in whole or in part by me and I
+    have the right to submit it under the open source license
+    indicated in the file; or
+
+(b) The contribution is based upon previous work that, to the best
+    of my knowledge, is covered under an appropriate open source
+    license and I have the right under that license to submit that
+    work with modifications, whether created in whole or in part
+    by me, under the same open source license (unless I am
+    permitted to submit under a different license), as indicated
+    in the file; or
+
+(c) The contribution was provided directly to me by some other
+    person who certified (a), (b) or (c) and I have not modified
+    it.
+
+(d) I understand and agree that this project and the contribution
+    are public and that a record of the contribution (including all
+    personal information I submit with it, including my sign-off) is
+    maintained indefinitely and may be redistributed consistent with
+    this project or the open source license(s) involved.
+```
diff --git a/Databricks/00-custom-spark-driver-defaults.conf b/Databricks/00-custom-spark-driver-defaults.conf
@@ -0,0 +1,21 @@
+# Copyright (c) 2022, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+[driver] {
+ "spark.plugins" = "com.nvidia.spark.SQLPlugin"
+ "spark.rapids.memory.pinnedPool.size" = "2G"
+ "spark.databricks.delta.optimizeWrite.enabled" = "false"
+ "spark.sql.optimizer.dynamicPartitionPruning.enabled" = "false"
+ "spark.sql.files.maxPartitionBytes" = "512m"
+ "spark.rapids.sql.concurrentGpuTasks" = "2"
+}
diff --git a/Databricks/Dockerfile b/Databricks/Dockerfile
@@ -0,0 +1,228 @@
+# Copyright (c) 2022, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+############# 
+# Combine all below Dockerfiles together:
+# https://github.com/databricks/containers/blob/master/ubuntu/gpu/cuda-11/base/Dockerfile
+# https://github.com/databricks/containers/blob/master/ubuntu/minimal/Dockerfile
+# https://github.com/databricks/containers/blob/master/ubuntu/python/Dockerfile
+# https://github.com/databricks/containers/blob/master/ubuntu/dbfsfuse/Dockerfile
+# https://github.com/databricks/containers/blob/master/ubuntu/standard/Dockerfile
+# https://github.com/databricks/containers/blob/master/experimental/ubuntu/ganglia/Dockerfile
+# https://github.com/dayananddevarapalli/containers/blob/main/webterminal/Dockerfile
+#############
+ARG CUDA_VERSION=11.5.2
+FROM nvidia/cuda:${CUDA_VERSION}-runtime-ubuntu20.04 as base
+
+ARG CUDA_PKG_VERSION=11-5
+
+#############
+# Install all needed libs
+#############
+
+RUN set -ex && \ 
+    cd /etc/apt/sources.list.d && \
+    mv cuda.list cuda.list.disabled && \
+    apt-get -y update && \
+    apt-get -y install wget && \
+    wget -qO - https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/3bf863cc.pub | apt-key add - && \
+    cd /etc/apt/sources.list.d && \
+    mv cuda.list.disabled cuda.list && \
+    apt-get -y update && \
+    apt-get -y upgrade && \
+    apt-get install -y software-properties-common && \
+    add-apt-repository ppa:deadsnakes/ppa -y && \
+    apt-get -y install python3.8 virtualenv python3-filelock libcairo2 cuda-cupti-${CUDA_PKG_VERSION} \
+               cuda-toolkit-${CUDA_PKG_VERSION}-config-common cuda-toolkit-11-config-common cuda-toolkit-config-common \
+               openjdk-8-jdk-headless iproute2 bash sudo coreutils procps gpg fuse openssh-server && \
+    apt-get -y install cuda-cudart-dev-${CUDA_PKG_VERSION} cuda-cupti-dev-${CUDA_PKG_VERSION} cuda-driver-dev-${CUDA_PKG_VERSION} \
+               cuda-nvcc-${CUDA_PKG_VERSION} cuda-thrust-${CUDA_PKG_VERSION} cuda-toolkit-${CUDA_PKG_VERSION}-config-common cuda-toolkit-11-config-common \
+               cuda-toolkit-config-common python3.8-dev libpq-dev libcairo2-dev build-essential unattended-upgrades cmake ccache \
+               openmpi-bin linux-headers-5.4.0-117 linux-headers-5.4.0-117-generic linux-headers-generic libopenmpi-dev unixodbc-dev \
+               sysstat ssh tmux supervisor && \
+    apt-get install -y less vim && \
+    /var/lib/dpkg/info/ca-certificates-java.postinst configure && \
+    # Initialize the default environment that Spark and notebooks will use
+    virtualenv -p python3.8 --system-site-packages /databricks/python3 --no-download --no-setuptools \
+        && /databricks/python3/bin/pip install --no-cache-dir --upgrade pip \
+        && /databricks/python3/bin/pip install \
+            databricks-cli \
+            ipython \
+        && /databricks/python3/bin/pip install --force-reinstall \
+            virtualenv \
+        && /databricks/python3/bin/pip cache purge && \
+    apt-get -y purge --autoremove software-properties-common cuda-cudart-dev-${CUDA_PKG_VERSION} cuda-cupti-dev-${CUDA_PKG_VERSION} \
+               cuda-driver-dev-${CUDA_PKG_VERSION} cuda-nvcc-${CUDA_PKG_VERSION} cuda-thrust-${CUDA_PKG_VERSION} \
+               python3.8-dev libpq-dev libcairo2-dev build-essential unattended-upgrades cmake ccache openmpi-bin \
+               linux-headers-5.4.0-117 linux-headers-5.4.0-117-generic linux-headers-generic libopenmpi-dev unixodbc-dev \
+               virtualenv python3-virtualenv && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* && \
+    mkdir -p /databricks/jars && \
+    mkdir -p /mnt/driver-daemon && \
+    #############
+    # Disable NVIDIA repos to prevent accidental upgrades.
+    #############
+    ln -s /databricks/jars /mnt/driver-daemon/jars && \
+    cd /etc/apt/sources.list.d && \
+    mv cuda.list cuda.list.disabled && \
+    # Create user "ubuntu"
+    useradd --create-home --shell /bin/bash --groups sudo ubuntu
+
+#############
+# Set all env variables
+#############
+ARG DATABRICKS_RUNTIME_VERSION=10.4
+ENV PYSPARK_PYTHON=/databricks/python3/bin/python3
+ENV DATABRICKS_RUNTIME_VERSION=${DATABRICKS_RUNTIME_VERSION}
+ENV LANG=C.UTF-8
+ENV USER=ubuntu
+ENV PATH=/usr/local/nvidia/bin:/databricks/python3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
+
+FROM base as with-plugin
+
+#############
+# Spark RAPIDS configuration
+#############
+ARG DRIVER_CONF_FILE=00-custom-spark-driver-defaults.conf
+ARG JAR_FILE=rapids-4-spark_2.12-22.10.0.jar
+ARG JAR_URL=https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.10.0/${JAR_FILE}
+ARG INIT_SCRIPT=init.sh
+COPY ${DRIVER_CONF_FILE} /databricks/driver/conf/00-custom-spark-driver-defaults.conf
+
+WORKDIR /databricks/jars
+ADD $JAR_URL /databricks/jars/${JAR_FILE}
+
+ADD $INIT_SCRIPT /opt/spark-rapids/init.sh
+RUN chmod 755 /opt/spark-rapids/init.sh
+
+WORKDIR /databricks
+
+#############
+# Setup Ganglia
+#############
+FROM with-plugin as with-ganglia
+
+WORKDIR /databricks
+ENV DEBIAN_FRONTEND=noninteractive
+RUN apt-get update && apt-get install -q -y --force-yes --fix-missing --ignore-missing \
+        ganglia-monitor \
+        ganglia-webfrontend \
+        ganglia-monitor-python \
+        python3-pip \
+        wget \
+        rsync \
+        cron \
+    && apt-get clean \
+    && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
+# Upgrade Ganglia to 3.7.2 to patch XSS bug, see CJ-15250
+# Upgrade Ganglia to 3.7.4 and use private forked repo to patch several security bugs, see CJ-20114
+# SC-17279: We run gmetad as user ganglia, so change the owner from nobody to ganglia for the rrd directory
+RUN cd /tmp \
+  && export GANGLIA_WEB=ganglia-web-3.7.4-db-4 \
+  && wget https://s3-us-west-2.amazonaws.com/databricks-build-files/$GANGLIA_WEB.tar.gz \
+  && tar xvzf $GANGLIA_WEB.tar.gz \
+  && cd $GANGLIA_WEB \
+  && make install \
+  && chown ganglia:ganglia /var/lib/ganglia/rrds
+# Install Phantom.JS
+RUN cd /tmp \
+  && export PHANTOM_JS="phantomjs-2.1.1-linux-x86_64" \
+  && wget https://s3-us-west-2.amazonaws.com/databricks-build-files/$PHANTOM_JS.tar.bz2 \
+  && tar xvjf $PHANTOM_JS.tar.bz2 \
+  && mv $PHANTOM_JS /usr/local/share \
+  && ln -sf /usr/local/share/$PHANTOM_JS/bin/phantomjs /usr/local/bin
+# Apache2 config. The `sites-enabled` config files are loaded into the container
+# later.
+RUN rm /etc/apache2/sites-enabled/* && a2enmod proxy && a2enmod proxy_http
+RUN mkdir -p /etc/monit/conf.d
+RUN echo '\
+check process ganglia-monitor with pidfile /var/run/ganglia-monitor.pid\n\
+      start program = "/usr/sbin/service ganglia-monitor start"\n\
+      stop program = "/usr/sbin/service ganglia-monitor stop"\n\
+      if memory usage > 500 MB for 3 cycles then restart\n\
+' > /etc/monit/conf.d/ganglia-monitor-not-active 
+RUN echo '\
+check process gmetad with pidfile /var/run/gmetad.pid\n\
+      start program = "/usr/sbin/service gmetad start"\n\
+      stop program = "/usr/sbin/service gmetad stop"\n\
+      if memory usage > 500 MB for 3 cycles then restart\n\
+\n\
+check process apache2 with pidfile /var/run/apache2/apache2.pid\n\
+      start program = "/usr/sbin/service apache2 start"\n\
+      stop program = "/usr/sbin/service apache2 stop"\n\
+      if memory usage > 500 MB for 3 cycles then restart\n\
+' > /etc/monit/conf.d/gmetad-not-active
+RUN echo '\
+check process spark-slave with pidfile /tmp/spark-root-org.apache.spark.deploy.worker.Worker-1.pid\n\
+      start program = "/databricks/spark/scripts/restart-workers"\n\
+      stop program = "/databricks/spark/scripts/kill_worker.sh"\n\
+' > /etc/monit/conf.d/spark-slave-not-active
+# add Ganglia configuration file indicating the DocumentRoot - Databricks checks this to enable Ganglia upon cluster startup
+RUN mkdir -p /etc/apache2/sites-enabled
+ADD ganglia/ganglia.conf /etc/apache2/sites-enabled
+RUN chmod 775 /etc/apache2/sites-enabled/ganglia.conf
+ADD ganglia/gconf/* /etc/ganglia/
+RUN mkdir -p /databricks/spark/scripts/ganglia/
+RUN mkdir -p /databricks/spark/scripts/
+ADD ganglia/start_spark_slave.sh /databricks/spark/scripts/start_spark_slave.sh
+
+# add local monit shell script in the right location
+RUN mkdir -p /etc/init.d
+ADD scripts/monit /etc/init.d
+RUN chmod 775 /etc/init.d/monit
+
+#############
+# Set up webterminal ssh
+#############
+FROM with-ganglia as with-webterminal
+
+RUN wget https://github.com/tsl0922/ttyd/releases/download/1.6.3/ttyd.x86_64 && \
+        mkdir -p /databricks/driver/logs && \
+        mkdir -p /databricks/spark/scripts/ttyd/ && \
+        mkdir -p /etc/monit/conf.d/ && \
+        mv ttyd.x86_64 /databricks/spark/scripts/ttyd/ttyd && \
+        export TTYD_BIN_FILE=/databricks/spark/scripts/ttyd/ttyd
+
+ENV TTYD_DIR=/databricks/spark/scripts/ttyd
+ENV TTYD_BIN_FILE=$TTYD_DIR/ttyd
+
+COPY webterminal/setup_ttyd_daemon.sh $TTYD_DIR/setup_ttyd_daemon.sh
+COPY webterminal/stop_ttyd_daemon.sh $TTYD_DIR/stop_ttyd_daemon.sh
+COPY webterminal/start_ttyd_daemon.sh $TTYD_DIR/start_ttyd_daemon.sh
+COPY webterminal/webTerminalBashrc $TTYD_DIR/webTerminalBashrc
+RUN echo '\
+check process ttyd with pidfile /var/run/ttyd-daemon.pid\n\
+      start program = "/databricks/spark/scripts/ttyd/start_ttyd_daemon.sh"\n\
+      stop program = "/databricks/spark/scripts/ttyd/stop_ttyd_daemon.sh"' >  /etc/monit/conf.d/ttyd-daemon-not-active
+
+FROM with-webterminal as with-alluxio
+#############
+# Setup Alluxio
+#############
+ARG ALLUXIO_VERSION=2.8.0
+ARG ALLUXIO_HOME="/opt/alluxio-${ALLUXIO_VERSION}"
+ARG ALLUXIO_TAR_FILE="alluxio-${ALLUXIO_VERSION}-bin.tar.gz"
+ARG ALLUXIO_DOWNLOAD_URL="https://downloads.alluxio.io/downloads/files/${ALLUXIO_VERSION}/${ALLUXIO_TAR_FILE}"
+
+RUN wget -O /tmp/$ALLUXIO_TAR_FILE ${ALLUXIO_DOWNLOAD_URL} \
+    && tar zxf /tmp/${ALLUXIO_TAR_FILE} -C /opt/ \
+    && rm -f /tmp/${ALLUXIO_TAR_FILE} \
+    && cp ${ALLUXIO_HOME}/client/alluxio-${ALLUXIO_VERSION}-client.jar /databricks/jars/
+
+#############
+# Allow ubuntu user to sudo without password
+#############
+RUN echo "ubuntu ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/ubuntu \
+    && chmod 555 /etc/sudoers.d/ubuntu