Skip to content

Commit

Permalink
Feature/137 service management (#143)
Browse files Browse the repository at this point in the history
* Implementation done for basic commands using ConfigMap, tests needed and DrainAndTerminate needed

* k8s state propogation working, unit test needed and stateful management of queue sub systems remain to be implemented

* Added trigger for testing purposes

* cuda discovery changes during builds

* Improved CUDA results for test builds

* Support for k8s hosted builds and tests, a TODO might be to add something like https://github.com/GoogleContainerTools/kaniko

* Get rid of github testing token and retire it from the github test account

* Fix naming convention for k8s config maps

* Add update case for configMap

* Broadcast tests implemented

* RMQ no longer leaking auth data, k8s configmap states verified as stopping processing manually, prometheus now available for counters to confirm this when test validation is written

* k8s testing filter added

* RMQ incorporated into k8s test

* relocate rmq directories to prevent server panics, push expensive cache test to kubernetes only

* Allow testing code in one package to test the queue availability in the production code of another package

* Moved queue probe test info the runner package non test section so it can be used throughout the system

* Use state updates from within the cmd runner directory not inside the internal package

* Add support for dynamic namespaces supplied by the k8s downward API

* Expand env vars inside the amqp references during a new, added cluster role bindings for k8s side testing

* Work on the k8s logic for testing

* Port number needs the bump to be converted into the admin port

* Admin timing out a lot add a timeout of our own

* Make test queue name compliant with defaults that are used by studio

* Admin timing out a lot add a timeout of our own, increase it to 15 seconds

* Header timeouts now an issue, manually specify one

* Change queue checking intervals to allow the test to detect queues and run the processing service logic

* Syntax fix

* Set initial state to running

* Added auto configuration of the exchange when testing is being used

* Debug amqp bindings in the cluster

* Debugging and testing pass

* Add the step to catch ignored passes, do some logging cleanup

* Match up the auto delete setting between the amqp clinet and the rabbit hole client

* Give the prometheus dimension a value

* Fix metrics names

* change success criteria

* Add checked and ignored counter to verify the test case for the state management

* Add checked and ignored counter to verify the test case for the state management

* Improve k8s test builds documentation

* Doc changes and unused var rem0ved

* End to end test for states within k8s config maps including global and host specific versions

* Work on getting the namespace auto detected

* Add a push of a state to flush out the updates and prevent the listener going silent on a state modification that results in no update because the state was already at that value

* Add handling for when a config map is not yet created

* Fix error handling

* Additional logging for test validation

* Allow the logger to use a striner method

* Debugging

* Debugging

* Use the main logger when running tests in the server

* When logging in the main runner dont overload the logger

* Expunge the last of the assignments to logger

* Message logging

* Version bump

* Add change log entries for the 0.8.0 release

* Add change log entries for the 0.8.0 release
  • Loading branch information
karlmutch authored Sep 20, 2018
1 parent c672b33 commit a220c77
Show file tree
Hide file tree
Showing 84 changed files with 85,055 additions and 335 deletions.
10 changes: 10 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# linguist is a package used by github to characterise the files in your code base, see https://github.com/github/linguist

# For an explanation of using these flags with the vendor directory please read https://medium.com/@clarkbw/managing-generated-files-in-github-1f1989c09dfd

vendor/**/* -diff -merge
vendor/**/* linguist-vendored

docs/* linguist-documentation

internal/datatypes/*_enumer.go linguist-generated=true
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,4 @@ pkg/
src/
certs/
Dockerfile.tmp
clusters
14 changes: 14 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,19 @@
IMPROVEMENTS:

* Added github templates for PRs, Issues etc. ([#133](https://github.com/SentientTechnologies/studio-go-runner/issue/133)).

* Capture artifact downloading failures and insert them into the experiments output file. ([#133](https://github.com/SentientTechnologies/studio-go-runner/issue/133)).

* Faulty GPUs with bad ECC memory now caught and will only accept CPU jobs, in addition to errors being output

# 0.8.0

IMPROVEMENTS:

* Added support for testing and non-release builds using kubernetes hosted pods, please see docs/k8s.md. Releasing from k8s hosting a future feature.

* RabbitMQ now supported within k8s testing as a seperate service within the namespace the test uses. Please see docs/k8s.md for more information.

* Multiple test cases added, many more to go but a legitimate effort is now underway given we have k8s support and are not constrained by travis.

* Config map support within kubernentes to inform pods of desired state changes Running, Abort, Drain and suspend, Drain and terminate. Enables rolling upgrade and maintenance use cases for k8s clusters.
14 changes: 7 additions & 7 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,12 @@ RUN groupadd -f -g ${USER_GROUP_ID} ${USER} && \
USER ${USER}
WORKDIR /home/${USER}

ENV GO_VERSION 1.10.3
ENV GO_VERSION 1.11

ENV GOPATH=/project
ENV PATH=$GOPATH/bin:$PATH
ENV PATH=$PATH:/home/${USER}/.local/bin:/home/${USER}/go/bin
ENV GOROOT=/home/${USER}/go

RUN cd /home/${USER} && \
mkdir -p /home/${USER}/go && \
Expand All @@ -63,11 +68,6 @@ RUN mkdir -p /home/${USER}/.local/bin && \
wget -q -O /home/${USER}/.local/bin/minio https://dl.minio.io/server/minio/release/linux-amd64/minio && \
chmod +x /home/${USER}/.local/bin/minio

ENV GOPATH=/project
ENV PATH=$GOPATH/bin:$PATH
ENV PATH=$PATH:/home/${USER}/.local/bin:/home/${USER}/go/bin
ENV GOROOT=/home/${USER}/go

VOLUME /project
WORKDIR /project/src/github.com/SentientTechnologies/studio-go-runner

Expand All @@ -76,4 +76,4 @@ LABEL vendor="Sentient Technologies INC" \
ai.sentient.module.version={{.duat.version}} \
ai.sentient.module.name={{.duat.module}}

CMD /bin/bash -c 'go get github.com/karlmutch/duat && go run -tags NO_CUDA build.go -r -dirs=internal && go run -tags NO_CUDA build.go -r -dirs=cmd'
CMD /bin/bash -c 'go get github.com/karlmutch/duat && go get github.com/karlmutch/enumer && dep ensure && go generate ./internal/types && go run -tags NO_CUDA build.go -r -dirs=internal && go run -tags NO_CUDA build.go -r -dirs=cmd'
87 changes: 87 additions & 0 deletions Dockerfile_full
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
FROM ubuntu:16.04

MAINTAINER [email protected]

ENV LANG C.UTF-8

ENV CUDA_8_DEB "https://developer.nvidia.com/compute/cuda/8.0/Prod2/local_installers/cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64-deb"
ENV CUDA_9_DEB "https://developer.nvidia.com/compute/cuda/9.0/Prod/local_installers/cuda-repo-ubuntu1604-9-0-local_9.0.176-1_amd64-deb"
ENV CUDA_PACKAGE_VERSION 8-0
ENV CUDA_FILESYS_VERSION 8.0
ENV NVIDIA_VERSION 384

RUN apt-get -y update && apt-get -y upgrade

RUN \
apt-get -y install apt-transport-https software-properties-common wget openssl ssh curl jq apt-utils && \
apt-get -y install make git gcc && apt-get clean

RUN cd /tmp && \
wget -q -O /tmp/cuda_8.deb ${CUDA_8_DEB} && \
dpkg -i /tmp/cuda_8.deb && \
apt-get -y update && \
DEBIAN_FRONTEND=noninteractive apt-get -y install --no-install-recommends nvidia-cuda-dev cuda-nvml-dev-${CUDA_PACKAGE_VERSION} && \
rm /tmp/cuda*.deb && \
apt-get clean

#wget --quiet -O /tmp/cuda_9.deb ${CUDA_9_DEB} && \
#dpkg -i /tmp/cuda_9.deb && \
# apt-key add /var/cuda-repo-9-0-local/7fa2af80.pub && \
#apt-get -y update && \
#DEBIAN_FRONTEND=noninteractive apt-get -y install --no-install-recommends cuda-runtime-9-2 && \
#rm /tmp/cuda*.deb

RUN \
ln -s /usr/local/cuda-${CUDA_FILESYS_VERSION} /usr/local/cuda && \
ln -s /usr/local/cuda/targets/x86_64-linux/include /usr/local/cuda/include && \
ln -s /usr/lib/nvidia-${NVIDIA_VERSION} /usr/lib/nvidia && \
apt-get clean && \
apt-get autoremove


ENV GO_VERSION 1.11

RUN mkdir -p /project/go && \
mkdir -p /project/src/github.com/SentientTechnologies && \
cd /project && \
wget -q -O /tmp/go.tgz https://storage.googleapis.com/golang/go${GO_VERSION}.linux-amd64.tar.gz && \
tar xzf /tmp/go.tgz && \
rm /tmp/go.tgz

RUN mkdir -p /project/.local/bin && \
wget -q -O /project/.local/bin/minio https://dl.minio.io/server/minio/release/linux-amd64/minio && \
chmod +x /project/.local/bin/minio

# Install RabbitMQ, originally from https://github.com/dockerfile/rabbitmq/blob/master/Dockerfile
RUN wget -q -O - 'https://dl.bintray.com/rabbitmq/Keys/rabbitmq-release-signing-key.asc' | apt-key add - && \
echo "deb https://dl.bintray.com/rabbitmq/debian xenial main erlang" | tee /etc/apt/sources.list.d/bintray.rabbitmq.list && \
apt-get update && \
DEBIAN_FRONTEND=noninteractive apt-get install -y rabbitmq-server && \
rabbitmq-plugins enable rabbitmq_management && \
echo "[{rabbit, [{loopback_users, []}]}]." > /etc/rabbitmq/rabbitmq.config && \
mkdir -p /data

ENV RABBITMQ_LOG_BASE /data/log
ENV RABBITMQ_MNESIA_BASE /data/mnesia

ENV GOPATH=/project
ENV PATH=$GOPATH/bin:$PATH
ENV PATH=$PATH:/project/.local/bin:/project/go/bin
ENV GOROOT=/project/go

ENV LOGXI='*=INF'
ENV LOGXI_FORMAT='happy,maxcol=1024'

WORKDIR /project/src/github.com/SentientTechnologies

RUN mkdir $GOPATH/bin && \
(curl https://raw.githubusercontent.com/golang/dep/master/install.sh | sh) && \
git config --global url."git://github.com".insteadOf "https://github.com" && \
go get github.com/karlmutch/enumer

CMD /bin/bash -c 'git clone https://github.com/SentientTechnologies/studio-go-runner.git && cd studio-go-runner && ( [[ -n $GIT_BRANCH ]] && git checkout $GIT_BRANCH ) && git branch && dep ensure && go generate ./internal/types && go run build.go -r -dirs=internal && go run build.go -r -dirs=cmd'

# Done last to prevent lots of disruption when bumping versions
LABEL vendor="Sentient Technologies INC Open Source" \
ai.sentient.module.version={{.duat.version}} \
ai.sentient.module.name={{.duat.module}}
32 changes: 32 additions & 0 deletions Gopkg.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions Gopkg.toml
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,11 @@ ignored = ["github.com/Sirupsen/logrus"]
name = "github.com/karlmutch/duat"
branch="master"

# K8s API Version 1.10
[[override]]
name = "github.com/ericchiang/k8s"
version="v1.1.0"

[prune]
go-tests = true
unused-packages = true
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# studio-go-runner

Version: <repo-version>0.7.1</repo-version>
Version: <repo-version>0.8.0</repo-version>

[![Build Status](https://travis-ci.org/SentientTechnologies/studio-go-runner.svg?branch=master)](https://travis-ci.org/SentientTechnologies/studio-go-runner) [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://github.com/SentientTechnologies/studio-go-runner/blob/master/LICENSE) [![Go Report Card](https://goreportcard.com/badge/SentientTechnologies/studio-go-runner)](https://goreportcard.com/report/SentientTechnologies/studio-go-runner)
[![Build Status](https://travis-ci.org/SentientTechnologies/studio-go-runner.svg?branch=master)](https://travis-ci.org/SentientTechnologies/studio-go-runner) [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://github.com/SentientTechnologies/studio-go-runner/blob/master/LICENSE) [![Go Report Card](https://goreportcard.com/badge/SentientTechnologies/studio-go-runner)](https://goreportcard.com/report/SentientTechnologies/studio-go-runner)[![DepShield Badge](https://depshield.sonatype.org/badges/SentientTechnologies/studio-go-runner/depshield.svg)](https://depshield.github.io)

studio-go-runner is an implementation of a studioml runner, in addition to any other Python dervied workloads.

Expand Down
71 changes: 51 additions & 20 deletions build.go
Original file line number Diff line number Diff line change
Expand Up @@ -108,16 +108,16 @@ func main() {
for _, dir := range dirs {
if dir != lastSeen {
deDup = append(deDup, dir)
lastSeen = dir
}
}
dirs = deDup
}

logger.Debug(fmt.Sprintf("dirs %v", dirs))

// Take the discovered directories and build them
//

// Take the discovered directories and build them from a deduped
// directory set
for _, dir := range dirs {
if _, err = runBuild(dir, "README.md"); err != nil {
logger.Warn(err.Error())
Expand Down Expand Up @@ -177,32 +177,35 @@ func runBuild(dir string, verFn string) (outputs []string, err errors.Error) {
// Are we running inside a container runtime such as docker
runtime, err := md.ContainerRuntime()
if err != nil {
return nil, err
return outputs, err
}

// If we are in a container then do a stock compile, if not then it is
// time to dockerize all the things
if len(runtime) != 0 {
logger.Info(fmt.Sprintf("building %s", dir))
outputs, err = build(md)
if err != nil {
return outputs, err
}
}

if err == nil && !*imageOnly {
logger.Info(fmt.Sprintf("testing %s", dir))
out, errs := test(md)
outputs = append(outputs, out...)
if len(errs) != 0 {
return nil, errs[0]
return outputs, errs[0]
}
outputs = append(outputs, out...)
}

if len(runtime) == 0 {
// Dont Dockerize in the main root directory of a project. The root
// dir Dockerfile is for a projects build container typically.
if dir != "." {
logger.Info(fmt.Sprintf("dockerizing %s", dir))
if err := dockerize(md); err != nil {
return nil, err
if err = dockerize(md); err != nil {
return outputs, err
}
// Check for a bin directory and continue if none
if _, errGo := os.Stat("./bin"); errGo == nil {
Expand All @@ -211,10 +214,6 @@ func runBuild(dir string, verFn string) (outputs []string, err errors.Error) {
}
}

if err != nil {
return nil, err
}

return outputs, err
}

Expand Down Expand Up @@ -312,9 +311,16 @@ func build(md *duat.MetaData) (outputs []string, err errors.Error) {
}

func CudaPresent() bool {
// Get any default directories from the linux env var that is used for shared libraries
libPaths := strings.Split(os.Getenv("LD_LIBRARY_PATH"), ":")
filepath.Walk("/usr/lib", func(path string, info os.FileInfo, err error) error {
if info.IsDir() {
libPaths = append(libPaths, path)
}
return nil
})
for _, aPath := range libPaths {
if _, errGo := os.Stat(filepath.Join(aPath, "libnvidia-ml.so.1")); errGo == nil {
if _, errGo := os.Stat(filepath.Join(aPath, "libcuda.so.1")); errGo == nil {
return true
}
}
Expand All @@ -333,6 +339,26 @@ func GPUPresent() bool {

}

func k8sPod() (isPod bool, err errors.Error) {

fn := "/proc/self/mountinfo"

contents, errGo := ioutil.ReadFile(fn)
if errGo != nil {
return false, errors.Wrap(errGo).With("stack", stack.Trace().TrimRuntime()).With("file", fn)
}
for _, aMount := range strings.Split(string(contents), "\n") {
fields := strings.Split(aMount, " ")
// For information about the individual fields c.f. https://www.kernel.org/doc/Documentation/filesystems/proc.txt
if len(fields) > 5 {
if fields[4] == "/run/secrets/kubernetes.io/serviceaccount" {
return true, nil
}
}
}
return false, nil
}

// test inspects directories within the project that contain test cases, implemented
// using the standard go build _test.go file names, and runs those tests that
// the hardware provides support for
Expand All @@ -343,17 +369,22 @@ func test(md *duat.MetaData) (outputs []string, errs []errors.Error) {
"-a",
"-v",
}
tags := []string{}

if !GPUPresent() {
opts = append(opts, "--no-gpu")
// Look for the Kubernetes is present indication and disable
// tests if it is not
sPod, _ := k8sPod()
if !sPod {
opts = append(opts, "-test.short")
} else {
opts = append(opts, "--use-k8s")
}

tags := []string{}

// Look for CUDA Hardware and set the build flags for the tests based
// on its presence
if !CudaPresent() {
if !GPUPresent() {
// Look for GPU Hardware and set the build flags for the tests based
// on its presence
tags = append(tags, "NO_CUDA")
opts = append(opts, "--no-gpu")
}

// Go through the directories looking for test files
Expand Down
Loading

0 comments on commit a220c77

Please sign in to comment.