Validate Deployment

Index:

1 Drivers
- 1.1 Check Drivers service's log
- 2.1 Check Drivers version
2 Data path check
3 Admin Account in Webportal
4 Troubleshooting OpenPAI services
5 Troubleshooting Kubernetes Clusters
6 Getting help

1 Check Drivers

1.1 Check Drivers service's log

Dashboard:

http://<master>:9090

search driver, view driver status

view driver logs, this log shows driver in health status

1.2 Check Drivers version

# (1) find driver container at server
~$ sudo docker ps | grep driver

daeaa9a81d3f        aiplatform/drivers                                    "/bin/sh -c ./inst..."   8 days ago          Up 8 days                                    k8s_nvidia-drivers_drivers-one-shot-d7fr4_default_9d91059c-9078-11e8-8aea-000d3ab5296b_0
ccf53c260f6f        gcr.io/google_containers/pause-amd64:3.0              "/pause"                 8 days ago          Up 8 days                                    k8s_POD_drivers-one-shot-d7fr4_default_9d91059c-9078-11e8-8aea-000d3ab5296b_0

# (2) login driver container

~$ sudo docker exec -it daeaa9a81d3f /bin/bash

# (3) checker driver version

root@~/drivers# nvidia-smi
Fri Aug  3 01:53:04 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 0000460D:00:00.0 Off |                    0 |
| N/A   31C    P8    31W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

2 Data path check

A configuration in service-configuration.yaml's cluster.commmon.data-path. The default value is /datastorage

#SSH to the master machine

~$ ls /datastorage

hadooptmp  hdfs  launcherlogs  prometheus  yarn  zoodata

3 Admin Account in Webportal

Dashboard:

http://<master>:9286/virtual-clusters.html

try to login:

Note: The username and password are configured in the service-configuraiton.yaml's rest-server field.

4 Troubleshooting OpenPAI services

4.1 Diagnosing the problem

Monitor

From kubernetes webportal:

Dashboard:

http://<master>:9090

From OpenPAI watchdog:

OpenPAI watchdog

Log

From kubernetes webportal:

From each node container / pods log file:

View containers log under folder:

ls /var/log/containers

View pods log under folder:

ls /var/log/pods

Debug

As OpenPAI services are deployed on kubernetes, please refer debug kubernetes pods

4.2 Fix problem

Update OpenPAI Configuration

Check and refine 4 yaml files:

    - layout.yaml
    - kubernetes-configuration.yaml
    - k8s-role-definition.yaml
    - serivices-configuration.yaml

Customize config for specific service

If user want to customize single service, you could find service config file at src and find image dockerfile at src.

Update Code & Image
- Customize image dockerfile or code

User could find service's image dockerfile at src and customize them.

Rebuild image

User could execute the following cmds:

Build docker image

    paictl.py image build -p /path/to/configuration/ [ -n image-x ]

Push docker image

    paictl.py image push -p /path/to/configuration/ [ -n image-x ]

If the -n parameter is specified, only the given image, e.g. rest-server, webportal, watchdog, etc., will be build / push.

4.3 Reboot service

Stop single or all services.

python paictl.py service stop \
  [ -c /path/to/kubeconfig ] \
  [ -n service-list ]

If the -n parameter is specified, only the given services, e.g. rest-server, webportal, watchdog, etc., will be stopped. If not, all PAI services will be stopped.

Boot up single all OpenPAI services.

Please refer to this section for details.

5 Troubleshooting Kubernetes Clusters

Please refer Kubernetes Troubleshoot Clusters

6 Getting help

StackOverflow: If you have questions about OpenPAI, please submit question at Stackoverflow under tag: openpai
Report an issue: If you have issue/ bug/ new feature, please submit it at Github

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

validate-deployment.md

validate-deployment.md

Validate Deployment

Index:

1 Check Drivers

1.1 Check Drivers service's log

1.2 Check Drivers version

2 Data path check

3 Admin Account in Webportal

4 Troubleshooting OpenPAI services

4.1 Diagnosing the problem

4.2 Fix problem

4.3 Reboot service

5 Troubleshooting Kubernetes Clusters

6 Getting help

Files

validate-deployment.md

Latest commit

History

validate-deployment.md

File metadata and controls

Validate Deployment

Index:

1 Check Drivers

1.1 Check Drivers service's log

1.2 Check Drivers version

2 Data path check

3 Admin Account in Webportal

4 Troubleshooting OpenPAI services

4.1 Diagnosing the problem

4.2 Fix problem

4.3 Reboot service

5 Troubleshooting Kubernetes Clusters

6 Getting help