Merge pull request microsoft#192 from microsoft/master

merge master
justrypython · Jul 23, 2019 · c76068f · c76068f
2 parents 1bd2012 + a5fa235
commit c76068f
Show file tree

Hide file tree

Showing 14 changed files with 799 additions and 399 deletions.
diff --git a/docs/en_US/AdvancedFeature/MultiPhase.md b/docs/en_US/AdvancedFeature/MultiPhase.md
@@ -32,9 +32,31 @@ It is pretty simple to use multi-phase in trial code, an example is shown below:
     # ...
     ```
 
-__2. Modify experiment configuration__
+__2. Experiment configuration__
 
-To enable multi-phase, you should also add `multiPhase: true` in your experiment YAML configure file. If this line is not added, `nni.get_next_parameter()` would always return the same configuration. For all the built-in tuners/advisors, you can use multi-phase in your trial code without modification of tuner/advisor spec in the YAML configure file.
+To enable multi-phase, you should also add `multiPhase: true` in your experiment YAML configure file. If this line is not added, `nni.get_next_parameter()` would always return the same configuration.
+
+Multi-phase experiment configuration example:
+
+```
+authorName: default
+experimentName: multiphase experiment
+trialConcurrency: 2
+maxExecDuration: 1h
+maxTrialNum: 8
+trainingServicePlatform: local
+searchSpacePath: search_space.json
+multiPhase: true
+useAnnotation: false
+tuner:
+  builtinTunerName: TPE
+  classArgs:
+    optimize_mode: maximize
+trial:
+  command: python3 mytrial.py
+  codeDir: .
+  gpuNum: 0
+```
 
 ### Write a tuner that leverages multi-phase:
 
@@ -48,6 +70,9 @@ trial_end
 ```
 With this information, the tuner could know which trial is requesting a configuration, and which trial is reporting results. This information provides enough flexibility for your tuner to deal with different trials and different phases. For example, you may want to use the trial_job_id parameter of generate_parameters method to generate hyperparameters for a specific trial job.
 
-Of course, to use your multi-phase tuner, __you should add `multiPhase: true` in your experiment YAML configure file__.
+### Tuners support multi-phase experiments:
+
+[TPE](../Tuner/HyperoptTuner.md), [Random](../Tuner/HyperoptTuner.md), [Anneal](../Tuner/HyperoptTuner.md), [Evolution](../Tuner/EvolutionTuner.md), [SMAC](../Tuner/SmacTuner.md), [NetworkMorphism](../Tuner/NetworkmorphismTuner.md), [MetisTuner](../Tuner/MetisTuner.md), [BOHB](../Tuner/BohbAdvisor.md), [Hyperband](../Tuner/HyperbandAdvisor.md), [ENAS tuner](https://github.com/countif/enas_nni/blob/master/nni/examples/tuners/enas/nni_controller_ptb.py).
 
-[ENAS tuner](https://github.com/countif/enas_nni/blob/master/nni/examples/tuners/enas/nni_controller_ptb.py) is an example of a multi-phase tuner.
+### Training services support multi-phase experiment:
+[Local Machine](../TrainingService/LocalMode.md), [Remote Servers](../TrainingService/RemoteMachineMode.md), [OpenPAI](../TrainingService/PaiMode.md)
diff --git a/docs/en_US/Release.md b/docs/en_US/Release.md
@@ -1,5 +1,6 @@
 # ChangeLog
 
+
 ## Release 0.9 - 7/1/2019
 
 ### Major Features
@@ -95,18 +96,18 @@
 
 ### Major Features
 
-* [Version checking](https://github.com/Microsoft/nni/blob/master/docs/en_US/PaiMode.md#version-check)
+* [Version checking](TrainingService/PaiMode.md)
   * check whether the version is consistent between nniManager and trialKeeper
-* [Report final metrics for early stop job](https://github.com/Microsoft/nni/issues/776)
+* [Report final metrics for early stop job](https://github.com/microsoft/nni/issues/776)
   * If includeIntermediateResults is true, the last intermediate result of the trial that is early stopped by assessor is sent to tuner as final result. The default value of includeIntermediateResults is false.
-* [Separate Tuner/Assessor](https://github.com/Microsoft/nni/issues/841)
+* [Separate Tuner/Assessor](https://github.com/microsoft/nni/issues/841)
   * Adds two pipes to separate message receiving channels for tuner and assessor.
 * Make log collection feature configurable
 * Add intermediate result graph for all trials
 
 ### Bug fix
 
-* [Add shmMB config key for OpenPAI](https://github.com/Microsoft/nni/issues/842)
+* [Add shmMB config key for OpenPAI](https://github.com/microsoft/nni/issues/842)
 * Fix the bug that doesn't show any result if metrics is dict
 * Fix the number calculation issue for float types in hyperband
 * Fix a bug in the search space conversion in SMAC tuner
@@ -121,8 +122,8 @@
 
 ### Documentation
 * Chinese version document: https://nni.readthedocs.io/zh/latest/
-* Debuggability/serviceability document: https://nni.readthedocs.io/en/latest/HowToDebug.html
-* Tuner assessor reference: https://nni.readthedocs.io/en/latest/sdk_reference.html#tuner
+* Debuggability/serviceability document: https://nni.readthedocs.io/en/latest/Tutorial/HowToDebug.html
+* Tuner assessor reference: https://nni.readthedocs.io/en/latest/sdk_reference.html
 
 ### Bug Fixes and Other Changes
 * Fix a race condition bug that does not store trial job cancel status correctly.
@@ -134,8 +135,8 @@
 
 ## Release 0.5.1 - 1/31/2018
 ### Improvements
-* Making [log directory](https://github.com/Microsoft/nni/blob/v0.5.1/docs/en_US/ExperimentConfig.md) configurable
-* Support [different levels of logs](https://github.com/Microsoft/nni/blob/v0.5.1/docs/en_US/ExperimentConfig.md), making it easier for debugging
+* Making [log directory](https://github.com/microsoft/nni/blob/v0.5.1/docs/ExperimentConfig.md) configurable
+* Support [different levels of logs](https://github.com/microsoft/nni/blob/v0.5.1/docs/ExperimentConfig.md), making it easier for debugging
 
 ### Documentation
 * Reorganized documentation & New Homepage Released: https://nni.readthedocs.io/en/latest/
@@ -200,16 +201,16 @@
 
 ### New examples
 
-* [FashionMnist](https://github.com/Microsoft/nni/tree/master/examples/trials/network_morphism), work together with network morphism tuner
-* [Distributed MNIST example](https://github.com/Microsoft/nni/tree/master/examples/trials/mnist-distributed-pytorch) written in PyTorch
+* [FashionMnist](https://github.com/microsoft/nni/tree/master/examples/trials/network_morphism), work together with network morphism tuner
+* [Distributed MNIST example](https://github.com/microsoft/nni/tree/master/examples/trials/mnist-distributed-pytorch) written in PyTorch
 
 ## Release 0.4 - 12/6/2018
 
 ### Major Features
 
 * [Kubeflow Training service](TrainingService/KubeflowMode.md)
   * Support tf-operator
-  * [Distributed trial example](https://github.com/Microsoft/nni/tree/master/examples/trials/mnist-distributed/dist_mnist.py) on Kubeflow
+  * [Distributed trial example](https://github.com/microsoft/nni/tree/master/examples/trials/mnist-distributed/dist_mnist.py) on Kubeflow
 * [Grid search tuner](Tuner/GridsearchTuner.md)
 * [Hyperband tuner](Tuner/HyperbandAdvisor.md)
 * Support launch NNI experiment on MAC
@@ -256,7 +257,7 @@
   Each trial job is allocated a unique sequence number, which can be retrieved by nni.get_sequence_id() API.
 
   ```bash
-  git clone -b v0.3 https://github.com/Microsoft/nni.git
+  git clone -b v0.3 https://github.com/microsoft/nni.git
   ```
 
 * **nni.report_final_result(result)** API supports more data types for result parameter.
@@ -278,20 +279,19 @@
   docker pull msranni/nni:latest
   ```
 
-* New trial example: [NNI Sklearn Example](https://github.com/Microsoft/nni/tree/master/examples/trials/sklearn)
-* New competition example: [Kaggle Competition TGS Salt Example](https://github.com/Microsoft/nni/tree/master/examples/trials/kaggle-tgs-salt)
+* New trial example: [NNI Sklearn Example](https://github.com/microsoft/nni/tree/master/examples/trials/sklearn)
+* New competition example: [Kaggle Competition TGS Salt Example](https://github.com/microsoft/nni/tree/master/examples/trials/kaggle-tgs-salt)
 
 ### Others
 
 * UI refactoring, refer to [WebUI doc](Tutorial/WebUI.md) for how to work with the new UI.
 * Continuous Integration: NNI had switched to Azure pipelines
-* [Known Issues in release 0.3.0](https://github.com/Microsoft/nni/labels/nni030knownissues).
 
 ## Release 0.2.0 - 9/29/2018
 
 ### Major Features
 
-* Support [OpenPAI](https://github.com/Microsoft/pai) Training Platform (See [here](TrainingService/PaiMode.md) for instructions about how to submit NNI job in pai mode)
+* Support [OpenPAI](https://github.com/microsoft/pai) Training Platform (See [here](TrainingService/PaiMode.md) for instructions about how to submit NNI job in pai mode)
   * Support training services on pai mode. NNI trials will be scheduled to run on OpenPAI cluster
   * NNI trial's output (including logs and model file) will be copied to OpenPAI HDFS for further debugging and checking
 * Support [SMAC](https://www.cs.ubc.ca/~hutter/papers/10-TR-SMAC.pdf) tuner (See [here](Tuner/SmacTuner.md) for instructions about how to use SMAC tuner)
@@ -301,9 +301,6 @@
   * Update ga squad example and related documentation
   * WebUI UX small enhancement and bug fix
 
-### Known Issues
-
-[Known Issues in release 0.2.0](https://github.com/Microsoft/nni/labels/nni020knownissues).
 
 ## Release 0.1.0 - 9/10/2018 (initial release)
 
@@ -327,6 +324,3 @@ Initial release of Neural Network Intelligence (NNI).
 * Others
   * Support simple GPU job scheduling
 
-### Known Issues
-
-[Known Issues in release 0.1.0](https://github.com/Microsoft/nni/labels/nni010knownissues).
diff --git a/src/nni_manager/core/nniDataStore.ts b/src/nni_manager/core/nniDataStore.ts
@@ -141,6 +141,7 @@ class NNIDataStore implements DataStore {
 
     public async getTrialJob(trialJobId: string): Promise<TrialJobInfo> {
         const trialJobs: TrialJobInfo[] = await this.queryTrialJobs(undefined, trialJobId);
+        assert(trialJobs.length <= 1);
 
         return trialJobs[0];
     }

diff --git a/src/nni_manager/core/nnimanager.ts b/src/nni_manager/core/nnimanager.ts
@@ -242,10 +242,8 @@ class NNIManager implements Manager {
         });
     }
 
-    public getTrialJob(trialJobId: string): Promise<TrialJobDetail> {
-        return Promise.resolve(
-            this.trainingService.getTrialJob(trialJobId)
-        );
+    public getTrialJob(trialJobId: string): Promise<TrialJobInfo> {
+        return this.dataStore.getTrialJob(trialJobId);
     }
 
     public async setClusterMetadata(key: string, value: string): Promise<void> {

diff --git a/src/nni_manager/core/test/mockedDatastore.ts b/src/nni_manager/core/test/mockedDatastore.ts
@@ -221,7 +221,12 @@ class MockedDataStore implements DataStore {
     }
 
     public getTrialJob(trialJobId: string): Promise<TrialJobInfo> {
-        throw new Error("Method not implemented.");
+        return Promise.resolve({
+            id: '1234',
+            status: 'SUCCEEDED',
+            startTime: Date.now(),
+            endTime: Date.now()
+        });
     }
 
     private async getFinalMetricData(trialJobId: string): Promise<any> {

diff --git a/src/nni_manager/package.json b/src/nni_manager/package.json
@@ -56,7 +56,11 @@
   },
   "resolutions": {
     "mem": "^4.0.0",
-    "handlebars": "^4.1.0"
+    "handlebars": "^4.1.0",
+    "lodash": "^4.17.13",
+    "lodash.merge": "^4.6.2",
+    "node.extend": "^1.1.7",
+    "hoek": "^4.2.1"
   },
   "engines": {
     "node": ">=10.0.0"

diff --git a/...ger/training_service/kubernetes/frameworkcontroller/frameworkcontrollerTrainingService.ts b/...ger/training_service/kubernetes/frameworkcontroller/frameworkcontrollerTrainingService.ts
@@ -201,19 +201,26 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
             throw new Error('Kubeflow Cluster config is not initialized');
         }
 
+        if (this.fcTrialConfig === undefined) {
+            throw new Error('Kubeflow trial config is not initialized');
+        }
+
         let trialJobOutputUrl: string = '';
 
         if (this.fcClusterConfig.storageType === 'azureStorage') {
             if (this.azureStorageClient === undefined) {
                 throw new Error('azureStorageClient is not initialized');
             }
             try {
-                //upload local files to azure storage
+                //upload local files, including scripts for running the trial and configuration (e.g., hyperparameters) for the trial, to azure storage
                 await AzureStorageClientUtility.uploadDirectory(
                     this.azureStorageClient, `nni/${getExperimentId()}/${trialJobId}`, this.azureStorageShare, `${trialLocalTempFolder}`);
+                //upload code files to azure storage
+                await AzureStorageClientUtility.uploadDirectory(
+                    this.azureStorageClient, `nni/${getExperimentId()}/${trialJobId}`, this.azureStorageShare, `${this.fcTrialConfig.codeDir}`);
 
-                trialJobOutputUrl = `https://${this.azureStorageAccountName}.file.core.windows.net/\
-                ${this.azureStorageShare}/${path.join('nni', getExperimentId(), trialJobId, 'output')}`;
+                trialJobOutputUrl = `https://${this.azureStorageAccountName}.file.core.windows.net/` + 
+                                    `${this.azureStorageShare}/${path.join('nni', getExperimentId(), trialJobId, 'output')}`;
             } catch (error) {
                 this.log.error(error);
 
@@ -226,7 +233,8 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
             await cpp.exec(`mkdir -p ${this.trialLocalNFSTempFolder}/nni/${getExperimentId()}/${trialJobId}`);
             // Copy code files from local dir to NFS mounted dir
             await cpp.exec(`cp -r ${trialLocalTempFolder}/* ${this.trialLocalNFSTempFolder}/nni/${getExperimentId()}/${trialJobId}/.`);
-
+            // Copy codeDir to NFS mounted dir
+            await cpp.exec(`cp -r ${this.fcTrialConfig.codeDir}/* ${this.trialLocalNFSTempFolder}/nni/${getExperimentId()}/${trialJobId}/.`);
             const nfsConfig: NFSConfig = nfsFrameworkControllerClusterConfig.nfs;
             trialJobOutputUrl = `nfs://${nfsConfig.server}:${path.join(nfsConfig.path, 'nni', getExperimentId(), trialJobId, 'output')}`;
         }
@@ -257,13 +265,12 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
             throw new Error('frameworkcontroller trial config is not initialized');
         }
 
-        await cpp.exec(`mkdir -p ${path.dirname(trialLocalTempFolder)}`);
-        await cpp.exec(`cp -r ${this.fcTrialConfig.codeDir} ${trialLocalTempFolder}`);
+        await cpp.exec(`mkdir -p ${trialLocalTempFolder}`);
+
         const installScriptContent : string = CONTAINER_INSTALL_NNI_SHELL_FORMAT;
         // Write NNI installation file to local tmp files
         await fs.promises.writeFile(path.join(trialLocalTempFolder, 'install_nni.sh'), installScriptContent, { encoding: 'utf8' });
         // Create tmp trial working folder locally.
-        await cpp.exec(`mkdir -p ${trialLocalTempFolder}`);
 
         for (const taskRole of this.fcTrialConfig.taskRoles) {
             const runScriptContent: string =

diff --git a/src/nni_manager/training_service/kubernetes/kubeflow/kubeflowTrainingService.ts b/src/nni_manager/training_service/kubernetes/kubeflow/kubeflowTrainingService.ts
@@ -201,6 +201,10 @@ class KubeflowTrainingService extends KubernetesTrainingService implements Kuber
             throw new Error('Kubeflow Cluster config is not initialized');
         }
 
+        if (this.kubeflowTrialConfig === undefined) {
+            throw new Error('Kubeflow Trial config is not initialized');
+        }
+
         let trialJobOutputUrl: string = '';
 
         assert(this.kubeflowClusterConfig.storage === undefined
@@ -212,13 +216,17 @@ class KubeflowTrainingService extends KubernetesTrainingService implements Kuber
                 throw new Error('azureStorageClient is not initialized');
             }
             try {
-                //upload local files to azure storage
+                //upload local files, including scripts for running the trial and configuration (e.g., hyperparameters) for the trial, to azure storage
                 await AzureStorageClientUtility.uploadDirectory(this.azureStorageClient,
                                                                 `nni/${getExperimentId()}/${trialJobId}`, this.azureStorageShare,
                                                                 `${trialLocalTempFolder}`);
+                //upload code files to azure storage
+                await AzureStorageClientUtility.uploadDirectory(this.azureStorageClient,
+                                                                `nni/${getExperimentId()}/${trialJobId}`, this.azureStorageShare,
+                                                                `${this.kubeflowTrialConfig.codeDir}`);
 
-                trialJobOutputUrl = `https://${this.azureStorageAccountName}.file.core.windows.net/${this.azureStorageShare}\
-                /${path.join('nni', getExperimentId(), trialJobId, 'output')}`;
+                trialJobOutputUrl = `https://${this.azureStorageAccountName}.file.core.windows.net/${this.azureStorageShare}` + 
+                                    `/${path.join('nni', getExperimentId(), trialJobId, 'output')}`;
             } catch (error) {
                 this.log.error(error);
 
@@ -228,9 +236,10 @@ class KubeflowTrainingService extends KubernetesTrainingService implements Kuber
             const nfsKubeflowClusterConfig: KubeflowClusterConfigNFS = <KubeflowClusterConfigNFS>this.kubeflowClusterConfig;
             // Creat work dir for current trial in NFS directory
             await cpp.exec(`mkdir -p ${this.trialLocalNFSTempFolder}/nni/${getExperimentId()}/${trialJobId}`);
-            // Copy code files from local dir to NFS mounted dir
+            // Copy script files from local dir to NFS mounted dir
             await cpp.exec(`cp -r ${trialLocalTempFolder}/* ${this.trialLocalNFSTempFolder}/nni/${getExperimentId()}/${trialJobId}/.`);
-
+            // Copy codeDir to NFS mounted dir
+            await cpp.exec(`cp -r ${this.kubeflowTrialConfig.codeDir}/* ${this.trialLocalNFSTempFolder}/nni/${getExperimentId()}/${trialJobId}/.`);
             const nfsConfig: NFSConfig = nfsKubeflowClusterConfig.nfs;
             trialJobOutputUrl = `nfs://${nfsConfig.server}:${path.join(nfsConfig.path, 'nni', getExperimentId(), trialJobId, 'output')}`;
         }
@@ -255,13 +264,10 @@ class KubeflowTrainingService extends KubernetesTrainingService implements Kuber
         }
 
         //create tmp trial working folder locally.
-        await cpp.exec(`mkdir -p ${path.dirname(trialLocalTempFolder)}`);
-        await cpp.exec(`cp -r ${kubeflowTrialConfig.codeDir} ${trialLocalTempFolder}`);
+        await cpp.exec(`mkdir -p ${trialLocalTempFolder}`);
         const runScriptContent : string = CONTAINER_INSTALL_NNI_SHELL_FORMAT;
         // Write NNI installation file to local tmp files
         await fs.promises.writeFile(path.join(trialLocalTempFolder, 'install_nni.sh'), runScriptContent, { encoding: 'utf8' });
-        // Create tmp trial working folder locally.
-        await cpp.exec(`mkdir -p ${trialLocalTempFolder}`);
 
         // Write worker file content run_worker.sh to local tmp folders
         if (kubeflowTrialConfig.worker !== undefined) {