Skip to content

Commit

Permalink
SWDEV-335697- Add support for dynamic partitioning
Browse files Browse the repository at this point in the history
Original updates:
    * Added .gitignore to help with future commits
    * Updated/added copyrights on modified or added files
    * Updated rocm_smi.h/.cc
      - Added 3 new SMI API functions:
          rsmi_dev_compute_partition_set &
          rsmi_dev_compute_partition_get
      - Added helpful maps/enums used in
        new get/set compute_partition API calls
    * Updated rocm_smi.py
      - Added --showcomputepartition
      - Added --setcomputepartition
      - Fixed a few mistypes
    * Updated rsmiBindings.py - added helpful class/dict/list
    * Updated rocm_smi_example.cc
      - Added helpful MACRO to detect if api is not supported.
      - Added current_compute_partition set/get rocm lib calls
      - Added helpful macro to discover future RSMI errors
      - Commented out test_set_freq, was having permission issues
        on a Navi21
    * Updated rocm_smi_main.cc
      - Added helpful map to debug API calls, left in for future use
      - Added comment to better understand a non-class function returns
    * Added computepartition_read_write.cc/.h
      - Added get/set compute partition API test calls
      - Confirmed on devices that do not support the API calls, tests pass
    * Updated rocm_smi_test/main.cc
      - Calls new compute partition gtests

Added following updates from review feedback:
   * Updated rocm_smi.h/cc
       - Removed C++ API calls, adding support for both C/C++
         API calls could cause confusion and adds extra work for us
       - rsmi_dev_compute_partition_get -> Fixed an edge case where
         user gives a small buffer length size (smaller than data
         received), but does not receive the partial buffer back.
         google Tests are updated to reflect this find.
   * Updated rocm_smi_example.cc
       - Fixed test_set_freq, issue was that file was not writable.
         We now indicate this warning, so prior errors make sense.
       - General test code cleanup. Removed extra code,
         by creating loops for tests.
   * Updated rocm_smi_main.cc
     - Moved and got rid of an external reference to a map used
       for debugging RSMI enums, now is a const public reference.
   * Updated rocm_smi.py
     - Updated python code to identify NOT_SUPPORTED due to
       (currently) only a few GPU support the feature

Change-Id: I4a567acbb59d6771fb64df08d19175fe3604fd1b
  • Loading branch information
charis-poag-amd committed Jan 13, 2023
1 parent 5c478e9 commit 4d7f3f2
Show file tree
Hide file tree
Showing 14 changed files with 1,111 additions and 54 deletions.
124 changes: 124 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
#
# NOTE! Don't add files that are generated in specific
# subdirectories here. Add them in the ".gitignore" file
# in that subdirectory instead.
#
# NOTE! Please use 'git ls-files -i --exclude-standard'
# command after changing this file, to see if there are
# any tracked files which get ignored after the change.
#
# Normal rules
#
.*
*.o
*.o.*
*.a
*.s
*.ko
*.so
*.so.dbg
*.mod.c
*.i
*.lst
*.symtypes
*.order
modules.builtin
*.elf
*.bin
*.gz
*.bz2
*.lzma
*.xz
*.lzo
#*.patch
*.gcno
*.pyc
*current_compute_partition

#
# Top-level generic files/folders
#
/[Bb][Ui][Ll][Dd]
*/[Bb][Ui][Ll][Dd]
/build
*/build
/[Gg][Tt][Ee][Ss][Tt][Ss]
*/[Gg][Tt][Ee][Ss][Tt][Ss]
/tags
/TAGS
/linux
/vmlinux
/vmlinuz
/System.map
/Module.markers
Module.symvers

#
# Debian directory (make deb-pkg)
#
/debian/

#
# git files that we don't want to ignore even it they are dot-files
#
!.gitignore
!.mailmap

### VisualStudioCode ###
!.vscode/settings.json

#
# Generated include files
#
include/config
include/linux/version.h
include/generated
arch/*/include/generated

# git generated dirs
patches-*

# quilt's files
patches
series

# cscope files
cscope.*
ncscope.*

# gnu global files
GPATH
GRTAGS
GSYMS
GTAGS

*.orig
*~
\#*#

#
# Leavings from module signing
#
extra_certificates
signing_key.priv
signing_key.x509
x509.genkey

#cmake files
CMakeLists.txt.user
CMakeCache.txt
CMakeFiles
CMakeScripts
Testing
Makefile
cmake_install.cmake
install_manifest.txt
compile_commands.json
CTestTestfile.cmake
_deps

#
# ROCm files
# Removes generated config headers like rocmsmi64Config.h & oamConfig.h
#
*Config.h
86 changes: 85 additions & 1 deletion include/rocm_smi/rocm_smi.h
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
* The University of Illinois/NCSA
* Open Source License (NCSA)
*
* Copyright (c) 2017, Advanced Micro Devices, Inc.
* Copyright (c) 2017-2023, Advanced Micro Devices, Inc.
* All rights reserved.
*
* Developed by:
Expand Down Expand Up @@ -352,6 +352,26 @@ typedef enum {
typedef rsmi_clk_type_t rsmi_clk_type;
/// \endcond

/**
* Compute Partition types
*/
typedef enum {
RSMI_COMPUTE_PARTITION_INVALID = 0,
RSMI_COMPUTE_PARTITION_CPX, //!< Core mode (CPX)- Per-chip XCC with
//!< shared memory
RSMI_COMPUTE_PARTITION_SPX, //!< Single GPU mode (SPX)- All XCCs work
//!< together with shared memory
RSMI_COMPUTE_PARTITION_DPX, //!< Dual GPU mode (DPX)- Half XCCs work
//!< together with shared memory
RSMI_COMPUTE_PARTITION_TPX, //!< Triple GPU mode (TPX)- One-third XCCs
//!< work together with shared memory
RSMI_COMPUTE_PARTITION_QPX, //!< Quad GPU mode (QPX)- Quarter XCCs
//!< work together with shared memory
} rsmi_compute_partition_type_t;
/// \cond Ignore in docs.
typedef rsmi_compute_partition_type_t rsmi_compute_partition_type;
/// \endcond

/**
* @brief Temperature Metrics. This enum is used to identify various
* temperature metrics. Corresponding values will be in millidegress
Expand Down Expand Up @@ -3470,6 +3490,70 @@ rsmi_is_P2P_accessible(uint32_t dv_ind_src, uint32_t dv_ind_dst,

/** @} */ // end of HWTopo

/*****************************************************************************/
/** @defgroup ComputePartition Compute Partition Functions
* These functions are used to configure and query the device's
* compute parition setting.
* @{
*/

/**
* @brief Retrieves the current compute partitioning for a desired device
*
* @details
* Given a device index @p dv_ind and a string @p compute_partition ,
* and uint32 @p len , this function will attempt to obtain the device's
* current compute partition setting string. Upon successful retreival,
* the obtained device's compute partition settings string shall be stored in
* the passed @p compute_partition char string variable.
*
* @param[in] dv_ind a device index
*
* @param[inout] compute_partition a pointer to a char string variable,
* which the device's current compute partition will be written to.
*
* @param[in] len the length of the caller provided buffer @p compute_partition
*
* @retval ::RSMI_STATUS_SUCCESS call was successful
* @retval ::RSMI_STATUS_INVALID_ARGS the provided arguments are not valid
* @retval ::RSMI_STATUS_UNEXPECTED_DATA data provided to function is not valid
* @retval ::RSMI_STATUS_NOT_SUPPORTED installed software or hardware does not
* support this function with the given arguments
* @retval ::RSMI_STATUS_INSUFFICIENT_SIZE is returned if @p len bytes is not
* large enough to hold the entire compute partition value. In this case,
* only @p len bytes will be written.
*
*/
rsmi_status_t
rsmi_dev_compute_partition_get(uint32_t dv_ind, char *compute_partition,
uint32_t len);

/**
* @brief Modifies a selected device's compute partition setting.
*
* @details Given a device index @p dv_ind, a type of compute partition
* @p compute_partition, this function will attempt to update the selected
* device's compute partition setting.
*
* @param[in] dv_ind a device index
*
* @param[inout] compute_partition using enum ::rsmi_copmpute_partition_type_t,
* define what the selected device's compute partition setting should be
* updated to.
*
* @retval ::RSMI_STATUS_SUCCESS call was successful
* @retval ::RSMI_STATUS_PERMISSION function requires root access
* @retval ::RSMI_STATUS_INVALID_ARGS the provided arguments are not valid
* @retval ::RSMI_STATUS_NOT_SUPPORTED installed software or hardware does not
* support this function with the given arguments
*
*/
rsmi_status_t
rsmi_dev_compute_partition_set(uint32_t dv_ind,
rsmi_compute_partition_type_t compute_partition);

/** @} */ // end of ComputePartition

/*****************************************************************************/
/** @defgroup APISupport Supported Functions
* API function support varies by both GPU type and the version of the
Expand Down
5 changes: 3 additions & 2 deletions include/rocm_smi/rocm_smi_device.h
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
* The University of Illinois/NCSA
* Open Source License (NCSA)
*
* Copyright (c) 2017, Advanced Micro Devices, Inc.
* Copyright (c) 2017-2023, Advanced Micro Devices, Inc.
* All rights reserved.
*
* Developed by:
Expand Down Expand Up @@ -161,7 +161,8 @@ enum DevInfoTypes {
kDevMemPageBad,
kDevNumaNode,
kDevGpuMetrics,
kDevGpuReset
kDevGpuReset,
kDevComputePartition
};

typedef struct {
Expand Down
1 change: 1 addition & 0 deletions include/rocm_smi/rocm_smi_main.h
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,7 @@ class RocmSMI {
uint64_t *weight);
int get_node_index(uint32_t dv_ind, uint32_t *node_ind);
const RocmSMI_env_vars& getEnv(void);
static const std::map<amd::smi::DevInfoTypes, std::string> devInfoTypesStrings;

private:
std::vector<std::shared_ptr<Device>> devices_;
Expand Down
Loading

0 comments on commit 4d7f3f2

Please sign in to comment.