Skip to content

Commit

Permalink
[SWDEV-381630] Add reset partition functionality
Browse files Browse the repository at this point in the history
Updates:
    * Added rsmi_dev_compute_partition_reset & rsmi_dev_nps_mode_reset
    * Added --resetcomputepartition and --resetnpsmode python smi calls
    * Added temp data files rocmsmi_boot_compute_partition_<device num>
      & rocmsmi_boot_nps_mode_partition_<device num>, writes UNKNOWN
      if data cannot be read or device does not support
    * Cleaned up NPS & compute API documentation
    * Added creation and reading of API temp files (used in reset
      functionality)
    * Cleaned up output of rocm_smi_example
    * Updated rocm_smi_example to check if running with sudo permission
      before executing write API calls (cleans up erroneous output)
    * Added template specialization for storing temp data, requires
      specific rsmi_type_t enums (restrics what data can be stored)
    * Added storage of temp data, if temp files do not exist
    * Updated google tests for NPS & compute to include reset API calls

Change-Id: I69895a466b97107617e6dbb355737b84499a76c9
Signed-off-by: Charis Poag <[email protected]>
  • Loading branch information
charis-poag-amd committed Feb 17, 2023
1 parent 9ef376c commit 77c950a
Show file tree
Hide file tree
Showing 12 changed files with 577 additions and 46 deletions.
45 changes: 41 additions & 4 deletions include/rocm_smi/rocm_smi.h
Original file line number Diff line number Diff line change
Expand Up @@ -3540,12 +3540,13 @@ rsmi_is_P2P_accessible(uint32_t dv_ind_src, uint32_t dv_ind_dst,
* which the device's current compute partition will be written to.
*
* @param[in] len the length of the caller provided buffer @p compute_partition
* , suggested length is 4 or greater.
*
* @retval ::RSMI_STATUS_SUCCESS call was successful
* @retval ::RSMI_STATUS_INVALID_ARGS the provided arguments are not valid
* @retval ::RSMI_STATUS_UNEXPECTED_DATA data provided to function is not valid
* @retval ::RSMI_STATUS_NOT_SUPPORTED installed software or hardware does not
* support this function with the given arguments
* support this function
* @retval ::RSMI_STATUS_INSUFFICIENT_SIZE is returned if @p len bytes is not
* large enough to hold the entire compute partition value. In this case,
* only @p len bytes will be written.
Expand All @@ -3572,13 +3573,30 @@ rsmi_dev_compute_partition_get(uint32_t dv_ind, char *compute_partition,
* @retval ::RSMI_STATUS_PERMISSION function requires root access
* @retval ::RSMI_STATUS_INVALID_ARGS the provided arguments are not valid
* @retval ::RSMI_STATUS_NOT_SUPPORTED installed software or hardware does not
* support this function with the given arguments
* support this function
*
*/
rsmi_status_t
rsmi_dev_compute_partition_set(uint32_t dv_ind,
rsmi_compute_partition_type_t compute_partition);

/**
* @brief Reverts a selected device's compute partition setting back to its
* boot state.
*
* @details Given a device index @p dv_ind , this function will attempt to
* revert its compute partition setting back to its boot state.
*
* @param[in] dv_ind a device index
*
* @retval ::RSMI_STATUS_SUCCESS call was successful
* @retval ::RSMI_STATUS_PERMISSION function requires root access
* @retval ::RSMI_STATUS_NOT_SUPPORTED installed software or hardware does not
* support this function
*
*/
rsmi_status_t rsmi_dev_compute_partition_reset(uint32_t dv_ind);

/** @} */ // end of ComputePartition

/*****************************************************************************/
Expand Down Expand Up @@ -3609,7 +3627,7 @@ rsmi_dev_compute_partition_set(uint32_t dv_ind,
* @retval ::RSMI_STATUS_INVALID_ARGS the provided arguments are not valid
* @retval ::RSMI_STATUS_UNEXPECTED_DATA data provided to function is not valid
* @retval ::RSMI_STATUS_NOT_SUPPORTED installed software or hardware does not
* support this function with the given arguments
* support this function
* @retval ::RSMI_STATUS_INSUFFICIENT_SIZE is returned if @p len bytes is not
* large enough to hold the entire nps mode value. In this case,
* only @p len bytes will be written.
Expand All @@ -3634,14 +3652,33 @@ rsmi_dev_nps_mode_get(uint32_t dv_ind, char *nps_mode, uint32_t len);
* @retval ::RSMI_STATUS_PERMISSION function requires root access
* @retval ::RSMI_STATUS_INVALID_ARGS the provided arguments are not valid
* @retval ::RSMI_STATUS_NOT_SUPPORTED installed software or hardware does not
* support this function with the given arguments
* support this function
* @retval ::RSMI_STATUS_AMDGPU_RESTART_ERR could not successfully restart
* the amdgpu driver
*
*/
rsmi_status_t
rsmi_dev_nps_mode_set(uint32_t dv_ind, rsmi_nps_mode_type_t nps_mode);

/**
* @brief Reverts a selected device's NPS mode setting back to its
* boot state.
*
* @details Given a device index @p dv_ind , this function will attempt to
* revert its NPS mode setting back to its boot state.
*
* @param[in] dv_ind a device index
*
* @retval ::RSMI_STATUS_SUCCESS call was successful
* @retval ::RSMI_STATUS_PERMISSION function requires root access
* @retval ::RSMI_STATUS_NOT_SUPPORTED installed software or hardware does not
* support this function
* @retval ::RSMI_STATUS_AMDGPU_RESTART_ERR could not successfully restart
* the amdgpu driver
*
*/
rsmi_status_t rsmi_dev_nps_mode_reset(uint32_t dv_ind);

/** @} */ // end of NPSMode

/*****************************************************************************/
Expand Down
2 changes: 2 additions & 0 deletions include/rocm_smi/rocm_smi_device.h
Original file line number Diff line number Diff line change
Expand Up @@ -217,6 +217,8 @@ class Device {
bool DeviceAPISupported(std::string name, uint64_t variant,
uint64_t sub_variant);
rsmi_status_t restartAMDGpuDriver(void);
rsmi_status_t storeDevicePartitions(uint32_t dv_ind);
template <typename T> std::string readBootPartitionState(uint32_t dv_ind);

private:
std::shared_ptr<Monitor> monitor_;
Expand Down
17 changes: 11 additions & 6 deletions include/rocm_smi/rocm_smi_utils.h
Original file line number Diff line number Diff line change
Expand Up @@ -66,18 +66,23 @@ namespace amd {
namespace smi {

pthread_mutex_t *GetMutex(uint32_t dv_ind);

int SameFile(const std::string fileA, const std::string fileB);
bool FileExists(char const *filename);
int isRegularFile(std::string fname, bool *is_reg);

int ReadSysfsStr(std::string path, std::string *retStr);
int WriteSysfsStr(std::string path, std::string val);

bool IsInteger(const std::string & n_str);

std::pair<bool, std::string> executeCommand(std::string command, bool stdOut = true);

std::pair<bool, std::string> executeCommand(std::string command,
bool stdOut = true);
rsmi_status_t storeTmpFile(uint32_t dv_ind, std::string parameterName,
std::string stateName, std::string storageData);
std::vector<std::string> getListOfAppTmpFiles();
bool containsString(std::string originalString, std::string substring);
std::tuple<bool, std::string> readTmpFile(
uint32_t dv_ind,
std::string stateName,
std::string parameterName);
void displayAppTmpFilesContent(void);
rsmi_status_t handleException();
rsmi_status_t
GetDevValueVec(amd::smi::DevInfoTypes type,
Expand Down
96 changes: 93 additions & 3 deletions python_smi_tools/rocm_smi.py
Original file line number Diff line number Diff line change
Expand Up @@ -411,6 +411,30 @@ def getVersion(deviceList, component):
return None


def getComputePartition(device):
""" Return the current compute partition of a given device
@param device: DRM device identifier
"""
currentComputePartition = create_string_buffer(256)
ret = rocmsmi.rsmi_dev_compute_partition_get(device, currentComputePartition, 256)
if rsmi_ret_ok(ret, device, silent=True) and currentComputePartition.value.decode():
return str(currentComputePartition.value.decode())
return "UNKNOWN"


def getMemoryPartition(device):
""" Return the current memory partition of a given device
@param device: DRM device identifier
"""
currentNPSMode = create_string_buffer(256)
ret = rocmsmi.rsmi_dev_nps_mode_get(device, currentNPSMode, 256)
if rsmi_ret_ok(ret, device, silent=True) and currentNPSMode.value.decode():
return str(currentNPSMode.value.decode())
return "UNKNOWN"


def print2DArray(dataArray):
""" Print 2D Array with uniform spacing """
global PRINT_JSON
Expand Down Expand Up @@ -773,6 +797,66 @@ def resetPerfDeterminism(deviceList):
printLogSpacer()


def resetComputePartition(deviceList):
""" Reset Compute Partition to its boot state
@param deviceList: List of DRM devices (can be a single-item list)
"""
printLogSpacer(" Reset compute partition to its boot state ")
for device in deviceList:
originalPartition = getComputePartition(device)
ret = rocmsmi.rsmi_dev_compute_partition_reset(device)
if rsmi_ret_ok(ret, device, silent=True):
resetBootState = getComputePartition(device)
printLog(device, "Successfully reset compute partition (" +
originalPartition + ") to boot state (" + resetBootState +
")", None)
elif ret == rsmi_status_t.RSMI_STATUS_PERMISSION:
printLog(device, 'Permission denied', None)
elif ret == rsmi_status_t.RSMI_STATUS_NOT_SUPPORTED:
printLog(device, 'Not supported on the given system', None)
else:
rsmi_ret_ok(ret, device)
printErrLog(device, 'Failed to reset the compute partition to boot state')
printLogSpacer()


def resetNpsMode(deviceList):
""" Reset NPS mode to its boot state
@param deviceList: List of DRM devices (can be a single-item list)
"""
printLogSpacer(" Reset nps mode to its boot state ")
for device in deviceList:
originalPartition = getMemoryPartition(device)
t1 = multiprocessing.Process(target=showProgressbar,
args=("Resetting NPS mode",13,))
t1.start()
addExtraLine=True
start=time.time()
ret = rocmsmi.rsmi_dev_nps_mode_reset(device)
stop=time.time()
duration=stop-start
if t1.is_alive():
t1.terminate()
t1.join()
if duration < float(0.1): # For longer runs, add extra line before output
addExtraLine=False # This is to prevent overriding progress bar
if rsmi_ret_ok(ret, device, silent=True):
resetBootState = getMemoryPartition(device)
printLog(device, "Successfully reset nps mode (" +
originalPartition + ") to boot state (" +
resetBootState + ")", None, addExtraLine)
elif ret == rsmi_status_t.RSMI_STATUS_PERMISSION:
printLog(device, 'Permission denied', None, addExtraLine)
elif ret == rsmi_status_t.RSMI_STATUS_NOT_SUPPORTED:
printLog(device, 'Not supported on the given system', None, addExtraLine)
else:
rsmi_ret_ok(ret, device)
printErrLog(device, 'Failed to reset nps mode to boot state')
printLogSpacer()


def setClockRange(deviceList, clkType, minvalue, maxvalue, autoRespond):
""" Set the range for the specified clktype in the PowerPlay table for a list of devices.
Expand Down Expand Up @@ -3228,7 +3312,7 @@ def save(deviceList, savefilepath):
action='store_true')
groupDisplay.add_argument('--shownodesbw', help='Shows the numa nodes ', action='store_true')
groupDisplay.add_argument('--showcomputepartition', help='Shows current compute partitioning ', action='store_true')
groupDisplay.add_argument('--shownpsmode', help='Shows current nps mode ', action='store_true')
groupDisplay.add_argument('--shownpsmode', help='Shows current NPS mode ', action='store_true')

groupActionReset.add_argument('-r', '--resetclocks', help='Reset clocks and OverDrive to default',
action='store_true')
Expand All @@ -3238,7 +3322,9 @@ def save(deviceList, savefilepath):
help='Set the maximum GPU power back to the device deafult state',
action='store_true')
groupActionReset.add_argument('--resetxgmierr', help='Reset XGMI error count', action='store_true')
groupAction.add_argument('--resetperfdeterminism', help='Disable performance determinism', action='store_true')
groupActionReset.add_argument('--resetperfdeterminism', help='Disable performance determinism', action='store_true')
groupActionReset.add_argument('--resetcomputepartition', help='Resets to boot compute partition state', action='store_true')
groupActionReset.add_argument('--resetnpsmode', help='Resets to boot NPS mode state', action='store_true')
groupAction.add_argument('--setclock',
help='Set Clock Frequency Level(s) for specified clock (requires manual Perf level)',
metavar=('TYPE','LEVEL'), nargs=2)
Expand Down Expand Up @@ -3317,7 +3403,7 @@ def save(deviceList, savefilepath):
or args.setpoweroverdrive or args.resetpoweroverdrive or args.rasenable or args.rasdisable or \
args.rasinject or args.gpureset or args.setperfdeterminism or args.setslevel or args.setmlevel or \
args.setvc or args.setsrange or args.setmrange or args.setclock or \
args.setcomputepartition or args.setnpsmode:
args.setcomputepartition or args.setnpsmode or args.resetcomputepartition or args.resetnpsmode:
relaunchAsSudo()

# If there is one or more device specified, use that for all commands, otherwise use a
Expand Down Expand Up @@ -3561,6 +3647,10 @@ def save(deviceList, savefilepath):
resetXgmiErr(deviceList)
if args.resetperfdeterminism:
resetPerfDeterminism(deviceList)
if args.resetcomputepartition:
resetComputePartition(deviceList)
if args.resetnpsmode:
resetNpsMode(deviceList)
if args.rasenable:
setRas(deviceList, 'enable', args.rasenable[0], args.rasenable[1])
if args.rasdisable:
Expand Down
Binary file modified rocm_smi/docs/ROCm_SMI_Manual.pdf
Binary file not shown.
Loading

0 comments on commit 77c950a

Please sign in to comment.