Skip to content

Tags: hansman/torchrec

Tags

v0.8.0

Toggle v0.8.0's commit message
Enable prefetch stage for StagedTrainPipeline (pytorch#2239)

Summary:
Pull Request resolved: pytorch#2239

Add ability to run prefetch as a stage in `StagedTrainPipeline`

Recommended usage to run 3-stage pipeline with data copy, sparse dist and prefetch steps (changes required shown with arrows):
```
sdd = SparseDataDistUtil(
    model=self._model,
    data_dist_stream=torch.torch.cuda.Stream(),
    prefetch_stream=torch.torch.cuda.Stream(), <--- define prefetch stream
)

pipeline = [
    PipelineStage(
        name="data_copy",
        runnable=lambda batch, context: batch.to(
            self._device, non_blocking=True
        ),
        stream=torch.cuda.Stream(),
    ),
    PipelineStage(
        name="start_sparse_data_dist",
        runnable=sdd.start_sparse_data_dist,
        stream=sdd.data_dist_stream,
        fill_callback=sdd.wait_sparse_data_dist,
    ),
    PipelineStage(
        name="prefetch",
        runnable=sdd.prefetch, <--- add stage with runnable=sdd.prefetch
        stream=sdd.prefetch_stream,
        fill_callback=sdd.load_prefetch, <--- fill_callback of sdd.load_prefetch
    ),
]

return StagedTrainPipeline(pipeline_stages=pipeline)
```

Order of execution for above pipeline:

Iteration pytorch#1:

_fill_pipeline():
batch 0: memcpy, start_sdd, wait_sdd (callback), prefetch, load_prefetch (callback)
batch 1: memcpy, start_sdd, wait_sdd (callback)
batch 2: memcpy

progress():
batch 3: memcpy
batch 2: start_sdd
batch 1: prefetch

after pipeline progress():
model(batch 0)
load_prefetch (prepares for model fwd on batch 1)
wait_sdd (prepares for batch 2 prefetch)

Iteration pytorch#2:
progress():
batch 4: memcpy
batch 3: start_sdd
batch 2: prefetch

after pipeline progress():
model(batch 1)
load_prefetch (prepares for model fwd on batch 2)
wait_sdd (prepares for batch 3 prefetch)

Reviewed By: zzzwen, joshuadeng

Differential Revision: D59786807

fbshipit-source-id: 6261c07cd6823bc541463d24ff867ab0e43631ea

v2024.07.22.00

Toggle v2024.07.22.00's commit message
benchmark of fbgemm op - regroup_kts (pytorch#2159)

Summary:
Pull Request resolved: pytorch#2159

# context
* added **fn-level** benchmark for the `regroup_keyed_tensor`
* `keyed_tensor_regroup` further reduces the CPU runtime from 2.0ms to 1.3ms (35% improvement) without hurting the GPU runtime/memory usage

# conclusion
* CPU runtime **reduces 40%** from 1.8 ms to 1.1 ms
* GPU runtime **reduces 60%** from 4.9 ms to 2.0 ms
* GPU memory **reduces 33%** from 1.5 K to 1.0 K
* **we should migrate to the new op** unless any unknown concern/blocker

# traces
* [files](https://drive.google.com/drive/folders/1iiEf30LeG_i0xobMZVhmMneOQ5slmX3U?usp=drive_link)
```
[[email protected] /data/sandcastle/boxes/fbsource (04ad34da3)]$ ll *.json
-rw-r--r-- 1 hhy hhy  552501 Jul 10 16:01 'trace-[1 Op] KT_regroup_dup.json'
-rw-r--r-- 1 hhy hhy  548847 Jul 10 16:01 'trace-[1 Op] KT_regroup.json'
-rw-r--r-- 1 hhy hhy  559006 Jul 10 16:01 'trace-[2 Ops] permute_multi_embs_dup.json'
-rw-r--r-- 1 hhy hhy  553199 Jul 10 16:01 'trace-[2 Ops] permute_multi_embs.json'
-rw-r--r-- 1 hhy hhy 5104239 Jul 10 16:01 'trace-[Module] KTRegroupAsDict_dup.json'
-rw-r--r-- 1 hhy hhy  346643 Jul 10 16:01 'trace-[Module] KTRegroupAsDict.json'
-rw-r--r-- 1 hhy hhy  895096 Jul 10 16:01 'trace-[Old Prod] permute_pooled_embs.json'
-rw-r--r-- 1 hhy hhy  561685 Jul 10 16:01 'trace-[Prod] KeyedTensor.regroup_dup.json'
-rw-r--r-- 1 hhy hhy  559147 Jul 10 16:01 'trace-[Prod] KeyedTensor.regroup.json'
-rw-r--r-- 1 hhy hhy 7958676 Jul 10 16:01 'trace-[pytorch generic] fallback_dup.json'
-rw-r--r-- 1 hhy hhy 7978141 Jul 10 16:01 'trace-[pytorch generic] fallback.json'
```
* pytorch generic
 {F1755208341}
* current prod
 {F1755209251}
* permute_multi_embedding (2 Ops)
 {F1755210682}
* KT.regroup (1 Op)
 {F1755210008}
* regroupAsDict (Module)
 {F1755210990}
* metrics
|Operator|CPU runtime|GPU runtime|GPU memory|notes|
|---|---|---|---|---|
|**[fallback] pytorch generic**|3.9 ms|3.2 ms|1.0 K|CPU-bounded, allow duplicates|
|**[prod] _fbgemm_permute_pooled_embs**|1.9 ms|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`|
|**[hybrid python/cu] keyed_tensor_regroup**|1.5 ms|2.0 ms|1.0 K|both GPU runtime and memory improved, **ALLOW** duplicates, PT2 friendly|
|**[pure c++/cu] permute_multi_embedding**|1.0 ms|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly|

Reviewed By: dstaay-fb

Differential Revision: D58907223

fbshipit-source-id: 108ce355b9191cba6fe6a79e54dc7291b8463f7b

v2024.07.15.00

Toggle v2024.07.15.00's commit message
correct VBE output merging logic to only apply to multiple TBE cases (p…

…ytorch#2225)

Summary:
Pull Request resolved: pytorch#2225

- fixes issue that was breaking with empty rank embeddings
  - `RuntimeError: torch.cat(): expected a non-empty list of Tensors`
  - we prevent this from occurring by enforcing merge logic only occurs when dealing with multi TBE outputs
- stops redundant merging logic and splits calculation when only dealing with single embedding output which is most cases

Reviewed By: ge0405

Differential Revision: D59705585

fbshipit-source-id: 98cd37be62289060524dee3404c71d826e8b18e4

v2024.07.08.00

Toggle v2024.07.08.00's commit message
avoid reserved python word in kwargs (pytorch#2205)

Summary:
Pull Request resolved: pytorch#2205

as per title

Reviewed By: gnahzg, iamzainhuda

Differential Revision: D59336088

fbshipit-source-id: 1614039ef2c8d7958c4e98e1b02588c18b932561

v2024.07.01.00

Toggle v2024.07.01.00's commit message
Overlap comms on backward pass (pytorch#2117)

Summary:
Pull Request resolved: pytorch#2117

Resolves issues around cuda streams / NCCL Deadlock with autograd.

Basically create seperate streams per pipelined embedding arch.

Reviewed By: sarckk

Differential Revision: D58220332

fbshipit-source-id: e203acad4a92702b94a42e2106d6de4f5d89e112

v2024.06.24.00

Toggle v2024.06.24.00's commit message
Fwd-Bwd correctness tests for TBEs, kernels (pytorch#2152)

Summary:
Pull Request resolved: pytorch#2152

Adding more tests for kernels coverage, testing inductor compilation and forward-backward numerical correctness.

Reviewed By: TroyGarden, gnahzg

Differential Revision: D58869080

fbshipit-source-id: 002a41d88b2435fbc97bb71509d3bf1afec89251

v2024.06.17.00

Toggle v2024.06.17.00's commit message
Bump version.txt for 0.8.0 release (pytorch#2121)

Summary:
Pull Request resolved: pytorch#2121

Bump version in main branch for 0.8.0 release

Reviewed By: IvanKobzarev, gnahzg

Differential Revision: D58671454

fbshipit-source-id: 361029726b06b9e580320b1ae3dcf6b86c853db1

v0.8.0-rc1

Toggle v0.8.0-rc1's commit message
Update setup and version for release 0.8.0

v2024.06.10.00

Toggle v2024.06.10.00's commit message
Revert _regroup in jagged_tensor (pytorch#2089)

Summary:
Pull Request resolved: pytorch#2089

Fix S422574
backout D57500720 D58001114'

Post: https://fb.workplace.com/groups/gpuinference/permalink/2814805982001385/
Example failed job: f567662663

Reviewed By: xush6528

Differential Revision: D58310586

fbshipit-source-id: 1deacc6318298bf5c18e024560b86250b64a8709

v2024.06.03.00

Toggle v2024.06.03.00's commit message
unify seq rw input_dist (pytorch#2051)

Summary:
Pull Request resolved: pytorch#2051

* unify unnecessary branching for input_dist module
* fx wrap some splits for honoring non-optional points.

Reviewed By: jingsh, gnahzg, yumin829928

Differential Revision: D57876357

fbshipit-source-id: 1baeb35e0280f251cf451dc5d65e5a8cab378555