How many targets can Targets handle? #1329

myushen · 2024-09-03T04:58:43Z

myushen
Sep 3, 2024

Help

I understand and agree to https://books.ropensci.org/targets/help.html.

Description

I have approximately 24k files. Below is a simplified version of the pipeline: It reads the file names from the RDS, loads them dynamically, and then branches over to other targets in subsequent steps. Initially, I processed the pipeline in batches of 2,500 samples each (For instance, firstly RDS has 2500 samples, then increases to 5000, etc). However, during the second batch, dispatching data_object branches took an unusually long time—a process that typically takes just a few seconds. I’m not sure if this issue is related to how crew.cluster manages large increments of targets, or if there is a limit on the number of targets that Targets can dispatch.

library(targets)
library(zellkonverter)
library(crew)
library(crew.cluster)

computing_resources = crew.cluster::crew_controller_slurm(
    slurm_memory_gigabytes_per_cpu = 20,
    slurm_cpus_per_task = 1,
    workers = 50,
    tasks_max = 5,
    verbose = T
  )

tar_option_set(
  controller = computing_resources
  #cue = tar_cue(mode = "never")
)

read_data_container <- function(file,
                                container_type = "anndata"){

  switch(container_type,
         "anndata" = zellkonverter::readH5AD(file, reader = "R", use_hdf5 = TRUE, 
                                             obs = FALSE, raw = FALSE, layers = FALSE),
         "sce_rds" = readRDS(file),
         "seurat_rds" = readRDS(file),
         "sce_hdf5" = loadHDF5SummarizedExperiment(file),
         "seurat_h5" = SeuratDisk::LoadH5Seurat(file)
  )
}


list(
  tar_target(
    file_path, 
    readRDS("input_file.rds") |> as.character() 
  ),
  tar_target(
    data_object,
    read_data_container(file_path, container_type = "anndata"),
    pattern = map(file_path),
    iteration = "list"
  ),
  tar_target(
   # other steps rely on data_object targets
)
)
#tar_make(reporter = "verbose_positives")

This screenshot shows that many branches have been dispatched, but none have completed yet.

wlandau · 2024-09-04T16:20:42Z

wlandau
Sep 4, 2024
Maintainer

Is the issue that messages like "dispatched branch data_object_dbbcba117fd6da13" and "dispatched branch data_object_7fff03fa2ec02af0" creep along and are slow to print to your R console, or that the underlying work is slow to complete? In the former case, profiling may give us a better idea of what exactly is slowing down execution: https://books.ropensci.org/targets/performance.html#profiling. In the latter case, it could be that your SLURM cluster is busy and all your jobs are waiting in a queue, and monitoring can tell you if your jobs are actually running: https://wlandau.github.io/crew.cluster/index.html#monitoring

23 replies

myushen Oct 23, 2024
Author

Hi @wlandau , I tried tasks_max = 1 but the hanging issue persists. I recorded a video to better reprex the probelm. Please let me know if you cant access it (https://drive.google.com/file/d/1ssYeV0jGWf4H0MmH183GMk7K7EvZBlLb/view?usp=drive_link)

Also, the function mentioned in the video is below. Basically, we want to know if the tiering is causing the hang.

script

  <tar_pattern> 
  name: sce_transformed_tier_1 
  description:  
  command:
    transform_utility(input_read_RNA_assay = data_object, 
        transform_fx = transform, external_path = "~/scratch/Census_rerun/run1_rds_format//external", 
        data_container_type = "anndata") 
  pattern:
    map(slice(data_object, index = c(1L, 3L, 5L, 10L, 
    14L, 17L, 18L, 20L, 21L, 24L, 25L, 29L, 33L, 34L, 35L, 36L, 46L, 
    49L, 51L, 52L, 53L, 54L, 77L, 86L, 132L, 133L, 138L, 169L, 195L, 
    218L, 222L, 225L, 226L, 227L, 229L, 230L, 231L, 233L, 235L, 256L, 
    277L, 282L, 288L, 312L, 313L, 314L, 315L, 316L, 317L, 318L, 319L, 
    320L, 325L, 326L, 330L, 351L, 362L, 363L, 372L, 403L, 416L, 455L, 
    463L, 465L, 466L, 475L, 526L, 530L, 540L, 542L, 546L, 548L, 549L, 
    555L, 567L, 569L, 572L, 615L, 616L, 617L, 618L, 619L, 620L, 621L, 
    622L, 623L, 624L, 625L, 626L, 627L, 628L, 629L, 630L, 631L, 632L, 
    633L, 634L, 635L, 636L, 637L, 638L, 639L, 640L, 641L, 642L, 643L, 
    644L, 645L, 646L, 647L, 648L, 649L, 650L, 651L, 652L, 653L, 654L, 
    655L, 656L, 657L, 658L, 659L, 660L, 661L, 662L, 663L, 664L, 665L, 
    666L, 667L, 668L, 669L, 670L, 671L, 672L, 673L, 674L, 675L, 676L, 
    677L, 678L, 679L, 680L, 681L, 682L, 683L, 684L, 685L, 686L, 687L, 
    688L, 689L, 690L, 691L, 692L, 693L, 694L, 695L, 696L, 697L, 698L, 
    699L, 700L, 701L, 702L, 703L, 704L, 705L, 706L, 707L, 708L, 709L, 
    710L, 711L, 712L, 713L, 714L, 715L, 716L, 717L, 718L, 719L, 720L, 
    721L, 722L, 723L, 724L, 725L, 726L, 727L, 728L, 729L, 730L, 731L, 
    732L, 733L, 734L, 735L, 736L, 737L, 738L, 739L, 740L, 741L, 742L, 
    743L, 744L, 745L, 746L, 747L, 748L, 749L, 750L, 751L, 752L, 753L, 
    754L, 755L, 756L, 757L, 758L, 759L, 760L, 761L, 762L, 763L, 764L, 
    765L, 766L, 767L, 768L, 769L, 770L, 771L, 772L, 773L, 774L, 775L, 
    776L, 777L, 778L, 779L, 780L, 781L, 782L, 783L, 784L, 785L, 786L, 
    787L, 788L, 789L, 790L, 791L, 792L, 793L, 794L, 795L, 796L, 797L, 
    798L, 799L, 800L, 801L, 802L, 803L, 804L, 805L, 806L, 807L, 808L, 
    809L, 810L, 811L, 812L, 813L, 814L, 815L, 816L, 817L, 818L, 819L, 
    820L, 821L, 822L, 823L, 824L, 825L, 826L, 827L, 828L, 829L, 830L, 
    831L, 832L, 833L, 834L, 835L, 836L, 837L, 838L, 839L, 840L, 841L, 
    842L, 843L, 844L, 845L, 846L, 847L, 848L, 849L, 850L, 851L, 852L, 
    853L, 854L, 855L, 856L, 857L, 858L, 859L, 860L, 861L, 862L, 863L, 
    864L, 865L, 866L, 867L, 868L, 869L, 870L, 871L, 872L, 873L, 874L, 
    875L, 876L, 877L, 878L, 879L, 880L, 881L, 882L, 883L, 884L, 885L, 
    886L, 887L, 888L, 889L, 890L, 891L, 892L, 893L, 894L, 895L, 896L, 
    897L, 898L, 899L, 900L, 901L, 902L, 903L, 904L, 905L, 906L, 907L, 
    908L, 909L, 910L, 911L, 912L, 913L, 914L, 915L, 916L, 917L, 918L, 
    919L, 920L, 921L, 922L, 923L, 924L, 925L, 926L, 927L, 928L, 929L, 
    930L, 931L, 932L, 933L, 934L, 935L, 936L, 937L, 938L, 939L, 940L, 
    941L, 942L, 943L, 944L, 945L, 946L, 947L, 948L, 949L, 950L, 951L, 
    952L, 953L, 954L, 955L, 956L, 957L, 958L, 959L, 960L, 961L, 962L, 
    963L, 964L, 965L, 966L, 967L, 968L, 969L, 970L, 971L, 972L, 973L, 
    974L, 975L, 976L, 977L, 978L, 979L, 980L, 981L, 982L, 983L, 984L, 
    985L, 986L, 987L, 988L, 989L, 990L, 991L, 992L, 993L, 994L, 995L, 
    996L, 997L, 998L, 999L, 1000L)), slice(transform, index = c(1L, 
    3L, 5L, 10L, 14L, 17L, 18L, 20L, 21L, 24L, 25L, 29L, 33L, 34L, 
    35L, 36L, 46L, 49L, 51L, 52L, 53L, 54L, 77L, 86L, 132L, 133L, 
    138L, 169L, 195L, 218L, 222L, 225L, 226L, 227L, 229L, 230L, 231L, 
    233L, 235L, 256L, 277L, 282L, 288L, 312L, 313L, 314L, 315L, 316L, 
    317L, 318L, 319L, 320L, 325L, 326L, 330L, 351L, 362L, 363L, 372L, 
    403L, 416L, 455L, 463L, 465L, 466L, 475L, 526L, 530L, 540L, 542L, 
    546L, 548L, 549L, 555L, 567L, 569L, 572L, 615L, 616L, 617L, 618L, 
    619L, 620L, 621L, 622L, 623L, 624L, 625L, 626L, 627L, 628L, 629L, 
    630L, 631L, 632L, 633L, 634L, 635L, 636L, 637L, 638L, 639L, 640L, 
    641L, 642L, 643L, 644L, 645L, 646L, 647L, 648L, 649L, 650L, 651L, 
    652L, 653L, 654L, 655L, 656L, 657L, 658L, 659L, 660L, 661L, 662L, 
    663L, 664L, 665L, 666L, 667L, 668L, 669L, 670L, 671L, 672L, 673L, 
    674L, 675L, 676L, 677L, 678L, 679L, 680L, 681L, 682L, 683L, 684L, 
    685L, 686L, 687L, 688L, 689L, 690L, 691L, 692L, 693L, 694L, 695L, 
    696L, 697L, 698L, 699L, 700L, 701L, 702L, 703L, 704L, 705L, 706L, 
    707L, 708L, 709L, 710L, 711L, 712L, 713L, 714L, 715L, 716L, 717L, 
    718L, 719L, 720L, 721L, 722L, 723L, 724L, 725L, 726L, 727L, 728L, 
    729L, 730L, 731L, 732L, 733L, 734L, 735L, 736L, 737L, 738L, 739L, 
    740L, 741L, 742L, 743L, 744L, 745L, 746L, 747L, 748L, 749L, 750L, 
    751L, 752L, 753L, 754L, 755L, 756L, 757L, 758L, 759L, 760L, 761L, 
    762L, 763L, 764L, 765L, 766L, 767L, 768L, 769L, 770L, 771L, 772L, 
    773L, 774L, 775L, 776L, 777L, 778L, 779L, 780L, 781L, 782L, 783L, 
    784L, 785L, 786L, 787L, 788L, 789L, 790L, 791L, 792L, 793L, 794L, 
    795L, 796L, 797L, 798L, 799L, 800L, 801L, 802L, 803L, 804L, 805L, 
    806L, 807L, 808L, 809L, 810L, 811L, 812L, 813L, 814L, 815L, 816L, 
    817L, 818L, 819L, 820L, 821L, 822L, 823L, 824L, 825L, 826L, 827L, 
    828L, 829L, 830L, 831L, 832L, 833L, 834L, 835L, 836L, 837L, 838L, 
    839L, 840L, 841L, 842L, 843L, 844L, 845L, 846L, 847L, 848L, 849L, 
    850L, 851L, 852L, 853L, 854L, 855L, 856L, 857L, 858L, 859L, 860L, 
    861L, 862L, 863L, 864L, 865L, 866L, 867L, 868L, 869L, 870L, 871L, 
    872L, 873L, 874L, 875L, 876L, 877L, 878L, 879L, 880L, 881L, 882L, 
    883L, 884L, 885L, 886L, 887L, 888L, 889L, 890L, 891L, 892L, 893L, 
    894L, 895L, 896L, 897L, 898L, 899L, 900L, 901L, 902L, 903L, 904L, 
    905L, 906L, 907L, 908L, 909L, 910L, 911L, 912L, 913L, 914L, 915L, 
    916L, 917L, 918L, 919L, 920L, 921L, 922L, 923L, 924L, 925L, 926L, 
    927L, 928L, 929L, 930L, 931L, 932L, 933L, 934L, 935L, 936L, 937L, 
    938L, 939L, 940L, 941L, 942L, 943L, 944L, 945L, 946L, 947L, 948L, 
    949L, 950L, 951L, 952L, 953L, 954L, 955L, 956L, 957L, 958L, 959L, 
    960L, 961L, 962L, 963L, 964L, 965L, 966L, 967L, 968L, 969L, 970L, 
    971L, 972L, 973L, 974L, 975L, 976L, 977L, 978L, 979L, 980L, 981L, 
    982L, 983L, 984L, 985L, 986L, 987L, 988L, 989L, 990L, 991L, 992L, 
    993L, 994L, 995L, 996L, 997L, 998L, 999L, 1000L))) 
  format: rds 
  repository: local 
  iteration method: list 
  error mode: stop 
  memory mode: persistent 
  storage mode: main 
  retrieval mode: main 
  deployment mode: worker 
  priority: 0 
  resources:
    crew: <environment> 
  cue:
    mode: thorough
    command: TRUE
    depend: TRUE
    format: TRUE
    repository: TRUE
    iteration: TRUE
    file: TRUE
    seed: TRUE 
  packages:
    crew.cluster
    crew
    HPCell
    testthat
    shinyBS
    fs
    CuratedAtlasQueryR
    stringr
    targets
    plotly
    ggplot2
    tidyr
    zellkonverter
    purrr
    tibble
    SingleCellExperiment
    SummarizedExperiment
    Biobase
    GenomicRanges
    GenomeInfoDb
    IRanges
    S4Vectors
    BiocGenerics
    stats4
    MatrixGenerics
    matrixStats
    arrow
    dplyr
    glue
    stats
    graphics
    grDevices
    utils
    datasets
    methods
    base 
  library:
    NULL

wlandau Oct 23, 2024
Maintainer

Maybe tiering or slicing is related to the hanging, but it's hard to speculate because of how much https://github.com/MangiolaLaboratory/HPCell adds on top of targets and crew, not to mention the HPC system you are using. To untangle all the different reasons that might be causing problems, I would need to run your pipeline and investigate it empirically using the techniques at https://books.ropensci.org/targets/debugging.html and more: browser() statements, profiling, ad hoc experiments to test potential explanations in isolation, etc. I do not have access to all your code on your specific HPC environment, but a reprex could be used instead. A true reprex:

Reproduces the hanging you see, and
is runnable on a different computer than yours.

I do realize how hard it might be to create a true reprex because of how much more https://github.com/MangiolaLaboratory/HPCell does than targets alone. So one easy alternative to try first is to profile the pipeline right up until it hangs. That way, we might be able to find out which function in targets is running when the hanging happens. For example, if it takes 1 minute to start hanging, then you might run:

proffer::pprof(
  R.utils::withTimeout(
    targets::tar_make(callr_function = NULL),
    timeout = 300,
    onTimeout = "silent"
  )
)

This should run the pipeline for 5 minutes (4 of which should hang). Then, the proffer flame graph should show a long time spent on whatever function targets is hanging at. If there really is a bottleneck in the local process of targets, the profiling data might show a useful clue. But it's entirely possible that I will need a reprex to investigate further. For details on proffer and how to use it, please see https://r-prof.github.io/proffer/. I recommend using the development version (remotes::install_github("r-prof/proffer")).

wlandau Oct 23, 2024
Maintainer

Maybe tiering or slicing is related to the hanging,

To test this, I created a concise true reprex that anyone can run on any modern computer and does not make assumptions about the existence of HPC resources or specific local files on disk.

library(targets)
library(tibble)
list(
  tar_target(data, tibble(x = seq_len(1e4))),
  tar_target(slice, data, pattern = slice(data, index = seq_len(1e4)))
)

The pipeline completed easily in about 2.3 minutes. So unless the problem requires more than 1e4 rows/branches to reproduce, I do not think this is a problem with slicing data frames or pattern = slice() in general.

stemangiola Oct 24, 2024

We are now investigating, if Mengyuan environment is the cause. We are replicating the hang around the lab. News soon.

wlandau Oct 30, 2024
Maintainer

Does the hanging happen when you use tar_make() without crew or crew.cluster? c.f. #1360 (reply in thread)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How many targets can Targets handle? #1329

{{title}}

Replies: 1 comment 23 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How many targets can Targets handle? #1329

myushen Sep 3, 2024

Help

Description

Replies: 1 comment · 23 replies

wlandau Sep 4, 2024 Maintainer

myushen Oct 23, 2024 Author

wlandau Oct 23, 2024 Maintainer

wlandau Oct 23, 2024 Maintainer

stemangiola Oct 24, 2024

wlandau Oct 30, 2024 Maintainer

myushen
Sep 3, 2024

Replies: 1 comment 23 replies

wlandau
Sep 4, 2024
Maintainer

myushen Oct 23, 2024
Author

wlandau Oct 23, 2024
Maintainer

wlandau Oct 23, 2024
Maintainer

wlandau Oct 30, 2024
Maintainer