Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update from grunwaldlab fork #88

Closed
wants to merge 609 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
609 commits
Select commit Hold shift + click to select a range
e3e6494
fix bug that ignored groups using only refseq for references
zachary-foster Mar 6, 2024
5b29fcc
fix bug
zachary-foster Mar 6, 2024
3f04e70
update docker file for main report
zachary-foster Mar 6, 2024
caef715
Adds more detail to the status table.
cahuparo Mar 6, 2024
10ae874
add missing dependency for adgenet in report docker container
zachary-foster Mar 6, 2024
536cf0b
Merge branch 'master' of github.com:grunwaldlab/pathogensurveillance
zachary-foster Mar 6, 2024
f3b8f23
Fix color-by typo in sample data sheet
Mar 10, 2024
7539f83
Update report with MSN function
Mar 10, 2024
52e1507
Update MSN function so it is more generalizable
Mar 13, 2024
4b89bcf
Update MSN
Mar 13, 2024
1fe4fff
add abstract
zachary-foster Mar 14, 2024
9133415
fix fake warning
zachary-foster Mar 14, 2024
6639891
update report dockerfile
zachary-foster Mar 14, 2024
b03ceae
add test dataset
zachary-foster Mar 14, 2024
8cfd98a
specify resources for some processes
zachary-foster Mar 14, 2024
0f932a9
Merge branch 'master' of github.com:grunwaldlab/pathogensurveillance
zachary-foster Mar 14, 2024
7ccff10
add bakta --skip-crispr
zachary-foster Mar 14, 2024
14143c3
Update main report doc
Mar 15, 2024
90bfe88
Merge branch 'master' of github.com:grunwaldlab/pathogensurveillance
Mar 15, 2024
b4b9bb8
update report
Mar 15, 2024
5fc2749
read2tree formater step 1
ricardoi Mar 15, 2024
adbfb75
faa and fna format
ricardoi Mar 15, 2024
f985faa
fix OOM error for vcf_concatenate:
masudermann Mar 15, 2024
80ca84a
Merge branch 'master' of github.com:grunwaldlab/pathogensurveillance
masudermann Mar 18, 2024
f9b7ede
Update Kpneum test database-remove two bad samples
masudermann Mar 18, 2024
0063989
update test datasets
zachary-foster Mar 20, 2024
280115b
Merge branch 'master' of github.com:grunwaldlab/pathogensurveillance
zachary-foster Mar 20, 2024
95b811c
remove debug print statements
zachary-foster Mar 25, 2024
9f68129
remove whitespace around report group ids
zachary-foster Mar 25, 2024
e8e7b7e
fix bug with drop=FALSE
cahuparo Mar 26, 2024
b53dfb0
Fix typo in metadata-nursery not nusery, otherwise we get error in MSN
Mar 29, 2024
b485248
Fix typos in metadata
Mar 29, 2024
ae0077f
Update MSN portion of report to handle different scenarios, like not …
Mar 29, 2024
280e715
Fix tabs-MSN outputs
Mar 29, 2024
be21434
Update main report
Mar 29, 2024
b0f5fbe
debug outputs for msn fxn
Mar 30, 2024
11bb200
Revise MSN code so network headers and tabs render properly
Apr 1, 2024
59030bf
trying to fix report not being run if there are no status messages
zachary-foster Apr 1, 2024
81e7655
added new presenation
zachary-foster Apr 1, 2024
233d707
Merge branch 'master' of github.com:grunwaldlab/pathogensurveillance
zachary-foster Apr 1, 2024
9908e28
fix typo in test metadata
zachary-foster Apr 1, 2024
0c11d96
add temp_dir option to specify location of temporary files for PICARD…
zachary-foster Apr 1, 2024
007a8dd
Upate MSN code chunk-still bizarre formatting concerns
Apr 2, 2024
273b02a
Add header message back to MSN plots
Apr 2, 2024
b708819
beginning r2tdir
ricardoi Apr 2, 2024
e45f727
ammended r2tdir
ricardoi Apr 2, 2024
741673f
add warnings when samples are removed from the core gene phylogeny
zachary-foster Apr 3, 2024
cec3b07
removed install code for psminer from report
zachary-foster Apr 3, 2024
0eec81f
add a way to specify the temporary directory for some processes
zachary-foster Apr 3, 2024
122cdaf
merge
zachary-foster Apr 3, 2024
71183f0
update test datasets
zachary-foster Apr 4, 2024
97bbb6e
update citation
zachary-foster Apr 4, 2024
add6885
Add new mixed bacterial test dataset
Apr 9, 2024
2ae4c45
updated report to add dropdown menu to select MSN
zachary-foster Apr 9, 2024
243acc4
Merge branch 'master' of github.com:grunwaldlab/pathogensurveillance
zachary-foster Apr 9, 2024
547b7a2
fixed bug when running single sample
zachary-foster Apr 9, 2024
80aad57
Revised mixed dataset
Apr 10, 2024
2cc0a75
revised mixed dataset
Apr 10, 2024
17f46b5
inital support of long reads by cutting into 150mers
zachary-foster Apr 10, 2024
03a6267
Merge branch 'master' of github.com:grunwaldlab/pathogensurveillance
zachary-foster Apr 10, 2024
4d49a3f
fixed bug
zachary-foster Apr 10, 2024
d166fd5
Revise mixed dataset again
Apr 10, 2024
66b9cc2
minor
zachary-foster Apr 10, 2024
1ab224b
Merge branch 'master' of github.com:grunwaldlab/pathogensurveillance
zachary-foster Apr 10, 2024
16eb838
starting to add long read support
zachary-foster Apr 10, 2024
263e2c0
updated main report docker image
zachary-foster Apr 10, 2024
d84ed4b
update main report docker images
zachary-foster Apr 10, 2024
69fc78c
add header to MSN plots
zachary-foster Apr 12, 2024
16efac9
change id to fasta
ricardoi Apr 12, 2024
72aa5e2
change fasta to buscoid
ricardoi Apr 12, 2024
0f4d6b0
rtbinder.py to bin
ricardoi Apr 12, 2024
da37e9d
busco binder module
ricardoi Apr 12, 2024
a388aa8
Create boxwood.csv
cahuparo Apr 12, 2024
3834010
Create fungi_n81.config
cahuparo Apr 12, 2024
8f34112
Update fungi_n81.config
cahuparo Apr 12, 2024
cb9a3fb
r2tbinder update $markers
ricardoi Apr 12, 2024
7b74c02
Quick fix to MSN headers. Should be number of samples in snp alignmen…
Apr 12, 2024
d0d9475
Remove loading psminer package from local directory
Apr 12, 2024
1fc84e5
added flye for genome assembly of long reads
zachary-foster Apr 12, 2024
f6989a6
added option to change minimum ANI between samples and selected refer…
zachary-foster Apr 12, 2024
019aa6d
Merge branch 'master' of github.com:grunwaldlab/pathogensurveillance
zachary-foster Apr 12, 2024
9bd983c
rt2bin working
ricardoi Apr 15, 2024
4d1211f
make temp_dir option apply to SRATOOLS_FASTERQDUMP
zachary-foster Apr 16, 2024
2902063
Allows more flexibility in column names from user input data
Apr 19, 2024
07e6dd0
change input format for reads; add required reads_type column
zachary-foster Apr 20, 2024
8a99a86
add test dataset for nanopore
zachary-foster Apr 22, 2024
80b46f2
Update mixed_bacteria.config
masudermann Apr 24, 2024
f7fdd51
add channels to run read2tree
zachary-foster Apr 25, 2024
7fd912b
fix no messages causing report not to render
zachary-foster Apr 26, 2024
0befe77
Merge branch 'master' of github.com:grunwaldlab/pathogensurveillance
zachary-foster Apr 26, 2024
8b0519b
Update pathogensurveillance.nf
logankblair Apr 30, 2024
8ac1836
Update assign_group_reference.R
logankblair Apr 30, 2024
db07dff
Update subset_core_gene.R
logankblair Apr 30, 2024
8534fe3
added number of variants used in SNP alignments to report output
zachary-foster May 1, 2024
f84c768
split cluster work in progress
May 2, 2024
36fe7e2
test
May 2, 2024
dd34706
working on integrating core gene clustering into the rest of the pipl…
zachary-foster May 2, 2024
d77817b
downgraded bakta version to make it work on nanopore data
zachary-foster May 6, 2024
624498a
ignore lab meeting presentations
zachary-foster May 6, 2024
0559d54
new clustering method to select core genes
zachary-foster May 8, 2024
8061181
working on integrating clustering of core genes into the rest of the …
zachary-foster May 8, 2024
76813fc
subsetting of core gene phylgoeny seems to be working
zachary-foster May 8, 2024
1329adb
Merge branch 'master' of github.com:grunwaldlab/pathogensurveillance
zachary-foster May 8, 2024
d7f2e88
merge
zachary-foster May 8, 2024
5d70c38
Update modules.config to include improved PIRATE args
cahuparo May 8, 2024
33b389e
Update modules.config
cahuparo May 8, 2024
4745a0a
added draft code for selecting a representative subset of downloaded …
zachary-foster May 9, 2024
5e0a81c
Merge branch 'master' of github.com:grunwaldlab/pathogensurveillance
zachary-foster May 9, 2024
569f72c
}
zachary-foster May 9, 2024
92c6f31
fixes to selecting genera for each family
zachary-foster May 10, 2024
0829dd5
read2tree seems to be working
zachary-foster May 10, 2024
98e67e6
Update references.bib
logankblair May 14, 2024
7e925df
Update references.bib
logankblair May 14, 2024
c750cce
Update references.bib
logankblair May 14, 2024
8350fd7
?
zachary-foster May 15, 2024
982b39d
merge
zachary-foster May 15, 2024
46bc7e1
Main report citations
logankblair May 15, 2024
c78f09e
New citation instructions in readme
logankblair May 16, 2024
f01f53b
Changed MAFFT conda channel to conda-forge and version to 7.526
cahuparo May 16, 2024
867bccc
added subsetting for contextual references
zachary-foster May 17, 2024
5a34d82
Update mixed.csv
masudermann May 21, 2024
7d49470
fixed bwa mem failing when there are more than 2 read files for SRA d…
zachary-foster May 21, 2024
11c6f53
working on making subset references be used instead of all references
zachary-foster May 23, 2024
47f561e
added journal
zachary-foster May 23, 2024
8802aed
Storage improvements
logankblair May 24, 2024
cf2b6ff
journal update
zachary-foster May 28, 2024
56238d6
merge
zachary-foster May 28, 2024
7819eda
working on new metadata format
zachary-foster May 30, 2024
26ff2b0
working on adding reference metadata validation
zachary-foster May 31, 2024
911ad91
Merge branch 'master' of github.com:grunwaldlab/pathogensurveillance
zachary-foster May 31, 2024
6c351cf
integrating NCBI queries into the input CSV processing
zachary-foster May 31, 2024
2806499
updated main report docker container
zachary-foster May 31, 2024
edfb350
midwork
zachary-foster May 31, 2024
147140d
add unzip to download_assemblies.nf conda environment
zachary-foster May 31, 2024
84f6d63
add journal
zachary-foster Jun 1, 2024
b164dda
add proof of concept docker file for aps workshop
zachary-foster Jun 1, 2024
1ac112e
Merge branch 'master' of github.com:grunwaldlab/pathogensurveillance
zachary-foster Jun 1, 2024
b37c4d5
allow reference data to be specified in the sample data CSV
zachary-foster Jun 1, 2024
1cc3ca5
added code to look up sequence type for accessions with one defined
zachary-foster Jun 1, 2024
ee2d4ff
added code to validate reference usage columns
zachary-foster Jun 1, 2024
2dad0a1
add code to validate the ploidy column
zachary-foster Jun 1, 2024
6048986
add defaults for some columns and validate color_by column
zachary-foster Jun 2, 2024
27a57cc
added limits to ncbi_query columns
zachary-foster Jun 2, 2024
e5e8b15
added description of new columns to the README
zachary-foster Jun 3, 2024
8f41eca
changed modules.config to publish PREPARE_REPORT_INPUT
logankblair Jun 3, 2024
fa3c6a4
Create aps.config
cahuparo Jun 3, 2024
dc5d422
starting to apply changes to input format to workflows
zachary-foster Jun 3, 2024
7873306
Update nextflow.config
cahuparo Jun 4, 2024
9f633ec
working on apply input format changes
zachary-foster Jun 4, 2024
fddc621
Fix conf file
masudermann Jun 4, 2024
2ecaba5
midwork
zachary-foster Jun 4, 2024
317e4be
trying to get gitpod working
zachary-foster Jun 4, 2024
9295599
fix bug from location of check_prior on new nextflow version
zachary-foster Jun 4, 2024
a317a28
midwork
zachary-foster Jun 6, 2024
ff159ae
midwork
zachary-foster Jun 11, 2024
ed89eb5
draft docker container to combine trim-low-abund and sourmash compare
zachary-foster Jun 11, 2024
3b9be31
midwork
zachary-foster Jun 12, 2024
a79d335
added nanoplot
zachary-foster Jun 13, 2024
6d66768
got assign_mapping_referenece working
zachary-foster Jun 14, 2024
f39b864
midowrk
zachary-foster Jun 18, 2024
eaeb0cc
make fasterqdump work with storeDir directive
zachary-foster Jun 20, 2024
92c8e7f
Change defualt baktadb to light
zachary-foster Jun 21, 2024
bfc912c
add command line option for which baktadb to use
zachary-foster Jun 21, 2024
180211e
variant analysis updated
zachary-foster Jun 23, 2024
9ca3a13
midwork:
zachary-foster Jun 24, 2024
73f5c17
use nf-core/gitpod for workshop
zachary-foster Jun 24, 2024
90fb4c9
replace depreciated docker option
zachary-foster Jun 25, 2024
87dfad2
core gene phylogeny subworkflow updated
zachary-foster Jun 25, 2024
d22cc95
working on updating busco analysis
zachary-foster Jun 25, 2024
5b3b777
busco analysis updated
zachary-foster Jun 25, 2024
b47a032
midwork
zachary-foster Jun 25, 2024
bbbe66d
midwork
zachary-foster Jun 25, 2024
5608b01
add quast back in
zachary-foster Jun 25, 2024
cf0ef70
preparing inputs for the main report
zachary-foster Jun 26, 2024
e42db5f
midwork
zachary-foster Jun 27, 2024
0c916dd
report input preparation working
zachary-foster Jun 27, 2024
91be535
make multiqc output named by report group
zachary-foster Jun 27, 2024
d2e50e6
midwork
zachary-foster Jun 28, 2024
a0d67bc
fix kingdom parsing and cache bug
zachary-foster Jul 1, 2024
6ad23ea
merge
zachary-foster Jul 1, 2024
324c37e
update out_dir name
zachary-foster Jul 1, 2024
a53f289
updated all test datasets to new input format
zachary-foster Jul 2, 2024
85a6ece
add xan small outputs to _test_data dir
masudermann Jul 3, 2024
f53c0ee
add ramorum output dirs to _test_data
masudermann Jul 3, 2024
b81c5a8
updated report outputs
masudermann Jul 3, 2024
5c0c84d
fix metadata file path in config file
masudermann Jul 4, 2024
95ae7cc
fix error when essages are empty
masudermann Jul 5, 2024
c357b69
add mycobacteroides test datase to _test_data
masudermann Jul 5, 2024
0e7e0cb
add xan dataset to _test_data dir
masudermann Jul 5, 2024
febc302
fix busco r2tf bug
zachary-foster Jul 5, 2024
726116f
Merge branch 'dev' of github.com:grunwaldlab/pathogensurveillance int…
zachary-foster Jul 5, 2024
8c046c8
add another test dataset output
masudermann Jul 6, 2024
8aa6281
fix mixed.csv file so basil powdery mildew is genomic dna
masudermann Jul 6, 2024
0d2ebce
revise metadata sheet
masudermann Jul 6, 2024
19278e1
revise mixed bacteria conf file
masudermann Jul 6, 2024
cf25bd7
incorporate mixed bacteria dataset
masudermann Jul 6, 2024
ee15b99
add mixed_bacteria outputs to _test_data
masudermann Jul 8, 2024
bbbf1b7
minimal update of report code
zachary-foster Jul 8, 2024
263c9f5
merge
zachary-foster Jul 8, 2024
0b10b54
make report ignore messages when there are none
zachary-foster Jul 8, 2024
40b245d
Update main_report.nf
cahuparo Jul 8, 2024
120feab
Update main_report.nf
cahuparo Jul 8, 2024
19cfee5
minor
zachary-foster Jul 8, 2024
3219db8
Merge branch 'master' of github.com:grunwaldlab/pathogensurveillance
zachary-foster Jul 8, 2024
5bb20bc
Update main_report.nf
cahuparo Jul 9, 2024
c5691bd
Update main_report.nf
cahuparo Jul 9, 2024
2a05435
Update index.qmd to include installation of packages not in conda.
cahuparo Jul 9, 2024
9a33a5b
Update busco_download.nf
cahuparo Jul 11, 2024
4665025
making changes to track changes in psminer
zachary-foster Jul 12, 2024
2909aec
making changes to track changes in psminer
zachary-foster Jul 12, 2024
cbfdd32
finshed replacing path handling code with psminer functions
zachary-foster Jul 12, 2024
979c208
update metadata sheets
masudermann Jul 15, 2024
e33d3c9
add new dir to _test_data
masudermann Jul 15, 2024
a698ae8
added cache_type option
zachary-foster Jul 16, 2024
0a9eda6
Merge branch 'master' of github.com:grunwaldlab/pathogensurveillance
zachary-foster Jul 16, 2024
d768f89
fix error when looking up sequence type
zachary-foster Jul 16, 2024
96bbefa
increase SRA download limt
zachary-foster Jul 17, 2024
1908574
Remove limit on number of genomes/reads downloaded
zachary-foster Jul 17, 2024
c59d153
dont download SRA reads with very low coverage
zachary-foster Jul 17, 2024
299ea3a
fix NCBI query error caused by 1 vs 0 based indexing
zachary-foster Jul 18, 2024
880ae26
Fix bug when ncbi_accession is the only column in input.csv
logankblair Jul 18, 2024
3655295
updates for new psminer plotting
zachary-foster Jul 19, 2024
5eb39cf
Merge branch 'master' of github.com:grunwaldlab/pathogensurveillance
zachary-foster Jul 19, 2024
e2b4595
add aps nanopore data
zachary-foster Jul 19, 2024
0f48504
add aps nanopore data
zachary-foster Jul 19, 2024
6316ed5
fix aps profile
zachary-foster Jul 19, 2024
478d374
updated SNP tree plotting
zachary-foster Jul 19, 2024
39741e5
MSN code more updated and mostly working
zachary-foster Jul 22, 2024
cf7ffd0
updates
zachary-foster Jul 22, 2024
6540ecc
improve scaling of zooming plots
zachary-foster Jul 22, 2024
3bb7f00
updated status messsage printing
zachary-foster Jul 23, 2024
669ad49
Merge branch 'master' of github.com:grunwaldlab/pathogensurveillance
zachary-foster Jul 23, 2024
2c38372
ignore rdata
zachary-foster Jul 23, 2024
8c48b75
ignore rdata
zachary-foster Jul 23, 2024
eff6191
Merge branch 'master' of github.com:grunwaldlab/pathogensurveillance
zachary-foster Jul 23, 2024
6ffb823
remove fungal sample
zachary-foster Jul 23, 2024
d3b182c
allow bgiseq input
zachary-foster Jul 23, 2024
d9cb77b
comment out test paths
zachary-foster Jul 24, 2024
0598f96
adding data to workshop docker image
zachary-foster Jul 24, 2024
8090ce4
update multiqc docker container
zachary-foster Jul 24, 2024
4311cf1
disable quarto cacheing to avoid permissions error on gitpod
zachary-foster Jul 24, 2024
b2f0bd3
trying to fix gitpod error
zachary-foster Jul 24, 2024
e94e7af
fixing conda erros on gitpod
zachary-foster Jul 24, 2024
ffd9254
fix mafft conda error on gitpod
zachary-foster Jul 24, 2024
d07cd8b
update conda dependencies for main report
zachary-foster Jul 24, 2024
df2c6b1
workshop release
zachary-foster Jul 25, 2024
1af7320
add nextflow to gc env
zachary-foster Jul 25, 2024
1716a07
typo fix
zachary-foster Jul 25, 2024
b1998ad
reduced dataset size
zachary-foster Jul 27, 2024
8d39d84
Merge branch 'master' of github.com:grunwaldlab/pathogensurveillance
zachary-foster Jul 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
7 changes: 6 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,9 @@ null/
*.Rproj
.Rhistory
assets/*.html
README.html
README.html
.~*
seqtk_sample/
docs/lab_meetings
path_surveil_data
.RData
72 changes: 65 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,9 @@
# ![nf-core/pathogensurveillance](docs/images/nf-core-pathogensurveillance_logo_light.png#gh-light-mode-only) ![nf-core/pathogensurveillance](docs/images/nf-core-pathogensurveillance_logo_dark.png#gh-dark-mode-only)
<picture>
<source media="(prefers-color-scheme: dark)" srcset="docs/images/nf-core-pathogensurveillance_logo_dark.png">
<source media="(prefers-color-scheme: light)" srcset="docs/images/nf-core-pathogensurveillance_logo_light.png">
<img alt="nf-core/pathogensurveillance" src="docs/images/nf-core-pathogensurveillance_logo_light.png">
</picture>


[![AWS CI](https://img.shields.io/badge/CI%20tests-full%20size-FF9900?labelColor=000000&logo=Amazon%20AWS)](https://nf-co.re/pathogensurveillance/results)[![Cite with Zenodo](http://img.shields.io/badge/DOI-10.5281/zenodo.XXXXXXX-1073c8?labelColor=000000)](https://doi.org/10.5281/zenodo.XXXXXXX)

Expand All @@ -17,14 +22,18 @@
<!-- TODO nf-core: Write a 1-2 sentence summary of what data the pipeline is for and what it does -->

**nf-core/pathogensurveillance** is a population genomic pipeline for pathogen diagnosis, variant detection, and biosurveillance.
The pipeline accepts the paths to raw reads for one or more organisms and creates reports in the form of interactive HTML reports or PDF documents.
Significant features include the ability to analyze unidentified eukaryotic and prokaryotic samples, creation of reports for multiple user-defined groupings of samples, automated discovery and downloading of reference assemblies from NCBI RefSeq, and rapid initial identification based on k-mer sketches followed by a more robust core genome phylogeny.
The pipeline accepts the paths to raw reads for one or more organisms (in the form of a CSV file) and creates reports in the form of interactive HTML reports or PDF documents.
Significant features include the ability to analyze unidentified eukaryotic and prokaryotic samples, creation of reports for multiple user-defined groupings of samples, automated discovery and downloading of reference assemblies from NCBI RefSeq, and rapid initial identification based on k-mer sketches followed by a more robust core genome phylogeny and SNP-based phylogeny.

The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The [Nextflow DSL2](https://www.nextflow.io/docs/latest/dsl2.html) implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from [nf-core/modules](https://github.com/nf-core/modules) in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!
The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner.
It uses Docker/Singularity containers making installation trivial and results highly reproducible.
The [Nextflow DSL2](https://www.nextflow.io/docs/latest/dsl2.html) implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.
Where possible, these processes have been submitted to and installed from [nf-core/modules](https://github.com/nf-core/modules) in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!

<!-- TODO nf-core: Add full-sized test dataset and amend the paragraph below if applicable -->

On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world data sets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources.The results obtained from the full-sized test can be viewed on the [nf-core website](https://nf-co.re/pathogensurveillance/results).
On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure.
This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world data sets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources.The results obtained from the full-sized test can be viewed on the [nf-core website](https://nf-co.re/pathogensurveillance/results).

## Pipeline summary

Expand All @@ -39,7 +48,7 @@ On release, automated continuous integration tests run the pipeline on a full-si
3. Download the pipeline and test it on a minimal dataset with a single command:

```bash
nextflow run nf-core/pathogensurveillance -profile test,YOURPROFILE --outdir <OUTDIR>
nextflow run nf-core/pathogensurveillance -profile test,YOURPROFILE --outdir <OUTDIR> -resume
```

Note that some form of configuration will be needed so that Nextflow knows how to fetch the required software. This is usually done in the form of a config profile (`YOURPROFILE` in the example command above). You can chain multiple config profiles in a comma-separated string.
Expand All @@ -54,13 +63,62 @@ On release, automated continuous integration tests run the pipeline on a full-si
<!-- TODO nf-core: Update the example "typical command" below used to run the pipeline -->

```bash
nextflow run nf-core/pathogensurveillance --input samplesheet.csv --outdir <OUTDIR> --genome GRCh37 -profile <docker/singularity/podman/shifter/charliecloud/conda/institute>
nextflow run nf-core/pathogensurveillance --input samplesheet.csv --outdir <OUTDIR> -profile <docker/singularity/podman/shifter/charliecloud/conda/institute> -resume
```

You can also try running a small example dataset hosted with the source code using the following command (no need to download anything):

```
nextflow run nf-core/pathogensurveillance --input https://raw.githubusercontent.com/grunwaldlab/pathogensurveillance/master/test/data/metadata_small.csv --outdir test_out --download_bakta_db true -profile docker -resume
```



## Documentation

The nf-core/pathogensurveillance pipeline comes with documentation about the pipeline [usage](https://nf-co.re/pathogensurveillance/usage), [parameters](https://nf-co.re/pathogensurveillance/parameters) and [output](https://nf-co.re/pathogensurveillance/output).

### Input format

The primary input to the pipeline is a CSV (comma comma-separated value) file.
This can be made in a spreadsheet program like LibreOffice Calc or Microsoft Excel by exporting to CSV.
Columns can be in any order and unneeded columns can be left out or left blank.
Column names are case insensitive and spaces are equivalent to underscores.
Only a single column containing either paths to raw sequence data, SRA (Sequence Read Archive) accessions, or NCBI queries to search the SRA is required and each sample can have values in different columns.
Any columns not recognized by `pathogensurveillance` will be ignored, allowing users to adapt existing sample metadata table by adding new columns.
Below is a description of each column used by `pathogensurveillance`:

* **sample_id**: The unique identifier for each sample. This will be used in file names to distinguish samples in the output. Each sample ID must correspond to a single source of sequence data (e.g. the `path` and `ncbi_accession` columns), although the same sequence data can be used by different IDs. Any values supplied that correspond to different sources of sequence data or contain characters that cannot appear in file names (\/:*?"<>| .) will be modified automatically. If not supplied, it will be inferred from the `path`, `ncbi_accession`, or `name` columns.
* **name**: A human-readable label for the sample that is used in plots and tables. If not supplied, it will be inferred from `sample_id`.
* **path**: Path to input sequence data, typically gzipped FASTQ files. When paired end sequencing is used, this is used for the forward read's data and `path_2` is used for the reverse reads. This can be a local file path or a URL to an online location. The `sequence_type` column must have a value.
* **path_2**: Path to the FASTQ files for the reverse read when paired-end sequencing is used. This can be a local file path or a URL to an online location. The `sequence_type` column must have a value.
* **ncbi_accession**: An SRA accession ID for reads to be downloaded and used as samples. Values in the `sequence_type` column will be looked up if not supplied.
* **ncbi_query**: A valid NCBI search query to search the SRA for reads to download and use as samples. This will result in an unknown number of samples being analyzed. The total number downloaded is limited by the `ncbi_query_max` column. Values in the `sample_id`, `name`, and `description` columns will be append to that supplied by the user. Values in the `sequence_type` column will be looked up and does not need to be supplied by the user.
* **ncbi_query_max**: The maximum number or percentage of samples downloaded for the corresponding query in the `ncbi_query` column. Adding a `%` to the end of a number indicates a percentage of the total number of results instead of a count. A random of subset of results will be downloaded if `ncbi_query_max` is less than "100%" or the total number of results.
* **sequence_type**: The type of sequencing used to produce reads for the `reads_1` and `reads_2` columns. Valid values include anything containing the words "illumina", "nanopore", or "pacbio". Will be looked up automatically for `ncbi_accession` and `ncbi_query` inputs but must be supplied by the user for `path` inputs.
* **report_group_ids**: How to group samples into reports. For every unique value in this column a report will be generated. Samples can be assigned to multiple reports by separating group IDs by ";". For example `all;subset` will put the sample in both `all` and `subset` report groups. Samples will be added to a default group if this is not supplied.
* **color_by**: The names of other columns that contain values used to color samples in plots and figures in the report. Multiple column names can be separated by ";". Specified columns can contain either categorical factors or specific colors, specified as a hex code. By default, samples will be one color and references another.
* **ploidy**: The ploidy of the sample. Should be a number. Defaults to "1".
* **enabled**: Either "TRUE" or "FALSE", indicating whether the sample should be included in the analysis or not. Defaults to "TRUE".
* **ref_group_ids**: One or more reference group IDs separated by ";". These are used to supply specific references to specific samples. These IDs correspond to IDs listed in the `ref_group_ids` or `ref_id` columns of the reference metadata CSV.

Additionally, users can supply a reference metadata CSV that can be used to assign custom references to particular samples.
References are assigned to samples if they share a reference group ID in the `ref_group_ids` columns that can appear in both input CSVs.
The reference metadata CSV can have the following columns:

* **ref_group_ids**: One or more reference group IDs separated by ";". These are used to group references and supply an ID that can be used in the `ref_group_ids` column of the sample metadata CSV to assign references to particular samples. * **ref_id**: The unique identifier for each user-defined reference genome. This will be used in file names to distinguish samples in the output. Each reference ID must correspond to a single source of reference data (The `ref_path`, `ref_ncbi_accession`, and `ref_ncbi_query` columns), although the same reference data can be used by multiple IDs. Any values that correspond to different sources of reference data or contain characters that cannot appear in file names (\/:*?"<>| .) will be modified automatically. If not supplied, it will be inferred from the `path`, `ref_name` columns or supplied automatically when `ref_ncbi_accession` or `ref_ncbi_query` are used.
* ref_id: The unique identify for each reference input. This will be used in file names to distinguish references in the output. Each sample ID must correspond to a single source of reference data (e.g. the `ref_path` and `ref_ncbi_accession` columns), although the same sequence data can be used by different IDs. Any values supplied that correspond to different sources of reference data or contain characters that cannot appear in file names (\/:*?"<>| .) will be modified automatically. If not supplied, it will be inferred from the `ref_path`, `ref_ncbi_accession`, or `ref_name` columns.
* **ref_name**: A human-readable label for user-defined reference genomes that is used in plots and tables. If not supplied, it will be inferred from `ref_id`. It will be supplied automatically when the `ref_ncbi_query` column is used.
* **ref_description**: A longer human-readable label for user-defined reference genomes that is used in plots and tables. If not supplied, it will be inferred from `ref_name`. It will be supplied automatically when the `ref_ncbi_query` column is used.
* **ref_path**: Path to user-defined reference genomes for each sample. This can be a local file path or a URL to an online location.
* **ref_ncbi_accession**: RefSeq accession ID for a user-defined reference genome. These will be automatically downloaded and used as input.
* **ref_ncbi_query**: A valid NCBI search query to search the assembly database for genomes to download and use as references. This will result in an unknown number of references being downloaded. The total number downloaded is limited by the `ref_ncbi_query_max` column. Values in the `ref_id`, `ref_name`, and `ref_description` columns will be append to that supplied by the user.
* **ref_ncbi_query_max**: The maximum number or percentage of references downloaded for the corresponding query in the `ref_ncbi_query` column. Adding a `%` to the end of a number indicates a percentage of the total number of results instead of a count. A random of subset of results will be downloaded if `ncbi_query_max` is less than "100%" or the total number of results.
* **ref_primary_usage**: Controls how the reference is used in the analysis in cases where a single "best" reference is required, such as for variant calling. Can be one of "optional" (can be used if selected by the analysis), "required" (will always be used), "exclusive" (only those marked "exclusive" will be used), or "excluded" (will not be used).
* **ref_contextual_usage**: Controls how the reference is used in the analysis in cases where multiple references are required to provide context for the samples, such as for phylogeny. Can be one of "optional" (can be used if selected by the analysis), "required" (will always be used), "exclusive" (only those marked "exclusive" will be used), or "excluded" (will not be used).
* **ref_color_by**: The names of other columns that contain values used to color references in plots and figures in the report. Multiple column names can be separated by ";". Specified columns can contain either categorical factors or specific colors, specified as a hex code. By default, samples will be one color and references another.
* **ref_enabled**: Either "TRUE" or "FALSE", indicating whether the reference should be included in the analysis or not. Defaults to "TRUE".

## Credits

nf-core/pathogensurveillance was originally written by Zachary S.L. Foster, Martha Sudermann, Nicholas C. Cauldron, Fernanda I. Bocardo, Hung Phan, Jeff H. Chang, Niklaus J. Grünwald.
Expand Down
9 changes: 7 additions & 2 deletions assets/main_report/.gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
/.quarto/
_book
_bookdown_files
_main_files
_site
*_files
*.html
*.ipynb
_test_data/**/quast
_test_data/**/multiqc
84 changes: 0 additions & 84 deletions assets/main_report/01-identification.Rmd

This file was deleted.

Loading
Loading