DATA

The folder "data" contains the smaller data files for the afternoon session of day 1. However, we would advice you to use the data provided on SAGA as it will make the data transfer to your folder much faster. SAGA is also the only place, where the larger data files are.

SCRIPTS

The folder "scripts" contains all the scripts to be run on SAGA. However, we would advice you that you transfer them from SAGA as described in the exercise below.

LECTURES

The folder "Lectures" contains all the lectures from this session.

Dataset
Contamination

EXERCISE

For getting started, copy all data from the folder "Day1Afternoon" to your folder in the project area in SAGA

cd /cluster/projects/nn9458k/phylogenomics/
mkdir $YOURNAME # You need to replace the variable $YOURNAME with, for example, your first or last name. This will be the folder you are working in the next two weeks.
cd $YOURNAME # You need to replace the variable $YOURNAME with the name you chose in the line above
cp -r ../week1/Day1Afternoon .
cd Day1Afternoon
cp .ncbirc ~

Run the following script to find the contaminations in your dataset
```
sbatch --get-user-env sbatch_ContaminationDetection.sh
```
Go to the "Results" folder and compare "Argentina_sp_contaminated.fasta_nr_RNA_BLAST_Matches.txt" with "Argentina_sp_contaminated.fasta_Taxa_found.txt" and find out which contaminations have been found. Which datasets would be needed having in mind that we added Protodrilus symbioticus articfically as a contamination to the dataset?
```
cd Results
cat Argentina_sp_contaminated.fasta_nr_RNA_BLAST_Matches.txt
```
Take a look at the output before executing the next command. Which hits belong to which dataset? (Hint: Look at the different name structures for the query sequences.)
```
cat Argentina_sp_contaminated.fasta_Taxa_found.txt
cd ..
```
Check which taxa where found for the added Protodrilus symbioticus-dataset and which for the original Argentina sp.-dataset? Which taxa will hence be relevant for the screening of contamination given that we have the artificially added one at hand already? Do we need an additional one?
The next step would now to find proper datasets allowing for screening against positive and negative reference datasets. For this course, we already did this for you. What could be good datasets for such a screening of the entire transcriptomic or genomic datasets?
1. Please also have in mind that we will use here only the genes for screening, which are in the alignments we selected. However, you would usually do this on the entire dataset before the orthology determination.
2. If you would set up the libraries yourself, all databases for positive references would have to have a trailing "pos_" and end with ".fasta". For the negative references, it would be "neg_" instead of "pos_". The file format has to be fasta and nucleotide sequences.
Run the following script to generate your reference dataset:
```
sbatch sbatch_CleaningContamination.sh
```
Normally you would have now a cleaned assembly dataset (the name of the assembly extended by "_pruned.fas"), which would go into the next step determining orthologies. However, here we did it after the orthology determination and you would therefore need to clean these contaminated sequences manually from the affected alignments.

Luckily, we already did this for you. The pruned single genes are in the subfolder "SingleGenes".
You need to concatenate the single genes now into a supermatrix. For this you can use the program FASconCAT-G, which is written in perl.
```
cd SingleGenes
cp ../../week1/Programs/FASconCAT-G/FASconCAT-G_v1.05.pl .
sbatch sbatch_Concatenation.sh
```
Now you can assess, what effect these contaminations had on the tree by running a phylogenetic tree reconstruction. You will use the same settings as before to ensure that the only difference is the exclusion of the sequences from a contamination.
```
cp Supermatrix/Matrix_Concatenated_Para_supermatrix.phy  ../ConcatenatedData/
cd ../ConcatenatedData/
sbatch sbatch_Supermatrix_Para_tree.sh
```

Download the final tree (ending on .treefile) as well as the one with the contaminated sequences to your local computer. You can find the latter tree in the folder "Day1Morning". This will allow you to look at them using FigTree. You can use either WinSCP or any other similar program to download the trees or you can use the command scp. In both cases, you should nagivate first to the folder, where you want the data to be on your local computer. The commands below are to be executed from your local computer if you are using the command line for download.

scp $USERID@saga.sigma2.no:/cluster/projects/nn9458k/phylogenomics/week1/Day1Morning/Concatenated_Para_Conta_SupermatrixTree/*.treefile . #This is for the tree with contaminated data. BE AWARE: You need to replace the variable $USERID with your user-id on Saga.
scp $USERID@saga.sigma2.no:/cluster/projects/nn9458k/phylogenomics/$YOURNAME/week1/Day1Afternoon/Results/*.treefile . #This is for the tree with uncontaminated data. BE AWARE: You need to replace the variable $USERID with your user-id on Saga and the variable $YOURNAME with the name of the folder you generated above.

Now you have a final tree (ending on .treefile) and compare it with the tree including the contaminated sequences. Are there any differences between the trees? Have in mind Argentina was the taxon with deliberate contamintion in this example and two loci were affected by this.

RESULTS

The folder "Results" contains the most important results from this session.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!