The folder "data" contains the smaller data files for the morning session of day 4. However, we would advice you to use the data provided on SAGA as it will make the data transfer to your folder much faster.
The folder "scripts" contains all the scripts to be run on SAGA. However, we would advice you that you transfer them from SAGA as described in the exercise below.
The folder "Lecture" contains the lecture from this session.
-
For getting started, copy all data from the folder "Day4Morning" to your folder in the project area in SAGA
cd /cluster/projects/nn9458k/phylogenomics/$YOURNAME cp -r ../week1/Day4Morning . cd Day4Morning
-
For the calculation of the saturation indices based on slope we will need the treefiles and phylip alignment files for each orthologous loci included in the original dataset
- Copy the .treefile files and relaxed phylip files of the first 100 loci from the excerise of the morning of Day 2 to this folder
sbatch sbatch_TreSpEx_Saturation.sh
-
For the c indices will only need the alignment file of the supermatrix with its partitions. Both are already in the folder.
sbatch sbatch_BaCoCa.sh
-
Download the following files to your own computer using scp
- Correlation_Results/Correlation_Slope_Summary.txt
- BaCoCa_Results/summarized_frequencies.txt
On your own computer
* Open "summarized_frequencies.txt" in a text editor, delete the first line and save the file
-
Important these two txt-files in R studio using "Import Database/From text (base)"; the heading to "yes"; RowNames to "Use first column"
-
We create now density plots in R to explore the distribution of the data
- Create a new R script and type in it the following:
x <- density(Correlation_Slope_Summary$Slope) plot(x) y <- density(Correlation_Slope_Summary$R2) plot(y) z <- density(log10(summarized_frequencies$c.value)) plot(z)
-
execute the R script
-
explore the plots, what could be a reasonable threshold?
Back on SAGA
-
We now extract all files, which have a value above your specified threshold and which shall be included; please do the following step for all three values (c value, R2 and slope); one example is given
awk -F"\t" '{if($26<100)print$1}' < BaCoCa_Results/summarized_frequencies.txt | sed "s/locus/FcC_locus/" | sed "s/$/.phy/" > summarized_frequencies_below100.txt mkdir Cvalue_below100 while read LINE; do cp ../Day2Morning/SingleGenes/$LINE Cvalue_below100; done < summarized_frequencies_below100.txt
-
Now we need to concatenate these again and run a tree reconstruction of the new supermatrix
cd Cvalue_above100
-
Copy FASconCAT-G to this folder as well as the sbatch_Concatenation.sh we used yesterday (see yesterday's exercise for this)
-
Modify the .sh-file to suite your needs now
sbatch sbatch_Concatenation.sh
-
-
When it is done, run a tree reconstruction again on the supermatrix as done before
The folder "Results" contains the most important results from this session.