2  Species tree reconstruction from organellar genomes

2.1 On this page

Biological insights and take-home messages are at the bottom of the page at Lesson Learnt: Section 2.3.

  • Here we reconstruct the Ulva species tree based on the 70 chloroplast and 29 mitochondrial genes we have retrieved using two methods: a Maximum Likelihood method and a coalescence-based one;
  • we compare the species trees reconstructed by using only chloroplast genes or only mitochondrial genes against the Ulva species tree obtained with all 99 organellar genes.

2.2 Reconstruction of Ulva species tree

We will reconstruct the Ulva phylogenetic species tree using a Maximum Likelihood approach with IQ-TREE, and a coalescence-based one with ASTRAL. We will generate the species tree based on:

  • chloroplast only genes
  • mitochondrial only genes
  • both chloroplast and mitochondrial genes

and we will evaluate the similarities and differences between the reconstructed phylogenies.

2.2.1 Maximum Likelihodd species tree reconstruction

For the ML approach, the first step consist in concatenating all alignments into a single concatenated file. We will use catfasta2phyml to concatenate the nucleotidic sequences.

However, we want to apply different evolution models to different genes. Even if for each organelle all genes are inherited in linkage, the same is not true for the chloroplast and the mitochondrial genome. Moreover, this would allow to model different signals across genes that are under different evolutionary pressures. Therefore, first we run single-gene ML trees for each of the 99 genes, and for that gene we detect which evolution model result the Maximum Likelihood reconstruction. Then, we use that model to generate a nexus file for the concatenated alignment that we will provide to IQ-TREE to assign the best model for each partition (gene).

Let’s start by running the ML tree reconstruction for each single organellar gene.

#!/usr/bin/env bash

# CP gene wise Maximum Likelihood phylogeny
mkdir ./03_cp_singleGene_ML && cd ./03_cp_singleGene_ML
ls ../02_cp_CDS_align/*.fa \
  | xargs -P 8 -n 6 -I {} bash -c \
  ‘~/bin/iqtree-2.2.0-Linux/bin/iqtree2 -s {} \
    -st DNA \
    -pre $(basename {} .fa) 
    -nt 6 \
    -wbt -bb 1000 -alrt 1000 \
    -m MFP+MERGE’

# MT gene wise Maximum Likelihood phylogeny
cd ..
mkdir ./03_mt_singleGene_ML && cd ./03_mt_singleGene_ML
ls ../02_mt_CDS_align/*.fa \
  | xargs -P 8 -n 6 -I {} bash -c \
  ‘~/bin/iqtree-2.2.0-Linux/bin/iqtree2 -s {} \
    -st DNA \
    -pre $(basename {} .fa) 
    -nt 6 \
    -wbt -bb 1000 -alrt 1000 \
    -m MFP+MERGE’
cd ..

Now we can create a nexus file with the best model predicted for each gene. This provides the best model for each partition (i.e.: gene) for the concatenated ML tree.

#!/usr/bin/env bash

# create nexus file with best models CP and MT 
for ORGANEL in cp mt; do 

  k=1
  START=0
  
  echo "#nexus" > ./04_"${ORGANEL}"_concat_ML/head_noscaffold.nex
  echo "begin sets;" >> ./04_"${ORGANEL}"_concat_ML/head_noscaffold.nex
  echo -n "charpartition mine = " >> ./04_"${ORGANEL}"_concat_ML/tail_noscaffold.nex
  
  for file in ./02_"${ORGANEL}"_CDS_alin/*.fa; do
    LEN=$(awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' $file | sed -n 2,2p)
    echo "charset part$k = $(($START+1))-$(($START+$LEN));" >> ./04_"${ORGANEL}"_concat_ML/head_noscaffold.nex
    START=$(($START+$LEN))
    MODEL=$(grep "Best-fit model" ./02_"${ORGANEL}"_CDS_alin/$(basename $file .fa).log \
      | sed 's/Best-fit model: //g' \
      | sed 's/ chosen according to BIC//g')
    echo -n "$MODEL:part$k ," >> ./04_"${ORGANEL}"_concat_ML/tail_noscaffold.nex
    k=$((k+1))
  done
  
  echo ";" >> ./04_"${ORGANEL}"_concat_ML/tail_noscaffold.nex
  echo "end;" >> ./04_"${ORGANEL}"_concat_ML/tail_noscaffold.nex
  
  cat ./04_"${ORGANEL}"_concat_ML/head_noscaffold.nex \
    ./04_"${ORGANEL}"_concat_ML/tail_noscaffold.nex \
    > ./04_"${ORGANEL}"_concat_ML/"${ORGANEL}"_allgenes_concat.nex
    
  rm ./04_"${ORGANEL}"_concat_ML/head_noscaffold.nex ./04_"${ORGANEL}"_concat_ML/tail_noscaffold.nex
  sed -i 's/ ,;/;/g' ./04_"${ORGANEL}"_concat_ML/"${ORGANEL}"_allgenes_concat.nex

done

# create a nexus file for CP+MT concatenated alignment
k=1
START=0

echo "#nexus" > ./06_cp_mt_concat_ML/head_noscaffold.nex
echo "begin sets;" >> ./06_cp_mt_concat_ML/head_noscaffold.nex
echo -n "charpartition mine = " >> ./06_cp_mt_concat_ML/tail_noscaffold.nex

for file in ./02_cp_CDS_alin/*.fa ./02_mt_CDS_alin/*.fa; do
  LEN=$(awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' $file | sed -n 2,2p)
  echo "charset part$k = $(($START+1))-$(($START+$LEN));" >> ./06_cp_mt_concat_ML/head_noscaffold.nex
  START=$(($START+$LEN))
  MODEL=$(grep "Best-fit model" ./02_*_CDS_alin/$(basename $file .fa).log \
    | sed 's/Best-fit model: //g' \
    | sed 's/ chosen according to BIC//g')
  echo -n "$MODEL:part$k ," >> ./06_cp_mt_concat_ML/tail_noscaffold.nex
  k=$((k+1))
done

echo ";" >> ./06_cp_mt_concat_ML/tail_noscaffold.nex
echo "end;" >> ./06_cp_mt_concat_ML/tail_noscaffold.nex

cat ./06_cp_mt_concat_ML/head_noscaffold.nex \
  ./06_cp_mt_concat_ML/tail_noscaffold.nex \
  > ./06_cp_mt_concat_ML/cp_mt_allgenes_concat.nex
  
rm ./06_cp_mt_concat_ML/head_noscaffold.nex ./06_cp_mt_concat_ML/tail_noscaffold.nex
sed -i 's/ ,;/;/g' ./06_cp_mt_concat_ML/cp_mt_allgenes_concat.nex

Now, we will use catfasta2phyml to concatenate the nucleotidic sequences.

#!/usr/bin/env bash

# concatenate CP alignments
perl ~/bin/catfasta2phyml.pl \
  --fasta \
  --concatenate 02_cp_CDS_align/*.fa \
  > 08_concatenated_ML/clstr.all.concat.nt.align.fa
  
# concatenate MT alignments
perl ~/bin/catfasta2phyml.pl \
  --fasta \
  --concatenate 02_mt_CDS_align/*.fa \
  > 08_concatenated_ML/clstr.all.concat.nt.align.fa

# concatenate CP+MT alignments
perl ~/bin/catfasta2phyml.pl \
  --fasta \
  --concatenate 07_nt_aln_ready/*.fa \
  > 06_cp_mt_concat_ML/clstr.all.concat.nt.align.fa

# ML analysis on concatenated alignment
cd ./04_cp_concat_ML/
~/bin/iqtree-2.2.0-Linux/bin/iqtree2 \
  -s cp_allgenes_concat.fa \
  -st DNA \
  -pre cp_allgenes_concat \
  -p cp_allgenes_concat.nex \
  --sampling GENESITE \
  -nt 64 \
  -wbt -bb 1000 -alrt 1000 

cd ../04_mt_concat_ML/
~/bin/iqtree-2.2.0-Linux/bin/iqtree2 \
  -s mt_allgenes_concat.fa \
  -st DNA \
  -pre mt_allgenes_concat \
  -p mt_allgenes_concat.nex \
  --sampling GENESITE \
  -nt 64 \
  -wbt -bb 1000 -alrt 1000 

cd ../06_cp_mt_concat_ML/
~/bin/iqtree-2.2.0-Linux/bin/iqtree2 \
  -s cp_mt_allgenes_concat.fa \
  -st DNA \
  -pre cp_mt_allgenes_concat \
  -p cp_mt_allgenes_concat.nex \
  --sampling GENESITE \
  -nt 64 \
  -wbt -bb 1000 -alrt 1000 
  

2.2.2 Colescence-based species tree reconstruction

For the coalescence-based approach, we will reconciliate the single-gene ML tree. We will feed to the coalescence model also all the 1000 Bootstrap trees generate for each ML gene reconstruction, and we will allow for 100 gene resampling to improve the support value at the branch level for the reconstructed species tree.

#!/usr/bin/env bash

# prepare files
cat 03_cp_singleGene_ML/*.contree > 05_cp_ASTRAL/cp_genetrees.tre
cat 03_mt_singleGene_ML/*.contree > 05_mt_ASTRAL/mt_genetrees.tre
cat 03_cp_singleGene_ML/*.contree 03_mt_singleGene_ML/*.contree > 07_cp_mt_ASTRAL/cp_mt_genetrees.tre

for file in 03_cp_singleGene_ML/*.ufboot; do
  echo $file | sed ‘s#^#03_cp_singleGene_ML/#’;
done > 05_cp_ASTRAL/cp_bootstrap_trees.BS

for file in 03_mt_singleGene_ML/*.ufboot; do
  echo $file | sed ‘s#^#03_mt_singleGene_ML/#’;
done > 05_mt_ASTRAL/mt_bootstrap_trees.BS

for file in 03_cp_singleGene_ML/*.ufboot 03_mt_singleGene_ML/*.ufboot; do
  echo $file | sed ‘s#^#03_cp_singleGene_ML/#’ | sed ‘s#^#03_mt_singleGene_ML/#’;
done > 07_cp_mt_ASTRAL/cp_mt_bootstrap_trees.BS


# rum cp coalescence-based tree
java -Xmx24000M \
  -jar bin/Astral/astral.5.7.8.jar \
  --bootstraps 05_cp_ASTRAL/cp_bootstrap_trees.BS \
  --gene-resampling \
  -r 100 \
  --input 05_cp_ASTRAL/cp_genetrees.tre \
  --output 05_cp_ASTRAL/cp_genetrees.coalescence.tre \
  2> 05_cp_ASTRAL/cp_genetrees.coalescence.tre
  
tail -n 1 05_cp_ASTRAL/cp_genetrees.coalescence.tre \
  > 05_cp_ASTRAL/cp_genetrees.coalescence.tre


# run mt coalescence-based tree
java -Xmx24000M \
  -jar bin/Astral/astral.5.7.8.jar \
  --bootstraps 05_mt_ASTRAL/mt_bootstrap_trees.BS \
  --gene-resampling \
  -r 100 \
  --input 05_mt_ASTRAL/mt_genetrees.tre \
  --output 05_mt_ASTRAL/mt_genetrees.coalescence.tre \
  2> 05_mt_ASTRAL/mt_genetrees.coalescence.tre
  
tail -n 1 05_mt_ASTRAL/mt_genetrees.coalescence.tre \
  > 05_mt_ASTRAL/mt_genetrees.coalescence.tre


# rum cp + mt coalescence-based tree
java -Xmx24000M \
  -jar bin/Astral/astral.5.7.8.jar \
  --bootstraps 07_cp_mt_ASTRAL/cp_mt_bootstrap_trees.BS \
  --gene-resampling \
  -r 100 \
  --input 07_cp_mt_ASTRAL/cp_mt_genetrees.tre \
  --output 07_cp_mt_ASTRAL/cp_mt_genetrees.coalescence.tre \
  2> 07_cp_mt_ASTRAL/cp_mt_genetrees.coalescence.tre
  
tail -n 1 07_cp_mt_ASTRAL/cp_mt_genetrees.coalescence.tre \
  > 07_cp_mt_ASTRAL/cp_mt_genetrees.coalescence.tre

2.2.3 Species trees comparison and reconciliation

Let’s plot the Ulva species trees we have just reconstructed!

First we import the trees in R.

# phylogenetic trees placeholder
phylogenetic_trees = list(
  "cp_ML" = list("type" = "ML", "tree" = NA, "root" = NA, "dendro" = NA),
  "cp_ASTRAL" = list("type" = "coalescent", "tree" = NA, "root" = NA, "dendro" = NA),
  "mt_ML" = list("type" = "ML", "tree" = NA, "root" = NA, "dendro" = NA),
  "mt_ASTRAL" = list("type" = "coalescent", "tree" = NA, "root" = NA, "dendro" = NA),
  "cp_mt_ML" = list("type" = "ML", "tree" = NA, "root" = NA, "dendro" = NA),
  "cp_mt_ASTRAL" = list("type" = "coalescent", "tree" = NA, "root" = NA, "dendro" = NA)
)

# read ML concatenated trees
phylogenetic_trees[["cp_ML"]][["tree"]] = ape::read.tree(file = "./data/cp_allgenes_concat.contree")
phylogenetic_trees[["mt_ML"]][["tree"]] = ape::read.tree(file = "./data/mt_allgenes_concat.contree")
phylogenetic_trees[["cp_mt_ML"]][["tree"]] = ape::read.tree(file = "./data/cp_mt_allgenes_concat.contree")

# read coalescence-based trees
phylogenetic_trees[["cp_ASTRAL"]][["tree"]] = ape::read.tree(file = "./data/cp_genetrees.coalescence.tre")
phylogenetic_trees[["mt_ASTRAL"]][["tree"]] = ape::read.tree(file = "./data/mt_genetrees.coalescence.tre")
phylogenetic_trees[["cp_mt_ASTRAL"]][["tree"]] = ape::read.tree(file = "./data/cp_mt_genetrees.coalescence.tre")

# reroot trees
for(i in 1:length(phylogenetic_trees)){
  
  # get root position
  phylogenetic_trees[[i]][["root"]] = tidytree::MRCA(phylogenetic_trees[[i]][["tree"]], "Pakin", "Oviri")
  
  # reroot set branch length NaN to 0 in coalescent trees and root length to 0.01 in ML trees
  if(phylogenetic_trees[[i]][["type"]] == "coalescent"){
    phylogenetic_trees[[i]][["tree"]] = treeio::root(phylogenetic_trees[[i]][["tree"]], node = phylogenetic_trees[[i]][["root"]], resolve.root = TRUE)
    phylogenetic_trees[[i]][["tree"]]$edge.length[is.na(phylogenetic_trees[[i]][["tree"]]$edge.length)] = 0
    phylogenetic_trees[[i]][["tree"]]$node.label = ifelse(phylogenetic_trees[[i]][["tree"]]$node.label == "Root", "1", phylogenetic_trees[[i]][["tree"]]$node.label)
    phylogenetic_trees[[i]][["tree"]]$node.label = ifelse(phylogenetic_trees[[i]][["tree"]]$node.label == "", "1", phylogenetic_trees[[i]][["tree"]]$node.label)
  } else {
    phylogenetic_trees[[i]][["tree"]] = treeio::root(phylogenetic_trees[[i]][["tree"]], node = phylogenetic_trees[[i]][["root"]] + 1, resolve.root = TRUE)
    phylogenetic_trees[[i]][["tree"]]$edge.length[which(phylogenetic_trees[[i]][["tree"]]$edge.length == 0)] = 0.1
    phylogenetic_trees[[i]][["tree"]]$node.label = ifelse(phylogenetic_trees[[i]][["tree"]]$node.label == "Root", "100", phylogenetic_trees[[i]][["tree"]]$node.label)
    phylogenetic_trees[[i]][["tree"]]$node.label = ifelse(phylogenetic_trees[[i]][["tree"]]$node.label == "", "100", phylogenetic_trees[[i]][["tree"]]$node.label)
  }
  
  # sort trees in descending order
  phylogenetic_trees[[i]][["tree"]] = ape::ladderize(phylogenetic_trees[[i]][["tree"]], right = TRUE)
  
  # set node label to numeric (BS and coalescence support)
  phylogenetic_trees[[i]][["tree"]]$node.label = as.numeric(phylogenetic_trees[[i]][["tree"]]$node.label)
  
}

Lets now plot a tanglegram: we will have the ML tree and the ASTRAL tree facing each other, and lines connecting the corresponding position of the same alga isolate. This will allow us to compare the two three structure, as well as how much concordant are the phylogenetic reconstructions.

We will do this for the three gene sets:

  • chloroplast genes only (70 genes)
  • mitochondrial genes only (27 genes)
  • chloroplast and mitochondrial genes (99 genes).
ML_tree = phylogenetic_trees[["cp_mt_ML"]][["tree"]]
ML_tree = groupClade(ML_tree, 53)

# prep ML tree
p_ML = ggtree(ML_tree, aes(linetype = group)) +
  scale_linetype_discrete(guide = "none")

p_ML$data[p_ML$data$node %in% c(1, 2), "x"] = mean(p_ML$data$x)
p_ML$data[p_ML$data$node %in% c(53), "x"] = mean(p_ML$data$x)/2

p_ML = ggtree::rotate(p_ML, 52) 

# prep ASTRAL tree
ASTRAL_tree = phylogenetic_trees[["cp_mt_ASTRAL"]][["tree"]]
ASTRAL_tree = groupClade(ASTRAL_tree, 101)

p_ASTRAL = ggtree(ASTRAL_tree, aes(linetype = group)) +
  scale_linetype_discrete(guide = "none")

p_ASTRAL$data[p_ASTRAL$data$node %in% c(13, 12), "x"] = max(p_ASTRAL$data$x)
p_ASTRAL$data[p_ASTRAL$data$node %in% c(101), "x"] = max(p_ASTRAL$data$x)/2

p_ASTRAL = ggtree::rotate(p_ASTRAL, 52) +
  geom_tiplab(color = "black", size = 2.75, align = TRUE, linetype = "dotted", linesize = 0.7) +
  geom_nodepoint(aes(fill = cut(as.numeric(label), c(0, 0.70, 0.90, 1.00))), shape = 21, size = 3) +
  scale_fill_manual(values = c("white", "grey", "black")) +
  theme(legend.position = "none")

# get tree data
d_ML = p_ML$data 
d_ASTRAL = p_ASTRAL$data

# rescale ASTRAL tree to ML tree space
scale_ratio_bl = max(d_ML$branch.length) / max(d_ASTRAL$branch.length)
scale_ratio_x = max(d_ML$x) / max(d_ASTRAL$x)
scale_ratio_b = max(d_ML$branch) / max(d_ASTRAL$branch) 
d_ASTRAL$branch.length = d_ASTRAL$branch.length * scale_ratio_bl
d_ASTRAL$x = d_ASTRAL$x * scale_ratio_x
d_ASTRAL$branch = d_ASTRAL$branch * scale_ratio_b

# reverse x-axis and set offset to make ASTRAL tree on the right-hand side of ML tree
d_ASTRAL$x = max(d_ASTRAL$x) - d_ASTRAL$x + max(d_ML$x) + 0.1


# combine trees
p_combined = p_ML +
  geom_nodepoint(aes(fill = cut(as.numeric(label),
                                c(0, 70, 90, 100))), shape = 21, size = 3) +
  scale_fill_manual(values = c("black", "grey", "white"),
                    guide = "legend",
                    name = "Bootstrap",
                    breaks = c("(90,100]", "(70,90]", "(0,70]"),
                    labels = expression(BP>=90,70 <= BP * " < 90", BP < 70)) +
  labs(title = "Chloroplast + mitochondria, 99 genes") +
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = "bottom") +
  geom_tree(data = d_ASTRAL) +
  #geom_tiplab(data = d_ASTRAL, color = "black", size = 2.75, align = TRUE, linetype = "dotted", linesize = 0.7) +
  ggnewscale::new_scale_fill() +
  geom_nodepoint(data = d_ASTRAL, aes(fill = cut(as.numeric(label), c(0, 0.70, 0.90, 1.00))), shape = 21, size = 3) +
  scale_fill_manual(values = c("white", "grey", "black"), guide = "none")

# combine tip data
d_ML$x = max(d_ML$x) + 0.05
d_ASTRAL$x = d_ASTRAL$x - 0.0125
d_labels = dplyr::bind_rows(d_ML, d_ASTRAL) %>%
  dplyr::filter(isTip == TRUE)
d_labels$color_group = ifelse(
  d_labels$label %in% c("Ufene", "Usp4"), "red",
  ifelse(d_labels$label %in% c(
    "Uaust1", "Uaust2", "Uaust4", "Ufaust5", "Ufaust6",
    "Ucomp1", "Ucomp2", "Ucomp3", "Ucomp4", "Ucomp5", "Ucomp6", "Ucomp7", "Ucomp8", "Ucomp9", "Ucomp10",
    "Ulact2", "Ulact3",
    "Umeri1", "Umeri3", "Umeri4", "Umeri5"), "grey", "steelblue")
)

p_combined +
  geom_tiplab(color = "antiquewhite3", size = 2.75, align = TRUE, linetype = "dotted", linesize = 0.7) +
  geom_tiplab(color = "black", size = 2.75, align = TRUE, linetype = "dotted", linesize = 0) +
  geom_line(data = d_labels, aes(x, y, group = label, color = color_group)) +
  scale_color_manual(values = c("grey", "red", "steelblue"), guide = FALSE)

Figure 1: cp + mt species tree reconstruction
ML_tree = phylogenetic_trees[["cp_ML"]][["tree"]]
ML_tree = groupClade(ML_tree, 37)

# prep ML tree
p_ML = ggtree(ML_tree, aes(linetype = group)) +
  scale_linetype_discrete(guide = "none")

p_ML$data[p_ML$data$node %in% c(1, 2), "x"] = mean(p_ML$data$x)
p_ML$data[p_ML$data$node %in% c(38), "x"] = mean(p_ML$data$x)/2

p_ML = ggtree::rotate(p_ML, 37) 

# prep ASTRAL tree
ASTRAL_tree = phylogenetic_trees[["cp_ASTRAL"]][["tree"]]
ASTRAL_tree = groupClade(ASTRAL_tree, 37)

p_ASTRAL = ggtree(ASTRAL_tree, aes(linetype = group)) +
  scale_linetype_discrete(guide = "none")

p_ASTRAL$data[p_ASTRAL$data$node %in% c(20, 21), "x"] = max(p_ASTRAL$data$x)
p_ASTRAL$data[p_ASTRAL$data$node %in% c(71), "x"] = max(p_ASTRAL$data$x)/2

p_ASTRAL = ggtree::rotate(p_ASTRAL, 37) +
  geom_tiplab(color = "black", size = 2.75, align = TRUE, linetype = "dotted", linesize = 0.7) +
  geom_nodepoint(aes(fill = cut(as.numeric(label), c(0, 0.70, 0.90, 1.00))), shape = 21, size = 3) +
  scale_fill_manual(values = c("white", "grey", "black")) +
  theme(legend.position = "none")

# get tree data
d_ML = p_ML$data 
d_ASTRAL = p_ASTRAL$data

# rescale ASTRAL tree to ML tree space
scale_ratio_bl = max(d_ML$branch.length) / max(d_ASTRAL$branch.length)
scale_ratio_x = max(d_ML$x) / max(d_ASTRAL$x)
scale_ratio_b = max(d_ML$branch) / max(d_ASTRAL$branch) 
d_ASTRAL$branch.length = d_ASTRAL$branch.length * scale_ratio_bl
d_ASTRAL$x = d_ASTRAL$x * scale_ratio_x
d_ASTRAL$branch = d_ASTRAL$branch * scale_ratio_b

# reverse x-axis and set offset to make ASTRAL tree on the right-hand side of ML tree
d_ASTRAL$x = max(d_ASTRAL$x) - d_ASTRAL$x + max(d_ML$x) + 0.1


# combine trees
p_combined = p_ML +
  geom_nodepoint(aes(fill = cut(as.numeric(label),
                                c(0, 70, 90, 100))), shape = 21, size = 3) +
  scale_fill_manual(values = c("black", "grey", "white"),
                    guide = "legend",
                    name = "Bootstrap",
                    breaks = c("(90,100]", "(70,90]", "(0,70]"),
                    labels = expression(BP>=90,70 <= BP * " < 90", BP < 70)) +
  labs(title = "Chloroplast, 70 genes") +
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = "bottom") +
  geom_tree(data = d_ASTRAL) +
  ggnewscale::new_scale_fill() +
  geom_nodepoint(data = d_ASTRAL, aes(fill = cut(as.numeric(label), c(0, 0.70, 0.90, 1.00))), shape = 21, size = 3) +
  scale_fill_manual(values = c("white", "grey", "black"), guide = "none")

# combine tip data
d_ML$x = max(d_ML$x) + 0.05
d_ASTRAL$x = d_ASTRAL$x - 0.0125
d_labels = dplyr::bind_rows(d_ML, d_ASTRAL) %>%
  dplyr::filter(isTip == TRUE)
d_labels$color_group = ifelse(
  d_labels$label %in% c(
    "Ucomp5", "Ucomp9", "Ucomp10", "Ucomp3", "Ucomp1", "Ucomp2", "Ucomp6", "Ucomp4",
    "Ulinz", "Uprol3", "Uprol2", "Utort", "Uflex1", "Ucali",
    "Umeri7", "Umeri6", "Usp2", "Uprol1",
    "Ulaci1", "Uaust3", "Urigi1", "Ulaci2"
  ), "grey", "steelblue"
)

p_combined +
  geom_tiplab(color = "antiquewhite3", size = 2.75, align = TRUE, linetype = "dotted", linesize = 0.7) +
  geom_tiplab(color = "black", size = 2.75, align = TRUE, linetype = "dotted", linesize = 0) +
  geom_line(data = d_labels, aes(x, y, group = label, color = color_group)) +
  scale_color_manual(values = c("grey", "steelblue"), guide = FALSE)

Figure 2: cp species tree reconstruction
ML_tree = phylogenetic_trees[["mt_ML"]][["tree"]]
ML_tree = groupClade(ML_tree, 40)

# prep ML tree
p_ML = ggtree(ML_tree, aes(linetype = group)) +
  scale_linetype_discrete(guide = "none")

p_ML$data[p_ML$data$node %in% c(1, 2), "x"] = mean(p_ML$data$x)
p_ML$data[p_ML$data$node %in% c(41), "x"] = mean(p_ML$data$x)/2

p_ML = ggtree::rotate(p_ML, 40) 

# prep ASTRAL tree
ASTRAL_tree = phylogenetic_trees[["mt_ASTRAL"]][["tree"]]
ASTRAL_tree = groupClade(ASTRAL_tree, 40)

p_ASTRAL = ggtree(ASTRAL_tree, aes(linetype = group)) +
  scale_linetype_discrete(guide = "none")

p_ASTRAL$data[p_ASTRAL$data$node %in% c(25, 24), "x"] = max(p_ASTRAL$data$x)
p_ASTRAL$data[p_ASTRAL$data$node %in% c(77), "x"] = max(p_ASTRAL$data$x)/2

p_ASTRAL = ggtree::rotate(p_ASTRAL, 40) +
  geom_tiplab(color = "black", size = 2.75, align = TRUE, linetype = "dotted", linesize = 0.7) +
  geom_nodepoint(aes(fill = cut(as.numeric(label), c(0, 0.70, 0.90, 1.00))), shape = 21, size = 3) +
  scale_fill_manual(values = c("white", "grey", "black")) +
  theme(legend.position = "none")

# get tree data
d_ML = p_ML$data 
d_ASTRAL = p_ASTRAL$data

# rescale ASTRAL tree to ML tree space
scale_ratio_bl = max(d_ML$branch.length) / max(d_ASTRAL$branch.length)
scale_ratio_x = max(d_ML$x) / max(d_ASTRAL$x)
scale_ratio_b = max(d_ML$branch) / max(d_ASTRAL$branch) 
d_ASTRAL$branch.length = d_ASTRAL$branch.length * scale_ratio_bl
d_ASTRAL$x = d_ASTRAL$x * scale_ratio_x
d_ASTRAL$branch = d_ASTRAL$branch * scale_ratio_b

# reverse x-axis and set offset to make ASTRAL tree on the right-hand side of ML tree
d_ASTRAL$x = max(d_ASTRAL$x) - d_ASTRAL$x + max(d_ML$x) + 0.1


# combine trees
p_combined = p_ML +
  geom_nodepoint(aes(fill = cut(as.numeric(label),
                                c(0, 70, 90, 100))), shape = 21, size = 3) +
  scale_fill_manual(values = c("black", "grey", "white"),
                    guide = "legend",
                    name = "Bootstrap",
                    breaks = c("(90,100]", "(70,90]", "(0,70]"),
                    labels = expression(BP>=90,70 <= BP * " < 90", BP < 70)) +
  labs(title = "Mitochondria, 29 genes") +
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = "bottom") +
  geom_tree(data = d_ASTRAL) +
  ggnewscale::new_scale_fill() +
  geom_nodepoint(data = d_ASTRAL, aes(fill = cut(as.numeric(label), c(0, 0.70, 0.90, 1.00))), shape = 21, size = 3) +
  scale_fill_manual(values = c("white", "grey", "black"), guide = "none")

# combine tip data
d_ML$x = max(d_ML$x) + 0.05
d_ASTRAL$x = d_ASTRAL$x - 0.0125
d_labels = dplyr::bind_rows(d_ML, d_ASTRAL) %>%
  dplyr::filter(isTip == TRUE)
d_labels$color_group = ifelse(
  d_labels$label %in% c("Usp3", "Utort", "Umeri6", "Usp4", "Urigi3", "Ufene"), "red", "grey"
)

p_combined +
  geom_tiplab(color = "antiquewhite3", size = 2.75, align = TRUE, linetype = "dotted", linesize = 0.7) +
  geom_tiplab(color = "black", size = 2.75, align = TRUE, linetype = "dotted", linesize = 0) +
  geom_line(data = d_labels, aes(x, y, group = label, color = color_group)) +
  scale_color_manual(values = c("grey", "red", "steelblue"), guide = FALSE)

Figure 3: mt species tree reconstruction

In the tanglegrams above, the blue lines show the identical topology between ML and coalescence trees; the grey lines indicate difference in topology within species, and red lines indicate difference in topology between species.

The set of 99 genes (chloroplast + mitochondria genes) manages to reconstruct correctly the Ulva species tree. The only deiscrepancy between ML and ASTRAL tree sits in the position of Ulva fenestrata (Ufene) and Ulva sp. 4 “Usp4” which clusters together in the ASTRAL tree, while they are assigned to different subclades in the ML tree.

In the chloroplast trees (70 genes), there are not major differences between the ML and the ASTRAL tree regarding the species positioning. Howvever, the Ulva species tree cannot be completely reconstructed because the phylogenetic relationships between the clade containing “Ucali”, “Uflex1”, “Ulinz”, “Utort”, “Uprol2”, “Uprol3” and the clade containing “Umeri6”, “Umeri7”, “Uprol1” and “Usp2” cannot be properly resolved.

The ML and ASTRAL mitochondrial trees (29 genes) are highly discordant, suggesting that the list of genomes and the genes available for Ulva isolates do not allow for proper reconstruction of the Ulva species tree.

2.2.4 Distances between reconstructed species trees

The tanglegrams provided a visual represenation of the concordance between the reconstructed Ulva species tree, but we can also calculate the similarities of the reconstructed trees. We will use four metrics:

The correlation between the common nodes is calculated as the pairwise cophenetic correlation coefficient between the distance matrix and the cophenetic matrix of the common nodes. The RF distance measures the number of unique splits between two trees by comparing their bipartitions. It ranges from 0 (identical trees) to a higher value indicating greater dissimilarity with more unique splits. The KF index quantifies the average genetic differences per site between populations, considering genetic variation within and between populations to provide a quantitative measure of genetic distance. The MAST % identifies the subset of nodes and branches that are common to both trees, disregarding any additional or missing branches. It aims to find the largest possible subtree that can be extracted from both trees while preserving their structural similarities, representing the maximum level of agreement in their branching patterns.

We can first calculate all the distances:

# get list of dendrograms to compare
dendro_list = dendextend::dendlist(
  "cp_ML" = phylogenetic_trees[["cp_ML"]][["tree"]] %>%
    ape::chronos() %>%
    stats::as.dendrogram(),
  "cp_ASTRAL" = phylogenetic_trees[["cp_ASTRAL"]][["tree"]] %>%
    ape::chronos() %>%
    stats::as.dendrogram(),
  "mt_ML" = phylogenetic_trees[["mt_ML"]][["tree"]] %>%
    ape::chronos() %>%
    stats::as.dendrogram(),
  "mt_ASTRAL" = phylogenetic_trees[["mt_ASTRAL"]][["tree"]] %>%
    ape::chronos() %>%
    stats::as.dendrogram(),
  "cp_mt_ML" = phylogenetic_trees[["cp_mt_ML"]][["tree"]] %>%
    ape::chronos() %>%
    stats::as.dendrogram(),
  "cp_mt_ASTRAL" = phylogenetic_trees[["cp_mt_ASTRAL"]][["tree"]] %>%
    ape::chronos() %>%
    stats::as.dendrogram()
)

Setting initial dates...
Fitting in progress... get a first set of estimates
         (Penalised) log-lik = -5.506629 
Optimising rates... dates... -5.506629 
Optimising rates... dates... -5.391076 

log-Lik = -5.745963 
PHIIC = 227.39 

Setting initial dates...
Fitting in progress... get a first set of estimates
         (Penalised) log-lik = -41.0089 
Optimising rates... dates... -41.0089 
Optimising rates... dates... -41.0089 

log-Lik = -40.14699 
PHIIC = 295.47 

Setting initial dates...
Fitting in progress... get a first set of estimates
         (Penalised) log-lik = -3.093321 
Optimising rates... dates... -3.093321 
Optimising rates... dates... -2.767643 

log-Lik = -11.12947 
PHIIC = 454.49 

Setting initial dates...
Fitting in progress... get a first set of estimates
         (Penalised) log-lik = -66.71373 
Optimising rates... dates... -66.71373 
Optimising rates... dates... -66.71269 

log-Lik = -65.14702 
PHIIC = 360.65 

Setting initial dates...
Fitting in progress... get a first set of estimates
         (Penalised) log-lik = -6.654174 
Optimising rates... dates... -6.654174 
Optimising rates... dates... -6.420461 

log-Lik = -7.69423 
PHIIC = 339.7 

Setting initial dates...
Fitting in progress... get a first set of estimates
         (Penalised) log-lik = -15019.6 
Optimising rates... dates... -15019.6 

log-Lik = -1e+100 
PHIIC = 2e+100 
# create empty lists for distances and correlations
empty_matrix = matrix(nrow = 6, ncol = 6)
colnames(empty_matrix) = c("cp_ML", "cp_ASTRAL", "mt_ML", "mt_ASTRAL", "cp_mt_ML", "cp_mt_ASTRAL")
rownames(empty_matrix) = c("cp_ML", "cp_ASTRAL", "mt_ML", "mt_ASTRAL", "cp_mt_ML", "cp_mt_ASTRAL")
dist_DB_pairwise = list(
  "common_nodes_corr" = empty_matrix,
  "RF" = empty_matrix,
  "KF" = empty_matrix,
  "MAST_%" = empty_matrix
)
dist_DB_min_set_species = list(
  "common_nodes_corr" = empty_matrix,
  "RF" = empty_matrix,
  "KF" = empty_matrix,
  "MAST_%" = empty_matrix
)

# common nodes correlation
common_nodes_corr = dendextend::cor.dendlist(dendro_list, method = "common_nodes")
dist_DB_pairwise[["common_nodes_corr"]] = common_nodes_corr
dist_DB_min_set_species[["common_nodes_corr"]] = common_nodes_corr

# get pairwise distances
for(i in 1:length(phylogenetic_trees)){
  for(k in 1:length(phylogenetic_trees)){
    
    # get tree
    tree1_tmp = phylogenetic_trees[[i]][["tree"]]
    tree2_tmp = phylogenetic_trees[[k]][["tree"]]
    
    # list of common species
    species_list = tree1_tmp$tip.label[which(tree1_tmp$tip.label %in% tree2_tmp$tip.label)]

    # get overlapping species
    tree1 = ape::keep.tip(tree1_tmp, species_list)
    tree2 = ape::keep.tip(tree2_tmp, species_list)
    
    # get distances
    distances = phangorn::treedist(tree1, tree2)
    dist_DB_pairwise[["RF"]][i, k] = distances[[1]]
    dist_DB_pairwise[["KF"]][i, k] = distances[[2]]
    #dist_DB_pairwise[["path_diff"]][i, k] = distances[[3]]
    dist_DB_pairwise[["MAST_%"]][i, k] = length(phangorn::mast(tree1, tree2, tree = FALSE)) / length(species_list)
    
    # clean
    rm(tree1_tmp, tree2_tmp, tree1, tree2, species_list, distances)
  }
}

# make list of minimum set of species
min_species_list = phylogenetic_trees[["cp_mt_ML"]][["tree"]]$tip.label
for(i in 1:length(phylogenetic_trees)){
  min_species_list = min_species_list[which(min_species_list %in% phylogenetic_trees[[i]][["tree"]]$tip.label)]
}

# get distances based on minimum set of species
for(i in 1:length(phylogenetic_trees)){
  for(k in 1:length(phylogenetic_trees)){

    # get overlapping species
    tree1 = ape::keep.tip(phylogenetic_trees[[i]][["tree"]], min_species_list)
    tree2 = ape::keep.tip(phylogenetic_trees[[k]][["tree"]], min_species_list)
    
    # get distances
    distances = phangorn::treedist(tree1, tree2)
    dist_DB_min_set_species[["RF"]][i, k] = distances[[1]]
    dist_DB_min_set_species[["KF"]][i, k] = distances[[2]]
    #dist_DB_min_set_species[["path_diff"]][i, k] = distances[[3]]
    dist_DB_min_set_species[["MAST_%"]][i, k] = length(phangorn::mast(tree1, tree2, tree = FALSE)) / length(min_species_list)
    
    # clean
    rm(tree1, tree2, distances)
  }
}

Then we plot them based on the pairwise label sets between trees

# prepare plot layout
layout(matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2, byrow = TRUE))

for(i in 1:length(dist_DB_pairwise)){
  # parameters for correlation VS distances
  correlation = FALSE
  color_scale = list(
    rev(COL2("RdBu")),
    COL1("Blues"),
    COL1("Blues"),
    COL1("YlOrRd")
  )
  tile_shape = "color"
  if(i == 1){
    correlation = TRUE
    tile_shape = "pie"
  }
  # plot
  corrplot::corrplot(
    dist_DB_pairwise[[i]],
    title = names(dist_DB_pairwise)[[i]],
    tile_shape, 
    "upper",
    order = "original",
    is.corr = correlation,
    col = color_scale[[i]],
    tl.srt = 45,
    tl.col = "grey15",
    mar = c(0, 0, 2, 0),
    cl.pos = "r",
    cl.align.text = "l"
  )
}

Figure 4: species trees distance based on pairwise label sets between trees

We can also see what are the distances if we restrict ourselves only to the isolates for which we have both the cp and the mt genomes (i.e.: full occupancy matrix)

# prepare plot layout
layout(matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2, byrow = TRUE))

for(i in 1:length(dist_DB_min_set_species)){
  # parameters for correlation VS distances
  correlation = FALSE
  color_scale = list(
    rev(COL2("RdBu")),
    COL1("Blues"),
    COL1("Blues"),
    COL1("YlOrRd")
  )
  tile_shape = "color"
  if(i == 1){
    correlation = TRUE
    tile_shape = "pie"
  }
  # plot
  corrplot::corrplot(
    dist_DB_min_set_species[[i]],
    title = names(dist_DB_min_set_species)[[i]],
    tile_shape, 
    "upper",
    order = "original", 
    is.corr = correlation,
    col = color_scale[[i]],
    tl.srt = 45,
    tl.col = "grey15",
    mar = c(0, 0, 2, 0),
    cl.pos = "r",
    cl.align.text = "l")
}

Figure 5: species trees distance based on overlapping cp+mt isolates

The results indicate that the topology of Ulva species trees constructed by the two methods (ML and ASTRAL) with the cp+mt dataset is highly concordant. We therefore decide to use the cp+mt ML tree as true Ulva species tree for downstream analyses.

2.3 Lessons Learnt

So far, we have learnt:

  • we have reconstructed a Ulva species tree for the 15 Ulva species present in this study that we can use in downstream analyses to identify the species phylognetic markers;
  • the Ulva species tree reconstructed by combining 70 chloroplast genes and 29 mitochondrial genes is superior when compared to the species tree reconstructed using only the chloroplast or only the mitochondrial genes.

2.4 Session Information

R version 4.3.2 (2023-10-31)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: openSUSE Tumbleweed

Matrix products: default
BLAS/LAPACK: /home/andrea/miniforge3/envs/moai/lib/libmkl_rt.so.2;  LAPACK version 3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=it_IT.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Brussels
tzcode source: system (glibc)

attached base packages:
 [1] parallel  stats4    grid      stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] treeio_1.26.0         TreeDist_2.9.2        stringr_1.5.1        
 [4] scales_1.3.0          RColorBrewer_1.1-3    reshape_0.8.9        
 [7] phytools_2.4-4        maps_3.4.2.1          phylogram_2.1.0      
[10] phangorn_2.12.1       gridExtra_2.3         ggtree_3.10.1        
[13] ggplot2_3.5.1         ggdist_3.3.2          doSNOW_1.0.20        
[16] snow_0.4-4            iterators_1.0.14      foreach_1.5.2        
[19] dendextend_1.19.0     DECIPHER_2.30.0       RSQLite_2.3.9        
[22] Biostrings_2.70.3     GenomeInfoDb_1.38.8   XVector_0.42.0       
[25] IRanges_2.36.0        S4Vectors_0.40.2      BiocGenerics_0.48.1  
[28] corrplot_0.95         ComplexHeatmap_2.18.0 circlize_0.4.16      
[31] ape_5.8-1            

loaded via a namespace (and not attached):
  [1] jsonlite_1.8.9          shape_1.4.6.1           magrittr_2.0.3         
  [4] farver_2.1.2            rmarkdown_2.29          GlobalOptions_0.1.2    
  [7] fs_1.6.5                zlibbioc_1.48.2         vctrs_0.6.5            
 [10] memoise_2.0.1           RCurl_1.98-1.16         htmltools_0.5.8.1      
 [13] distributional_0.5.0    DEoptim_2.2-8           gridGraphics_0.5-1     
 [16] htmlwidgets_1.6.4       plyr_1.8.9              cachem_1.1.0           
 [19] igraph_2.1.4            mime_0.12               lifecycle_1.0.4        
 [22] pkgconfig_2.0.3         Matrix_1.6-5            R6_2.5.1               
 [25] fastmap_1.2.0           GenomeInfoDbData_1.2.11 rbibutils_2.3          
 [28] shiny_1.10.0            clue_0.3-66             digest_0.6.37          
 [31] numDeriv_2016.8-1.1     aplot_0.2.4             ggnewscale_0.5.0       
 [34] colorspace_2.1-1        patchwork_1.3.0         labeling_0.4.3         
 [37] clusterGeneration_1.3.8 compiler_4.3.2          bit64_4.6.0-1          
 [40] withr_3.0.2             doParallel_1.0.17       optimParallel_1.0-2    
 [43] viridis_0.6.5           DBI_1.2.3               R.utils_2.12.3         
 [46] MASS_7.3-60.0.1         rjson_0.2.23            scatterplot3d_0.3-44   
 [49] tools_4.3.2             httpuv_1.6.15           TreeTools_1.13.0       
 [52] R.oo_1.27.0             glue_1.8.0              quadprog_1.5-8         
 [55] nlme_3.1-167            R.cache_0.16.0          promises_1.3.2         
 [58] cluster_2.1.8           PlotTools_0.3.1         generics_0.1.3         
 [61] gtable_0.3.6            R.methodsS3_1.8.2       tidyr_1.3.1            
 [64] pillar_1.10.1           yulab.utils_0.2.0       later_1.4.1            
 [67] dplyr_1.1.4             lattice_0.22-6          bit_4.5.0.1            
 [70] tidyselect_1.2.1        knitr_1.49              xfun_0.50              
 [73] expm_1.0-0              matrixStats_1.5.0       stringi_1.8.4          
 [76] lazyeval_0.2.2          ggfun_0.1.8             yaml_2.3.10            
 [79] evaluate_1.0.3          codetools_0.2-20        tibble_3.2.1           
 [82] ggplotify_0.1.2         cli_3.6.3               xtable_1.8-4           
 [85] Rdpack_2.6.2            munsell_0.5.1           Rcpp_1.0.14            
 [88] coda_0.19-4.1           png_0.1-8               blob_1.2.4             
 [91] bitops_1.0-9            viridisLite_0.4.2       tidytree_0.4.6         
 [94] purrr_1.0.2             crayon_1.5.3            combinat_0.0-8         
 [97] GetoptLong_1.0.5        rlang_1.1.5             fastmatch_1.1-6        
[100] mnormt_2.1.1            shinyjs_2.1.0