1  QC and Genome Composition assessment

1.1 On this page

Biological insights and take-home messages are at the bottom of the page at section Lesson Learnt: Section 6.3.

  • Here

1.2 Reads QC and Filtering

Just trim low quality bases with Trimmomatic and discard short reads. I keep only properly paired reads and I discard the unpaired, since the latter are just a tiny fraction of my total data, they complicate the analysis and there is not much added value in them (at least for the analysis we run).

The file sample.lst just contains the names of my kveiks samples.

# generate FastQC report
for file in *.fastq; do
  ~/bin/FastQC/fastqc $file &
done

# LowQ base trimming and filtering
while read line; do
    java -jar ~/bin/trimmomatic/trimmomatic.jar PE \
        -threads 72 \
        -phred33 \
        $line.R1.fq.gz $line.R2.fq.gz \
        $line.R1.tr.fq.gz $line.R1.tr.un.fq.gz $line.R2.tr.fq.gz $line.R2.tr.un.fq.gz \
        SLIDINGWINDOW:10:30 \
        TRAILING:30 \
        MINLEN:50
done < sample.lst

1.3 Genome Composition

We check the composition of the sequenced kveiks by aligning the reads to a set of Saccharomyces reference genomes.

First, we align the reads to the six Saccharomyces reference genomes. Then we create 10kb windows with bedtools and we calculate the average coverage for each window.

## index the reference genome
~/bin/bwa/bwa index Saccharomyces_RefGen.fa

## reads alignment
while read line ; do

    bwa mem -t 72 \
        -K 100000000 \
        Saccharomyces_RefGen.fa \
      "${line}".R1.tr.fq.gz \
      "${line}".R2.tr.fq.gz \
    > "${line}".align.sam
    
  samtools view -@ 72 -Sb \
    "${line}".align.sam > "${line}".align.bam
    
  samtools sort -@ 72 
    "${line}".align.bam "${line}".align.sort
    
done < ../sample.lst 

## bam2bed
# prepare reference genomes
samtools-1.9/samtools faidx Saccharomyces_RefGen.fa
gatk CreateSequenceDictionary --R Saccharomyces_RefGen.fa
cut -f 2,3 Saccharomyces_RefGen.dict \
  | tail -n +2 \
  | sed 's/LN://g' \
  | sed 's/SN://g' \
  > Saccharomyces_RefGen.bedchor
bedtools makewindows \
  -g Saccharomyces_RefGen.bedchor \
    -w 1000 \
    > Saccharomyces_RefGen.1kb_win.tab

# calculate mean coverage
while read line; do
    bedtools coverage \
      -a Saccharomyces_RefGen.1kb_win.tab \
      -b "${line}".align.sort.bam -mean \
      > "${line}".align.sort.1kb_cov.bed;
done

# format output
for file in *.1kb_cov.bed ; do
    sed -i 's/Scere_/Scere\t/g' $file;
    sed -i 's/Seuba_/Seuba\t/g' $file;
    sed -i 's/Skudr_/Skudr\t/g' $file;
    sed -i 's/Smika_/Smika\t/g' $file;
    sed -i 's/Spara_/Spara\t/g' $file;
    sed -i 's/Suvar_/Suvar\t/g' $file;
    python bed_remove0coverage.py \
      --input $file \
      | sed '/^[[:space:]]*$/d' \
      > $(basename $file .bed).no0.bed;
done

for file in *.no0.bed; do
    NAME=$(basename $file .align.sort.1kb_cov.no0.bed);
    sed -i "s/^/$NAME\t/" $file;
done

cat *.1kb_cov.no0.bed > Viking.species.cov.bed


# generate per chromosomes coverages
while read line; do
    OLD=$(echo $line | cut -d " " -f 1)
    NEW=$(echo $line | cut -d " " -f 2)
    sed -i "s/$OLD\t/$NEW\t/g" Viking.species.cov.chr.bed
done < ref_chr_names.lst 

# clean output
for file in *.no0.bed; do
    cp $file $file.1;
    while read line; do
        OLD=$(echo $line | cut -d " " -f 1);
        NEW=$(echo $line | cut -d " " -f 2);
        sed -i "s/$OLD\t/$NEW\t/g" $file.1;
    done < ../00_ref_genomes/ref_chr_names.lst;
    grep Chr $file.1 > $(basename $file .bed).chr.bed;
    rm $file.1 &
done

We then remove windows with 0 coverage to have a clearer figure, and we plot the coverages with transforming the y axis in log10 scale, since Krogerus kveiks have much higher coverage than the strains we sequenced.

data_cov = read.delim("data/p01-01/Viking.species.cov.chr.bed", header = FALSE)
data_cov$V7 = ifelse(data_cov$V6 >= 1, "Up", "Down")

# ylim = 1500
ggplot(data_cov, aes(x = V2, y = V6)) +
  geom_jitter(aes(color = V7),
              show.legend = FALSE,
              alpha = 0.035,
              shape = ".",
              position = position_jitter(0.475)) +
  geom_hline(yintercept = 1, linetype = 2, color = "red") +
  coord_cartesian(expand = FALSE) +
  scale_color_manual(values  = c("grey75", "cyan4")) + 
  scale_y_continuous(limits = c(-4, 9000),
                     trans = "log2",
                     breaks = c(1, 10, 100, 1000),
                     labels = scales::comma) +
  labs(title = "Coverage of Reference Genomes",
       x = "Reference Species",
       y = "Coverage") +
  facet_wrap(~V1, scales = "free_y") +
  theme(plot.title = element_text(hjust = 0.5, size = 20),
        panel.background = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.major.y = element_line(colour = "gray75", size = 0.5),
        strip.background = element_rect(colour = "black", fill = NA),
        strip.text = element_text(size = 8),
        axis.ticks.y = element_blank(),
        axis.text.x = element_text(angle = 90, vjust = 0.65))

Figure 1: Coverage of Reference Genomes.

All sequenced farmhouse yeasts are pure S. cerevisiae strains, except for Muri and k7R25 that are S. cerevisiae, S. eubayanus and S. uvarum triple hybrids.

We can dig in into the coverage of the single chromosomes and visualize the extent of hybridization and eventual aneuploidies.

plot_lst[["14R30"]]

Figure 2: Subgenome coverages.
plot_lst[["14R6"]]

Figure 2: Subgenome coverages.
plot_lst[["16R23"]]

Figure 2: Subgenome coverages.
plot_lst[["16R37"]]

Figure 2: Subgenome coverages.
plot_lst[["17P5"]]

Figure 2: Subgenome coverages.
plot_lst[["19R18"]]

Figure 2: Subgenome coverages.
plot_lst[["1R16"]]

Figure 2: Subgenome coverages.
plot_lst[["21P1"]]

Figure 2: Subgenome coverages.
plot_lst[["21R38"]]

Figure 2: Subgenome coverages.
plot_lst[["27R17"]]

Figure 2: Subgenome coverages.
plot_lst[["28P1"]]

Figure 2: Subgenome coverages.
plot_lst[["28P6"]]

Figure 2: Subgenome coverages.
plot_lst[["28R21"]]

Figure 2: Subgenome coverages.
plot_lst[["28R33"]]

Figure 2: Subgenome coverages.
plot_lst[["28R8"]]

Figure 2: Subgenome coverages.
plot_lst[["2R23"]]

Figure 2: Subgenome coverages.
plot_lst[["38R16"]]

Figure 2: Subgenome coverages.
plot_lst[["39R20"]]

Figure 2: Subgenome coverages.
plot_lst[["3R11"]]

Figure 2: Subgenome coverages.
plot_lst[["40R14"]]

Figure 2: Subgenome coverages.
plot_lst[["40R1"]]

Figure 2: Subgenome coverages.
plot_lst[["40R20"]]

Figure 2: Subgenome coverages.
plot_lst[["41R10"]]

Figure 2: Subgenome coverages.
plot_lst[["41R15"]]

Figure 2: Subgenome coverages.
plot_lst[["42R20"]]

Figure 2: Subgenome coverages.
plot_lst[["42R31"]]

Figure 2: Subgenome coverages.
plot_lst[["44R32"]]

Figure 2: Subgenome coverages.
plot_lst[["44R7"]]

Figure 2: Subgenome coverages.
plot_lst[["45P5"]]

Figure 2: Subgenome coverages.
plot_lst[["45R11"]]

Figure 2: Subgenome coverages.
plot_lst[["46R12"]]

Figure 2: Subgenome coverages.
plot_lst[["46R37"]]

Figure 2: Subgenome coverages.
plot_lst[["6R15"]]

Figure 2: Subgenome coverages.
plot_lst[["7R7"]]

Figure 2: Subgenome coverages.
plot_lst[["8R19"]]

Figure 2: Subgenome coverages.
plot_lst[["9R40"]]

Figure 2: Subgenome coverages.
plot_lst[["Granvin1"]]

Figure 2: Subgenome coverages.
plot_lst[["Hornindal1"]]

Figure 2: Subgenome coverages.
plot_lst[["Hornindal2"]]

Figure 2: Subgenome coverages.
plot_lst[["k7R15"]]
NULL
plot_lst[["k7R25"]]

Figure 2: Subgenome coverages.
plot_lst[["Laerdal2"]]

Figure 2: Subgenome coverages.
plot_lst[["Muri"]]

Figure 2: Subgenome coverages.
plot_lst[["SortdalEbbe1"]]

Figure 2: Subgenome coverages.
plot_lst[["Voss1"]]

Figure 2: Subgenome coverages.

Muri and k7R25 are triple hybrids S. cerevisiae X S. eubayanus X S. uvarum. What is extremely interesting is that they share the same genomic composition in terms of origin of the chromosomes, but they have distinct CNVs patterns. I would suggest a common origin for such hybrid.

1.4 Ploydies estimation

We can estimates Kveiks ploidy by looking at the k-mer frequency distributions of the sequenced reads obtained from whole genome sequencing.


# calculate kmer freq
for file in 01_alignments/*bam; do
  ~/bin/ntCard/bin/ntcard \
    --threads=64 \
    --kmer=41 \
    --pref=$(basename $file .align.sort.md.r.bam).kmer41freq \
    $file;
done

# merge outputs
for file in *.hist; do
  tail -n +3 $file > $file.1;
  mv $file.1 $file;
  cat $file | sed "s/^/$(basename $file .kmer41freq_k41.hist)\t/g";
done > Viking.species.kmer41freq.txt

Kmer frequency distribution plots can suggest us the degree of heterozygosity of kveiks genomes. All Kveiks seems to have a high degree of heterozygosity, except for hybrid strains Muri and k7R25. This support the idea that the sequenced kveiks so far are tetraploid S. cerevisiae, while Muri and k7R25 are triploid hybrids S. cerevisiae X Seubayanus X S. uvarum.

data_k = read.delim("./data/p01-01/Viking.species.kmer41freq.mod.txt", header = FALSE)
data_k$V2 = as.numeric(data_k$V2)
data_k$V3 = as.numeric(data_k$V3)

# plot kmer distributions
ggplot(data_k, aes(x = V2, y = V3)) +
  geom_line(color = "steelblue", size = 1) +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
  facet_wrap(~V1, scales = "free") +
  labs(title = "K-mer frequencies distribution",
       x = "k-mer depth",
       y = "k-mer frequency") +
  theme(plot.title = element_text(size = 20, hjust = 0.5),
        axis.title = element_text(size = 18),
        axis.ticks.y = element_blank(),
        axis.text.y = element_blank(),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "black", fill = NA),
        strip.background = element_rect(colour = "black", fill = NA),
        strip.text = element_text(size = 8))

Figure 2: K-mer frequencies distribution.

1.5 Lessons Learnt

Based on the we have learnt:

  • Fr

1.6 Session Information

R version 4.5.1 (2025-06-13)
Platform: x86_64-suse-linux-gnu
Running under: openSUSE Tumbleweed

Matrix products: default
BLAS:   /usr/lib64/R/lib/libRblas.so 
LAPACK: /usr/lib64/R/lib/libRlapack.so;  LAPACK version 3.12.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=it_IT.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Brussels
tzcode source: system (glibc)

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] magrittr_2.0.3 gridExtra_2.3  ggplot2_3.5.2 

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5        cli_3.6.5          knitr_1.50         rlang_1.1.6       
 [5] xfun_0.52          generics_0.1.3     jsonlite_2.0.0     labeling_0.4.3    
 [9] glue_1.8.0         DT_0.33            htmltools_0.5.8.1  sass_0.4.10       
[13] scales_1.4.0       rmarkdown_2.29     jquerylib_0.1.4    crosstalk_1.2.1   
[17] evaluate_1.0.3     tibble_3.2.1       fastmap_1.2.0      yaml_2.3.10       
[21] lifecycle_1.0.4    compiler_4.5.1     dplyr_1.1.4        RColorBrewer_1.1-3
[25] htmlwidgets_1.6.4  pkgconfig_2.0.3    farver_2.1.2       digest_0.6.37     
[29] R6_2.6.1           tidyselect_1.2.1   pillar_1.10.2      bslib_0.9.0       
[33] withr_3.0.2        tools_4.5.1        gtable_0.3.6       cachem_1.1.0