Rising Junior (Walnut High School) Shares his Summer Research on Hybrid Genome Assembly

7/24/2017

The three tracks in the photo above are as follows: genome (scf180000022261), annotation from Nematostella (first_try.gff), annotation from Mnemiopsis (whole_pipe.gff). Mnemiopsis predicted one large gene with many introns and exons, but Nematostella predicted many smaller genes.

For the past month, I have been assembling and annotating the genome of Renilla muelleri, commonly known as a sea pansy. We isolated DNA from organisms acquired through the aquarium trade, which were then sequenced using both an Illumina Hi-Seq, and MiSeq and a Pacific Biosciences RS II. Our paired-end Illumina reads had an average length of 166 bp at 190x coverage, and our PacBio reads were at 10x coverage. I fed the Illumina reads and the PacBio subreads to MaSuRCA-3.2.2 for a hybrid de novo assembly; the resulting scaffold was run through stats.sh (bbmap) and Quast to calculate genome statistics. Both programs reported the genome to be ~185 Mb with a GC content of 36.3%. The assembly had 6,036 contigs with a N50 of 63.191 Kb. To compare different assemblies, we used SPAdes to assemble a genome with only the Illumina reads. Using the same statistics programs, the Illumina-only assembly had a similar GC content 36.9%, but a genome size of ~ 255 Mb. I attribute these differences to the difficulties that short-read-only assemblers have in resolving repeat regions. The assembler is unable to identify the length of a certain region because of numerous base pair repeats unless it has longer reads like PacBio.

I then used the draft hybrid assembly for annotation with Augustus, a gene prediction software. I conducted two annotations with Augustus, one using training data from Nematostella vectensis (provided by Joe Ryan) and the other with Mnemiopsis leidyi; both used RNASeq data of Renilla provided by J. Ryan. Based on previous research, the number of genes in Renilla is anywhere from 15,000 to 25,000. With the Nematostella training set, 20,464 genes were predicted, and with Mnemiopsis 8,588 genes were predicted. The Mnemiopsis training set was able to predict exons and introns, but the Nematostella training set seemed to have predicted each exon as a gene. Because of this, the Mnemiopsis annotation had less than half of the predicted genes Nematostella annotation had. I tested these training sets out, but I realize that Nematostella and Mnemiopsisa are >500 MY divergent from Renilla. In the future, I will use a training set specific to Renilla generated from RNA-seq data. I am also planning on assembling a final genome with environmental contaminants (e.g., bacteria, viruses) removed and then running a final annotation. Stay posted for the draft genome and annotation files!

Justin Jiang, Walnut High School ‘19

Although Justin spends most of his time on bioinformatics, he also takes breaks from the command line and helps with lab work, which includes running gels and PCRs.

1 Comment

Rising Junior (Walnut High School) Shares his Summer Research on Hybrid Genome Assembly

Leave a Reply.

UCE Project Team

Archives

Categories