Newsroom

Research

An Integrated Proteogenomic Pipeline Help Re-annotate Genome of a Model Diatom Phaeodactylum tricornutum

Diatoms comprise a large group of photosynthetic eukaryotic phytoplankton, commonly inhabiting freshwater and marine habitats. It has been estimated that they assimilate at least 20% of CO2 on Earth and provide about 40% of annual marine primary productivity. Phaeodactylum tricornutum is commonly used as a model organism for studying diatom biology. Although its genome was sequenced in 2008, a high quality genome annotation has still not been obtained for this diatom.   

Proteogenomic has been applied to the identification of previously unidentified genes and the correction and validation of predicted genes in various organisms. The research group led by Dr. GE Feng at Institute of Hydrobiology (IHB) of Chinese Academy of Sciences performed systematic investigation to characterize the proteogenomic analysis of Synechococcus sp. PCC 7002 (PNAS, 2014, 111(52):E5633-E5642) and developed a one-stop open source software termed GAPP for carrying out genome annotation in prokaryotes (Molecular & Cellular Proteomics. 2016;15(11):3529-3539).    

Based on these analyses, the researchers further develop a systematic approach for conducting an integrated proteogenomic analysis of Phaeodactylum tricornutum using mass spectrometry (MS).   

In their study, a minimally redundant proteogenomic database was constructed from six-frame translation of genomic sequences and three-frame translation of RNA sequences. To maximize the coverage and achieve in-depth identification of peptides and proteins, the researchers performed two different sample prefractionation methods, two different enzymes digestion, five different algorithms and more stringent FDR filtering strategy in the study.    

A total of 6,628 Phaeodactylum tricornutum proteins were unambiguous identifications, evidenced by reliable peptides. Based on the protein-coding potential analysis, 1,895 genes generated from the Phaeodactylum tricornutum genome may not code for any proteins. Notably, the proteogenomic analysis unambiguously reveals 606 possible new protein-coding annotations and 506 corrections to existing gene models, 94 splice variants, 58 single amino acid variants. Among 606 possible novel genes, 56 genes were confirmed by their proteomic data to be the bona fide proteins that had been previously mis-annotated as lincRNAs in Phaeodactylum tricornutum.    

Importantly, they identified 24 different posttranslational modifications (PTM) using the same experimental MS data, which may play important roles in cellular functions. These findings expand the genomic landscapes of Phaeodactylum tricornutum and provide a rich resource for the study of diatom biology. The proteogenomic pipeline developed in this study is applicable to any sequenced eukaryotes and so represents a significant contribution to the toolset for eukaryotic proteogenomic analysis.   

The results were published in Molecular Plant entitled “Genome annotation of a model diatom Phaeodactylum tricornutum using an integrated proteogenomic pipeline”. This study is supported by grants from the National Key Research and Development Program (2016YFA0501304), the National Natural Science Foundation of China (Grant No. 31570829), and the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant No. XDB14030202). 

 

Overview of the workflow for the proteogenomic analysis of P. tricornutum (Image by IHB)