Development of a Big Data Toolset for Identifying Functional Genetic Variants
Yiqing Zhao, Max M. He
Center for Human Genetics
Research area: Genetics
Background: High-throughput next generation sequencing technologies, such as whole-genome sequencing (WGS) and whole-exome sequencing (WES), are widely used to identify disease- and/or drug-associated genetic mutations in human genomic studies and clinical settings. Existing annotation programs generate meaningful biological information for every genetic variant based on a single WGS/WES VCF (Variant Call Format) file. This project aimed to develop a web-based variant annotation program called cloudANNOVAR on a Big Data infrastructure.
Methods: Leveraging Apache Cassandra, cloudANNOVAR was developed to extract meaningful biological insights (such as identify mutation that occurs in exon region and annotate with pathogenicity information of specific mutation) for large-scale genetic variants on a Big Data infrastructure with a scalable and centralized database. All of the annotated information can be stored and managed in the centralized Cassandra database so that the existing variants present in other samples do not need to be re-annotated. We ran comparison on a WES dataset using both ANNOVAR and cloudANNOVAR. Accuracy was calculated using ANNOVAR results as a gold standard.
Results: cloudANNOVAR achieved an accuracy of 98.8% compared to the results generated by ANNOVAR. With a single node server, cloudANNOVAR took about 50 seconds to annotate variants from a WES data while ANNOVAR took about 25 minutes to complete the same job.
Conclusions: High processing power can be achieved by incorporating Big Data infrastructure into an annotation pipeline. Its performance can be further improved when using a multiple data-node cluster. In addition, cloudANNOVAR can be incorporated into other statistical analysis tools, such as SeqHBase, for better performance in the future.