Machine Learning (ML) holds promise to enable computers to assist humans in the analysis of large, complex data sets. Here, we are trying to adapt and develop ML models for the analysis of genomics data sets, including the variant calling, gene expression, epigenetic, proteomic or metabolomic data.

Variant Calling Refinement

The process of identification of “true variants” from “raw variants” derived from sequence reads aligned against a reference genome is called “variant calling refinement”. Variant refinement eliminates false positives from a true variant list through heuristic filtering and manual review. Heuristic filtering includes setting project-specific thresholds for sequencing features such as read coverage depth, variant allele fraction (VAF), base quality metrics, and others. On the other hand, manual review requires direct examination of aligned reads using a genomic viewer such as Integrative Genomic Viewer (IGV) to identify false positives that are consistently missed by automated variant callers. Generally, manual variant refinement is time-consuming, costly, poorly standardized, and non-reproducible. Here, we systematized and standardized variant refinement using a machine learning approach.

Image Processing

Advances in automated and high-throughput imaging technologies have resulted in a deluge of high-resolution images and sensor data of plants. However, extracting patterns and features from this large corpus of data requires the use of ML-based tools to enable data assimilation and feature identification for plant phenotyping. Four stages of the decision cycle in plant phenotyping and plant breeding activities where different ML approaches can be deployed are (i) identification, (ii) classification, (iii) quantification, and (iv) prediction (ICQP). 

Image adapted from Nguyen et al. 2017