The GenomicsDB stores variant data in a 2D array where:
- Each column corresponds to a genomic position (chromosome + position);
- Each row corresponds to a sample in a VCF (or CallSet in the GA4GH
terminology);
- Each cell contains data for a given sample/CallSet at a given position;
data is stored in the form of cell attributes;
- Cells are stored in column major order - this makes accessing cells with
the same column index (i.e. data for a given genomic position over all
samples) fast.
- Variant interval/gVCF interval data is stored in a cell at the start of the
interval. The END is stored as a cell attribute. For variant intervals
(such as deletions and gVCF REF blocks), an additional cell is stored at
the END value of the variant interval. When queried for a given genomic
position, the query library performs an efficient sweep to determine all
intervals that intersect with the queried position.