Lossless and reference-free compression of FASTQ/A files using GeneSqueeze

We live in a world driven by big data, especially in genomics research and next-generation sequencing (NGS). Large genomic datasets, including FASTQ and FASTA files, are commonly stored and processed in the cloud. With the rapidly declining cost of sequencing, the volume of sequencing data has skyrocketed. However, as NGS data grows exponentially, so do the costs associated with cloud storage, data transfer, and data processing.

One of the biggest challenges in managing genomic data is reducing storage and transmission costs without compromising data integrity. Standard data compression methods often lead to a loss of information, which is a critical issue in bioinformatics workflows that require precise genomic data.

Rajant Health, through its Trovomics software platform, has developed GeneSqueeze, a lossless compression algorithm specifically designed for genomic data formats like FASTQ/A files. Unlike traditional compression techniques, GeneSqueeze is reference-free and leverages the inherent patterns in sequencing files to achieve significant size reductions.

Key Benefits of GeneSqueeze:

  • Lossless data compression that preserves IUPAC nucleotides and read identifiers.

  • Auto-tuning compression protocol that adapts to each file's unique structure.

  • Compatibility with all FASTQ/A file attributes, including varying read lengths, number of reads, and read identifier formats.

By implementing GeneSqueeze, research institutions and biotech companies can optimize their genomic data storage and data transfer processes, significantly reducing operational costs while maintaining the accuracy of their sequencing data.

For a deeper dive into the algorithm’s performance, see our Nature publication below:

Previous
Previous

Precision Medicine with Dr. Brian McDonough

Next
Next

What is RNA-Sequencing, and what does it tell us?