INSIGHTS
Case Study

Data Storage and Analysis Solution for Computational Biology Group

Problem

A company focused on exon expression needed data storage and processing solutions to power their computational biology group. First, they needed to ingest all the data into a storage solution that would allow for fast querying for exons near specific points on the genome. Then, tools would be built on top of this database to allow for analysis of the exon expression in large genomic studies.

The first challenge that the team faced was a need to clean, aggregate and store hundreds of gigabytes of genomic data from Gencode and Omicsoft into a database. For this particular analysis, a standard PostgresSQL database was identified as a best-case solution for fast query and join time for the data along the multiple views requested by the client. Loading the files that the client had received was the main hurdle, as some of the files were on the order of 100GB.

Solution

To deal with this, the team created a Dask pipeline to load in the files in parts. After this, we created SQL scripts which could perform all the necessary updates to the tables and partitioning of the data for the analysis pipelines. Mapping and annotating were done using house-built gene query services consuming APIs like mygene.info or ensembl.

The large custom data back end allowed our team to start working on a python library which would allow the computational biology team to get the specific custom measurements they needed without having to learn the specifics of the database or the Postgres installation that powered this. This library was further extended to be usable in a custom dashboard that could compare these custom metrics as needed. With these items complete, we were able to work with the original CSVs again to implement other Dask pipelines to perform more bespoke and unique analysis at larger scales.

Outcome

Our solution allowed for the science that they were looking to perform to scale to the level which they desired. This analysis was built into a dashboard which allowed them to analyze different possible targets for their therapeutics to work on. The ability to load and reload the data allowed for easy integration of new datasets and allowed the team to combine the results from their wet lab with vendor bought data.