Hit Enter to search or Escape to close
Problem
An early-stage clinical diagnostic company needed to implement a high-throughput DNA-based screening liquid biopsy to detect early stages of cancer. With no existing computing or storage infrastructure, there was a widely available choice of architectures to develop a cost-effective, scalable system. The anticipated number of samples per year that would go through the pipeline was in the millions, and creating low time-to-analysis was critical.
Solution
Arrayo worked with the client to transfer their legacy data into a Microsoft Azure data lake. Since the group was working towards clinical trials, the traceability of the changes made to the data was paramount, which led to our team working with the new delta lake storage layer as offered by Databricks. This allowed for better handling of metadata and more reliability to the system.
Now that the client had better data infrastructure that could handle their size demands, we built out pipelines that could perform the proper computation. Using R, Python, and Scala APIs for Apache Spark, the client worked closely with Arrayo’s team to ensure that the computation could run within reasonable time tolerances. In the most burdensome pipeline that the client had, there was a more than 50x speedup. Along with speeding up the pipelines, we built out a testing suite to ensure that the new distributed pipeline would give the same results when prompted with the same data as before. As soon as non-concordance is identified they can utilize the previous test cases to ensure new changes do not have unintended consequences on the pipeline.
Outcome
To conclude, Arrayo delivered an increased speed of existing academic pipeline by 50x in a scalable production environment, we ensured concordant results with legacy pipeline used in prior trials and moved to new data infrastructure to increase transparency.