UCSC has built the Cancer Genomics Hub (CGHub) for the US National Cancer Institute, designed to hold data for all major NCI projects. To date it has served more than more than 10 petabytes of data to more than 320 research labs. Cancer is exceedingly complex, with thousands of subtypes involving an immense number of different combinations of mutations. The only way we will understand it is to gather together DNA data from many thousands of cancer genomes so that we have the statistical power to distinguish between recurring combinations of mutations that drive cancer progression and "passenger" mutations that occur by random chance. Currently, with the exception of a few international research projects, most cancer genomics research is taking place in research silos, with little opportunity for data sharing. If this trend continues, we lose an incredible opportunity. Soon cancer genome sequencing will be widespread in clinical practice, making it possible in principle to study as many as a million cancer genomes. For these data to also have impact on understanding cancer, we must begin soon to move data into a network of compatible global cloud storage and computing systems, and design mech- anisms that allow genome and clinical data to be used in research with appropriate patient consent.
The Global Alliance for Genomics and Health was created to address this problem. Our Data Working Group is designing the future of large-scale genomics for cancer and other diseases. This is an opportunity we cannot turn away from, but involves both social and technical challenges.