Data Resources
Data Consortia
Data consortia and related resources for ML training.
Consortium | Acronym / Affiliation | Website | URL | Description | Pricing / Membership Options | Comments |
---|---|---|---|---|---|---|
Genomics England (100,000 Genomes Project) | Genomics England | https://www.genomicsengland.co.uk | https://www.genomicsengland.co.uk | A UK initiative to sequence 100,000 genomes from NHS patients with rare diseases and cancers to advance research in genomics. Provides access to datasets for approved research purposes. | Membership is through approved research applications; no publicly listed pricing for general access. | |
The Cancer Genome Atlas (TCGA) | TCGA | https://www.cancer.gov/tcga | https://www.cancer.gov/tcga | A comprehensive project mapping genomic changes in over 30 types of cancer, used to improve understanding of cancer biology and treatment. Publicly accessible datasets for research. | Free access to datasets via public repositories (e.g., NCI GDC). | |
All of Us Research Program | All of Us | https://www.joinallofus.org | https://www.joinallofus.org | A U.S. NIH initiative aiming to build one of the largest, most diverse health databases to improve personalized medicine. Participants share health, genomic, and environmental data. | Free access, but requires registration and approval for data access. | |
ELIXIR | ELIXIR | https://www.elixir-europe.org | https://www.elixir-europe.org | A European intergovernmental organization that unites life science resources (tools, data, standards) for open access to drive research. Its network integrates national bioinformatics infrastructure. | Membership options for both academia and industry, with variable fees based on services. | |
Global Alliance for Genomics and Health (GA4GH) | GA4GH | https://www.ga4gh.org | https://www.ga4gh.org | An international effort developing frameworks and standards for sharing genomic and health-related data to accelerate research and medicine globally. | Free to participate; some projects or collaborations may involve fees. | |
Human Cell Atlas (HCA) | HCA | https://www.humancellatlas.org | https://www.humancellatlas.org | A global consortium creating a reference map of all human cells as a basis for understanding human health and disease. Provides access to cutting-edge datasets on cell types. | Free access to data, membership for collaboration opportunities. | |
The Structural Genomics Consortium (SGC) | SGC | https://www.thesgc.org | https://www.thesgc.org | A public-private partnership supporting the discovery of proteins linked to human disease by generating open-access 3D structures of proteins. | Free to access data and tools; industry participation may involve funding contributions or collaborations. | |
UK Biobank | UK Biobank | https://www.ukbiobank.ac.uk | https://www.ukbiobank.ac.uk | A large-scale biomedical database with genetic, health, and lifestyle information from 500,000 UK participants to advance medical research. | Researchers pay a fee to access data. Costs depend on the data type and scope of the research. | |
ENCODE (Encyclopedia of DNA Elements) | ENCODE | https://www.encodeproject.org | https://www.encodeproject.org | A collaborative project aimed at identifying all functional elements in the human genome. Provides large-scale datasets and analysis tools for understanding gene regulation. | Free public access to data. | |
Genetic Alliance Registry | Genetic Alliance | https://www.geneticalliance.org | https://www.geneticalliance.org | A consortium linking patient data registries to enable health research and drug discovery through shared data from rare disease patient communities. | Membership for patient organizations and research institutions, variable fees for participation. | |
Pistoia Alliance | Pistoia Alliance | https://www.pistoiaalliance.org | https://www.pistoiaalliance.org | A global, non-profit alliance focused on lowering barriers to innovation in life sciences R&D. Focus areas include digital transformation, AI, real-world data, and data interoperability. | Membership for organizations, fees vary by membership level and organization size. | May still have $250 individual membership. Offers many free virtual meetings on data topics. |
Foundation for the National Institutes of Health Biomarkers Consortium (FNIH) | Biomarkers Consortium | https://fnih.org/what-we-do/biomarkers-consortium | https://fnih.org/what-we-do/biomarkers-consortium | Public-private partnership aiming to identify and qualify biomarkers for drug development and precision medicine across multiple diseases, particularly cancer and neurology. | Funding provided by partners; no general membership or pricing available publicly. | |
Critical Path Institute | C-Path | https://c-path.org | https://c-path.org | A non-profit accelerating drug development by creating data standards and public databases. Focuses heavily on biomarkers, clinical trial modeling, and regulatory science. | Primarily funded by grants and partnerships with sponsors; no general membership. | |
Biomarker Enterprise to Advance Personalized Medicine (BEAM) | BEAM | https://www.personalizedmedicinecoalition.org | https://www.personalizedmedicinecoalition.org | A collaboration that fosters biomarker discovery, development, and integration into clinical trials and treatments. Focuses on biomarker validation and standardization across the life sciences industry. | Operates under the Personalized Medicine Coalition, membership options vary for industry and academic institutions. | |
ERP Biomarkers | ERP Biomarkers | https://erpbiomarkers.org | https://erpbiomarkers.org | Focused on the identification and validation of electrophysiological (ERP) biomarkers, particularly for neuropsychiatric conditions, ERP Biomarkers seeks to improve biomarker-based approaches to drug development and diagnostics. | Research institutions and pharmaceutical companies can join for membership, with fees depending on organization type. | |
Accelerating Therapeutics for Opportunities in Medicine (ATOM) | ATOM | https://www.atomscience.org | https://www.atomscience.org | A public-private partnership using AI and high-performance computing to accelerate drug discovery and development, focusing on speeding the process from target identification to clinical candidate. | Collaborators are generally large institutions, and there is no public pricing; membership requires industry partnerships. | More of a government funded ai-drug discovery platform |
MELLODDY Consortium | MELLODDY | https://www.melloddy.eu | https://www.melloddy.eu | A European project using AI and machine learning to leverage private datasets from pharmaceutical companies for drug discovery. The focus is on improving predictive modeling for preclinical and clinical success rates. | Funded by the European Union’s Horizon 2020 program and private industry collaborations; no public membership option. | |
AI4Health Consortium | AI4Health | https://www.ai4health.eu | https://www.ai4health.eu | A European initiative promoting the application of AI in healthcare and drug discovery. It integrates data, modeling, and AI solutions to address complex biomedical problems. | Funded by public and private partners; project-based participation, not an open membership consortium. | |
Open Reaction Database | ORD | https://docs.open-reaction-database.org/en/latest/ | https://docs.open-reaction-database.org/en/latest/ | open access chemical reaction database to support machine learning and related efforts in reaction prediction, chemical synthesis planning, and experiment design | Open source | Led by Conner Coley (MIT), moredetails here https://cen.acs.org/physical-chemistry/computational-chemistry/new-database-machine-learning-research/99/web/2021/11 |
Kaggle | Kaggle | https://www.kaggle.com | https://www.kaggle.com | Site to explore, analyze and share quality data sets. Incldues some synthetic data sets for training. | ||
Google BigQuery | https://cloud.google.com/bigquery/public-data | https://cloud.google.com/bigquery/public-data | Public data sets made available to the public through Google Cloud Public Dataset progeram. | Open to analyze or integrate into your tools. | ||
Google Cloud Life Sciences Public Datasets | https://cloud.google.com/life-sciences/docs/resources/public-datasets | https://cloud.google.com/life-sciences/docs/resources/public-datasets | A variety of public datasets you can access for free and integrate into your applications. Google hosts these datasets, providing public access (interactive or file access). | |||
NAR Databases | Nucleic Acids Research | https://www.oxfordjournals.org/nar/database/c/ | https://www.oxfordjournals.org/nar/database/c/ | NAR annual db issue of molecular biology databases by cateogry. Over 1950 databases collected from 180 papers | Freely available on NAR web site. | |
#REF! | ||||||
Consortium | Acronym / Affiliation | Website | URL | Description | Pricing / Membership Options | Comments |