{"id": 1119131, "name": "Number of entries in biological sequence databases", "unit": "entries", "createdAt": "2025-10-28T17:01:41.000Z", "updatedAt": "2025-10-28T17:01:41.000Z", "coverage": "", "timespan": "1976-2024", "datasetId": 7255, "shortUnit": "", "columnOrder": 0, "shortName": "entries", "catalogPath": "grapher/biotech/2025-09-09/epoch_database_growth/epoch_database_growth#entries", "descriptionShort": "Biological sequence databases store data such as DNA, RNA, and amino acid sequences and 3D protein structures. This data includes entries from [GenBank](#dod:genbank), [RefSeq](#dod:refseq), [PDB](#dod:protein-data-bank), [UniProtKB/SwissProt](#dod:uniprotkb), as well as predicted protein structures in [AlphaFoldDB](#dod:alphafolddb) and [ESMAtlas](#dod:esmatlas).", "descriptionFromProducer": "#### GenBank\nGenBank \u00ae is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. GenBank is part of the International Nucleotide Sequence Database Collaboration, which comprises the DNA DataBank of Japan (DDBJ), the European Nucleotide Archive (ENA), and GenBank at NCBI. These three organizations exchange data on a daily basis.\n\n#### RefSeq\nThe Reference Sequence (RefSeq) collection provides a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins. RefSeq sequences form a foundation for medical, functional, and diversity studies. They provide a stable reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis (especially RefSeqGene records), expression studies, and comparative analyses.\n\nRefSeq genomes are copies of selected assembled genomes available in GenBank. RefSeq transcript and protein records are generated by several processes including:\n\n- Computation\n  - Eukaryotic Genome Annotation Pipeline\n  - Prokaryotic Genome Annotation Pipeline\n- Manual curation\n- Propagation from annotated genomes that are submitted to members of the International - Nucleotide Sequence Database Collaboration (INSDC)\n\n#### Protein Data Bank (PDB)\nRCSB PDB (RCSB.org) is the US data center for the global Protein Data Bank (PDB) archive of 3D structure data for large biological molecules (proteins, DNA, and RNA) essential for research and education in fundamental biology, health, energy, and biotechnology.\n\n#### UniProtKB/Swiss-Prot\nThe UniProt Knowledgebase (UniProtKB) is the central hub for the collection of functional information on proteins, with accurate, consistent and rich annotation. In addition to capturing the core data mandatory for each UniProtKB entry (mainly, the amino acid sequence, protein name or description, taxonomic data and citation information), as much annotation information as possible is added.\n\nThe UniProt Knowledgebase consists of two sections: a section containing manually-annotated records with information extracted from literature and curator-evaluated computational analysis (UniProtKB/Swiss-Prot), and a section with computationally analyzed records that await full manual annotation (UniProtKB/TrEMBL).\n\n\n#### AlphaFoldDB\nAlphaFold DB provides open access to over 200 million protein structure predictions to accelerate scientific research.\n\nAlphaFold is an AI system developed by Google DeepMind that predicts a protein\u2019s 3D structure from its amino acid sequence. It regularly achieves accuracy competitive with experiment.\n\nGoogle DeepMind and EMBL\u2019s European Bioinformatics Institute (EMBL-EBI) have partnered to create AlphaFold DB to make these predictions freely available to the scientific community. The latest database release contains over 200 million entries, providing broad coverage of UniProt (the standard repository of protein sequences and annotations). We provide individual downloads for the human proteome and for the proteomes of 47 other key organisms important in research and global health. We also provide a download for the manually curated subset of UniProt (Swiss-Prot).\n\n#### ESMAtlas\nA protein\u2019s structure, the three-dimensional coordinates of all the atoms in the chain of amino acids, can be a key to understanding its function. This Metagenomic Atlas is the first large-scale view of the structures of metagenomic proteins encompassing hundreds of millions of proteins. To make structure predictions at this scale, a breakthrough in the speed of protein folding was necessary. We developed a new protein structure prediction approach named ESMFold. ESMFold uses the representations from a large language model (ESM2) to generate an accurate structure prediction from the sequence of a protein.", "descriptionProcessing": "We use the data collected by Epoch AI on the growth of key biological sequence databases over time. We have added their extraction notes below for reference.\n\nWe show the maximum number of entries reported for each database in a given year.\n\n\n**Extraction notes from Epoch AI**\n- GenBank: Data extracted from [release notes](https://www.ncbi.nlm.nih.gov/genbank/release/) of GenBank.\n- RefSeq: Data extracted from [RefSeq release notes](https://ftp.ncbi.nlm.nih.gov/refseq/release/release-notes/archive/).\n- UniProt: Data extracted from UniProt release notes and supplementary data from [UniProt paper](https://academic.oup.com/nar/article/43/D1/D204/2439939#supplementary-data).\n- PDB: Data extracted from [RCSB PDB growth statistics](https://www.rcsb.org/stats/growth/growth-released-structureswebsite) webpage.\n- AlphaFoldDB: Data extracted from AlphaFoldDB release notes and [associated paper](https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkad1011/7337620).\n- ESMAtlas: Data extracted from [ESMAtlas database](https://esmatlas.com/about#download_dataset) information.", "type": "int", "datasetName": "Trends in Biological Sequence Data", "updatePeriodDays": 0, "datasetVersion": "2025-09-09", "nonRedistributable": false, "display": {"unit": "entries", "numDecimalPlaces": 0, "entityAnnotationsMap": "GenBank: DNA and RNA sequences\nRefSeq: Gene records (incl. genomic DNA, transcripts, and proteins)\nProtein Data Bank (PDB): 3D molecule structures (e.g. proteins)\nUniProtKB (Swiss-Prot): Protein sequences\nAlphaFoldDB: Predicted 3D protein structures\nESMAtlas: Predicted 3D protein structures"}, "schemaVersion": 2, "processingLevel": "major", "presentation": {"topicTagsLinks": ["Medicine & Biotechnology"]}, "descriptionKey": ["Biological sequence data includes data such as DNA and RNA sequences, amino acid sequences of proteins, and three-dimensional structures of proteins and other molecules found in living organisms. These organisms can be bacteria, viruses, plants, animals and humans.", "Researchers use this data to better understand the biology of organisms, including the functions of genes and proteins and how they interact. This knowledge can then be applied to e.g. develop new drugs and treatments for illnesses or use biotechnology to improve agriculture and environmental science.", "This dataset provides an overview of the growth of key biological sequence databases over time.", "**GenBank** is a database of RNA and DNA sequences maintained by the National Center for Biotechnology Information (NCBI). Researchers can submit nucleotide sequences, which are then reviewed and annotated by NCBI staff. Researchers are responsible for the scientific accuracy of their submissions.", "**RefSeq** is a curated collection of DNA, RNA, and protein sequences maintained by the NCBI. It provides reference sequences for major research organisms, including humans, model organisms, and pathogens. RefSeq records integrate information on genomic DNA, transcripts, and proteins to provide a complete picture of each gene in each organism.", "**Protein Data Bank (PDB)** is a database of 3D structural data of large biological molecules, such as proteins, nucleic acids, lipids, carbohydrates, and complex assemblies of these molecules. It only includes experimentally validated structures and is managed by the Research Collaboratory for Structural Bioinformatics (RCSB), which reviews each submission for quality and accuracy before adding it to the database.", "**UniProtKB/Swiss-Prot** is a database of protein sequence and functional information. This includes information on both the protein sequence and structure, as well as its effects and interactions in an organism. The Swiss-Prot section is manually curated and contains only experimentally verified information.", "**AlphaFoldDB** is a database of predicted 3D structures of proteins generated by the AlphaFold AI model, which uses deep learning to predict protein structures based on their amino acid sequences. While structures are generally not experimentally validated, when they are, predictions have been shown to be highly accurate.", "**ESMAtlas** is a database of predicted 3D structures of proteins generated by the ESM AI model, which predicts protein structures based on their amino acid sequences. While structures are generally not experimentally validated, when they are, predictions have been shown to be highly accurate.", "All databases listed here are freely accessible to the public and are widely used by researchers, educators, and students worldwide for various purposes, including scientific research, drug discovery, and education."], "dimensions": {"years": {"values": [{"id": 1976}, {"id": 1977}, {"id": 1978}, {"id": 1979}, {"id": 1980}, {"id": 1981}, {"id": 1982}, {"id": 1983}, {"id": 1984}, {"id": 1985}, {"id": 1986}, {"id": 1987}, {"id": 1988}, {"id": 1989}, {"id": 1990}, {"id": 1991}, {"id": 1992}, {"id": 1993}, {"id": 1994}, {"id": 1995}, {"id": 1996}, {"id": 1997}, {"id": 1998}, {"id": 1999}, {"id": 2000}, {"id": 2001}, {"id": 2002}, {"id": 2003}, {"id": 2004}, {"id": 2005}, {"id": 2006}, {"id": 2007}, {"id": 2008}, {"id": 2009}, {"id": 2010}, {"id": 2011}, {"id": 2012}, {"id": 2013}, {"id": 2014}, {"id": 2015}, {"id": 2016}, {"id": 2017}, {"id": 2018}, {"id": 2019}, {"id": 2020}, {"id": 2021}, {"id": 2022}, {"id": 2023}, {"id": 2024}]}, "entities": {"values": [{"id": 372252, "name": "Protein Data Bank (PDB)", "code": null}, {"id": 372254, "name": "GenBank", "code": null}, {"id": 372255, "name": "RefSeq", "code": null}, {"id": 372250, "name": "UniProtKB (Swiss-Prot)", "code": null}, {"id": 372251, "name": "AlphaFoldDB", "code": null}, {"id": 372253, "name": "ESMAtlas", "code": null}]}}, "origins": [{"id": 8932, "title": "Trends in Biological Sequence Data", "description": "Growth of key biological sequence databases between January 1976 and January 2024.\n\nBiological sequence data used to train biological sequence models is provided by a vast array of public databases compiled by government, academic, and private institutions. Epoch delineates major sources into three primary categories:\n\n- DNA sequence databases. These have the highest growth rate of analyzed databases, with GenBank seeing a 31% increase in the number of sequences stored between 2022 and 2023. Whole genome shotgun sequencing studies have been the driving force of growth of DNA data, as the increase in number of entries in all other GenBank divisions, referred to as traditional entries, is greatly attenuated in comparison.\n\n- Protein sequence databases. The level of detail in protein sequence databases can vary. Databases with rich annotations such as UniProtKB have a much slower growth rate (6.7%), compared to metagenomic databases such as MGnify (20%), which provide protein sequences but lack detailed information about the protein\u2019s structure, function, and origin.\n\n- Protein structure databases. Gathering experimental data on protein structures is slow and painstaking. Thus, the Protein Data Bank grows by only 6.5% per year. Instead, databases publishing protein structures predicted by AI models can quickly generate large volumes of synthetic data. Databases of synthetic data such as AlphaFoldDB and ESMAtlas have dramatically boosted the supply of available data, though their growth could slow as opportunities for synthetic data are exhausted.\n\nThe majority of entries in large biological databases such as the International Nucleotide Sequence Database Collaboration (INSDC), MGnify, UniProtKB and PDB pertain to cellular organisms (humans, animals, plants, fungi, yeast, bacteria). For example, UniProtKB entries comprise 97% cellular and 2% viral protein sequences, a subset of which are known pathogens.", "producer": "Epoch AI", "citationFull": "Nicole Maug, Aidan O'Gara and Tamay Besiroglu (2024), \"Biological Sequence Models in the Context of the AI Directives\". Published online at epoch.ai. Retrieved from: 'https://epoch.ai/blog/biological-sequence-models-in-the-context-of-the-ai-directives' [online resource]", "attributionShort": "Epoch AI", "urlMain": "https://epoch.ai/blog/biological-sequence-models-in-the-context-of-the-ai-directives", "urlDownload": "https://docs.google.com/spreadsheets/d/10L0LF47eoWfYSdIfiDLEIDp3EqJt72WA_Wm2L1VjrqI/edit?gid=316233161#gid=316233161", "dateAccessed": "2025-09-09", "datePublished": "2024-01-17", "license": {"url": "https://epoch.ai/data", "name": "CC BY 4.0"}}]}