Data Sources

ProteoCosmos is built on open data from multiple sources, including: STRING, the Human Protein Atlas, UniprotKB, PAN-GO, and AlphaFold DB. Details on the these data sources, the licenses under which they are used, and any processing applied to the data are provided below.

Please note that ProteoCosmos and BK SciViz are not sponsored or endorsed by any of the following authors, organizations, or licensors.

The software for running the ProteoCosmos app and all imagery contained within it was developed by Brad Krajina at BKSciViz.

Data Source Summary

We summarize the relevant data sources here, and provide more detail regarding publications and processing of data in the sections to follow.

STRING
Website: string-db.org
Version: v12.0
Date Downloaded: 2025/11/19
Licensed under: CC BY 4.0

The Human Protein Atlas (Subcellular Location)
Website: proteinatlas.org
Version: Version 25.0
Date Downloaded: 2026/02/09
Licensed under: CC BY-SA 4.0

AlphaFold Database
Website: alphafold.ebi.ac.uk/
Version: v6
Date Downloaded: 2026/02/23
Licensed under: CC-BY-4.0

PAN-GO (Gene Ontology Consortium and PANTER)
Website: functionome.geneontology.org/
Date Downloaded: 2026/02/27
Licensed under: CC-BY-4.0

Gene Ontology
Website: https://geneontology.org/
Date Downloaded: 2026/02/27
Licensed under: CC-BY-4.0

UniProt
Website: uniprot.org
Date Downloaded: 2026/02/20
Licensed under: CC-BY-4.0

Ensembl
Website: ensembl.org
Version: 115
Date Downloaded: 2026/02/27
Licensed under: Unrestricted\

STRING Database

All human protein-protein interaction data was derived from data downloaded from STRING:

Source: STRING
Website: string-db.org
Version: v12.0
Date Downloaded: 2025/11/19
Licensed under: CC BY 4.0\

Relevant Publications for STRING

Szklarczyk D, Kirsch R, Koutrouli M, Nastou K, Mehryary F, Hachilif R, Annika GL, Fang T, Doncheva NT, Pyysalo S, Bork P‡, Jensen LJ‡, von Mering C‡. The STRING database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 2023 Jan 6;51(D1):D638-646.PubMed

Complete publications list on STRING website

Processing of StringDB data:

The complete collection of protein-protein interactions for the human proteome (9606) was downloaded directly from stringDB. The interactions were filtered to include high-confidence interactions (confidence score >= 700). Next, confidence weights were normalized to a maximum value of 1 and clustering was performed using the Leiden algorithm (resolution 0.75). The robustness of clustering was assessed by evaluating the Adjusted Rand Index and the Variation of Information of repeat clustering on repeated clustering with different random seeds.

In the ProteoCosmsos visualization, at the initial “cluster-level” view, each dot represents one of the clusters identified using the above method. The edges between the dots that appear on hover reflect the total edge weight between clusters of the induced graph.

When expanding a cluster to show the individual protein view, only protein-protein interactions within that cluster are shown. Currently, between cluster interactions are only shown at the cluster level. Work is ongoing to incorporate a protein-level view of both within-cluster and between-cluster interactions.

The Human Protein Atlas

Data Sources and Citations for The Human Protein Atlas

The subcellular localization visualization was generated using subcellular location annotation data from The Human Protein Atlas:

Source: The Human Protein Atlas (Subcellular Location)
Website: proteinatlas.org
Version: Version 25.0
Date Downloaded: 2026/02/09
Licensed under: CC BY-SA 4.0\

Specifically, the tab-separated subcellular location annotations file was downloaded directly from the Human Protein Atlas website at: link

Relevant Publications for Human Protein Atlas Subcellular Location Data

Thul PJ et al., A subcellular map of the human proteome. Science. (2017) PubMed: 28495876 DOI: 10.1126/science.aal3321

Processing of Human Protein Atlas data

The annotations from the Human Protein Atlas were mapped onto a curated selection of summary terms by manual review of the original terms (summary terms: Nucleus, Cytoplasm, Mitochondria, Endoplasmic Reticulum, Golgi Apparatus, Plasma Membrane, Vesicles, Primary Cilium, Secretory, Sperm). For each protein identified by a StringDB ensemble peptide id, the corresponding Human Protein Atlas terms were identified by mapping the Ensembl Peptide Id to the Ensembl gene id and gene name (Ensembl version 115, Biomart). The normalized enrichment of terms was then computed by calculating the fraction of proteins in each cluster that have a given term name, divided by the proportion of proteins with that term across the entire proteome.

AlphaFold Database

All protein structure visualizations in ProteoCosmos were generated using protein structure prediction data (.pdb and .cif) from AlphaFold DB:

Source: AlphaFold Database
Website: alphafold.ebi.ac.uk/
Version: v6
Date Downloaded: 2026/02/23
Licensed under: CC-BY-4.0\

Relevant publications for AlphaFold DB structure

Fleming J. et al. AlphaFold Protein Structure Database and 3D-Beacons: New Data and Capabilities. Journal of Molecular Biology, (2025) link

Jumper, J et al. Highly accurate protein structure prediction with AlphaFold. Nature (2021) link

Kim, RS et al. BFVD—a large repository of predicted viral protein structures. NAR (2024) link

Hunt M, Lima L, Anderson D, Bouras G, Hall M, Hawkey J, Schwengers O, Shen W, Lees JA, Zamin Iqbal Z. BioRXiV (2025) link

Wheeler RJ. A resource for improved predictions of Trypanosoma and Leishmania protein three-dimensional structure. PLoS One (2021) link

Litvin, U et al. Viro3D: a comprehensive database of virus protein structure predictions. Mol Syst Biol (2025) link

Processing of AlphaFold DB structure data

Protein structure predictions were downloaded as .cif and .pdb files directly from the AlphaFold DB website. The structures were centered around their centroid using Biotite (python) and imported as a 3D structure into Blender using the Molecular Nodes (Brady Johnston) add-on for rendering. Each structure was scaled based on its bounding box to fit within a preset camera view and rendered as a 2D image. The final rendered images were post-processed to adjust contrast and optimize for compression.

Each image structure represents the final 2D image produced through the above rendering pipeline.

PAN-GO and Gene Ontology

Gene-ontology terms were obtained from the Phylogenetic Annotation using Gene Ontology project (PAN-GO). The complete annotations were downloaded directly as a tsv file.

Source: PAN-GO (Gene Ontology Consortium and PANTER)
Website: functionome.geneontology.org/
Date Downloaded: 2026/02/27
Licensed under: CC-BY-4.0\

Additionally, the go-basis gene ontology terms were used to identify the namespace (cellular component, biological process, molecular function) of PAN-GO terms. The go-basic ontology for the human proteome was downloaded directly as a .obo file.

Source: Gene Ontology
Website: https://geneontology.org/
Date Downloaded: 2026/02/27
Licensed under: CC-BY-4.0

Relevant Publications PAN-GO, Gene Ontology, and PANTHER

Ashburner et al. Gene ontology: tool for the unification of biology. Nat Genet. 2000 May;25(1):25-9. DOI: 10.1038/75556

The Gene Ontology Consortium. The Gene Ontology knowledgebase in 2026. Nucleic Acids Res. 2025 Dec 18;gkaf1292. DOI: 10.1093/nar/gkaf1292

Feuermann et al., A compendium of human gene functions derived from evolutionary modelling, Nature 2025. link

PANTHER: Making genome-scale phylogenetics accessible to all link Paul D. Thomas, Dustin Ebert, Anushya Muruganujan, Tremayne Mushayahama, Laurent-Philippe Albou and Huaiyu Mi Protein Society. 2022;31(1):8-22. doi:10.1002/pro.421

UniProt Knowledgebase

UniProt Knowledgebase was used to derive mappings between identifiers (gene symbol, protein name) that were necessary to join data from different resources. The flat .txt file of reviewed entries was downloaded directly from the UniprotKB website:

Source: UniProt
Website: uniprot.org
Date Downloaded: 2026/02/20
Licensed under: CC-BY-4.0\

Relevant publications for UniprotKB

The UniProt Consortium UniProt: the Universal Protein Knowledgebase in 2025 Nucleic Acids Res. 53:D609–D617 (2025)

Ensembl

Ensembl data was used to derive mappings between peptide ids and gene ids/symbols, which was use to link StringDB data to other data sources. These tables were obtained via BioMart:

Source: Ensembl
Website: ensembl.org
Version: 115
Date Downloaded: 2026/02/27 Licensed under: Unrestricted

Relevant Publications for Ensembl

Sarah C Dyer, Andrew D Yates, et al. Ensembl 2025 Nucleic Acids Res. 2025, 53(D1):D948–D957 PMID: 39656687 10.1093/nar/gkae1071