Abstract
Cell lines are essential tools for studying biological mechanisms, advancing pre-clinical drug discovery and supporting biologics production. To further research in these fields, we introduce the Cell Lines CoCoPUTs (Codon and Codon Pair Usage Tables,https://dnahive.fda.gov/hivecuts/cell-lines/), a comprehensive resource of transcriptomic-weighted codon and codon-pair usages for 1866 unique cell lines derived from two cancer databases, Catalogue of Somatic Mutations in Cancer (COSMIC) and Cancer Cell Line Encyclopedia (CCLE), and the Human Protein Atlas (HPA) database. Despite differences in the number of cell lines in each database and platforms used for the analysis (microarray vs RNA-Seq), codon usage distributions were broadly similar for all overlapping cell lines across three databases. Application of unsupervised machine learning approaches, including hierarchical and spectral clustering, for the analysis of 1355 cell lines of non-metastatic origin yielded more distinct clusters based on codon-pair usage over codon usage. However, distance-based comparisons indicated that codon usage often yields equal or smaller within-group distances than codon-pair usage and that cell lines are, on average, closer to their site of origin than to their disease phenotype.
| Original language | English |
|---|---|
| Article number | 169718 |
| Journal | Journal of Molecular Biology |
| DOIs | |
| State | Accepted/In press - Jan 1 2026 |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 3 Good Health and Well-being
Keywords
- Cell Lines CoCoPUTs
- codon usage
- codon-pair usage
- dinucleotide usage
- transcriptomic weighted cell line data
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver