Data
- Phylogenomics
- Skimming and Microbiome
- Branch Length
- Alignment
- Simulators, Cladogenesis, HIV, etc.
- Obsolete
Phylogenomics
ASTRAL related
weighted ASTRAL
Zhang, C., & Mirarab, S. (2022). Weighting by Gene Tree Uncertainty Improves Accuracy of Quartet-based Species Trees. Molecular Biology and Evolution, 39(12), msac215. 10.1093/molbev/msac215.
ASTRAL-Pro
- Zhang, C., Scornavacca, C., Molloy, E. K., & Mirarab, S. (2020). ASTRAL-Pro: Quartet-Based Species-Tree Inference despite Paralogy. Molecular Biology and Evolution, 37(11), 3292–3307. 10.1093/molbev/msaa139
- Zhang, C., & Mirarab, S. (2022). ASTRAL-Pro 2: Ultrafast species tree reconstruction from multi-copy gene family trees. Bioinformatics. 10.1093/bioinformatics/btac620
- Available on Gihub
SODA
Rabiee, Maryam, and Siavash Mirarab. “SODA: Multi-Locus Species Delimitation Using Quartet Frequencies.” Edited by Yann Ponty. Bioinformatics 36, no. 24 (2021): 5623–31. https://doi.org/10.1093/bioinformatics/btaa1010.
- Some of the data available on https://github.com/maryamrabiee/SODA/tree/master/data
ASTRAL-constrained
Rabiee, Maryam, and Siavash Mirarab. “Forcing External Constraints on Tree Inference Using ASTRAL.” BMC Genomics 21, no. S2 (2020): 218. doi:10.1186/s12864-020-6607-z.
ASTRAL-MP
Yin, J., Zhang, C., & Mirarab, S. (2019). ASTRAL-MP: Scaling ASTRAL to very large datasets using randomization and parallelization. Bioinformatics, 35(20). doi.org/10.1093/bioinformatics/btz211.
-
SV 10-1000
: These gene trees can be found in the ASTRAL-II subsection below. -
Avian
: This is the dataset with 48 species and 14446 gene trees with varying contractions. The gene trees, the species trees generated by ASTRAL as well as the ASTRAL output can be found on Dryad -
Insects
: This is the dataset with 144 species and 149278 gene trees. The gene trees can be found on the dryad of a different project -
Simulated 10K
: This is the dataset with 10K species and simulated using SimPhy (20 replicates).
ASTRAL-Multiind
Rabiee, Maryam, Erfan Sayyari, and Siavash Mirarab. “Multi-Allele Species Reconstruction Using ASTRAL.” Molecular Phylogenetics and Evolution 130 (2019): 286–96. 10.1016/j.ympev.2018.10.033.
INSTRAL
Rabiee, Maryam, and Siavash Mirarab. “INSTRAL: Discordance-Aware Phylogenetic Placement Using Quartet Scores.” Systematic Biology, July 10, 2019. doi: 10.1093/sysbio/syz045.
- Data are available on Dryad
Polytomy test
Sayyari, E., & Mirarab, S. (2018). Testing for Polytomies in Phylogenetic Species Trees Using Quartet Frequencies. Genes, 9(3), 132. doi: 10.3390/genes9030132.
ASTRAL-III
Zhang, Chao, Maryam Rabiee, Erfan Sayyari, and Siavash Mirarab. “ASTRAL-III: Polynomial Time Species Tree Reconstruction from Partially Resolved Gene Trees.” BMC Bioinformatics 19, no. S6 (2018): 153. https://doi.org/10.1186/s12859-018-2129-y.
- Data are available on GitLab
LocalPP
Sayyari, E., & Mirarab, S. (2016). Fast Coalescent-Based Computation of Local Branch Support from Quartet Frequencies. Molecular Biology and Evolution, 33(7), 1654–1668. https://doi.org/10.1093/molbev/msw079
- Available on Dryad: https://doi.org/doi:10.5061/dryad.wstqjq2tk
ASTRAL-HGT study
Davidson, Ruth, Pranjal Vachaspati, Siavash Mirarab, and Tandy Warnow. “Phylogenomic Species Tree Estimation in the Presence of Incomplete Lineage Sorting and Horizontal Gene Transfer.” BMC Genomics 16, no. Suppl 10 (2015): S1. 10.1186/1471-2164-16-S10-S1.
- Data available from Illinois data bank.
ASTRAL-II
Mirarab, S., & Warnow, T. (2015). ASTRAL-II: Coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics, 31(12), i44–i52. 10.1093/bioinformatics/btv234
-
Simulation scripts are available on Github
-
All the simulation and real data files are available on Dryad.
ASTRAL-I
Mirarab, Siavash, Rezwana Reaz, Md. Shamsuzzoha Bayzid, Théo Zimmermann, M. S. Swenson, and Tandy Warnow. “ASTRAL: Genome-Scale Coalescent-Based Species Tree Estimation.” Bioinformatics 30, no. 17 (2014): i541–48. doi:10.1093/bioinformatics/btu462.
- Available on Dryad: https://doi.org/doi:10.5061/dryad.ht76hdrp0.
Binning related
Statistical Binning
Mirarab, Siavash, Md. Shamsuzzoha Bayzid, Bastien Boussau, and Tandy Warnow. “Statistical Binning Enables an Accurate Coalescent-Based Estimation of the Avian Tree.” Science 346, no. 6215 (2014): 1250463–1250463. doi: 10.1126/science.1250463
- The statistical binning dataset is permanently provided at UIUC under doi:10.13012/C5MW2F2P
- Some files are missing from that link and are provided on this GitHub page. The README.pdf file shown on that github page gives the details.
Weighted Statistical Binning
Bayzid, Md. Shamsuzzoha, Siavash Mirarab, Bastien Boussau, and Tandy Warnow. “Weighted Statistical Binning: Enabling Statistically Consistent Genome-Scale Phylogenetic Analyses.” PLoS ONE 10, no. 6 (January 18, 2015): e0129183. doi:10.1371/journal.pone.0129183.
-
Most of the datasets used in this study are available through the prior publication (statistical binning).
- The new datasets generated for this study are available on figshare, with DOI: 10.6084/m9.figshare.1411146.
- A more detailed RADME files is available here
- The weighted statistical binning software is available on GitHub.
Response to comment on statistical binning
Mirarab, Siavash, Md. Shamsuzzoha Bayzid, Bastien Boussau, and Tandy Warnow. “Response to Comment on ‘Statistical Binning Enables an Accurate Coalescent-Based Estimation of the Avian Tree.’” Science 350, no. 6257 (2015): 171. doi:10.1126/science.aaa7719.
- Data are made available on GitHub; See
binning-response
directory there.
BBCA
Zimmermann, Théo, Siavash Mirarab, and Tandy Warnow. “BBCA: Improving the Scalability of *BEAST Using Random Binning.” BMC Genomics 15, no. Suppl 6 (October 17, 2014): S11. doi:10.1186/1471-2164-15-S6-S11.
- Data are made available on GitHub; see
bbca
directory there.
Misc methods
uDance
Metin Balaban et al., “Generation of Accurate, Expandable Phylogenomic Trees with uDance,” Nature Biotechnology Online (2023), https://doi.org/10.1038/s41587-023-01868-8.
- Data is available from Harvard repository https://doi.org/10.7910/DVN/BCUM6P
- See also https://github.com/balabanmetin/uDANCE-data
Quco
Maryam Rabiee and Siavash Mirarab, “QuCo: Quartet-Based Co-Estimation of Species Trees and Gene Trees,” Bioinformatics 38, no. Supplement_1 (2022): i413–21, https://doi.org/10.1093/bioinformatics/btac265.
- See https://gitlab.com/mrabiee/quo-data
- Most of the data are on Dryad: https://datadryad.org/stash/dataset/doi:10.6076/D1CP4R
TripVote
Mai, Uyen, and Siavash Mirarab. “Completing Gene Trees without Species Trees in Sub-Quadratic Time.” Edited by Russell Schwartz. Bioinformatics 38, no. 6 (M2022): 1532–41. https://doi.org/10.1093/bioinformatics/btab875.
- See https://uym2.github.io/tripVote/
- The actual data are on Dryad
TAPER
Zhang, Chao, Yiming Zhao, Edward Louis Braun, and Siavash Mirarab. “TAPER: Pinpointing Errors in Multiple Sequence Alignments despite Varying Rates of Evolution.” Methods in Ecology and Evolution 12, no. 11 (2021), 2145-58. doi:10.1111/2041-210X.13696.
DISTIQUE
Sayyari, Erfan, and Siavash Mirarab. “Anchoring Quartet-Based Phylogenetic Distances and Applications to Species Tree Reconstruction.” BMC Genomics 17, no. S10 (2016): 101–13. doi: 10.1186/s12864-016-3098-z.
- http://esayyari.github.io/DISTIQUE
- Some of the files linked there may be lost, unfortunately.
Empirical datasets
1KP-pilot
Wickett, Norman J., Siavash Mirarab, Nam Nguyen, Tandy Warnow, Eric J Carpenter, Naim Matasci, Saravanaraj Ayyampalayam, Michael S. Barker, J. Gordon Burleigh, Matthew A. Gitzendanner, Brad R. Ruhfel, Eric Wafula, Joshua P. Der, Sean W. Graham, Sarah Mathews, Michael Melkonian, Douglas E. Soltis, Pamela S. Soltis, Nicholas W. Miles, Carl J. Rothfels, Lisa Pokorny, A. Jonathan Shaw, Lisa DeGironimo, Dennis W. Stevenson, Barbara Surek, Juan Carlos Villarreal, Béatrice Roure, Hervé Philippe, Claude W. DePamphilis, Tao Chen, Michael K. Deyholos, Regina S. Baucom, Toni M. Kutchan, Megan M. Augustin, Jian Jun Wang, Yong Zhang, Zhijian Tian, Zhixiang Yan, Xiaolei Wu, Xiao Sun, Gane Ka-Shu Wong, and James Jim Leebens-Mack. “Phylotranscriptomic Analysis of the Origin and Early Diversification of Land Plants.” Proceedings of the National Academy of Sciences 111, no. 45 (October 29, 2014): E4859-4868. https://doi.org/10.1073/pnas.1323926111.
1KP
One Thousand Plant Transcriptomes Initiative Green (a co-first author) “One Thousand Plant Transcriptomes and the Phylogenomics of Green Plants.” Nature, 2019. doi:10.1038/s41586-019-1693-2.
- This paper has many datasets, but the most relevant part to my work is on https://github.com/smirarab/1kp
More is needed!
Ekin Tilic, Erfan Sayyari, Josefin Stiller, Siavash Mirarab, and Greg W Rouse. “More Is Needed—Thousands of Loci Are Required to Elucidate the Relationships of the ‘Flowers of the Sea’ (Sabellida, Annelida).” Molecular Phylogenetics and Evolution 151 (2020): 106892. 10.1016/j.ympev.2020.106892.
10K Bacteria
Zhu, Q., Mai, U., Pfeiffer, W., Janssen, S., Asnicar, F., Sanders, J. G., Belda-Ferre, P., Al-Ghalith, G. A., Kopylova, E., McDonald, D., Kosciolek, T., Yin, J. B., Huang, S., Salam, N., Jiao, J., Wu, Z., Xu, Z. Z., Cantrell, K., Yang, Y., … Knight, R. (2019). Phylogenomics of 10,575 genomes reveals evolutionary proximity between domains Bacteria and Archaea. Nature Communications, 10(1), 5477. doi: 10.1038/s41467-019-13443-4
- The 10k bacteria dataset is available here
Avian Indel
Houde, Peter, Edward L Braun, Nitish Narula, Uriel Minjares, and Siavash Mirarab. “Phylogenetic Signal of Indels and the Neoavian Radiation.” Diversity 11, no. 7 (July 6, 2019): 108. https://doi.org/10.3390/d11070108.
Ruminants
hen, Lei, Qiang Qiu, Yu Jiang, Kun Wang, Zeshan Lin, Zhipeng Li, Faysal Bibi, et al. “Large-Scale Ruminant Genome Sequencing Provides Insights into Their Evolution and Distinct Traits.” Science 364, no. 6446 (2019): eaav6202. 10.1126/science.aav6202.
Fragmentary data
Erfan Sayyari, James B Whitfield, and Siavash Mirarab, “Fragmentary Gene Sequences Negatively Impact Gene Tree and Species Tree Reconstruction,” Molecular Biology and Evolution 34, no. 12 (2017): 3279–91, https://doi.org/10.1093/molbev/msx261.
- See this page
- Most of the data is on Dryad: doi:10.6076/D14599
Avian
Jarvis ED, Mirarab S, Aberer AJ, Li B, Houde P, Li C, et al. Whole genome analyses resolve the early branches in the tree of life of modern birds. Science. 2014;346(6215):1320–31. doi: http://doi.org/10.1126/science.1253451.
See also:
-
Jarvis ED, Mirarab S, Aberer AJ, Li B, Houde P, Li C, Ho SY, Faircloth BC, Nabholz B, Howard JT, Suh A, Weber CC, da Fonseca RR, Alfaro-Núñez A, Narula N, Liu L, Burt D, Ellegren H, Edwards SV, Stamatakis A, Mindell DP, Cracraft J, Braun EL, Warnow T, Jun W, Gilbert MT, Zhang G; Avian Phylogenomics Consortium. Phylogenomic analyses data of the avian phylogenomics project. Gigascience. 2015 Feb 12;4:4. doi: 10.1186/s13742-014-0038-1. PMID: 25741440; PMCID: PMC4349222.
-
Data available at http://dx.doi.org/10.5524/101041
Skimming and Microbiome
Phylogenetic placement
h-DEPP
Jiang, Yueyu, Puoya Tabaghi, and Siavash Mirarab. “Learning Hyperbolic Embedding for Phylogenetic Tree Placement and Updates.” Biology 11, no. 9 (2022): 1256. https://doi.org/10.3390/biology11091256.
DEPP
Jiang, Yueyu, Metin Balaban, Qiyun Zhu, and Siavash Mirarab. “DEPP: Deep Learning Enables Extending Species Trees Using Single Genes.” Edited by Claudia Solis-Lemus. Systematic Biology 72, no. 1 (2023): 17–34. https://doi.org/10.1093/sysbio/syac031.
APPLES bootstrapping
Hasan, Navid Bin, Metin Balaban, Avijit Biswas, Md. Shamsuzzoha Bayzid, and Siavash Mirarab. “Distance-Based Phylogenetic Placement with Statistical Support.” Biology 11, no. 8 (August 12, 2022): 1212. https://doi.org/10.3390/biology11081212.
APPLES-II
Balaban, Metin, Yueyu Jiang, Daniel Roush, Qiyun Zhu, and Siavash Mirarab. “Fast and Accurate Distance‐based Phylogenetic Placement Using Divide and Conquer.” Molecular Ecology Resources 22, no. 3 (2022): 1213–27. https://doi.org/10.1111/1755-0998.13527.
- Available on Zenodo
MISA
Balaban, Metin, and Siavash Mirarab. “Phylogenetic Double Placement of Mixed Samples.” Bioinformatics 36, no. Supplement_1 (July 1, 2020): i335–43. doi: 10.1093/bioinformatics/btaa489.
APPLES-I
Balaban, Metin, Shahab Sarmashghi, and Siavash Mirarab. “APPLES: Scalable Distance-Based Phylogenetic Placement with or without Alignments.” Edited by David Posada. Systematic Biology 69, no. 3 (2020): 566–78. https://doi.org/10.1093/sysbio/syz063.
- Available on Dryad
SEPP
Mirarab, Siavash, Nam Nguyen, and Tandy Warnow. “SEPP: SATé-enabled phylogenetic placement.” Biocomputing 2012. 2012. 247-258.
- Available on https://doi.org/10.13012/B2IDB-9316702_V1
Skmer (distance calculation)
Skmer Support values
Rachtman, Eleonora, Shahab Sarmashghi, Vineet Bafna, and Siavash Mirarab. “Quantifying the Uncertainty of Assembly-Free Genome-Wide Distance Estimates and Phylogenetic Relationships Using Subsampling.” Cell Systems 13, no. 10 (2022): 817-829.e3. https://doi.org/10.1016/j.cels.2022.06.007.
- This paper analyzes existing, publicly available data. All original studies are referenced in the main text. Accession numbers for the datasets are available in this paper’s supplemental information.
Tk4 model
Balaban, Metin, Nishat A Bristy, Ahnaf Faisal, Md. Shamsuzzoha Bayzid, and Siavash Mirarab. “Genome-Wide Alignment-Free Phylogenetic Distance Estimation under a No Strand-Bias Model.” Edited by Thomas Lengauer. Bioinformatics Advances 2, no. 1 (2022): vbac055. https://doi.org/10.1093/bioadv/vbac055.
- Available in Zenodo, at
Skmer
Sarmashghi, Shahab, Kristine Bohmann, M. Thomas P. Gilbert, Vineet Bafna, and Siavash Mirarab. “Skmer: Assembly-Free and Alignment-Free Sample Identification Using Genome Skims.” Genome Biology 20, no. 1 (2019): 34. https://doi.org/10.1186/s13059-019-1632-4.
- See https://shahab-sarmashghi.github.io/Skmer/
- The actual data are on https://skmer.ucsd.edu/data/skmer
Taxonomic classification and Contamination removal
Impact of contamination
Rachtman, Eleonora, Metin Balaban, Vineet Bafna, and Siavash Mirarab. “The Impact of Contaminants on the Accuracy of Genome Skimming and the Effectiveness of Exclusion Read Filters.” Molecular Ecology Resources 20, no. 3 (May 4, 2020): 1755-0998.13135. https://doi.org/10.1111/1755-0998.13135.
- Available on Dryad https://datadryad.org/stash/dataset/doi:10.6076/D1FG62
KRANK
Şapcı, A.O.B. and Mirarab, S. (2024) “Memory-bound k-mer selection for large and evolutionary diverse reference libraries”, Genome Research, p. gr.279339.124. Available at: https://doi.org/10.1101/gr.279339.124.
- All data available on Dryad https://datadryad.org/stash/dataset/doi:10.5061/dryad.0000000c2.
- A catalog of reference libraries are availalbe at https://ter-trees.ucsd.edu/data/krank.
- Simulated reads used for read classification experiments can be found here.
- Smaller files (including results, and misc data) are available on GitHub.
- All releases of the software and its documentation can be found on GitHub.
CONSULT-II
Şapcı, A.O.B., Rachtman, E. and Mirarab, S. (2024) “CONSULT-II: Accurate taxonomic identification and profiling using locality-sensitive hashing”, Bioinformatics, p. btae150. Available at: https://doi.org/10.1093/bioinformatics/btae150.
- CONSULT-II reference libraries (built using WoL-v1 archaeal and bacterial genomes): 140Gb (high-sensitivity), 32Gb (robust and lightweight), 18Gb (lightweight).
- Simulated reads used for classification experiments: reads from bacterial genomes and reads from archaeal genomes.
- Smaller files (including results, scripts and taxonomy files) are available on GitHub.
- All releases of the software and its documentation can be found on GitHub.
CONSULT
Rachtman, Eleonora, Vineet Bafna, and Siavash Mirarab. “CONSULT: Accurate Contamination Removal Using Locality-Sensitive Hashing.” NAR Genomics and Bioinformatics 3, no. 3 (June 23, 2021): 10.1101/2021.03.18.436035. https://doi.org/10.1093/nargab/lqab071.
- Data available on Dryad: https://doi.org/doi:10.6076/D19P44
- Some more on https://skmer.ucsd.edu/data/consult/
TIPP
Nguyen, Nam-phuong; Mirarab, Siavash; Bo, Liu; Pop, Mihai; Warnow, Tandy (2014). “TIPP: taxonomic identification and phylogenetic profiling.” . Bioinformatics 30.24 (2014): 3548-3555.
- Available at UIC https://doi.org/10.13012/B2IDB-8783447_V1
Branch Length
Dating and unit conversion
wLogDate
Uyen Mai, Siavash Mirarab, Log Transformation Improves Dating of Phylogenies, Molecular Biology and Evolution, msaa222, https://doi.org/10.1093/molbev/msaa222.
CASTLES
Tabatabaee, Yasamin, Chao Zhang, Tandy Warnow, and Siavash Mirarab. “Phylogenomic Branch Length Estimation Using Quartets.” Bioinformatics 39, no. Supplement_1 (2023): i185–93. https://doi.org/10.1093/bioinformatics/btad221.
Rooting, Clustering, Error detection, etc.
MinVarRooting
Mai, Uyen, Erfan Sayyari, and Siavash Mirarab. “Minimum Variance Rooting of Phylogenetic Trees and Implications for Species Tree Reconstruction.” Edited by Gabriel Moreno-Hagelsieb. PLOS ONE 12, no. 8 (2017): e0182238. https://doi.org/10.1371/journal.pone.0182238.
- For data, see https://uym2.github.io/MinVar-Rooting/
TreeShrink
Mai, Uyen, and Siavash Mirarab. “TreeShrink: Fast and Accurate Detection of Outlier Long Branches in Collections of Phylogenetic Trees.” BMC Genomics 19, no. S5 (May 8, 2018): 272. doi: 10.1186/s12864-018-4620-2.
- For dasta, see https://uym2.github.io/TreeShrink/
- Much of the larger datasets are on Dryad
TreeCluster
Balaban, Metin, Niema Moshiri, Uyen Mai, Xingfan Jia, and Siavash Mirarab. “TreeCluster: Clustering Biological Sequences Using Phylogenetic Trees.” Edited by Serdar Bozdag. PLOS ONE 14, no. 8 (2019): e0221068. doi:10.1371/journal.pone.0221068.
- Data deposited on Zenodo
Alignment
FastSP
Mirarab, Siavash, and Tandy Warnow. “FastSP: Linear Time Calculation of Alignment Accuracy.” Bioinformatics 27, no. 23 (2011): 3250–58. https://doi.org/10.1093/bioinformatics/btr553.
- Data Available on https://github.com/smirarab/fastsp-data
UPP
Nguyen, Nam-phuong, Siavash Mirarab, Keerthana Kumar, and Tandy Warnow. “Ultra-Large Alignments Using Phylogeny-Aware Profiles.” Genome Biology 16, no. 1 (2015): 124. https://doi.org/10.1186/s13059-015-0688-z.
- Available at UIUC: https://doi.org/10.13012/B2IDB-3174395_V1
- See also https://doi.org/10.13012/B2IDB-1614388_V1 and https://databank.illinois.edu/datasets/IDB-5139418
PASTA
Mirarab, Siavash, Nam Nguyen, Sheng Guo, Li-San Wang, Junhyong Kim, and Tandy Warnow. “PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences.” Journal of Computational Biology 22, no. 5 (2015): 377–86. https://doi.org/10.1089/cmb.2014.0156.
- Available at UIUC: https://doi.org/10.13012/B2IDB-3174395_V1
- Note that this is the same repo as UPP
- See also https://doi.org/10.13012/B2IDB-1614388_V1 and https://databank.illinois.edu/datasets/IDB-5139418
Simulators, Cladogenesis, HIV, etc.
DIMSIM
Zhang, Chao, Andrey V. Bzikadze, Yana Safonova, and Siavash Mirarab. “A Scalable Model for Simulating Multi-Round Antibody Evolution and Benchmarking of Clonal Tree Reconstruction Methods.” Frontiers in Immunology 13 (2022): 1014439. https://doi.org/10.3389/fimmu.2022.1014439.
DualBirth model
Moshiri and Mirarab, Dual-birth: A Two-State Model of Tree Evolution and Its Applications to Alu Retrotransposition. 10.1093/sysbio/syx088
- Data on Dryad at https://doi.org/10.5061/dryad.13n52
ProACT
Moshiri, Niema, Davey M Smith, and Siavash Mirarab. “HIV Care Prioritization Using Phylogenetic Branch Length.” JAIDS Journal of Acquired Immune Deficiency Syndromes (2020): 2019.12.20.885202. doi:10.1097/QAI.000000000000261.
Obsolete
This dataset webpage used to host our datasets. We have made an effort to make everything available elsewhere. I provide the link here just in case it becomes useful.
Here are some old datasets from Tandy’s group:
- SuperFine,DACTAL,BeeTLe: https://doi.org/10.13012/B2IDB-2952208_V1