Phylogenomics

weighted ASTRAL

Zhang, C., & Mirarab, S. (2022). Weighting by Gene Tree Uncertainty Improves Accuracy of Quartet-based Species Trees. Molecular Biology and Evolution, 39(12), msac215. 10.1093/molbev/msac215.

  • Small files, including scripts: Gihub
  • Larger files: Dryad

ASTRAL-Pro

  • Zhang, C., Scornavacca, C., Molloy, E. K., & Mirarab, S. (2020). ASTRAL-Pro: Quartet-Based Species-Tree Inference despite Paralogy. Molecular Biology and Evolution, 37(11), 3292–3307. 10.1093/molbev/msaa139
    • Small files, including scripts: Gihub
    • Larger files: Dryad
  • Zhang, C., & Mirarab, S. (2022). ASTRAL-Pro 2: Ultrafast species tree reconstruction from multi-copy gene family trees. Bioinformatics. 10.1093/bioinformatics/btac620

SODA

Rabiee, Maryam, and Siavash Mirarab. “SODA: Multi-Locus Species Delimitation Using Quartet Frequencies.” Edited by Yann Ponty. Bioinformatics 36, no. 24 (2021): 5623–31. https://doi.org/10.1093/bioinformatics/btaa1010.

ASTRAL-constrained

Rabiee, Maryam, and Siavash Mirarab. “Forcing External Constraints on Tree Inference Using ASTRAL.” BMC Genomics 21, no. S2 (2020): 218. doi:10.1186/s12864-020-6607-z.

ASTRAL-MP

Yin, J., Zhang, C., & Mirarab, S. (2019). ASTRAL-MP: Scaling ASTRAL to very large datasets using randomization and parallelization. Bioinformatics, 35(20). doi.org/10.1093/bioinformatics/btz211.

  • SV 10-1000: These gene trees can be found in the ASTRAL-II subsection below.

  • Avian: This is the dataset with 48 species and 14446 gene trees with varying contractions. The gene trees, the species trees generated by ASTRAL as well as the ASTRAL output can be found on Dryad

  • Insects: This is the dataset with 144 species and 149278 gene trees. The gene trees can be found on the dryad of a different project

  • Simulated 10K: This is the dataset with 10K species and simulated using SimPhy (20 replicates).

    • All the data can be found on Dryad

    • Scripts to generate the data and to analyze it are also found on GitHub

ASTRAL-Multiind

Rabiee, Maryam, Erfan Sayyari, and Siavash Mirarab. “Multi-Allele Species Reconstruction Using ASTRAL.” Molecular Phylogenetics and Evolution 130 (2019): 286–96. 10.1016/j.ympev.2018.10.033.

  • The smaller data, including script on GitHub
  • The larger datasets are on Dryad

INSTRAL

Rabiee, Maryam, and Siavash Mirarab. “INSTRAL: Discordance-Aware Phylogenetic Placement Using Quartet Scores.” Systematic Biology, July 10, 2019. doi: 10.1093/sysbio/syz045.

  • Data are available on Dryad

Polytomy test

Sayyari, E., & Mirarab, S. (2018). Testing for Polytomies in Phylogenetic Species Trees Using Quartet Frequencies. Genes, 9(3), 132. doi: 10.3390/genes9030132.

ASTRAL-III

Zhang, Chao, Maryam Rabiee, Erfan Sayyari, and Siavash Mirarab. “ASTRAL-III: Polynomial Time Species Tree Reconstruction from Partially Resolved Gene Trees.” BMC Bioinformatics 19, no. S6 (2018): 153. https://doi.org/10.1186/s12859-018-2129-y.

  • Data are available on GitLab

LocalPP

Sayyari, E., & Mirarab, S. (2016). Fast Coalescent-Based Computation of Local Branch Support from Quartet Frequencies. Molecular Biology and Evolution, 33(7), 1654–1668. https://doi.org/10.1093/molbev/msw079

ASTRAL-HGT study

Davidson, Ruth, Pranjal Vachaspati, Siavash Mirarab, and Tandy Warnow. “Phylogenomic Species Tree Estimation in the Presence of Incomplete Lineage Sorting and Horizontal Gene Transfer.” BMC Genomics 16, no. Suppl 10 (2015): S1. 10.1186/1471-2164-16-S10-S1.

ASTRAL-II

Mirarab, S., & Warnow, T. (2015). ASTRAL-II: Coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics, 31(12), i44–i52. 10.1093/bioinformatics/btv234

  • Simulation scripts are available on Github

  • All the simulation and real data files are available on Dryad.

ASTRAL-I

Mirarab, Siavash, Rezwana Reaz, Md. Shamsuzzoha Bayzid, Théo Zimmermann, M. S. Swenson, and Tandy Warnow. “ASTRAL: Genome-Scale Coalescent-Based Species Tree Estimation.” Bioinformatics 30, no. 17 (2014): i541–48. doi:10.1093/bioinformatics/btu462.

Statistical Binning

Mirarab, Siavash, Md. Shamsuzzoha Bayzid, Bastien Boussau, and Tandy Warnow. “Statistical Binning Enables an Accurate Coalescent-Based Estimation of the Avian Tree.” Science 346, no. 6215 (2014): 1250463–1250463. doi: 10.1126/science.1250463

  • The statistical binning dataset is permanently provided at UIUC under doi:10.13012/C5MW2F2P
  • Some files are missing from that link and are provided on this GitHub page. The README.pdf file shown on that github page gives the details.

Weighted Statistical Binning

Bayzid, Md. Shamsuzzoha, Siavash Mirarab, Bastien Boussau, and Tandy Warnow. “Weighted Statistical Binning: Enabling Statistically Consistent Genome-Scale Phylogenetic Analyses.” PLoS ONE 10, no. 6 (January 18, 2015): e0129183. doi:10.1371/journal.pone.0129183.

  • Most of the datasets used in this study are available through the prior publication (statistical binning).

  • The new datasets generated for this study are available on figshare, with DOI: 10.6084/m9.figshare.1411146.
    • A more detailed RADME files is available here
  • The weighted statistical binning software is available on GitHub.

Response to comment on statistical binning

Mirarab, Siavash, Md. Shamsuzzoha Bayzid, Bastien Boussau, and Tandy Warnow. “Response to Comment on ‘Statistical Binning Enables an Accurate Coalescent-Based Estimation of the Avian Tree.’” Science 350, no. 6257 (2015): 171. doi:10.1126/science.aaa7719.

  • Data are made available on GitHub; See binning-response directory there.

BBCA

Zimmermann, Théo, Siavash Mirarab, and Tandy Warnow. “BBCA: Improving the Scalability of *BEAST Using Random Binning.” BMC Genomics 15, no. Suppl 6 (October 17, 2014): S11. doi:10.1186/1471-2164-15-S6-S11.

  • Data are made available on GitHub; see bbca directory there.

Misc methods

uDance

Metin Balaban et al., “Generation of Accurate, Expandable Phylogenomic Trees with uDance,” Nature Biotechnology Online (2023), https://doi.org/10.1038/s41587-023-01868-8.

Quco

Maryam Rabiee and Siavash Mirarab, “QuCo: Quartet-Based Co-Estimation of Species Trees and Gene Trees,” Bioinformatics 38, no. Supplement_1 (2022): i413–21, https://doi.org/10.1093/bioinformatics/btac265.

TripVote

Mai, Uyen, and Siavash Mirarab. “Completing Gene Trees without Species Trees in Sub-Quadratic Time.” Edited by Russell Schwartz. Bioinformatics 38, no. 6 (M2022): 1532–41. https://doi.org/10.1093/bioinformatics/btab875.

TAPER

Zhang, Chao, Yiming Zhao, Edward Louis Braun, and Siavash Mirarab. “TAPER: Pinpointing Errors in Multiple Sequence Alignments despite Varying Rates of Evolution.” Methods in Ecology and Evolution 12, no. 11 (2021), 2145-58. doi:10.1111/2041-210X.13696.

DISTIQUE

Sayyari, Erfan, and Siavash Mirarab. “Anchoring Quartet-Based Phylogenetic Distances and Applications to Species Tree Reconstruction.” BMC Genomics 17, no. S10 (2016): 101–13. doi: 10.1186/s12864-016-3098-z.

Empirical datasets

1KP-pilot

Wickett, Norman J., Siavash Mirarab, Nam Nguyen, Tandy Warnow, Eric J Carpenter, Naim Matasci, Saravanaraj Ayyampalayam, Michael S. Barker, J. Gordon Burleigh, Matthew A. Gitzendanner, Brad R. Ruhfel, Eric Wafula, Joshua P. Der, Sean W. Graham, Sarah Mathews, Michael Melkonian, Douglas E. Soltis, Pamela S. Soltis, Nicholas W. Miles, Carl J. Rothfels, Lisa Pokorny, A. Jonathan Shaw, Lisa DeGironimo, Dennis W. Stevenson, Barbara Surek, Juan Carlos Villarreal, Béatrice Roure, Hervé Philippe, Claude W. DePamphilis, Tao Chen, Michael K. Deyholos, Regina S. Baucom, Toni M. Kutchan, Megan M. Augustin, Jian Jun Wang, Yong Zhang, Zhijian Tian, Zhixiang Yan, Xiaolei Wu, Xiao Sun, Gane Ka-Shu Wong, and James Jim Leebens-Mack. “Phylotranscriptomic Analysis of the Origin and Early Diversification of Land Plants.” Proceedings of the National Academy of Sciences 111, no. 45 (October 29, 2014): E4859-4868. https://doi.org/10.1073/pnas.1323926111.

1KP

One Thousand Plant Transcriptomes Initiative Green (a co-first author) “One Thousand Plant Transcriptomes and the Phylogenomics of Green Plants.” Nature, 2019. doi:10.1038/s41586-019-1693-2.

More is needed!

Ekin Tilic, Erfan Sayyari, Josefin Stiller, Siavash Mirarab, and Greg W Rouse. “More Is Needed—Thousands of Loci Are Required to Elucidate the Relationships of the ‘Flowers of the Sea’ (Sabellida, Annelida).” Molecular Phylogenetics and Evolution 151 (2020): 106892. 10.1016/j.ympev.2020.106892.

10K Bacteria

Zhu, Q., Mai, U., Pfeiffer, W., Janssen, S., Asnicar, F., Sanders, J. G., Belda-Ferre, P., Al-Ghalith, G. A., Kopylova, E., McDonald, D., Kosciolek, T., Yin, J. B., Huang, S., Salam, N., Jiao, J., Wu, Z., Xu, Z. Z., Cantrell, K., Yang, Y., … Knight, R. (2019). Phylogenomics of 10,575 genomes reveals evolutionary proximity between domains Bacteria and Archaea. Nature Communications, 10(1), 5477. doi: 10.1038/s41467-019-13443-4

  • The 10k bacteria dataset is available here

Avian Indel

Houde, Peter, Edward L Braun, Nitish Narula, Uriel Minjares, and Siavash Mirarab. “Phylogenetic Signal of Indels and the Neoavian Radiation.” Diversity 11, no. 7 (July 6, 2019): 108. https://doi.org/10.3390/d11070108.

Ruminants

hen, Lei, Qiang Qiu, Yu Jiang, Kun Wang, Zeshan Lin, Zhipeng Li, Faysal Bibi, et al. “Large-Scale Ruminant Genome Sequencing Provides Insights into Their Evolution and Distinct Traits.” Science 364, no. 6446 (2019): eaav6202. 10.1126/science.aav6202.

Fragmentary data

Erfan Sayyari, James B Whitfield, and Siavash Mirarab, “Fragmentary Gene Sequences Negatively Impact Gene Tree and Species Tree Reconstruction,” Molecular Biology and Evolution 34, no. 12 (2017): 3279–91, https://doi.org/10.1093/molbev/msx261.

Avian

Jarvis ED, Mirarab S, Aberer AJ, Li B, Houde P, Li C, et al. Whole genome analyses resolve the early branches in the tree of life of modern birds. Science. 2014;346(6215):1320–31. doi: http://doi.org/10.1126/science.1253451.

See also:

  • Jarvis ED, Mirarab S, Aberer AJ, Li B, Houde P, Li C, Ho SY, Faircloth BC, Nabholz B, Howard JT, Suh A, Weber CC, da Fonseca RR, Alfaro-Núñez A, Narula N, Liu L, Burt D, Ellegren H, Edwards SV, Stamatakis A, Mindell DP, Cracraft J, Braun EL, Warnow T, Jun W, Gilbert MT, Zhang G; Avian Phylogenomics Consortium. Phylogenomic analyses data of the avian phylogenomics project. Gigascience. 2015 Feb 12;4:4. doi: 10.1186/s13742-014-0038-1. PMID: 25741440; PMCID: PMC4349222.

  • Data available at http://dx.doi.org/10.5524/101041

Skimming and Microbiome

Phylogenetic placement

h-DEPP

Jiang, Yueyu, Puoya Tabaghi, and Siavash Mirarab. “Learning Hyperbolic Embedding for Phylogenetic Tree Placement and Updates.” Biology 11, no. 9 (2022): 1256. https://doi.org/10.3390/biology11091256.

DEPP

Jiang, Yueyu, Metin Balaban, Qiyun Zhu, and Siavash Mirarab. “DEPP: Deep Learning Enables Extending Species Trees Using Single Genes.” Edited by Claudia Solis-Lemus. Systematic Biology 72, no. 1 (2023): 17–34. https://doi.org/10.1093/sysbio/syac031.

APPLES bootstrapping

Hasan, Navid Bin, Metin Balaban, Avijit Biswas, Md. Shamsuzzoha Bayzid, and Siavash Mirarab. “Distance-Based Phylogenetic Placement with Statistical Support.” Biology 11, no. 8 (August 12, 2022): 1212. https://doi.org/10.3390/biology11081212.

APPLES-II

Balaban, Metin, Yueyu Jiang, Daniel Roush, Qiyun Zhu, and Siavash Mirarab. “Fast and Accurate Distance‐based Phylogenetic Placement Using Divide and Conquer.” Molecular Ecology Resources 22, no. 3 (2022): 1213–27. https://doi.org/10.1111/1755-0998.13527.

MISA

Balaban, Metin, and Siavash Mirarab. “Phylogenetic Double Placement of Mixed Samples.” Bioinformatics 36, no. Supplement_1 (July 1, 2020): i335–43. doi: 10.1093/bioinformatics/btaa489.

  • Smaller files on GitHub
  • Large files are given on Dryad

APPLES-I

Balaban, Metin, Shahab Sarmashghi, and Siavash Mirarab. “APPLES: Scalable Distance-Based Phylogenetic Placement with or without Alignments.” Edited by David Posada. Systematic Biology 69, no. 3 (2020): 566–78. https://doi.org/10.1093/sysbio/syz063.

SEPP

Mirarab, Siavash, Nam Nguyen, and Tandy Warnow. “SEPP: SATé-enabled phylogenetic placement.” Biocomputing 2012. 2012. 247-258.

Skmer (distance calculation)

Skmer Support values

Rachtman, Eleonora, Shahab Sarmashghi, Vineet Bafna, and Siavash Mirarab. “Quantifying the Uncertainty of Assembly-Free Genome-Wide Distance Estimates and Phylogenetic Relationships Using Subsampling.” Cell Systems 13, no. 10 (2022): 817-829.e3. https://doi.org/10.1016/j.cels.2022.06.007.

  • This paper analyzes existing, publicly available data. All original studies are referenced in the main text. Accession numbers for the datasets are available in this paper’s supplemental information.

Tk4 model

Balaban, Metin, Nishat A Bristy, Ahnaf Faisal, Md. Shamsuzzoha Bayzid, and Siavash Mirarab. “Genome-Wide Alignment-Free Phylogenetic Distance Estimation under a No Strand-Bias Model.” Edited by Thomas Lengauer. Bioinformatics Advances 2, no. 1 (2022): vbac055. https://doi.org/10.1093/bioadv/vbac055.

Skmer

Sarmashghi, Shahab, Kristine Bohmann, M. Thomas P. Gilbert, Vineet Bafna, and Siavash Mirarab. “Skmer: Assembly-Free and Alignment-Free Sample Identification Using Genome Skims.” Genome Biology 20, no. 1 (2019): 34. https://doi.org/10.1186/s13059-019-1632-4.

Taxonomic classification and Contamination removal

Impact of contamination

Rachtman, Eleonora, Metin Balaban, Vineet Bafna, and Siavash Mirarab. “The Impact of Contaminants on the Accuracy of Genome Skimming and the Effectiveness of Exclusion Read Filters.” Molecular Ecology Resources 20, no. 3 (May 4, 2020): 1755-0998.13135. https://doi.org/10.1111/1755-0998.13135.

KRANK

Şapcı, A.O.B. and Mirarab, S. (2024) “Memory-bound k-mer selection for large and evolutionary diverse reference libraries”, Genome Research, p. gr.279339.124. Available at: https://doi.org/10.1101/gr.279339.124.

CONSULT-II

Şapcı, A.O.B., Rachtman, E. and Mirarab, S. (2024) “CONSULT-II: Accurate taxonomic identification and profiling using locality-sensitive hashing”, Bioinformatics, p. btae150. Available at: https://doi.org/10.1093/bioinformatics/btae150.

CONSULT

Rachtman, Eleonora, Vineet Bafna, and Siavash Mirarab. “CONSULT: Accurate Contamination Removal Using Locality-Sensitive Hashing.” NAR Genomics and Bioinformatics 3, no. 3 (June 23, 2021): 10.1101/2021.03.18.436035. https://doi.org/10.1093/nargab/lqab071.

TIPP

Nguyen, Nam-phuong; Mirarab, Siavash; Bo, Liu; Pop, Mihai; Warnow, Tandy (2014). “TIPP: taxonomic identification and phylogenetic profiling.” . Bioinformatics 30.24 (2014): 3548-3555.

Branch Length

Dating and unit conversion

wLogDate

Uyen Mai, Siavash Mirarab, Log Transformation Improves Dating of Phylogenies, Molecular Biology and Evolution, msaa222, https://doi.org/10.1093/molbev/msaa222.

CASTLES

Tabatabaee, Yasamin, Chao Zhang, Tandy Warnow, and Siavash Mirarab. “Phylogenomic Branch Length Estimation Using Quartets.” Bioinformatics 39, no. Supplement_1 (2023): i185–93. https://doi.org/10.1093/bioinformatics/btad221.

Rooting, Clustering, Error detection, etc.

MinVarRooting

Mai, Uyen, Erfan Sayyari, and Siavash Mirarab. “Minimum Variance Rooting of Phylogenetic Trees and Implications for Species Tree Reconstruction.” Edited by Gabriel Moreno-Hagelsieb. PLOS ONE 12, no. 8 (2017): e0182238. https://doi.org/10.1371/journal.pone.0182238.

TreeShrink

Mai, Uyen, and Siavash Mirarab. “TreeShrink: Fast and Accurate Detection of Outlier Long Branches in Collections of Phylogenetic Trees.” BMC Genomics 19, no. S5 (May 8, 2018): 272. doi: 10.1186/s12864-018-4620-2.

TreeCluster

Balaban, Metin, Niema Moshiri, Uyen Mai, Xingfan Jia, and Siavash Mirarab. “TreeCluster: Clustering Biological Sequences Using Phylogenetic Trees.” Edited by Serdar Bozdag. PLOS ONE 14, no. 8 (2019): e0221068. doi:10.1371/journal.pone.0221068.

Alignment

FastSP

Mirarab, Siavash, and Tandy Warnow. “FastSP: Linear Time Calculation of Alignment Accuracy.” Bioinformatics 27, no. 23 (2011): 3250–58. https://doi.org/10.1093/bioinformatics/btr553.

UPP

Nguyen, Nam-phuong, Siavash Mirarab, Keerthana Kumar, and Tandy Warnow. “Ultra-Large Alignments Using Phylogeny-Aware Profiles.” Genome Biology 16, no. 1 (2015): 124. https://doi.org/10.1186/s13059-015-0688-z.

PASTA

Mirarab, Siavash, Nam Nguyen, Sheng Guo, Li-San Wang, Junhyong Kim, and Tandy Warnow. “PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences.” Journal of Computational Biology 22, no. 5 (2015): 377–86. https://doi.org/10.1089/cmb.2014.0156.

Simulators, Cladogenesis, HIV, etc.

DIMSIM

Zhang, Chao, Andrey V. Bzikadze, Yana Safonova, and Siavash Mirarab. “A Scalable Model for Simulating Multi-Round Antibody Evolution and Benchmarking of Clonal Tree Reconstruction Methods.” Frontiers in Immunology 13 (2022): 1014439. https://doi.org/10.3389/fimmu.2022.1014439.

DualBirth model

Moshiri and Mirarab, Dual-birth: A Two-State Model of Tree Evolution and Its Applications to Alu Retrotransposition. 10.1093/sysbio/syx088

ProACT

Moshiri, Niema, Davey M Smith, and Siavash Mirarab. “HIV Care Prioritization Using Phylogenetic Branch Length.” JAIDS Journal of Acquired Immune Deficiency Syndromes (2020): 2019.12.20.885202. doi:10.1097/QAI.000000000000261.

Obsolete

This dataset webpage used to host our datasets. We have made an effort to make everything available elsewhere. I provide the link here just in case it becomes useful.

Here are some old datasets from Tandy’s group: