Year 2020

From QiuLab
Revision as of 18:20, 28 February 2020 by imported>Weigang
Jump to navigation Jump to search

Algorithm & tools for Bb plasmid nomenclature

  1. Reference: email changes with Sherwood on Feb 27-28, 2020
  2. To reduce the amount of manual curation/judgement calls, it seems we need to automate the plasmid call using the following algorithms (which should work for the majority of cases):
    1. Identify PFam32 genes using BLAST or HMMER
    2. Build a NJ tree with sequences from a PFam32 database
    3. Calculate some kind of group consistency score at each clade level
    4. Identify presence/absence of a cluster with other 3 partition genes
    5. Assign plasmid names (single names for most, a few composite names)
  3. Modification: species-tree/gene-tree reconciliation: Your nicely illustrated tree reminds me that a more rational/formal (than %diff cutoff) way for delineating orthologous (same name) & paralogous (different names) PFam32 groups should be the so-called “species tree / gene tree reconciliation” algorithm. This algorithm would identify each branch on the gene tree (i.e., your tree) as either due to “duplication” (creating paralogs, long branches) or “speciation” (creating orthologs, short branches). Then assign new plasmid names to each major/ancestral duplication branch (not counting recent duplications within a single species or strain). By this algorithm, the lp56 group (node pointed by green lines) is valid, since there is no genome appearing more than once among its descendants. By this algorithm, all nodes indicated by blue lines are all valid, regardless level of sequence difference (e.g., VA1, cp26). By this algorithm, we made an overcall on lp28-9 and lp28-1 (orange node), which should be a single paralogous group, since there is no genome as multiple descendants. The somewhat deep divergence simply suggests fast evolution.