Documentation of SHOT - Shared Ortholog and Gene Order Tree Reconstruction Tool


SHOT is a web server for the reconstruction of genome phylogenies. In contrast to phylogenies derived from comparisons of single genes, such phylogenies are less affected by horizontal gene transfer, unrecognized paralogy, highly variable rates of gene evolution, or misalignment than phylogenies based on single genes (Snel et al., 1999; Huynen et al., 1999; Fitz-Gibbon and House, 1999). Distance-based phylogenies may be generated based on two independent strategies. The underlying method of gene content phylogenies is based on an approach introduced by Snel et al. (1999). As an alternative, SHOT allows to generate phylogenies of prokaryotes based on an analysis of gene order conservation of the whole genome.


SHOT

For SHOT gene content phylogenies, similarity between two species is defined as the ratio of the number of shared orthologs and a normalization value that reflects varying genome sizes. For gene order phylogenies distances are derived from the number of orthologous gene pairs conserved. Here, a conserved gene pair is defined as orthologues that in two genomes form an adjacent pair of genes with conserved transcriptional directions.

SHOT uses the following operational definition of orthology: It applies Smith-Waterman sequence comparison (Smith & Waterman, 1981; Pearson, 1998) of protein coding genes, and pre-selects pairs of homologous sequences using a cutoff value of E=0.01. From this list, only pairs of genes that are each other's closest relatives (bi-directional best hits) in the respective genomes are considered as 'orthologs'. To include the possibility of fusion and splitting of genes, multiple genes from one genome are allowed to have the same closest relative - as long as the matches do not overlap. Phylogenetic trees are constructed using tools from the PHYLIP package (Felsenstein, 1989).

How to get started with SHOT

Essential inputs of SHOT are (1) the desired reconstruction method (gene content phylogeny or gene order phylogeny), and (2) a selection of species to be included in the phylogeny. If 'apply selection below' is chosen, the user can select individual species (by checking the boxes) at the bottom of the page.

By default, the output format is a PNG-image of a graphical unrooted tree, with the option to download the tree as a postscript file or in Newick format. If 'Newick format including bootstrap values' is selected, a Newick file including bootstrap values (based on 100 replicates) is generated that can be read into various phylogeny softwares.


Bootstrapping


Bootstrapping is achieved by creating random subsets of the genes per genome. Considering the respective parameters selected, the subsets are subsequently used to construct gene content or gene order trees. For gene content phylogenies bootstrapping is performed by randomly selecting half of the ORFs annotated for each genome, and then performing the steps described above (considering the respective parameters selected by the user). For gene order trees bootstrap values are calculated as follows: Three fourths of the relevant genes are randomly chosen and subsequently labelled as selected (these genes are either taken from all ORFs annotated as genes or from all protein coding genes with orthologs in the genome compared with, as selected by the user). Genomes are then analysed for the presence of conserved gene pairs as described above, however conserved pairs are only counted if all 4 genes belonging to a conserved pair were marked as labelled. Conserved orthologous gene pairs are counted only, if the conserved gene pair is also present without applying bootstrapping - in other words if the pair was not artificially constructed by the bootstrap procedure.




Advanced input parameters for gene content phylogenies


Normalization value:

1. Division by the size of the smaller of two genomes compared (the theoretical maximum of shared orthologs)
2. Division by the weighted average genome size [applied by default]. This value is obtained from an empirical formula representing a fit to the number of orthologs shared between archaeal and eubacterial genomes as a function of the eubacterial genome size ({sqrt(2) * a * b / sqrt( a^2 + b^2 )}; with a and b as the sizes of the two genomes compared; see Fig. 1 in Snel et al. (1999)). This option represents a more reasonable evolutionary model than applying the genome size of the smaller genome, since the number of genes shared with the Archaea increases with the bacterial genome size also for very large genomes - albeit slower.


Genome size definition:

1. Genome size is defined as the number of open reading frames (ORFs) annotated as protein coding genes.

2. the genome size is the number of ORFs with at least one homolog in other genomes completed so far [default selection]. By disregarding orphan ORFs, this option eliminates considerable variations in genome annotation quality. Therefore, it may be a better estimate of the theoretical maximum number of orthologs.
3. Genome size is the number of ORFs with at least one ortholog in other completed genomes. This even more stringent option will particularly affect the size of genomes that have experienced a high number of recent duplications. We recommend to use that formula rather for investigating unexpected topologies, rather than as a standard option.


Distance measure:
Evolutionary distances d are directly determined from estimated similarities s (see Swofford & Olsen 1990).
1. d=-ln(s) is applied by default.

2. Alternatively, d=1-s may be used. This function is less supported by models of evolution (Swofford & Olsen), hence providing a poorer estimate of evolutionary distances for weak similarities. We thus tend to apply the latter option for testing the robustness of clusters.


Clustering algorithm:
1. Neighbor-joining (Saitou & Nei 1987) is the default clustering algorithm.

2. The slower Fitch-Margoliash algorithm (Fitch and Margoliash 1967) may be applied instead.



Advanced input parameters for gene order phylogenies


Genes considered:
1. All ORFs annotated as genes are analysed for the presence of conserved gene pairs [default selection].

2. Only genes having an ortholog in the genome compared with (regarding genes without ortholog in the other genome as not annotated) are analysed for the presence of conserved gene pairs. As a result, events that only affect the genomic gene content are ignored.


Normalization:

Numbers of conserved gene pairs are normalized according to the genome size of the smaller genome, which correlates with the maximum number of conserved gene pairs theoretically possible. Genome size may be defined as follows:

1. number of ORFs annotated as genes
2. number of ORFs with at least one homolog in other genomes completed so far [default selection; we recommend to use this option if all ORFs annotated as genes are analysed for the presence of conserved gene pairs]
3. number of ORFs with at least one ortholog in other genomes completed so far.
4. In addition, the number of orthologs shared between genomes compared may be considered as the genome size. This option should be applied, if only genes having an ortholog in the genome compared with are considered for the gene order analysis (In that case that value represents the theoretical maximum of conserved gene pairs.).


Distance measure and clustering algorithm:
- Selectable parameters are identical for gene order and gene content phylogenies.




References


Felsenstein, J. (1989) PHYLIP-phylogeny inference package (Version 3.2). Cladistics, 5, 164-166.

Fitch, W.M & Margoliash, E., (1967). Construction of phylogenetic trees. Science, 155, 279-284.

Fitz-Gibbon, S.T and House, C.H. (1999) Whole genome-based phylogenetic analysis of free-living microorganisms. Nucleic Acids Res., 27, 4218-4222.

Huynen, M., Snel, B & Bork, P. (1999). Lateral gene transfer, genome surveys, and the phylogeny of prokaryotes. Science, 286, 1443a.

Pearson, W., (1998). Empirical statistical estimates for sequence similarity searches. J. Mol. Biol., 276, 71-84.

Saitou, N. & Nei, M., (1987). The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol., 4, 406-425.

Smith, T. & Waterman, M.S., (1981). Identification of common molecular subsequences. J. Mol. Biol., 147, 195-197.

Snel, B., Bork, P. & Huynen, M., (1999). Genome phylogeny based on gene content. Nat. Genet., 21, 108-110.

Swofford, D.L. and Olsen, G.J. (1990) Phylogeny construction. In Molecular Systematics (Hillis, D.M. and Moritz, C., eds), pp. 411-501, Sinauer Associates