Documentation of SHOT - Shared Ortholog and Gene Order Tree Reconstruction Tool
SHOT is a web server for
the reconstruction of genome phylogenies. In contrast to
phylogenies derived from comparisons of single genes, such
phylogenies are less affected by horizontal gene transfer,
unrecognized paralogy, highly variable rates of gene evolution, or
misalignment than phylogenies based on single genes (Snel et al.,
1999; Huynen et al., 1999; Fitz-Gibbon
and House, 1999). Distance-based phylogenies may be generated
based on two independent strategies. The underlying method of gene
content phylogenies is based on an approach introduced by Snel et
al. (1999). As an alternative, SHOT allows to generate phylogenies
of prokaryotes based on an analysis of gene order conservation of the
whole genome.
SHOT
For SHOT gene content phylogenies, similarity between two species is defined as the
ratio of the number of shared orthologs and a normalization value
that reflects varying genome sizes. For gene order phylogenies
distances are derived from the number of orthologous gene pairs
conserved. Here, a conserved gene pair is defined as
orthologues that in two genomes form an adjacent pair of genes with
conserved transcriptional directions.
SHOT uses the following operational
definition of orthology: It applies Smith-Waterman sequence
comparison (Smith & Waterman, 1981; Pearson, 1998) of protein
coding genes, and
pre-selects pairs of homologous sequences using a cutoff value of
E=0.01. From this list, only pairs of genes that are each other's
closest relatives (bi-directional best hits) in the respective
genomes are considered as 'orthologs'. To include the possibility of
fusion and splitting of genes, multiple genes from one genome are
allowed to have the same closest relative - as long as the matches do
not overlap. Phylogenetic trees are constructed using tools from the
PHYLIP package (Felsenstein, 1989).
How to get started with SHOT
Essential inputs of SHOT are (1) the
desired reconstruction method (gene content phylogeny or gene order
phylogeny), and (2) a selection of species to be
included in the phylogeny. If 'apply selection below' is
chosen, the user can select individual species (by checking the boxes) at the bottom of the
page.
By default, the output format is a PNG-image of a graphical unrooted tree, with the option to download the tree as a postscript file or in Newick format. If 'Newick format including bootstrap values' is selected, a Newick file including bootstrap values (based on 100 replicates) is generated that can be read into various phylogeny softwares.
Bootstrapping
Bootstrapping is achieved by creating random subsets of the genes per genome.
Considering the respective parameters selected, the subsets are
subsequently used to construct gene content or gene order trees. For
gene content phylogenies bootstrapping is performed by randomly
selecting half of the ORFs annotated for each genome, and then
performing the steps described above (considering the respective
parameters selected by the user). For gene order trees bootstrap
values are calculated as follows: Three fourths of the relevant genes
are randomly chosen and subsequently labelled as selected (these
genes are either taken from all ORFs annotated as genes or from all
protein coding genes with orthologs in the genome compared with, as selected by the user).
Genomes are then analysed for the presence of conserved gene pairs as
described above, however conserved pairs are only counted if all 4
genes belonging to a conserved pair were marked as labelled. Conserved
orthologous gene pairs are counted only, if the conserved gene pair is
also present without applying bootstrapping - in other words
if the pair was not artificially constructed by the bootstrap procedure.
Advanced input
parameters for gene content phylogenies
Normalization value:
1.
Division by the size of the smaller of two genomes compared (the
theoretical maximum of shared orthologs)
2. Division by the
weighted average genome size [applied by default]. This value is
obtained from an empirical formula representing a fit to the number
of orthologs shared between archaeal and eubacterial genomes as a
function of the eubacterial genome size ({sqrt(2) * a * b / sqrt(
a^2 + b^2 )}; with a and b as the sizes of the two
genomes compared; see Fig. 1 in Snel et al. (1999)). This
option represents a more reasonable evolutionary model than applying
the genome size of the smaller genome, since the number of genes
shared with the Archaea increases with the bacterial genome size also
for very large genomes - albeit slower.
Genome size definition:
1. Genome size is defined as the number of open reading frames (ORFs) annotated as protein coding genes.
2. the genome size is the number of
ORFs with at least one homolog in other genomes completed so far
[default selection]. By disregarding orphan ORFs, this option eliminates
considerable variations in genome annotation quality. Therefore, it
may be a better estimate of the theoretical maximum number of
orthologs.
3. Genome size is the number of ORFs with at least one
ortholog in other completed genomes. This even more
stringent option will particularly affect the size of genomes
that have experienced a high number of recent duplications. We
recommend to use that formula rather for investigating
unexpected topologies, rather than as a standard option.
Distance measure:
Evolutionary distances d are directly determined from
estimated similarities s (see Swofford & Olsen 1990).
1.
d=-ln(s) is applied by default.
2. Alternatively, d=1-s may be used. This function is less supported by models of evolution (Swofford & Olsen), hence providing a poorer estimate of evolutionary distances for weak similarities. We thus tend to apply the latter option for testing the robustness of clusters.
Clustering algorithm:
1.
Neighbor-joining (Saitou & Nei 1987) is the default clustering
algorithm.
2. The slower Fitch-Margoliash algorithm (Fitch and Margoliash 1967) may be applied instead.
Advanced input parameters for gene order phylogenies
Genes considered:
1. All ORFs annotated as genes are
analysed for the presence of conserved gene pairs [default
selection].
2. Only genes having an ortholog in
the genome compared with (regarding genes without ortholog in the
other genome as not annotated) are analysed for the presence of
conserved gene pairs. As a result, events that only affect the genomic gene content are ignored.
Normalization:
Numbers of conserved gene pairs are
normalized according to the genome size of the smaller genome, which
correlates with the maximum number of conserved gene pairs
theoretically possible. Genome size may be defined as follows:
1. number of ORFs annotated as
genes
2. number of ORFs with at least
one homolog in other genomes completed so far [default selection; we
recommend to use this option if all ORFs annotated as genes are
analysed for the presence of conserved gene pairs]
3. number of ORFs with at least
one ortholog in other genomes completed so far.
4. In addition, the number of
orthologs shared between genomes compared may be considered as the
genome size. This option should be applied, if only genes having an
ortholog in the genome compared with are considered for the gene
order analysis (In that case that value represents the theoretical
maximum of conserved gene pairs.).
Distance measure and
clustering algorithm:
- Selectable parameters are identical
for gene order and gene content phylogenies.
References
Felsenstein, J. (1989) PHYLIP-phylogeny inference package (Version 3.2). Cladistics, 5, 164-166.
Fitch, W.M & Margoliash, E., (1967). Construction of phylogenetic trees. Science, 155, 279-284.
Fitz-Gibbon, S.T and House, C.H. (1999) Whole genome-based phylogenetic analysis of free-living microorganisms. Nucleic Acids Res., 27, 4218-4222.
Huynen, M., Snel, B & Bork, P. (1999). Lateral gene transfer, genome surveys, and the phylogeny of prokaryotes. Science, 286, 1443a.
Pearson, W., (1998). Empirical statistical estimates for sequence similarity searches. J. Mol. Biol., 276, 71-84.
Saitou, N. & Nei, M., (1987). The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol., 4, 406-425.
Smith, T. & Waterman, M.S., (1981). Identification of common molecular subsequences. J. Mol. Biol., 147, 195-197.
Snel, B., Bork, P. & Huynen, M., (1999). Genome phylogeny based on gene content. Nat. Genet., 21, 108-110.
Swofford, D.L. and Olsen, G.J. (1990) Phylogeny construction. In Molecular Systematics (Hillis, D.M. and Moritz, C., eds), pp. 411-501, Sinauer Associates