FGP :: PlantTribes
Methods
The predicted proteomes of Arabidopsis and rice were downloaded from TIGR which included 27,117 and 80,975 putative protein coding genes, respectively. TIGR's rice database, which contains inherent duplication due to overlapping BAC/PAC clones, was therefore screened to eliminate identical genes on the same chromosome, thus reducing the number of sequences to 63,673.
Following Enright and colleagues (TribeMCL, Enright et al. 2002), the non-redundant set, consisting of 90,790 proteins, was included in an all-against-all BLAST (Altschul et al. 1997) search and a similarity matrix was constructed from transformed BLAST scores. The matrix was then used to perform MCL (van Dongen 2000) clustering at low, medium, and high stringencies (inflation of 1.2, 3.0, and 5.0, respectively).
The PFAM database server was used to find all known domains for all sequences. Unified annotations were assigned to each putative gene family using Perl regular expressions to find common words and average start positions. The annotations, sequences, MCL output, and PFAM results were loaded into a MySQL database and user searchable CGI scripts were written which you are now viewing.
Searching the Database
Keyword Search
You can search the PlantTribes database through keyword search. You must choose either the "id" (ie, 'Atg01010') or "annotation" (ie, 'NAC domain protein'). Both
Arabidopsis and Rice are indexed and the search automatically occurs on both species. You may use regular expressions in the search.
PFAM Search
You can search the PlantTribes database using the Accession (ie, 'PF00646'), Name (ie, 'F-box'), or Description (ie, 'F-box domain') of PFAM Domains.
Tribe Size Search
You can search the PlantTribes database using sizes of gene families. You can enter the minimum/maximum number of sequences for each species or the total number of sequences in a tribe. All tribes that fit that size range will be returned with hyperlinks to each tribe's info page. For example, if you would like to find all tribes with 10 members from
Arabidopsis and 10 members from Rice, you would enter 10 in the minimum text box and 10 in the maximum text box for the rows labeled 'Arabidopsis' and 'Rice'. Conversely, if you would like to find tribes with 25-30 sequences, you would enter 25 in the minimum text box and 30 in the maximum text box in the row named 'Total'. You must also choose the stringency of either 1, 2, or 3 (low, medium, high) with a default value of 3 (medium).
BLAST
You can use any of the blast programs (BLASTN, BLASTX, BLASTP, TBLASTN, TBLASTX) to search for similar sequences in
Arabidopsis and Rice. Both the protein and cds sequences are indexed in the database and please be aware that no results will be returned if you mistakenly choose the wrong program (ie, BLASTP for nucleotide query). All the normal options are available for blasting.