Gclust Server

2009/09/14

WebSite: http://gclust.c.u-tokyo.ac.jp/
HTTPS Site: https://dbarchive.biosciencedbc.jp/data/gclust

Database of sequence clusters obtained as a result of all-against-all BLAST search of proteins in 95 organism species.

README Content

  1. Database Component
  2. Data Description
  3. License
  4. Update History
  5. Literature
  6. Contact address

1. Database Component

    DBCLS exchanges data below Downloadable files in the original site to CSV files.
  1. README
  2. Amino acid sequences of predicted proteins and their annotation for 95 organism species.
  3. Cluster based on sequence comparison of homologous proteins of 95 organism species
  4. Proteins in similarity relationship with the cluster
  5. Downloadable files in the original site

    The following data are downloadable files in the original site.
  6. Amino acid sequences used for clusterintg (Multi FASTA format)
  7. Sequence ID and annotation information
  8. Prefix list for each organism
  9. Designation of organism group
  10. Parameters for Organism Grouping
  11. Clustering results
  12. Table of Cluster and Organism Species Number
Return to Top

2. Data Description

2.1 README

Data name README
Description of data contents HTML file to describe "Gclust Server" data.
File README_e.html(English)
Return to Top

2.2 Amino acid sequences of predicted proteins and their annotation for 95 organism species.

Data name Amino acid sequences of predicted proteins and their annotation for 95 organism species.
Description of data contents

Amino acid sequences of predicted proteins and their annotation for 95 organism species. The data are given in a CSV format text file.

File gclust_seq.zip (152MB)
Data items are the following:
Data item Primary key Foreign key Description
Sequence ID * ID of a sequence
Cluster ID * ID of cluster. gclust_cluster is referenced.
Annotation in original database Annotation at the original website
Species Species name
Length Amino acid sequence length
Sequence Amino acid sequence

2.3 Cluster based on sequence comparison of homologous proteins of 95 organism species

Data name Cluster based on sequence comparison of homologous proteins of 95 organism species
Description of data contents

Clustering was performed by the method in which the round-robin BLAST search of the above amino acid sequence data is performed, the E-value and the overlap score (the All-against-all BLASTP search of the above amino acid sequence data, and heuristic estimation of a similarity threshold for homologs of each protein by entropy-optimized organism count method (Bioinformatics 2009 Mar 1;25(5):599-605.). The data are given in a CSV format text file.

File gclust_cluster.zip (8.72MB)
Data items are the following:
Data item Primary key Foreign key Description
Cluster ID * ID of cluster
Representative sequence ID * ID of a sequence that represents the cluster. gclust_seq is referenced.
Link to cluster sequences Link to the list of sequences belonging to the cluster (empty space)
Link to related sequences Link to the list of sequences that are similar to the cluster, but not clustered
Sequence length Amino acid sequence length
Representative annotation Representative annotation of the cluster
Number of Sequences Number of sequences contained in the cluster
Homologs Number of sequences contained in the cluster
Clustering threshold The threshold of E-value used for clustering
Plants (7species) (%) The appearance rate of this cluster in the plant and algal group (including 7 species)
Other bikonts (9 species) (%) The appearance rate of this cluster in other Bikonta (Chromalveolata, Excavata) group (including 9 species)
Cyano (25species) (%) The appearance rate of this cluster in the cyanobacteria group (including 25 species)
Photo Bact (15species) (%) The appearance rate of this cluster in the photosynthetic bacteria group (including 15 species)
Other Bact (31 species) (%) The appearance rate of this cluster in the non-photosynthetic bacteria group (including 31 species)
Opisthokonts (8species) (%) The appearance rate of this cluster in the opisthokont group (including 8 species)
Number of Sequences for each species The number of sequences by organism species contained in the cluster.
Species not appearing in this cluster Organism species not contained in the cluster.

2.4 Proteins in similarity relationship with the cluster

Data name Proteins in similarity relationship with the cluster
Description of data contents

Protein sequences that are similar to any clustered sequence of 95 organisms species, but not clustered. The data are given in a CSV format text file.

File gclust_related.zip (69MB)
Data items are the following:
Data item Description
Cluster ID ID of cluster
Sequence ID ID of a sequence

2.5 Amino acid sequences used for clusterintg (Multi FASTA format)

Data name Amino acid sequences used for clusterintg (Multi FASTA format)
Description of data contents

Amino acid sequences of predicted proteins and their annotation for 95 organism species. FASTA format file.

File all95.fa.zip (161MB)

2.6 Sequence ID and annotation information

Data name Sequence ID and annotation information
Description of data contents

A tab-delimited text file specifying the ID, length and annotation information of the amino acid sequences of the predicted proteins for 95 organism species.

File all95.p.table.zip (7.28MB)
Data items are the following:
Data item Description
Field 1 ID of amino acid sequence (Sequence ID)
Field 2 Length of amino acid sequence
Field 3 Annotation of amino acid sequence

2.7 Prefix list for each organism

Data name Prefix list for each organism
Description of data contents

List of prefixes for organisms used in Gclust. Each prefix is applied to the top of the sequence ID according to each organism. The first line specifies the number of organism species (95). From the second line, the prefix of each organism is listed on each line, and "//END" is entered on the last line.

File prefix_all95 (1KB)
Data items are the following:
Prefix Organism name
ATH Arabidopsis thaliana
CME Cyanidioschyzon merolae
CRE Chlamydomonas reinhardtii
OSA Oryza sativa
OTAU Ostreococcus tauri
PPT Physcomitrella patens
PoTR Populus tricocarpa
DPTM Paramecium tetraurelia
GTH Guillardia theta
NGR Naegleria gruberi
PFA Plasmodium falciparum
PHRA Phytophthora ramorum
PHSO Phytophthora sojae
PTR Phaeodactylum tricornutum
TET Tetrahymena thermophila SB210
TPS Thalassiosira pseudonana
Ana Anabaena sp. PCC 7120
Ava Anabaena variabilis ATCC 29413
Glv Gloeobacter violaceus
Npun Nostoc punctiforme sp. PCC73102
Pm1 Prochlorococcus marinus MED4
Pm2 Prochlorococcus marinus MIT9313
Pm3 Prochlorococcus marinus SS120
Pm4 Prochlorococcus marinus MIT9312
Pm5 Prochlorococcus marinus NATL2A
Pm6 Prochlorococcus marinus MIT9301
Pm7 Prochlorococcus marinus MIT9303
Pm8 Prochlorococcus marinus MIT9315
Pm9 Prochlorococcus marinus NATL1A
PmA Prochlorococcus marinus AS9601
S63 Synechococcus sp. PCC 6301
S79 Synechococcus sp. PCC 7942
S81 Synechococcus sp. WH8102
S93 Synechococcus sp. CC9311
S96 Synechococcus sp. CC9605
Syn Synechocystis sp. PCC 6803
Tel Thermosynechococcus elongatus
Ter Trichodesmium erythraeum 405 1
YelA Cyanobacterium Yellowstone A-prime
YelB Cyanobacterium Yellowstone B-prime
Caur Chloroflexus aurantiacus
Cch Chlorobium chlorochromatii CaD3
Clim Chlorobium limicola DSM 245
Cph Chlorobium phaeobacteroides DSM 266
Ctep Clorobium tepidum
Pvi Prostheocochloris vibrioformis DSM 265
Rde Roseobacter denitrificans Och 114
Rpa1 Rhodopseudomonas plustris BisA53
Rpa2 Rhodopseudomonas plustris BisB4
Rpa3 Rhodopseudomonas plustris BisB18
Rpa4 Rhodopseudomonas plustris HaA2
Rpal Rhodopseudomonas plustris
Rrub Rhodospirillum rubrum ATCC 11170
Rsh Rhodobacter sphaeroides ATCC 17029
Rsp Rhodobacter sphaeroides 2.4.1
Afu Archaeoglobus fulgidus DSM 4304
Ape Aeropyrum pernix K1
Atu Agrobacterium tumefaciens str. C58
Bja Bradyrhizobium japonicum USDA 110
Bma Burkholderia mallei ATCC 23344
Bms Brucella suis 1330
Bpe Bordetella pertussis Tohama I
Bsu Bacillus subtilis Marburg 168
Ccr Caulobacter crescentus CB15
Cvi Chromobacterium violaceum ATCC 12472
Eba Azoarcus sp EbN1
Eco Escherichia coli K-12
Fal Frankia alni ACN14a
Fra Frankia sp. CcI3
Gox Gluconobacter_oxydans_621H
Hal Halobacterium sp. NRC-1
Mac Methanosarcina acetivorans str. C2A
Mes Mesorhizobium sp. BNC1
Mlo Mesorhizobium loti MAFF303099
Mtu Mycobacterium tuberculosis H37Rv
Neq Nanoarchaeum equitans Kin4-M
Pho Pyrococcus horikoshii OT3
Pst Pseudomonas syringae pv. tomato str. DC3000
Rhe Rhizobium_etli_CFN_42
Rle Rhizobium leguminosarum
Rso Ralstonia solanacearum GMI1000
Sco Streptomyces coelicolor A3(2)
Sep Staphylococcus epidermidis ATCC 12228
Sme Sinorhizobium meliloti 1021
Sto Sulfolobus tokodaii str. 7
Vvy Vibrio vulnificus YJ016
CEL Caenorhabditis elegans
DCGR Candida glabrata CBS138
DKLA Kluyveromyces lactis NRRL Y-1140
DME Drosophila melanogaster
HSA Homo sapiens
SPO Schyzosaccharomyces pombe
S99 Synechococcus sp. CC9902
NCR Neurospora crassa 74-OR23-1A
SCE Saccharomyces cerevisiae

2.8 Designation of organism group

Data name Designation of organism group
Description of data contents

The definition for grouping 95 species of organism is specified. The first line specifies the number of organism species, and "//END" is entered on the final line. The line starting with "#" is a line for comment. Data are provided in a tab-delimited text file format.

File grp_def1 (1KB)
Data items are the following:
Data item Description
Field 1 Prefix of the sequence ID of organism
Field 2 Group (Numbers from 1 to 6)

2.9 Parameters for Organism Grouping

Data name Parameters for Organism Grouping
Description of data contents

The file designated with the threshold for the ratio of organism species showing homology in the organism species in each organism group when allocation to the organism group is made. For example, when the designated value is 0.5, the cluster is determined as belonging to the "Plants" group if the sequences of four or more organism species out of seven species in this organism group exist in the cluster.

File pat_def1 (1KB)
Data items are the following:
Field 1 Group number
Field 2 Designated value for allocation to organism group
Field 3 Group name

2.10 Clustering results

Data name Clustering results
Description of data contents

Results of running Gclust program. The data include such information as the requirements for running the program, the cluster ID, the threshold used for cluster grouping, the ID of the sequence belonging to the cluster and the sequence ID of the related group.

File all95m8.hom.1.zip (140MB)
File Composition
Lines 1 to 80: Requirements for running the gclust program.
From line 81 on: Information for each cluster.
  END Related groups
Format for Each Cluster
Group [Cluster ID]: [Number of sequences belonging to cluster] sequences. Final thr =   [Threshold]
Group [Cluster ID]: [Number of sequences belonging to cluster] sequences. Final thr =   [Threshold]
[ID of sequence belonging to cluster]    [Sequence length]    [Presence of homology between sequences within cluster]   Number of rows in the number of sequences belonging to cluster)]  [Part of annotation]
(Individual information of the sequences belonging to the cluster is given.)
…
(List of related group)
  Related groups
  Related groups
    [Related cluster ID](Number of sequences belonging to the cluster ID on left]): [Sequence ID 0]
  END Related groups

2.11 Table of Cluster and Organism Species Number

Data name Table of Cluster and Organism Species Number
Description of data contents

Cluster, representative sequence ID of cluster, its length, the number of sequences contained in the cluster, organism species, the number of sequences belonging to the cluster for each of 95 organism species, compiled into a tab-delimited text file format table.

File all95.tbl.zip (4.53MB)
Data items are the following:
Number Cluster ID
ID Sequence ID
Length Sequence length
seqs Number of sequences belonging to cluster
homologs Number of sequences belonging to cluster
ATH Number of sequences belonging to cluster in the Arabidopsis thaliana sequence
OSA Number of sequences belonging to cluster in the Oryza sativa sequence
PoTR Number of sequences belonging to cluster in the Populus tricocarpa sequence
PPT Number of sequences belonging to cluster in the Physcomitrella patens sequence
CRE Number of sequences belonging to cluster in the Chlamydomonas reinhardtii sequence
OTAU Number of sequences belonging to cluster in the Ostreococcus tauri sequence
CME Number of sequences belonging to cluster in the Cyanidioschyzon merolae sequence
GTH Number of sequences belonging to cluster in the Guillardia theta sequence
PFA Number of sequences belonging to cluster in the Plasmodium falciparum sequence
PTR Number of sequences belonging to cluster in the Phaeodactylum tricornutum sequence
TPS Number of sequences belonging to cluster in the Thalassiosira pseudonana sequence
Ter Number of sequences belonging to cluster in the Trichodesmium erythraeum 405 1 sequence
Ana Number of sequences belonging to cluster in the Anabaena sp. PCC 7120 sequence
Ava Number of sequences belonging to cluster in the Anabaena variabilis ATCC 29413 sequence
Npun Number of sequences belonging to cluster in the Nostoc punctiforme sp. PCC73102 sequence
Syn Number of sequences belonging to cluster in the Synechocystis sp. PCC 6803 sequence
Glv Number of sequences belonging to cluster in the Gloeobacter violaceus sequence
Tel Number of sequences belonging to cluster in the Thermosynechococcus elongatus sequence
YelA Number of sequences belonging to cluster in the Cyanobacterium Yellowstone A-prime sequence
YelB Number of sequences belonging to cluster in the Cyanobacterium Yellowstone B-prime sequence
S63 Number of sequences belonging to cluster in the Synechococcus sp. PCC 6301 sequence
S79 Number of sequences belonging to cluster in the Synechococcus sp. PCC 7942 sequence
S81 Number of sequences belonging to cluster in the Synechococcus sp. WH8102 sequence
S93 Number of sequences belonging to cluster in the Synechococcus sp. CC9311 sequence
S96 Number of sequences belonging to cluster in the Synechococcus sp. CC9605 sequence
S99 Number of sequences belonging to cluster in the Synechococcus sp. CC9902 sequence
Pm1 Number of sequences belonging to cluster in the Prochlorococcus marinus MED4 sequence
Pm2 Number of sequences belonging to cluster in the Prochlorococcus marinus MIT9313 sequence
Pm3 Number of sequences belonging to cluster in the Prochlorococcus marinus SS120 sequence
Pm4 Number of sequences belonging to cluster in the Prochlorococcus marinus MIT9312 sequence
Pm5 Number of sequences belonging to cluster in the Prochlorococcus marinus NATL2A sequence
Pm6 Number of sequences belonging to cluster in the Prochlorococcus marinus MIT9301 sequence
Pm7 Number of sequences belonging to cluster in the Prochlorococcus marinus MIT9303 sequence
Pm8 Number of sequences belonging to cluster in the Prochlorococcus marinus MIT9315 sequence
Pm9 Number of sequences belonging to cluster in the Prochlorococcus marinus NATL1A sequence
PmA Number of sequences belonging to cluster in the Prochlorococcus marinus AS9601 sequence
Atu Number of sequences belonging to cluster in the Agrobacterium tumefaciens str. C58 sequence
Bja Number of sequences belonging to cluster in the Bradyrhizobium japonicum USDA 110 sequence
Bms Number of sequences belonging to cluster in the Brucella suis 1330 sequence
Ccr Number of sequences belonging to cluster in the Caulobacter crescentus CB15 sequence
Gox Number of sequences belonging to cluster in the Gluconobacter_oxydans_621H sequence
Mes Number of sequences belonging to cluster in the Mesorhizobium sp. BNC1 sequence
Mlo Number of sequences belonging to cluster in the Mesorhizobium loti MAFF303099 sequence
Rhe Number of sequences belonging to cluster in the Rhizobium_etli_CFN_42 sequence
Rle Number of sequences belonging to cluster in the Rhizobium leguminosarum sequence
Sme Number of sequences belonging to cluster in the Sinorhizobium meliloti 1021 sequence
Rpa1 Number of sequences belonging to cluster in the Rhodopseudomonas plustris BisA53 sequence
Rpa2 Number of sequences belonging to cluster in the Rhodopseudomonas plustris BisB4 sequence
Rpa3 Number of sequences belonging to cluster in the Rhodopseudomonas plustris BisB18 sequence
Rpa4 Number of sequences belonging to cluster in the Rhodopseudomonas plustris HaA2 sequence
Rpal Number of sequences belonging to cluster in the Rhodopseudomonas plustris sequence
Rrub Number of sequences belonging to cluster in the Rhodospirillum rubrum ATCC 11170 sequence
Rde Number of sequences belonging to cluster in the Roseobacter denitrificans Och 114 sequence
Rsh Number of sequences belonging to cluster in the Rhodobacter sphaeroides ATCC 17029 sequence
Rsp Number of sequences belonging to cluster in the Rhodobacter sphaeroides 2.4.1 sequence
Eco Number of sequences belonging to cluster in the Escherichia coli K-12 sequence
Pst Number of sequences belonging to cluster in the Pseudomonas syringae pv. tomato str. DC3000 sequence
Vvy Number of sequences belonging to cluster in the Vibrio vulnificus YJ016 sequence
Bsu Number of sequences belonging to cluster in the Bacillus subtilis Marburg 168 sequence
Sep Number of sequences belonging to cluster in the Staphylococcus epidermidis ATCC 12228 sequence
Fal Number of sequences belonging to cluster in the Frankia alni ACN14a sequence
Fra Number of sequences belonging to cluster in the Frankia sp. CcI3 sequence
Mtu Number of sequences belonging to cluster in the Mycobacterium tuberculosis H37Rv sequence
Sco Number of sequences belonging to cluster in the Streptomyces coelicolor A3(2) sequence
Rso Number of sequences belonging to cluster in the Ralstonia solanacearum GMI1000 sequence
Cvi Number of sequences belonging to cluster in the Chromobacterium violaceum ATCC 12472 sequence
Bma Number of sequences belonging to cluster in the Burkholderia mallei ATCC 23344 sequence
Bpe Number of sequences belonging to cluster in the Bordetella pertussis Tohama I sequence
Eba Number of sequences belonging to cluster in the Azoarcus sp EbN1 sequence
Caur Number of sequences belonging to cluster in the Chloroflexus aurantiacus sequence
Cch Number of sequences belonging to cluster in the Chlorobium chlorochromatii CaD3 sequence
Clim Number of sequences belonging to cluster in the Chlorobium limicola DSM 245 sequence
Cph Number of sequences belonging to cluster in the Chlorobium phaeobacteroides DSM 266 sequence
Ctep Number of sequences belonging to cluster in the Clorobium tepidum sequence
Pvi Number of sequences belonging to cluster in the Prostheocochloris vibrioformis DSM 265 sequence
Afu Number of sequences belonging to cluster in the Archaeoglobus fulgidus DSM 4304 sequence
Hal Number of sequences belonging to cluster in the Halobacterium sp. NRC-1 sequence
Mac Number of sequences belonging to cluster in the Methanosarcina acetivorans str. C2A sequence
Pho Number of sequences belonging to cluster in the Pyrococcus horikoshii OT3 sequence
Ape Number of sequences belonging to cluster in the Aeropyrum pernix K1 sequence
Sto Number of sequences belonging to cluster in the Sulfolobus tokodaii str. 7 sequence
Neq Number of sequences belonging to cluster in the Nanoarchaeum equitans Kin4-M sequence
SCE Number of sequences belonging to cluster in the Saccharomyces cerevisiae sequence
SPO Number of sequences belonging to cluster in the Schyzosaccharomyces pombe sequence
PHRA Number of sequences belonging to cluster in the Phytophthora ramorum sequence
PHSO Number of sequences belonging to cluster in the Phytophthora sojae sequence
DCGR Number of sequences belonging to cluster in the Candida glabrata CBS138 sequenc
DKLA Number of sequences belonging to cluster in the Kluyveromyces lactis NRRL Y-1140 sequence
NCR Number of sequences belonging to cluster in the Neurospora crassa 74-OR23-1A sequence
DPTM Number of sequences belonging to cluster in the Paramecium tetraurelia sequence
TET Number of sequences belonging to cluster in the Tetrahymena thermophila SB210 sequence
NGR Number of sequences belonging to cluster in the Naegleria gruberi sequence
HSA Number of sequences belonging to cluster in the Homo sapiens sequence
DME Number of sequences belonging to cluster in the Drosophila melanogaster sequence
CEL Number of sequences belonging to cluster in the Caenorhabditis elegans sequence
Annotations Annotation
Return to Top

3. License

The Standard License specifies the license terms regarding the use of this database and the requirements you must follow in using this database.
The Additional License specifies those items that are exceptionally permitted even though they are generally prohibited in the Standard License.

3.1 Standard License

The Standard License for this database is the license specified in the Creative Commons Attribution-Share Alike 2.1 Japan.
If you use data from this database, please be sure attribute this database as follows: "Gclust Server, Copyright© 2008-2009 Department of Life Sciences, Graduate School of Arts and Sciences, The University of Tokyo licensed under CC Attribution-Share Alike 2.1 Japan".

The summary of the Creative Commons Attribution-Share Alike 2.1 Japan is found here.

With regard to this database, you are licensed to:

  1. freely access part or whole of this database, and acquire data;
  2. freely redistribute part or whole of the data from this database; and
  3. freely create and distribute database and other derivative works based on part or whole of the data from this database,

under the Standard License, as long as you comply with the following conditions:

  1. You must attribute this database in the manner specified by the author or licensor when distributing part or whole of this database or any derivative work.
  2. You must distribute any derivative work based on part or whole of the data from this database under this License.

3.2 Additional License

1. You must display this Additional License along with the Standard License when distributing any derivative work based on part of whole of the data from this database.

2. When you conduct research by using this database, and describe the research results in an article or paper, you always need to cite this database, and specify the name and URL of this database in the article or paper.

3.You need to contact the Licensor shown below to request a license for use of this database or any part thereof not licensed under the Standard License and the above Additional License.

Naoki Sato
Laboratory of Plant Functional Genomics, Department of Life Sciences, Graduate School of Arts and Sciences, The University of Tokyo
E-Mail: naokisat[at]bio[dot]c[dot]u-tokyo[dot]ac[dot]jp

3.3 About Providing Links to This Database

You can freely provide links to all contents in this database. But, contents might be changed without notice.

Return to Top

4. Update History

Date Update contents
2010/03/29 Gclust Server English archive site is opened.
2009/8 Data is updated.
2006/6 Gclust Server(http://gclust.c.u-tokyo.ac.jp/) is released.
Return to Top

5. Literature

Naoki Sato
Gclust: trans-kingdom classification of proteins using automatic individual threshold setting.
Bioinformatics 2009 Mar 1;25(5):599-605.
PMID: 19158159

6. Contact address

When you have any question about "Gclust Server", contact the following:

Naoki Sato
Laboratory of Plant Functional Genomics, Department of Life Sciences, Graduate School of Arts and Sciences, The University of Tokyo
E-Mail: naokisat[at]bio[dot]c[dot]u-tokyo[dot]ac[dot]jp

Return to Top