Proteomics

PROTEOMICS IS THE STUDY OF THE PROTEINS ENCODED BY GENES

All proteins are formed from 20 different common amino acids that are joined by the hundreds and even thousands in various combinations to form long beaded chains that are twisted into thousands of unique shapes. The amino acid sequence of a protein is specified by the nucleotide sequence of the gene that encodes it. The amino acid sequence of a protein ultimately dictates its three-dimensional (3D) structure, which influences its biomolecular interactions and specialized functions. As a consequence of decades of careful research, the appearance of certain common patterns of short amino acid sequence motifs can now be used to predict the general functions of many proteins. Moreover, the 3D structures of thousands of proteins have now been deduced by x-ray crystallographic studies and other methods.

Knowledge of the primary structure of a given protein predicted from its gene sequence can provide some clues as to its specific function. However, protein function is not only dictated by its shape, but also the context in which the protein must work. The same protein may have multiple functions that depend upon which other interacting proteins are also present. These other proteins may serve as regulators or targets for functional expression of the protein. Thus, it is of great importance to track the wide scale distribution of proteins to hone in on their cell-specific functions. This underlies the importance of systems biology approaches to the investigation of proteins.

With only a few exceptions (e.g., germ-line cells, red blood cells and tumour cells), almost all of the two hundred or so different specialized types cells in the human body share the same genes. However, they differ profoundly with respect to which of these genes are actively turned on to produce proteins. The term "proteome" has been adopted to specifically describe the unique complement of proteins that reside in a cell.

Mapping the proteomes of humans and other organisms will be several orders of magnitude more difficult than sequencing their genomes. More than a third of all the genes may be actively expressed in a typical cell. Many genes can specify the synthesis of multiple proteins through alternative splicing to generate slightly different mRNA copies during gene transcription. Furthermore, most proteins subsequently undergo extensive post-translational modifications. Several hundred different type of covalent modifications of proteins have been discovered, including phosphorylation, glycosylation, sulphation, methylation, acetylation, myristoylation, palmitoylation, and isoprenylation. These covalent modifications can have profound effects on the activities, functional interactions and locations of proteins within cells. It is likely that, on average, each gene may specify a hundred or more protein variants that arise from alternative splicing and covalent modification. Therefore, the number of potentially distinct protein entities in the human proteome is probably in the several millions. Apart from the staggering multitude of different potential proteins species within any cell, another major issue is the very dynamic nature of the proteome. A cell's protein composition markedly varies with cell type, gender, age, health and environmental conditions. Consequently, the goal to identify specific biomarkers of human disease is extremely challenging and requires very broad based screening approaches and a lot of clinical data.

Antibodies have proven to be the most specific and reliable probes available to track the expression and covalent modifications of proteins. Kinexus has screened over 5000 of the world’s best commercial antibodies and incorporated them into its proteomics services to permit the broad analysis of signal transduction protein levels and phosphorylation. In addition, Kinexus has developed over 1600 of its own antibodies to support these research efforts.

Learn More About Proteomics

Basic Local Alignment Search Tool

Host: National Center for Biotechnology Information, Bethesda, MD, USA

Features: BLAST finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

ClustalW

Host: European Molecular Biology Lab - European Bioinformatics Institute, Hixton, UK

Features: Clustal W is a general purpose multiple sequence alignment program for DNA or proteins. It produces biologically meaningful multiple sequence alignments of divergent sequences. It calculates the best match for the selected sequences, and lines them up so that the identities, similarities and differences can be seen. Evolutionary relationships can be seen via viewing Cladograms or Phylograms.

The Dali Server

Host: EMBL EBI and Institute of Biotechnology, University of Helsinki, Finland

Features: The Dali server is a network service for comparing protein structures in 3D.

NCBI Entrez Conserved Domain Database

Host: National Center for Biotechnology Information, Bethesda, MD, USA

Features: The Entrez CDD can be searched to identify proteins that share a conserved interaction domain. It features a collection of multiple sequence alignments for ancient domains and full-length proteins.

NCBI Entrez Gene Database

Host: National Center for Biotechnology Information, Bethesda, MD, USA

Features: Entrez Gene is a searchable database of genes, from RefSeq genomes, and defined by sequence and/or located in the NCBI Map Viewer. It does not include all known or predicted genes; instead Entrez Gene focuses on the genomes that have been completely sequenced, that have an active research community to contribute gene-specific information, or that are scheduled for intense sequence analysis. The content (nomenclature, map location, gene products and their attributes, markers, phenotypes, and links to citations, sequences, variation details, maps, expression, homologs, protein domains and external databases) is updated as new information becomes available.

NCBI Entrez Protein Database

Host: National Center for Biotechnology Information, Bethesda, MD, USA

Features: The protein entries in the Entrez search and retrieval system have been compiled from a variety of sources, including SwissProt, PIR, PRF, PDB, and translations from annotated coding regions in GenBank and RefSeq.

Exact Antigen Antibody Resource

Features: A curated database of 22,000 monclonal antibody products, hundreds of thousands of product information pages submitted by reagent providers, millions of webpages selected from all 700 reagent suppliers and over 200,000 bioscience-related websites. Antibodies are organized according to genes, species, reagent types (antibodies, phospho-specific antibodies, recombinant proteins, ELISA, siRNA, etc.), patents, and researchers.

Gene Ontology

Host: European Molecular Biology Lab - European Bioinformatics Institute, Hixton, UK

Features: The Gene Ontology (GO) Consortium is an international collaboration among scientists at various biological databases, with an Editorial Office based at the EBI. GOA is a project that aims to provide assignments of gene products to the Gene Ontology (GO) resource. The objective of GO is to provide controlled vocabularies for the description of the molecular function, biological process and cellular component of gene products. These terms are to be used as attributes of gene products by collaborating databases, facilitating uniform queries across them.

Gene Ontology Annotation (GOA) Database

Host: European Molecular Biology Lab - European Bioinformatics Institute, Hixton, UK

Features: The GOA project aims to provide high-quality Gene Ontology (GO) annotations to proteins in the UniProt Knowledgebase (UniProtKB) and International Protein Index (IPI) and is a central dataset for other major multi-species databases; such as Ensembl and NCBI.

NCBI Homologene

Host: National Center for Biotechnology Information, Bethesda, MD, USA

Features: HomoloGene is a system for automated detection of homologs among the annotated genes of several completely sequenced eukaryotic genomes.

Host: Albanova University Center at the Royal Institute of Technology (KTH, Stockholm) and the Rudbeck Laboratories (Uppsala University), Sweden

Features: The HPR atlas has been created to show the expression and localization of proteins in a large variety of normal human tissues and cancer cells. The data is presented as high resolution images representing immunohistochemically stained tissue sections. Available proteins (genes) can be reached through a specific search (by gene/protein name/id or classification, such as kinase or protease) or by browsing the individual chromosomes.

Host: PandeyLab and Institute of Bioinformatics at John Hopkins University, Baltimore, MD, USA

Features: HPRD represents a centralized platform to visually depict and integrate information pertaining to domain architecture, post-translational modifications, interaction networks and disease association for each protein in the human proteome.

HUPO Human Protein Atlas

Host: Human Proteome Organization Initiative Based in Stockholm and Uppsala, Sweden

Features: The human protein atlas displays expression and localization of proteins in a large variety of normal human tissues and cancer cells. The data is publically available and presented as high resolution images of immunohistochemically stained tissues and cell lines with over 1500 antibodies and over 1.2 million images.

InterAction

Host: European Molecular Biology Lab - European Bioinformatics Institute, Hixton, UK

Features: IntAct is a protein interaction database and analysis system. It provides a query interface and modules to analyse interaction data.

Host: European Molecular Biology Lab - European Bioinformatics Institute, Hixton, UK and Swiss Institute of Bioinformatics (SIB)

Features: The Integrated relational Enzyme database (IntEnz) contains enzyme data approved by the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB) on the nomenclature and classification of enzyme-catalysed reactions.

Integrated Resources of Proteins Domains and Functional Sites

Host: European Molecular Biology Lab - European Bioinformatics Institute, Hixton, UK

Features: InterPro is a database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to unknown protein sequences.

Integrated Protein Classification Database

Host: Protein Information Resource located at Georgetown University Medical Center

Features: The iProClass is an integrated resource that provides comprehensive family relationships and structural/functional features of proteins.

Ontology Lookup Service

Host: European Molecular Biology Lab - European Bioinformatics Institute, Hixton, UK

Features: The Ontology Lookup Service is a spin-off of the PRIDE project, which required a centralized query interface for ontology and controlled vocabulary lookup. While many of the ontologies queriable by the OLS are available online, each has its own query interface and output format.

Protein and Associated Nucleotide Domains with Inferred Trees

Host: European Molecular Biology Lab - European Bioinformatics Institute, Hixton, UK

Features: PANDIT - Protein and Associated Nucleotide Domains with Inferred Trees. PANDIT is a collection of multiple sequence alignments and phylogenetic trees covering many common protein domains.

Host: Research Collaboratory for Structural Bioinformatics, Rutgers, The State University of New Jersey, Piscataway, NJ, USA

Features: PDB contains the single worldwide repository for the processing and distribution of 3-D biological macromolecular structure data.

Protein Data Bank

Host: European Molecular Biology Lab - European Bioinformatics Institute, Hixton, UK

Features: PDBsum provides an at-a-glance overview of every macromolecular structure deposited in the Protein Data Bank (PDB), giving schematic diagrams of the molecules in each structure and of the interactions between them.

Protein families database of alignments and HMMs

Host: Wellcome Trust Sanger Institute, Hinxton, UK

Features: Pfam is a large collection of over 8000 multiple sequence alignments and hidden Markov models covering many common protein domains and families. Each family in Pfam can be examined for multiple alignments, protein domain architectures and structures and species distribution.

Phosphorylation Site Database

Host: Dept. of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany

Features: PHOSIDA (PHOsphorylation SIte Database) allows retrieval of phosphorylation data of any protein of interest. It lists phosphorylation sites associated with particular projects and proteomes or, alternatively, displays phosphorylation sites found for any protein or protein group of interest.

Host: European Molecular Biology Lab - European Bioinformatics Institute, Hixton, UK, Cellzome and others

Features: The Phospho.ELM database contains a collection of experimentally verified Serine, Threonine and Tyrosine sites in eukaryotic proteins. The entries, manually annotated and based on scientific literature, provide information about the phosphorylated proteins and the exact position of known phosphorylated instances.

Host: Cell Signaling Technology Company, Beverly, MA, USA

Features: PhosphoSite contains a very comprehensive list of many of the known human and mouse protein phosphorylation sites with very extensive supporting information.

Protein Information Resource

Host: Georgetown University Medical Center, Washington, DC, USA

Features: The Protein Information Resource (PIR) is an integrated public bioinformatics resource to support genomic and proteomic research, and scientific studies. PIR has made many protein databases and analysis tools freely accessible to the scientific community.

Proteomics IDEntifications Database

Host: European Molecular Biology Lab - European Bioinformatics Institute, Hixton, UK and Ghent University in Belgium

Features: The PRIDE PRoteomics IDEntifications database is a centralized, standards compliant, public data repository for proteomics data. It has been developed to provide the proteomics community with a public repository for protein and peptide identifications together with the evidence supporting these identifications. PRIDE is able to capture details of post-translational modifications coordinated relative to the peptides in which they have been found.

Database of Protein Families and Domains

Host: European Molecular Biology Lab - European Bioinformatics Institute, Hixton, UK

Features: Prosite is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs. Users can enter a protein sequence or find out the characteristic motifs of domains.

ScanSite

Host: Massachusetts Institute of Technology, Beth Israel Deaconess Medical Center, St. Jude's Children's Research Hospital, MA, USA

Features: Scansite searches for motifs within proteins that are likely to be phosphorylated by specific protein kinases or bind to domains such as SH2 domains, 14-3-3 domains or PDZ domains.

Host: European Molecular Biology Lab - Heidelberg, Germany

Features: SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. More than 500 domain families found in signalling, extracellular and chromatin-associated proteins are detectable. These domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues. Each domain found in a non-redundant protein database as well as search parameters and taxonomic information are stored in a relational database system. User interfaces to this database allow searches for proteins containing specific combinations of domains in defined taxa

SWISS-2DPAGE Two-dimensional polyacrylamide gel electrophoresis database

Host: Swiss Institute of Bioinformatics, Geneva, Switzerland

Features: Swiss 2D-PAGE contains data on proteins identified on various 2-D PAGE and SDS-PAGE reference maps. Proteins can be located on the 2-D PAGE maps or display the region of a 2-D PAGE map where one might expect to find a protein from Swiss-Prot.

Swiss Protein Database

Host: Swiss Institute of Bioinformatics, Geneva, Switzerland

Features: SwissProt is a curated protein sequence database provided by the ExPASy (Expert Protein Analysis System) proteomics server of the Swiss Institute of Bioinformatics (SIB). It strives to provide a high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases.

TIGR Protein Families

Host: The Institute for Genomics Research, Bethesda, MD, USA

Features: TIGRFAMs are protein families based on Hidden Markov Models or HMMs. Use this page to see the curated seed alignmet for each TIGRFAM, the full alignment of all family members and the cutoff scores for inclusion in each of the TIGRFAMs. Also use this page to search through the TIGRFAMs and HMMs for text in the TIGRFAMs Text Search or search for specific sequences in the TIGRFAMs Sequence Search.

Transcriptions - The Music of Protein Sequences

Host: Texas Wesleyan University, Fort Worth, TX, USA

Features: “A Protein Primer” makes music from protein sequences by assigning increasing pitch to amino acids by their increasing hydrophobicity values and the duration of each note is set by the number of codons coding for it.

Universal Protein Resource

Host: European Molecular Biology Lab - European Bioinformatics Institute, Hixton, UK

Features: The UniProt (Universal Protein Resource) for protein sequences and is the central hub for the collection of functional information on proteins, with accurate, consistent, and rich annotation, the amino acid sequence, protein name or description, taxonomic data and citation information. It is a central repository of protein sequence and function created by joining the information contained in UniProtKB/Swiss-Prot, UniProtKB/TrEMBL, and PIR.