(The FASEB Journal. 2004;18:8-30.)
© 2004 FASEB
A genomic perspective on protein tyrosine phosphatases: gene structure, pseudogenes, and genetic disease linkage
JANNIK N. ANDERSEN,
PETER G. JANSEN*,
SØREN M. ECHWALD
,
OLE H. MORTENSEN
,
TOSHIYUKI FUKADA,
ROBERT DEL VECCHIO,
NICHOLAS K. TONKS1 and
NIELS PETER H. MØLLER
,1
Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA;
* Scientific Computing, Novo Nordisk, DK-2760 Måløv, Denmark;
Exiqon, DK-2950 Vedbæk, Denmark; and
Signal transduction, Novo Nordisk, DK-2880 Bagsværd, Denmark
1Correspondence: N.P.H.M., Novo Nordisk, Bldg. 6A1.086, Signal Transduction, DK-2880 Bagsværd, Denmark; E-mail: nphm{at}novonordisk.com and N.K.T., Cold Spring Harbor Laboratory, Demerec Bldg., 1 Bungtown Rd., Cold Spring Harbor, NY 11724-2208, USA; E-mail: tonks{at}cshl.edu
 |
ABSTRACT
|
|---|
The protein tyrosine phosphatases (PTPs) are now recognized as critical regulators of signal transduction under normal and pathophysiological conditions. In this analysis we have explored the sequence of the human genome to define the composition of the PTP family. Using public and proprietary sequence databases, we discovered one novel human PTP gene and defined chromosomal loci and exon structure of the additional 37 genes encoding known PTP transcripts. Direct orthologs were present in the mouse genome for all 38 human PTP genes. In addition, we identified 12 PTP pseudogenes unique to humans that have probably contaminated previous bioinformatics analysis of this gene family. PCR amplification and transcript sequencing indicate that some PTP pseudogenes are expressed, but their function (if any) is unknown. Furthermore, we analyzed the enhanced diversity generated by alternative splicing and provide predicted amino acid sequences for four human PTPs that are currently defined by fragments only. Finally, we correlated each PTP locus with genetic disease markers and identified 4 PTPs that map to known susceptibility loci for type 2 diabetes and 19 PTPs that map to regions frequently deleted in human cancers. We have made our analysis available at http://ptp.cshl.edu or http://science.novonordisk.com/ptp and we hope this resource will facilitate the functional characterization of these key enzymes.Andersen, J. N., Jansen, P. G., Echwald, S. M., Mortensen, O. H., Fukada, T., Del Vecchio, R., Tonks, N. K., Møller, N. P. H. A genomic perspective on protein tyrosine phosphatases: gene structure, pseudogenes, and genetic disease linkage.
Key Words: genome analysis protein classification PTP exon structure PTP gene family processed pseudogenes PTPs and cancer alternative splicing
 |
INTRODUCTION
|
|---|
PROTEIN TYROSINE PHOSPHATASES (PTPs) are critical regulators of signal transduction. In conjunction with the protein tyrosine kinases (PTKs), they regulate the reversible phosphorylation of tyrosine residues in proteins and thereby control such fundamental physiological processes as cell growth and differentiation, cell cycle, metabolism, and cytoskeletal function (reviewed in ref 1
). Furthermore, interference with the delicate balance between counteracting PTKs and PTPs has been shown to be involved in the development of human diseases such as autoimmunity, diabetes, and cancer (reviewed in refs 1
2
3
4
). Defined by the signature motif C(x)5R, PTPs can be divided into two major categories: the tyrosine-specific or classical PTPs, typified by the prototypic member PTP1B in which the signature motif is "(I/V)HCSxGxGR(S/T)G"; and the dual specificity phosphatases (DSPs), which can accommodate the dephosphorylation of tyrosine, serine, and threonine residues, in addition to inositol phospholipids, in their active site. In this review we have focused on the classical, tyrosine-specific PTPs. An analysis of the DSPs, which show greater sequence diversity, is being conducted separately.
The total number of genes in the human genome has been debated extensively. Based on extrapolations from the number of expressed sequence tags (ESTs) (5)
, the original expectation was much higher than the current estimate of 24,50037,000 human genes (6
7
8
9)
. Initially, cDNA sequence information had revealed transcripts corresponding to 37 human PTP genes (10
, 11)
, but it was expected that the total number of genes encoding classical PTPs would be considerably higher. The bioinformatics-based identification of 90 sequences with the PTP signature motif in the genome of C. elegans (12)
lent support to the view that additional PTPs would be present in the more complex human genome. The first bioinformatics analysis of the draft human genome from the International Human Genome Sequencing Consortium reported the presence of 112 genes classified as tyrosine-specific and dual specificity PTPs (6)
. In contrast, analysis from Celera reported the presence of 56 tyrosine-specific human PTPs (7)
. In both cases, the sequence and the nature of these PTPs remain elusive. Therefore, there is a need for a comprehensive analysis and expert gene annotation (manual review) of these key signal-transducing enzymes using the essentially complete version of human genome sequence (Build 33).
Here, for the first time, we have catalogued the classical PTPs of the human genome and conducted a comparative exon structure analysis of this gene family. Our study provides the foundation for disease association studies and for studies of the genetic elements that control PTP expression in various cells (e.g., analysis of promoter elements and alternative splice sites). The present definition of the PTP gene family is reviewed in the broader context of their amino acid sequences, 3-dimensional structures, chromosomal location, and disease loci. The analysis also provides insight into the evolutionary history of these enzymes as well as the current state of human genome sequence analysis. We have made all results and databases available at our web sites (http://ptp.cshl.edu or http://science.novonordisk.com/ptp) and hope this resource may serve as a platform for future studies of this important protein family.
 |
IDENTIFICATION AND CHROMOSOMAL MAPPING OF PTP GENES
|
|---|
To define the composition of the PTP family of proteins, we mapped all PTP-like sequences in the human genome by analyzing raw genomic, cDNA, and EST sequences deposited in GenBank. The hits from this search were confirmed in both the public and private genome assemblies using the UCSC and Celera Genome browser, respectively. We identified 38 PTP-encoding genes, including one putative novel human PTP, and have extended the protein sequences of four PTPs that are currently defined by fragmentary sequences. In addition, we refined the exon structure of 14 PTP domains for which automated gene annotation programs have encountered difficulty.
SEARCHING THE HUMAN GENOME FOR PTP SEQUENCES
To identify the genomic complement of the PTP family, we first generated a list of unique human PTP domains from our nonredundant database of vertebrate PTP transcripts published elsewhere (11)
. These protein sequences [37 PTP catalytic domains and 12 membrane distal domains from tandem domain receptor-like PTPs (RPTPs)] were searched against the six translated reading frames of the public human genome (draft-quality and finished sequences) using the BLAST heuristic algorithm andsoftware developed at Novo Nordisk, as described in detail elsewhere (13)
. This homology search retrieved 295 unique accession numbers. Each genomic sequence was then compared with our local database of human PTP domains and alignments generated to identify perfect matches and novel PTP-like sequences. These alignments, which revealed the nature and extent of PTP homology (including exon-intron boundaries of known PTPs), allowed us to classify the 295 genomic clones as containing either 1) known PTPs, 2) novel sequences with exon structure similar to known PTPs, 3) PTP pseudogenes (based on the presence of frameshift mutations, in-frame stop codons, or lack of apparent exon structure), 4) DSPs, or 5) false positives (i.e., other genomic sequence).
All previously catalogued PTP cDNAs (11)
could be mapped onto the genome (Fig. 1
and Table 1
) consistent with the essentially complete coverage of the human genome sequence (Build 33). We also mapped 1 novel PTP and 12 PTP-like sequences (Fig. 1
and Table 1
). The sequence of the novel human PTP, termed PTP-OST, was assigned to chromosome 1q32.1, a region syntenic to the locus for rat osteotesticular PTP (OST-PTP) (14)
and mouse embryonic stem cell phosphatase (PTP-ESP) (15)
. All mapping results were correlated with published in situ hybridization data and a consensus chromosomal location defined (Table 1)
. Furthermore, we provide cross references for all PTP loci to their protein, transcript, and genomic sequences and to various annotated gene records (GeneCard, LocusLink, Unigene, euGene, Ensemble, GDB, and OMIM) because these data sources tend to contain mutually complementary information (Table 2
). The complete set of PTP sequences, including their genomic annotation, is available as hyperlinked databases (Table S1 and Table S2) at our web sites. Finally, we identified orthologs in the mouse genome (Build 30) (16)
for all 38 human PTP genes supporting the use of the laboratory mouse as an animal model of human biology and disease (Table S2, web sites only). A dendogram of these PTP sequences, including 34 rat PTP transcripts, documents the ortholog relationships and provides an overview of PTP gene symbols [assigned by the Human Gene Nomenclature Committee (17)
] and PTP protein names commonly used in the PubMed literature (Fig. 2
).

View larger version (48K):
[in this window]
[in a new window]
|
Figure 2. Dendogram of PTP domains showing ortholog relationships and PTP nomenclature. The 38 human PTP genes were analyzed by aligning their PTP "catalytic" domains (residue 1 to 279, PTP1B numbering) with the 38 mouse ortholog sequences and 34 rat transcripts identified in this study and an unrooted tree was drawn by the neighbor-joining method. Human PTP gene symbols (blue) and protein names are detailed in Table 1
and accession numbers for the rodent sequences are available on our web sites (http://ptp.cshl.edu and http://science.novonordisk.com/ptp). The horizontal distance in the dendogram indicates degree of sequence divergence (the greater the distance, the greater the divergence) and the scale at the top corner is the distance equivalent to 10 substitutions per 100 amino acids. The 17 PTP domain subtypes are 9 nontransmembrane subtypes (NT1-NT9), 5 tandem receptor-like subtypes (R1/R6, R2A, R2B, R4, R5), and 3 single domain receptor-like PTP subtypes (R3, R7, and R8). As a statistical test of the significance of sequence similarity within PTP subtypes, bootstrap values were calculated (values indicated at the dendogram node, the maximal value being 1000) and support the classification. A nonredundant set of 234 vertebrate PTP domain sequences can be retrieved from our web site, including multiple sequence alignments and dendograms comprising D2 domains.
|
|
The present chromosomal mapping considerably refines the physical location of six human PTP genes (LAR, PTP
, PTPD2, PTP
, PTP
, and PTP
) and allows assignment of the chromosomal location for five PTPs (MEG1, BDP1, PTPTyp, PTPD1, and MEG2) that have not been mapped experimentally (Table 1)
. For the remaining PTPs, in silico mapping matches the published data, thus documenting that the current cytogenetic annotation of the genome is accurate (18)
and can be used to link the position of PTPs with specific disease markers.
NONRANDOM DISTRIBUTION OF PTPS
Distribution of human PTP genes is nonrandom, with the largest clusters of loci found on chromosomes 1 and 12 (Fig. 1)
. PTP genes are absent from the X and Y chromosomes and from chromosomes 16, 17, 21, and 22. Chromosomes 5, 8, and 13 contain PTP pseudogenes only (see below). In general, closely related PTPs, such as TCPTP/PTP1B, PTPD1/PTPD2, and PTP
/PTP
do not colocalize to the same chromosome, although their similarity in exon structure reveals they arose by gene duplication of a common ancestral gene. Chromosome 12 is the only exception, since it harbors the two SH2 domains containing PTPs (SHP1 and SHP2) and three members of the RPTPß subtype (RPTPß, GLEPP1, and PTPS31). Many PTP genes, although phylogenetically divergent, are positioned within 0.9 and 2.0 centimorgan (cM) of each other on a particular chromosome. Such positioning in chromosomal domains (19
, 20)
, also observed for the PTKs (21)
, may imply their coregulation or indicate functional relationships between the respective PTP genes. Since PTP and PTK gene families cooperate in regulating tyrosine phosphorylation in multicellular organisms, we compared the chromosomal positions of these two gene families; however, we did not find examples of PTP and PTK genes in close proximity of each other in the human genome (see analysis at our web sites).
 |
PTP GENE ORGANIZATION
|
|---|
Mapping of exon-intron boundaries
For the 37 PTP genes with known transcripts (11)
, the exon-intron boundaries were mapped by aligning their cDNA sequences, with the genomic clones listed in Table 1
. A few of these PTP genes were incomplete or their exons were present in the databases on opposite DNA strands, indicating incorrect orientation of assembled sequence fragments. Although these errors did not interfere with our search strategy (based on PTP domain sequence homology), this type of error is problematic for the automated annotation of genes, which rely on finding consecutive open reading frames. Nevertheless, because we reviewed all exon-intron boundaries manually, we were able to generate an accurate alignment of the genomic organization for all the PTP domains (Fig. 3
). A more detailed version ofthis genomic alignment can be retrieved from our web sites.
Within the PTP gene family, the number of exons ranges from 9 encoding HePTP, which at 339 amino acid residues is the smallest PTP, to 47 encoding PTPBAS, which at 2466 amino acid residues is the largest member. Likewise, there are considerable differences with respect to the sizes of PTP genes. At one end of the spectrum, SHP1 (2161 bp mRNA) is encoded in the most compact gene structure, with 16 exons spanning only 9960 bp giving rise to a 595 residue protein. At the other end, the genomic sequence encoding RPTP
(12680 bp mRNA) is >100-fold longer (32 exons spanning 1117166 bp), making it the largest PTP gene, although only the ninth largest protein in the PTP family.
Splice sites follow the consensus
The exon-intron boundaries of the PTP domains follow the consensus of splicing donor and acceptor sites of most eukaryotic genes (AG/GT rule) (22)
. There are three possible junctional phases between exons and introns: phase 0 introns separate the junction between the triplet amino acid encoding codon whereas phase 1 and 2 introns separate within the triplet after the first and second nucleotide, respectively. Phase 0 introns dominate the gene structure of all PTP domains but are present only infrequently in the noncatalytic domains and extracellular fragments. In fact, only two amino acids in the PTP domain are encoded by a split codon: the invariant serine within the active site signature motif (see later) and a nonconserved residue at the junction of the exon encoding the KNRY and the DxxRVxL motifs (Fig. 3)
. This knowledge of PTP exon structure facilitated the analysis and classification of novel PTP-like sequences retrieved by our search.
The PTP domain is a compact cassette of 6-9 exons
Structurally, members of the PTP protein family fall into two broad categories consisting of nontransmembrane or transmembrane receptor-like molecules (23)
, which may be further divided into 17 principal subtypes based on amino acid sequence homology of their conserved PTP domains (11)
(Fig. 2)
. A prominent feature of the PTP genes is the presence of very short introns between the 6-9 exons that encode the PTP domain (
280 amino acids) compared with the size of introns found in the noncatalytic or extracellular segments. The genomic sequence of the conserved PTP domain span on average 30,000 base pairs (bp), which is considerably smaller than the typical large introns found in the 5' region of these genes.
It has been shown that 10 sequence motifs define the PTP family of proteins (11)
. The nucleotide sequences encoding these motifs are rarely interrupted by introns, which make it unlikely that our genome-wide search has missed novel PTPs hiding in the genome as currently sequenced (Build 33). One striking exception is the presence of an intron in most phosphatase domains that interrupts the conserved signature motif "(V/I)HCSxGxGR(S/T)G." In fact, 30 PTP catalytic domains and all RPTP D2 domains have their signature motif split between two exons (Fig. 3
and Fig. 4
). The observation that this exon-intron junction is not present in several nontransmembrane PTPs, including PTP1B, TCPTP, PTPD1, PTPD2, and PTPBAS, may indicate an early evolutionary divergence of the latter enzymes (24)
.
The transmembrane segment of receptor-type PTPs is encoded in one exon
Similar to other transmembrane proteins, the membrane-spanning segment of RPTPs is encoded by a single exon (see gray segment in Fig. 3
), supporting the idea that evolution created the earliest genes by exon shuffling of small pieces of DNA (25)
. In contrast, with the exception of closely related PTPs within a subtype, the intervening sequences between transmembrane regions and the PTP domains do not share a common exon structure. The only conserved feature of the intervening sequences among RPTPs is the prominent patch of basic residues carboxyl-terminal to the transmembrane segment, which is consistent with the "positive inside" rule for transmembrane helices (26)
(Fig. 3
and exon alignments at our web sites).
Diversity of PTP transcripts
The present mapping of PTP genes allows for analysis of the genetic elements that control PTP expression and alternative splicing. However, no methods have yet been developed to predict the complete set of alternatively spliced proteins for a given gene. Here, in an attempt to estimate the total number of unique PTP proteins, we have compiled a database of reported PTP splice forms and other variants (Table S3; see web sites). Although this database does not consider functional specificity afforded by post-translational modifications (e.g., glycosylation, phosphorylation and proteolytic processing), >85 unique PTP proteins are currently known. Analysis of these sequences in the context of the human genome reveals four principal methods for generating diversity within the PTP protein family at the transcriptional and translation level: 1) a combination of promoter usage (e.g., RPTP
) (27)
; 2) usage of alternative splice sites (e.g., TCPTP) (28)
; 3) exon skipping (e.g., CD45) (29)
; and 4) intron retention (e.g., PTP1B) (30
, 31)
. The last three principles are broadly referred to as alternative splicing. For several PTPs, it has been shown that protein isoforms, derived from alternatively spliced mRNA, have distinct physiological functions (32
33
34)
. Knowledge of such isoforms is critical when targeting PTPs by antisense oligonucleotides, RNA interference, or other probes to study the cellular function of these enzymes or perhaps even as a way of treating diseases.
The 3'-untranslated regions (3'UTRs) of PTPs are among the longest in the genome
As currently sequenced, the 3'-UTR of PTP mRNAs are among the longest 10% in the genome. Since 5'- and 3'-UTR sequences of eukaryotic mRNAs are known to play crucial roles in post-transcriptional regulation of gene expression modulating nucleocytoplasmic mRNA transport (35)
, translation efficiency (36
, 37)
, and mRNA stability (38)
, future analysis of these sequences for conserved motifs and structural elements may reveal new insights into the regulation of PTP expression. Indeed, the recently reported association between insulin resistance and a variation in the 3'UTR of PTP1B, which apparently increases mRNA stability (39)
, emphasizes that mapping of possible disease-associated mutations should not be restricted to analysis of the protein coding regions.
 |
Mapping of exons onto the tertiary structure of proteins
|
|---|
In the debate about the origin of introns and their role in evolution of early genes (i.e., the exon shuffling process), it has been suggested that exons delineate elements of protein modules (40
, 41)
. In a recent structural genomic analysis of intron distribution in 665 proteins with known 3-dimensional structures, it was concluded that phase 0 introns correlated with the boundary regions of compact polypeptide modules in ancient conserved proteins (25)
. Since the conserved PTP domain is dominated by phase 0 introns, we next examined whether their exons, as visualized on protein tertiary structures from the Protein Database Bank, correlated with any protein structural elements (Fig. 4)
. When the position of exons encoding PTP1B, SHP2, PTP
, LAR, and PTPµ were mapped onto their respective structures, we discovered that introns were positioned primarily within the loop regions of the PTP fold and not within secondary structure elements (i.e.,
-helixes and ß-sheets). With the exception of the active site signature motif, these loop regions are also the segments in which PTP protein sequences are highly diverse (Fig. 3)
. In RPTPs, the helix-turn-helix element (shown in red in Fig. 4
), referred to as the inhibitory wedge and which may serve a regulatory function (42)
, is encoded by a single exon. In contrast, in nontransmembrane PTPs (represented by PTP1B, TCPTP, SHP1, and SHP2) in which this motif has not been implicated in regulation, the structures have an intron insertion within the second
-helix (
2'); this is the only case of an intron position not confined to a PTP loop region.
 |
Prediction of full-length human PTP sequences
|
|---|
Our BLAST search identified one novel human PTP (defined by genomic sequence data AL356953 and AL592300), which maps to 1q32.1, a region syntenic to the locus for rat osteotesticular PTP (PTP-OST) (14)
and mouse embryonic stem cell phosphatase (PTP-ESP) (15)
. Consistent with this synteny, we have predicted a human mRNA, termed PTP-OST, that has 75% identity to the mouse (AF300701) and rat (L36884) nucleotide sequence and is defined by 35 exons (data available at our web sites). Our predicted sequence is based on the public genome assembly (Build 33); however, the human PTP-OST locus is sparsely covered by fragmentary sequences in both the public and private genome assembly. Discrepancies between the two current assemblies, including the presence of an additional PTP-OST-like fragment (AL354751) on chromosome 9 (see analysis at our web sites), indicate that new sequence data are needed to close gaps and reduce ambiguity in order to define this human PTP accurately. Only two short 3' EST sequences match the human gene, suggesting that, similar to its mouse and rat counterparts, it has a highly regulated and restricted expression pattern [i.e., the mouse and rat mRNAs are bone-specific and their expression is detectable only in osteoblasts during differentiation (43
, 44)
]. Human PTP-OST is predicted to be a receptor-type PTP that possesses 10 fibronectin type III repeats, a membrane-spanning segment, and an intracellular segment consisting of one catalytic PTP domain and a second atypical PTP-like domain. Notably, the human ortholog has not yet been cloned, and this first report of a possible human sequence will facilitate its characterization.
In addition to PTP-OST, full-length sequences are not available for four human PTPs (STEP, HDPTP, PTPTyp, and PTPS31). Partial cDNA sequences currently define these human PTPs, although full-length ortholog sequences have been cloned and characterized in rodents. To illustrate the analytical power of current genomic databases and search tools, we have predicted their possible full-length sequences. First, we investigated the human/mouse and human/rat homology map to confirm synteny between rodent loci and the identified human genomic sequences. We then aligned the mouse and/or rat cDNAs to the human genome assembly. This allowed us to identify missing exons and compose a likely full-length human sequence for each PTP. While these predicted sequences are available at our web sites, we have detailed our analysis of the PTPS31 gene below, which also serves to illustrate the protein diversity generated via alternative splicing of PTPs.
PTPS31, a receptor-type PTP with alternatively spliced cytoplasmic isoforms
In the early 1990s, when only a few full-length PTP cDNAs had been published, the research community was actively engaged in identifying novel PTPs using PCR and different sets of degenerate primers. At that time, PCR fragments corresponding to a putative novel human PTP termed PTPS31 (clone number 31 from a skeletal muscle cDNA library) had been isolated. To identify a full-length clone, these PCR fragments were used to screen cDNA libraries, and two clones (S31C and S31D) were initially isolated that seemed to code for nontransmembrane PTPs with the sequence MRMR as the apparent amino terminus (Fig. 5
a). However, since there was no in-frame stop codon upstream of the proposed initiation site, additional clones were isolated: S31F(1), S31F(2), and S31F(3). Surprisingly, these new clones did not contain the previously identified amino-terminal sequence MRMR, but instead continued upstream with a sequence predicted to encode a transmembrane region and a number of fibronectin III-like repeats. Apparently, PTPS31 could exist as both a cytoplasmic and a receptor-like PTP. At that time continued cloning efforts did not result in identification of the 5' end of the receptor-like PTPS31, and only the longest cDNA, S31F, was deposited in GenBank as AR073855.

View larger version (49K):
[in this window]
[in a new window]
|
Figure 5. Genomic analysis of PTPS31 cDNA clones and prediction of the human extracellular domain sequence based on homology to rat PTPGMC1. a) Schematic representation of exons encoding the 3' end of human PTPS31. The exon structure was deduced by aligning isolated cDNA clones [S31C, S31D, S31F(1), S31F(2), and S31F(3)] to the genome sequences (AC074031 and AC074031). The identified exon-intron boundaries follow the consensus for splice donor and acceptor sites. The promoter sequences identified upstream of exons 1A and 1B were predicted using the Promoter 2.0 Prediction Server (www.cbs.dtu.dk). Exon numbering is according to the predicted full-length sequence of PTPS31F (available at http://science.novonordisk.com/ptp or http://ptp.cshl.edu. b) Genomic context of human PTPS31 as viewed in the UCSC Genome browser (http://genome.ucsc.edu). The exon-intron structures in black represent (from top to bottom) the predicted full-length human sequence of PTPS31 (including the 3 PTPS31 exons present on the opposite DNA strand due to a sequence inversion in the assembly process) and the five PTPS31 clones. The exon structures shown in color represent known proteins from Swiss-Prot, TrEMBL, or the RefSeq sequence database (light blue) and predicted genes based on Ensembl, Twinscan, and Genscan results. Below the Genscan predictions are human mRNAs, ESTs, and rat PTPGMC1 aligned to the human genome sequence. The bottom graph shows the degree of human/mouse evolutionary conservation.
|
|
With access to the human genome sequence and EST databases, we have now revisited PTPS31 with the aim of demonstrating the power of modern analytical tools and databases. First, we retrieved the genomic sequence for PTPS31 from our database (Table 1
, accession number AC074031) and aligned it with the five S31 clones to identify their exon structure (Fig. 5)
. The deduced exon structure revealed that these variants could be the result of alternative splicing. The genomic organization of the conserved PTP domain was identical to other members of the R3 subtype (PTPß, DEP1, SAP1, GLEPP1, and PTP-OST) with the predicted transmembrane segment encoded by a single exon. To identify the 5' end of the putative human receptor-like enzyme, we analyzed the rat ortholog sequence PTPGMC1 (45)
in the context of the human genome and compared it to human S31 clones. This analysis identified a short 411 bp mRNA (AF169351) and a spliced EST sequence that corresponded to the human gene (Fig. 5b
). Alignment of the rat sequence to the human genome predicted exons also supported by the human/mouse homology map (Fig. 5b
). As a result, we were able to predict the first 26 exons of the human PTPS31 gene. We encountered difficulties only in one region of the genomic clone, where three predicted exons were found on the opposite strand of DNA due to misassembly of sequence fragments in the public draft-quality clone (Fig. 5b
). The deduced extracellular domain of human PTPS31 encodes 18 fibronectin type III repeats, and the alignment between the rat PTPGMC1 sequence and the predicted human sequence can be viewed at our web sites.
Next we analyzed whether it was indeed likely that PTPS31 could exist as both nontransmembrane (clones S31C and S31D) and transmembrane proteins (the S31F clones). Inspection of the 5' end of S31D identified an in-frame stop codon 80 bp upstream of the proposed initiation codon and consensus promoter elements. Likewise, for the predicted transmembrane isoform, an in-frame stop codon and consensus promoter region were found upstream of the first exon (Fig. 5a
). Thus, this case seems to correspond to the otherwise distantly related PTP
, which exists both as a receptor type and a cytoplasmic form (46)
. It is of particular interest that different promoters control the expression of the two PTP
isoforms and that functional promoter elements have been identified immediately upstream of the initiation codons (27)
. Although additional experiments are required to demonstrate unequivocally the existence of the above PTPS31 isoforms, the present analysis is another demonstration of how access to genome sequences can improve the process of identifying and characterizing novel genes.
 |
PTP PSEUDOGENES
|
|---|
Pseudogenes are disabled copies of genes (or decay remnants of genes) that do not produce a full-length protein (47)
. Operationally, they are most readily defined as fragments of sequence that appear similar to known protein domains but have stop codons or frameshifts mid-domain (47
, 48)
. Pseudogenes are often classified as either 1) "processed," which arise when an mRNA transcript is reverse-transcribed and reintegrated into the genome, or 2) "nonprocessed," which arise from duplication of genomic DNA that, over evolutionary time, gradually accumulated disabling mutations of their reading frame (49)
.
Several PTP pseudogenes arose by retrotransposition
We identified nine PTP-like sequencesfive closely related to SHP2, two to TCPTP, and one each to MEG1 or PTP
(Table 1)
which we classified as processed pseudogenes because they had no apparent exon structure and harbored frameshift mutations and multiple stop codons. Consistent with this classification, most of these pseudogenes contained polyadenylated tailscharacteristic of retrotransposition (Fig. 6
and Fig. 7
) and all were absent from the mouse genome (Build 30) (16)
, suggesting that they originated recently. The increased occurrence of retrotransposition of TCPTP and SHP2 may reflect a high transcriptional activity of these genes in humans (50)
.

View larger version (28K):
[in this window]
[in a new window]
|
Figure 6. Comparison of TCPTP (gene structure and cDNA) with the genomic sequence of the two TCPTP pseudogenes on chromosome 1 (TCPTP-P1) and chromosome 13 (TCPTP-13). Exons in the TCPTP gene (PTPN2) are visualized as rectangles. Conserved PTP amino acids within exons are color coded. Introns and flanking genomic sequence are shown as lines (not to scale). White segments correspond to the untranslated regions (UTRs) of the TCPTP gene. The exon structure for the two TCPTP isoforms, TC45 (NM_002929) and TC48 (NM_080422), are shown; numbers above the exons refer to the residue position (amino acid) in the two TCPTP proteins. Numbers in parentheses beneath the exons indicate their lengths (nucleotides). The polyadenylation tail (AAAAAA) is indicated for the cDNA and the genomic retrotransposed pseudogenes. The degree of conservation (percent nucleotide identity) between TC45 and the pseudogenes TCPTP-P1 and TCPTP-P2 is 95% and 94%, respectively. Symbols within the apparent PTP reading frame of the pseudogenes indicate the positions of in-frame stop codons (red star), nucleotide deletions or insertions (blue triangle), and other point mutations (black dot). The nucleotide sequence alignment used for this diagram is available at our web sites.
|
|

View larger version (41K):
[in this window]
[in a new window]
|
Figure 7. Comparison of SHP2 (protein, gene, and cDNA) with the genomic structure of five SHP2 pseudogenes on chromosomes 3, 4, 5, 6, and 8 (SHP-P3, -P4, -P5, -P6, and -P8). Exons and introns in the SHP2 gene (PTPN11) are shown as rectangles and lines, respectively. The degree of conservation (nucleotide identity) between the SHP2 cDNA sequence and intronless pseudogenes are shown. The inverted triangles in the SHP2 cDNA represent nucleotide positions in which SHP2 differs from the consensus nucleotide found in the ancient retrotransposed SHP2 cDNAs. Nine of these recent mutations in modern SHP2 were silent (green triangles). Red stars indicate the first stop codon within the apparent PTP reading frame of the pseudogenes. A detailed nucleotide sequence alignment of SHP2 (cDNA) with its pseudogenes (genomic sequences) can be retrieved from our web sites.
|
|
For TCPTP, integration of reverse-transcribed mRNA into the genome was evident on chromosomes 1 and 13. These genomic sequences, which we termed TCPTP-P1 and TCPTP-P13, share 9495% nucleotide identity with the cDNA of the 45 kDa isoform of TCPTP (TC45), including homology to the 5'- and 3'-UTR (Fig. 6
and sequence alignment at our web sites). If transcribed, the TCPTP pseudogenes would generate a short nonfunctional polypeptide of either 41 or 149 amino acids, respectively, due to frameshift mutations and premature stop codons. TCPTP-P1 arose by retrotransposition of an alternatively spliced mRNA missing the second exon.
For SHP2, we found five retrotransposed sequences on chromosomes 3, 4, 5, 6, and 8 (SHP2-P3, -P4, -P5, -P6, and -P8), which all share >92% nucleotide identity with the SHP2 cDNA, including homology to the 5' and 3'UTR (Fig. 7
and sequence alignments at our web sites). Like the TCPTP pseudogenes, the SHP2-derived sequences harbor frameshift mutations and premature stop codons in their apparent reading frame. Again, one pseudogene (SHP2-P5) arose by retrotransposition of an alternatively spliced mRNA. The authentic ATG initiation site is conserved in three of the five SHP2 pseudogenes; if transcribed, SHP2-P3 encodes a protein containing two SH2 domains that hypothetically could act as a dominant negative molecule of the SHP2 enzyme in vivo.
The two TCPTP and five SHP2 pseudogenes described above were previously detected by in situ hybridization (51
, 52)
. In fact, two groups have determined the genomic localization of SHP2. Using a 14.2 kb genomic library clone that contained both an exon and an intron sequence, this PTP was assigned to chromosome 12q24.1 by fluorescence in situ hybridization (53)
. When a SHP2 cDNA probe was used, however, additional hybridization signals were observed over 4q21 and 5p14 as well as to a lesser degree over chromosomes 3q1-3q13.2, 6q23-q24, and 8q12 (52)
. Back in 1992, it was proposed that these signals could represent new SH2 domain containing PTPs. In light of todays genomic sequence, we conclude that these signals correspond to the exact localization of the five intronless SHP2 pseudogenes.
Some PTP pseudogenes are likely to be expressed
Intriguingly, some of the PTP pseudogenes identified in this study were represented by EST sequences; since at least one pseudogene, SHP2-P3, has the potential to encode a functional protein fragment, we assessed the possible expression of the SHP2 and TCPTP-derived pseudogenes using PCR and cDNA templates from eight different human tissues (Fig. 8
and Table 3
). For each pseudogene, primer sets were designed to anneal to regions where the sequences of the pseudogene were unique compared with the parent gene. Sequencing of the PCR products confirmed that the two TCPTP pseudogenes (TCPTP-P1 and -P13) and three of the five SHP2 pseudogenes (SHP2-P4, -P6, and -P8) could be amplified from reverse-transcribed mRNA and thus are likely to be expressed, although as yet with unknown function (Table 3)
. The tissue distribution and expression level of these processed pseudogenes differed markedly from the parent functional transcript (Fig. 8)
. This is consistent with the notion that retrotransposed genes cannot include the transcriptional control elements present in the parental gene, but employ a nearby promoter present in an unrelated sequence (48)
.
PTP pseudogenes provide insights into evolution
The nucleotide sequence of pseudogenes reveals important insight into mutation rate and evolutionary history of the human genome (48)
. For example, alignment of SHP2 cDNA with its five processed (retrotransposed) pseudogenes reveals the most recent mutations that have occurred in the modern SHP2 enzyme. Specifically, we found 10 nucleotide positions in modern SHP2 that harbor a different nucleotide base from the consensus found in the SHP2 pseudogenes (see sequence alignment at our web sites). Of these 10 mutations, only one has changed the amino acid of the SHP2 protein (Met411Thr), diverging it further from SHP1. Since Thr411 is a surface-exposed residue and is found in a consensus kinase recognition sequence (protein kinase C), it is tempting to speculate that post-translation modification via theronine phosphorylation has provided selection pressure for the observed mutation.
Analysis of novel PTP-like sequences with apparent exon structure
In addition to human PTP-OST at 1q32.1 and the processed pseudogenes described above, our search for novel PTP genes identified three genomic sequences with a PTP-like exon structure and a fourth clone of poor sequence quality not present in the public or private genome assembly (Table 1)
. Subsequent analysis of their apparent PTP reading frames readily identified the sequence mapping to 5q23.1 as a nonprocessed pseudogene, most likely derived from duplication and degradation of genomic DNA from PTP
. Likewise, the PTP-OST-like fragment at 9q22.31 (which complicates the definition of the human PTP-OST locus; see analysis of PTP-OST) harbors several stop codons and thus is classified as a nonprocessed pseudogene (if not an artifact of the genome assembly process). However, the third genomic clone (AL390719) displayed a striking homology to SHP1 and SHP2 and was not a clear-cut case of PTP gene duplication and subsequent degradation. As a result, we combined a bioinformatics analysis of this sequence with PCR experiments and cloning of transcribed and genomic DNA from this region.
SHP3: a unique pseudogene with exon structure
Using the incomplete contig AL390719 (from Build 28), we were able to map nine exons giving rise to an apparent open reading frame homologous to SHP1 and SHP2, hence termed SHP3 (Fig. 9
). Four EST sequences derived from pancreas (BM141900, BM142081), hypothalamus (BI601978), and an adenocarcinoma (BF035622) matched the amino-terminal SH2 domain of SHP3, although the overlap was limited to 125 nucleotides within a single predicted exon (i.e., exon 2 of SHP3). Consistent with these EST sequences, we could amplify exon 2 of SHP3 from cDNA libraries from several different human tissues including hypothalamus, pancreas, and ovary (data not shown). This result created much excitement, since we were also able to amplify a transcript containing part of exon 4 of SHP3 consistent with the existence of three EST sequences (BF210831, BM129687, and BM129400) that overlapped exon 4 by 75 bp. However, to our disappointment, we were never able to amplify a SHP3-derived transcript encompassing exon 2 in a context with any of the other predicted exons despite using different sense primers annealing to exon 2 and a combination of different antisense primers annealing to exons 4, 8, 10, or 12, respectively. Yet using the corresponding set of SHP2 control primers, we were able to amplify and clone the paralogous SHP2 transcript from almost all tissues tested. Subsequent cloning of the SHP3 genomic sequence and concomitant release of a new version of the sequence AL390719 (version 31) without gaps revealed that the active site sequence of SHP3 has three critical mutations, which would make this an inactive enzyme (Fig. 9b
). In addition, the new version of the genomic sequence AL390719 introduced stop codons in the putative SHP3 reading frame. Thus, we conclude that SHP3 is a disabled gene; consistent with this, there is no evidence of a SHP3 sequence in the mouse genome (16)
as currently sequenced (Build 30).

View larger version (66K):
[in this window]
[in a new window]
|
Figure 9. Genomic organization of SHP3: a unique pseudogene with apparent exon structure at chromosome 1p36.32. a) Diagram showing the level of conservation between the genomic sequence of SHP3 (accession number: AL390719) and the exon structure of the SHP1 and SHP genes (PTPN6 and PTPN11). The nucleotide identities between various exons are indicated. The region of PTP homology spans 8300 bp and covers the two SH2 domains and the PTP domain. b) Amino acid sequence alignment of SHP2 with the apparent PTP reading frame of SHP3. Critical residues that are invariant in functional SH2 or PTP domains, but mutated in the SHP3 pseudogene, are shown in blue. cDNA libraries prepared from 16 different tissues (MTC panel 1, MTC panel 2) and human hypothalamus brain cDNA (Marathon-Ready, Clontech) was used in an attempt to clone transcripts for SHP3. The Advantage-GC cDNA polymerase mix (Clontech) was used for these PCR experiments due to the high GC content of the SHP3 sequence. The genomic SHP3 sequence of the putative PTP domain was amplified using a human genomic DNA library from Clontech (catalog number 6550-1).
|
|
A retroviral long terminal repeat (LTR) is present between exon 8 and 9 of SHP3, which lends added support for classification of SHP3 as a disabled pseudogene. We were intrigued by the subsequent identification of a 1368 bp polyadenylated mRNA (AF148950) containing genomic sequence of SHP3. This mRNA had been cloned in an effort to show that insertion into the genome of LTRs from endogenous retroviruses may modulate transcription of neighboring genomic DNA (54)
. The authors noted the homology within the nonretroviral part of this sequence to exon 8 of SHP2 and suggested it could represent a novel SHP2-like gene or a solitary duplicated exon. Notably, transcription occurs in the antisense direction of its PTP reading frame.
 |
DISEASE ASSOCIATION
|
|---|
An important outcome of the human genome assembly is that it offers the possibility of identifying genes underlying human diseases at a much higher pace than before, having circumvented the need for labor intensive positional cloning. Knowledge of disease loci, generated from family segregation and genetic epidemiological studies, can now be explored in the context of detailed maps of the human genome sequence. Somatic mutations, which are the cause of most sporadic cancers (4)
, can be mapped more readily as a result of the human genome project and the advent of new technologies for high throughput DNA sequencing and large-scale analysis.
The combined availability of defined disease loci and the current precise mapping of 38 human PTP genes allow for an initial systematic evaluation of these enzymes as candidate genes in genetically determined diseases. Caution is warranted when attempting to associate diseases with a specific gene or chromosomal region (55)
. Due to errors related to population sampling, stratification, or just statistical coincidence, experience has shown there is considerable risk that reported genetic linkages or associations may turn out to be false positive results (55)
. With special relevance to susceptibility loci, a linkage region can easily contain up to 200 different genes; mapping a gene within a linkage region may therefore serve as an indication, but is far from sufficient to provide proof, of any functional connection.
In an attempt to provide an overview of the possible etiological/pathogenic roles of PTPs in human diseases, we have applied three different approaches to uncover disease-related biological information. First, bearing the foregoing reservations in mind, we searched the area surrounding each PTP gene for disease susceptibility loci (using the Online Mendelian Inheritance in Man (OMIM) catalog of 14255 genetic disorders and disease loci). Second, given the pivotal role of tyrosine phosphorylation in malignant cell transformation (exemplified by mutations in the human epidermal growth factor receptor, the proto-oncogene c-Src, and the suggested tumor suppressor role of PTPs; reviewed in refs 4
, 56
), we examined whether PTP genes were "frequently" deleted or amplified in human cancers. For this search, we used the Mitelman database of recurrent chromosome aberrations in human cancers (currently holds 44,177 clinical records) and defined "frequent" as five or more recorded cases. Third, we searched animal models and mouse knockout studies for disease-like phenotypes associated with each PTP. The results of this study are summarized in an online electronic database in which each PTP locus is hyperlinked to disease information in OMIM, Mitelman, and PubMed (Table S4; see http://science.novonordisk.com/ptp or http://ptp.cshl.edu). Since the functional annotation of the human genome constantly evolves, this database format has the advantage of providing the users with up-to-date disease linkage information and the most current genome maps (e.g., Morbid and SNP Maps) for the PTP loci of interest. Again, it should be stressed that this information should serve only as a starting point for additional studies of the role of PTPs in human diseases rather than unequivocal evidence of association.
Involvement of PTPs in type 2 diabetes and obesity
Perhaps the most spectacular example of a link between the PTP family and human disease is in the one of diabetes and obesity. Type 2 diabetes and obesity are multifactorial diseases strongly influenced by genetic background. Through population and family studies, 10 confirmed susceptibility loci have been described for type 2 diabetes and/or obesity-related traits (Table 4
and web sites). Insulin resistance is a key feature of type 2 diabetes and obesity. Several different molecular defects may be underlying the impaired response to insulin. Since PTPs are involved in negative regulation of insulin signaling, it has been speculated that increased activity of members of this enzyme family could contribute to insulin resistance, at least in a subset of patients with diabetes or obesity. Four PTPs (PTP1B, PTP
, SHP2, and PCPTP1) localize within the above linkage regions and are candidate disease genes. Of these enzymes, only PTP1B has been implicated in diabetes and obesity.
One of the regions showing the strongest evidence for genetic linkage is chromosome 20q13.1-q13.2, which has been associated with quantitative trait loci for obesity and high fasting serum insulin levels (57
, 58)
as well as type 2 diabetes (59
, 60)
. Two PTPs map to this region: PTP1B at 20q13.1-q13.2 and RPTP
at 20q12-q13. Although little is known about the physiological role of RPTP
, two independent studies generating PTP1B knockout mice have demonstrated that ablation of PTP1B not only increases insulin sensitivity (61
, 62)
, but also produces resistance to diet-induced obesity due to the removal of a negative inhibitory constraint on insulin and leptin signaling (63
, 64)
. In obese and diabetic rodents, expression levels and activity of PTP1B both appear elevated in skeletal muscle and adipose tissue, supporting a role for PTP1B in the etiology of insulin resistance (65)
.
Mutations in the human PTP1B locus have also been identified. A recent genetic screen of the human PTP1B gene identified a proline to leucine variant in the noncatalytic, carboxyl-terminal segment of the enzyme that conferred an increased risk to diabetes in the Danish Caucasian population (66)
. In vitro studies showed that this variant reduced cdc2 kinase-mediated phosphorylation of a neighboring serine residue (Ser386), which may lead to perturbed function of PTP1B. Two other variants of the PTP1B gene have been identified: 1) a 3'UTR variant, which apparently increases the stability of PTP1B mRNA and is associated with increased insulin resistance (39)
, and 2) a silent variant (Pro303) that confers a degree of resistance to type 2 diabetes to carriers (67)
. Although more studies are needed to establish functional consequences of these variants, the fact that three independent studies have shown associations to type 2 diabetes supports the notion that the PTP1B locus is involved in the genetics of this disease in humans.
Chromosome 2q37 is another region associated with type 2 diabetes in a Mexican-American population with high prevalence of obesity and diabetes (68)
(Table 4)
. Although no PTPs have been identified in this region, positional cloning studies have implicated the calpain-10 gene as a candidate disease gene (reviewed in ref 69
). Thus, early studies in human platelets demonstrated that thrombin induces calpain-mediated cleavage of PTP1B by removing its ER-targeting motif (30)
. This generates a delocalized 42 kDa cytoplasmic protein with enhanced enzyme activity and leads to dephosphorylation of a set of cellular substrates different from those encountered by the ER-targeted enzyme (30)
. We speculate that calpain-10 variants with abnormal expression levels or proteolytic activity may influence the subcellular localization of PTP1B in insulin-sensitive tissues and thereby lead to perturbed regulation of insulin signaling. Although this hypothesis remains to be tested, it suggests there may be a functional relationship between the two type 2 diabetes linkage regions.
CD45 and immune function
The leukocyte common antigen CD45 is an abundant transmembrane receptor-like PTP that is expressed exclusively on hemapoietic cells (29)
and plays a positive role in promoting signaling through T and B cells (70
71
72)
. Transgenic mice bearing a potential activating mutation in CD45 display lymphoproliferation, autoantibody production, and severe autoimmune nephritis (73)
, whereas CD45 knockout mice are severely immunodeficient and display compromised thymocyte development and reduced B cell response (74)
. These observations are consistent with an important role for CD45 in mediating antigen receptor signaling.
The importance of CD45 in human health was recently demonstrated by the identification of two patients with severe combined immunodeficiency (SCID) and concomitant genetic lesions in CD45. In one patient, a complete lack of CD45 surface expression was observed due to a large deletion at one allele and a point mutation at the other (75)
. In the second patient, a homozygous 6 bp deletion in the coding region of the CD45 gene results in very low surface expression of the protein (76)
. A silent single nucleotide polymorphism (C77G) in exon four of CD45, which correlates with aberrantly high expression levels of exon four-encoded CD45, was reported to be associated with the development of multiple sclerosis (MS) (77)
. MS is believed to be caused by an abnormal immune response to myelin antigen(s), and it was hypothesized that the C77G polymorphism disrupted a strong exonic silencer element, which normally serves to inhibit the inclusion of exon 4 (78)
. Conflicting results emerge regarding this polymorphism. Whereas one study did not provide any evidence for an association of CD45 with the development of MS in U.S. patients (79)
, another study identified the C77G mutant in 5 of 196 Italian MS patients, but in none of 222 healthy controls (80)
. Although these observations illustrate the problems inherent in identifying links between genetic lesions and human diseases, the present data suggest that genotyping of CD45 in patients with unexplained disorders in immune activation may reveal important insight into the physiological function of CD45 and provide an opportunity to design drugs that modulate antigen and cytokine receptor signaling in autoimmunity and cancer.
SHP2, Noonan syndrome, and cancer
SHP2 is another PTP that has been shown to function positively in signal transductionfor example, in its activation of Erk MAP kinase in response to growth factor receptor PTKs and cytokines (81)
. Missense mutations in the SHP2 gene have recently been identified as the underlying cause of Noonan syndrome (82)
, an autosomal dominant disorder characterized by multiple developmental abnormalities including facial dysmorphia, short stature, cardiac defects, and skeletal malformations diagnosed in
1:10002500 newborns. The striking aspect of these mutations is that they are classified as "gain of function" and are predicted to activate SHP2 by relieving the intramolecular autoinhibition of the PTP domain by its amino-terminal SH2 domain (83)
. This is important since it is the first example of a putative gain of function mutation in a PTP that is the underlying cause of a human disease. This discovery is a prime example of the candidate gene approach afforded by the human genome project. Thus, early genetic studies had mapped Noonan syndrome to a 5 cM region at 12q24.1 (84
, 85)
, but it was access to the human genome sequences that made Tartaglia and co-workers investigate SHP2 as the candidate gene, as it mapped to the above region and was known to play a critical role in signal transduction pathways associated with diverse developmental processes (82)
. The same group recently identified activating mutations in SHP2 in five unrelated children with Noonan syndrome and familial juvenile myelomonocytic leukemia (JMML) (86)
. Furthermore, they observed mutations in 21 of 62 individuals with JMML but not Noonan syndrome. Similar mutations have been shown to increase the activity of SHP2, as measured with bacterially expressed recombinant protein in assays in vitro (H. Keilhack and B. Neel, personal communication). It appears that JMML is associated with aberrant up-regulation of the Ras-MAP kinase pathway, resulting from mutually exclusive mutations that either activate Ras or SHP2 or inactivate neurofibromin (NF1). Mutations in SHP2 were also noted in some patients with myelodysplastic syndrome and acute myeloid leukemia (86)
, and it will be of interest to ascertain whether such mutations drive the progression of other leukemias in addition to JMML.
PTPs and cancer
Although characterization of the PTP family has revealed important insights into function and uncovered links between PTPs and human diseases, most of these enzymes remain uncharacterized. In this study we have concentrated on identifying potential links between PTP genes and cancer. In the early days of PTP research, a simplified concept developed: the main function of this group of enzymes was to act as off-switches to counteract the PTKs (reviewed in ref 1
). As a result, PTPs were considered putative tumor suppressors (87
88
89
90)
. For example, a copy of the short arm of chromosome 3 is often missing in various carcinomas (91)
, and when the gene for PTP
was localized to 3p21 it was hypothesized that this enzyme functioned as a tumor suppressor whose functional loss could be involved in the pathogenesis of renal and lung tumors (92)
. Today, other candidate non-PTP, tumor suppressor genes have been identified in the same chromosomal area (93)
. As a result, PTP
is no longer considered a likely candidate