Expression Signature to Estimate Survival in Prostate Cancer
Expression Signature to Estimate Survival in Prostate Cancer
Written approval from the local ethics committee was obtained for the molecular analysis of biological samples from PCa patients. This study was conducted in a stepwise manner. The procedure for selecting and verifying genes is outlined in Figure 1 and described in detail as follows.
(Enlarge Image)
Figure 1.
Outline of a stepwise gene selection process. (a) Identification of 641 embryonic stem cell gene predictors (ESCGPs) by bioinformatic analysis. Previously published data sets of whole-genome complementary DNA microarrays derived from five human ESC lines and 115 human normal tissues from various organs were retrieved from the Stanford Microarray Database (SMD). After a data-centering process, a sub-data set with expression profile of 24 361 genes in the ESC lines was isolated from the combined whole data set. A single-class significance analysis of microarray (SAM) was performed and a SAM plot was generated. The 328 genes with the highest expression levels and 313 genes with the lowest expression levels were identified, in total 641 ESCGPs. (b) Identification of 258 ESCGPs in prostate cancer (PCa). PCa ESCGPs were identified by matching the list of the 641 ESCGPs and the list of 5513 genes published by Lapointe et al. When clustering the 112 PCa tissue samples and comparing the cluster results when using all 5513 genes and when using only the 258 ESCGPs present in the data set, nearly identical results were obtained. Sample labeling: PL, lymph node metastasis; PN, normal prostate tissue; PT, prostate tumor. Three cases (marked green) were placed in different classification positions and two cases (purple) were consistently misclassified. (c) Selection of important candidate ESCGPs for clinical survival correlation. Of 258 PCa ESCGPs, 34 genes were selected by their high-ranking order in the SAM analysis identifying significant genes for the subtype classification or for the discriminating between tumor and normal samples. Of these 34 ESCGPs, 19 were selected based on their markedly different expression patterns and robust performances in RT-PCR reactions (Supplementary Figure S1). The 19 ESCGPs and the 5 reported genes were included in the optimization of the 4-plex qPCR method using RNAs from PCa cell lines. (d) Identification of the ESCGP signature in Subset 1. After the 4-plex qPCR optimization, the method was used to analyze 36 fresh–frozen fine-needle aspiration (FNA) biopsies taken from PCa patients (Subset 1). RNAs could be extracted in 28 biopsies. A series of cluster analyses using different gene combinations revealed that the ESCGP signature VGLL3, IGFBP3 and F3 classified Subset 1 samples into three subtypes with strong survival correlations. The level of gene expression increases from blue to red, whereas the delta Ct value decreases from blue to red. Gray areas represent missing qPCR data.
Previously published data sets of whole-genome cDNA microarrays derived from five human ESC lines and 115 human normal tissues from various organs were retrieved from the SMD (Stanford Microarray Database) (Figure 1a). Initially we combined these two retrieved data sets, the data set of normal tissues was used to normalize the subset of ESC lines by a data-centering process, afterwards a sub-data set with whole-genome expression profile of the 24 361 genes in the ESC lines was isolated from the combined whole data set. A single-class significance analysis of microarrays (SAM) was performed using the subset of the ESC lines only, whereby all genes were ranked according to the consistency (without significant variations) of their expression levels across the ESC lines, assuming that genes with significant expression variation between ESC lines would not be critical for maintaining stem cell-like status. Significant genes were selected at delta 0.23 and q-value ≤0.05.
An independent data set with 112 prostate tissue samples was used to verify the ESCGP findings and to select ESCGPs for PCa (Figure 1b). The list of genes in the published data set was matched to the list of the candidate ESCGPs identified in Step 1. This resulted in a shorter list of genes and the expression data of these genes were used to repeat the cluster analysis. The result was compared with the original cluster.
A SAM analysis was performed using the list of PCa ESCGPs identified in Step 2 (Figure 1c). The high-ranking ESCGPs and an additional five reported genes were selected for further analysis. A 4-plex qPCR method was optimized for the quantification of these genes by using RNAs from three PCa cell lines (LNCaP, DU145, PC-3). The procedures used for the isolation of total RNA from cell cultures and FNA cytology smears, cDNA synthesis, RT-PCR and multiplex qPCR analysis are described in the Supplementary Information file, Supplementary Tables S1–S3 and Supplementary Figures S1 and S2. For the qPCR analysis, the expression level of each gene in a sample was normalized to that of GAPDH (glyceraldehyde 3-phosphate dehydrogenase) and was presented as the delta Ct value, which is inversely correlated to the gene expression level. This delta Ct value was centered by the median value across all samples, and the centered delta Ct value was then used for the statistical and k-nearest neighbor (kNN) analyses.
FNA Samples: The FNA samples were collected according to the routine procedure used at the Clinical Pathology/Cytology unit at Karolinska University Hospital in Stockholm, Sweden (Figure 1d). Multiple cytology smears were obtained by prostate FNA procedure in each patient at the time of diagnosis. The representative smear was identified by examination of Giemsa-stained slides and was used for the clinical cytological diagnosis. The remaining fresh smears on glass slides that were duplicates of the Giemsa-stained slide used for diagnosis were freshly frozen and kept at –70 °C. We found that at least 80% of all cells collected from most of the PCa FNA samples were cancer cells. Of the 241 FNA samples that were collected from the patients, we obtained good-quality total RNAs from 193 samples; 189 of these samples were from patients with a diagnosis of PCa. The researcher who performed the 4-plex qPCR analyses of the FNA samples was not informed of the relevant clinical data until the complete data set was constructed from both the qPCR results and the clinical data.
Clinical Characteristics of the Cohort: The 189 PCa patients were diagnosed between 1986 and 2001. During this time period in Sweden, PCa diagnoses were mainly confirmed by prostate FNA cytology rather than by performing a multiple-core biopsy of the prostate. Elderly men without lower urinary tract symptoms were seldom tested for their serum PSA levels. The average PSA level at the time of PCa diagnosis was therefore higher during this study than the level currently observed. In this cohort, very few patients had an indolent cancer, over 50% of patients had high-grade, advanced cancer, and hormone therapy was the primary treatment for 77.9% of the patients (Table 1). With regard to comorbidity, 40% of the patients had cardiovascular disease and 9% diabetes. An internship doctor who was not informed of the results of the molecular analyses collected the relevant clinical data under the supervision of an oncologist. Information with regard to the date of diagnosis, the date of death and the cause of death for all patients was first obtained from regional or national registries and was then verified by examining the medical journals. Diagnosis and cause of death were coded according to the International Classification of Diseases (ICD9 and ICD10) recommended by the World Health Organization. PCa-specific mortality was assigned to cases where PCa or metastases were the primary or secondary cause of death. Death causes of patients are described in Supplementary Table S8. By 31 December 2008, 22 of the original 189 patients were still alive, 163 were deceased and 4 could not be found in the registries (Table 1).
Statistical Analysis. Sample Size and Design of the Subsets: The details for these procedures are provided in the Supplementary Information file. The patient cohort was divided into three subsets, according to the diagnoses and the experimental time order (Table 1). For Subset 1, we evaluated the strongest candidate genes from the ESCGP list. For Subset 2, we evaluated the most significant genes from Subset 1 and selected genes (reported genes) from the literature. For Subset 3, we tested the genes that demonstrated significance in Subset 2 and a limited number of genes that showed significance in Subset 1 but were not tested in Subset 2. A summary of the genes tested in the different subsets and complete set is shown in Table 2. Two major factors determined that not all candidate genes could be tested in every sample, that is, insufficient amount of total RNAs isolated from most FNA samples and fixed gene combinations by 4-plex qPCR.
Definition of important parameters The details and references for the parameters are provided in the Supplementary Information file.
Survival analysis The univariate and multivariate hazard ratios were calculated according to the Cox proportional hazard model using Stata statistics software (version 10.1; StataCorp LP, College Station, TX, USA). The Kaplan–Meier plots and statistical box plots were made using JMP statistics software (version 8.0.1; SAS Institute, Cary, NC, USA).
Cluster analysis Gene expression data were evaluated using the unsupervised hierarchical clustering method and the gene median-centered delta Ct values, results were visualized using Treeview software (Eisen Lab, University of California at Berkeley, CA, USA). Unsupervised hierarchical clustering is based on similarity measures and identifies clusters as groups of patterns.
Parametric model design A first-order polynomial model using the selected genes was designed based on the assumption of the Weibull distribution in Stata statistics software (StataCorp LP). Of the 95 patients for whom there was data with regard to expression pattern of the ESCGP signature, 87 had expression data and clinical information that could be used for Weibull regression survival prediction. Two models were made, one using clinical parameters alone and one combining clinical parameters and ESCGP signature. The clinical parameters included PSA level (>50 vs ≤50 ng ml), clinical disease stage (advanced vs localized), tumor grade (poorly vs well/moderately differentiated) and age at the time of diagnosis. A first-order polynomial model using the selected genes was designed in Stata statistics software (StataCorp LP).
kNN modeling The data set was randomly divided into a training set (70% of the data set, n=139) for model development and a test set (30% of the data set, n=50) for model verification. Four different kNN models estimating overall survival were designed and optimized on the training set data (Table 5). One of the models had only clinical parameters, one had only the ESCGP signature and two had combinations of clinical parameters and ESCGP signature. In all cases, models were applied only to cases without missing data, Euclidian distance measures were used and the average survival time for the three nearest neighbors was calculated as output. The scaling of the parameters of each model was determined through an exhaustive search of all combinations of the scaling factors 1, 3 and 9. A random number generator with a similar distribution as the overall survival in the data set was used as the reference for random guess of overall survival. For all models and the random number generator, the prediction performance of the kNN models was evaluated by comparing the average and variance of the absolute prediction error. kNN is a pattern based classification tool that assigns an unknown case to the same group as the most similar reference cases, meaning that kNN can be capable of classifying data sets where there is no simple univariate relationship between gene expression and patient outcome.
Materials and Methods
Written approval from the local ethics committee was obtained for the molecular analysis of biological samples from PCa patients. This study was conducted in a stepwise manner. The procedure for selecting and verifying genes is outlined in Figure 1 and described in detail as follows.
(Enlarge Image)
Figure 1.
Outline of a stepwise gene selection process. (a) Identification of 641 embryonic stem cell gene predictors (ESCGPs) by bioinformatic analysis. Previously published data sets of whole-genome complementary DNA microarrays derived from five human ESC lines and 115 human normal tissues from various organs were retrieved from the Stanford Microarray Database (SMD). After a data-centering process, a sub-data set with expression profile of 24 361 genes in the ESC lines was isolated from the combined whole data set. A single-class significance analysis of microarray (SAM) was performed and a SAM plot was generated. The 328 genes with the highest expression levels and 313 genes with the lowest expression levels were identified, in total 641 ESCGPs. (b) Identification of 258 ESCGPs in prostate cancer (PCa). PCa ESCGPs were identified by matching the list of the 641 ESCGPs and the list of 5513 genes published by Lapointe et al. When clustering the 112 PCa tissue samples and comparing the cluster results when using all 5513 genes and when using only the 258 ESCGPs present in the data set, nearly identical results were obtained. Sample labeling: PL, lymph node metastasis; PN, normal prostate tissue; PT, prostate tumor. Three cases (marked green) were placed in different classification positions and two cases (purple) were consistently misclassified. (c) Selection of important candidate ESCGPs for clinical survival correlation. Of 258 PCa ESCGPs, 34 genes were selected by their high-ranking order in the SAM analysis identifying significant genes for the subtype classification or for the discriminating between tumor and normal samples. Of these 34 ESCGPs, 19 were selected based on their markedly different expression patterns and robust performances in RT-PCR reactions (Supplementary Figure S1). The 19 ESCGPs and the 5 reported genes were included in the optimization of the 4-plex qPCR method using RNAs from PCa cell lines. (d) Identification of the ESCGP signature in Subset 1. After the 4-plex qPCR optimization, the method was used to analyze 36 fresh–frozen fine-needle aspiration (FNA) biopsies taken from PCa patients (Subset 1). RNAs could be extracted in 28 biopsies. A series of cluster analyses using different gene combinations revealed that the ESCGP signature VGLL3, IGFBP3 and F3 classified Subset 1 samples into three subtypes with strong survival correlations. The level of gene expression increases from blue to red, whereas the delta Ct value decreases from blue to red. Gray areas represent missing qPCR data.
Identification of Candidate ESCGPs
Previously published data sets of whole-genome cDNA microarrays derived from five human ESC lines and 115 human normal tissues from various organs were retrieved from the SMD (Stanford Microarray Database) (Figure 1a). Initially we combined these two retrieved data sets, the data set of normal tissues was used to normalize the subset of ESC lines by a data-centering process, afterwards a sub-data set with whole-genome expression profile of the 24 361 genes in the ESC lines was isolated from the combined whole data set. A single-class significance analysis of microarrays (SAM) was performed using the subset of the ESC lines only, whereby all genes were ranked according to the consistency (without significant variations) of their expression levels across the ESC lines, assuming that genes with significant expression variation between ESC lines would not be critical for maintaining stem cell-like status. Significant genes were selected at delta 0.23 and q-value ≤0.05.
Selection of Candidate ESCGPs in PCa
An independent data set with 112 prostate tissue samples was used to verify the ESCGP findings and to select ESCGPs for PCa (Figure 1b). The list of genes in the published data set was matched to the list of the candidate ESCGPs identified in Step 1. This resulted in a shorter list of genes and the expression data of these genes were used to repeat the cluster analysis. The result was compared with the original cluster.
Refining ESCGPs Selection Using RT-PCR and Multiplex qPCR Analyses of three PCa Cell Lines
A SAM analysis was performed using the list of PCa ESCGPs identified in Step 2 (Figure 1c). The high-ranking ESCGPs and an additional five reported genes were selected for further analysis. A 4-plex qPCR method was optimized for the quantification of these genes by using RNAs from three PCa cell lines (LNCaP, DU145, PC-3). The procedures used for the isolation of total RNA from cell cultures and FNA cytology smears, cDNA synthesis, RT-PCR and multiplex qPCR analysis are described in the Supplementary Information file, Supplementary Tables S1–S3 and Supplementary Figures S1 and S2. For the qPCR analysis, the expression level of each gene in a sample was normalized to that of GAPDH (glyceraldehyde 3-phosphate dehydrogenase) and was presented as the delta Ct value, which is inversely correlated to the gene expression level. This delta Ct value was centered by the median value across all samples, and the centered delta Ct value was then used for the statistical and k-nearest neighbor (kNN) analyses.
Establishing of the Clinical Importance
FNA Samples: The FNA samples were collected according to the routine procedure used at the Clinical Pathology/Cytology unit at Karolinska University Hospital in Stockholm, Sweden (Figure 1d). Multiple cytology smears were obtained by prostate FNA procedure in each patient at the time of diagnosis. The representative smear was identified by examination of Giemsa-stained slides and was used for the clinical cytological diagnosis. The remaining fresh smears on glass slides that were duplicates of the Giemsa-stained slide used for diagnosis were freshly frozen and kept at –70 °C. We found that at least 80% of all cells collected from most of the PCa FNA samples were cancer cells. Of the 241 FNA samples that were collected from the patients, we obtained good-quality total RNAs from 193 samples; 189 of these samples were from patients with a diagnosis of PCa. The researcher who performed the 4-plex qPCR analyses of the FNA samples was not informed of the relevant clinical data until the complete data set was constructed from both the qPCR results and the clinical data.
Clinical Characteristics of the Cohort: The 189 PCa patients were diagnosed between 1986 and 2001. During this time period in Sweden, PCa diagnoses were mainly confirmed by prostate FNA cytology rather than by performing a multiple-core biopsy of the prostate. Elderly men without lower urinary tract symptoms were seldom tested for their serum PSA levels. The average PSA level at the time of PCa diagnosis was therefore higher during this study than the level currently observed. In this cohort, very few patients had an indolent cancer, over 50% of patients had high-grade, advanced cancer, and hormone therapy was the primary treatment for 77.9% of the patients (Table 1). With regard to comorbidity, 40% of the patients had cardiovascular disease and 9% diabetes. An internship doctor who was not informed of the results of the molecular analyses collected the relevant clinical data under the supervision of an oncologist. Information with regard to the date of diagnosis, the date of death and the cause of death for all patients was first obtained from regional or national registries and was then verified by examining the medical journals. Diagnosis and cause of death were coded according to the International Classification of Diseases (ICD9 and ICD10) recommended by the World Health Organization. PCa-specific mortality was assigned to cases where PCa or metastases were the primary or secondary cause of death. Death causes of patients are described in Supplementary Table S8. By 31 December 2008, 22 of the original 189 patients were still alive, 163 were deceased and 4 could not be found in the registries (Table 1).
Statistical Analysis. Sample Size and Design of the Subsets: The details for these procedures are provided in the Supplementary Information file. The patient cohort was divided into three subsets, according to the diagnoses and the experimental time order (Table 1). For Subset 1, we evaluated the strongest candidate genes from the ESCGP list. For Subset 2, we evaluated the most significant genes from Subset 1 and selected genes (reported genes) from the literature. For Subset 3, we tested the genes that demonstrated significance in Subset 2 and a limited number of genes that showed significance in Subset 1 but were not tested in Subset 2. A summary of the genes tested in the different subsets and complete set is shown in Table 2. Two major factors determined that not all candidate genes could be tested in every sample, that is, insufficient amount of total RNAs isolated from most FNA samples and fixed gene combinations by 4-plex qPCR.
Definition of important parameters The details and references for the parameters are provided in the Supplementary Information file.
Survival analysis The univariate and multivariate hazard ratios were calculated according to the Cox proportional hazard model using Stata statistics software (version 10.1; StataCorp LP, College Station, TX, USA). The Kaplan–Meier plots and statistical box plots were made using JMP statistics software (version 8.0.1; SAS Institute, Cary, NC, USA).
Cluster analysis Gene expression data were evaluated using the unsupervised hierarchical clustering method and the gene median-centered delta Ct values, results were visualized using Treeview software (Eisen Lab, University of California at Berkeley, CA, USA). Unsupervised hierarchical clustering is based on similarity measures and identifies clusters as groups of patterns.
Parametric model design A first-order polynomial model using the selected genes was designed based on the assumption of the Weibull distribution in Stata statistics software (StataCorp LP). Of the 95 patients for whom there was data with regard to expression pattern of the ESCGP signature, 87 had expression data and clinical information that could be used for Weibull regression survival prediction. Two models were made, one using clinical parameters alone and one combining clinical parameters and ESCGP signature. The clinical parameters included PSA level (>50 vs ≤50 ng ml), clinical disease stage (advanced vs localized), tumor grade (poorly vs well/moderately differentiated) and age at the time of diagnosis. A first-order polynomial model using the selected genes was designed in Stata statistics software (StataCorp LP).
kNN modeling The data set was randomly divided into a training set (70% of the data set, n=139) for model development and a test set (30% of the data set, n=50) for model verification. Four different kNN models estimating overall survival were designed and optimized on the training set data (Table 5). One of the models had only clinical parameters, one had only the ESCGP signature and two had combinations of clinical parameters and ESCGP signature. In all cases, models were applied only to cases without missing data, Euclidian distance measures were used and the average survival time for the three nearest neighbors was calculated as output. The scaling of the parameters of each model was determined through an exhaustive search of all combinations of the scaling factors 1, 3 and 9. A random number generator with a similar distribution as the overall survival in the data set was used as the reference for random guess of overall survival. For all models and the random number generator, the prediction performance of the kNN models was evaluated by comparing the average and variance of the absolute prediction error. kNN is a pattern based classification tool that assigns an unknown case to the same group as the most similar reference cases, meaning that kNN can be capable of classifying data sets where there is no simple univariate relationship between gene expression and patient outcome.