Statistics
Power calculation
Under certain assumptions i.e. a moderate 0.5 correlation between biomarkers of interest, a sample of 333 from the positive group and 666 from the negative group achieve 80% power to detect a difference of 0.0422 between the area under the ROC curve (AUC) under the null hypothesis of 0.8500 and an AUC under the alternative hypothesis of 0.8922 using a two-sided z-test at a significance level of 0.05000. The data are discrete (rating scale) responses. The AUC is computed between false positive rates of 0.000 and 1.000. The ratio of the standard deviation of the responses in the negative group to the standard deviation of the responses in the positive group is 1.000.
Methods of data collection
Patients with pathologically proven bladder cancer, newly diagnosed or recurrent, will be recruited as bladder cancer patients prior to TransUrethral Resection of the Bladder (TURB), at pre-assessment clinics, as in-patients on urology wards or at Planned Cystoscopy sessions. Control patients will be recruited from haematuria clinics. Following written, informed consent the Research Nurse will complete a Recruitment Form which will collect information including demographics, habits, occupations, current medications, haematuria (macro or micro), history of other cancer, renal stones, recurrent urinary tract infections. This information will be keyed into a dedicated database. We will obtain urine and blood samples from each patient. A urologist will determine the final diagnosis for each patient based on results of all investigations.
Initial exploration of demographic, lifestyle and clinicopathological variables:
Data from the Recruitment Form will be grouped using univariate analyses prior to multivariate modelling. Where cells contain < 30 items logical combination of groupings and re-coding will be considered. We will then investigate mean biomarker levels across groupings and explore homogeneity of variances. These preliminary analyses will inform subsequent modelling to define which factors effect individual biomarker levels and ultimately the diagnostic algorithm.
Creation of bench mark Proven Predicted Probability (PPP) diagnostic algorithm:
We have previously described the concept of proven predicted probability (PPP) which we have previously described in relation to bladder cancer (Abogunrin et al, Cancer 2012). Demographic, lifestyle and clinicopathological variables will be entered into a Forward Wald binary logistic regression analyses (cut-off probability for case classification = 0.5) to create a diagnostic algorithm based on PPP using data from 500 haematuria patients. This test algorithm will be evaluated using the data from the remaining ~ 499 patients. The ROC of PPP will act as the bench mark against which diagnostic algorithms will be compared.
Preliminary analyses to reduce dimensionality of the large biomarker panel
We will undertake a preliminary analysis using biomarker data from 999 patients to reduce the dimensionality using T-test and stepwise Forward Wald binary logistic regression analyses. This approach will identify smaller effect sizes because the maximum number of cases will be employed in the analyses to identify the biomarkers for inclusion in the subsequent analyses to create the algorithms. We will enter all biomarkers into a Forward Wald binary logistic regression analyses (cut-off probability for case classification = 0.5) to create a diagnostic algorithm. Biomarkers which do not achieve 0.1 significance following univariate analyses and do not contribute to the 999 patient equation will not be entered into subsequent regression analyses.
Creation of diagnostic algorithms using binary logistic regression analyses
All the data will be divided into ten groups of 99 patients with 1:2 cases to controls plus one case in each group. We will run 10 models. Algorithms will be created as described above using cohorts of 900 patients and tested on the remaining 99 patients. We will average the AUCs and obtain a combined standard error = 0.1*(SE12 +SE22 + SE32 +SE42 +SE52 +SE62 +SE72 +SE82 +SE98 +SE102) 0.1. For the ten algorithms, ROC curves will be generated by inserting the predicted probabilities, generated from the Forward Wald analyses, as the test variable. We will use the coordinate points of the ROC curvesto determine the cut-off point. As it is expected that the key biomarkers identified from these groups may differ slightly, we will undertake a final Forward Wald binary logistic regression analysis to identify the components of the final diagnostic algorithm(s) for testing in the biochip format.
Advanced statistics
Haematuria is associated with a wide range of final diagnoses, each of which is composed of complex biological components with crosstalk at multiple levels through interacting pathways. These pathways do not operate in isolation; an alteration in one leads to changes in protein-protein interactions, transcription and/or translation in others. Each cell’s phenotype and underlying functionality is dependent on its internal interactions between genes, gene products, bound and unbound proteins and its extracellular microenvironment. Unravelling of these complexities to identify the causal pathway interactions of the disease is beyond human intuitive reasoning. There is a need to enhance our understanding of basic biological mechanisms including the pathogenesis of complex diseases, which are in contrast to monogenetic diseases like cystic fibrosis, not caused by the malfunctioning of individual genes.
In order to gain insight into the heterogeneity of the population of haematuria patients, captured by the measured biomarkers, we will study the classification ability of a set of biomarkers with respect to different clinically relevant patient subpopulations. Further, we will study estimates of the dimension of the biomarker feature space for clinically relevant patient subpopulations. Based on this, we will estimate patient subpopulations from the dataset itself, instead of using known clinical information about the patients to obtain such a grouping.
For our analysis, we will develop novel statistical and computational methods to extract reliable and robust information from the experimental data. Specifically, our procedures will combine hierarchical clustering with a random forest classifier.