Descriptive statistics is the term given to the analysis of data that helps describe, show or summarize data in a meaningful way such that, for example, patterns might emerge from the data. Descriptive statistics do not, however, allow us to make conclusions beyond the data we have analysed or reach conclusions regarding any hypotheses we might have made. They are simply a way to describe our data.
Inferential statistics is concerned with making predictions or inferences about a population from observations and analyses of a sample. That is, we can take the results of an analysis using a sample and can generalize it to the larger population that the sample represents. In order to do this, however, it is imperative that the sample is representative of the group to which it is being generalized.
Exploratory Data Analysis
For a pilot study, a collection of data from a data base, or a review of prior literature, the emphasis in analysis is on estimating means (median) and standard deviations (or ranges) or proportions to describe each variable and looking at relationships between variables with two x two tables and scatterplots, rather than on testing hypotheses. Estimation of expected means or mean changes and standard deviations of the outcome variables or change in outcomes are necessary as a start to computation of the required sample size to detect differences of interest in the planned study.
Univariate analyses are focused on looking at one variable at a time (even if the variable is a change or a ratio). These analyses will often be done using t-tests or non-parametric rank tests.
Regression analyses are used to predict one outcome measure from one or more other predictor variables. It might be used to assess inter-rater agreement, the relationship between pre and post scores, or to see whether treatment outcome is related to patient demographics or pre-conditions. Usually, the relationship is assumed to be linear, but non-linear models can also be fit. Generally, the outcome will be a continuous variable, but regression methods are available for categorical or yes/no outcomes. A rule of thumb is that there should be 10-20 cases per predictor variable.
Categorical Data Analysis
When the outcome variable is dichotomous (yes/no, case/control), or categorical (an ordered scale, or categories with no natural order such as genotype, or disease, or region of origin), a coding system will need to be established, and modifications of statistical methods for continuous variables will be needed.
Survival analysis methods are commonly used for variables like time to recurrence, or time to death. These methods can take into account “censoring,” that is, we might know that a patient was still alive at six months after start of treatment, but because of loss to follow-up, we don’t know the patient’s later status. Survival analysis methods include the log-rank test and Cox proportional hazards models which allow the inclusion of predictor variables into the analysis.
Longitudinal and Repeated Measures Analysis
Longitudinal studies involve long-term follow-up of a cohort or panel to determine trends over time or to compare them among groups. A repeated measures study will typically involve comparison of various treatments in the same patients or repeated observations of clinical parameters over time. The simplest repeated measures study is the crossover study in which each patient is studied under placebo as well as under treatment conditions. Balancing the order of treatments across time periods using latin square or other balanced designs is an important part of study planning and analysis. Analyses must take into account the correlation structure of the responses.
Multi-level and Hierarchical Modeling
Multi-level and hierarchical modeling approaches are used when subjects are clustered at several levels such as patient, hospital, and region.
Multiple Imputation/Missing Data Methods
Most data sets have some missing data due to failure to record items in the patient record, missed visits, or drop-outs. If the amount of missing data is very small, it may be reasonable to analyze complete cases. Generally though, some missing data or imputation method (filling in an estimated value) will need to be used. The choice of approach needs a lot of consideration because there may be a variety of reasons that values are missing and the use of a particular method can lead to misleading results in some situations.
Genomic, Proteomic, and Biomarker Analyses
We have experience with many types of high-through put molecular data including: RNA-sea, Chip-sea, microarray, protein mass-spec, microbiome sequencing, and metabolomics. These studies are commonly exploratory studies to determine whether specific genes, proteins, or biomarkers or combinations thereof can be used to help with disease diagnosis, or outcome prediction. Typically the biggest problems are caused by small sample sizes and large numbers of potential predictors or combinations of predictors so that results are not robust to replication. Biomarker discovery and validation studies require careful assessment of molecular data in the context of well defied clinical diagnostic or predictive scenarios.
Count Data Models
When the outcome variable is a count such as the number of comorbidities, or the number of doctor visits during a year, statistical analysis methods using the binomial, or Poisson models may be required.
Applications of Bayesian Statistics
Bayesian statistical methods try to make use of prior knowledge by inclusion of prior distributions in the analyses. Such modeling is of special use for sequential analysis methods, and Markov Chain modeling.
Monte Carlo Simulations
For some study designs or types of outcome variables, standard statistical methods are not applicable and random simulations can be run to assess the likelihood of seeing results as extreme or more extreme than those observed. Such approaches usually require special programming by the statistical analyst.
Data mining methods are used for attempting to discover patterns in large data sets. Often used approaches are based on artificial intelligence and machine learning, or on statistical methods such as cluster analysis, decision trees, classification methods, and discriminant analysis.