FASEB J.
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Kaplan, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kaplan, D.
Related Collections
Right arrowRelated Articles
(The FASEB Journal. 2007;21:305-308.)
© 2007 FASEB

POINT: Statistical analysis in NIH peer review—identifying innovation

David Kaplan1

Case School of Medicine, Case Western Reserve University, Cleveland, Ohio, USA

1 Correspondence: Case School of Medicine, Case Western Reserve University, 2103 Cornell Rd., Cleveland, OH 44106-7288, USA. E-mail: david.kaplan{at}case.edu

IN THIS ESSAY/HYPOTHESIS, I assess the statistical technology used in the NIH peer review of research proposals. In the NIH system of proposal review, sample sizes are inappropriately small, samples are highly selective or nonrandom which makes the review particularly susceptible to bias—the samples are not independently derived, the scoring precision is unrealistically high, and the arithmetic mean is the only statistical value considered. I propose a complementary peer review system that addresses these issues. Moreover, I hypothesize that innovative grant applications can be assessed in a system with robust statistical procedures by measures of dispersion such as variance and/or kurtosis. The identification of innovation has been a long-term deficiency at NIH.

The peer review of grant proposals is an essential part of the management of biomedical research. Consequently, it is reasonable to consider how statistics are used in this peer review process.

First, NIH and most funding agencies assess grant applications by requesting reviews from just two or three qualified persons. The number of samples is constrained by the large size of the grant applications and by writing proposals narrowly for extreme experts in narrowly defined fields. Second, peer review does not use random sampling of the relevant population. Instead experts are chosen based on their perceived expertise and qualifications. Unfortunately, this sampling scheme is particularly subject to bias because the extreme experts have a vested interest in the current paradigms. Third, the review process used by NIH and others involves discussions among reviewers. These discussions are meant to provide consensus, not to assess the distribution of sensibilities among the relevant population of scientists. The scores provided in this system are not independently derived, but independence of the observations is a valuable attribute for statistical analysis. Fourth, NIH uses a 41-grade scoring system which suggests an unrealistic degree of precision. A high-precision score is needed to distinguish among the many applications in a situation with only two or three dependent samples being collected. Finally, peer review of grant applications at NIH and most funding agencies uses a single statistic, the arithmetic mean. The use of this single statistic may be reasonable for a normal distribution of parametrically related observations; yet, scores given in peer review are not necessarily normally distributed and are definitely not parametrically related.

Although funding agencies use statistical techniques that are constrained and although they largely ignore the distribution of reviews among the relevant peer group, their system has been successful in identifying excellent grant proposals. What NIH and other funding agencies have specifically struggled with is the identification of innovative ideas.

Innovation and excellence are easily conflated; nevertheless, their distinction is essential in establishing the most potent funding policies. Excellence is the property that indicates a high degree of certainty that the proposed work will enhance our scientific understanding or capabilities. The trade-off is the inverse relationship between the degree of certainty and the degree of the enhancement. Excellence usually provides us with small steps forward. In contrast, innovation is the property that indicates novelty which carries with it a low degree of certainty of success. Although innovative projects lack certainty, the potential for huge enhancements in our scientific understanding or capabilities makes them an attractive bet. Undoubtedly, the reason innovative and excellent proposals are readily confused is because there is some overlap in these characteristics. Innovation requires a level of excellence and excellence requires a level of innovation.

NIH has struggled with the identification of innovative grant proposals. In 1996, NIH Director Harold Varmus convened a special panel to address his concern that NIH was not succeeding in focusing on "projects that will lead to changes in how we think about science and that will encourage investigators to take more risks" (1). The panel recommended that innovation be explicitly included as a criterion for the ranking of grant proposals (1). Nevertheless, this intervention has proven to be less than successful. A mechanism to engender innovation is still on the agenda for NIH. The NIH Roadmap, a recent initiative of the current NIH Director Elias Zerhouni, was conceived in part as a mechanism to foster more innovative investigations (2). The means to identify innovative ideas continue to be a major concern of the NIH Peer Review Advisory Committee (3).

I hypothesize that statistical measures of disagreement among reviewers could be used to identify innovative proposals assuming an appropriate system of sampling and scoring is established. I propose that innovation is reflected in the controversy that an application engenders as opposed to the consensus that a proposal is excellent.

There are many statistical measures that reflect divergent opinions among reviewers. A powerful measure of reviewer disagreement, such as Cohen’s kappa or Fleiss’ kappa, could be calculated. Nevertheless, for the purpose of this essay, I shall limit my consideration to variance and/or kurtosis. Variance is a measure of the scatter or dispersion of observations and kurtosis assesses the peakedness of the distribution or the degree that observations occur in the tails of the distribution.

Powerful new ideas do not elicit consensus. General concurrence about a grant application indicates that the ideas presented have already been established or that they are obvious based on our current state of understanding. Instead, innovative proposals generate considerable enthusiasm by some and equally considerable disapprobation by others. Powerful new ideas always threaten the current understanding. Some scientists are attracted to the potential for enhancement and others are more fearful that we shall subvert the current situation. Innovation elicits controversy because it challenges existing paradigms. The inability to reach consensus indicates that we have no information that represents a definitive guide; consequently, the characteristic of failing to reach consensus is reasonable to associate with novel or innovative ideas.

The history of science is replete with examples of innovative ideas that were initially treated with disdain by some and hailed by others. Rosalyn Yalow’s first manuscript describing the radioimmunoassay was vigorously rejected by one set of reviewers but others saw the promise of this innovation (4). Mario Capecchi’s idea for knock-out mice was not funded by NIH’s "dismissive" reviewers but the idea was attractive enough to find avid supporters elsewhere (5). The original proposition of speciation by natural selection put forth by Charles Darwin was met with vociferous objection as well as tenacious support (6). It is inherent for innovative ideas to engender controversy.

There is no currently accepted measure for innovation. Many scientists believe that they can reliably identify innovative proposals by themselves. However, subjective evaluations come with a significant degree of fallibility, perhaps mostly because innovation is easily confused with excellence. Excellence can be determined by independent opinions about the probability that the realization of an idea would advance our understanding. My proposal defines innovation in terms of statistical measures involving opinions about the degree of excellence. My proposal benefits from the greater clarity of opinions associated with judging the degree of excellence, from the involvement of many instead of few, and from the objectivity of statistical calculations.

With the structure of review that is currently used at NIH and other funding agencies, variance, kurtosis, and other measures of data dispersion are meaningless. NIH does not use sampling or scoring that would make statistical measures beyond the arithmetic mean to be useful. For variance or kurtosis to be used by a funding agency, a different system of review should be instituted. First, there should be many reviewers all of whom have actually read the grant and formed independent opinions. Using reasonable assumptions, I have calculated a needed sample size of 30 reviewers but this number would vary with different assumptions. Second, the characteristics of reviewers in terms of seniority and self-judged closeness to the expertise of the proposal would be defined. Third, it is not feasible to ask 30 reviewers to read 25 single-spaced pages of arcane description. The size of applications could be reduced to 1 page. I believe both excellent and innovative ideas can be explained simply, clearly, and directly within that limitation. Fourth, meetings to discuss the applications would no longer be necessary. I propose that the independence of the opinions solicited is paramount for the variance and kurtosis to be accurate reflections of the innovation of the proposal. Fifth, a simplified low-precision grading system is needed so that reviewers can accurately grade the relative quality of the proposals. Currently, NIH asks reviewers to assess applications on a scale of 41 grades. I believe this scale promotes a myth of precision that is not helpful. Instead I propose a 5-grade scale.

Examples of potential results for 30 reviewers scoring 4 different applications on a 5-grade scale are shown in Fig. 1 . The scores have been provided to illustrate the types of distributions that would be obtained with the sampling and scoring structure I have proposed. In the upper left panel, the kurtosis value is near zero which defines a normal distribution. The upper right panel shows a distribution that is known as leptikurtic (highly peaked). This consensus distribution indicates a broadly held sensibility about a particular idea. The lower left panel shows a diffuse or flat distribution. Applications that demonstrate this type of distribution are likely to possess similar characteristics in terms of their excellence and innovation. In the lower right panel is the distribution for a controversial proposal. I hypothesize that applications that elicit a platykurtic distribution (with the events accumulated in the tails) are innovative proposals.


Figure 1
View larger version (18K):
[in this window]
[in a new window]

 
Figure 1. Possible distributions and associated statistical measures for the evaluation of applications with an n of 30 reviewers. Hypothetical scores of 30 reviewers with calculated means, variances, and kurtosis are shown. The upper left panel shows a normal distribution. The upper right panel shows a consensus distribution. The lower left panel shows a flat or diffuse distribution. The lower right panel shows a controversy distribution.

The assessment of grant applications by a variety of statistical measures would provide administrators with powerful tools for the analysis of the grant procedure itself. NIH and other funding agencies use the arithmetic mean alone in order to provide a one-dimensional, linear tool for rank ordering. With the system of sampling and scoring that I have proposed, other statistical measures could be included in the analysis. In Fig. 2 , a two-dimensional grid is shown. This grid comprises axes of the mean and of variance or negative kurtosis. The mean is the excellence measure and variance or negative kurtosis is the innovation statistic according to my hypothesis. In other words, innovation may be difficult to discern with the linear tool, but readily revealed by considering additional dimensions.


Figure 2
View larger version (13K):
[in this window]
[in a new window]

 
Figure 2. Selection Grid. Mean and variance or negative kurtosis are used to select applications for funding.

As currently practiced, peer review of grant applications involves a few reviewers who actually consider each proposal and then who influence one another in a meeting in order to provide guidance to the many who did not even look at the application. This process is particularly prone to groupthink, which is a dysfunctional mode of decision-making characterized by the conforming of individual opinions to the perceived consensus of the group. Variance, or kurtosis, or other measures of data dispersion in the system of peer system I have proposed would not be susceptible to groupthink or other distortions associated with reaching consensus. In this sense, it is a more objective way to assess proposals.

For scientists who have been preparing 15–25 page grant applications, a 1 page limitation may seem ridiculously short. Nevertheless, it is clearly adequate to make an argument since the currently required half-page abstract is expected to include the argument of the application. A single page will allow applicants to succinctly state the logic of the proposed research without a profusion of supplementary information that can serve to obfuscate as much as to clarify. In this way reviewers will be able to focus on the argument presented; and consequently, their assessments will be more clearly related to the logic and less influenced by extraneous information that is exceedingly difficult for reviewers to wade through and understand.

A peer review system that vitiates the bias that is related to sample selection and that obtains a more broadly representative sample of the peer group is likely to engender greater satisfaction and less cynicism among scientists. Besides the value of selecting applications for funding using a complementary structure of peer review, the sense of fairness and objectivity that the use of various statistical measures is likely to produce among investigators represents another significant subsidiary benefit.

A peer review selection system that uses a variety of statistical measures for deciding funding will be able to titrate the optimal amount of innovative applications versus excellent applications to fund. The proportion of applications judged excellent by consensus as opposed to applications that engender controversy can be chosen for funding based on the specific goals of the program. If the goal of the program involves the development of an idea that has already been accepted, excellent applications, as determined by the mean, will be preferred. On the other hand, if the goal of the program focuses on new ways of thinking about a specific problem, then innovative grants, as indicated by a variability measure, will be chosen. Using a variety of statistical measures, administrators of funding agencies will have unprecedented capability in assessing applications.

The use of objective, statistical measures to identify innovation and more generally to influence proposal selection for the allocation of research funds represents a new potential for scientific investigation. At this point, there are no data to assess my hypothesis that variance or negative kurtosis is a measure of innovation. A trial of these ideas will be valuable in assessing their validity.

The author discloses that he has no conflict of interest in the publication of this manuscript.


Figure 3
View larger version (131K):
[in this window]
[in a new window]

 
Figure 3.


   FOOTNOTES
 
Note: This essay is based on a presentation made on May 22, 2006, to the Peer Review Advisory Committee of the National Institutes of Health. Image: Giambattista della Porta—Fantastic Figures 1589. Courtesy Accademia Nazionale dei Lincei

The opinions expressed in editorials, essays, letters to the editor, and other articles comprising the Up Front section are those of the authors and do not necessarily reflect the opinions of FASEB or its constituent societies. The FASEB Journal welcomes all points of view and many voices. We look forward to hearing these in the form of op-ed pieces and/or letters from its readers addressed to journals{at}faseb.org


   REFERENCES
TOP
REFERENCES
 

  1. . NIH Office of Extramural Research (May 5, 1997) Minutes of the Meeting of the Peer Review Oversight Group (PROG). http://grants2.nih.gov/grants/peer/prog/minutes_970505.htm
  2. Zerhouni, E. (2003) The NIH Roadmap. Science 302,63-72[Abstract/Free Full Text]
  3. Armstrong, D. (September 26, 2005) Innovation Review Criterion. Peer Review Advisory Committee Meetings (PRAC), Agenda, Minutes and Presentation Materials (NIH, Office of Exramural Research). http://grants.nih.gov/grants/peer/prac/prac_sep_2005/prac_20050926_meeting.htm
  4. Yalow, R. (1978) Radioimmunoassay: a probe for fine structure of biologic systems. Science 200,1236-1245[Free Full Text]
  5. Dennis, C. (2004) From rags to riches. Nature 430,10-11[CrossRef][Medline]
  6. Mayr, E. (1991) One Long Argument: Charles Darwin and the Genesis of Modern Evolutionary Thought. Harvard University Press Cambridge, Massachusetts.

Related Articles

COUNTERPOINT: Statistical analysis in NIH peer review—identifying innovation
Thoru Pederson
FASEB J 2007 21: 309-310. [Full Text] [PDF]

RIPOSTE: Statistical analysis of NIH peer review—identifying innovation
David Kaplan
FASEB J 2007 21: 311. [Full Text] [PDF]




This Article
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Kaplan, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kaplan, D.
Related Collections
Right arrowRelated Articles


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS