To Name :
To Email :
From Name :
From Email :
Comments :

What’s wrong with multicenter trials?

Assessing the assessments in psychopharmacology

Vol. 5, No. 3 / March 2006

Did you know that many clinical studies of drugs with known efficacy fail to show outcomes better than placebo? These “failed trials” are an expensive way of saying “no information,” and they have become a big nuisance for psychopharmacology research.1 Failed trials seem to be escalating in frequency, thereby increasing the expense and controversy of testing new agents (Box 1).

Box 1

Failed trials: How they differ from positive and negative trials

Positive trial. If a new drug is being tested against a standard drug with known efficacy and against placebo, a good outcome for the new drug is to match the efficacy of the known standard and for both drugs to beat placebo. This is a positive trial for the standard drug and new drug; they both work.

Negative trial. If the standard drug beats the placebo but the new drug does not, this is a negative trial for the new drug; it does not work, at least in this trial. Perhaps it lacks efficacy or the dosage is wrong, but some problem must exist. A negative trial is the method used to prove a new drug’s lack of efficacy.

Failed trial. If the standard drug, new drug, and placebo have the same outcomes, that is a failed trial—the bane of clinical psychopharmacology. Another study of the new drug is required because only positive or negative trials are useful in evaluating whether a drug works or whether the FDA will approve it. Thus, a failed trial is an expensive way of saying, ‘no information.’


Many failed trials are not published because sponsors, authors, and journals are not interested in publishing such information. The published literature, therefore, leaves the mistaken impression that new and standard drugs work more consistently in clinical trials than they really do. Now that the pharmaceutical industry and FDA have agreed to post all clinical study results on various Web sites, the true outcomes of new drug testing may become more transparent.

How could an effective drug not beat placebo? Some naysayers contend that standard drugs, such as antidepressants, do not work much better than placebo anyway. Others argue that:

  • new investigators and clinical raters entering pharmaceutical research can inflate ratings
  • pressure from sponsors for rapid enrollments may lead to symptomatic volunteers instead of real patients being enrolled in clinical trials2
  • after enrollment, patients who may not have had access to psychiatric care before the clinical trial receive a lot of free attention, which may enhance the placebo effect.

Shifting patients and raters. Patient heterogeneity may also be a factor in failed trials. Psychopharmacology trials have shifted from a few small, regionally specific studies to many large, multinational clinical trials. Once held predominantly in North America and Europe, trials now take place in more diverse areas, such as South America, Russia, Eastern Europe, and India.

Because clinical trials are being outsourced—like many other services in the United States—they are no longer being rated by research-trained American psychiatrists. The new trial raters’ clinical and research experience is more heterogenous; some work for an experienced investigator but have no clinical training themselves.

Whatever the reasons, failed trials increase the cost and delay the development of new drugs. The risk of generating ambiguous results from failed trials is discouraging sponsors from studying drugs with novel mechanisms of action. Many new trials are testing relatively non-innovative active metabolites of known therapeutic agents, active enantiomers, and controlled-release formulations. There is less risk that these drugs do not work, even if failed trials fall short of confirming their efficacy.


New uses of traditional rating scales can also challenge reliable patient assessment. For example, the Positive and Negative Syndrome Scale (PANSS) has been used for decades to assess the effectiveness of antipsychotics to improve positive symptoms in acutely psychotic patients. However, some recent study designs use the PANSS to study only negative symptoms or the effectiveness of adjunctive, nonantipsychotic agents in stable patients. And although some raters may have experience using PANSS patient ratings, they do not necessarily have experience using the scale in unorthodox and experimental ways.

As ingenuity informs study design and therapeutic focus sharpens, “gold standard” scales may need to be retooled.3 However, consensus in developing and accepting new patient-rating instruments remains difficult to achieve. Scales that accommodate issues of growing clinical interest are greatly desired in depression research, for example. These would include tools with increased sensitivity for:

  • onset of action
  • the sequence of therapeutic effects on discrete symptom domains
  • other behavioral characteristics.

Potential tools being tested include a newly standardized Hamilton Rating Scale for Depression (GRID-HAM-D)3 and the Video Interview Behavior Evaluation Scale (VIBES).4


Blaming scientists or clinical rating scales for imprecise outcome measurements is unproductive. Perhaps we should concentrate on factors we can change, such as reducing inconsistencies among clinical raters. This can be done through training programs that enhance adherence to expert scale calibrations or that foster consensus on how to reduce variance in how a scale is applied.

A trial’s power to differentiate a drug from placebo is related to the number of patients studied, the variance of the measurement made, and the effect size of the drug on the measurement. A drug’s effect size is relatively fixed, related to its inherent properties, and therefore not amenable to being modified in a clinical trial.

As variations in clinical measurements increase, statistical power decreases (unless the number of patients studied increases), and the number of failed studies grows. On the other hand, if ratings variations can be reduced, the ability to distinguish differences among treatment groups will increase according to the patient population size.

Rating the rater. If non-uniform patient assessments in CNS clinical trials lead to unreliable data and inconclusive results, what can be done? Is it possible to use adult education principles to change clinical raters’ behaviors and diminish the variance from rater to rater? Educational techniques that target rater proficiency and lead to certification have begun to address problems with reliability in patient assessment5 (Box 2).

Box 2

Three take-home points

  • Measuring drugs’ effects in psychopharmacologic trials is not the same as measuring them in clinical practice.
  • Patient and rater variability, imperfect measuring instruments, fluctuating symptoms, and effects of participating in a study contribute to difficulties in proving that an effective drug is more efficacious than placebo.
  • Achieving consensus among clinician raters through educational programs on how to apply the clinical rating scales can reduce variability and enhance the power of multicenter trials.

Related resources

  • Kobak KA, Lipsitz JD, Williams JB, et al. A new approach to rater training and certification in a multicenter trial. J Clin Psychopharmacol 2005;25(5):407-12.
  • Engelhardt N, Feiger AD, Cogger KO, et al. Rating the raters: assessing the quality of Hamilton Rating Scale for Depression clinical interviews in two industry-sponsored clinical drug trials. J Clin Psychopharmacol 2006;26(1):71-4.


Dr. Stahl receives grant/research support or serves as a consultant to Asahi, AstraZeneca Pharmaceuticals, Avanir, Boehringer Ingelheim, Cephalon, Bristol-Myers Squibb Co., Cyberonics, Cypress Bioscience, Pierre Fabre, Forest Laboratories, GlaxoSmithKline, Janssen Pharmaceutica, Otsuka, Eli Lilly & Co., Nova Del Pharma, Pfizer, Sanofi Synthelabo, Sepracor, Shire Pharmaceuticals, Solvay Pharmaceuticals, and Wyeth.


Arbor Scientia staff writer Darius Shayegan co-authored this article.

Adapted and reprinted with permission from PsychEd Up: Psychopharmacology Educational Update 2005:1(7):6-7. Copyright 2005, NEI Press.


1. Klein DF, Thase ME, Endicott J, et al. Improving clinical trials: American Society of Clinical Psychopharmacology recommendations. Arch Gen Psychiatry 2002;59:272-8.

2. Kobak KA, Engelhardt N, Williams JB, Lipsitz JD. Rater training in multicenter clinical trials: issues and recommendations. J Clin Psychopharmacol 2004;24(2):113-7.

3. Bagby RM, Ryder AG, Schuller DR, Marshall MB. The Hamilton Depression Rating Scale: has the gold standard become a lead weight? Am J Psychiatry 2004;161(12):2163-77.

4. Katz MM, Houston JP, Brannan S, et al. A multivantaged behavioural method for measuring onset and sequence of the clinical actions of antidepressants. Int J Neuropsychopharmacol 2004;7(4):471-9.

5. Shayegan DK, Stahl SM. Enhancing inter-rater reliability utilizing multimedia and distance learning. Paper presented at: XXIII Congress, Collegium Internationale Neuro-psychopharmacologicum (CINP); June 23-27, 2002; Montreal, Canada.

Did you miss this content?
An under-recognized epidemic of elder abuse needs your awareness and action