CAT | Parallelism
9
Introduction to parallelism testing in potency assays
11 Comments | Posted by Dan in Parallelism
In a previous blog post I stated that a potency value doesn’t mean anything unless the shape of the curves of the reference standard and the unknown are exactly the same:
This condition is known as parallelism (or more correctly mathematical similarity). While that may sound completely logical and simple, the real world (as always) is more complicated. When we estimate potency we have to rely on a statistical model of the underlying data. The four parameter logistic model is a common choice in potency assays. This model is fitted to our data using a regression algorithm of some sort.
The raw data gathered in an assay and then used in the regression is a sampling of all of the possible data points at each dilution. Because of this limitation, the model itself is only an estimate of the "true" underlying curves of our experimental system or assay.
The consequence of this reality is that we have two estimated curves and we are trying to use them to tell if the underlying "true" data (which we don’t know) is really parallel. That’s not an easy thing to do with certainty. For example, are either of these pairs of curves parallel? How do we know for sure?
There are many different approaches available to answer this question. The simplest method is just to look at the curves. If you are doing investigational work and you’re fairly familiar with your assay, this may be all you need. However, in a more regulated environment you will probably need something a little less subjective. In general, there are two philosophies on how to measure parallelism using statistical methods: difference testing and equivalence testing. Let’s discuss each of these in more detail.
Difference testing
Difference testing relies on the creation of a metric for the measure of parallelism. In theory, such a metric should scale with the degree of non-parallelism. In other words, the less parallel the curves are, the larger the metric.
Let’s walk through how these metrics are derived. In potency testing we fit two models to derive parallelism data, the full model and the reduced model.
In the full model, we fit independent parameters for our reference and sample curves. If we use a four parameter logistic model to estimate the best fit for the upper and lower asymptotes, the slope, and the EC50 parameters for each curve independently, we have a total of eight different parameters. This model is illustrated in the first graph below. Notice how the curves pair have different shapes. In this graph the two curves clearly have different upper asymptotes and are therefore not parallel. Also notice how the two curves fit their own underlying data fairly well. This is the "full" model:
In the reduced model, we only allow there to be one common set of upper and lower asymptotes and slope. Only the EC50 parameter is allowed to be unique to each curve. This situation is illustrated in the graph below. It is this model we use to estimate potency. Notice how the two curves have the same shape, but they don’t fit the data as well:
We can use these two graphs to generate a metric for parallelism. The first thing we can do is to calculate what’s called the residual or error for each data-point. The residual is simply the distance from the data point to the curve:
If we compare the two graphs of the full and reduced model above, it becomes obvious that the residuals in the full model are always going to be smaller or equal to the residuals in the reduced model. We can use this information to generate a metric.
First, let’s square all of the residuals (to equalize positive and negative residuals) and sum those squares. This number is known as the sum of squared errors (SSE). If we do this for both curves, we have two sets of SSE, one for the full model and one for the reduced model. Like I mentioned above, the SSE for the full model is always smaller than or equal to the reduced model.
One common use of these metrics is to use an F-test for parallelism. The following formula is used to calculate an F-statistic:
As its name indicates, this statistic is distributed according to the F-distribution. We can therefore use this statistic to set up a hypothesis test of parallelism. The null hypothesis is that the curves are not different (notice that I didn’t say that they are the same), and the alternate hypothesis is that they are different. We then generate a p-value with a cutoff that help us decide if we should reject the null hypothesis (usually <0.05). We would then say that the curves are different and therefore not parallel.
However, as many different authors before me have noted, there’s a weakness to this approach that’s hard to overcome. In the equation above, the SSE for the full model appears in the denominator. So what happens if you have a very precise assay and the full model has a very low SSE? You are then dividing by a very small number and the F statistic gets very large. This situation can lead to false positives for lack of parallelism. In effect you are punished for having a very precise assay that follows the model very closely. The differences between the curves may be small, but your good assay was able to detect it. In a highly variable assay the opposite occurs, you will accept many more assays just because you don’t have the precision to tell if they are parallel or not.
This situation has been remedied by the use of a chi-square statistic. The formula for calculating it is as follows:
Again, this statistic follows the distribution its named after. The same strategy we employed above can be used to set up a hypothesis test for parallelism using the chi-square metric. Since this metric doesn’t rely on dividing by the SSE of the full model, it doesn’t suffer from the same issues with assay precision that the F stat does.
Unfortunately, there are still some potential problems with using this approach. First the regression has to be perfectly weighted in order for this stat to be perfectly chi-square distributed. Perfect weighing is difficult to achieve.
But beyond the weighting issue, there is also a philosophical problem with this approach. These types of tests are measuring whether the shapes of the curves are different, but what we need to know for potency is whether they are actually the same. Not being different is not the same as saying they are equivalent. We may simply not have good enough information to tell that they are different.
Equivalence testing
Parallelism testing for potency assays has recently switched to focus on testing for curve equivalence rather than difference. How does this work? This approach requires us to set a limit on a specific assay parameter that we are willing to accept.
For example, we can say that as long as the ratio of the slopes from two assays is between 0.8 and 1.25 we will accept the assay. We can then fit the two curves independently (full model) and calculate a confidence interval on the slope ratio. If the confidence interval on this metric is contained within the two limits, we say that the curves are equivalent based on our criteria.
This type of test has two consequences. First, we can say that the curves are actually equivalent instead of "not different". Second, we are no longer punished if the assay is “too” precise, since all that will do is make our confidence interval shorter. Let’s see what this looks like graphically:
As you can see, this type of test makes intuitive sense since we can set our limits based on our knowledge of the assay system without a large data set for determining statistically derived limits. It also prevents false positives. In difference testing you have to accept that you will reject some runs based on chance alone what were truly parallel . This is less likely in equivalence testing since you’re not doing a hypothesis test.
So why not use equivalence testing for everything? In a simple, linear assay, I would encourage this approach since it’s easy to calculate the confidence intervals for each parameter in the regression. However, in a non-linear regression the confidence intervals for the equation parameters can not be solved independently and the joint confidence regions have very complex shapes and in some cases extend to infinity.
So for now, we are stuck with difference testing for more complex models. I’ve recently heard about some interesting work being done that may solve this problem, but I’m sworn to secrecy… As soon as this work is completed and published in a public forum, I will discuss it here on the blog.
I hope this has been a simple to understand introduction to parallelism testing. If you want to read a little more about these topics, here are two journal articles I recommend to get you started:
http://www.ncbi.nlm.nih.gov/pubmed/15971545
PDA J Pharm Sci Technol. 2005 Mar-Apr;59(2):127-37.
Assessing parallelism prior to determining relative potency.
Hauck WW , Capen RC , Callahan JD , De Muth JE , Hsu H , Lansky D , Sajjadi NC , Seaver SS , Singer RR , Weisman D .
http://www.ncbi.nlm.nih.gov/pubmed/15920890
J Biopharm Stat. 2005;15(3):437-63.
Measuring parallelism, linearity, and relative potency in bioassay and immunoassay data.
Gottschalk PG , Dunn JR .
As always, thanks for reading!
Dan
2
Bootstrap method for setting parallelism metric cutoff
0 Comments | Posted by Dan in Parallelism
Assessing parallelism in potency assays is a topic that is currently hotly debated. In most cases, this assessment is performed using some sort of metric that is calculated for each assay. A cutoff is then established on this metric to determine if two samples are parallel or non-parallel. Establishing that cutoff can be a very difficult exercise.
This is a slide deck of a presentation I gave on a novel method we developed for setting the cutoff: Joelsson – Bootstrap parallelism method
Thanks for reading!
Dan
Before I wrote any of the other posts on this blog, I should have written a post on what exactly a potency assay is. To make up for this oversight, here is that post now.
In the context of biologics development, potency can be defined as the ability of a drug or treatment to elicit a particular response at a certain dose. In other words, potency is a measure of a drug’s activity in a biological system.
If possible to perform, the purest readout of potency is a direct measure of clinical efficacy. However, this method makes for a very inconvenient way to assay the material. That’s why we need a potency assay. The goal of the potency assay is to measure the material in a more convenient system. The potency assay should also be a predictor of the activity of the drug in the final recipient. In order to do that, the potency assay has to be based on a biologically relevant measure. You need to be able to draw on scientific knowledge and on clinical efficacy data to make that link. Potency is always a relative measure. We calculate potency based on the activity of a reference standard (often with a link to a clinical result).
Ok, that’s a clear definition, but what does that mean in practice? It means that at each dose of an assay the shift in the relative response between the sample and reference should be the same. Let’s look at a couple of pictures to make this clearer:
As you can see in these two images, whether the dose response curve is linear or not, potency is nothing more than a shift in the activity between the reference and the standard. If the activity at a certain dose is higher than the reference, the sample is said to be more potent and if it has less activity it is said to be less potent.
The measure of potency is calculated as a ratio of activity between the two curves, usually expressed as a percentage. In the example above, the potency of the samples is 200% since the distance between the two curves is a two-fold increase in the dose.
There you have it, this is what we talk about when we discuss potency assays.
As you might have noticed from the images above, the calculation of potency is only meaningful if the two curves have exactly the same shape, otherwise the distance between them won’t be constant. This property is called parallelism or similarity. This restriction causes problems in the real world since potency assays can often be quite variable and we therefore only have an estimate of the true dose response curve. We therefore have to make sure that we can say with a particular level of certainty that the true curves are actually parallel. But that discussion deserves a post of it’s own at a later date.
Thanks for reading,
Dan
