The math works out fine for skipping frames for dithering, within reason, and it certainly would do no harm to not dither one frame. The reason for this is because most sigma clipping algorithms are median sigma clipping, not mean sigma clipping.
The intent of dithering is to move hot pixels around such that they fall on enough different pixels that when stacked, the value of the hot pixel will fall outside of the sigma clipping threshold. The sigma clipping threshold is calculated by starting with the median for the stack, calculating the standard deviation (sigma) for the stack and then adding the median to some multiple (kappa) of the sigma.
Consider a simple example of 20 subs that are dithered every frame, and for a given pixel in the stack, the median is 450ADU, the mean(with no hot pixels in the pixel stack) is 500 ADU and in one frame of the stack, a 10,000ADU hot pixel appears. The mean used to calculate the sigma-clipping threshold would be (19500 + 110,000)/20 = 975. Sigma is simply the square root of the mean, or about 31. If we choose the commonly-used kappa of 3, the rejection threshold for that pixel is 450+3*31 = 543. So, clearly our 10,000 ADU pixel would be rejected and replaced with a median pixel value of 450. In fact, any pixel with an ADU greater than 543 would be rejected. This example would therefore reject any pixels 21% or more brighter than the median.
Now let’s look at the same example, but dithered every 4 frames. For each set of 4 frames that are not dithered, the hot pixel will be in the same place, but for the remaining 16 frames, it will be in a different place. So for the final stack, for each pixel in which the hot pixel has appeared, the mean will be (500x16 + 10,000x4)/2 = 2400. However, for median sigma clipping, it is the median that is most important, and it is far less affected by additional outliers than the mean. In the first example, most of the pixels were scattered about the mean with one outlier. The median is the middle ADU number going from the lowest to highest. Since 19 of the 20 pixel values in the first example were near the mean, the median was also close to the mean. In the second example, we added three more large outliers, which changed the mean dramatically. But since 16 0f the 21 pixels in the stack are still scattered about the mean, the middle or median pixel value will remain statistically unchanged. However, the mean is changed, which means that the sigma value will be different. In this example, the sigma is about 49, so a 3-sigma clipping threshold for this stack would be 450+3*49 = 597. So, the 10,000 ADU pixel clearly still gets clipped. However, it is important to note that the threshold is higher, so, any pixels less than 597ADU would not get clipped. We lose the ability to clip the much dimmer bright pixels since our threshold rises to 33% greater than the median. This is the downside to not dithering every frame.
But dithering every few frames has the most value when a lot of subs are taken because the time overhead for dithering can become significant. The more frames that are taken, the less the non-dithered frames affect the mean in the stacking calculation. Using the same example as above, but this time, stacking 40 frames instead of 20, the stack mean for a pixel with a hot pixel in 4 frames of the stack would be(36500 + 410,000)/40 = 1450. The sigma would be 38, and 3kappa-sigma would be 564. So our clipping threshold now drops to 25% greater than the median. It might seem that this is still marginally worse than dithering every frame – and it is – but it is important to understand that there is a fair amount of uncertainty associated with the median. Since the outliers don’t really affect the calculation of the median (within reason), the actual median selected for any given pixel stack will be subject to Poisson noise and is essentially randomly selected with a standard deviation roughly equal to the square root of the median. So, while we have been using 450 as the median in all of our examples, in reality, it could be 450 +/- 20 or so. As I hope you can see, this uncertainty completely swamps the small difference in rejection threshold. The bottom line is that for a sufficiently large stack, there is essentially no difference in the performance of sigma-clipping between frames dithered every frame and skipped dithered frames, provided that the number of skipped dithered frames is reasonable.
But that’s not the end of the story. If a reasonable dithering frequency is selected (like four used in this example) the time saved can significantly increase the number of subs collected, particularly if the subs are short and the dithering overhead is large in comparison. In this case (which is the base case for the request to be able to dither every “X” frames) we have the potential to dramatically increase SNR in the same imaging time period with absolutely no other penalty. Additionally, the subs are likely to be of higher quality since for most of them, there will be no settling after a dither – a time period when guiding can often be poorer even after the settling threshold has been reached.
Of course, there are limitations. If one is doing long-exposure narrowband imaging where only a handful of subs are collected in a session, dithering every frame is probably essential. But in that case, the total dithering time overhead is insignificant compared to imaging time. However, for short exposures sessions where a lot of subs are collected, skipping dithered frames can result in substantially more, and better quality subs collected with no penalty.
So, I challenge your assertion that dithering every frame is somehow the “right” way and that the math doesn’t work out for skipping dithered frames. It simply isn’t true.