Wednesday, July 7, 2010

K-S testing.

I tried the k-s test on the data to see if it agrees with a r^-1/2 density profile. I had a fair bit of trouble with this k-s test stuff. At first I blamed the python implementation, but eventually I tested it by generating normally distributed random numbers, and passing it the cumulative distribution function cdf(x)=$\frac{1}{2} \left[1+erf(x/\sqrt{2})\right]$ and got reasonable agreement (D,p=(0.0583, 0.885) for a sample of 100 numbers). So I am reasonably sure now that it works.

The bottom line I guess is that the k-s test is concluding that the data are probably not drawn from an $x^{-1/2}$ distribution. I searched around the vicinity of -0.5, and found the local maximum for the p coming out of the ks test. Here is a plot of the cumulative distribution functions, and the D/p values that come out (for the best scaling). I plotted a scaling of -0.5 for comparison. N.B. the labeling for these models is $\rho(x)\propto x^{-expt}$.



This apparently means that even for the local maximum value of p, there is only a 2ish percent chance that these data came from an $x^{-expt}$ distribution.

I should mention at this point that I had trouble deciding what to use for the cdf; there were two logical options and I tried them both. The plot above was made using cdf(x)=$(x/x_{max})^{-expt+1}$, which has the advantage that the curves meet again at 1. The other strategy I tried was to turn the density $\rho(x)$ into a number density function, and then divide by the number of points I was using in the k-s test. In other words: cdf(x)=$\frac{A}{m_p N_{ks}} \frac{1}{(-expt+1)} x^{(-expt+1)}$. This has the advantage that it makes more sense. Here is the plot I got:

I was surprised this was so much worse, given that at the maximum point, the value of the theory cdf was 0.994, but I will explain the difference. In order to get the number density, I needed the amplitude of the mass density A. I did this by performing a least-squares fit to the density data points (with error bars) that I showed you in previous posts. Every time I adjust the value of the exponent, the leastsq fit allows the amplitude A to compensate, to best match the density data (which has been binned, and using the data twice in this way might be a cheat, statistically speaking...). That is why you see the curves jumping around a bit more in these plots. It makes me confused, because I felt that these theory CDFs differed only in a normalization from one another, but I guess there is more to it.

In any event, in the process of doing the least squares fit, I also got a reduced chi square value. Here is a plot of the measured density and the best fit model fixing the exponent to be $-1/2$.

I took this one further step, and fit both the amplitude and the exponent together. The plot is visually quite similar, and the parameters were A=0.2422, expt=0.4988 and the reduced chisq was 0.6912. Taking the circularity of my analysis to the point of the absurd, I then tried a k-s test using both the amplitude and the exponent I got out of the data (although I am vaguely recalling that NRC++ warned that there was a rule against such behavior)... still it very much distresses me, take a look:

That p is rediculous! I guess the only conclusion I can draw here is that my error bars are just way too big, so that the chi-suqare statistic is basically just telling me about the error bars (?). Otherwise how can these two statistical tests differ so extremely? I am beginning to lose my faith in statistics... not that I had all that much to begin with.

No comments:

Post a Comment