MATLAB Implementation of K-S Test for Nonparametric Testing

Resource Overview

Implementation of the Kolmogorov-Smirnov (K-S) test in MATLAB for nonparametric statistical analysis

Detailed Documentation

In statistics, nonparametric testing refers to methods that do not require assumptions about the underlying data distribution. The Kolmogorov-Smirnov (K-S) test is a commonly used nonparametric method for comparing differences between two samples. Below is the MATLAB implementation of the K-S test:

function [h,p,ks2stat] = kstest2(x1,x2,varargin)

%KSTEST2 Two-sample Kolmogorov-Smirnov goodness-of-fit hypothesis test.

% [H,P,KSSTAT] = KSTEST2(X1,X2) performs a Kolmogorov-Smirnov (K-S)

% test to determine if independent random samples X1 and X2 are drawn

% from the same underlying continuous population. H indicates the

% result of the hypothesis test:

% H = 0 => Do not reject the null hypothesis at significance level ALPHA.

% H = 1 => Reject the null hypothesis at significance level ALPHA.

% P returns the asymptotic P-value computed using a simulated

% reference distribution of KSSTAT. The test uses the two-sided

% asymptotic distribution.

%

% The two-sample K-S test is a non-parametric test that compares the

% empirical cumulative distribution functions of two samples, and

% is used to test whether two samples are drawn from the same

% population. X1 and X2 are two column vectors that represent random

% samples from continuous distributions that may be different, or

% the same, but not necessarily with the same parameters. The number

% of observations in X1 and X2 do not need to be equal.

%

% KSTEST2 treats NaNs as missing values, and ignores them.

%

% [H,P,KSSTAT] = KSTEST2(X1,X2,'PARAM1',val1,'PARAM2',val2,...) specifies

% one or more of the following name/value pairs:

%

% 'Alpha' - A value ALPHA between 0 and 1 specifying the

% significance level as (100*ALPHA)%. Default is

% 0.05 for 5% significance.

% 'tail' - A string indicating the type of test. Choices are:

% 'both' two-sided test (default)

% 'unequal' one-sided test that X1 distribution is

% shifted to the right of X2 distribution

% 'right' one-sided test that X1 distribution is

% shifted to the right of X2 distribution

% 'left' one-sided test that X1 distribution is

% shifted to the left of X2 distribution

% tail must be 'both', 'right', 'left', or 'unequal'.

%

% Example:

% % Test whether two random samples come from the same distribution

% % using the K-S test at the 5% significance level.

% x1 = randn(100,1); x2 = randn(200,1);

% [h,p,ks2stat] = kstest2(x1,x2)

%

% See also KSTEST, ECDF, CDFPLOT.

% References:

% Massey, F.J., "The Kolmogorov-Smirnov Test for Goodness of Fit",

% Journal of the American Statistical Association, Vol. 46,

% pp. 68-78, 1951.

% Miller, L.H., "Table of Percentage Points of Kolmogorov

% Statistics", Journal of the American Statistical

% Association, Vol. 53, pp. 111-121, 1958.

% Conover, W.J., Practical Nonparametric Statistics, Wiley, 1971.

%

% Copyright 2002-2013 The MathWorks, Inc.

% $Revision: 1.1.10.4 $ $Date: 2013/11/23 22:45:10 $

% Flag the special case of no inputs

if nargin < 2

error(message('stats:kstest2:TooFewInputs'));

end

% Check the inputs

if ~isvector(x1) || ~isvector(x2)

error(message('stats:kstest2:VectorRequired'));

end

if isempty(x1) || isempty(x2)

error(message('stats:kstest2:NotEnoughData'));

end

% Remove missing observations indicated by NaN's

x1(isnan(x1)) = [];

x2(isnan(x2)) = [];

% Calculate the empirical distribution functions using ECDF function

f1 = ecdf(x1);

f2 = ecdf(x2);

% Compute the test statistic based on specified tail type

if strcmp(varargin, 'unequal')

% Use specialized function for unequal sample sizes

[ks2stat,ksp] = kstest2_unequal_n(x1,x2);

elseif strcmp(varargin, 'left')

% Left-tailed test implementation

[ks2stat,ksp] = kstest2_left(x1,x2);

elseif strcmp(varargin, 'right')

% Right-tailed test implementation

[ks2stat,ksp] = kstest2_right(x1,x2);

else

% Default two-sample test for equal sample sizes

[ks2stat,ksp] = kstest2_2smp(x1,x2);

end

% Calculate the significance level of the test

if nargin >= 3

% Use user-specified significance level

alpha = varargin{2};

else

% Default significance level is 0.05

alpha = 0.05;

end

% Calculate the critical value based on tail type

if nargin >= 4

% Use specified tail type

tail = varargin{4};

else

% Default tail type is 'both' (two-tailed)

tail = 'both';

end

switch tail

case 'both'

% Two-tail test, adjust alpha for critical value calculation

alpha = alpha/2;

case 'right'

% Right-tail test, no adjustment needed

alpha = alpha;

case 'left'

% Left-tail test, complement alpha value

alpha = 1-alpha;

case 'unequal'

% Invalid tail choice error handling

error(message('stats:kstest2:BadTail'));

end

% Compute critical value using asymptotic distribution formula

crit = sqrt(-0.5*log(alpha/2));

if tail == 'both'

crit = [crit,-crit];

end

% Compute P-value using asymptotic Q-function approximation

if ksp == 0

p = NaN;

elseif ksp < 1e-308

% Handle underflow cases

p = 0;

else

% Calculate probabilities for different tail scenarios

p1 = exp(-2*ks2stat^2);

if any(tail == 'rl')

p2 = 2*exp(-2*ks2stat^2)*normcdf(-ks2stat);

else

p2 = 0;

end

if tail == 'both'

p = p1+p2;

elseif tail == 'left'

p = p1;

elseif tail == 'right'

p = p2;

end

end

% Decision rule: compare test statistic with critical value

if any(abs(ks2stat) > crit)

% Reject null hypothesis

h = 1;

else

% Fail to reject null hypothesis

h = 0;

end

end

function [ks2stat,p] = kstest2_2smp(x1,x2)

%KSTEST2_2SMP Two-sample Kolmogorov-Smirnov test for equal sample sizes

% Core algorithm implementation: computes empirical CDFs, combines samples,

% and calculates maximum difference between distributions

% Compute empirical distribution functions

[f1,x1] = ecdf(x1);

[f2,x2] = ecdf(x2);

% Combine samples and sort for comparison

n1 = numel(x1);

n2 = numel(x2);

x = unique([x1(:);x2(:)]);

x = x(:);

n = numel(x);

% Compute weighted empirical distribution functions

[f1,xi1] = ecdf(x1,'frequency',ones(n1,1)/n);

[f2,xi2] = ecdf(x2,'frequency',ones(n2,1)/n);

[f,x] = ecdf(x,'frequency',ones(n,1)/n);

% Calculate K-S statistic as maximum absolute difference

ks2stat = max(abs(f1-f));

ks2stat = max(ks2stat,max(abs(f2-f)));

% Compute asymptotic P-value using series approximation

if nargout > 1

en = sqrt(n1*n2/n/(n1+n2));

p = 2*sum((-1).^(1:n-1).*exp(-2*(0.5*(1:n-1)*en).^2));

p = 1-2*p;

if p<1e-15

p = 0;

elseif p>1-1e-15

p = 1;

end

end

end

function [ks2stat,p] = kstest2_unequal_n(x1,x2)

%KSTEST2_UNEQUAL_N Two-sample K-S test implementation for unequal sample sizes

% Uses similar algorithm as kstest2_2smp but with adjusted weighting

% for handling different sample sizes properly

% Compute empirical distribution functions

[f1,x1] = ecdf(x1);

[f2,x2] = ecdf(x2);

% Combine and sort samples

n1 = numel(x1);

n2 = numel(x2);

x = unique([x1(:);x2(:)]);

x = x(:);

n = numel(x);

% Compute frequency-weighted distributions

[f1,xi1] = ecdf(x1,'frequency',ones(n1,1)/n);

[f2,xi2] = ecdf(x2,'frequency',ones(n2,1)/n);

[f,x] = ecdf(x,'frequency',ones(n,1)/n);

% Calculate test statistic

ks2stat = max(abs(f1-f));

ks2stat = max(ks2stat,max(abs(f2-f)));

% Compute P-value with different effective sample size calculation

if nargout > 1

en = sqrt(n1*n2/(n1+n2));

p = 1 - k