A fundamental assumption often made in supervised classification is that
the problem is static, i.e., the description of the classes does not
change with time. However many practical classification tasks involve
changing environments. Thus designing and testing classifiers for changing
environments are of increasing interest and importance. A number of
benchmark data sets are available for static classification tasks.
For example, the UCI machine
learning repository is extensively used by researchers to compare
algorithms across various domains. Only few benchmark datasets are
available for changing environments. Also, while generating data for static
environments is relatively straightforward, this is not so for changing
environments. The reason is that an infinite amount of changes can be
simulated, and it is difficult to define which ones will be realistic
and hence useful. We proposed a general framework for generating data to
simulate changing environments in:
Narasimhamurthy A., L.I. Kuncheva, A framework for generating data to
simulate changing environments, Proc. IASTED, Artificial Intelligence
and Applications, Innsbruck, Austria, 2007, 384-389.
pdf[bib]
returns a data set A with N
points and a label vector LABA. V is an array with probabilities of size
N-by-K, where K is the number of sources to sample from. The rows of V, V(i,:)
are the mixing proportions of the K sources (each row of V sums up to 1).
P is a user-defined function called as [X,LABX] = P(i), which samples once from
data source i and returns point X with its label. The programme implements
the following algorithm:
I. Initialise the data set A = empty and the set of labels LABA = empty.
II. For i = 1,...,N
(i). k = Sample once from the set {1,2,...,K} with distribution defined by
V(i,1),...,V(i,K) to determine which of the K sources will provide
the next point.
(ii). Sample a new data point from source k, [x,labx] = P(k) and add x to the
current set A. Add the class label labx to LABA.
III. End i
IV. Return data set A and labels LABA.
Samples one point (=instance, =example) from the STAGGER data problem,
widely used as a benchmark artificial dataset for changing
environments. The feature space is described by three features
(attributes): size {small, medium, large}, colour {red, green,
blue} and shape {square, circular, triangular}. There are three
data sources (called target concepts):
Target Concept 1 : size = small AND color = red
Target Concept 2 : color = green OR shape = circular
Target Concept 3 : size = medium OR size = large
In their experiment, Widmer and Kubat [1] generate randomly
120 training instances and label each instance according to the
current target concept. After processing each instance, the predictive
accuracy is tested on an independent test set of 100 instances,
also generated randomly. One target concept is active at a time.
Concept 1 is active from instance 1 to instance 40, Concept 2 is active
from instance 41 to 80 and Concept 3 from instance 81 to 120.
All you need to do in order to generate a training set A with labels
LABA sampled from the STAGGER distribution is the following couple
of lines:
The values of each of the 3 features are coded with integers 1, 2 and 3.
For example [2,3,2] denotes (medium size, blue colour, triangular shape).
The code for generating a training set of 120 STAGGER data points is
also available as STAGGER_training_data.m
This function can be used as input in simulation_changing_environment
in order to generate the "moving hyperplane data" in 2d, often used as benchmark.
Point x returned by the function lies in the square [-1,+1]-by-[-1,+1].
There are two classes with equal prior probabilities. The data on one side
of the hyperplane are labelled as class 1 and on the other side as class 2.
If static data is required, the programme should be called in a loop with
a fixed angle k of the separating plane. For changing environments k may
be varied as desired. For this example we assume that there are 360 data
sources determined by the degree of rotation of the separating line. The code
Moving_hyperplane_2d_demo.m
shows an animation of the generated data as the line rotates in full circle.
This function can be used in simulation_changing_environment.
It samples a point from a 2-dimensional mixture of 4 equiprobable Gaussian clusters,
and labels the point into one of 2 classes. There are two concepts, so k = 1 or 2.
"s" is an optional parameter specifying the variance of all clusters. Suppose that
point x has coordinates (x1,x2). According to the first concept, if x2 is negative,
the class label of x is 1, otherwise the class label is 2. According to the second
concept, x is labelled as class 1 if x1 is negative, and in class 2 otherwise.
Gaussian_2d_demo.m
shows an animation of the generated data as one of the concepts fades away and
is replaced by the other concept.
This function can be used in simulation_changing_environment.
It samples a point from 3 Gaussian classes with static means and changing variances.
The means of the first two components are at [-2 -1], [2 -1] and [0 2], respectively.
The remaining n-2 components have means 0. (If n is not specified, the default
value n = 2 is used.) The covariance matrices are diagonal. The variances of the
first two features are varied but are kept equal within each cluster. The classes
are equiprobable. There are 5 concepts corresponding to the following sets of standard
deviations for the three classes:
s1
s2
s3
1:
1.4
1.1
0.1
2:
0.6
0.4
1.7
3:
1.3
1.3
0.9
4:
0.3
0.3
0.1
5:
1.6
0.3
1.4
Gaussian_3class_demo.m
shows an animation of the generated data as one of the concepts fades away and
is replaced by the next one in the list.
[1] Widmer, G., Kubat, M. Learning in the presence of concept drift
and hidden contexts, Machine Learning, 1996, 23, 69-101.