PRIM.example.data {rfcdmin} | R Documentation |
PRIM-2Dimensional Dataset:
The simulated data comprises of three actual normally distributed clusters, denoted by 1, 2, and 3, which are true category statuses denoted by the vector 'true.Y.status.PRIM'. In reality, the truth is unknown, and thus, the information about the true category status is hidden. The rules obtained from PRIM will be compared to the true cluster categories, from which the data was generated.
The stochastic structure of the data is a mixture model of normals. In our test dataset there were 1000 observations.
Each of the 1000 'Y.PRIM' response comes from independent bernoulli distributions with a probability of 0.5.
The 'X.PRIM' observations with 2 column variables and 1000 row observations was made up of the following distributions conditional on the response 'Y.PRIM':
X.PRIM | Y.PRIM=1 ~ N([500,500],[[625, 0],[0, 625]])
X.PRIM | Y.PRIM=1 ~ N([400,100],[[625, 0],[0, 625]])
X.PRIM | Y.PRIM=1 ~ N([200,900],[[625, 0],[0, 625]])
X1|Y.PRIM=1,Y.PRIM=0 ~ U(0,1023)
X2|Y.PRIM=1,Y.PRIM=0 ~ U(0,1023)
PRIM-4Dimensional Dataset:
The simulated data comprises of three actual clusters, denoted by 1, 2, and 3, that are normally distributed. These values for true category status are denoted by the vector true.Y.status.PRIM4D. In reality, the truth is unknown, and thus, the information about the true category status is hidden. The rules obtained will be compared to the true cluster categories, from which the data was generated.
The stochastic structure of the data is a mixture model of normals. In our test dataset there were 2000 observations.
Each of the 2000 Y.PRIM4D response comes from independent bernoulli distributions with a probability of 0.5.
The X.PRIM4D matrix observations with 4 column variables and 2000 row observations was made up of the following distributions conditional on the response Y.PRIM4D:
X.PRIM4D | Y.PRIM4D=1 ~ N([500,500,500,500], [[625, 0,0,0],[0,625,0,0], [0,0,625,0],[0,0,0,625]])
X.PRIM4D | Y.PRIM4D=1 ~ N([200,1000,200,800], [[700, 0,0,0],[0,200,0,0], [0,0,100,0],[0,0,0,700]])
X.PRIM4D | Y.PRIM4D=1 ~ N([1000,100,900,100], [[400, 0,0,0],[0,500,0,0], [0,0,600,0],[0,0,0,300]])
X1|Y.PRIM4D=1,Y.PRIM4D=0 ~ U(0,1023)
X2|Y.PRIM4D=1,Y.PRIM4D=0 ~ U(0,1023)
X3|Y.PRIM4D=1,Y.PRIM4D=0 ~ U(0,1023)
X4|Y.PRIM4D=1,Y.PRIM4D=0 ~ U(0,1023)
data(PRIM.example.data)
A dataset generated with the following code in the "example" section.
See also rfcprim for examples using this simulated data.
data(PRIM.example.data) ## The following is the source R code ## setting the seed will generate the same dataset if (FALSE){ if (require(MASS)){ set.seed(20) ## Y.PRIM response binary data Y.PRIM <- rbinom(1000, 1, 0.5) ## X.PRIM matrix, rows corresponding to Y.PRIM ## columns are the variables ## when Y.PRIM==1 there is a MVN distribution of the X.PRIM ## There is simulation for 3 clusters Sigma <- matrix(c(25,0,0,25),2,2)^2 X1.y1 <- mvrnorm(n=ceiling(length(Y.PRIM[Y.PRIM==1])/4), c(500, 500), Sigma, empirical = FALSE) X2.y1 <- mvrnorm(n=ceiling(length(Y.PRIM[Y.PRIM==1])/4), c(400, 100), Sigma, empirical = FALSE) X3.y1 <- mvrnorm(n=ceiling(length(Y.PRIM[Y.PRIM==1])/4), c(200, 900), Sigma, empirical = FALSE) X4.1.y1 <- runif(ceiling(length(Y.PRIM[Y.PRIM==1])/4), 0, 1023) X4.2.y1 <- runif(ceiling(length(Y.PRIM[Y.PRIM==1])/4), 0, 1023) X4.y1 <- cbind(X4.1.y1, X4.2.y1) X.y1 <- rbind(X1.y1, X2.y1, X3.y1, X4.y1)[1:length(Y.PRIM[Y.PRIM==1]),] ## when Y.PRIM==0 there is only a uniform distribution of X's X1.y0 <- runif(length(Y.PRIM[Y.PRIM==0]), 0, 1023) X2.y0 <- runif(length(Y.PRIM[Y.PRIM==0]), 0, 1023) X.y0 <- cbind(X1.y0, X2.y0) ## true Y.PRIM cluster status: ## 0, 4 is noise ## 1, 2, 3, are the true clusters true.Y.status.PRIM <- c(rep(2, dim(X1.y1)[1]), rep(1, dim(X2.y1)[1]), rep(3, dim(X3.y1)[1]), rep(4, dim(X4.y1)[1]))[1:length(Y.PRIM[Y.PRIM==1])] true.Y.status.PRIM <- c(true.Y.status.PRIM, rep(0,dim(X.y0)[1])) ## sort the Y.PRIM to correspond with the X.PRIM matrix Y.PRIM<- sort(Y.PRIM, decreasing=TRUE) X.PRIM <- rbind(X.y1, X.y0) colnames(X.PRIM) <- c("X1", "X2") } ## 4D Dataset if (require(MASS)==TRUE){ Y.PRIM4D <- rbinom(2000, 1, 0.5) ## when Y.PRIM4D==1 there is a MVN distribution of the X Sigma1 <- rbind(c(625,0,0,0), c(0, 625, 0, 0), c(0, 0, 625, 0), c(0, 0, 0, 625)) Sigma2 <- rbind(c(700, 0, 0, 0), c(0, 200, 0, 0), c(0, 0, 100,0), c(0, 0, 0, 700)) Sigma3 <- rbind(c(400,0,0,0), c(0,500,0,0), c(0,0,600,0), c(0, 0,0,300)) X1.y1 <- mvrnorm(n=ceiling(length(Y.PRIM4D[Y.PRIM4D==1])/4), c(500, 500, 500, 500), Sigma1, empirical = FALSE) X2.y1 <- mvrnorm(n=ceiling(length(Y.PRIM4D[Y.PRIM4D==1])/4), c(200, 1000, 200, 800), Sigma2, empirical = FALSE) X3.y1 <- mvrnorm(n=ceiling(length(Y.PRIM4D[Y.PRIM4D==1])/4), c(1000, 100, 900, 100), Sigma3, empirical = FALSE) X4.1.y1 <- runif(ceiling(length(Y.PRIM4D[Y.PRIM4D==1])/4), 0, 1023) X4.2.y1 <- runif(ceiling(length(Y.PRIM4D[Y.PRIM4D==1])/4), 0, 1023) X4.3.y1 <- runif(ceiling(length(Y.PRIM4D[Y.PRIM4D==1])/4), 0, 1023) X4.4.y1 <- runif(ceiling(length(Y.PRIM4D[Y.PRIM4D==1])/4), 0, 1023) X4.y1 <- cbind(X4.1.y1, X4.2.y1, X4.3.y1, X4.4.y1) X.y1 <- rbind(X1.y1, X2.y1, X3.y1, X4.y1)[1:length(Y.PRIM4D[Y.PRIM4D==1]),] ## when Y.PRIM4D==0 there is only a uniform distribution of X's X1.y0 <- runif(length(Y.PRIM4D[Y.PRIM4D==0]), 0, 1023) X2.y0 <- runif(length(Y.PRIM4D[Y.PRIM4D==0]), 0, 1023) X3.y0 <- runif(length(Y.PRIM4D[Y.PRIM4D==0]), 0, 1023) X4.y0 <- runif(length(Y.PRIM4D[Y.PRIM4D==0]), 0, 1023) X.y0 <- cbind(X1.y0, X2.y0, X3.y0, X4.y0) ## true if in a cluster otherwise FALSE true.Y.status.PRIM4D <- c(rep(2, dim(X1.y1)[1]), rep(1, dim(X2.y1)[1]), rep(3, dim(X3.y1)[1]), rep(4, dim(X4.y1)[1]))[1:length(Y.PRIM4D[Y.PRIM4D==1])] ## NOTE: true.Y.status.PRIM4D=0,4 is uniform, random noise; true.Y.status.PRIM4D=1,2,3 ## denotes the real cluster category true.Y.status.PRIM4D <- c(true.Y.status.PRIM4D, rep(0,dim(X.y0)[1])) Y.PRIM4D <- sort(Y.PRIM4D, decreasing=TRUE) X.PRIM4D <- rbind(X.y1, X.y0) colnames(X.PRIM4D) <- c("X1", "X2", "X3", "X4") } }