标签:des style http java color os strong io
Introduction
Standard control options
Line 1: Describe data
Line 2: Set run options
Line 3: Set importance options
Line 4: Set proximity computations
Line 5: Set options based on proximities
Line 6: Replace missing values
Line 7: Visualization
Line 8: Saving a forest
Line 9: Running new data down a saved forest
Customized options
Input data files
Categorical input variables
Class weights
Using a prespecified subset of input variables
Using the most important input variables
Using a saved forest
Output control options
File names
Example settings - satimage data
Probably the best way to learn how to use the random forests code is to study the satimage example. The options are documented below.
Unless otherwise noted, setting an option to 0 turns the feature "off".
noutlier = 1 computes the outlier measure for each case in the training set and noutlier = 2 also computes the outlier measure for each case in the test set. (Outlier detection requires nprox >0).
nscale = k>0 computes the first k canonical coordinates used in metric scaling.
nprot = k>0 computes k prototypes for each class.
For a J-class problem, random forests expects the classes to be numbered 1,2, ...,J. The code to read in the training and/or test data is:
c -------------------------------------------------------
c READ IN DATA--SEE MANUAL FOR FORMAT
c
open(16, file=‘data.train‘, status=‘old‘)
do n=1,nsample0
read(16,*) (x(m,n),m=1,mdim), cl(n)
enddo
close(16)
if(ntest.gt.1) then
open(17, file=‘data.test‘, status=‘old‘)
do n=1,ntest0
read(17,*) (xts(m,n),m=1,mdim),clts(n)
end do
close(17)
end if
To change the dataset names or to manipulate the data, edit the code as required. For the training data, always use the notation x(m,n) for the value of the mth variable in the nth case, and cl(n) for the class number (integer). For the test data, always use the notation xts(m,n) for the value of the mth variable in the nth case, and clts(n) for the class number (integer).
If there are categorical variables, set maxcat equal to the largest number of categories and then specify the number of categories for each categorical variable in the integer vector cat, which is defined in the highlighted lines of code:
c -------------------------------------------------------
c SET CATEGORICAL VALUES
c
do m=1,mdim
cat(m)=1
enddo
c fill in cat(m) for all variables m for which cat(m)>1
cat(1)= FILL IN THE VALUE
...
cat(mdim)= FILL IN THE VALUE
For example, setting cat(5)=7 implies that the 5th variable is a categorical with 7 values. Any variables with cat=1 will be assumed to be continuous. Note: for an L valued categorical input variable, random forests expects the values to be numbered 1,2, ... ,L.
If the classes are to be assigned different weights, set jclasswt=1 and fill in the desired weights in the highlighted lines of the code:
c SET CLASS WEIGHTS
do j=1,nclass
classwt(j)=1
end do
if(jclasswt.eq.1) then
classwt(1)= FILL IN THE VALUE
...
classwt(nclass)= FILL IN THE VALUE
end if
Look for this text early in the program:
if(mselect.eq.0) then
mdimt=mdim
do k=1,mdim
msm(k)=k
end do
end if
if (mselect.eq.1) then
mdimt= FILL IN THE VALUE
msm(1)= FILL IN THE VALUE
...
msm(mdimt)= FILL IN THE VALUE
end if
If mselect = 1, mdimt is the number of variables to be used and the values of msm(1),.,msm(mdimt) specify which variables should be used.
If imp = 1, and mdim2nd = k>0, then the program does a 2nd run. In the 2nd run, it uses only the k most important variables found in the first run. If there were missing values, the initial run on all mdim variables determines the fill-ins.
To save the forest, set isaverf = 1. To run the forest on new data, set isaverf = 1 and read in the the new data as a test set using the notation clts for the class label and xts for the variables. Basic parameters have to agree with the original run, i.e. nclass, mdim, jbt.
Missing values: If the data to run down the saved tree has missing values that need to be replaced, then in the initial run, set missfill=1 andisavefill =1. To run the forest on new data, set missfill=2 and ireadfill =1.
Outliers: Whether or not outlier measure in the runs down the saved tree is enabled is independent of whether missing values are replaced. To enable the outlier measure in new data, in the initial run, set isaveprox =1. To run the forest on new data, set noutlier =2 and ireadprox =1.
The options to control output are specified in the following lines of code:
c
c -------------------------------------------------------
c OUTPUT CONTROLS
c
parameter(
& isumout = 1, !0/1 1=summary to screen
& idataout= 1, !0/1/2 1=train,2=adds test (7)
& impfastout= 1, !0/1 1=gini fastimp (8)
& impout= 1, !0/1/2 1=imp,2=to screen (9)
& impnout= 1, !0/1 1=impn (10)
& interout= 1, !0/1/2 1=interaction,2=screen (11)
& iprotout= 1, !0/1/2 1=prototypes,2=screen (12)
& iproxout= 1, !0/1/2 1=prox,2=adds test (13)
& iscaleout= 1, !0/1 1=scaling coors (14)
& ioutlierout= 1) !0/1/2 1=train,2=adds test (15)
The values are all set to 1 in the code, but can be altered as necessary. The comments after the exclamation points give a quick idea of what the settings are. More details are given below:
isumout can take the values 0 or 1. If it takes the value 1, a classification summary is sent to the screen.
idataout can take the values 0,1, or 2. If it has the value 1 then the training set is read to file with the format such that each row contains the data for a case. The columns are:
n,cl(n),jest(n),(q(j,n), j=1,nclass),(x(m,n),m=1,mdim)
where n=case number, cl(n)=class label, jest(n)=predicted class for case n, q(j,n)=the proportion of votes for the jth class out of the total of all the votes for the nth case, and x(m,n) are the values of the input variables for the nth case.
If it has the value 2 then both the training set (in the above format) and the test set are read to the same file - first the training set. If the test set has labels then the format is:
n,clts(n),jests(n), (qts(j,n), j=1,nclass),(xts(m,n),m=1,mdim).
This is the same as the format for the training set except that it holds the corresponding values for the test set. If the test set has no labels, the format is the same except that clts(n) is deleted.
impfastout takes the values 0 or 1. If it takes the value 1 a two-column array is printed out with as many rows as there are input variables. The first column is the variable number. The second is the total gini contribution for the corresponding variable summed over all trees and normalized to make the average of the column equal to one.
impout =0 or 1. If it takes the value 1, the output consists of four columns with a row for each variable. The first column is the variable number. The second column is the raw importance score (discussed later). The third column is the raw score divided by its standard error i.e. its z-score. The fourth column assigns a significance level to the z-score, assuming it is normally distributed.
interout = 0,1, or 2. If it takes the value 1, the output is sent to a file. If it takes the value 2, the output is sent to the screen. The first output is a two column list headed by the word "CODE". The first column consists of successive integers starting from one on to mdimt - the number of variables being used in the current run. The second column is the original variable number corresponding to the number in the first column. Then a square matrix of size mdimt on a side is printed out. The matrix, in the k,m place, contains the interaction between variables k and m rounded to the nearest digit (for readability). A bordering top row and first column contain the variable numbers.
iprotout = 0,1, or 2. If it takes the value 1, it writes to a file, if it takes the value 2, it writes to the screen. If nprot is set equal to k>0, it attempts to compute k prototypes for each class. If all prototypes can be computed, the results are in a matrix with 3*nclass*nprot+1 columns and mdimt+1rows. However, it might not be possible to compute all nprot prototypes for each class, in which case there will be fewer columns. The first column contains the variable number. The next 3 columns contain the first class-1 prototype and its lower and upper "quartiles" (as described here). The next 3 columns contain the second class-1 prototype and its lower and upper "quartiles", and so on. Once the class-1 prototypes are done, the class-2 prototypes are given, and so on.
There are 3 extra rows at the top of the output. The first row gives the number of that class that are closest to the prototype. The second row is the prototype (from 1 to nprot), the third row is the class.
iproxout = 0,1, or 2. If it takes the value 1, the output sent to file is a rectangular matrix with nsample rows and nrnn+1 columns:
n,(loz(n,k),prox(n,k)), k=1,nrnn)
The first is the case number. This is followed by nrnn couples. Each couple consists of, first, the number of a case that has among the nrnnlargest proximities to case n, second, the value of the proximity.
If there is a test set present the training set output is followed by a rectangular matrix of depth ntest and nrnn+1 columns:
n,(lozts(n,k), proxts(n,k)), k=1,nrnn).
The first is the case number in the test set. The nrnn couples consist, first, of the number of a case in the training set that has among the nrnnlargest proximities in the training set to case n, second, the value of the proximity.
iscaleout = 0 or 1. If it takes the value 1, the data about the nscale +1 scaling coordinates are output to file. The format is a rectangular matrix of depth nsample. The columns are:
n, cl(n), jest(n), (xsc(n,k),k=1,nscale+1).
The first is the case number, the second is the labeled class, third is the predicted class, and the next nscale+1 are the coordinates of thenscale+1 scaling coordinates.
ioutlierout = 0,1, or 2. If it takes the value 1, the output is a rectangular matrix of depth nsample with columns:
n, cl(n), amax1(outtr(n),0.0)
The first is the case number. The second is the labeled class. The third is the normalized outlier measure truncated below at zero. If it takes the value 2, then additional output about outliers in the test set is added. This consists of a matrix of depth ntest and columns:
n, jests(n), amax1(outts(n),0.0)
The difference from the above is that the predicted class (jests) is outputted instead of the class label, which may or may not exist for the test set.
The following code specifies all file names.
c
c -------------------------------------------------------
c READ OLD TREE STRUCTURE AND/OR PARAMETERS
c
if (irunrf.eq.1)
& open(1,file=‘savedforest‘,status=‘old‘)
if (ireadpar.eq.1)
& open(2,file=‘savedparams‘,status=‘old‘)
if (ireadfill.eq.1)
& open(3, file=‘savedmissfill‘,status=‘old‘)
if (ireadprox.eq.1)
& open(4, file=‘savedprox‘, status=‘old‘)
c
c -------------------------------------------------------
c NAME OUTPUT FILES FOR SAVING THE FOREST STRUCTURE
c
if (isaverf.eq.1)
& open(1, file=‘savedforest‘,status=‘new‘)
if (isavepar.eq.1)
& open(2, file=‘savedparams‘,status=‘new‘)
if (isavefill.eq.1)
& open(3, file=‘savedmissfill‘,status=‘new‘)
if (isaveprox.eq.1)
& open(4, file=‘savedprox‘, status=‘new‘)
c
c -------------------------------------------------------
c NAME OUTPUT FILES TO SAVE DATA FROM CURRENT RUN
c
if (idataout.ge.1)
& open(7, file=‘save-data-from-run‘,status=‘new‘)
if (impfastout.eq.1)
& open(8,file=‘save-impfast‘,status=‘new‘)
if (impout.eq.1)
& open(9,file=‘save-importance-data‘,status=‘new‘)
if (impnout.eq.1)
& open(10,file=‘save-caseimp-data‘,status=‘new‘)
if (interout.eq.1)
& open(11,file=‘save-pairwise-effects‘,status=‘new‘)
if (iprotout.eq.1)
& open(12,file=‘save-protos‘,status=‘new‘)
if (iproxout.ge.1)
& open(13, file=‘save-run-proximities‘,status=‘new‘)
if (iscaleout.eq.1)
& open(14, file=‘save-scale‘,status=‘new‘)
if (ioutlierout.ge.1)
& open(15, file=‘save-outliers‘,status=‘new‘)
c
c -------------------------------------------------------
c READ IN DATA--SEE MANUAL FOR FORMAT
c
open(16, file=‘satimage.tra‘, status=‘old‘)
do n=1,nsample0
read(16,*) (x(m,n),m=1,36), cl(n)
end do
close(16)
if(ntest.gt.1) then
open(17, file=‘satimage.tes‘, status=‘old‘)
do n=1,ntest0
read(17,*) (xts(m,n),m=1,36),clts(n)
end do
close(17)
end if
This is a fast summary of the settings and options for a run of random forests version 5. The code for this example is available here. The code below is what the user sees near the top of the program. The program is set up for a run on the satimage training data which has 4435 cases, 36 input variables and six classes. The satimage test set has 2000 cases with class labels. All options are turned off.
parameter(
c DESCRIBE DATA
1 mdim=36, nsample0=4435, nclass=6, maxcat=1,
1 ntest=2000, labelts=1, labeltr=1,
c
c SET RUN PARAMETERS
2 mtry0=6, ndsize=1, jbt=50, look=10, lookcls=1,
2 jclasswt=0, mdim2nd=0, mselect=0, iseed=4351,
c
c SET IMPORTANCE OPTIONS
3 imp=0, interact=0, impn=0, impfast=0,
c
c SET PROXIMITY COMPUTATIONS
4 nprox=0, nrnn=5,
c
c SET OPTIONS BASED ON PROXIMITIES
5 noutlier=0, nscale=0, nprot=0,
c
c REPLACE MISSING VALUES
6 code=-999, missfill=0, mfixrep=0,
c
c GRAPHICS
7 iviz=0,
c
c SAVING A FOREST
8 isaverf=0, isavepar=0, isavefill=0, isaveprox=0,
c
c RUNNING A SAVED FOREST
9 irunrf=0, ireadpar=0, ireadfill=0, ireadprox=0)
There are two data files corresponding to the data description. The training data is satimage.tra, the test data is satimage.tes. The training data is read in with the lines:
open(16, file=‘satimage.tra‘, status=‘old‘)
do n=1,nsample0
read(16,*) (x(m,n),m=1,mdim),cl(n)
enddo
close(16)
For the training data, always use the notation x(m,n) for the value of the mth variable in the nth case, and cl(n) for the class number (integer). The test data is read in with the lines:
if(ntest.gt.1) then
open(17, file=‘satimage.tes‘, status=‘old‘)
do n=1,ntest0
read(17,*) (xts(m,n),m=1,mdim),clts(n)
enddo
close(17)
endif
For test data, always use the notation xts(m,n) for the value of the mth variable in the nth case, and clts(n) for the class number (integer). Compile and run the code to get the following output on the screen:
=============================================================
class counts-training data
1072 479 961 415 470 1038
class counts-test data
461 224 397 211 237 470
10 13.39 3.45 3.76 6.97 44.58 19.79 18.69
10 11.05 1.95 3.12 5.79 37.91 15.19 14.04
20 10.73 2.52 2.71 5.20 42.65 15.32 13.20
20 9.70 1.08 2.23 5.54 37.91 11.81 11.49
30 9.81 2.05 2.30 4.47 40.24 14.89 11.75
30 9.50 0.87 3.12 5.79 37.44 11.39 10.64
40 9.31 1.96 2.51 4.16 39.76 14.04 10.50
40 9.15 1.30 2.68 4.28 36.02 11.81 10.64
50 9.00 2.05 2.30 3.95 39.76 13.40 9.63
50 9.20 0.87 2.68 5.04 36.49 10.97 10.85
final error rate % 8.99662
final error test % 9.20000
Training set confusion matrix (OOB):
true class
1 2 3 4 5 6
1 1050 0 6 4 25 1
2 1 468 0 1 2 2
3 14 0 923 82 1 19
4 1 4 20 250 6 56
5 6 5 1 4 407 22
6 0 2 11 74 29 938
Test set confusion matrix:
true class
1 2 3 4 5 6
1 457 0 4 0 8 0
2 0 218 0 3 3 0
3 2 1 377 33 0 12
4 0 1 10 134 1 28
5 2 2 0 1 211 11
6 0 2 6 40 14 419
The pairs of lines in the output give
oob estimates of overall error rate and class error rates,
test set estimates for the same.
The output may vary slightly with different compilers and settings of the random number seed.
Suppose you want to save the forest. Then the code is here and the options read as below:
parameter(
c DESCRIBE DATA
1 mdim=36, nsample0=4435, nclass=6, maxcat=1,
1 ntest=0, labelts=0, labeltr=1,
c
c SET RUN PARAMETERS
2 mtry0=6, ndsize=1, jbt=50, look=10, lookcls=1,
2 jclasswt=0, mdim2nd=0, mselect=0, iseed=4351,
c
c SET IMPORTANCE OPTIONS
3 imp=0, interact=0, impn=0, impfast=0,
c
c SET PROXIMITY COMPUTATIONS
4 nprox=0, nrnn=5,
c
c SET OPTIONS BASED ON PROXIMITIES
5 noutlier=0, nscale=0, nprot=0,
c
c REPLACE MISSING VALUES
6 code=-999, missfill=0, mfixrep=0,
c
c GRAPHICS
7 iviz=0,
c
c SAVING A FOREST
8 isaverf=1, isavepar=0, isavefill=0, isaveprox=0,
c
c RUNNING A SAVED FOREST
9 irunrf=0, ireadpar=0, ireadfill=0, ireadprox=0)
Compile and run the code to get the following output on the screen:
=============================================================
class counts-training data
1072 479 961 415 470 1038
10 13.39 3.45 3.76 6.97 44.58 19.79 18.69
20 10.73 2.52 2.71 5.20 42.65 15.32 13.20
30 9.81 2.05 2.30 4.47 40.24 14.89 11.75
40 9.31 1.96 2.51 4.16 39.76 14.04 10.50
50 9.00 2.05 2.30 3.95 39.76 13.40 9.63
final error rate % 8.99662
Training set confusion matrix (OOB):
true class
1 2 3 4 5 6
1 1050 0 6 4 25 1
2 1 468 0 1 2 2
3 14 0 923 82 1 19
4 1 4 20 250 6 56
5 6 5 1 4 407 22
6 0 2 11 74 29 938
To run the 2000-case satimage test set down the forest saved from the training set run, the code is here and the options look like this:
parameter(
c DESCRIBE DATA
1 mdim=36, nsample0=1, nclass=6, maxcat=1,
1 ntest=2000, labelts=1, labeltr=1,
c
c SET RUN PARAMETERS
2 mtry0=6, ndsize=1, jbt=50, look=10, lookcls=1,
2 jclasswt=0, mdim2nd=0, mselect=0, iseed=4351,
c
c SET IMPORTANCE OPTIONS
3 imp=0, interact=0, impn=0, impfast=0,
c
c SET PROXIMITY COMPUTATIONS
4 nprox=0, nrnn=5,
c
c SET OPTIONS BASED ON PROXIMITIES
5 noutlier=0, nscale=0, nprot=0,
c
c REPLACE MISSING VALUES
6 code=-999, missfill=0, mfixrep=0,
c
c GRAPHICS
7 iviz=0,
c
c SAVING A FOREST
8 isaverf=0, isavepar=0, isavefill=0, isaveprox=0,
c
c RUNNING A SAVED FOREST
9 irunrf=1, ireadpar=0, ireadfill=0, ireadprox=0)
Compile and run the code to get the following output on the screen:
=============================================================
10 11.05 1.95 3.12 5.79 37.91 15.19 14.04
20 9.70 1.08 2.23 5.54 37.91 11.81 11.49
30 9.50 0.87 3.12 5.79 37.44 11.39 10.64
40 9.15 1.30 2.68 4.28 36.02 11.81 10.64
50 9.20 0.87 2.68 5.04 36.49 10.97 10.85
final error test % 9.20000
Test set confusion matrix:
true class
1 2 3 4 5 6
1 457 0 4 0 8 0
2 0 218 0 3 3 0
3 2 1 377 33 0 12
4 0 1 10 134 1 28
5 2 2 0 1 211 11
6 0 2 6 40 14 419
引用:http://www.stat.berkeley.edu/~breiman/RandomForests/cc_manual.htm
(Breiman)随机森林之源代码,布布扣,bubuko.com
标签:des style http java color os strong io
原文地址:http://www.cnblogs.com/liulunyang/p/3868792.html