uvepls


SYNOPSIS

uvepls  [type={LOO | LTO | LMO}; defaults to LOO]  \
    [runs=<number of runs; defaults to 20>]  \
    [groups=number of groups; defaults to 5]}  \
    [pc=<number of PCs; defaults to the number of PCs of the current PLS model]  \
    [dummy_range_coefficient=<1.0 - 5.0; defaults to 1.0>]  \
    [dummy_value_coefficient=<1.0e-20 - 1.0; defaults to 1.0e-10>]  \
    [use_srd_groups={YES | NO; defaults to NO}]  \
    [save_ram={YES | NO; defaults to NO}]  \
    [uve_m={YES | NO; defaults to NO}]  \
    [uve_alpha=<0.0 - 100.0; defaults to 0.0>]  \
    [ive={YES | NO; defaults to NO}]  \
    [ive_percent_limit=<0 - 100; defaults to 100>]  \
    [ive_external_pred={YES | NO; defaults to NO}]


DESCRIPTION

The uvepls keyword is used to carry out a variable selection according to both the UVE-PLS methodology as originally described by Centner et al. [1], and the modified iterative IVE-PLS procedure developed by Polanski et al. [2]. The original UVE-PLS method is based on the calculation, for each of the active X variables included in the PLS model, of the following ratio:

c(j) = b(j) / s[b(j)]

where b(j) is the average of the PLS pseudo-coefficient as obtained by n leave-one-out cross-validation run over n objects, and s[b(j)] is the standard deviation among the different coefficients. Variables are kept or rejected according to a comparison between the c(j) for the real variable and the largest c(j) among dummy variables, which are assigned small, random values. In the same paper [1], two robust variants of the algorithm were proposed. In the first, UVE-M, the pseudo-coefficient average was replaced by the median, and the standard deviation by the interquartile range. In the second, UVE-α, instead of comparing each c(j) with the largest c(j) among dummy variables, a user-defined percentile between 0 and 100 is chosen in the sorted dummy c(j) vector. Both these variants have been implemented in Open3DQSAR; additionally, according to the suggestions by Grohmann and Schindler [3], also leave-two-out and leave-many-out cross-validation can be used to calculate b(j) and s[b(j)].
The IVE-PLS methodology, while still relying on the estimate of the magnitude of the PLS pseudo-coefficients to rule out unimportant variables, is based on an iterative procedure: instead of computing c(j) values for each variable in one pass and choosing whether to keep or reject the variable by comparison with corresponding c(j) values for dummy variables, after each pass the variable with the smallest c(j) value is eliminated. After the iterative procedure is completed by sequential elimination of all X variables, the model which yields the highest q2 is chosen. In Open3DQSAR's implementation, similarly to Open3DQSAR's FFD variable selection, the SDEP evaluated on an external test set can also be used as a criterion to choose the best model instead of q2.
In both UVE-PLS and IVE-PLS methodologies, the variable grouping accomplished by the Smart Region Definition procedure [4] can be taken into account; namely, c(j) values can be computed as the average values over all the variables belonging to each SRD group, rather than being computed for single variables. While reference to the original literature is recommended for the details of UVE-PLS/IVE-PLS methodologies, in the following all parameters controlling the outcome of these procedures in Open3DQSAR are reviewed.

By default, the uvepls module operates in parallel fashion on multiprocessor machines, using all the CPUs available in the system; if one wishes to run the computation on a smaller number of CPUs, this may be specified with the env n_cpus keyword before calling uvepls.

EXAMPLES

#this command invokes UVE-PLS selection using LOO cross-validation with 3 principal components. Dummy values are generated according to default settings, and SRD groups are not used. The number of CPUs previously set by env n_cpus are used
uvepls  pc=3  type=LOO  use_srd_groups=no

#this command invokes UVE-PLS selection using LMO cross-validation (5 groups, 50 runs) with 5 principal components. Dummy values are generated according to default settings, SRD groups are used and the robust UVE-M and UVE-α (75 percentile) variants have been chosen. 2 CPUs are used
env  n_cpus=2
uvepls  pc=5  type=LMO  groups=5  runs=50  \
    use_srd_groups=yes  uve_m=yes  uve_alpha=75

# this command invokes IVE-PLS selection using LMO cross-validation (5 groups, 100 runs) with 5 principal components. SRD groups are not used. The maximum number of eliminated variables is limited to 50%. 4 CPUs are used
env  n_cpus=4
uvepls  ive=yes  pc=5  type=LMO  groups=5  runs=100  \
    use_srd_groups=no  uve_m=yes  ive_percent_limit=50


REFERENCES

  1. Centner, V.; Massart, D. L.; de Noord, O. E.; de Jong, S.; Vandeginste, B. M.; Sterna, C. Anal. Chem. 1996, 68, 3851-3858.   DOI
  2. Gieleciak, R.; Polanski, J. J. Chem. Inf. Model. 2007, 47, 547-556.   DOI
  3. Grohmann, R.; Schindler, T. J. Comput. Chem. 2008, 29, 847-860.   DOI
  4. Pastor, M.; Cruciani, G.; Clementi, S. J. Med. Chem. 1997, 40, 1455-1464.   DOI

Sitemap
Print version
Contact
Mailing list


Last update:
May 31. 2015 20:39:42

Powered by
CMSimple - CMSimple-Styles


Get Open3DGRID at SourceForge.net. Fast, secure and Free Open Source software downloads



Would you like to align your
dataset? Try Open3DALIGN
Just wish to compute a MIF?
Try Open3DGRID