### uvepls

#### SYNOPSIS

`uvepls [type={LOO | LTO | LMO}; defaults to LOO] \`

[runs=<number of runs; defaults to 20>] \

[groups=number of groups; defaults to 5]} \

[pc=<number of PCs; defaults to the number of PCs of the current PLS model] \

[dummy_range_coefficient=<1.0 - 5.0; defaults to 1.0>] \

[dummy_value_coefficient=<1.0e-20 - 1.0; defaults to 1.0e-10>] \

[use_srd_groups={YES | NO; defaults to NO}] \

[save_ram={YES | NO; defaults to NO}] \

[uve_m={YES | NO; defaults to NO}] \

[uve_alpha=<0.0 - 100.0; defaults to 0.0>] \

[ive={YES | NO; defaults to NO}] \

[ive_percent_limit=<0 - 100; defaults to 100>] \

[ive_external_pred={YES | NO; defaults to NO}]

#### DESCRIPTION

The`uvepls`

keyword is used to carry out a variable selection according to
both the UVE-PLS methodology as originally described by Centner
et al. [1], and the modified iterative IVE-PLS
procedure developed by Polanski et al. [2]. The
original UVE-PLS method is based on the calculation, for each of
the active X variables included in the PLS model, of the following
ratio:`c(j) = b(j) / s[b(j)]`

where

`b(j)`

is the average of the PLS pseudo-coefficient as
obtained by *n*leave-one-out cross-validation run over

*n*objects, and

`s[b(j)]`

is the standard deviation among the
different coefficients. Variables are kept or rejected according to a
comparison between the `c(j)`

for the real variable and the
largest `c(j)`

among dummy variables, which are assigned
small, random values. In the same paper [1], two
robust variants of the algorithm were proposed. In the first, UVE-M, the
pseudo-coefficient average was replaced by the median, and the standard
deviation by the interquartile range. In the second, UVE-*α*, instead of comparing each

`c(j)`

with the largest `c(j)`

among
dummy variables, a user-defined percentile between 0 and 100 is chosen in
the sorted dummy `c(j)`

vector. Both these variants have been
implemented in **Open3DQSAR**; additionally, according to the suggestions by Grohmann and Schindler [3], also leave-two-out and leave-many-out cross-validation can be used to calculate

`b(j)`

and `s[b(j)]`

.The IVE-PLS methodology, while still relying on the estimate of the magnitude of the PLS pseudo-coefficients to rule out unimportant variables, is based on an iterative procedure: instead of computing

`c(j)`

values for each variable in one
pass and choosing whether to keep or reject the variable by comparison
with corresponding `c(j)`

values for dummy variables, after
each pass the variable with the smallest `c(j)`

value is
eliminated. After the iterative procedure is completed by sequential
elimination of all X variables, the model which yields the highest
*q*is chosen. In

^{2}**Open3DQSAR**'s implementation, similarly to

**Open3DQSAR**'s FFD variable selection, the SDEP evaluated on an external test set can also be used as a criterion to choose the best model instead of

*q*.

^{2}In both UVE-PLS and IVE-PLS methodologies, the variable grouping accomplished by the Smart Region Definition procedure [4] can be taken into account; namely,

`c(j)`

values can be computed as the average values
over all the variables belonging to each SRD group, rather than being
computed for single variables. While reference to the original literature
is recommended for the details of UVE-PLS/IVE-PLS methodologies, in the
following all parameters controlling the outcome of these procedures
in **Open3DQSAR**are reviewed.

`type`

: influences the kind of cross-validation (`LOO`

,`LTO`

,`LMO`

) used to compute`c(j)`

values. For`type=LMO`

, also the number of CV`runs`

and the number of`groups`

in which the dataset is split can be chosen

`pc`

: the number of PCs used to build the CV PLS models; defaults to the number of PCs extracted in the current PLS model

`dummy_range_coefficient`

: this parameter, together with`dummy_value_coefficient`

, determines the range of values which can be assumed by dummy variables; in particular, dummy variables will assume a random value according to the following equation:`dummy_value = (dummy_range_coefficient * (largest_x_value_in_current_field - smallest_x_value_in_current_field) *`

(rand() - 0.5) + 0.5 * (largest_x_value_in_current_field + smallest_x_value_in_current_field)) * dummy_value_coefficient

`dummy_value_coefficient`

: this parameter, together with the previous one, determines the range of values which can be assumed by dummy variables (see above)

`use_srd_groups`

: this parameter determines whether`c(j)`

values will be computed on single variables or rather as averages over variables belonging to the groups identified by the SRD algorithm

`save_ram`

: this parameter, if set to`YES`

, allows to save physical RAM at the cost of some performance loss, since pseudo-coefficient matrices calculated in the individual CV runs will be stored on a temporary file rather than in memory

`uve_m`

: this parameter, if set to`YES`

, toggles the use of [median, interquartile range] instead of [average, standard deviation] to calculate`c(j)`

values (see above in the discussion about the UVE-PLS methodology)

`uve_alpha`

: this parameter, which may range from 0 to 100, toggles the use as a threshold value of the corresponding percentile in the dummy`c(j)`

sorted vector instead of the maximum value. If`uve_alpha`

is set to 0, then standard UVE-PLS is carried out

`ive`

: this parameter, if set to`YES`

, toggles the use of IVE-PLS in place of UVE-PLS

`ive_percent_limit`

: this parameter, if set to a value lower than 100 (the default), stops the IVE-PLS iterative procedure after a percentage of variables equal to`ive_percent_limit`

has been marked for deletion, no matter if the*q*(or the SDEP, if external validation has been chosen) could improve removing further variables^{2}

`ive_external_pred`

: this parameter, if set to yes, allows to use SDEP performance on an external test set as a parameter to select the variable set endowed with the highest predictivity

`uvepls`

module operates in parallel fashion on multiprocessor
machines, using all the CPUs available in the system; if one wishes to
run the computation on a smaller number of CPUs, this may be specified
with the `env n_cpus`

keyword before calling `uvepls`

.
#### EXAMPLES

```
#this command invokes UVE-PLS selection using LOO cross-validation
with 3 principal components. Dummy values are generated according to
default settings, and SRD groups are not used. The number of CPUs
previously set by env n_cpus are used
```

uvepls pc=3 type=LOO use_srd_groups=no

#this command invokes UVE-PLS selection using LMO cross-validation
(5 groups, 50 runs) with 5 principal components. Dummy values are
generated according to default settings, SRD groups are used and the
robust UVE-M and UVE-α (75 percentile) variants have been chosen. 2
CPUs are used

env n_cpus=2

uvepls pc=5 type=LMO groups=5 runs=50 \

use_srd_groups=yes uve_m=yes uve_alpha=75

# this command invokes IVE-PLS selection using LMO cross-validation
(5 groups, 100 runs) with 5 principal components. SRD groups are
not used. The maximum number of eliminated variables is limited
to 50%. 4 CPUs are used

env n_cpus=4

uvepls ive=yes pc=5 type=LMO groups=5 runs=100 \

use_srd_groups=no uve_m=yes ive_percent_limit=50

#### REFERENCES

- Centner, V.; Massart, D. L.;
de Noord, O. E.; de Jong, S.; Vandeginste, B. M.; Sterna, C.
*Anal. Chem.***1996**,*68*, 3851-3858. DOI - Gieleciak, R.; Polanski, J.
*J. Chem. Inf. Model.***2007**,*47*, 547-556. DOI - Grohmann, R.; Schindler, T.
*J. Comput. Chem.***2008**,*29*, 847-860. DOI - Pastor, M.; Cruciani, G.; Clementi, S.
*J. Med. Chem.***1997**,*40*, 1455-1464. DOI