Svm-rfe software




















Now that we have a ranking of features for each of the 10 training sets, the final step is to estimate the generalization error we can expect if we were to train a final classifier on these features and apply it to a new test set. Here, a radial basis function kernel SVM is tuned on each training set independently. This consists of doing internal fold CV error estimation at each combination of SVM hyperparameters Cost and Gamma using grid search.

The optimal parameters are then used to train the SVM on the entire training set. Finally, generalization error is determined by predicting the corresponding test set. This is done for each fold in the external fold CV, and all 10 of these generalization error estimates are averaged to give more stability. This process is repeated while varying the number of top features that are used as input, and there will typically be a "sweet spot" where there are not too many nor too few features.

Outlined, this process looks like:. Each featsweep list element corresponds to using that many of the top features i. Within each, svm. These accuracies are averaged as error. To show these results visually, we can plot the average generalization error vs. For reference, it is useful to show the chance error rate. Typically, this is equal to the "no information" rate we would get if we simply always picked the class label with the greater prevalence in the data set.

As you can probably see, the main limitation in doing this type of exploration is processing time. For example, in this demonstration, and just considering the number of times we have to fit an SVM, we have:.

We have already shortened this some by 1 eliminating more than one feature at a time in the feature ranking step, and 2 only estimating generalization accuracies across a sweep of the top features. This code is already set up to use lapply calls for these 2 mains task, so fortunately, they can be relatively easily parallelized using a variety of R packages that include parallel versions of lapply. Examples include:.

Included in the sge directory is an example custom parallel implementation using the SGE cluster interface. Using simulation studies based on time-to-event outcomes and three real datasets, we evaluate the three methods, based on pseudo-samples and kernel principal component analysis, and compare them with the original SVM-RFE algorithm for non-linear kernels.

The three algorithms we proposed performed generally better than the gold standard RFE for non-linear kernels, when comparing the truly most relevant variables with the variable ranks produced by each algorithm in simulation studies.

Generally, the RFE-pseudo-samples outperformed the other three methods, even when variables were assumed to be correlated in all tested scenarios. Conclusions: The proposed approaches can be implemented with accuracy to select variables and assess direction and strength of associations in analysis of biomedical data using SVM for categorical or time-to-event responses.

This script implements with the Orange machine learning library an algorithm for extracting and ranking features that carry the most discriminative or predictive power for an observation's class membership.

It can be used to improve the performance of classifiers, as well as to aid the discovery of biomarkers. To come: plotting number of features x accuracy, permutation test to assess significance, and some further documentation.

Skip to content. Star 0. Branches Tags. Sign up or log in Sign up using Google. Sign up using Facebook. Sign up using Email and Password. Post as a guest Name. Email Required, but never shown. The Overflow Blog. Stack Gives Back Safety in numbers: crowdsourcing data on nefarious IP addresses.



0コメント

  • 1000 / 1000