Bayesian Penalized Method for Streaming FeatureSelection
XIAO-TINGWANG1.2AND XIN-ZELUAN2Shl fiic Rin ity fCh Bjg h Nmnf Reazh Asomive Techology Comuay Lad Shesyang 110179 ChisaCorresponding author: Xin-Ze Luan (xinzeluanP )
ABSTRACT The online feature selection with streaming features has bee more and more important in recent years. In contrast to standard feature selection method streaming feature selection method canselect feature dynamically without exploring full feature space. In some applications when the informationof all features is unknown in advance standard feature selection methods cannot perform well in this setting. Moreover in ultrahigh dimensional data analysis especially considering interaction effects betweenfeatures the streaming feature selction method enjoy putational feasibility. I this pape we proposed a Bayesian penalized method for streaming feature selection problem. The proposed approach adopts Bayesiancoefficients of curent model. Comparing with many existing streaming feature selection methods our method can work for more general case of predictive model. The proposed method is evaluated extensivelyon various high-dimensional datasets. The experimental results show that the algorithm is petitive with many existing streaming and traditional feature selection algorithms.
: INDEX TERMS Streaming feature selection Bayesian penalized model.
by one. It is infeasible to perform standard feature selec-tion in this setting. Different from standard feature selectionfeatures are present and the features are dynamically consid- algorithm the streaming feature selection assume not all thedimensional data analysis especially considering interaction ered for include in a predictive model. Moreover in ultrahigheffects within the feature many feature selection methodaforementioned challenges many sreaming feature selection would suffer from putational burden. To overe theinvesting [2] and OSFS [3] [4]. method have been developed such as Grafting [1] Alpha-
L.INTRODUCTION
In many real world applications such as bioinformat-ics puter vision and social media the data normally includes a large number of features which is defined ashigh-dimensional data. For example in miRNA gene expres- sion analysis the data normally has a large number of genesand includes a small number of samples. Thus one of thevaluable information from a large number of features. critical challenge of high-dimensional data is how to obtain
issue and play important role in machine learning and data Feature selection method is effective way to address thismining. The goal of feature selection is to select a subsetof relevant features from a large number of features. This approach is able to reduce plexity of model provide inter-A variety of feature selection methods have been proposed o d aod pue po ua o qdand most of that are standard feature selections which assume seeu ds op which all features are available in advance. However the priorand mercial statistical analysis features may arrive one example many real applications such as email spam detection
The OSFS algorithm using online the relevance and redun-dancy analysis framework to select optimal subset from streaming features. The OSFS approach select feature inde-features we need to training predictive model based on the pendent of predictive model hence after select the optimaloptimal features. The Alpha-investing method selects the newfeature based on a p-value evaluated by a regression model. Specifically if the pvalue of the new feature attains a certainthe feature which has been selected will never be discarded. threshold the feature will be selected. In this algorithm The drawback of Alpha-investing is that it only evaluates newfeatures after new features have been included. Different ing features but never tests the redundancy of selected
from Alpha-investing method the Grafting is stage-wise approach it is able to select a new feature and drop currentlyon penalized model so the algorithm relies on some regu- selected features. The Grafting method selects features basedlarization parameters. However the regularization requiresto address the feature stream is infinite. Motived by Graft- global information in feature space. Hence it not suitablestreaming feature selection problem. In contrast to Grafting ing method we develop a Bayesian penalized method formethod the Bayesian penalized method is able to adaptivelyincluded to current model. Hence our method can select fea- estimate the regularization parameter when a new feature iss ameay Suueans Suixo ueu Suedos ture when size of the feature stream is unknown. Moreover methods our method can works for more general case ofIn this paper we will focus on classification model and will predictive model and not restricts to classification problem.discuss regression model in later section.
The rest of is organized as follows: In Section II we reviewrelated existing works of tradition and streaming featureselection. In Section II we introduce Bayesian penalized model for steaming feature selection. We provide experimen-tal studies in Section IV. Finally the conclusion are given in Section V.
IL. RELATED WORKS
In this section we will give a brief review of conventional fea-ture selection approach and several streaming feature selec- tion algorithms.
A. TRADITIONAL FEATURE SELECTION
The basic assumption of traditional feature selection methodis that all the features are available in advance. In general traditional feature selection can be divided into three classes:filter wrapper and embedded approach [5]. [6].
Filter feature selection approach [7][11] evaluate the importance of features based on the properties of the dataor some statistical rule. The filter method is independent of the machine learming algorithm and can be consideredas a preprocesing step. Thus this type feature selec-the filter feature selection ignore classification model tion method is putationally fast. On the other hand which may lead to worse prediction performance.
training a specific classification model and use the clas-sification performance as the rule to evaluate the selected features. variable subset [12][14]. Compared to filter method wrapper favor good prediction performance. However wrapper method usually search all possiblefeatures thus the feature selection method suffers veryputationally intensive.
• Embedded method [15][17] aims to incorporate thefeature selection during the model training process. The most popular embedded method is Lasso [18]. Theembedded approach enjoy good prediction performanceand fast putational performance.
B. STREAMING FEATURE SELECTION
Standard feature selection technique all features are presentbefore training model. However in many real-worlde esep e s o [ot]6] sodde and considered for the predictive model. Moreover in ultra-selection method would suffer from putational burden. high dimensional data applications many traditional featureTo overe the this challenges many streaming featureselection method have been developed [20].
The Alpha-investing method [2] selects the arrived fea-ture according a p-value returned by a regression model. Specifically the new feature will be selected if the p-value ofhas been selected will never be removed in this approach. this feature attains a certain threshold. However the featureDhillon et al. [21] extended the Alpha-investing method andproposed a multiple stream-wise feature selection approach to address multiple feature classes. Similar with [2] the maindrawback of this approach is that it only considers includingnew features and never remove the redundancy of selected features which has been selected. The OSFS algorithm [3]identity optimal feature subset in streaming feature selection using the relevance and redundancy analysis framework tosetting. However OSFS approach select feature independentof classification model thus it need to training classification model based on the selected features. In addition when theputational burden. Reference [4] further proposed Fast-OSFS number of selected feature is large this method suffer -to improve putational efficient. Reference [22] proposed Scalable and Accurate OnLine Approach (SAOLA) forstreaming feature selection problem. This mainly focus on addresses the challenges due to extremely high dimensional- ity. This method adopts pairwise parison techniques andmaintains a sparse model during feature selection process.
Perkins and Theiler [1] proposed the Grafing algorithmfor online feature selection. In particular Grafting is aularized learning framework. It iteratively generating a fea- It uo ps pe yoe sp asisture subset using gradient descent. In each iteration step a gradient descent test is used to identify a feature which ismost likely to improve current model. However this methodselects features relies on some regularization parameters and the parameter requires global information in feature space.Motived by Grafting approach we develop a penalized model Hence it not suitable to address the feature stream is infinite.with Bayesian regularization for streaming feature selectionproblem. Compared to Grafting our method can adaptively update the regularization parameter when a new feature isincluded to current model and is able to select feature when size of all feature is unknown.
IILPROPOSED METHODOLOGY
A. THE LASSO PENALIZED MODEL
Regularization using L1 regularized [18] term has been widely used in feature selection problem. Before applying thepenalized forstreaming feature selection we first ntroducingthe Lasso penalized model in traditional feature selection
dataset D ={Xy;X2,y2.Xx,y} whereX Rx which based full feature space. Suppose we have the inputand n is the number of input sample p is the number of features and y; = Rnx1 is corresponding response variable.The penalized model can be expressed as:
where L(D w) is the loss function. the coefficients w e Bpx1to be estimated by optimizing the object function (1) and the > O is a control parameter. In this work we focus on binaryclassification problem (y; {1 1} ) and use logistic loss as the loss function in the model:
B. THE BAYESIAN GRAFTING APPROACH FOR STREAMING FEATURE SELECTION
In this subsection we extended L1 regularized method forstreaming feature setting. In the steaming feature setting we should select best subset of features seen so far based onpredictive model. Hence we need to develop some criterionsto discard the irrelevant feature and active the important feature based on predictive model. We first formally definetest to actives feature which seen so far. screening rule to rejects the irrelevant feature and gradient
Definition 1 (Correlation Filter): Assuming a featare Xand given response variable y if the correlation p(y Xj) ≥≥ 0 then she fearure X is irrelevant to the y.
variable y then we can rejecrthe feature X from the penalized Theorem 1: If α feature X; is irrelevant to the responsemodel.
featare and response variable: p(y Xj) ≥ 0 then X y ≈ 0. Proof: If α feature X; and the correlarion between theSo for every given regulcarization parameter 3. > O we have x y<.According the KKT condion we can get thecorresponding weigh w; = O in the predictive model Hence we can safe reject the feature Xj. According to the Theorem 1 we can formally definerule 1(R1): If the feature is irrelevant to the response vari- able (we can use the correlation between the feature and theresponse variable: p(y Xj) ≈ 0. In our experimental studies we use |(y. Xj)I < c ( c is the predefined value) to replace p( X ) ≈ O. Then the feature be rejected. This rule is verybased on chi-square test. similar to the irrelevant feature test of OSFS method which Moreover we want to select the important features (or drosome irrelevant features) in the streaming feature selection application. The Grafting is stage-wise approach itis able tselect a new feature and drop currently selected features. The basic idea of Grafting for streaming feature selection is thatincluding a feature into the optimize model if the reductionof loss function values outweighs the L1 regularization term. Specifically the feature X; will be added to the model if: (1) (2) TASLE 1. Summary of high-dimensional microarray datasets. the number of features the number of samplesColon 7129 2000 62 71Prostate 6033 102DLBCL 7130 77Lung 12533 181 TABLE 2. Summary of three UCI datasets. the number of features the number of samplesIONOSPHERE 569 30SPAMBASE WDBC 4601 351 57 34 In the streaming feature selection seting the Graftingfeature. If the feature pass the gradient test then this feature approach perform this gradient test for the recently seenis discarded. Otherwise it will be included into the model this feature. In order to drop some currently selected features and the model is optimized with respect to the parameter ofHowever the Grafting algorithm relies on the regularization this gradient test are then repeated for all the selected features.parameter . and its value requires global information in feature space. Hence it not suitable to address the featurestream is infinite. ularization term [23] and modified the Grafting algorithm To adress this issue we can adopt a Bayesian L1 reg-to adaptively estimate the regularization parameter 入. based on the values of feature selected. Minimization of (1) has astraight-forward Bayesian interpretation. The posterior dis-tribution for w the parameters of the model can be written as: q u8 u s! sed apou ao (l) d oad separable Laplace distribution: where p is the number of active (non-zero) model parame-ters. A good value for the regularization parameters can be integrated out analytically. The prior distribution over modelparameters is given by marginalizing over w: As is a scale parameter an appropriate prior is given byuniform prior over log . We can obtain: the improper Jeffrey's prior p(w) α 1/x corresponding to a (3) (4) (5) (6) TABLE 3. The performance f streaming feature selection on high-dimensional microarray datasets BP-SFS Fast-OSFS Alpha-investing SAOLA GraftingLeukemia 0.881±0.07 0.907±0.07 0.901±0.03 0.871±0.04 0.922±0.05Colon 60080 0.827±0.11 0.819±0.09 0.781±0.13 0.788±0.09 0.811±0.08Prostate 0°0906'0 0.876±0.07 0.889±0.05 0.872±0.03 0.854±0.04 0.894±0.05DLBCL 006580 0.821±0.06 0.838±0.06 0.804±0.07 0.799±0.06 0.841±0.04Lung 0.915±0.03 0.903±0.05 0.911±0.06 0°0S06°0 0.901±0.06 0.913±0.08 TABLE 4. The performance of streaming feature selection on UCI datasets. Fast-OSFS Alpha-investing SAOLA GraftingIONOSPHERE 200880 001580 0.856±0.04 002060 0.862±0.04SPAMBASE 0.923±0.02 0.834±0.5 0.833±0.05 00060 0.801±0.06 007680WDBC 0°0996°0 00880 0.879±0.04 0.947±0.03 00980 0.952±0.03 TABLE 5. Numer of selectedfeatures f streaming fature selectionnhigh-mensional mcroarry datasets. BP-SFS OSFS Fast-OSFS Alpha-investing SAOLA GraftingLeukemia 10.1 1.6 2.2 1.1 12.2 16.9Colon 10.3 1.3 2.1 1.5 10.2 11.5Prostate 13.5 1.1 1.7 1.7 6.6 3.8 10.9DLBCL Lung 24.1 13.2 2.8 1.4 4.5 11.3 4.7 10.2 21.6 15.5 33.2 TABLE 6. Number of selected features of streaming feature selection on UCI datasets. BP-SFS OSFS Fast-OSFS Alpha-investing SAOLA GraftingIONOSPHERE 18.7 2.9 3.2 4.9 3.1 22.1SPAMBASE 33.1 8.6 8.5 3.7 6.8 27.9WDBC 11.1 2.8 2.7 10.9 2.9 16.9 (1) bees equivalent to minimizing (7) and the pseudo reg- ularization parameter y can be continuously updated basedon Equation (8) during the learning. Using the Gamma integral we obtain: log(p(w)) ox plog|w| and Equation (1) can be revised as: (7) and modified the Grafting algorithm to estimate the regular- In streaming feature seletion setting we adop this methodization parameter y based on the values of feature selected and evaluate a new feature according to: non-zeros in Equation (7). Differentiating the original and However this requires an estimate of the number p ofmodified training criteria (7) we have that (9) (8) ) (p isthe umber of seletedfeatures). From a gradient descent perspective minimizing Thus we can formally define rule 2 (R2): If the featurepass the gradient test (9) then this feature is rejected. Beside Algorithm 1 Bayesian Penalized Model for Steaming Fea- ture Selection0: Initialize: Define Possible Set P = {} and Active Set A={1.0: Repeat:0:Get the new feature X from the feature stream. ▪Test this based on rule 1 (R1) if the feature X; isirrelevant to the response variable ( |o(y X)| < c ) then X will be discards.▪If this new feature X satisfies |o(y Xj)] ≥ c thengradient test (9) then this feature will be included in test this feature based on rule 2 (R2). If it pass the•If this new feature new feature meets both rule 1 Possible set P.(R1) and rule 2 (R2) then then this feature will beincluded in Active set A and the model is optimized with respect to the parameter of this feature. In addi-to all features in Possible set and update the pseudo tion we go back and reapply the gradient test (9)/dqued uzma laD.0: Until TABLE 7. The parison of the our method and standard Bayesianpenalized model (sBP). BP-SFS sBPIONOSPHERE SPAMBASE 0.878±0.07 0.923±0.02 0.866±0.07WDBC 0.966±0.04 0.966±0.04 0.931±0.03Leukemia 0.931±0.05 0.933±0.04Colon 0.824±0.09 0.815±0.09Prostate 0.906±0.04 0.910±0.04DLBCL 0.859±0.03 0.859±0.03Lung 0.915±0.03 0.916±0.03 define the rule 1 and rule 2 we also need to define twoTherefore the plete algorithm is given as Algorithm 1. important set: Possible set and Active set. In particular If the feature pass the rule 1 then this feature will be discards. Ifthe feature meets the rule 1 and pass the gradient test (9) then this feature will be included in Possible set. Otherwise this feature will be included in Active set. In order to makethe optimization result of online feature selection consistency with that of batch learming we must ensure that anytime weadd a feature to the model (Active set) we also go back andreapply the gradient test (9) to all features in Posible set. IV.EXPERIMENTS A. EXPERIMENTAL SETUP In this section we pare the performance of the proposedBayesian Penalized Streaming Feature Selection (BP-SFS)and several state-of-the-art streaming feature selections: penalized model (sBP). TASLE .The runtime of the ourmethod and standard Bayesian BP-SFS sBPIONOSPHERE 4.1s 5.5sSPAMBASE 10.9s 11.1sWDBC 4.3s 5.2sLeukemia 29.7s 35.6sColon 13.5s 22.7sProstate 43.1s 44.5sDLBCL 50.2s 53.9sLung 147.1s 221.3s •Alpha-investing [2]: This approach includes a new fea- ture based on a p-value.▪ OSFS [3]: This method using the relevance and redun- wog asqns peudo oos on uodde sisffeue oupstreaming features.• Fast-OSFS [4]: Fast-OSFS improves the putational efficiency of OSFS.• SAOLA: SAOLA applys a online pairwise parison technique and maintains a sparse model in an online manner. In addition this method is able to handle the •Grafing [1]: This model perform steaming feature selec- extremely high dimensional data.tion based on a L1 regularized framework. We evaluate the performance of streaming feature selec-tion methods on high-dimensional microarray datasets.Table 1 shows the details of high-dimensional microarray datasets [24] which used in our experimental studies. In addi-selection methods on three UCI datasets : IONOSPHERE tion we also evaluate the performance of streaming featureSPAMBASE and WDBC (the details of the datasets can befound in Table 2). In this experiments we randomly divide the data set intoa training set with 60% of the whole data and the remaining samples as a test set. The experiments of cach dataset repeatFast-OSFS and SAOLA methods the selected features are 20 times. As suggested in [4] for Alpha-investing OSFS used to construct K-NN and report the average predictionaccuracy in the experiments. Note that the parameter of Graft- ing was selected by using cross-validation and the parametersof Alpha-investing using its default setings [4]. B.EXPERIMENTAL RESULTS datasets can be found in Table 3 and Table 4. From the The results of prediction accuracy of high-dimensional UCItables we can see that the proposed BP-SPS outperforms other five streaming feature selection methods in terms of