Scalable andAccurate OnlineFeatureSelectionforBigData
KUI YU Simon Fraser University XINDONG WU Hefei University of Technology and Univerity of VermontWEI1 DING University of Masaechusetts BotonJIAN PEI Simon Fraser University
with big data. Firstly in many big data applications the dimensionality is extremely high in millions and Feature seletion is important in many big data applications. Two critical challenges closely associatekeepe growing Secondly big data applications call for highly scalable feature selection algorithms in an online manner uch that each feature can be procesed in a sequential acan.We present AOLA a Scalable and curate nLine pproach for feature selection in this paper With a theoretical analysis on boundsof the pairwise correlations between features SAOLA employs novel pairwise parison techniques and maintaina parsimonious model vertime in an nlinemanner.Furthermore to deal with uing featurethat arrve by groupe we extend the SAOLA algorithm and then propoe a new group-SAOLA algorithmfor online group feature selection. The group-SAOLA algorithm can online maintain a set of feature groupe that iparseat the levels ofboth grou andindividualeature simltaneuslyn emirical studyuingadata sets of extremely high dimensionality and have superior performance over the state-of-the-art feature series of benchmark real data sets shows that ourtwo algorithms SAOLA and group-SAOLA are scalable onseleetion method8.
Additional Key Words and Phrases: Online feature selection Extremely high dimensionality Group fea-tures Big data
1.INTRODUCTION
In data mining and machine learning the task of feature selection is tochoose a subset of relevant features and remove irrelevant and redun- dant features from high-dimensional data towards maintaining a parsi-monious model [Guyon and Elisseeff 2003;Liu and Yu 2005; Xiao et al. 2015;such as social media services high resolution images genomic data analysisanddocument data analysisconsume data of extremely high dimensionalityin the order of millions or more [Wu et al. 2014; Zhai et al. 2014; Chen et al. 2014; Yu et al. 2015a;Yu et al. 2015b]. For example the Web Spam Corpus 2011 [Wang et al. 2012] col- lected approximately 16 million features (attributes) for web spam page detection and the data set from KDD CUP 2010 about using educational data mining to aceurately predict student performance includes more than 29 million features.The scalability of feature selection methods bees ecritical to tackle millions of features [Zhai et al. 2014].
manner. For exampie in SINA Weibo hot topics in behavior in Weibo keep changing Moreover in many appliations feature seleetion has to be condueted in an onlinedaily. When a novel hot topic appears it may e with a set of new keywords (a.k.a.a set of features). And then some of the new keywords may serve as key features to identify new hot topics.Another example is feature selection in bioinformatics wherehigh cost in condueting wet lab experiments [Wang et al. 2013].When it is impossibleto wait for a plete set of features it is practical to conduct feature selection from
the features available so far and consume new features in an online manner as theybee available.
To search for a minimal subset of features that leads to the most accurate predic-tion model two types of feature selection approaches were proposed in the literature namely batch methods [Brown et al. 2012; Woznica et al. 2012; Javed et al. 2014] andthe entire training data set into memory.This is obviously not scalable when handling online methods [Wu et al. 2013; Wang et al. 2013]. A batch method requires loadinglarge-scale data sets that exceed memory capability. Moreover a batch method has to access the full feature set prior to the learning task [Wu et al. 2013; Wang et al. 2013].
Online feature selection has two major approaches. One assumes that the num-ber of features on training data is fixed while the number of data points changes over time such as the OFS algorithm [Hoi et al. 2012; Wang et al. 2013] that per-forms feature selection upon each data instance. Different from OFS the other on- line method assumes that the number of data instances is fixed while the numberof features changes over time such as the Fast-OSFS [Wu et al. 2013] and alpha- investing algorithms [Zhou et al. 2006]. This approach maintains a best feature sub-set from the features seen so far by processing each feature upon its arrival. Wang et al. [Wang et al. 2013] further proposed the OGFS (Online Group Feature Selection)algorithm by assuming that feature groups are processed in a sequential scan. It is still an open research problem to efficiently reduce putational cost when the di-mensionality is in the scale of millions or more [Wu et al. 2013; Zhai et al. 2014].
dimensional data for big data analytics our contributions are as follows. In this paper we propose online feature selection to tackle extremely high-
We conduct a theoretical analysis to derive a low bound on pairwise correlationsWith this theoretical analysis we develop SAOLA a Scalable and Aceurate QnLine between features to effectively and efficiently filter out redundant features.Approach for feature selection. The SAOLA algorithm employs novel online pair- wise parisons to maintain a parsimonious model over time. We analyze the upperTo deal with new features that arrive by groups we extend theSAOLA algorithm bound of the gap of information gain between selected features and optimal features.namely the group-SAOLA algorithm. The group-SAOLA algorithm can online yield a set of feature groups that is sparse between groups as well as within each group formaximizing its predictive performance for classification.
An extensive empirical study using a series of benchmark data sets illustrates thatour two methods both SAOLA and group-SAOLA are scalable on data sets of ex-tremely high dimensionality and have superior performance over the state-of-the-art online feature selection methods.
3 proposes our SAOLA algorithm and Seetion 4 presents the group-SAOLA algorithm. The rest of the paper is organized as follows. Setion 2 reviews related work. SectionSection 5 reports our experimental results. Finally Section 6 coneludes the paper andour future work.
2.RELATEDWORK
Given a set of input features on a training data set the problem of feature selection sdation of predietion models. There are two types of feature selection approaches pro-posed in the literature namely the batch methods and the online methods.
Standard batchmethods can be broadly classified into three categories:filter wrap-per and embedded methods. A wrapper method performs a forward or backward strategy in the space of all possible feature subsets using a classifier of choiceto evaluate each subset. Although this method has high accuracy the exponential
number of possible subsets makes the method putationally expensive in gen-eral [Kohavi and John 1997]. The embedded methods attempt to simultaneously max- imize classification performance and minimize the number of features used basedon a classification or regression model with specific penalties on coefficients of fea- tures [Tibshirani 1996; Weston et al. 2000; Zhou et al. 2011].
measures such as distance information dependency, or consistency to se A filter method is independent of any classifiers and applies evaluationlect features [Dash and Liu 2003; Forman 2003; Peng et al. 2005; Song et al. 2012; Liu et al. 2014]. Then the filter methods build a classifier using selected features.Due to their simplicity and low putational cost many filter methods havebeen proposed to solve the feature selection problem such as the well-established mRMR (minimal-Redundancy-Maximal-Relevance) algorithm [Peng et al. 2005] andet al.[Zhao et al. 2013]proposed a novel framework to consolidate different criteria to the FCBF (Fast Correlation-Based Filter) algorithm [Yu and Liu 2004]. Recently Zhaohandle feature redundancies.To bring almost two decades of research on heuristic fil-presented a unifying framework for information theoretic feature selection using an optimized loss funetion of the conditional likelihood of the training labels.
posed the efficient embedded algorithm the FGM (Feature Generating Machine) algo- To deal with high dimensionality Tan et al. [Tan et al. 2010; Tan et al. 2014] pro-rithm and Zhai et al. [Zhai et al. 2012] further presented the GDM (Group DiseoveryMachine) algorithm that outperforms the FGM algorithm.
Group feature selection is also an interesting topic in batch methods and it se-lects predietivefeaturegroupsrather than individualfeaturesFornstaneim age processing each image can be represented by multiple kinds of groups suchas SiFT for shape information and Color Moment for color information. The lassorani [Tibshirani 1996] for shrinkage and variable selection which minimizes the sum of squared errors with the L penalty on the sum of the absolute values of the co-proposed a group lasso method to seleet grouped variables for aceurate prediction in effcients of features. Based on the lasso method Yuan and Lin [Yuan and Lin 2006]regression. Later the sparse group lasso criterion as an extension of the group lassomethod was proposed by Friedman et al. [Friedman et al. 2010] which enables to en- courage sparsity at the levels of both features and groups simultaneously.
analytics that calls for dynamic feature selection because the method has to access all It is diffieult for the standard batch method to operate on high dimensional datafeatures before feature selection can start.
Online feature selection has two research lines. One assumes that the numbertime [Hoi et al. 2012]. Recently Wang et al. [Wang et al. 2013] proposed an online fea- of features on training data is fixed while the number of data points changes overture selection method OFS which assumes data instances are sequentially added.
instances is fixed while the number of features changes over time. Perkins and Different from OFS the other online approach assumes that the number of dataa stagewise gradient descent approach. Grafting treats the selection of suitable fea-tures as an integral part of learning a predictor in a regularized learning framework and operates in an incremental iterative fashion gradually building up a feature setpresented Alpha-investing which sequentially considers new features as additions to while training a predietor model using gradient descent. Zhou et al. [Zhou et al. 2006]a predictive model by modeling the candidate feature set as a dynamically generated stream. However Alpha-investing requires the prior information of the original fea-ture set and never evaluates the redundancy among the selected features.
Table 1. Summary on methematical notations
Notat D Mathematical me training data setF input feature set on D the class attributeN. the number ol features the number of data instancesL G tbe set of featuare groups the conditional lmkelahood the conditional log-likelihoodG1 the set ol all leat ure groups available till time t;1 a group ol features5.M S' C the set of selected groups at time feature subsets within FZ'A'X 12*g *76*7x a single leature (Y ∈ F X F Z C F) an as eignment of valuee for X Y Z and Cd a numP-dimensional data instance an assignment of a set of values of Sa N-dimens a data value od F S' denote the feature subset selected at time t nal leatuare vector1.1 S. [S% [returns the size of SHP[e[S] denote8 the posterior probabtity ol C condflsoned on S relevance threshold the limstation ol the maximum subset sizep-value significant level for Fisher's Z-teste
OSFS (Online Streaming Feature Selection) algorithm and its faster version the Fast- To tackle the drawbacks Wu et al. [Wu et al. 2010; Wu et al. 2013] presented theOSFS algorithm. To handle online feature selection with grouped features Wang et al. [Wang et al. 2013] proposed the OGFS (Online Group Feature Selection) algorithm.However the putational cost inherent in those three algorithms may still be very expensive or prohibitive when the dimensionality is extremely high in the scale ofmillions or more.
The big data challenges on eficient online processing and scalability motivate us todevelop a scalable and online processing method to deal with data with extremely high dimensionality
3. THE SAOLA ALGORITHM FOR ONLINE FEATURE SELECTION
3.1. Problem Detinition
the number of data instances dis a multidimensional vector that contains numP fa- In general a training data set D is defined by D = {(d c). 1 ≤ ≤ N) where N istures and cis a class label within the vector of the class attribute C.The feature set Fon D is defined by F = (F F2..FumP}. The problem of feature selection on D is to select a subset of relevant features from F to maximize the performance of predictionmodels. The features in F are categorized into four disjoint groups namely strongly relevant redundant non-redundant and irrelevant features [Kohavi and John 1997;Yu and Liu 2004] and the goal of feature selection is to remove redundant or irrele- vant features from F while keeping strongly relevant or non-redundant features. Themathematical notations used in this paper are summarized in Table I.
ture to C if and only if vS C F{F.) and f vc Vs for which P(S = s F; = fi) > 0 Definition 3.1 (Irrelevant Feature). [Kohavi and John 1997] F is an irrelevant fea-and P(C = c |S = s F = f.) = P(C = c |S = s). 中
Definition 3.2 (Markou Blanket {Koller and Sahami 1995]).A Markov blanket offeature F denoted as M F{F} makes all other features independent of F given M that is Y F (M U {F}) s.t. P(F [M Y) = P(F |M).
By Definition 3.2 a redundant feature is defined by [Yu and Liu 2004] as follows.
hence should be removed from F if it has a Markov blanket within F. Definition 3.3 (Redundant Feature). A feature F F is a redundant feature and
We also denote D by D = {F .C} 1 ≤ i ≤ numP which is a sequence of features thatis presented in a sequential order where F; ={f J..J} denotes the feature
high-dimensional data not only with limited memory but also without requiring its If Dis processed in a sequential scan that is one dimension at a time we can processplete set of features available. The challenge is that as we process one dimension at a time at any time t how to online maintain a minimum size of feature subsetS of maximizing its predictive performanee for classification. Assuming S C F is thefeature set containing all features available till time 1 S represents the selected feature set at t1 and F; is a new ing feature at time t our problem can beformulated as follows:
(1)
We can further depose it into the following key steps:
Determine the relevance of F to C. Firstly we determine whether Eq.(2) holds or not.
(2)
If so F is cannot add any additional diseriminative power with respect to C thus F should be discarded. Hence Eq.(2) helps us identify a new feature that eitheris pletely useless by itself with respect to C or needs to be bined with other features. If not we further evaluate whether F carries additional predictive informa-tion to C given S - that is whether Eq.(3) holds. If Eq.(3) holds F has a Markovblanket in S′ and thus F should be discarded.
(3)
Caleulate S with the inclusion of F. Once F is added to Si at time t St ={S F) we then solve Eq.(4) to prune S. to satisfy Eq.(1).
(4)
Accordingly solving Eq.(1) is deposed to how to sequentially solve Eq.(2) to Eq.(4)redundaney. at each time point. Essentially Eq.(3) and Eq.(4) deal with the problem of feature
3.2. Using Mutual Information to Solve Eq-(1)
To solve Eq.(1) we willemploy mutual information to caleulate correlations between features. Given two features Y and Z the mutual information between Y and Z isdefined as follows.
(5)
The entropy of feature Y is defined as
(6)