I Introduction
Researches on artificial feedforward neural networks (FNNs) have become increasingly active and popular, for it is one of the most powerful tools in artificial intelligence field. What’s more, it has been widely studied and applied in a great number of engineering fields, such as the genetic research
[20], the prediction of rainfall[23], the analysis of biological emotion[18] and the humanoid robots control [42]. Because FNN has formidable approximation capability to approximate complex and nonlinear functions directly through training on samples, a computational model containing system characteristics could be built up more conveniently without knowing the exact information of the objective system by training on input samples.For the development of this algorithm, many researchers have explored the abilities of multilayer FNNs [37, 27, 17]
. Leshno has proved that multilayer FNNs with activation functions are able to approximate any function, including continuous functions and nonsequence functions
[19]. For the past few years, the singlehidden layer feedforward neural networks (SLFNs) model has been proposed by Huang et al. [14, 13, 11], which can approximate functions with zero errors by learning diverse samples. SLFNs have at mosthidden neurons and the hidden layer biases.
For traditional SLFNs, the parameters need to be tuned iteratively to ensure the optimality of the network and frequent parameters tuning will lead to a complicated relationship between the parameters of different layers. As a solution, the famous Backpropagation (BP) algorithm based on gradientdescend has been proved to be the most effective tool for iterative learning algorithm in SLFNs [25]. However, the BPtype training algorithm has some inherent weaknesses resulting in inevitable slow operations and poor performances, such as slow convergence [1], local minima [38].
To address this problem, Huang et al. propose and investigate a simple but effective learning algorithm for the SLFNs, called extreme learning machine (ELM). It has been observed that the ELM has unparalleled advantages such as extremely training speed, and excellent generalization performance [14, 13, 11]. Unlike the traditional BP algorithm, ELM did not pay much attention to the input weights and the neuron biases, and the input weights and hidden layer biases can be arbitrarily assigned [12, 9].
Although the ELM enjoys its promising merits in application prospects, it still has several shortcomings that ELM could not easily overcome. The first one is that because of due to the mechanisms of randomly assigning weights, the output of ELM fluctuates and the performances could be unstable. The second problem ELM involving is that the hidden layer matrix could be not full column rank or even illconditioned when the input weights and biases are generated randomly, which may incur serious computational robustness problems in obtaining optimal output weights [33].
Therefore, in order to overcome the aforementioned weaknesses of BPtype FNNs and ELM, in this paper, we focus on utilizing Gegenbauer orthogonal polynomials as activation functions, and constructing orthogonal polynomials FNN with the simplicity, such as fixed input weights and biases. Such settings have been proved effective in [41, 43]. Further, to avoid lengthy training and determine optimal output weights directly, a method called weights direct determination (WDD) based on pseudoinverse in previous works [41, 43] is firstly reviewed. However, the WDD only focuses on empirical risk minimization, which might easily lead to a too complicated model and higher risks of overfitting [32, 3]. Also, in the previous works, the proposed networks are single output networks, which may be not consistent with the facts that FNNs with multiple output are the most effective way to tackle multiclassification problems.
In this paper, a novel FNN termed as Gegenbauer neural network (GNN) is constructed and investigated based on Gegenbauer polynomials series and polynomials function approximation. Then, based on equalityconstrained optimization, a regularized WDD (RWDD) is proposed with a regularized term to prevent the model from being too complicated. Also, based on the number of network output nodes, a unified learning mechanism of RWDD for binary and multi classification problems is given. Besides, with regard to computational scalability and efficiency, two different solutions to RWDD are given, depending on the scale of the problems and the dimensionality of hypothesis feature space. Specifically, the main contributions of this paper lie in the following facts.

The main merit ELM algorithm is its extremely fast training speed. In this paper, the proposed GNN has similar calculation complexity to ELM, mainly involving an inversion of a square matrix. Thus, it has the comparable excellent training speed as ELM.

Compared with ELM, the proposed algorithm is stable and has lower computational risks, since the input weights are fixed and the tensor products of Gegenbauer orthogonal polynomials are utilized as activation functions.

Experiments on several realworld data sets have illustrated that the GNN has comparable or even better generalization performance in binary classification problems and multiclassification problems, compared with least square support vector machine (LSSVM) with Gaussian kernel and ELM with Gaussian kernel.

Experiments have shown that GNN has great computational scalability and efficiency, still performs very well when there only exists very few neurons in the network.

Experiments on data sets contaminated with random noises prove the better classification robustness of GNN.
The remainder of the article is organized as follows. Section II presents and analyzes the theoretical basis of the GNN, and the GNN with single output nodes (SOGNN) and the GNN with multiple output nodes (MOGNN) are constructed and discussed respectively. In Section III, the proposed RWDD is discussed and analyzed based on equality constrainedoptimization. Section IV presents several comparison experiments on LSSVM, ELM and GNN with RWDD. And the final the remarks are given in Section V.
Ii Theoretical Basis for GNN
In this section, the fundamentals of Gegenbauer Polynomials are first presented. Then, the theoretical basis of Gegenbauer polynomials series (GPS) for multivariate function approximation is briefly analyzed. Furthermore, based on the analysis, the GNN model is constructed and investigated.
Iia Fundamentals of Gegenbauer Polynomials
Definition 1. The general expression of Gegengenbauer polynomials can be defined as below [39, 36]:
(1) 
where is a parameter of the polynomials that satisfies , is the degree of the Gegenbauer polynomials and and is a Gamma function. Equivalently, Gegenbauer polynomials can also be defined by the recurrence relations below [36]:
(2) 
where and .
IiB GPS for Multivariate Function Approximation
First, a function approximation theorem, i.e., Weierstrass approximation theorem, is presented for the further discuss.
Theorem 1.
(Weierstrass Approximation Theorem)
Let with .Then there exists a sequence of polynomials converging uniformly to on :
(4) 
with . And it can also be generalized to a multivariate case. Set . Let . Then there is a sequence of polynomials converging uniformly to on :
(5) 
with .
Proof:
Based on the Gegenbauer polynomials and theories of function approximation, i.e., Theorem 1
, and polynomial interpolation
[16, 43, 40], a univariate continuous function defined on can be approximated by the order (or to say the total number of polynomials used) GPS:(6)  
where is the coefficient corresponding to , denotes order GPS and is the error function related to . According to approximation theories [41] and Theorem 1, the error function could converge to zero when is large enough, which could be expressed as:
(7) 
Lemma 1.
Let and is a continuous realvalued function with variables on (i.e., ). Then, , the generalized multivariate Bernstein polynomials of , converges uniformly to and could be structured as follows:
(8) 
where with , and is binomial coefficient (i.e., ).
According to Lemma 1, for a target function , we have the following result:
(9) 
Then, on the basis of polynomial interpolation theory [15], we give the following theorem.
Theorem 2.
For a realvalued function with variables continuous on , it could be approximated by GPS uniformly.
Proof:
Without loss of generality, in light of Eq. (6) and (7), can be approximated by GPS as follows:
(10) 
where denotes the coefficient corresponding to .
Then, in light of the lemma given above, we could substitute the Eq. (10) into Eq. (8):
(11)  
where is the total amount of Gegenbauer polynomials utilized to approximate , and is the residual error functions related to function , is the coefficient for , , termed as the Gegenbauer Elementary function (GEF) in this paper, is the order tensor product of the onedimensional Gegenbauer polynomials, which could be expressed as:
is the weights for , and is the total amount of Gegenbauer polynomials tensor product (i.e., ) utilized to approximate .
IiC Network Architecture
Based on the analysis in Section IIB, the SOGNN and MOGNN are hereby constructed and investigated. The output patterns of both forms are then illustrated.
IiC1 SOGNN Model
According to the theoretical analysis above, the architecture of SOGNN is constructed. As is shown in Fig. 1 , the network architecture of SOGNN consists of three layer, including input layer, hidden layer and output layer. Let denotes the dimensional input vector, and all of the input data should be normalized to . In this model, the neurons in input layer and the single neuron in output layer are both chosen linear in our model. Besides, the activation functions in hidden layer are a series of GEFs , and denotes the total amount of hidden neurons. Thus, the output of SOGNN in this case can be expressed as follow:
(13) 
where denotes the Gegenbauer activation vector with respect to , actually maps the dimensional input space to the dimensional hidden layer feature space, and denotes the connecting weights vector.
If the number of samples is , their outputs can be written in matrix form, i.e.,
(14) 
where and
Based on Theorem 2 (specifically, Eq. (11) and (12), with large enough and optimal weights, can best approximate the target function . Also, in this model, all the input weights are fixed to be , and the threshold values for all neurons are fixed to be .
Notably, the aforementioned parametric settings can effectively simplify the network architecture, lower the computational complexity, make it more convenient for future implementation on hardware, and have been proved effective in previous works [41, 43].
Even more, such parametric simplification, on one hand, maintains the approximation capability and the optimality of the network since the network (shown in Fig. 1) with the aforementioned parametric settings well matches the theoretical analysis in Eq. (14) and (12). On the other hand, such fixed parametric settings could also improve the robustness of the model when faced with disturbance in input. According to [21, 22], when the input weights and hidden layer biases are randomly assigned, the robustness could be poor, since the changes of the output weights matrix sometimes could be very large when input disturbance is encountered, which will result in increasing risk of network.
Further, with activation functions being orthogonal elementary functions, our model prevents the hidden activation matrix from being not full column rank and illconditioned, according to orthogonal polynomials theory [5]. Also, with being invertible, the direct computing of can be easily obtained. Notably, this sets the basis of the computational robustness of the proposed method in Section IIIB [8].
IiC2 MOGNN Model
As is shown in Fig. 2, the MOGNN model is constructed based on SOGNN. To be more specific, the MOGNN with outputs consists of SOGNN. The output of MOGNN is dimensional vector, where and the th output of MOGNN is the output of the th SOGNN. The parametric settings is the same as the settings of SOGNN. The output of th output node is given below:
(15) 
Iii Proposed Regularized Weights Direct Determination
In this section, the detailed explanation of our proposed RWDD for GNN is presented. First, the original WDD method is reviewed, after which the main drawbacks of it are discussed. Then, based on equalityconstrained optimization, a method minimizing the empirical risks and structural risks with better generalization ability is provided, and the analysis of it is carried out in cases of two forms of GNN, respectively. Finally, the unified RWDD for GNN weights determination under different cases are given.
Iiia Review of the Original Weights Direct Determination
According to the previous works in [43, 41]
, for a given GNN model, the objective loss function of WDD is:
(16) 
where is the label of samples.
With WDD, the solution of the optimal output weights can be calculated by the following equation:
(17) 
where is the MoorePenrose generalized inverse of activation matrix , also termed as .
According to [2]
, there are several methods to calculate the MoorePenrose generalized inverse of a matrix: orthogonal projection method, iterative method and singular values decomposition (SVD).
Since the WDD directly calculates the leastsquare solution involving an inversion of a matrix, the speed of calculation can be guaranteed. Further, according to the previous works [43, 41], the networks equipped with WDD have demonstrated outstanding performances, such as excellent approximation ability, easy implementation.
However, although WDD method has aforementioned merits, it still has a main drawback. As is shown Eq. (16), the WDD only focuses on empirical risk minimization and no measure is taken to prevent the model hypothesis from being too complicated, which might easily lead to an overfitting model [32, 3].
IiiB Equalityconstrained Optimization Problems for GNN
Due to the aforementioned drawbacks of the original WDD method, in this paper, our model does not only tend to seek the smallest training error but also seek the smallest norm of output weights. According to the famous Barlett’s theory [3], a good model with generalization ability should be the one could reach a best tradeoff between the empirical risk and the structural risk. Generally, the empirical risk is the training error of the neural network, which could be represented by norm, i.e., ; the structural risk is the norm of output weights, i.e., , which denotes the complexity of the model[29]. Thus, the optimization object and the mathematical model could be formulated as below:
(18) 
According to the ridge regression theory
[7], it would make solution much stabler and tending to have better generalization ability since a positive value added to the diagonal of or. Also, according to the statistical learning theory
[22, 28], the real risk of a model should be the sum of the empirical one and the structural one, and these two kinds of risk should be considered together in discussions. Since the proposed GNN has two forms (i.e., SOGNN and MOGNN), in the following, the solution for each form would be discussed respectively. After that, it can be seen the solution for SOGNN is actually one specific case of MOGNN. Thus, these two solutions are finally unified, which leads to the final solution of RWDD.IiiB1 SOGNN for Classification
Based on the analysis in Section II, the SOGNN can approximate any continuous target function in the variables space. For classification problems, the expected output of SOGNN can be close enough to the target class labels in the corresponding regions. Thus, the objective function of proposed RWDD for SOGNN can be described as below:
(19) 
where the weight factor is the regularizer. By tuning the values of , the proportion of the empirical risk and the structural risk can be adjusted for reaching the best tradeoff. Based on KKT theorem [30], to solve the RWDD optimization problem for SOGNN is equivalent to solve the following Lagrangian equation:
(20) 
where denotes the Lagrangian multiplier vector with equality constraints of Eq. (19) and . By setting the gradients of the Lagrangian equation with respect to , and equal to zero respectively, we can obtain the the following KKT optimality conditions of Eq. (20):
(21a)  
(21b)  
(21c) 
IiiB2 MOGNN for Classification
According to Section IIC, the MOGNN is actually a generalized form of the SOGNN. For classification problems, the number of the output nodes of MOGNN should be , and when the original class label of sample is , the expected output of MOGNN should be a dimensional vector and the th component of it should be close to , while others should be close to . Thus, the objective function of proposed RWDD for MOGNN can be described as below:
(22) 
where is the target vector corresponding to sample , is the error vector, is a matrix of , where is the vector of the output weights corresponding to th output node and is Frobenius norm.
Similar to SOGNN, the Lagrangian equation for RWDD optimization for MOGNN is:
(23) 
where denotes the Lagrangian multiplier matrix, and . By setting the partial derivatives of Eq. (23) equal to zero respectively, we can also obtain the KKT optimality conditions of Eq. (23) as follows:
(24a)  
(24b)  
(24c) 
It can be seen from the Eq. (21) and Eq. (24) that to solve the RWDD optimization for SOGNN is actually to solve one particular case of the RWDD optimization for MOGNN when the number of output nodes is fixed to . To be more specific, some of the matrices relevant to in MOGNN reduce to vectors in SOGNN. Thus, the solutions of weights for SOGNN and MOGNN could be analyzed together. It is worth mentioning that the size of the activation matrix remains the same in the both cases, and it only depends on the number of training samples and the neuron number .
IiiC Proposed Regularized Weights Direct Determination
Based on the aforementioned KKT optimality conditions for SOGNN and MOGNN, the unified solution of output weights for SOGNN and MOGNN can be obtained. However, due to the efficiency concerns, different solutions to the above KKT optimality conditions can be obtained based on the size of the training data sets.
IiiC1 Small Training Data Sets
When is not huge, or more specifically, less than , by substituting Eq. (21a) and (24a), and Eq. (21b) and (24b) into Eq. (21c) and (24c) respectively, we can get:
(25) 
where
is an identity matrix, whose size is
, and the same as the size of .By substituting Eq.(25) into Eq. (21a) and (24a), and getting rid of , the following equation can be obtained:
(26) 
when the number of training samples is not large, the calculation of RWDD mainly involves an inversion of a matrix. It is worth mentioning: for SOGNN, the and is column vectors; for MOGNN, the and evolve to matrices.
IiiC2 Huge Training Data Sets
When is huge, i.e., greater than , by substituting Eq. (21c) and (24c) into Eq. (21b) and (24b) respectively, the explicit expression for can be obtained as follows:
(27) 
Based on this, we can obtain the following solution for weights vector , by substituting Eq. (27) into Eq. (24a):
(28) 
where is an identity matrix, the same as the size of . Thus, the calculation mainly involves an inversion of an matrix, and the speed only hinges on when grows large.
Based on the aforementioned discussion under two circumstances and Eq. (14), the final network output equation is:
(29) 
As is shown in the Eq. (29), the calculation of RWDD always only involves an inversion of a matrix of relatively small size, thus the the rapidness of RWDD calculation is guaranteed.
Data Sets  Training  Testing  Classes  Features  LSSVM  ELM  GNN  

Australian  484  206  2  6  1500  1500  0.05  
Banana  1591  3709  2  2  1500  1500  0.05  
Diabetes  537  231  2  8  1500  1500  0.05  
Liver  241  104  2  6  1500  1500  0.05  
Ionosphere  245  106  2  33  1500  1500  0.05 
Data Sets  Training  Testing  Classes  Features  LSSVM  ELM  GNN  
Iris  105  45  3  3  1500  1000  0.05  
Glass  158  56  6  9  1500  1000  0.05  
Wine  126  52  3  13  1500  1000  0.05  
Ecoli  243  93  8  7  1500  1000  0.05  
Vehicle  594  252  4  18  1500  1000  0.05  
Iv Experiments and Analysis
In this section, to verify the performance of the GNN with proposed RWDD, we first compare the generalization performance of GNN with different algorithms (i.e., LSSVM with Gaussian kernel and ELM with Gaussian kernel, another two famous algorithm based on equalityconstrained optimization) on realworld benchmark binary, multiclass classification cases. To make it more intuitive, the decision boundaries of different algorithms are presented to demonstrate the classification capability of the proposed method. Then, we also compare the computational scalability of three algorithms, and the computational efficiency of GNN and ELM. Since model’s sensitivity to hyper parameter selection is also one key feature of algorithms, experiments between GNN and ELM are conducted to test model’s tolerance for the hyper parametric changes. At last, we conduct an experiment in which different amplitude noises are added to test the robustness to random noises of the different algorithms.
The simulations in Subsection IVB and IVF are carried out in MATLAB R2018b environment running in Intel(R) Core(TM) i58520U 1.60 GHz CPU with 8GB RAM, and the simulations in Subsection IVD and IVE are carried out in MATLAB R2018a environment running in Intel(R) Core(TM) i78700T 2.40GHz CPU with 16GB RAM.The codes for LSSVM and ELM are downloaded from [24] and [10] respectively.
Iva Benchmark Data Sets and Hyper Parameters Selections
To verify the classification performance of different algorithms, 5 binary classification cases and 5 multiclassification cases have been tested in out experiments. All data sets are downloaded from the UCI Machine Learning Repository
[31].According to [9], the selection of hyper parameters greatly influences the performance of the algorithms. For LSSVM with the popular Gaussian kernel (i.e., ), it mainly has two hyper parameters which are the cost parameter (preventing the model from being too complex) and kernel parameter . ELM using Gaussian kernel, has three hyper parameters, the cost parameter , kernel parameter and number of nodes . For the proposed GNN with RWDD, it also has three hyper parameters which are the regularizer , the parameter for GPS and the number of neurons . In this paper, the value of is fixed to . Also, the number of nodes for ELM and the proposed GNN with RWDD for binary and multiclassification problems are fixed to 1000.
Specifically, for a binary problem, only one LSSVM, one ELM with single output node and one SOGNN are trained; for a class multiclassification problem, it need to train binary LSSVM, while for ELM and GNN, only one network with output nodes is required. Thus, ELM and GNN are more flexible and applicable mechanisms in multiclassification.
In experiments, for LSSVM and ELM, the hyper parameters (i.e., and ) are searched from 31 different values in the fashion of 4fold crossvalidation gird search, which will result in 961 different combinations of (,). And the ranges of values of and are . The value of for GNN is also searched from 31 different values in . At last, the specifications of each experiment (including the divisions of data sets, the number of features and classes of data set, and the hyper parameters selections for each algorithm) are reported in TABLE I and II.
Note that for each experiment, fifty trials are conducted to ensure the results are not accidental values, and the average result is reported. Besides, all the data are reshuffled at each trial and all the data are normalized before experiments.
IvB Performance Comparison on RealWorld Benchmark Data Sets
present the performance comparisons of LSSVM with Gaussian kernel, ELM with Gaussian kernel and GNN with RWDD. Note that the simulation results include the mean testing classification accuracies (Acc), corresponding mean standard deviations (Std) and the average training time, and the best results are highlighted in boldface.
For binary classification problems, GNN could achieve comparable or even better generalization performance as LSSVM and ELM with comparably fast learning speed. Take the results of Banana data set as instance.

For Banana data set, the best Acc is achieved by LSSVM, while the Acc by GNN is only 0.11% less. Also, GNN demonstrates comparably extremely fast training speed as ELM, with difference within 0.01s.
For multiclassification problems, the solutions achieved by GNN are better than LSSVM and ELM, with higher Acc, and stabler with lower Std. To be illustrative, take the results of Glass and Ecoli data sets as examples.

For Glass data set, Acc of GNN is much higher, with 5.46% and 4.47% higher than the results achieved by LSSVM and ELM.

For Ecoli data set, Acc of the GNN is at least 2% higher than LSSVM and ELM, while Std is much lower. Although the training time is a little longer, the training speed of GNN with RWDD is very fast and accessible.
Data Sets  LSSVM  ELM  GNN  
Testing  Testing  Training  Testing  Testing  Training  Testing  Testing  Training  
Acc (%)  Std (%)  Time (s)  Acc (%)  Std (%)  Time (s)  Acc (%)  Std (%)  Time (s)  
Australian  72.08  3.44  0.2462  74.69  1.42  0.2267  75.91  3.11  0.2639 
Banana  89.63  0.33  0.1241  89.04  0.46  0.1124  89.52  0.61  0.1206 
Diabetes  77.01  2.41  0.1550  77.52  2.46  0.1328  77.29  2.05  0.1421 
Liver  71.17  3.92  0.2541  74.87  1.44  0.2817  72.68  4.04  0.2370 
Ionosphere  91.58  2.74  0.2151  92.12  4.73  0.1761  91.03  2.66  0.2240 
Data Sets  LSSVM  ELM  GNN  
Testing  Testing  Training  Testing  Testing  Training  Testing  Testing  Training  
Acc (%)  Std (%)  Time (s)  Acc (%)  Std (%)  Time (s)  Acc (%)  Std (%)  Time (s)  
Iris  96.22  2.36  0.0231  96.04  2.37  0.0222  96.58  2.34  0.0257 
Glass  67.32  5.04  0.0417  68.41  4.27  0.0326  72.68  4.94  0.0421 
Wine  97.63  1.82  0.0343  98.48  4.46  0.0326  98.54  1.63  0.0457 
Ecoli  85.93  3.52  0.0344  87.48  2.91  0.0382  89.55  2.23  0.0481 
Vehicle  83.19  1.93  0.1029  83.16  1.89  0.0931  84.05  1.75  0.1266 
IvC Comparison of Decision Boundaries
The decision boundaries of each algorithm. It can be observed that three algorithms can classify the nonlinear separable data set well.
As a more intuitive proof of GNN s classification capability,we also conduct a simple experiment aiming at showing the decision boundaries of different algorithms on a nonlinear separable data classes. For this experiment, it is conducted on a synthetic twodimensional, two class data set called double moon, with a pattern of two moons intertwining. Since it is a binary classification problem, the pattern vectors of Class 1 is tagged “+1” labels and ones of Class 2 received “1” labels.
For this experiment, all the classifiers are trained with a data set with 3000 samples. The decision boundary of each algorithm is recorded in Fig. 3 (a)(c), respectively. Note that this simple experiment is set to demonstrate the classification ability of each classifier intuitively, thus all the samples were used to train and the testing accuracy rate were not given.
IvD Computational Scalability and Efficiency
One of the biggest merits of ELM originates from its excellent computational scalability and efficiency, which means the training speed of ELM maintains when the scale of problem grows and the number of hidden neurons increases. In this subsection, experiments on Liver data sets are conducted to verify the computational scalability and efficiency of GNN. To be specific, since the experiments here is set to evaluate the computational scalability and efficiency of algorithms, we simply amplify size of the Liver data set by randomly copying the samples. To avoid accidental results, each data point plotted is the average values of 100 trials.
IvD1 Computational Scalability
In the simple example shown in Fig. (4), we can see the GNN has comparable (or even better) computational scalability as ELM. Specifically, according to the analysis in [9] , for LSSVM the main computational cost comes from its calculating Lagrange multipliers , which is highly associated with the size of data sets . However, since ELM and GNN with RWDD have different equations calculating the output weights when is great (refer Eq. (29)), the computational cost of ELM and GNN reduces dramatically. Thus, ELM and GNN has better computational scalability, compared to LSSVM.
IvD2 Computational Efficiency
The results of comparison on the computational efficiency, has been shown in Fig. (5), we can see the training times consumed by both ELM and GNN do not increase remarkably, as increases. Even more, the training time consumed by GNN grows a bit slower than ELM with a smoother slope.
Based on the above two experiments and the analysis above, it can be seen that the GNN still has extremely fast training speed when the and becomes very large. Thus, it can be concluded that the GNN has great computational scalability and efficiency, with regard to the size of training data sets and the number of neurons .
IvE Sensitivity to Hyper Parameters
Axiomatically, the selections of the hyper parameters are vital to the performance of the classification algorithms. However, if the model is not very sensitive to changes of hyper parameters, i.e., the best generalization performance could be achieved in a wide range of combinations of hyper parameters, the model would have better utility value, since it require less user intervention in realworld implementation. Thus, in this subsection, the ensuing experiments are conducted to test the sensitivity of ELM and GNN to hyper parameters, Even more, the generalization performances of ELM and GNN when is very small are also explored. To avoid accidental results, each data point is the average values of 100 trials and the ELM’s Gaussian parameters and GNN’s GPS parameter follow the optimal settings listed in Subsection IVA.
IvE1 Sensitivity to Hyper Parameters
The experiments are conducted on a binary problem (Liver data set) and a multiclass problem (Vehicle data set) to show GNN s sensitivity to the combinations of two hyper parameters (i.e., and ). Similarly, as a comparison, the same experiments on ELM are conducted to test the sensitivity to the combinations of two similar hyper parameters (i.e., and ).
According to the results shown in Fig (6) and Fig. (7), the best performance of GNN could be achieved in a wider range of combinations of (,). Especially, the performance of GNN is not sensitive to , and can still perform well even when there only exists a very few neurons in the network. Thus, only need to be specified to achieve best performance. However, the ELM is relatively more sensitive to the combinations of (, ), and good performance could be achieved only when the is large enough.
IvE2 Performance when is small
From the Fig (6) and Fig. (7), we can also see that when neurons are few, the generalization performance of GNN are still good, while the generalization performance of ELM grows slowly when grows. Thus, experiments on the same two data sets are conducted to compare the generalization performance of these two algorithms when is small. The basic experimental settings also follow the ones in Subsection IVA, except (,) for ELM and (,) for GNN.
From the results shown in Fig. (8) and Fig. (9), it is clear that the GNN works still very well with high Accs, when the hypothesis feature space is not complicated (i.e., ), while for ELM, we can performance deteriorates greatly when lacking enough neurons. Thus, with appropriate values of , the approximation capability of GNN is much better than ELM, which needs more neurons to ensure its approximation capability. As a conclusion, GNN has much better generalization performance than ELM when the network hypothesis feature space is simple (i.e., is small).
Thus, through the two aforementioned experiments, we can conclude the that the GNN is less sensitive to the changes of hyper parameters than ELM. What is more, it still performs excellently when the dimensionality of the hypothesis feature space is small. Thus, the model proposed has great utility values and is easy to implement.
Data Sets  LSSVM  ELM  GNN  
Testing  Changes  Testing  Changes  Testing  Changes  
Acc (%)  (%)  Acc (%)  (%)  Acc (%)  (%)  
Iris  94.31  1.91  95.21  0.83  96.02  0.56 
Glass  63.21  5.11  63.76  4.65  68.55  4.13 
Wine  93.81  2.82  96.57  1.91  97.77  0.77 
Ecoli  80.42  5.51  82.75  4.73  85.41  4.14 
Vehicle  79.21  3.98  79.51  3.65  81.89  2.16 
Data Sets  LSSVM  ELM  GNN  
Testing  Changes  Testing  Changes  Testing  Changes  
Acc (%)  (%)  Acc (%)  (%)  Acc (%)  (%)  
Iris  91.87  4.35  92.46  3.58  94.59  1.99 
Glass  60.19  7.13  61.72  6.69  66.77  6.23 
Wine  92.73  4.90  95.94  2.54  98.21  0.33 
Ecoli  78.80  7.13  80.13  7.35  84.51  5.04 
Vehicle  75.27  7.92  77.37  5.79  79.25  4.80 
IvF Comparison of Classification Robustness
To investigate the classification robustness of the proposed GNN with RWDD, we conduct the following experiments on the same realworld benchmark multiclassification data sets as Subsection IVB. Then, we contaminate the data sets by adding random noises with amplitude of ,and to the training sets. And the test set should be noise free since this experiment is to evaluate the classification robustness of the algorithms when noises exist. All experiment settings follow the settings in Subsection IVA.
TABLE V and VI show the Accs of three algorithms on different data sets when the noises amplitude is and respectively. Also, we compare the results in TABLE IV with TABLE VVI, and calculate the changes in Accs before and after the noises being added, to show the influences on the performance of algorithms brought by random noises.
All the performances of algorithms deteriorate as the random noises are added. However, the decreases in Accs of GNN are much slower than that of the LSSVM and ELM. To be illustrative, we take the Wine data sets as example.

For Wine data set, when the amplitude of noises is , the decrease of GNN in Acc is only 0.77%, while that of LSSVM and ELM are 2.82% and 1.91% respectively; when the amplitude grows to the decrease of LSSVM and ELM in Acc are 4.90% and 2.54%, while that of the GNN is only 0.33%, 1/15 and 1/8 of the decrease of LSSVM and ELM. Besides, the Acc of GNN is at least 2% higher than LSSVM and ELM when amplitude of noises is 10%.
Based on the above analysis, it could be concluded that the classification robustness of GNN with RWDD is higher than LSSVM and ELM.
V Conclusion
In this paper, to tackle the slow operations of BPtype algorithms and computational robustness problems, a novel type GNN model with structural simplicity has been proposed and investigated, based on GPS and polynomials function approximation. Then, in order to overcome the disadvantages of easy overfitting and improve model’s generalization performance, a RWDD method, based on equalityconstrained optimization, has been proposed to minimize networks’ empirical risk and structural risk. For better computational scalability and efficiency, two solutions to the RWDD optimization are given, depending on the scale of the problems and the complexity of the model.
As verifications, the comparison experiments on various realworld benchmark data sets have proved GNN has the comparable (or better) generalization performance, especially in multi classification problems, compared with LSSVM, ELM. Besides, the excellent computational scalability and efficiency of GNN have been confirmed in the comparison experiments. As for the hyperparametric experiments, less sensitivity of GNN to the combinations of hyper parameters has been observed. Especially, the generalization ability of GNN hardly relies on the values of and good performances could be achieved even though the dimensionality of the hypothesis feature space is very small. Finally, the experiments on contaminated data sets also verified the classification robustness of the proposed method.
References
 [1] (1992) Improving convergence of backpropagation by handling flatspots in the output layer. Artificial Neural Networks, pp. 1003–1009. Cited by: §I.
 [2] (1971) Generalized inverse of matrices and its applications. Cited by: §IIIA.
 [3] (1998) The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. Information Theory IEEE Transactions on 44 (2), pp. 525–536. Cited by: §I, §IIIA, §IIIB.
 [4] (2006) Extension of bernstein polynomials to infinite dimensional case. Journal of Approximation Theory 140 (2), pp. 191–202. Cited by: §IIB.
 [5] (2004) Orthogonal polynomials: computation and approximation. In Numerical Mathematics & Scientific Computation, Cited by: §IIC1.
 [6] (2005) Proof of weierstrass approximation theorem using bandlimited functions. Proceedings of the IEEE 61 (4), pp. 512–512. Cited by: §IIB.

[7]
(2000)
Ridge regression: biased estimation for nonorthogonal problems
. Technometrics 12 (1), pp. 55–67. Cited by: §IIIB.  [8] (2013) Robust extreme learning machine. Neurocomputing 102 (2), pp. 31–44. Cited by: §IIC1.
 [9] (2012) Extreme learning machine for regression and multiclass classification. IEEE Transactions on Systems Man & Cybernetics Part B 42 (2), pp. 513–529. Cited by: §I, §IVA, §IVD1.
 [10] Note: School of Electric. and Electro. Engi., Nanyang Tech. Univ., Singapore [Online].http://www.ntu.edu.sg/home/egbhuang/elm_codes.html Cited by: §IV.
 [11] (1998) Upper bounds on the number of hidden neurons in feedforward networks with arbitrary bounded nonlinear activation functions. Neural Networks, IEEE Transactions on 9 (1), pp. 224–229. Cited by: §I, §I.
 [12] (2003) Universal approximation using incremental feedforward networks with arbitrary input weights. Revised and resubmitted to IEEE Transactions on Neural Networks. Cited by: §I.
 [13] (2004) Extreme learning machine: a new learning scheme of feedforward neural networks. In Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Conference on, Vol. 2, pp. 985–990. Cited by: §I, §I.
 [14] (2006) Extreme learning machine: theory and applications. Neurocomputing 70 (1), pp. 489–501. Cited by: §I, §I.
 [15] (2008) Direct and converse results for multivariate generalized bernstein polynomials. Cited by: §IIB, §IIB.
 [16] (2015) Dynamic load identification for stochastic structures based on gegenbauer polynomial approximation and regularization method. Mechanical Systems & Signal Processing 5657, pp. 35–54. Cited by: §IIB.

[17]
(2010)
Analysis of feature extraction criterion function maximum in nonlinear multilayer feedforward neural networks for pattern recognition
. In Intelligent Computation Technology and Automation (ICICTA), 2010 International Conference on, Vol. 1, pp. 655–658. Cited by: §I.  [18] (2014) ERNN: a biologically inspired feedforward neural network to discriminate emotion from eeg signal. Neural Networks and Learning Systems, IEEE Transactions on 25 (3), pp. 609–620. Cited by: §I.
 [19] (1993) Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks 6 (6), pp. 861–867. Cited by: §I.
 [20] (2014) Construction of linear dynamic gene regulatory network based on feedforward neural network. In Natural Computation (ICNC), 2014 10th International Conference on, pp. 99–107. Cited by: §I.
 [21] (2011) A new robust training algorithm for a class of singlehidden layer feedforward neural networks. Neurocomputing 74 (16), pp. 2491–2501. Cited by: §IIC1.
 [22] (1999) Neural network learning :theoretical foundations. Ai Magazine 22 (2), pp. 99–100. Cited by: §IIC1, §IIIB.

[23]
(2014)
Rainfall prediction in kemayoran jakarta using hybrid genetic algorithm (ga) and partially connected feedforward neural network (pcfnn)
. In Information and Communication Technology (ICoICT), 2014 2nd International Conference on, pp. 166–171. Cited by: §I.  [24] LSsvmlab toolbox. Note: Dept. Elect. Eng., ESATSCDSISTA, Leuven, Belgium, 2002 [Online]http://www.esat.kuleuven.be/sista/lssvmlab/ Cited by: §IV.
 [25] (1986) Learning representations by backpropagating errors. Nature 323 (3), pp. 533–536. Cited by: §I.
 [26] (2014) Orthogonal polynomials of several variables. Journal of Approximation Theory 112 (3), pp. 318–319. Cited by: §IIB.
 [27] (2012) Fast evolutionary programmingbased hybrid multilayer feedforward neural network for predicting gridconnected photovoltaic system output. In Signal Processing and its Applications (CSPA), 2012 IEEE 8th International Colloquium on, pp. 44–47. Cited by: §I.
 [28] (2002) Weighted least squares support vector machines : robustness and sparse approximation. Neurocomputing 48 (1), pp. 85–105. Cited by: §IIIB.
 [29] (1999) Least squares support vector machine classifiers. Neural Processing Letters 9 (3), pp. 293–300. Cited by: §IIIB.
 [30] (1980) Practical methods of optimization, volume 2: constrained optimization (r. fletcher). Cited by: §IIIB1.
 [31] UCI repository of machine learning databases. Note: Dept. Inf. Comput. Sci., Univ. California, Irvine, CA, 1998 [Online].http://www.ics.uci.edu/~mlearn/MLRepository.html Cited by: §IVA.
 [32] (1995) The nature of statistical learning theory. Cited by: §I, §IIIA.
 [33] (2011) A study on effectiveness of extreme learning machine. Neurocomputing 74 (16), pp. 2483–2490. Cited by: §I.
 [34] (2013) Interval uncertain method for multibody mechanical systems using chebyshev inclusion functions. International Journal for Numerical Methods in Engineering 95 (7), pp. 608–630. Cited by: §IIA.
 [35] (2014) Uncertainty propagation in sea for structural cacoustic coupled systems with nondeterministic parameters. Journal of Sound & Vibration 333 (17), pp. 3949–3965. Cited by: §IIA.

[36]
(2017)
A unified method for the response analysis of interval/random variable models of acoustic fields with uncertainbutbounded parameters: a unified polynomial method for interval/random variable models
. International Journal for Numerical Methods in Engineering 111. Cited by: §IIA.  [37] (2006) A new fast learning algorithm for multilayer feedforward neural networks. In Machine Learning and Cybernetics, 2006 International Conference on, pp. 2928–2934. Cited by: §I.
 [38] (2016) A survey of randomized algorithms for training neural networks. Information Sciences 364, pp. 146–155. Cited by: §I.
 [39] (2009) Gegenbauer orthogonal basis neural network and its weightsdirectdetermination method. Electronics Letters 45 (23), pp. 1184. Cited by: §IIA.
 [40] (2008) A weightsdirectlydetermined simple neural network for nonlinear system identification. In IEEE IEEE International Conference on Fuzzy Systems, Cited by: §IIB.
 [41] (2014) Weights and structure determination of multipleinput feedforward neural network activated by chebyshev polynomials of class 2 via crossvalidation. Neural Computing & Applications 25 (78), pp. 1761–1770. Cited by: §I, §IIB, §IIB, §IIC1, §IIIA, §IIIA.
 [42] (2015) Neuraldynamicmethodbased dualarm cmg scheme with timevarying constraints applied to humanoid robots. IEEE Transactions on Neural Networks & Learning Systems 26 (12), pp. 3251–3262. Cited by: §I.

[43]
(2018)
Twoinput poweractivation neural network weightsdirectdetermination and structure optimized by particle swarm optimization
. In the Ninth International Conference on Intelligent Control and Information Processing (ICICIP), Vol. , pp. 191–198. Cited by: §I, §IIB, §IIC1, §IIIA, §IIIA.
Comments
There are no comments yet.