electronics Article FPGA Implementation for the Sigmoid with Piecewise Linear Fitting Method Based on Curvature Analysis Zerun Li, Yang Zhang *, Bingcai Sui *, Zuocheng Xing * and Qinglin Wang * College of Computer, National University of Defense Technology, Changsha 410073, China; lizerun16@nudt.edu.cn * Correspondence: zhangyang@nudt.edu.cn (Y.Z.); bingcaisui@nudt.edu.cn (B.S.); zcxing@nudt.edu.cn (Z.X.); wangqinglin@nudt.edu.cn (Q.W.) Citation: Li, Z.; Zhang, Y.; Sui, B.; Xing, Z.; Wang, Q. FPGA Implementation for the Sigmoid with Piecewise Linear Fitting Method Abstract: The sigmoid activation function is popular in neural networks, but its complexity limits the hardware implementation and speed. In this paper, we use curvature values to divide the sigmoid function into different segments and employ the least squares method to solve the expressions of the piecewise linear fitting function in each segment. We then adopt an optimization method with maximum absolute errors and average absolute errors to select an appropriate function expression with a specified number of segments. Finally, we implement the sigmoid function on the fieldprogrammable gate array (FPGA) development platform and apply parallel operations of arithmetic (multiplying and adding) and range selection at the same time. The FPGA implementation results show that the clock frequency of our design is up to 208.3 MHz, while the end-to-end latency is just 9.6 ns. Our piecewise linear fitting method based on curvature analysis (PWLC) achieves recognition accuracy on the MNIST dataset of 97.51% with a deep neural network (DNN) and 98.65% with a convolutional neural network (CNN). Experimental results demonstrate that our FPGA design of sigmoid function can obtain the lowest latency, reduce absolute errors, and achieve high recognition accuracies, while the hardware cost is acceptable in practical applications. Keywords: sigmoid; neural networks; piecewise linear fitting; approximation methods; FPGA; high speed; hardware acceleration Based on Curvature Analysis. Electronics 2022, 11, 1365. https:// doi.org/10.3390/electronics11091365 1. Introduction Academic Editor: Alexander Barkalov Received: 23 March 2022 Accepted: 21 April 2022 Published: 25 April 2022 Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. Copyright: © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/). The term artificial neural network (ANN) refers to a series of mathematical models inspired by biology and neuroscience. These models primarily simulate biological neural networks by abstracting the neural network of the human brain, constructing artificial neurons, and establishing connections among these artificial neurons according to a certain topological structure. In the field of artificial intelligence, ANNs are usually referred as neural networks or neural models. The basic constitutive unit of a neural network is the artificial neuron, which mainly simulates the structure and characteristics of biological neurons, receives a group of input signals, and produces output. The output of a neuron is usually realized by different activation functions, including sigmoid, Relu, Softplus, etc. [1]. Among them, the sigmoid function is widely used in various ANN models due to its simple expression and limited output range. However, the sigmoid function is a nonlinear function with exponent and division operations that consumes a large amount of hardware resources. It is therefore necessary to simplify and accelerate the sigmoid function when deploying neural networks on hardware platforms [2,3]. These hardware platforms include embedded devices, Internet of Things (IoT) applications, and field-programmable gate array (FPGA) boards [4,5]. It is also necessary to strike a balance between performance and functional flexibility when realizing a feasible neural network model design [6,7]. In order to solve the problem of simplifying and deploying sigmoid functions, various fitting methods, such as look-up table, the coordinate rotation digital computer (CORDIC) Electronics 2022, 11, 1365. https://doi.org/10.3390/electronics11091365 https://www.mdpi.com/journal/electronics Electronics 2022, 11, 1365 2 of 16 algorithm, Taylor series expansion, polynomial, the piecewise method, and the hybrid method, have been proposed to implement the sigmoid function on hardware. The look-up table [8] is a direct method of fitting a sigmoid function according to preset values. It requires all input and output values to be saved in memory and reads outputs based on inputs. While the accuracy of the outputs can be extremely high, it consumes too much storage space to save high-precision values. The CORDIC method [9] converts the sigmoid function into simple operations, such as addition and shifting through multiple iterations. Although this method does not involve multiplication, it requires the use of multiple lookup tables and additions; as a result, its hardware resource consumption is also too high for many applications. The Taylor series expansion method [10] and polynomial method [11] fit sigmoid functions with high-order expressions, which also consume a large amount of hardware resources. Moreover, the referenced hybrid method [12] applies different kinds of fitting methods together, which requires substantial storage space and many complex operations. Compared with the methods above, the piecewise fitting method has clear function expressions, consumes few hardware resources, and achieves high fitting accuracy [13]. It has thus become a mainstream sigmoid function fitting method. The basic principle of the piecewise fitting method is to divide the sigmoid function into several regions in a specific piecewise manner, then use a different expression for each region to replace the original function, thereby achieving the purpose of fitting the original function [14]. Piecewise fitting methods can be divided into piecewise linear fitting and piecewise nonlinear fitting methods. The latter has higher fitting accuracy but consumes more hardware resources, while the former can achieve the same fitting accuracy through the use of more segment numbers without needing to employ high-order operations. Thus, the piecewise linear fitting method is more compatible to obtain high speed and few hardware resources on FPGA. Savich divides the sigmoid function into five segments in the range of [−8, 8] and uses a linear fitting method with both adders and multipliers [15]. Armato uses the area conservation method to divide the sigmoid function into 16 non-uniform segments for linear fitting [16]. Ngah proposes a fitting method that combines piecewise binomial function and look-up table approaches. In more detail, this method uses a piecewise second-order nonlinear function method and look-up table method to perform the fitting for the first time, then adds and subtracts the output value to improve the fitting accuracy [12]. Campo divides the sigmoid function into 12 segments and then uses the second-order Taylor expansion formula to fit every segment [10]. Gomar uses the approximate calculation method for exponent to fit the sigmoid function [17]. Pandit applies Chebyshev’s polynomial approximation for efficient hardware realization [18]. Mitra proposes a 16-segment linear fitting algorithm based on adders, multipliers, and logic blocks [14]. Zamanlooy uses a continuous valued number system to linearly fit the sigmoid and applies the continuous modular compression operation to reduce the width of the numbers [19]. Nguyen divides the sigmoid function into 12 segments and chooses the parameters of the linear function based on the distribution probability of input values [20]. Pan combines the piecewise linear (PWL) approximation, Taylor series approximation, and Newton–Raphson method-based approximation methods together to implement the sigmoid function efficiently [21]. The piecewise fitting methods described above have advantages in terms of their utilization of hardware resources; however, their recognition accuracies and processing speed on hardware need some improvement [22,23]. This paper accordingly chooses several abscissas of potential piecewise points based on the curvatures of the sigmoid function. We solve the function expressions in every segment between piecewise points using the least squares method. We then compare the absolute errors of different fitting function with potential piecewise points and choose a single function expression as the hardware implementation scheme to achieve higher fitting accuracy and reduce hardware resource consumption. Finally, we realize the piecewise linear fitting function on the specified FPGA platform to simulate the inference stage of neural networks. The circuits Electronics 2022, 11, 1365 3 of 16 for different segments work in parallel to calculate the outputs of these segments, and the multiplexer selects the result of one segment as the final output of the fitting function based on the range of input value. The clock frequency of this circuit design and recognition accuracies on the MINIST dataset show that our PieceWise linear fitting method based on curvatures (PWLC) has lower time latency and achieves higher accuracies in the specified neural networks. The contributions of this paper lie in the following three aspects: • • • This paper proposes a new method to select potential piecewise points based on the curvature values of the sigmoid function. Piecewise points are dynamically selected in the specified range according to curvature values. This paper develops an approximate comparison scheme for PWLC to determine the proper expression of the piecewise linear function. The comparison is based on the values of maximum absolute errors and average absolute errors. This paper presents a high-speed hardware design for PWLC. The circuit implemented on the FPGA development platform can achieve the lowest end-to-end latency at higher clock frequency with the use of additional hardware resources. The remainder of this paper is organized as follows. Section 2 presents the principle of solving expressions of the piecewise linear fitting function based on curvature analysis. Section 3 outlines the comparison scheme for the expressions of piecewise function solutions. Section 4 describes the module design of the piecewise linear function. Section 5 presents the experimental results and draws some comparisons with other papers. Finally, Section 6 makes a conclusion for this paper at the end. 2. Piecewise Linear Fitting Method Based on Curvature Values The sigmoid function has continuity and monotonicity in the domain of definition. Unlike the linear Relu function, the derivative of the sigmoid function in the domain of definition is constantly changing and exhibits nonlinear characteristics. The sigmoid function is close to the saturation values (0 or 1) at both ends and has almost no change of value at all. The function graphs among saturation areas have clearly nonlinear characteristics, which can be fitted by several linear functions with different derivatives. In the middle of this range, the sigmoid function changes more drastically, which needs to be fitted with more linear functions with different derivatives. The derivative can describe how fast a function changes. As can be seen from Figure 1, when x = 0, the derivative of the sigmoid function reaches its maximum value. Although the value of the sigmoid function changes drastically near 0, the shape is relatively straight and can be approximated as a linear function. When x = 3, the derivative value of sigmoid function is small, and it can be observed that the sigmoid function has a greater degree of curvature near x = 3. In particular, the derivative values of the two functions f 1 ( x ) = x2 and f 2 ( x ) = x4 are the same at x = 0, but the curvature values of the two functions at zero are obviously different. Thus, the derivatives cannot intuitively describe the deviation between a curve function and a straight line, especially the degree of curvature at a single point. Different from the derivative, the curvature value is defined as the rate of the tangent direction angle at one point on the function relative to the arc length of the curve. This indicator can describe the degree to which a curve deviates from a straight line. The value of curvature is positively related to the curve’s degree of curvature. The expression of curvature is defined as follows. κ = lim ∆`→0 ∆θ = ∆` |y00 ( x0 )| 1 + (y0 ( x 0 )) 2 3 2 (1) Electronics 2022, 11, 1365 4 of 16 Figure 1. The original, derivative, and curvature graphs of the sigmoid function at the range of [−8, 8]. In the equation above, θ is the tangent direction angle, while ` is the arc length of the given tangent direction angle. The original expression of the sigmoid function is given below. 1 y( x ) = (2) 1 + exp(− x ) The expression of the first derivative for the sigmoid function is given below. y0 ( x ) = exp(− x ) (3) (1 + exp(− x ))2 The expression of the second derivative for the sigmoid function is given below. y00 ( x ) = exp(−2x ) − exp(− x ) (4) (1 + exp(− x ))3 Thus, the expression of the curvature value for the sigmoid function is as follows. κ= | exp(−2x ) − exp(− x )|(1 + exp(− x ))3 3 (1 + 4 exp(− x ) + 7 exp(−2x ) + 4 exp(−3x ) + exp(−4x )) 2 (5) 2.1. Selection of Piecewise Points Based on Curvature Analysis As shown in Figure 1, the curvature graph is symmetric about the x-axis. When x = 0, the curvature has the minimum value of 0. At this point, the curvature of the sigmoid function as its smallest and the shape is closest to a straight line. When x ∈ [−5, 0.5] ∪ [0.5, 5], the curvature value tends to be large. When x is around -1 and 1, the curvature reaches its maximum value (close to 0.1). When x ∈ (−∞, −5] ∪ [5, +∞), the curvature value is close to 0. In this range, the sigmoid function can be approximately regarded as a linear function. From Figure 1, the derivative of the sigmoid function exhibits obvious changes in the range of [−8, 8] and peaks when x = 0, at which point the derivative of the sigmoid function achieves a maximum value of 0.25. If the range of [−8, 8] is subdivided into numerous small segments, the derivative change in each of these small segments will tend to be smooth. In these small areas, the shape of the sigmoid function becomes similar to the shape of some linear functions. As a result, the nonlinear sigmoid function can be fitted by numerous linear functions, which will decrease the sigmoid function fitting error. It Electronics 2022, 11, 1365 5 of 16 is practical that the sigmoid function can be fitted in the specified range according to the abscissa of the given piecewise point, and that each segment range has an independent piecewise function expression. In the saturation regions at both ends, the original function can be approximately equal to 0 or 1 with few fitting errors. Due to the complexity of the sigmoid curvature function, it is complicated to obtain the maximum of this function. This paper applies systematic sampling for abscissas to reflect the curvature values corresponding to different coordinates of segment intervals. In the range of [−5, 5], the abscissa is equidistant and the scale value is 0.5. Systematic sampling results in a set of samples according to proportional abscissas. This sampling method, with its utilization of equal uniform spacing, can intuitively reflect the changes in curvature value. In the range of [0, 5], there is only one peak value of curvature, and there is also no periodic variability or monotonous change. The abscissas of the positive piecewise points and the corresponding curvature values are listed in Table 1. A segment interval with a larger curvature requires more piecewise points for linear fitting if we are to reduce fitting errors and improve the numerical accuracy of the fitting function. Table 1. Curvatures with different abscissas of the sigmoid function. Abscissa 0 0.5 1 1.5 2 2.5 Curvature 0 0.053 0.086 0.092 0.079 0.059 Abscissa 3 3.5 4 4.5 5 5.5 Curvature 0.041 0.027 0.017 0.011 0.007 0.004 2.2. Solution for Function Fitting Based on Sample Points and Selected Piecewise Points We select n points in a sigmoid graph whose horizontal ordinates are x1 , x2 , x3 , · · · , xn , according to a sequence from small values to large values. The corresponding longitudinal coordinates are y1 , y2 , y3 , · · · , yn . Due to the monotonicity of the sigmoid function, the corresponding longitudinal coordinates are also in the same sequence. To apply all data pairs equally, we sample the data with equal uniform spacing. The data pairs are as presented below. x1 1+exp1(− x ) x1 y1 1 x2 y2 x2 1+exp1(− x2 ) 1 x3 y3 x3 (6) = 1+exp(− x3 ) . . . . . . . .. . .. 1 xn yn xn 1+exp(− xn ) Here, n is the number of sample points of the sigmoid function. The discretized sample points are evenly distributed on the coordinate axis, and the subscripts represents the size of the value. This detailed systematic sampling guarantees sample coverage and accurately describes the data distribution of the original function. In the internal ranges enclosed by adjacent piecewise points, the sigmoid function can be fitted with linear functions. All m piecewise points can be selected from the values listed in Table 1, and the expressions of the piecewise linear fitting functions for the sigmoid function are as follows. 0, x ≤ b1 η1 + k1 ( x − b1 ), b1 < x ≤ b2 η2 + k2 ( x − b2 ), b2 < x ≤ b3 p( x ) = . (7) . .. .. η m − 1 + k m − 1 ( x − bm − 1 ) , bm − 1 < x ≤ bm 1, x>b m Electronics 2022, 11, 1365 6 of 16 where bm is the abscissa of the m-th piecewise point chosen in Table 1, ηm is the ordinate of the m-th piecewise point, k m is the slope of the m-th segment of the linear fitting function, and the total number of piecewise points is m. This formula also assumes that the order of the piecewise points is b1 < b2 < · · · < bm , which is similar to the order of abscissas among sample points. The expression in Equation (7) presents the general form of the piecewise function. Notably, due to the continuity of the sigmoid function, we require the piecewise linear function to be continuous at each piecewise point. Under this premise, the slope and intercept of each linear region depend on the value of the former segment. The function value of the former interval plus the increment can generate the function of the latter interval. The piecewise function can therefore be expressed in the form of the previous function expression β 1 + β 2 ( x − b1 ) + · · · + β m−1 ( x − bm−2 ) and the numerical increment of this interval β m ( x − bm−1 ). In fact, the subsequent interval is based on the formal interval. Accordingly, the expression of the piecewise function is presented as below. 0, x ≤ b1 β + β x − b , b ( ) 2 1 1 1 < x ≤ b2 β + β 2 ( x − b1 ) + β 3 ( x − b2 ), b2 < x ≤ b3 1 .. .. p( x ) = . . β 1 + β 2 ( x − b1 ) + β 3 ( x − b2 ) + · · · + β m ( x − bm − 1 ) , bm − 1 < x ≤ bm 1, x > bm (8) By taking advantage of the continuity of the linear fitting function, the number of unknown parameters can be decreased from 3 × (m − 1) to 2 × m − 1. This paper applies a custom step function δxi >bj to express the piecewise functions with a matrix form. The expression of δxi >bj can be expressed as the following piecewise function with step values including 0 and 1. ( 0, xi ≤ b j δxi >bj = (9) 1, xi > b j Here, i ∈ 1, 2, . . . , n, j ∈ 1, 2, . . . , m and b j is the horizontal ordinate of the step point. By using this step function, the relationship between data pairs can be described in matrix form rather than the piecewise function expression form. We bring the values of n samples and m piecewise points into the expression in Equation (8) and obtain an equation containing an unknown β i (i = 1, 2, . . . , m). The equation of the unknown parameter based on the specified data pairs is presented in Equation (10). 1 1 .. . 1 x1 − b1 x2 − b1 .. . xn − b1 ( x1 − b2 )δx1 >b2 ( x2 − b2 )δx2 >b2 .. . ··· ··· .. . ( x1 − bm−1 )δx1 >bm−1 ( x2 − bm−1 )δx2 >bm−1 .. . ( xn − b2 )δxn >b2 · · · ( xn − bm−1 )δxn >bm−1 β1 β2 .. . βm = y1 y2 .. . (10) yn In common cases, the relationship between all sample points n and piecewise points m is n > m. It is worthy of note that the solution for β i (i = 1, 2, . . . , m) will reduce the fitting errors between the sample points and the corresponding sigmoid function values. To abbreviate the expression of the matrix form, this paper simplifies the matrix operation in the expression below. (11) Aβ = Y where A is the regression matrix of size n × m, β is the vector consisting of m parameters, and Y is the vector comprising of n function values. We use the least square method to Electronics 2022, 11, 1365 7 of 16 reduce the sum of squares for residuals in order to solve the unknown vector β. The solution of β can be expressed using β∗ as follows. −1 β ∗ = AT A AT y (12) After solving β, the expressions of the piecewise function become clear. To accelerate the processing speed and retain recognition accuracy, the different fitting schemes designed to replace the original sigmoid function need to be compared before hardware implementation occurs. This paper applies the error vector E to describe the differences between fitted values in the piecewise linear function and actual values of the original function in the given data pairs. The error vector E can be expressed as follows. E = Aβ∗ − Y = [e1 , e2 , . . . , en ] T (13) E is the vector reflecting absolute errors with n elements, and ei = p( xi ) − yi , i = 1, 2, . . . , n. The maximum of absolute error emax = max (|e1 |, |e2 |, . . . , |en |) in E is the largest value of all biases between the fitting function and the original function with the data pairs, while the average absolute error of all elements eavg = n1 ∑in=1 |ei | in E is the average value of all biases in the fitting model. |·| denotes the absolute value of each element in vector E. The two kinds of indexes can measure the errors among fitting models. 3. Realization Scheme of PWLC for the Sigmoid Function Having demonstrated the principle of solving expressions of the linear fitting function with sample data points, we next present the detailed process used to solve the function expression with the PWLC method. To determine the abscissas of the piecewise points with the specified sample points, we need to analyze the characteristics of the sigmoid function. The sigmoid function is centrally symmetric about the point with the coordinate of (0, 0.5). There are obvious saturation areas for the sigmoid function at both ends of the x-coordinate axis. When x = 8, the sigmoid function’s value is equal to 0.9997, which is very close to its largest value of 1. When x > 8, the value of the fitting function can be set to 1. Correspondingly, when x ≤ −8, the value of the fitting function can be set to 0. From the curvature graph of the sigmoid function, it can be seen that when x = 0, the curvature is also 0. Around x = 0, the sigmoid function is almost straight, and x = 0 is in the range of an approximately straight line. In addition, the ideal linear fitting function is also centrally symmetric about the point with the coordinate of (0, 0.5). If x = 0 is set as a piecewise point, the slope and intercept of the front and back two segments remain equal, and the expressions of the two piecewise functions are the same; thus, it is unnecessary to set x = 0 as a piecewise point. Due to the central symmetry of the linear fitting function, the coordinates of the piecewise points on the positive half of the x-axis and the coordinates of the piecewise points on the negative half of the x-axis generally appear in pairs. Therefore, the number of piecewise points of the linear fitting function is set to an even number. The number of piecewise points here is set to 4, 6, 8, and 10, and the total number of segment intervals including the two ranges of x ≤ −8 and x > 8 is 5, 7, 9, and 11, respectively. The selection of the abscissa of the piecewise point needs to be symmetric about x = 0 to satisfy the central symmetry condition of the fitting function. Under the premise of the central symmetry, we can define the abscissa independently. It should be noted that the piecewise points need to be set relatively densely in the segment interval with larger curvature values so that more linear functions can be used to fit the sigmoid function with a greater degree of curvature. The numbers of abscissas are even, and the selection of abscissas is based on the analysis of curvatures. This paper selects several representative abscissas in the range of [−8, 8]. The abscissas of the piecewise points and the corresponding numbers of abscissas are shown in Table 2, arranged from small numbers to large numbers. Electronics 2022, 11, 1365 8 of 16 Table 2. Different abscissas with various numbers of piecewise points. Numbers Abscissas of Piecewise Points for Different Functions 4 6 8 10 12 14 −8, −2, 2, 8 −8, −3, −1, 1, 3, 8 −8, −4, −2, −1, 1, 2, 4, 8 −8, −4.5, −3, −2, −1, 1, 2, 3, 4.5, 8 −8, −4.5, −3, −2.5, −2, −1, 1, 2, 2.5, 3, 4.5, 8 −8, −4.5, −3, −2.5, −2, −1.5, −1, 1, 1.5, 2, 2.5, 3, 4.5, 8 According to the abscissas selected in Table 2 and the method proposed to solve the piecewise linear fitting function, we can solve the expressions of the functions with different numbers of piecewise points based on the analysis of curvature values discussed above. The piecewise function expressions gives the slopes and intercepts in the format of decimals, valid to five decimal places. In fact, the slopes and intercepts in each piecewise linear function interval have redundant mantissas in the format of decimal numbers. Due to the limited storage space available for fixed-point numbers on the FPGA platform, it is sufficient to present the leading significant digits. These decimal numbers can be converted to binary numbers with limited bits of mantissas in the storage space of the FPGA platform. The expression of a piecewise linear fitting function with four piecewise points p4 ( x ) is as follows. 0, 0.01511 · x + 0.09783, x ≤ −8 −8 < x ≤ −2 p4 ( x ) = 0.21619 · x + 0.5, −2 < x ≤ 2 0.01511 · x + 0.90217, 2 < x ≤ 8 1, x>8 (14) The expression of a piecewise linear fitting function with six piecewise points p6 ( x ) is as follows. 0, x ≤ −8 0.00634 · x + 0.04392, − 8 < x ≤ −3 0.11127 · x + 0.35872, −3 < x ≤ −1 p6 ( x ) = 0.25255 · x + 0.5, −1 < x ≤ 1 0.11127 · x + 0.64128, 1 < x ≤ 3 0.00634 · x + 0.95608, 3 < x ≤ 8 1, x>8 (15) The expression of a piecewise linear fitting function with eight piecewise points p8 ( x ) is as follows. 0, x ≤ −8 0.00261 · x + 0.01947, −8 < x ≤ −4 0.04767 · x + 0.19971, −4 < x ≤ −2 0.15881 · x + 0.42199, −2 < x ≤ −1 p8 ( x ) = 0.23682 · x + 0.5, (16) −1 < x ≤ 1 0.15881 · x + 0.57801, 1 < x ≤ 2 0.04767 · x + 0.80029, 2 < x ≤ 4 0.00261 · x + 0.98053, 4 < x ≤ 8 1, x>8 Electronics 2022, 11, 1365 9 of 16 The expression of a piecewise linear fitting function with 10 piecewise points p10 ( x ) is as follows. 0, x ≤ −8 0.00252 · x + 0.01875, −8 < x ≤ −4.5 0.02367 · x + 0.11397, −4.5 < x ≤ −3 0.06975 · x + 0.25219, −3 < x ≤ −2 0.14841 · x + 0.40951, −2 < x ≤ −1 p10 ( x ) = 0.2389 · x + 0.5, (17) −1 < x ≤ 1 0.14841 · x + 0.59049, 1 < x ≤ 2 0.06975 · x + 0.74781, 2 < x ≤ 3 0.02367 · x + 0.88603, 3 < x ≤ 4.5 0.00252 · x + 0.98125, 4.5 < x ≤ 8 1, x>8 The expression of a piecewise linear fitting function with 12 piecewise points p12 ( x ) is as follows. 0, x ≤ −8 0.00248 · x + 0.0185, −8 < x ≤ −4.5 0.02405 · x + 0.11556, −4.5 < x ≤ −3 0.06608 · x + 0.24165, −3 < x ≤ −2.5 0.07375 · x + 0.26084, −2.5 < x ≤ −2 0.14761 · x + 0.40855, −2 < x ≤ −1 p12 ( x ) = 0.23906 · x + 0.5, (18) −1 < x ≤ 1 0.14761 · x + 0.59145, 1 < x ≤ 2 0.07375 · x + 0.73916, 2 < x ≤ 2.5 0.06608 · x + 0.75835, 2.5 < x ≤ 3 0.02405 · x + 0.88444, 3 < x ≤ 4.5 0.00248 · x + 0.9815, 4.5 < x ≤ 8 1, x>8 The expression of a piecewise linear fitting function with 14 piecewise points p14 ( x ) is as follows. 0, x ≤ −8 0.00247 · x + 0.01843, − 8 < x ≤ −4.5 0.02415 · x + 0.11599, −4.5 < x ≤ −3 0.0639 · x + 0.23525, −3 < x ≤ −2.5 −2.5 < x ≤ −2 0.0831 · x + 0.28325, 0.12891 · x + 0.37487, − 2 < x ≤ −1.5 0.16351 · x + 0.42677, −1.5 < x ≤ −1 p14 ( x ) = 0.23674 · x + 0.5, (19) −1 < x ≤ 1 0.16351 · x + 0.57323, 1 < x ≤ 1.5 0.12891 · x + 0.62513, 1.5 < x ≤ 2 0.0831 · x + 0.71675, 2 < x ≤ 2.5 0.0639 · x + 0.76475, 2.5 < x ≤ 3 0.02415 · x + 0.88401, 3 < x ≤ 4.5 0.00247 · x + 0.98157, 4.5 < x ≤ 8 1, x>8 Electronics 2022, 11, 1365 10 of 16 Based on the fitting function expression above (the interval length of sample points is set to 0.01), we obtain the maximum absolute errors emax and the average absolute errors eavg in Figure 2. As can be seen from the figure, the absolute error is inversely proportional to the number of piecewise points. In more detail, when the number of piecewise points increases to 10, the maximum error converges to about 0.007, while the average error converges to about 0.001. It can be determined that the point at which the number of piecewise points is 10 is the elbow point in Figure 2. When the number of piecewise points is less than 10, the absolute error of the piecewise function is relatively large, and the fitting degree of the original sigmoid function is not ideal. When the number of piecewise points is greater than 10, the absolute error converges to a small value without significant change, while the absolute errors converge to the fixed values. Notably, a higher number of piecewise points increases the complexity of the piecewise function, elevating the power consumption and consuming more unnecessary hardware resources with little improvement on absolute fitting errors. Accordingly, considering the absolute error and the complexity of the function, this paper uses 10 as the number of piecewise points. Maximum Absolute Error Average Absolute Error 0.06 Absolute errors 0.05 0.04 0.03 0.02 0.01 0 4 6 8 10 12 14 Number of Piecewise Points Figure 2. Absolute errors between the piecewise linear fitting function and original function with different numbers of piecewise points based on analysis of curvature values. Apart from the selection of piecewise points, it is possible for the interval length of sample points to have effects on the maximum absolute errors and average absolute errors. This paper sets the interval lengths of sample points to 0.5, 0.1, 0.05, 0.01, and 0.005. The corresponding numbers of sample points n in the range of [−8, 8] are 32, 160, 320, 1600, and 3200, respectively. The maximum absolute errors and average absolute errors between the fitting function and the original sigmoid function at different interval lengths are presented in Table 3. Table 3. Absolute errors with different sample point interval lengths. Length 0.5 0.1 0.05 0.01 0.005 emax 0.00653 0.00773 0.00781 0.00784 0.00784 eavg 0.00251 0.00166 0.00164 0.00163 0.00163 It can be seen that differences in sample point interval lengths have a significant impact on the absolute errors of the fitting function. As the sample point interval length is reduced, the selected sample points become more dense, the maximum error emax tends to be larger, and the average error eavg becomes smaller. When the sample point interval length is 0.01, these two absolute errors converge to fixed values at the same time. It is worth noting that the maximum error does not accurately reflect all of the fitting function’s deviation from the original function, even when the sample point interval length is small. Due to the Electronics 2022, 11, 1365 11 of 16 sparsity of the sample points, the maximum error cannot include other non-sample points. The average error can thus describe the relative deviation between the fitting function and the original function more generally. When the sample point interval length is sufficiently small and the selection of sample points is sufficiently dense, both the maximum error and the absolute error can reflect the degree to which the linear fitting function deviates from the sigmoid function. When the sampling interval length is less than 0.01 in Table 3, the values of the maximum error and the absolute error remain unchanged. More sampling points cannot improve the fitting effect; instead, it will cause the unknown parameter matrix β to be larger, increase the amount of calculation required to solve unknown parameters, raise the complexity of the fitting function expression, and cost a large amount of additional computational time. Therefore, it is both appropriate and reliable to set the sampling interval to 0.01. 4. Hardware Design for the Circuit of PWLC Method We implement our PWLC method on the Xilinx FPGA (XC7V2000) with the Vivado design suite [24]. In the specified range of [−8, 8], all the numbers of the sigmoid function, including input values, slopes, intercepts, and output values, are within the range of [−8, 8]. The circuit uses 16-bit fixed-point numbers to store all values. The fixed-point number includes a 1-bit signal part, a 3-bit integer part, and a 12-bit mantissa part. If the input value is outside the range of [−8, 8], it can be stored as the saturation values of a 16-bit fixed-point number. The output function value of −8 is 0, while the output function value of 8 is 1. All input values have corresponding output values expressed using this 16-bit fixed-point method. The original decimal numbers are converted to 16-bit fixed-point numbers with minimal accuracy loss. It is time consuming to calculate different segments of the optimized function in series and then output the result for the specified segment. This paper accordingly designs a structure that allows 11 segments to be processed in parallel, after which one result is chosen as an output according to the range of input values. The hardware computes the arithmetic results of the input value in nine segments and selects one result based on comparisons among all piecewise points and the input value. The arithmetic (multiplying and adding) and range selection operations are performed in parallel to decrease the end-to-end latency. The input value is compared with 10 piecewise points and the range of input values are determined. As some slopes of the piecewise function are equal, we reuse some multipliers and connect two different adders after each multiplier. This design can reduce the number of multipliers required by four and therefore reduce the hardware resources. The hardware realization structure of PWLC is in shown Figure 3. The comparators compare the input value with all piecewise points without the trigger of the clock. When the input value is larger than a piecewise point, the comparator output is set to 1; otherwise, the comparator outputs 0. The expression for the comparator output is as follows. ( 0, x ≤ b j c j = δx>bj = (20) 1, x > b j The relationship among the 10 outputs of the different comparators and the data ports of the multiplexer is summarized in Table 4. The selected function value of the multiplexer is based on the range of input value. The multiplexer then outputs the calculation results according to the comparison results. Electronics 2022, 11, 1365 12 of 16 input[15:0] Buffer for Slopes and Intercepts t1 k1 t9 Buffer for Piecewise Points t4 k4 t6 b1 t5 k5 Com > b10 Com > c[0] 0 D0 c[9:0] 1 D1 D2 D7 D8 MUX c[9] D10 D9 output[15:0] Figure 3. The overall hardware design of a piecewise linear fitting function with 10 piecewise points. Table 4. Selection method for the 11 input data of the multiplexer. c [0] c [1] c [2] c [3] c [4] c [5] c [6] c [7] c [8] c [9] MUX 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 D0 D1 D3 D5 D7 D9 D8 D6 D4 D2 D10 According to the linear function expression with 10 piecewise points, the operations in different ranges can be realized by some combination of multipliers and adders. These multipliers truncate the sign bit and the high 15-bit data as output when multiplying two pieces of 16-bit input data. For the multipliers and adders in different ranges, the slopes k i and the intercepts ti are inputs of the multiplier and adder, respectively. Ranges that are symmetrical on the y-axis are brought close to each other to reuse the multiplier with the same slope value. Five multipliers and nine adders work in parallel and output nine results corresponding to different ranges. Based on the encoding table, the multiplexer chooses the specified value as the final output according to its segment selection. The circuit for the PWLC method has input and output registers, so it outputs a value with two clocks. The multipliers, adders, and comparators are combinational circuits and operate with two clock cycles of latency. The parallel processing procedure can eliminate the read-after-write correlation between the selection signal and the outputs of the nine adders. Moreover, the parallel operation scheme across two data paths can also make full use of hardware resources and increase the computational efficiency. The computation modules thus work together at high speed and take little computation time. Electronics 2022, 11, 1365 13 of 16 5. Results and Comparisons We list the timing characteristics and hardware resources from the referenced papers in Table 5. The FPGA series names in Table 5 are all Virtex. We present the detailed data of the minimum input arrival time before clock and the maximum output required time after clock [25,26]. The minimum input arrival time before clock of the proposed circuit is 8.559 ns and the maximum output required time after clock is 8.860 ns. Due to the high processing speed requirement, the timing characteristics include clock frequency and circuit latency. The clock frequency of our circuit design is 208.3 MHz, while the whole end-to-end latency is 9.6 ns. The comparisons of hardware resources include flip-flop (FF), look-up table (LUT), and digital signal processor (DSP). We find that our design can achieve high frequency; the primary reason for this relates to our circuit design with two parallel data paths. This design, with higher numbers of FFs and LUTs to realize parallel processing, can achieve the lowest latency when implementing our circuit without the use of DSPs. Given the advantages in terms of processing latency, the hardware resource of LUT overhead is acceptable in practical scenarios, while the design of Campo [10] may exceed the DSP resources and that of Gomar [17] has more FF usage. Table 5. Timing characteristics and hardware resources of different methods. Method Campo [10] Gomar [17] Proposed Timing Characteristics Hardware Resources Freq./MHz Lat./ns Platform LUT FF DSP 373.5 383.8 208.3 18.7 13.0 9.6 XC6V2000 XC4VFX12 XC7V2000 232 123 493 16 71 32 6 0 0 This paper proposes maximum absolute error and average absolute error to describe the deviations between the sigmoid function and the piecewise linear fitting function. The maximum error presents the largest deviation, while the average error gives the overall deviation of all samples in the domain of definition. The comparisons of maximum absolute errors and average absolute errors among different methods [20] are presented in Table 6. As the table shows, the proposed method has the smallest maximum absolute error among all methods and the second smallest average absolute error. Moreover, compared with the method that achieves the smallest average absolute error (proposed by Armato [16]), our method has fewer segments in the specified range. This lower number of segments can decrease the hardware complexity and reduce hardware resource consumption. In short, our hardware implementation design achieves high fitting accuracies with few FFs and no usage of DSPs. Table 6. Absolute errors of different fitting methods for the sigmoid function. Method Range Segments Armato [16] Ngah [12] Gomar [17] Mitra [14] Zamanlooy [19] Campo [10] Savich [15] Pan [21] Nguyen [20] Proposed [−8, 8] [−4, 4] [−4, 4] [−9.35, 9.35] [−8, 8] [−4.59, 4.59] [−8, 8] [−5, 5] [−5, 5] [−8, 8] 16 Null Null 14 6 12 5 7 12 9 Absolute Errors Maximum Average 0.00788 0.022 0.0087 0.0127 0.0189 0.028 0.0679 0.0189 0.0125 0.00784 0.00107 0.0077 0.0058 0.0015 0.0059 0.0043 0.0263 0.00587 0.0042 0.0016 This paper further applies the piecewise linear fitting function to recognize different handwritten numerals in the MINIST dataset with a specified deep neural network (DNN) Electronics 2022, 11, 1365 14 of 16 and a convolutional neural network (CNN). The structure of DNN (comprising of five fully connected layers) and CNN (consisting of two convolutional layers, two pooling layers, and two fully connected layers) is in Table 7. Based on the specified DNN and CNN structure, this paper compares the recognition accuracies of different fitting methods on the MNIST dataset. The recognition accuracy can intuitively reflect the actual effect of the proposed linear fitting method; a higher recognition accuracy indicates that the design is more trustable in practical use. Table 7. Layer names and sizes of DNN and CNN. DNN CNN Layer Name Layer Size Layer Name Layer Size Input Hidden1 Hidden2 Hidden3 Hidden4 Hidden5 Output 784 576 450 300 120 80 10 Input Conv1 Pool1 Conv2 Pool2 FullyCon1 FullyCon2 Output 1 × 32 × 32 6 × 28 × 28 6 × 14 × 14 12 × 10 × 10 12 × 5 × 5 300 120 10 According to the contents of Table 8, the hardware implementation of the linear fitting function proposed in this paper achieves a higher recognition rate than other methods. Compared with the second-highest recognition accuracy obtained by Nguyen [20], our PWLC method increases the accuracy with DNN by 0.06% and the accuracy with CNN by 0.23%. Moreover, the recognition rate of the linear fitting function circuit applied in the deployment of DNN is even higher than that of the original sigmoid function. This may be because all the middle layers of the DNN network are fully connected layers, and the linear fitting function is expressed in a piecewise hierarchical form, which facilitates precise feature extraction with discrete values and result in high recognition rates. For its part, the original nonlinear sigmoid function may aggravate the error transmission in the inference process. Thus, the recognition rates of DNN with the linear fitting function are superior to those of the original nonlinear sigmoid function. Table 8. Accuracies of DNN and CNN with different fitting methods. Method Range Segments Sigmoid Armato [16] Ngah [12] Gomar [17] Mitra [14] Zamanlooy [19] Campo [10] Savich [15] Nguyen [20] Proposed (−∞,+∞) [−8, 8] [−4, 4] [−4, 4] [−9.35, 9.35] [−8, 8] [−4.59, 4.59] [−8, 8] [−5, 5] [−8, 8] Null 16 Null Null 14 6 12 5 12 9 Accuracy/% DNN CNN 97.37 97.38 97.37 97.3 97.35 97.36 97.34 96.61 97.45 97.51 98.96 98.26 98.35 98.24 98.29 98.21 98.27 97.85 98.42 98.65 6. Conclusions This paper proposes PWLC to calculate the expression of the piecewise linear fitting function for the sigmoid function, compare the absolute errors of different expressions for the fitting functions, and realize a hardware acceleration scheme with high fitting accuracies. According to the characteristics of the curvature graph and the systematic sampling method, the abscissas of a given sigmoid function graph are dynamically selected as the candidate piecewise points. After comparing the maximum absolute error and average absolute Electronics 2022, 11, 1365 15 of 16 error of the linear fitting function with different numbers of piecewise points, we choose the elbow point of the absolute error graph for different piecewise points for use in our hardware implementation circuit design. Moreover, as the hardware resource consumption is mainly related to the range numbers of the piecewise function, the circuit design in this paper achieves low absolute errors while maintaining hardware resource consumption at a moderate level. This design does not require a DSP composed of multiple LUTs and FFs. Therefore, this implementation of the sigmoid function will not lead to excessive usage of DSPs. We further apply parallel operation on arithmetic and data comparisons in different ranges simultaneously to accelerate the processing speed evidently. These parallel operation paths consist of multipliers, adders, and comparators used to achieve low latency and high processing speed. Based on this parallelism at the operational level, the fitting error and circuit latency are very low, albeit at the cost of a slight increase in hardware resource consumption. In the future, it will be valuable to use the piecewise linear fitting function algorithm and hardware module design proposed in this paper to implement other activation functions in other neural networks. Author Contributions: Conceptualization, Z.L.; funding acquisition, Z.X.; investigation, Z.L.; methodology, Z.L. and Y.Z.; resources, Y.Z. and Q.W.; software, B.S. and Q.W.; supervision, Y.Z. and B.S.; writing—original draft, Z.L.; writing—review and editing, Y.Z. All authors have read and agreed to the published version of the manuscript. Funding: This work was supported by the National Natural Science Foundation of China under Grant 61874140 and 62002365. Data Availability Statement: (accessed on 22 March 2022). The MINIST dataset is on http://yann.lecun.com/exdb/mnist/ Acknowledgments: We appreciate our reviewers and editors for their precious time. Conflicts of Interest: The authors declare no conflict of interest. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. Dubey, S.R.; Singh, S.K.; Chaudhuri, B.B. A Comprehensive Survey and Performance Analysis of Activation Functions in Deep Learning. ArXiv 2021, arXiv:2109.14545. Reuther, A.; Michaleas, P.; Jones, M.; Gadepally, V.N.; Samsi, S.; Kepner, J. AI Accelerator Survey and Trends. AI Accelerator Survey and Trends. In Proceedings of the 2021 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA, 20–24 September 2021; pp. 1–9. [CrossRef] Wang, E.; Davis, J.J.; Zhao, R.; Ng, H.C.; Niu, X.; Luk, W.; Cheung, P.Y.K.; Constantinides, G.A. Deep Neural Network Approximation for Custom Hardware. ACM Comput. Surv. 2019, 52, 1–40. [CrossRef] Ghimire, D.; Kil, D.; Kim, S.h. A Survey on Efficient Convolutional Neural Networks and Hardware Acceleration. Electronics 2022, 11, 945. [CrossRef] Papaphilippou, P.; Luk, W. Accelerating Database Systems Using FPGAs: A Survey. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL), Dublin, Ireland, 27–31 August 2018; pp. 125–1255. [CrossRef] Chiluveru, S.R.; Tripathy, M.; Mohapatra, B. Accuracy controlled iterative method for efficient sigmoid function approximation. IET Electron. Lett. 2020, 56, 914–916. [CrossRef] Lin, Z.; Sinha, S.; Liang, H.; Feng, L.; Zhang, W. Scalable Light-Weight Integration of FPGA Based Accelerators with Chip Multi-Processors. IEEE Trans. Multi-Scale Comput. Syst. 2018, 4, 152–162. [CrossRef] Namin, A.H.; Leboeuf, K.; Muscedere, R.; Wu, H.; Ahmadi, M. Efficient hardware implementation of the hyperbolic tangent sigmoid function. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), Taipei, China, 24–27 May 2009; pp. 2117–2120. [CrossRef] Chen, H.; Jiang, L.; Yang, H.; Lu, Z.; Fu, Y.; Li, L.; Yu, Z. An Efficient Hardware Architecture with Adjustable Precision and Extensible Range to Implement Sigmoid and Tanh Functions. Electronics 2020, 9, 1739. [CrossRef] Campo, I.D.; Finker, R.; Echanobe, J.; Basterretxea, K. Controlled accuracy approximation of sigmoid function for efficient FPGA-based implementation of artificial neurons. Electron. Lett. 2013, 49, 1598–1600. [CrossRef] Nascimento, I.; Jardim, R.; Dias, F.M. A new solution to the hyperbolic tangent implementation in hardware: Polynomial modeling of the fractional exponential part. Neural Comput. Appl. 2012, 23, 363–369. [CrossRef] Ngah, S.; Bakar, R.B.A. Sigmoid Function Implementation Using the Unequal Segmentation of Differential Lookup Table and Second Order Nonlinear Function. J. Telecommun. Electron. Comput. Eng. 2017, 9, 103–108. Electronics 2022, 11, 1365 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 16 of 16 Jin, R.; Jiang, J.; Dou, Y. Accuracy Evaluation of Long Short Term Memory Network Based Language Model with Fixed-Point Arithmetic; Springer International Publishing: Cham, Switzerland, 2017; pp. 281–288. [CrossRef] Mitra, S.; Chattopadhyay, P. Challenges in implementation of ANN in embedded system. In Proceedings of the International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), Chennai, India, 3–5 March 2016; pp. 1794–1798. [CrossRef] Savich, A.W.; Moussa, M.A.; Areibi, S. The Impact of Arithmetic Representation on Implementing MLP-BP on FPGAs: A Study. IEEE Trans. Neural Netw. 2007, 18, 240–252. [CrossRef] [PubMed] Armato, A.; Fanucci, L.; Scilingo, E.P.; Rossi, D.D. Low-error digital hardware implementation of artificial neuron activation functions and their derivative. Microprocess. Microsyst. 2011, 35, 557–567. [CrossRef] Gomar, S.; Mirhassani, M.; Ahmadi, M. Precise digital implementations of hyperbolic tanh and sigmoid function. In Proceedings of the Asilomar Conference on Signals, Systems and Computers (ASILOMAR), Pacific Grove, CA, USA, 6–9 November 2016; pp. 1586–1589. [CrossRef] Pandit, B.K.; Banerjee, A. VLSI Architecture of Sigmoid Activation Function for Rapid Prototyping of Machine Learning Applications. In Proceedings of the 2021 IEEE International Symposium on Smart Electronic Systems (iSES), Jaipur, India, 18–22 December 2021; pp. 117–122. [CrossRef] Zamanlooy, B.; Mirhassani, M. An Analog CVNS-Based Sigmoid Neuron for Precise Neurochips. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2017, 25, 894–906. [CrossRef] Nguyen, V.T.; Jueping, C.; Linyu, W.; Jie, C. Low complexity probability-based piecewise linear approximation of the sigmoid function. J. Xidian Univ. 2020, 47, 58–65. [CrossRef] Pan, Z.; Gu, Z.; Jiang, X.; Zhu, G.; Ma, D. A Modular Approximation Methodology for Efficient Fixed-Point Hardware Implementation of the Sigmoid Function. IEEE Trans. Ind. Electron. 2022. [CrossRef] Liang, Y.; Lu, L.; Xiao, Q.; Yan, S. Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2020, 39, 857–870. [CrossRef] Wei, X.; Liang, Y.; Li, X.; Yu, C.H.; Zhang, P.; Cong, J. TGPA: Tile-Grained Pipeline Architecture for Low Latency CNN Inference. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Marrakech, Morocco, 19–21 March 2018; pp. 1–8. [CrossRef] Xilinx. Vivado Design Suite Tutorial: Design Flows Overview (UG888). 2021. Available online: https://www.xilinx.com/content/ dam/xilinx/support/documents/sw_manuals/xilinx2021_1/ug888-vivado-design-flows-overview-tutorial.pdf (accessed on 22 March 2022). Kumar, A.; Sharma, P.; Gupta, M.K.; Kumar, R. Machine Learning Based Resource Utilization and Pre-estimation for Network on Chip (NoC) Communication. Wirel. Pers. Commun. 2018, 102 2211–2231. [CrossRef] Kumar, A.; Verma, G.; Gupta, M.K.; Salauddin, M.; Rehman, B.K.; Kumar, D. 3D Multilayer Mesh NoC Communication and FPGA Synthesis. Wirel. Pers. Commun. 2019, 106, 1855–1873. [CrossRef]