FUELS Concerns about diminishing fossil fuel reserves along with climate change and national security have promoted considerable research activities on exploring alternative, environmentally friendly processes for producing liquid transportation fuels. During the past decades, industrial and academia have doing their bests to pursue the proper biofuels, which have desirable combustion properties and strong competitiveness, to partially or completely replace traditional petroleumdriven gasoline and diesels. The CN (Cetane Number) is one of the most important signatures to evaluate the fuels combusted in the engine, which is a correlation based on ignition delay from the start of fuel injection. In general, the higher a fuel’s cetane number is, the shorter ignition delay period the fuel has. Cooperative Fuel Research (CFR) and Ignition Quality Tester (IQT) are two mainly used methods in determining CN. The American Society for Testing and Materials (ASTM) Standard D613 (ASTM D613, 2015) uses the single-cylinder CFR for determination of CN, while ASTM Standard D7170 (ASTM D7170-16, 2016) and ASTM Standard D6890 (ASTM D6890, 2015) utilize the IQT. All the methods provide accurate CN measurements, while the CFR can mostly reflect a fuel’s actual combustion behaviour in the engine and the IQT offers a faster measure process with lower volumetric requirements. Clearly, if a series of new biofuel molecules are experimentally synthesized one by one to test the potential CN, a huge amount of time and money will be spent inefficiently. To some extent, a proper experimental data driven predictive model representing the structure-CN relationships of molecules approximately, which can help chemists screen the molecules and subsequently speed up the process of exploring latent new fuels with good performance, is desirable. Motivated by the tremendous advancement of machine learning, predicting molecules’ properties from their structures is an obvious way and has been extensively investigated. Yang used backpropagating neural networks to predict the CN of iso-paraffins and diesel fuels based on quantitative structure property relationship (QSPR) (Yang H, 2001). Taylor utilized the QSARIS software to calculate more than 100 molecular descriptors of 275 compounds that may be used to build the predictive model of the CN. Then the software determined which descriptors are most relevant in modelling the CN using a genetic algorithm and built a predictive model (RMSE = 9.1CN units) (Taylor J, 2004). Although this model is not accurate enough, Taylor's work did provide QSPR inputs for later research to predict CN and indicated which molecular descriptions are crucial. For example, Kessler retained 15 molecular descriptors to build a backpropagation neural network based prediction model of cetane number for furanic biofuel additives, with a total RMSE of 5.97 for the core data set of 284 molecules (Kessler T, 2017). Some other types of models such as utilizing an inverse function method (Smolenskii E, 2008) and using the genetic function approximation (GFA) (Creton B, 2010) have also been used to predict CN. Apparently, these methods have made it possible to describe the relationship between molecular descriptors and properties accurately in a single chemical family of similar chemical properties within test range. Although the aforementioned methods have pretty good robustness and can fit the relationship between CNs and molecular descriptors quite accurately, it is easy to overlook that in choosing which molecular description to be included in a predictive model of CN. To some extent, humans’ choices may limit the performance of the predictive model since these choices are made following humans’ knowledge. Alpha Go Zero, which beat Alpha Go by removing the constraints of human knowledge, is an implicational instance. Under such circumstance, it is very important and interesting to formulate a method for surrogate model building, which can automatically select the molecular descriptors while guaranteeing the prediction accurate, to predict molecular CN. Considering the excellent characteristics of Symbolic Regression and Mathematical Programming, the objective of this paper is to propose a novel machine learning method to explicitly correlate the relationship between CNs and molecular descriptors.