18259 >> Wolfram Schulte: Good afternoon, everyone. I'm Wolfram Schulte. I'm currently hosting Victor Kuliamin. It's my pleasure to host him for today's talk on the practices of standard formalization. I met Victor at various testing conferences and I was generally impressed by the breadth of work they're doing from testing open standards and that's what we will hear today. He's actually part of the Institute for Systems Programming which has probably more than 20 people working on testing various aspects from hardware to mobile devices, to browsers, and he will mainly report on testing things in the Linux context. So I look forward to your talk. Thanks for joining us, Victor. >> Victor Kuliamin: Thank you for the invitation. I actually was in doubt how to focus my presentation, because actually we do a lot of work in these directions, and it is very wide. So I will try to speak some -- I'll try to provide some brief overview of this work and then to speak more in detail about some little part of it concerning testing mathematical libraries in Linux or in Unix systems as a whole. So first I will talk about the background of our institute concerning, of course, with testing. One of the first big projects performed in our institute was development of regression testing for real time operating systems, which operating system for networks, and since then we continue to perform projects related with testing of operating systems from different sources. We conduct such testing for real time operating system developed in Russia, which is also opposites compatible and used in some defense applications as far as I know. We conduct test development for Linux standard base. And we also developed conformance test for every ARINC 653 standard, which is a standard on proposed management, time management and partitioning in real time operating systems in avionics. We also have some background in compiler testing. Most notable, the projects related with description of static semantics for the development of conformance testing for C and Java languages performed in 2002/2004. And projects on testing optimizing units in Intel compilers where the main problem was that Intel can generate a lot of tests, gigabytes of tests, but they were very poorly targeted and optimizing units. So they need some tests that execute the code of optimizing units, but were designed as general programs and provided on inputs on the compilers as a whole. So we succeeded in doing this. And actually I reported to Iman about a dozen bugs on Intel compilers as a result of this project. Then we also have background in protocol testing, started from the testing Microsoft research implementation of IPv6 protocol and then development of the suite from IPv6 also for Microsoft Windows C, and then we also developed test suite for IP security protocol. And this project is now in process. It is not finished yet. And one of the latest additions to our profile is hardware testing. We participate in development test suites for mid-based general purpose microprocessors with some DSP extensions. So this is general background of what kind of projects we perform. Now about technologies we used. We mostly focused on model-based testing and static analysis technologies. And the model-based testing we developed so-called classical knowledge testing which was used for testing networks operating systems; and further it was rebranded in unit test technology which is continuously developed by aiding some extensions in different application domains. The general scheme of model-based testing as I might recall for the audience is the following: We have some system under test, and we have some model of its behavior on the basis of which we should construct test coverage criteria for further testing. Then from the behavior model we can automatically produce test Oracals that evaluate the results of our testing and maybe some model or state of the system on the test for more precise valuation and also for tracking coverage, more specifically. And from the coverage criteria, we automatically produce some coverage metric to measure adequacy of our tests. Then we also construct some test generator which can use all these modules to perform testing and to perform immediate evaluation of its adequacy. This is the most general scheme maybe that can be refined and simply found in different places. For example, we can skip Oracle and provide just testing results for further evaluation by people. Or we can skip on the fly test execution and just generate tests for further testing. Or we can leave out state model to generate tests only on the basis of some coverage criteria used. Actually, all these variants are used in different contexts and in different projects in our institute. Then one of the main activities of our institute in the last five years is participation in Linux standardization. It is all concerned with a lot of different distributions and versions of Linux. As far as I know there are more than 500 official different distributions of Linux. And the main problem, of course, how to provide applications for all these distributions. Do we need different applications, different development for which of the distributions. Of course, if we need to do so, it will be very hard. So one of the proposed solutions for this problem is Linux standard base, which is binary interface standard, which supposes that we can simply move applications between different distributions conforming to the standards without any recompilation, just in binary form. If we use the same hardware platforms as a base. So it was supported all to my mind nine or eight different hardware platforms. And for each hardware platform it defined a set of binary interfaces, which provide such interpretability between different distributions. This standard includes a lot of more specific standards like POSIX or X open, or open, and it also includes a lot of de facto standard libraries used in many Linux applications like Lebex [phonetic] ML or QT and so on. And not only C language labels, but also JDK on pair libraries, and for widely used programming languages. For example, for C language, it now consists of about 40,000 functions which are not specified in some uniform way. About 2,000 functions which mostly comes from POSIX or open standards are described more accurately or more precisely. And other functions are not so well-defined. And some of them have no documentation at all. Some of them have, for example, one line of description. This function does something. And that's all. So if we need to check conformance to such standard, of course it is a problem and it is a hard task. And what we perform in relation to this Linux standard base activity. The institute actually is responsible for this being infrastructure development. And this infrastructure includes standard database, which has data on all the libraries' operations standardized on use and distributions on tests already developed for them. This infrastructure also includes conformance checking tools both for distributions and for applications. And for distributions, these tools are based on static analysis and testing and for applications on testing again and on monitoring or dynamic verification. We also develop information system for LSB evolution, which supports analytical activities concerning on making further decision on development of standard, for example, how to evaluate which libraries are widely used in different applications, and which are not. And also we perform a project on Linux driver verification, which is based on the technology very similar to the one developed in Microsoft Research and now used in static driver verifier, to my mind, in Microsoft. And we use blast tool very similar to Slum. But also we have some additional instrumentation based on some aspect-oriented things. So that we can define a model of system and model of driver in non-invasive way, meaning the code of the driver. Now, if we have such huge number of interfaces, how can we test them? Actually, we cannot do it with the same accuracy. And for 2,000 functions which are described rather accurately, we provide conformance testing on the base of unitized technology. For less strictly, but nevertheless well-defined functions, we provide many test development augmented with some support for requirement possibility for parameterized testing, for example. So some things are automated, but the main decisions are performed by people in this test development. And for other functions, we used specialized method of construction technology which is based on the information already stored in the standard database. This database has information on functions, signatures that data types of the parameters and the results and this technology is based on the idea to refine information on data types to put in this database some additional information on how such a data type can be correctly initialized and can be finalized. And then some exploration tool just constructs tests on the basis on this information. Just puts some insulation procedures to construct parameters and then some call to the tested function, and then some finalization. Actually, of course these tests are not accurate and not good, but they provide just something, some testing that we can do for such a great number of functions. Before that, they have no tests at all. Maybe some words, very short, about unitized text development technology. >>: Are you presenting the findings of your testing efforts as well? So how many facts? How great a variety between the different Linux distributions is? So is that the results of your findings? >> Victor Kuliamin: Actually, there are different kinds of findings. The first kind is some inaccuracies in issues and standards themselves. For example, we reported about two or three dozen issues to POSIX team, and these issues were accepted by them and we have to my mind just about the same number on the reported issues on POSIX standard. Also, there are, to my mind, hundreds of different, various issues in various distributions and libraries, about two or three dozens problems in GLIPC [phonetic] library which is also accepted. And actually about 50, to my mind, or 60, defects in other widely used libraries on Linux. And concerning differences between distributions, actually there are differences but they are not reported as some specific issue. Usually they are concerned with some use of different versions of libraries or in some distribution you may use some page. In some distributions you may not use it. And something like that. So usually if we find such issues, we transform it into a report on some back in library or issue related with the usage of pages for some library. Actually, most of such issues found are published on our site, linuxtesting.org. Okay. Several words about unitized technology. It is based on the idea of so-called stateful contracts when we define behavior of closely related group of operations in terms of its model state and paren post conditions of its operations. And then we can use structure of post condition as a source of coverage criteria, because in most practical cases, post condition is not a single predicate and consists of a lot of different ifs. And each of these ifs can be considered as coverage goal. Then we perform some coverage targeted obstruction of state to construct finite state machine of group and perform on the [indiscernible] exploration of finite state machine to obtain testing. I wish to be more specific on the testing of math libraries, because it also illustrates some aspects of formalization of standards and some issues you can find this way. So if you pose a question, how one can test implementation of mathematical functions, first of all, you need to look at the existing standards that describe requirements to implementation, to these implementations. There are four such standards with slightly different coverage, because, for example, IEEE 754 provides requirements on performs some basic mathematic applications. Actually, these standards also provide slightly different, not consistent set of requirements. So to recall some details of floating point arithmetic. Floating point number type is specified by two integers, the number of bits used as a whole for this number and the number of bits used for exponent. And if exponent does not consist of all zeros and all units it is called so-called floating point number and can be calculated by this formula. If the exponent consists of all zeros, it is denoted normal number and it is calculated by slightly different formula, and you might note that there are such net effect as minus zero in floating point numbers. Actually, IEEE standard tries to, not to distinguish zero and minus zero. But in some specific cases there are differences and they're directly specified in this standard. And also if the exponent consists of all units, the corresponding number is an exceptional number, which may represent infinity. Also called not a number at all, which is used to represent results that cannot be correctly stated as finite or infinite. So the mind floating point types has the following parameters. And in addition, this IEEE 754 standard provides the following requirements on computations: It describes behavior of basic arithmetic operations, square root functions, fused multiplication, additions and some typed conversions also. In many cases, the result of computations cannot be represented as exact floating point numbers. In this case you should somehow round the exact result to a number. And for this purpose, this standard defines four different modes which prescribe the direction of rounding in each case. It also provides restrictions on ->>: To the mirror, the two cases, it's like in between to undergo towards 0 or go towards either, not that there were five nodes. >> Victor Kuliamin: What, I don't understand. >>: You're saying four rounding modes. I'm familiar with five. Maybe the fifth one is not IEEE 754. And so the fifth one I believe splits to the nearest in two cases, to the nearest as meaning precisely between the two. >> Victor Kuliamin: I see. Yes. Sometimes the fifth mode is mentioned but in the standard, there is only one to the nearest mod, and it says that if the exact result is just in the middle between two floating point numbers, then we should round it to the number with more zeros on the end. So, in addition, this standard also provides requirements concerning cases where this operation should return NANs or infinite results, mostly naturally. And it also specifies five flags that can be set in the case of some exceptions to signal about some incorrectness in computations, so some specific situations that occurred. If you consider C language standard and POSIX standard, they provide some requirements to library functions implemented in various mathematic functions, and they mostly try to extend IEEE requirements to these functions. For example, they specify exact values in some points for these functions. They specify the end points where these functions should be infinite. This should set the division by zero flock, and they also specify situations when NAN results and invalid flux should be set. Besides that, POSIX also provides some specific requirements. For example, it says that if the function is secure to its argument in the neighborhood of 0. Then for the normal arguments, it should return the argument. This is in contradiction with rounding mod, because this requirement is applied for every rounding mods. So it is not a formal contradiction, because IEEE standards do not say anything on such functions, but it is in contradiction with some extension of IEEE requirements, you may imagine. Another example of such, not a contradiction, but maybe inconsistency, between POSIX approach and IEEE approach is values, requirements on values that should be returned in case of overflow. POSIX, again, requires that one in the same way it should be returned independently on the rounding mod. And also it said nothing about specific value of this constant. Actually different libraries use different values. And this is the obvious source of noninterpretability between implementations of mathematical functions. So if you provide some calculations on one platform and have such results, then they can be incorrectly interpreted on the other platform with another value of these constants. One more example of such inconsistency between POSIX and IEEE is POSIX requirements that sometimes some functions should return known NAN results or NAN arguments, for example. Maximum of NAN and any number should be called to this number. Not NAN as it would be if we tried to extend IEEE policy on NANs. So these standards are not in good correspondence. You can also look at the full requirements of POSIX for sine functions, and here some general descriptions and some requirements on returning certain specifications, and you can note that there are no any words about inaccuracies or precision of calculations. So any function which does this and provide any other values can be considered as correct implementation of sine function. There is so-called standard on language independent arithmetics which is mostly strict standard on mathematical libraries. And it provides a rather accurate set of requirements starting from preservation of sine of mathematical function by its implementation, preservation of [indiscernible] boundaries on accuracies. It is also specified that these boundaries and these restrictions can be applied to trigonometric functions only for small arguments where the period is not -- is much larger than the distance between the neighbor floating point numbers. Because outside of this interval, they became just a certain set of points. And also this standard provides restrictions on symmetries of implementation on exact values and on some asymptotics near 0 usually. Because near 0, the density of floating point numbers is much higher. And those are some restrictions on inequality relations between such functions as exponential and exponential minus 1 or here proposed the sine and co-sine functions and so on. So if we look at the whole list of requirements from different standards, we can note that the single requirement to perform correct rounding of exact results according to current rounding mode is sufficient to infer most of them. Actually, this requirement is very good for standardization, because it provides the best computational accuracy you can achieve. All the issues in implications using mathematical libraries related with inaccuracies should be then investigated as properties of algorithms used. Not as related with accuracies of libraries themselves. It also provides perfect interpretability between different libraries and applications. You can get the same results from different implementations if all of them implement this requirement. And also it is sometimes considered as rather hard to implement. There are examples where it is achieved, for example, CR equilibrium who had implemented it in India tries to implement correct rounding for all elementary functions. Into implementation of GLIPC libraries for the platform do not achieve this, but come very close. They actually provide for most functions, only one oop or two oop errors. So when we tried to construct tests, we base them on this requirement of correct rounding. We also tried to extend naturally the ideas of IEEE standard, because it is basic for floating point computations. And to extract the corresponding requirements on nonresults, infinite results and exception flux for all the exceptions of mathematical functions. Then after we can say what we can check, what we should check with our tests, the next thing to define is what data we should use. For testing mathematical functions, we use three sources of data. First, it is a bit structure or floating point numbers which actually defines some natural boundaries like 0 infinity, the largest positive number and so on. And also can be used to construct numbers corresponding to some patterns of [indiscernible] which usually became a source of some subtle errors in different implementations. For example, very well known [indiscernible] division was related with some specific pattern of [indiscernible] of arguments. Also we found, for example, such buck in GLIPC implementation of integer rounding function. It also is related with such pattern of bit structure. Two other sources of this data are intervals of uniform behavior of a function end points where it is rather hard to calculate correct results. Because it requires more precision than on average. So ->>: What do you mean by uniform behavior? >> Victor Kuliamin: I will explain on this slide. By uniform behavior, I mean intervals where the function preserves sine over kind of monotonicity, or it has some nice asymptotics, where just from this interval. So actually all the numbers can be partitioned in some set of intervals according to these ideas, specific for a function, of course. And even for functions of several arguments, the same idea is working. You can partition now not a nexus but all the plane or the space into a set of areas where the function has different, slightly different behaviors. And ->>: What do you mean by the number of errors? In general. >> Victor Kuliamin: It depends. For some functions like square root or exponential, there are ten dozen intervals. For functions like trigonometric signs and co-sine, there are millions of intervals, but not all of them can be chosen for testing. Actually, for such complex function, we perform some additional filtering of these intervals trying to look only at intervals where their behavior is mostly specific. For example, we tried to look at periods of sine where it comes most closely to 1, to the value of 1 or minus 1 or to 0. So actually I will say some things about this later. But actually from all the periods of sine function, for example, we choose several thousands of such periods which contain some specific behavior and use them as such intervals. Then the third source is so-called hard points where the exact result of a function lies, for example, very close to the mean between two consecutive floating point numbers so that we need very precise computation to make correct rounding of this result. The same picture can be observed when we are using so-called directed rounding mods to 0, to infinities. But the pattern of the results, here you can see the pattern of the function of the result, which is close to the mean between -- the mean of two floating point numbers. And for directed rounding modus, the pattern is slightly different. Here, the result should be close to some exact floating point number. And there are some examples of these hard points. For example, you can see here in this point the correct rounding of [indiscernible] value can be performed only if we calculate additional 80 bits, additional to the average [indiscernible]. So maybe such points are very rare. To answer this question, we can evaluate the number based on the very simple probabilistic model which occurs to be very close to reality. It says that if we look for points where number of additional bits needed for correct rounding is less than the number of bits in Mantissa, then we can usually find such points and the number is growing with diminishing of number of additional bits, of course. And in some general situation you can see that the numbers prescribed by this model are in very good correspondence with the real numbers of such hard points for sine function, for example. But sometimes this model doesn't work because, for example, function can have very good asymptotics. It can come close to some floating point number and each point on this interval became hard from this point of view. But actually we are not interested in some good point than such good cases because in such intervals the function can be calculated easily with the help of these asymptotics. So we are interested mostly in isolated hard points which are not lining intervals of asymptotics. So there are several methods to calculate such hard points. Of course, we can just try to search them by brute force. And this method works but only on single precision, where we have just 2 to the power 32 numbers. There are additional methods such as methods based on continued fractions which allows to calculate some hard points for trigonometric function starting from continued fraction expansion of pi, which gives a rational approximation to pi and also floating point numbers which are very close to integer multiples of pi. The other method, so-called [indiscernible] method can be used to calculate hard points for square root function. It is based on calculating square roots on modular increasing powers of two. And so this sequence provides the adequate square root from some parameter. And for each such parameter which can be taken as a small odd number, we can calculate usually several hard points with the help of this method. Other methods are based on the idea that hard points core respond to the nodes in degree of floating point numbers which lies very close to the graph of a function. The reduced search method used linear approximation of a function and then some optimized search of these nodes close to the line segment obtained. Then lattice reduction is more complex. It uses polynomial approximation in some construction of space of polynomials of multiple variable polynomials which can then be reduced to the basis which gives such close points. Some details of this method are rather intricate. Then I propose so-called integers second method based on the idea that if we have a point where the linear term of a tangent is linear or rational, then the intersections of the graph and the seconds parallel to this tangent will correspond to such hard points. And these intersections can be factorially calculated based on the reverse expansion series of the difference between the graph and the tangent. So with the help of all these methods, we computed some set of hard points actually for simple cases like square root. We have calculated all hard points of some requirements requiring more than, for example, 48 bits to calculate exact value. And using these results and other sources of this data, we composed the suites for almost all the real variable POSIX functions. Test bits are constructed rather simply. There are a frame book which performs test execution and comparison between expected result and actual result. And there are a set of test data consisting of arguments and expected results. Expected results are calculated with the help of multi-precision MAPL tool or NPFR library. So we use different sources to escape some systematic box in both implementations. And here are some statistics of test suite development. So, for example, these points are some boundary points such as 0, infinity, and maximum positive floating point number and so on. These points correspond to different intervals, different Mantissa patterns. These are hard points where we can calculate it, for example, for logarithmic gamma functions it is hard. I don't know effective way to calculate hard points for such a function just now. And sometimes we use other sources. For example, for square root functions, there are a lot of floating point numbers from which square root should be exact floating point number. And they provide also useful test cases. And for sine logarithmic gamma and basic functions, also the points which are closest to 0s and to extremes are used as test cases. So we composed a huge set of tests for mathematical functions and executed it on several platforms. And that's the summary of details. Here rows correspond to different functions and rows correspond to different platforms. And each cell for each function is divided into four parts corresponding to four rounded mods. So you can see that sometimes the functions perform almost uniformly in different rounding mods and sometimes it is exact in rounding to nearest and very buggy in other rounding mods, very implementation-dependent. And there are also some examples of bugs found in this testing. And here are bugs related to exponential function and hyperbolic functions, you can see these negative exponentials or exponential of a negative number which is greater than 1 and so on. The same picture can be seen on trigonometric functions. And for row functions, the same, error function is some probability. And here you can see that this probability is very high. So most notable are two things to my mind here that on these four platforms, implementation is almost exact or exact in the rounding to nearest and very buggy in other modes. So I think it is because of some optimization performed just for the nearest mod. And also as I have noted already, implementation of GLIPC library for returning platform seems to be the least buggy implementation under consideration. The other picture shows which implementations provide different or the same results on our test set. So implementations in this, call it in the same color, provide just the same results on test and implementations with white cells provide different from each other results. >>: White cells? >> Victor Kuliamin: Yes, white cells are unique implementations which are not similar to any other. So you can see that, for example, we can see the development of mathematical model and implications based on grid systems or cloud computing. And if we want to use different platforms as a base of this cloud computing system, agreed, we can have problems if we arrange, if we automatically arrange or provide some automatic scheduling of calculations on different platforms, because we can get just different results from such applications. And actually, as I know such scenarios already implemented, in one instituted, most [indiscernible] grid was implemented with mathematical scheduling of computations. And they suddenly are faced with different results just caused by different scheduling of different implementations or different platforms. It was independent teams. So they do not know about our results, but you can see that this has practical significance. So then some maybe obvious conclusion that formalization of standards can uncover numerous issues even in rather major industrial standards like POSIX which has a long history of development and considered as very good in the industry. And also in some cases formalization is not only ineffective in terms of cost and effort but required but even impossible if the standard has nothing to require from some interface. So that's all. Thank you for your attention. Of course, if you want to discuss some issues concerning not mathematical testing, but other things I have sketched before, you are welcome. [applause] >> Wolfram Schulte: Victor is here for the remaining week, so if you want to catch him to get some of the issues worked on, feel free. He's sitting next to me in my hallway. We can take questions. >>: These were defined in the Linux library, right? >> Victor Kuliamin: Not only in Linux. We also here confessed implementations of C runtime library, Visual Studio, because it provides almost the same set of functions, except for some. >>: Okay. I'm just going to ask. What about the Intel? Intel provides some library like this, right? >> Victor Kuliamin: Yes. For example ->>: [indiscernible]. >> Victor Kuliamin: This implementation of GLIPC on [indiscernible] was developed entirely by Intel. And also there are Intel libraries for Windows which I also performed tests on. But I do not -- I have not put yet the results on the table. And it seems that the implementation on Windows is also very good. It demonstrates almost the same results as on the timing. >> Wolfram Schulte: Any other questions? Thanks again, Victor. [applause]