18259 >> Wolfram Schulte: Good afternoon, everyone. I'm Wolfram...

advertisement
18259
>> Wolfram Schulte: Good afternoon, everyone. I'm Wolfram Schulte. I'm currently hosting
Victor Kuliamin. It's my pleasure to host him for today's talk on the practices of standard
formalization. I met Victor at various testing conferences and I was generally impressed by the
breadth of work they're doing from testing open standards and that's what we will hear today.
He's actually part of the Institute for Systems Programming which has probably more than 20
people working on testing various aspects from hardware to mobile devices, to browsers, and he
will mainly report on testing things in the Linux context. So I look forward to your talk. Thanks for
joining us, Victor.
>> Victor Kuliamin: Thank you for the invitation. I actually was in doubt how to focus my
presentation, because actually we do a lot of work in these directions, and it is very wide.
So I will try to speak some -- I'll try to provide some brief overview of this work and then to speak
more in detail about some little part of it concerning testing mathematical libraries in Linux or in
Unix systems as a whole.
So first I will talk about the background of our institute concerning, of course, with testing. One of
the first big projects performed in our institute was development of regression testing for real time
operating systems, which operating system for networks, and since then we continue to perform
projects related with testing of operating systems from different sources.
We conduct such testing for real time operating system developed in Russia, which is also
opposites compatible and used in some defense applications as far as I know.
We conduct test development for Linux standard base. And we also developed conformance test
for every ARINC 653 standard, which is a standard on proposed management, time management
and partitioning in real time operating systems in avionics.
We also have some background in compiler testing. Most notable, the projects related with
description of static semantics for the development of conformance testing for C and Java
languages performed in 2002/2004.
And projects on testing optimizing units in Intel compilers where the main problem was that Intel
can generate a lot of tests, gigabytes of tests, but they were very poorly targeted and optimizing
units. So they need some tests that execute the code of optimizing units, but were designed as
general programs and provided on inputs on the compilers as a whole.
So we succeeded in doing this. And actually I reported to Iman about a dozen bugs on Intel
compilers as a result of this project.
Then we also have background in protocol testing, started from the testing Microsoft research
implementation of IPv6 protocol and then development of the suite from IPv6 also for Microsoft
Windows C, and then we also developed test suite for IP security protocol.
And this project is now in process. It is not finished yet. And one of the latest additions to our
profile is hardware testing. We participate in development test suites for mid-based general
purpose microprocessors with some DSP extensions. So this is general background of what kind
of projects we perform.
Now about technologies we used. We mostly focused on model-based testing and static analysis
technologies. And the model-based testing we developed so-called classical knowledge testing
which was used for testing networks operating systems; and further it was rebranded in unit test
technology which is continuously developed by aiding some extensions in different application
domains.
The general scheme of model-based testing as I might recall for the audience is the following:
We have some system under test, and we have some model of its behavior on the basis of which
we should construct test coverage criteria for further testing. Then from the behavior model we
can automatically produce test Oracals that evaluate the results of our testing and maybe some
model or state of the system on the test for more precise valuation and also for tracking
coverage, more specifically.
And from the coverage criteria, we automatically produce some coverage metric to measure
adequacy of our tests. Then we also construct some test generator which can use all these
modules to perform testing and to perform immediate evaluation of its adequacy.
This is the most general scheme maybe that can be refined and simply found in different places.
For example, we can skip Oracle and provide just testing results for further evaluation by people.
Or we can skip on the fly test execution and just generate tests for further testing. Or we can
leave out state model to generate tests only on the basis of some coverage criteria used.
Actually, all these variants are used in different contexts and in different projects in our institute.
Then one of the main activities of our institute in the last five years is participation in Linux
standardization. It is all concerned with a lot of different distributions and versions of Linux. As
far as I know there are more than 500 official different distributions of Linux.
And the main problem, of course, how to provide applications for all these distributions. Do we
need different applications, different development for which of the distributions.
Of course, if we need to do so, it will be very hard. So one of the proposed solutions for this
problem is Linux standard base, which is binary interface standard, which supposes that we can
simply move applications between different distributions conforming to the standards without any
recompilation, just in binary form.
If we use the same hardware platforms as a base. So it was supported all to my mind nine or
eight different hardware platforms. And for each hardware platform it defined a set of binary
interfaces, which provide such interpretability between different distributions.
This standard includes a lot of more specific standards like POSIX or X open, or open, and it also
includes a lot of de facto standard libraries used in many Linux applications like Lebex [phonetic]
ML or QT and so on.
And not only C language labels, but also JDK on pair libraries, and for widely used programming
languages. For example, for C language, it now consists of about 40,000 functions which are not
specified in some uniform way. About 2,000 functions which mostly comes from POSIX or open
standards are described more accurately or more precisely.
And other functions are not so well-defined. And some of them have no documentation at all.
Some of them have, for example, one line of description. This function does something. And
that's all. So if we need to check conformance to such standard, of course it is a problem and it is
a hard task.
And what we perform in relation to this Linux standard base activity. The institute actually is
responsible for this being infrastructure development. And this infrastructure includes standard
database, which has data on all the libraries' operations standardized on use and distributions on
tests already developed for them.
This infrastructure also includes conformance checking tools both for distributions and for
applications. And for distributions, these tools are based on static analysis and testing and for
applications on testing again and on monitoring or dynamic verification.
We also develop information system for LSB evolution, which supports analytical activities
concerning on making further decision on development of standard, for example, how to evaluate
which libraries are widely used in different applications, and which are not.
And also we perform a project on Linux driver verification, which is based on the technology very
similar to the one developed in Microsoft Research and now used in static driver verifier, to my
mind, in Microsoft.
And we use blast tool very similar to Slum. But also we have some additional instrumentation
based on some aspect-oriented things. So that we can define a model of system and model of
driver in non-invasive way, meaning the code of the driver.
Now, if we have such huge number of interfaces, how can we test them? Actually, we cannot do
it with the same accuracy. And for 2,000 functions which are described rather accurately, we
provide conformance testing on the base of unitized technology. For less strictly, but
nevertheless well-defined functions, we provide many test development augmented with some
support for requirement possibility for parameterized testing, for example. So some things are
automated, but the main decisions are performed by people in this test development.
And for other functions, we used specialized method of construction technology which is based
on the information already stored in the standard database.
This database has information on functions, signatures that data types of the parameters and the
results and this technology is based on the idea to refine information on data types to put in this
database some additional information on how such a data type can be correctly initialized and
can be finalized.
And then some exploration tool just constructs tests on the basis on this information. Just puts
some insulation procedures to construct parameters and then some call to the tested function,
and then some finalization. Actually, of course these tests are not accurate and not good, but
they provide just something, some testing that we can do for such a great number of functions.
Before that, they have no tests at all. Maybe some words, very short, about unitized text
development technology.
>>: Are you presenting the findings of your testing efforts as well? So how many facts? How
great a variety between the different Linux distributions is? So is that the results of your findings?
>> Victor Kuliamin: Actually, there are different kinds of findings. The first kind is some
inaccuracies in issues and standards themselves. For example, we reported about two or three
dozen issues to POSIX team, and these issues were accepted by them and we have to my mind
just about the same number on the reported issues on POSIX standard.
Also, there are, to my mind, hundreds of different, various issues in various distributions and
libraries, about two or three dozens problems in GLIPC [phonetic] library which is also accepted.
And actually about 50, to my mind, or 60, defects in other widely used libraries on Linux. And
concerning differences between distributions, actually there are differences but they are not
reported as some specific issue. Usually they are concerned with some use of different versions
of libraries or in some distribution you may use some page. In some distributions you may not
use it.
And something like that. So usually if we find such issues, we transform it into a report on some
back in library or issue related with the usage of pages for some library.
Actually, most of such issues found are published on our site, linuxtesting.org. Okay. Several
words about unitized technology. It is based on the idea of so-called stateful contracts when we
define behavior of closely related group of operations in terms of its model state and paren post
conditions of its operations.
And then we can use structure of post condition as a source of coverage criteria, because in most
practical cases, post condition is not a single predicate and consists of a lot of different ifs. And
each of these ifs can be considered as coverage goal.
Then we perform some coverage targeted obstruction of state to construct finite state machine of
group and perform on the [indiscernible] exploration of finite state machine to obtain testing.
I wish to be more specific on the testing of math libraries, because it also illustrates some aspects
of formalization of standards and some issues you can find this way.
So if you pose a question, how one can test implementation of mathematical functions, first of all,
you need to look at the existing standards that describe requirements to implementation, to these
implementations. There are four such standards with slightly different coverage, because, for
example, IEEE 754 provides requirements on performs some basic mathematic applications.
Actually, these standards also provide slightly different, not consistent set of requirements. So to
recall some details of floating point arithmetic. Floating point number type is specified by two
integers, the number of bits used as a whole for this number and the number of bits used for
exponent. And if exponent does not consist of all zeros and all units it is called so-called floating
point number and can be calculated by this formula.
If the exponent consists of all zeros, it is denoted normal number and it is calculated by slightly
different formula, and you might note that there are such net effect as minus zero in floating point
numbers.
Actually, IEEE standard tries to, not to distinguish zero and minus zero. But in some specific
cases there are differences and they're directly specified in this standard.
And also if the exponent consists of all units, the corresponding number is an exceptional
number, which may represent infinity. Also called not a number at all, which is used to represent
results that cannot be correctly stated as finite or infinite.
So the mind floating point types has the following parameters. And in addition, this IEEE 754
standard provides the following requirements on computations: It describes behavior of basic
arithmetic operations, square root functions, fused multiplication, additions and some typed
conversions also.
In many cases, the result of computations cannot be represented as exact floating point numbers.
In this case you should somehow round the exact result to a number. And for this purpose, this
standard defines four different modes which prescribe the direction of rounding in each case.
It also provides restrictions on ->>: To the mirror, the two cases, it's like in between to undergo towards 0 or go towards either,
not that there were five nodes.
>> Victor Kuliamin: What, I don't understand.
>>: You're saying four rounding modes. I'm familiar with five. Maybe the fifth one is not IEEE
754. And so the fifth one I believe splits to the nearest in two cases, to the nearest as meaning
precisely between the two.
>> Victor Kuliamin: I see. Yes. Sometimes the fifth mode is mentioned but in the standard, there
is only one to the nearest mod, and it says that if the exact result is just in the middle between two
floating point numbers, then we should round it to the number with more zeros on the end.
So, in addition, this standard also provides requirements concerning cases where this operation
should return NANs or infinite results, mostly naturally.
And it also specifies five flags that can be set in the case of some exceptions to signal about
some incorrectness in computations, so some specific situations that occurred.
If you consider C language standard and POSIX standard, they provide some requirements to
library functions implemented in various mathematic functions, and they mostly try to extend IEEE
requirements to these functions.
For example, they specify exact values in some points for these functions. They specify the end
points where these functions should be infinite. This should set the division by zero flock, and
they also specify situations when NAN results and invalid flux should be set.
Besides that, POSIX also provides some specific requirements. For example, it says that if the
function is secure to its argument in the neighborhood of 0. Then for the normal arguments, it
should return the argument.
This is in contradiction with rounding mod, because this requirement is applied for every rounding
mods.
So it is not a formal contradiction, because IEEE standards do not say anything on such
functions, but it is in contradiction with some extension of IEEE requirements, you may imagine.
Another example of such, not a contradiction, but maybe inconsistency, between POSIX
approach and IEEE approach is values, requirements on values that should be returned in case
of overflow. POSIX, again, requires that one in the same way it should be returned independently
on the rounding mod. And also it said nothing about specific value of this constant. Actually
different libraries use different values. And this is the obvious source of noninterpretability
between implementations of mathematical functions.
So if you provide some calculations on one platform and have such results, then they can be
incorrectly interpreted on the other platform with another value of these constants.
One more example of such inconsistency between POSIX and IEEE is POSIX requirements that
sometimes some functions should return known NAN results or NAN arguments, for example.
Maximum of NAN and any number should be called to this number. Not NAN as it would be if we
tried to extend IEEE policy on NANs.
So these standards are not in good correspondence. You can also look at the full requirements
of POSIX for sine functions, and here some general descriptions and some requirements on
returning certain specifications, and you can note that there are no any words about inaccuracies
or precision of calculations.
So any function which does this and provide any other values can be considered as correct
implementation of sine function. There is so-called standard on language independent
arithmetics which is mostly strict standard on mathematical libraries. And it provides a rather
accurate set of requirements starting from preservation of sine of mathematical function by its
implementation, preservation of [indiscernible] boundaries on accuracies.
It is also specified that these boundaries and these restrictions can be applied to trigonometric
functions only for small arguments where the period is not -- is much larger than the distance
between the neighbor floating point numbers. Because outside of this interval, they became just
a certain set of points.
And also this standard provides restrictions on symmetries of implementation on exact values and
on some asymptotics near 0 usually. Because near 0, the density of floating point numbers is
much higher.
And those are some restrictions on inequality relations between such functions as exponential
and exponential minus 1 or here proposed the sine and co-sine functions and so on.
So if we look at the whole list of requirements from different standards, we can note that the
single requirement to perform correct rounding of exact results according to current rounding
mode is sufficient to infer most of them. Actually, this requirement is very good for
standardization, because it provides the best computational accuracy you can achieve.
All the issues in implications using mathematical libraries related with inaccuracies should be then
investigated as properties of algorithms used. Not as related with accuracies of libraries
themselves.
It also provides perfect interpretability between different libraries and applications. You can get
the same results from different implementations if all of them implement this requirement. And
also it is sometimes considered as rather hard to implement. There are examples where it is
achieved, for example, CR equilibrium who had implemented it in India tries to implement correct
rounding for all elementary functions.
Into implementation of GLIPC libraries for the platform do not achieve this, but come very close.
They actually provide for most functions, only one oop or two oop errors.
So when we tried to construct tests, we base them on this requirement of correct rounding. We
also tried to extend naturally the ideas of IEEE standard, because it is basic for floating point
computations. And to extract the corresponding requirements on nonresults, infinite results and
exception flux for all the exceptions of mathematical functions.
Then after we can say what we can check, what we should check with our tests, the next thing to
define is what data we should use. For testing mathematical functions, we use three sources of
data. First, it is a bit structure or floating point numbers which actually defines some natural
boundaries like 0 infinity, the largest positive number and so on. And also can be used to
construct numbers corresponding to some patterns of [indiscernible] which usually became a
source of some subtle errors in different implementations.
For example, very well known [indiscernible] division was related with some specific pattern of
[indiscernible] of arguments. Also we found, for example, such buck in GLIPC implementation of
integer rounding function. It also is related with such pattern of bit structure.
Two other sources of this data are intervals of uniform behavior of a function end points where it
is rather hard to calculate correct results. Because it requires more precision than on average.
So ->>: What do you mean by uniform behavior?
>> Victor Kuliamin: I will explain on this slide. By uniform behavior, I mean intervals where the
function preserves sine over kind of monotonicity, or it has some nice asymptotics, where just
from this interval.
So actually all the numbers can be partitioned in some set of intervals according to these ideas,
specific for a function, of course. And even for functions of several arguments, the same idea is
working. You can partition now not a nexus but all the plane or the space into a set of areas
where the function has different, slightly different behaviors.
And ->>: What do you mean by the number of errors? In general.
>> Victor Kuliamin: It depends. For some functions like square root or exponential, there are ten
dozen intervals. For functions like trigonometric signs and co-sine, there are millions of intervals,
but not all of them can be chosen for testing.
Actually, for such complex function, we perform some additional filtering of these intervals trying
to look only at intervals where their behavior is mostly specific. For example, we tried to look at
periods of sine where it comes most closely to 1, to the value of 1 or minus 1 or to 0. So actually
I will say some things about this later. But actually from all the periods of sine function, for
example, we choose several thousands of such periods which contain some specific behavior
and use them as such intervals.
Then the third source is so-called hard points where the exact result of a function lies, for
example, very close to the mean between two consecutive floating point numbers so that we
need very precise computation to make correct rounding of this result. The same picture can be
observed when we are using so-called directed rounding mods to 0, to infinities. But the pattern
of the results, here you can see the pattern of the function of the result, which is close to the
mean between -- the mean of two floating point numbers.
And for directed rounding modus, the pattern is slightly different. Here, the result should be close
to some exact floating point number. And there are some examples of these hard points. For
example, you can see here in this point the correct rounding of [indiscernible] value can be
performed only if we calculate additional 80 bits, additional to the average [indiscernible].
So maybe such points are very rare. To answer this question, we can evaluate the number
based on the very simple probabilistic model which occurs to be very close to reality.
It says that if we look for points where number of additional bits needed for correct rounding is
less than the number of bits in Mantissa, then we can usually find such points and the number is
growing with diminishing of number of additional bits, of course.
And in some general situation you can see that the numbers prescribed by this model are in very
good correspondence with the real numbers of such hard points for sine function, for example.
But sometimes this model doesn't work because, for example, function can have very good
asymptotics. It can come close to some floating point number and each point on this interval
became hard from this point of view.
But actually we are not interested in some good point than such good cases because in such
intervals the function can be calculated easily with the help of these asymptotics. So we are
interested mostly in isolated hard points which are not lining intervals of asymptotics.
So there are several methods to calculate such hard points. Of course, we can just try to search
them by brute force. And this method works but only on single precision, where we have just 2 to
the power 32 numbers.
There are additional methods such as methods based on continued fractions which allows to
calculate some hard points for trigonometric function starting from continued fraction expansion of
pi, which gives a rational approximation to pi and also floating point numbers which are very close
to integer multiples of pi.
The other method, so-called [indiscernible] method can be used to calculate hard points for
square root function. It is based on calculating square roots on modular increasing powers of
two. And so this sequence provides the adequate square root from some parameter. And for
each such parameter which can be taken as a small odd number, we can calculate usually
several hard points with the help of this method.
Other methods are based on the idea that hard points core respond to the nodes in degree of
floating point numbers which lies very close to the graph of a function. The reduced search
method used linear approximation of a function and then some optimized search of these nodes
close to the line segment obtained.
Then lattice reduction is more complex. It uses polynomial approximation in some construction of
space of polynomials of multiple variable polynomials which can then be reduced to the basis
which gives such close points. Some details of this method are rather intricate.
Then I propose so-called integers second method based on the idea that if we have a point where
the linear term of a tangent is linear or rational, then the intersections of the graph and the
seconds parallel to this tangent will correspond to such hard points. And these intersections can
be factorially calculated based on the reverse expansion series of the difference between the
graph and the tangent.
So with the help of all these methods, we computed some set of hard points actually for simple
cases like square root. We have calculated all hard points of some requirements requiring more
than, for example, 48 bits to calculate exact value.
And using these results and other sources of this data, we composed the suites for almost all the
real variable POSIX functions. Test bits are constructed rather simply. There are a frame book
which performs test execution and comparison between expected result and actual result. And
there are a set of test data consisting of arguments and expected results. Expected results are
calculated with the help of multi-precision MAPL tool or NPFR library. So we use different
sources to escape some systematic box in both implementations.
And here are some statistics of test suite development. So, for example, these points are some
boundary points such as 0, infinity, and maximum positive floating point number and so on.
These points correspond to different intervals, different Mantissa patterns. These are hard points
where we can calculate it, for example, for logarithmic gamma functions it is hard. I don't know
effective way to calculate hard points for such a function just now. And sometimes we use other
sources.
For example, for square root functions, there are a lot of floating point numbers from which
square root should be exact floating point number. And they provide also useful test cases. And
for sine logarithmic gamma and basic functions, also the points which are closest to 0s and to
extremes are used as test cases.
So we composed a huge set of tests for mathematical functions and executed it on several
platforms. And that's the summary of details. Here rows correspond to different functions and
rows correspond to different platforms.
And each cell for each function is divided into four parts corresponding to four rounded mods. So
you can see that sometimes the functions perform almost uniformly in different rounding mods
and sometimes it is exact in rounding to nearest and very buggy in other rounding mods, very
implementation-dependent. And there are also some examples of bugs found in this testing. And
here are bugs related to exponential function and hyperbolic functions, you can see these
negative exponentials or exponential of a negative number which is greater than 1 and so on.
The same picture can be seen on trigonometric functions. And for row functions, the same, error
function is some probability. And here you can see that this probability is very high.
So most notable are two things to my mind here that on these four platforms, implementation is
almost exact or exact in the rounding to nearest and very buggy in other modes. So I think it is
because of some optimization performed just for the nearest mod.
And also as I have noted already, implementation of GLIPC library for returning platform seems to
be the least buggy implementation under consideration.
The other picture shows which implementations provide different or the same results on our test
set. So implementations in this, call it in the same color, provide just the same results on test and
implementations with white cells provide different from each other results.
>>: White cells?
>> Victor Kuliamin: Yes, white cells are unique implementations which are not similar to any
other. So you can see that, for example, we can see the development of mathematical model
and implications based on grid systems or cloud computing. And if we want to use different
platforms as a base of this cloud computing system, agreed, we can have problems if we
arrange, if we automatically arrange or provide some automatic scheduling of calculations on
different platforms, because we can get just different results from such applications.
And actually, as I know such scenarios already implemented, in one instituted, most
[indiscernible] grid was implemented with mathematical scheduling of computations. And they
suddenly are faced with different results just caused by different scheduling of different
implementations or different platforms. It was independent teams. So they do not know about
our results, but you can see that this has practical significance. So then some maybe obvious
conclusion that formalization of standards can uncover numerous issues even in rather major
industrial standards like POSIX which has a long history of development and considered as very
good in the industry.
And also in some cases formalization is not only ineffective in terms of cost and effort but
required but even impossible if the standard has nothing to require from some interface.
So that's all. Thank you for your attention. Of course, if you want to discuss some issues
concerning not mathematical testing, but other things I have sketched before, you are welcome.
[applause]
>> Wolfram Schulte: Victor is here for the remaining week, so if you want to catch him to get
some of the issues worked on, feel free. He's sitting next to me in my hallway. We can take
questions.
>>: These were defined in the Linux library, right?
>> Victor Kuliamin: Not only in Linux. We also here confessed implementations of C runtime
library, Visual Studio, because it provides almost the same set of functions, except for some.
>>: Okay. I'm just going to ask. What about the Intel? Intel provides some library like this,
right?
>> Victor Kuliamin: Yes. For example ->>: [indiscernible].
>> Victor Kuliamin: This implementation of GLIPC on [indiscernible] was developed entirely by
Intel. And also there are Intel libraries for Windows which I also performed tests on. But I do
not -- I have not put yet the results on the table. And it seems that the implementation on
Windows is also very good. It demonstrates almost the same results as on the timing.
>> Wolfram Schulte: Any other questions? Thanks again, Victor.
[applause]
Download