>> Juan Chen: Thanks for coming here. So... Jan Vitek from Purdue University. He's also a chief...

>> Juan Chen: Thanks for coming here. So it's my great pleasure to introduce Professor Jan Vitek from Purdue University. He's also a chief scientist from Fiji Systems, Ovm startup. And so Jan has worked on many interesting things, including like systems, [inaudible] systems, realtime systems and also programming analysis, programming languages. So today he's going to talk about scripts and programs. With that, Jan. >> Jan Vitek: Okay. Thank you for having me here. It's a pleasure to be here. So the title of this talk is referring to -- notice this desire many people have had for many years to evolve scripts which are typically thought of as, you know, hacked together, you know, one-shot computations into more robust programs. And the talk will have three parts. I will be telling you about work that really would have liked to do but we didn't, then work that we kind of half did, but I think is interesting enough to tell you about, and then work that we've done but we don't know if it's any good, so it will be good to have feedback. And I should say this has been done in collaboration with quite a large crowd, and both at Purdue and outside of Purdue, together with IBM, Stockholm University, [inaudible] Texas Arlington and so on. Okay. So let's start. So the first part here is, well, this part is really coming from interaction with Tobias Wrigstad, who was a postdoc at Purdue, and he was involved in this program which is called Pluto, and it has a long Swedish name that I list here and I'm not going to garble for you. And we've been trying to -- it's supposed to be open source by law. But, you know, been three years, we've been trying to get the code and we haven't succeeded. So what I've been -- what I'm going to tell you here is, you know, based on secondhand story, both Tobias who knows the developers and on slides and talks by the developers. Okay. So what is Pluto? Well, in the beginning of this story, there was a Perl program that was put together to perform a simple data migration task. Okay. And it was also a prototype of a larger system that was going to be built by a big American company. And like these things happen, the larger system had three interesting attributes: it was late, over budget, and unusable. Okay. So what do you do? Well, you take the script and you make it the program. So it became a program that is currently managing these retirement savings of 5 1/2 million people in the a small country, the whole of it. 23 billion Euros. And it's been developed by about 30 people over seven years. Okay. So this is all background. Now, from a computer standpoint, here are statistics that are more interesting for us. Pluto is 320,000 lines of Perl, SQL, shell, HTML, databases, 750 gigabytes of data, has to run 24/7 because people want to invest and see how their retirement savings are faring. And bugs are directly correlated to potential loss of large numbers of dollars, or Euros in that case. So putting your software engineering hat on, there is one scene reaction. Okay. There's no way this makes any sense. And I'm -- the Pluto guys use this slide in their talk, so I feel, you know, I'm not doing -- not badmouthing them. But it turns out that the system works and has worked for seven years. So it's sort of interesting to reflect how could they get something that is, you know, from software engineering side, you know, looking at this as a software engineer, as unlikely to succeed as, you know, a 320,000 line Perl program can be, how do they make it? So a number of factors were highlighted in their talk in discussions with them. One was they found that Perl was more productive than Java. Every time they did internal contests, you know, they would pit their team of guys working with Perl, team of guys working with Java, you know, Perl would win. As a side note, in that system all of the critical parts are done in Perl and all of the noncritical parts are done in Java. They use the -- they had some discipline, so they decided that they would throw away many features of Perl, and it ends up looking more like SQL. There's like no floating points, no threading, no OO. So it's a sort of subset. Then they decided to take -- or, you know, they had the luxury to have a database to back up the results. So what they do is whenever they detect an error, they just abort the day's run, you know, undo all the changes, revert to the morning state and then figure out what to do about it. They're fail fast. Last, and I think interesting for some of you here, they use contracts. So Perl is not typed. So they came up with their own notation for contracts in Perl, which looks like this. Here's a contract in Perl, prepost condition, all that, checking whether the variables have certain types. And this is weaved in the Perl program and it's actually tested. If the contract fails, you abort, undo. Okay. So looking back at their experience, here is what the developers was -- were lessons out of this project. Perl syntax is ugly. You know, they would really like a better language. Again, it was productive, right? They didn't like the fact that it was weakly typed. They would like some more checks. And they didn't like the speed. You know, one of the issues was they have to run the computation once a day, and over the years the time, the computation time sort of slowly is creeping up on the available 24 hours. So, you know, Perl doesn't parallelize particularly well, and they definitely didn't want to try that. And they complained, you know, encapsulation, modularity were not well supported. Okay. So this raises some questions as researchers from the research angle, and the questions we took away from this experiment is is it possible -- so we believe that there are many scripts that, you know, start as a one-shot exercise and then stay around for a long time. So is it possible to, you know, have scripts and programs how -- I'm not going to define neither of those terms, but, you know, we all have some idea what we mean by that, cut it in the same language. And can we evolve a script into a program. What does that mean? Well, you know, add the level of assurance and safety that we would like to have to feel safer that this is actually going to save our pension forever and also maybe get some speed because this will help the compiler to generate more efficient code. Another interesting question that this thing raises is if it's true that Perl plus contract is more productive than Java plus static types, then the question is, you know, are type systems useful at all, or what are the assertion about -- you know, static assertion we want to make about the dynamic behavior of program that will actually help us be more productive. A good question. I don't know. So these were some questions, and there's some links here to the talks and slides about this. So this got us, you know, started on our project. And sort of the second step -- so we started thinking about scripting languages and, you know, evolving scripts into program and we quickly realized that we had very little clue about what a scripting language is. Well, you can take a book, of course you can learn, but, you know, more importantly, what do people do with scripting languages, what do actual scripts look like. So the second part of this talk is going to try to explore -- I mean, we have all heard that, you know, if you program in Perl, programmers can do anything, or if you take JavaScript, you know, they can redefine anything. But do they? You know, what is actually the behavior of real-world programs in these languages. So, you know, we decided to investigate this. And we set out two questions. So assuming that you want to invent a new language or extend an existing language, how much dynamism do you actually need. You know, how many dynamic features do you need to do, you know, most of the tasks that people do with scripting languages. One of the possibilities is maybe that people use the dynamic features just because they don't know better, you know, because they're programmers, not real programmers like us. So we don't know. We have no clue. Another question which we set out to try to answer is if you had a dynamic language that is out there that has legacy code, can I add a type system onto it, on top of it and get some benefits, get more assurance, get faster runtimes. And here the proviso is without rewriting all the legacy programs and libraries. Right? I mean, if you have to change existing code, it's useless, right? You may as well start with the new language. Yeah. >>: When they reported that they felt more productive in Perl -- or maybe that was based on measurements, I don't know, did they give any more details? Because, you know, the second slide about they wanted more type safety and better performance sounds like they want Java. >> Jan Vitek: Right. So, no, I don't have more information. I mean, we -- what we wanted to do was get their whole code history and analyze what kind of errors they made over the years. But that never happened, unfortunately. So I don't know. Yes. >>: And I guess correlated with that, if they were doing these little one-off contests, were they doing maintenance of code or start from scratch [inaudible] ->> Jan Vitek: They were -- so my understanding is they were small little project, you know, one feature, one little thing. So clearly this is not meant as an indictment of anything or -- it's just a data point, and an interesting one because they cared. They wanted to -- they are under pressure to go to Java actually internally and every year they have to face the question, well, will we go, because it turns out they have a hard time finding Perl programming. And there are many problems, but... So, anyway, we sort of set out these questions and we followed, you know, what methodology. Well, we decided let's pick one language that would be dynamic, that would be pretty much as dynamic as it gets, so you sort of -- you look in your hat and, poof, oh, here's JavaScript. Okay. Why JavaScript? Well, it's very dynamic. And it has another very interesting property: You can get running code very easily. And that's really hard with any other languages. You know, if I want to know the dynamic behavior of programs, I have to have programs, have to have inputs. With JavaScript just open my browser and poof. You know, click to any link and you have a program and you have its input. So what we did was we instrumented a browser and we went to most of the popular Web site, and they pretty much all have JavaScript on them. We recorded a large number of traces at runtime, about 7 gigabyte of traces. And then we had an offline analyzer that analyzed the traces for a number of properties of interest. Yes. >>: Sorry. So I'm curious, your methodology you picked JavaScript. And then you kind of saw, hey, there's all this code sitting in Web pages. I'm just curious if you thought about -- and I'm going to be a little snobbish -- that these aren't whole programs architected to run end to end, something you might see in [inaudible] before Python. And you're seeing these little nuggets on [inaudible] code and perhaps even written by people with no formal training ->> Jan Vitek: Even better. >>: Okay. >> Jan Vitek: That is exactly what we would like. >>: Okay. >> Jan Vitek: You know, we want to see, you know, the bad code out there. So that's -in some sense that was a plus for us. So and then we also had some static analysis that looked at the source code. That's another good thing. You get source code with JavaScript. So in case you doubt, JavaScript is a programming language. And here's a bit of JavaScript code. What does it do? Well, the first thing here is function lists is a function that is -- can be used as a constructor, like, you know, in C#. And what does it do? It takes the -- this object, adds a field named value and sets its value to V, adds a field named next and set its value to N. Then this bit here says, well, whenever you create an instance of this list, make sure that it has the map method. And the map method is this function that creates -- you know, does whatever we expect it. But the interesting thing is you can have higher-order functions in JavaScript. And here's a little bit of code that creates a linked list, one, two, three, and here's a little bit of code that passes in a closure into the map function. So it's a programming language like we're used to. Nothing scary about it. Right? Well, yeah, right. So one of the things you have probably observed, there are no type declarations. So no types. All right. That's fine. You can grab any string and eval it. You can load code dynamically. In an object-oriented language we don't usually do that, but you can add fields to an existing object, to an existing class. You can change methods and add methods and remove methods. So you can do anything you want. You can change the hierarchy. So, you know, these are really unfriendly to any kind of static analysis. You can't even get a control flow graph for JavaScript program without doing a lot of work. And if I throw in eval, you just probably can't. All right. So the corpus we got for our analysis is about a hundred JavaScript programs. And in the following slides I'm just going to give you data on a bunch of them. There's Bing there, there's Google mail, Gmail, there's YouTube. LivelyKernel is a programming environment written in JavaScript by Sun and other random Web pages. So just to give you a little bit of an idea how big we're talking this all is, so here is my corpus, so, you know, Bing, .NET, all of these guys above. Growing is the size of the source code in bytes. So we're looking at 100K about -- you know, it goes to 400K for Apple MobileMe, above 500K for Flickr, and Facebook was for than a megabyte. Okay. Just gives you this idea that these are actually, you know, big -- you know, this is not five lines of JavaScript code. And the source -- and going down here is the size of our traces, and this gives you a little bit of a feeling how long we interacted with the Web site. And, you know, this line here is about 5 megabytes. So, yeah, I mean, this is a little bit harder to understand without knowing the format in which we do our traces. But the point is some interactions are shorter and some interactions are longer. Okay. So the first thing we wanted to find out, sort of standard metric in looking at programming languages is what is the instruction mix, you know, how do your program behave. And here's a graph. And roughly -- so each bar is a program trace, or actually the average over all the traces from one Web site. And the red part is your stores. The green part roughly corresponds to different kinds of reads. And the topmost blue part is calls. Okay. So this is fairly what one would, you know -- probably what one would expect. There are certain Web sites that are a little bit -- all right, so these are writes, like I said, reads, and calls. Some Web site -- the last one is the average. Some Web sites are a little bit different, like ImageSchack and WordPress that have almost no writes, but, yeah, you know, whatever it is they're doing it's unclear. One observation we made, there is a hunt full of throws and catch, almost none overall. But it does happen. The other metric, which is of, you know, somewhat -- you know, this is also standard metric is how long were the functions that we're executing in number of instructions, and this is giving us for all of the Web sites the average. And here, you know, the average is about 30 instructions long, and, you know, there's some variation. So one thing to observe is they are not that short. I mean, they are actually, most of them, doing quite a lot of work. >>: When you said instruction, how did you [inaudible]? >> Jan Vitek: This is ->>: [inaudible]. >> Jan Vitek: So we've instrumented the JavaScript interpreter. And we've recorded the number of event in the trace, and there may be some -- a few events, maybe a local variable move that we may not have recorded. But, you know, all of the arithmetic operation call -- you know, all of the byte codes mostly are recorded in this. So at the bytecode level. This is at the -- yes. Okay. So the next sort of slightly more interesting metric is what kind of data do these programs manipulate. So here we look at the allocated data types over all traces. You see there's a large chunk of arrays, there are dates, regular expression functions, objects. Instances. That means objects. Almost no errors. Prototypes. A big chunk of prototypes. What are prototypes? They're like your classes. Why are there so many? Because every function gets one prototype for free. So, you know, there will be always as many as there are functions. And anonymous objects, those are objects that are just created with curly braces without going through a constructor. Breaking this down into Web sites, we get these cool pictures. What should we get out of this. Is there a weird size, like WordPress where the only thing you're creating a data objects. Same thing for Digg. You know, Lively is mostly doing arrays and so on so. You know, there's some things we can get out of this. So now going sort of more towards trying to answer the questions we set out originally, which was, you know, could you provide a type system for this, how dynamic is the program. The next thing we did is measured the prototype chain length. So what is the prototype chain length? Well, it's really, you know, the closest thing to inheritance steps in JavaScript. So what are we showing on this one. We're showing the top -- the orange gives you the max, the longest chain in the trace. And the brown part is the average. So what can we get out of this? Well, most -- so if you have a prototype chain of one on average, that means you have no inheritance, because everybody has a prototype for free whenever you're created. So you see a lot of them use absolutely no inheritance or very little. And in the case of Gmail it goes up to 12 levels of inheritance. So this is already, you know, a reasonable hierarchy with some code reuse, and, you know, you better understand where you get the code and so on. Now, I should mention that Gmail is compiled from Java code, right? So there may be a correlation here. I know of course you can change your prototype, x._proto changes your parent class. So, you know, life is fun. Evals. So it's -- we would like to believe that these programs don't use eval, because it would make static analysis easier. Facebook uses over 7,000 evals in our traces on average. So that's not so good. And everybody uses some. Right? So this is a thousand, right? So you're in the hundreds pretty much everywhere. So what else would we like to believe? Well, we would like to believe that most of the evals out there are really doing deserializing -- deserialization using the JSON format. So there is a format where you can encode an object as a string and then you give it to eval and, poof, you get the object out. So we looked at this. And the orange part of the bar are the JSON strings. And, yeah, we're doing a lot, but that still leaves quite a few things. You know, you have code like this, x.f = 3. It make's -- it's really weird. This is see in actually real code. I'm evaling a constraint string. I could have written that code. There is no good reason why I should use eval to just slow down the virtual machine. So there is code like that. We're trying -- so part of -- so this is still on going. So we're trying to sort of split these brown bars into, you know, what we call trivial evals, evals that are really not doing anything interesting and sort of more dangerous one. And, you know, the results are going to come. So okay. All right. So the next bit that we found out is in JavaScript you can call a function with -- a two-argument function can be called with one argument or four arguments. All right. If you call it with one, then the second argument gets -- becomes undefined. If you call it with four, you get an array that you can access. So this little bit of code calls a function with, you know, either too few for too many. So how often does that occur in real program? So this is the percentage of functions that have been called with different number of arguments. So 10 means 10 percent of the functions defined in the program are called with variadic arguments. And you look, the average is way above 10. So -- and it goes up to, you know, in GMaps over 50 percent of the functions are called with random numbers of arguments. So if you expect to get any information out of the signature, you know, you should be careful. Right? >>: Can you break those down between too much and too few? >> Jan Vitek: Not yet. That would be an interesting -- yes. And the other thing would be another question which we would like, you know, probably should ask is are they always called with the same number of arguments or not. And, you know, there's fine tuning here that should happen. But okay. >>: Can JavaScript [inaudible] like can you set default delays [inaudible]. >> Jan Vitek: No. No, it's undefined. So the next metric which is really interesting in an object-oriented setting is the -- you know, the degree of dynamism in the dispatch. So if I call a function, if I have a call x.f, how many methods do I actually invoke at runtime from this point in the source code. And the larger this number, the more dynamic the program. So we did measure this and the way to read this graph is this point means -- the point here means there's about 100,000 call sites that are monomorphic. So from this call site I always call the same method. So this is very nice. You know, we can -- if we knew which one these were, we could inline them, you could optimize, you know, life is good. And there is one here where, let's see, one call site that dispatches to 100,000 functions. And now we start to scratch your head and you say there are not even a hundred function in the source -- a thousand function in the source code, so how could you dispatch to that many, right? So we sort of thought, well, what if this was really dispatching to, you know, the same function but our analysis told they were different, how would that occur. Well, if I wrote code like this, I have a class -- or a constructor list which within the body sets the C field to some function. All right. So what is this doing? Every time I call this constructor, it's constructing a new function object which has the same body but, you know, is a different function. So in our sort of first cut of the analysis we said, well, if it's a different function we'll record that as a different function, but turns out that this graph shows how many times do you have -- so there's 100,000 bodies that are unique and then there are few bodies here -there is one body that is shared by 50,000 functions, so that means, you know, there is one thing that has been instantiated 50,000 times. So now if we get rid of this and we go back to the dynamic dispatch figure, things shift quite a lot. So you end up around -- I think the number here is 2,000. So the most dynamic call site in our traces dispatches to 2,000 functions that are different. 2,000 is still a big number, right? Right. So another thing that we found surprising and funny is -- so when you think of adding types to a language like JavaScript, you say, okay, how am I going to define a type, what's my notion for a type. Well, one way to -- one sort of naive way would say, oh, I saw this constructor called list. Well, we'll call that a type. Now, it turns out that constructor is just a function. Say that I looked -- I took a slightly different definition of a type. I say a type is the set of method and field names. Okay. And now say I look at all the uses of the constructor and look do I always get the same type, same sets of methods and field names out of it. And the answer is no. So there's about 2,000 constructor that are monomorphic, as in whenever you call list you get a list out, and then there is one constructor that returns 300 different types. So 300 different sets of field names and method names coming out of this guy. How can this happen. Well, here's an example of code. We say the person, and depending on the sex, we add a field. You know. That would be my son, you know, he likes guns. So that's not too good. So we can -- so I don't have a slide for this, but we sort of started drilling down into this and we figured out that these points here seem to correspond to uses of some libraries, which implement inheritance in JavaScript. So really this -- we believe we can get this down to about 50. So if we get rid of the special cases, it's still bad but it's, you know, not as horrible as this. Now, the last thing we looked at was now assuming I knew a type, is this going to change. Well, in JavaScript I can always add or delete fields. So do people do that? Well, so what is this? This is per object the number of addition in orange and deletion in blue of fields. So per object, so what this means is on -- in Bing, per object I'm adding 095 fields and deleting, you know, a tiny, tiny fraction. On average, so average over all the objects. This is maybe not the best way to represent the data, but what it's telling us is there is a lot of adds and a lot of deletes. And that means it's really hard to imagine a static type system. You can imagine if you had only adds perhaps, but the deletes really screw you up because it's not monotonic anymore. So but we thought this is way too much. This is way too much -- yeah? >>: Did you analyze the deletes to see if they were perceivable? >> Jan Vitek: I'm sorry, they were... >>: If not -- so maybe they're deleted but no code actually tests for them ever being there? >> Jan Vitek: That would be -- we couldn't do it with our framework because we only do traces. But yes. Yeah? >>: So you obviously didn't treat indices that are numeric as fields in this ->> Jan Vitek: Oh, yes, we did. I'm going to come to ->>: Oh, okay. >> Jan Vitek: Hold your horses. You're ahead of me. So the first thing we figured out is, well, what if we -- so what we said is after construction -- but the question is what does construction mean. So we decided to take heuristics to say, well, let's change the way we record things. And we say construction is until the first read of a field of an object. So I say I'm constructing an object until I read a field. And I'm just going to disregard all of the writes that happen until my first read and say, you know, this is the construction phase. If I had a class-based language, I would have defined a class. And since I don't have a class-based language but I'm obliged to add fields one at a time. Well, you know, so what happens? Well, the -- sort of the faded numbers are what we had before, and if we take construction into account, the numbers go down quite a lot. But it's still there. So the next thing is -- we figured out is, yes, in JavaScript arrays and hash tables are objects. And some of these adds and deletes just corresponds to adding things in an array, deleting things from a hash table. So maybe we can get those out and get a feeling for how the programs would look if you had proper arrays and proper hash tables as data types. So how do we know that an array -- that we have a hash table access. Well, you have something like X square braces name for a string, and if it's an array access, it's X square braces an integer. So if we remove those, then the numbers go down. So now we are going, you know -- in many cases you have fairly low numbers of adds. But in all cases it still happens. In pretty much all cases we still have deletes. So, you know, we put all of the smarts we could to get down the dynamism. At the end it looks like, yes, these programs are truly dynamic and there's not much we can do. So conclusion to this part is, well, you know, yeah, the code is really dynamic. Not all of the features are used by all the application, but there's always somebody who does something weird. And so it's -- it's sort of not clear to us that there is much hope in imposing a static-type discipline on JavaScript without rewriting considerably the legacy SQL. And then the question is you know, what's the benefit. Yes. >>: So it looks like Gmail has the most dynamic [inaudible]. >> Jan Vitek: Yeah. >>: And it says [inaudible] Java, so that could be that our heuristic are not strong enough. You know, it could be that, you know, they are doing something that is perfectly reasonable, but we just don't know how to interpret that. So, yes, clearly. So there are -yes. These are heuristics. Yeah. >>: Did you look at people actually changing the prototype chain at runtime or doing other really crazy dynamic things like that? >> Jan Vitek: Yes. We have numbers, and I think we found cases where that happens. But I can't recall how much. But there were some. But not -- you know, it wasn't -- well, anytime it happens it's significant because it means, you know, you can't assume it doesn't exist. Yes. >>: I was curious, you're looking at all these Web sites, but in this discipline Pluto project, did you analyze that code? >> Jan Vitek: No, we couldn't get the code. This was my -- sort of that's the standard -this was the part of work we didn't do, which we wish we had. We really wanted to get access to that source code. And, you know, there's -- it may happen. We'll have to -- if we have to sue. At least in Sweden they're supposed to hand it over, but they are making some issues with privacy. Background. There's work on adding types and doing type inference for JavaScript on evaluating the behavior of Python applications, staging formation flow for JavaScript, try to sort of split the problem in doing static analysis of the parts they could and [inaudible] propagation at runtime. There is a lot of good work on static type inference for Ruby, but looking at Ruby, it seems that Ruby is more -- is better -- you know, better behaved than JavaScript. Because I don't think that their results would carry over to JavaScript. And so more work. There's static analysis for JavaScript and so on. Okay. So now the last part. So what do we do with this? So what do we do with this? So we sort of have this idea that we would like to evolve programs -- scripts into programs. We've sort of looked at dynamic languages and we found that they're really dynamic. So how do we proceed? Well, here's our methodology. We said, well, let's just start with a new language. It's always fun to invent a programming language. Have two benefits [inaudible] you can correct design mistakes and you don't have a user base. You know, both things are good. Not really, but, you know, they have their advantages. So we decided to experiment, and experiments sort of by saying, well, if we design a language from scratch where our goal is to be able to go from these scripts to programs, what would be the design decisions we would take. So we -- this is together with people at IBM, we designed a language called Thorn. And in the context of this talk what is interesting about Thorn is that it lets you -- lets scripts scroll by gradually adding encapsulation and modularization features, which are not available in JavaScript and the like, by addition of concurrency and by addition of types. And I think I'll mention all three, but I'll spend a little bit more time on the types, which is because it's maybe the more novel part of the language. Okay. So yes it is a programming language, and it -- Thorn -- you know, this is a simple Thorn program. It's a class-based language. We're defining a list with two fields, head and tail, and there's a map function that does the apply, and here we are creating a list. So nothing's too fancy. Yes. >>: You're not scripting at all, right, because you have to have type to begin with. >> Jan Vitek: You have -- well, it is not scripting at all. So Python. Is Python scripting? So defining scripting. So my personal definition of scripting is lightweight and unobtrusive, probably untyped, you know -- so what it is not is it is not JavaScript. So it forces you to put some structure. I mean, this was a design decision that we decided that a minimum -- a modicum of structure was actually a good thing. And what was the modicum of structure is pretty much what Smalltalk would force you to have. You know, a class is declared and it has attributes and you're going to tell me what those attributes are before I instantiate an object of this class. Yes. >>: Are your class statements executable statements? Or are they compile time? >> Jan Vitek: So there is an interpreter so you can execute them in an interpreter, I guess. But they -- I think -- for what you have in mind, they are more compile time. >>: So I slightly missed -- do I have to house all my code in a class? Is that what you're saying? Or can I have top-level functions? >> Jan Vitek: You can have top-level functions. You can have statements, you know, sort of trying to -- yeah. Okay. I'm just showing this to relate to the example with JavaScript earlier. This is the same bit of code done in Thorn, and it's actually shorter than the JavaScript code. So for whatever that's worth. The other thing is that most of the people that do, you know, nontrivial JavaScript will use something to get the power of classes, so we might as well build that in the language. This is what the version 4 of JavaScript wanted to do, and then, you know, it got bogged down in politics and it didn't happen. So very quickly. The first part of -- you know, the first set of mechanism we want to provide is modularity. So I really don't want to spend too much time on this. I'll just zip past. This is a Thorn script which really is very easy to read. I mean, it does some queries and some fancy operations. And this is just a script. This is what you would write in Perl, and it's actually not much prettier, I would say. No encapsulation, no modularity. So the first thing you may want to do to this thing is, well, wrap it in a class and create an abstraction that has a name. So you say all right, well, you know, we can create a class, it has a name, this name be used [inaudible] and does stuff. And this is exactly the same program, but I've added a class which does a little bit of -- you know, adds a little bit of structure. It's still an untyped language, so we don't have types and we don't have access control. That's sort of hard to do -- or we haven't found a nice way to do access control in an untyped language. So the best -- so what do I mean by access control? I mean all the fields are really visible and accessible. We could come up with a -- we have -- we considered making some field private, but strongly private as Smalltalk would do, which is you can only use your own version of the field, not even those from an object of the same class. But in the end for simplicity we decided not to go there. So that's the first step. Then the second step, if you want to package code and distribute it, you need some higher sort of bigger unit than just classes, and you need also to support visibility and some access control. And for that we bring in modules. And modules give you -- you know, there's a -- it's a collection of definitions. And what they give you that is a little bit interesting that you don't have, say, in Java, there is a keyword called own which does a little bit of control over linking. What it says is, you know, import your own copy of that other module inside of this module. This which means that if you have -- you can have several version of a parser. One of the examples that I [inaudible] Java is you have a big application that use two different version of an XML parser in different parts. And the only way to do this in Java is to use class loader tricks. Here you can say, well, import my own version of this, and it's local and it has -- and it's different from all the other version. And the state is not shared. And, you know, it has -- it is a little bit nicer. And finally, the last encapsulation mechanism which is still different from modules, so what modules give you is name control, control of a name space and a little bit of control over linking. And finally the last bit we provide is what we call a component. And what a component is is really an isolated unit, isolated in terms of no sharing of references. So when I say my counter component here has this -- so components can only talk by message passing, so this is a synchronous method of the component, there will be no sharing between one component instance and another. So you are guaranteed by the language that these are completely isolated little islands of objects. So these are three different mechanisms for modularity and encapsulation. And components also are a unit of concurrency. Yes. Sorry? >>: I was going to ask how are components different than a class, but then you answered. >> Jan Vitek: Right. So here we're spawning a component and we're sending messages to it. So there's more to say about it, about concurrency obviously. I'll be very, very short. We picked this idea that, you know, most of the scripts are written sequentially nowadays. And we didn't want to introduce shared memory and -- because then you couldn't really -you know, you will have to lose that sequential thinking, which is, you know, for a lot of the audience of scripting languages, which is important. Because, you know, they're not ready for concurrency. Period. So what we said was, well, sure you can think consequentially and then you wrap it up as a component and then you get message passing and essentially some form of an actor model, so components, you can think of them as airline processes, actors, they have a mailbox, there's some interesting parties, but it's not particularly different. Yes. >>: Does this mean one of these implications of a method on a component [inaudible] copy [inaudible]? >> Jan Vitek: So the -- so the -- the -- the question is do we do deep copying, and it depends. If the components are co-located in the same address space, the data types -- we have support for what we call pure data types, immutable, and these can be passed by value. And otherwise, yes, you would have to do a deep copy. And definitely if they're in distributed -- on distributed nodes you will end up doing copying. But I think here the idea is, you know, what is a simple concurrency model you can slap on top of the sequential model without completely breaking down your sequential programs. Yes. >>: So [inaudible] ->> Jan Vitek: Oh, you have a synch and not synch. So there is a synch call and there is a not synch call. >>: [inaudible]. >> Jan Vitek: Oh, you're thinking implementation now. That's a choice. So our -- so let's see. What does our current -- so we have two -- there was -- I think we've tried both, both cases. You know, one thread per component and sometimes thread hijacking. You know, you can play around with what works best. Yeah. So that's about all I want to say on concurrency. And the last bit of the talk is how to add types. So to recap, you know, dynamic type checking is great. Anything goes. You know, we can do anything until it stops working. And one of the benefits of dynamic typing is also you can run a program even when bits are missing. You know, you can -- and that is good. That's actually good. You can run broken programs, as long as you don't need the parts that are broken. Just do that. And static typing we know and like, has a good properties that you get earlier warnings and you can generate faster code. So the question was, well, can we have both of these in the same design. So how shall we do this? Well, here's an example. So here's a bit of code in Thorn, say, and there's class foo that has a bar, a method bar which takes an X. And let's assume that -- until now I've shown you completely untyped code. Let's assume that we had a syntax for writing type and notation. Colon int let's say means X is expected to be an int. And, you know, so for instance I can create a variable A of type foo and, you know -- all right. So let's say we do that. So what we really want to do is to be able to call typed code from an untyped context. Or more specifically code a.bar with an X where X is an untyped value. We would like to do that, because that lets us interact, interface typed and untyped code. So how can we do this. Well, what we could do, it's easy enough, is you say, well, maybe the runtime can check that X is actually compatible to the type that you declared. But then the question becomes when do you perform the check, how long does it take you to do this check. If it's really fast and you can do it at the call, then life is good. It turns out that that's not the case. Well, for int it's probably going to be the case, yes. But let's take a slightly more interesting example. So same kind of thing, you have an interface order that defines just one operation, compare, that they can an ordered and returns an int. And then you have a function sort that takes an array of ordered and returned an array of ordered. And then you have a call to sort with an X where X is unknown. So same question: When do I do the check, how long does it take to do the check. So say I do the check here when I call it. How long does it take me to do the check. Well, it will -- I have to check that X is an array. That's probably fast. And then I have to check that all the entries of X are ordered. Best case linear in the length of the array. Million entries, million operations. But that's not the worst of it. Let's assume that the array is mutable. Then right after the check somebody else could, you know, update one of those entries and invalidate what I've just checked. And then I can't do anything. And then it becomes nonsense. So idea that people -- other people have had. This is not one of ours. What if you said, well, we'll add a wrapper around the value and that wrapper will check that all the interactions between sort and that unknown values obey the contract that this is an array of ordered values. So here we're putting a dynamic contract around the variable or the object pointed to that variable, and the contract is simply make sure that when you interact with this thing it behaves like an array of ordered. If at any point of time it stops behaving like an array of ordered, what do we do, well, you throw in an exception, say. Okay. So some sense what the compile code would look like here is, you know, when you call sort, you add this little operation here which is create me a contractor or a wrapper around X to make sure that it's an array of ordered. And that's sort of sensible. So you can see that then you can feed dynamically type value into statically type code. You can -- and if at some point in the computation the dynamically typed value behaves badly, then you will get an error, which is good. Downside, the downside of this is you can't really optimize the code. So if you remember our previous example, if this was a static type in context, I could unbox the int and perform the addition on native -- I wouldn't have to have objects here. But now if I have -- if there's even the possibility that anything that is labeled int is a wrapper around something else, then my compiler can't optimize anything. It has to keep code, residual code all over the place to check is this a wrapper, if it is a wrapper, throw an exception. Okay. So that's sort of what was proposed in the past, and we decided to do something different. And this is, you know, the design we have in Thorn. And some -- you know, the basic principles we were following is we wanted to make sure that we don't reject correct dynamically type programs. So the idea is you should be able to add these annotations and the programs that used to work should still work. And that's sort of -- there's subtle issues there, because if you're not careful, you can add something that's slightly too constraining, and then suddenly you get a type error. But, you know, the program would have worked. And you have to be careful about that. We want to be modular. No -- nothing whole program about our approach. And lastly we want to make sure that we design something that rewards good behavior. You know, we want to tell the programmer, look, you give information to the system, as a benefit you will get something out. What can we get out. Well, you can get either a guarantee of correctness or faster running times. Now, the problem with the previous design with the wrappers is there are [inaudible] cases where you can by adding type information slow down the program arbitrarily. Because, you know, you're creating that at runtime, these wrappers. There's data you're allocating and you have to manipulate. So that's some -- this is motivation. So what did we do? So we decided to have, you know, a type system that is sort of in between static and dynamic. So we call these like types. And the idea is dynamic -- the dynamic parts of the system will be just as dynamic if this was Smalltalk. The static part can be compiled just as if this was C#. And the middle, well, the middle will give you a little bit of error checking, probably not much performance improvement, but, you know, a little bit of assurance. That's sort of what we're shooting for. So in our setting every class named C has a type C and also a type like C. And the compiler will check that if you have declared a variable to be like C that you actually use it as a C. Example. Example, example. So here's a class point which has two field, X and Y, two methods. And it has a move method P which moves the points by grabbing the coordinates of the argument and storing them in the receiver's fields. Here the fields are declared int and move as it stands is untyped. So here when we say nothing, think of this -- the type is dyn for dynamic. So what that means is this method will work as long as you feed it something that has get X and get Y methods that return ints. Okay. So what we do in our language, if you want to add a little bit of safety is you say, well, you know, maybe let's -- I think it should be a point really. So let's say that -- I'll write that. I'll write P should be like point. So what does that mean? It means that now the body of the method, the scope of P will be type checked as if P was a point. If I have a type error, this will be caught. So the nice thing with this -- well, one of the advantage of this design is flexibility. So we don't prevent code from calling this method with, you know, something completely different. All we're doing is we're type checking the body. So for instance assume that I had a class coordinate which also has get X and get Y but has nothing in common with point. They are not subclasses, they have no -- nothing in common. They just happen to agree on those two methods. Well, it's perfectly -- if I have a variable P that is a point I call move, I can pass a coordinate. The type system will be happy because I've declared P to be untyped. So their type system can't do anything here. And when I call this method, we only said it was like a point, so the type checker, there is no runtime checks. We're just going to feed it through and it's going to run just fine because C happens to have the two coordinates. >>: That is essentially, then, like a shadow interface, then, that basically replicates -- I mean, so basically [inaudible]. >> Jan Vitek: It's just like you had an interface for point, and it's just going to check the body of the method, nothing more. Yeah. >>: [inaudible] in this case the coordinate has to have integer [inaudible] otherwise it won't work [inaudible]. >> Jan Vitek: Right. Yes. Yes. So it really looks at -- you know, we said it's like a point, so we look at point, and it says, oh, all right, if I use get X, well, it better have a get X. And, you know, the type is ->>: Does that mean [inaudible] check some more then it's necessary, right, because maybe you don't depend [inaudible] the return type is actually int. And so if a coordinate were to return something else, it would not type check. >> Jan Vitek: It would not type check. It ->>: [inaudible] couldn't call it. >> Jan Vitek: No, you could call it, because we're not doing any type checking at the call point. If C is dynamic, we're saying, oh, whatever. Right? So it would fail the type checking here. And then what -- okay. So you ->>: [inaudible] you might have more constraints there, say, than ->> Jan Vitek: Yes. It may be too constraining. >>: [inaudible]. >> Jan Vitek: But object would be fine. Yes. >>: So what if the body does not depend on the return types ->> Jan Vitek: Well, then ->>: -- and then -- but -- and so coordinates, say, does not have them [inaudible] then you wouldn't do any optimizations even though -- >> Jan Vitek: No, you are not going to do any optimizations in this code. >>: So what am I getting? >> Jan Vitek: Ah. What are you getting. Well, you're getting this: you're getting -imagine that now you add one line to your function move, and that line says p.hog. Well, hog is not a method or a field of point. So my ID now can flag this as a compile time error. Okay. Now, other advantage is I can get name completion because I've given a hint to my ID, you know, what is this behaving like. So I'm getting local correctness. It's very shallow property. But it's useful. Especially if you plug this into your ID. Because you have -you know, now you can have intelligent assist, you can -- so that's one thing you get, compile time errors here. So I'm going long, so I'll skip this slide. And so as a summary on this, so a like type is a promise how a value will be treated locally. We still have concrete type. So if you don't say like, you have a real type, and that real type can be compiled efficiently. And when you have something that is concrete, it will never actually -- it will never be anything but the type you declare. So the concrete code can be checked normally and the dynamic code stays dynamic. So like type lets you move slowly from dynamic to static. Yeah. >>: [inaudible] function would a concrete parameter type ->> Jan Vitek: Yeah. >>: Can I still call it on anything? >> Jan Vitek: So if I have a function with a concrete parameter type, can I call it, say, from a dynamic context. Yes. And what you get at that case is a concrete type test. It will say -- and it will just reject it if it's not exactly the type you declared. If you have a like type you can pass anything. Yes. >>: So [inaudible] you're saying if you have this like type parameter class and you're saying this method guarantees it will use it according to that ->> Jan Vitek: Protocol, yes. >>: But is that really true? Because suppose I have an up cast, right; that is, I'm passing it to some dynamic context from there on, then who knows what that context does. >> Jan Vitek: Yes. Then who knows what that -- so ->>: So then that's not true, right? >> Jan Vitek: It is true as long as you -- you know, you're saying this -- it's not the -okay. So what it really says is this variable will be treated as if it belonged to that type. Now, if you cast this to another variable that has ->>: The object. >> Jan Vitek: Not the object. It's a very syntactic property. >>: So at the call site I really have no idea [inaudible]. >> Jan Vitek: Yes. Think of it as a, you know, type checked comment. You know, this is what, you know, also has been advocated for code contracts. We're moving a comment that, you know, the Smalltalk guys would have embedded this in the name of the variable. We're saying, you know, just write like this and the ID will take care of it and you get this. All right. So we said we want to reward programmers. So what we did was implement a compiler that uses these like type and then we took a program and we annotated it and we showed performance of the two versions. So here is the results. So let's see. This is dynamic Thorn. The numbers are -- these are some shootout benchmarks. The numbers are normalized to the performance of Python. So dynamic Thorn is about 1 1/2 times slower than Python and, you know, sometimes 2 times slower than Python. And with a few type annotations, you know, in the tens, we get, you know, to .25 or 50 percent faster than Python. Okay. So, yeah, what does this say. Well, it says for a naive implementation this is what you can get. I'm not making any stronger claims than, look, with really very little work we went from something that was slower than Python but faster than Ruby to something that's faster than Python. Yes. >>: Does the speedup come from adding concrete types or like types? >> Jan Vitek: Ah, good question. Yes, yes, yes. They were mostly for concrete types and mostly because we could unbox some operations on integers. So, you know, take this with as much salt as you want. But it is just -- you know, the point I'm making here is, you know, for a very naive implementation of a programming language that, you know, yes, we spent a long time on it but not much on optimizations, we got big speedup on two programs. Or three. Yes. >>: By your last comment, would it be if -- if you took like dynamically but just then gave them the option saying like, hey, this has to also be an integer and will always be an integer, and then they can exploit this sort of ->> Jan Vitek: You can imagine that you would get similar speedups. Right. So there's a lot of related work. Yes. >>: A follow-up on that. You said you found things were used at different types [inaudible]. >> Jan Vitek: In JavaScript, yes. >>: In JavaScript. Did you break the data at all in terms of dense, like how often -- was it that certain objects and other objects were freely mixed but integers were always integers? >> Jan Vitek: There were -- so I really can't answer that one. I don't know. So the question is ->>: [inaudible] people who write programs are pretty flexible, but, you know, if they're indexing into an array, it's always going to be [inaudible]. >> Jan Vitek: So my feeling is that there weren't that many numbers in those programs. It was mostly strings, hash maps. But yeah. >>: So the adaptations that you've added, were those adaptations that could have been in foo [inaudible]. >> Jan Vitek: Perhaps. I can't say. I mean, we didn't try to infer them. I don't -- one would have to look. In some cases yes. But, you know, the point is you only have so much time, and, you know, we -- so if you have to spend a lot of time doing your -- you know, putting heavy duty type inference, whole program, control flow, you know, it costs you -- yeah. So I think this is -- the only point I'm making here is with very little you can speed up these programs. And that's the -- I don't want to claim more than we've actually achieved. Yes. >>: So the like types, do they help you ->> Jan Vitek: For speed, no. >>: -- [inaudible] at all? >> Jan Vitek: Not in our current implementation. >>: Only for the ID [inaudible]. >> Jan Vitek: The ID, the checking. And, you know, they are this sort of this midway between the two. The first step of refactoring you put like everywhere, and then when you sort of are more comfortable you sort of remove the likes. >>: I think you said this, and I just want to make sure. So you weren't doing any inferencing at call sites or through [inaudible]. >> Jan Vitek: No. No, no, no. No ->>: So essentially what you did is you went into a function, you checked the type, and then for the lifetime of that variable your calls were optimized. >> Jan Vitek: So you mean these numbers, how did we get them? Or is that the question? >>: Was your basic strategy that you would enter the function, check the type, and then you would just [inaudible]. >> Jan Vitek: Right. Yes. Roughly. Yeah. >>: So then another part of this which I'm very curious about which I guess you haven't done yet, but it's how the programmers will react. Because, I mean, I think they both ->> Jan Vitek: That's the downside of no user base. >>: But you might -- I mean, so some of these things that like type, right, you might catch way more buzz at compile times ->> Jan Vitek: Yes. >>: -- and that might save a lot of time, but they might react allergically to having types at all because they're so used to having dynamic environments. >> Jan Vitek: But then the sort of -- the selling point is, look, you can get it faster. There's always this sort of hope that give me more information and it will be faster. Yeah. >>: [inaudible]. >> Jan Vitek: Well, like types are sort of the gateway drug to speed. You start by putting some like types because they don't prevent you -- you know, they don't break your program. As long as it compiles it will still run. >>: It's documentation that helps the ID. >> Jan Vitek: Yes. >>: That's a great thing. >> Jan Vitek: Yeah. >>: Well, there's some possibility, right, that you could take a program with like types and infer [inaudible] where those ->> Jan Vitek: So that's what I said. Up to now we haven't used them for speed. But one could imagine maybe doing more. Sorry? >>: [inaudible]. >> Jan Vitek: So there is a lot of background work. Can't get it all, you know, in detail. But, you know, it starts with Findler and Felleisen on contracts, the Bracha work on Stongtalk, the Stongtalk type systems for Smalltalk. Gradual typing. There was an effort to do something fairly similar, so ValleyScript was never published, but it's a draft that Flanagan wrote for ECMAScript 4. And that was -- had some of the similar features we found out recently. And, you know, a lot of these good things. And this work was done, as I said, in the context of the collaboration with IBM. So concluding. So, well, we have, you know, a language design that lets us, you know, gradually evolve by adding, you know, annotations and, you know, both concurrency and modularity scripts into programs. We believe that it addresses some of the issues that the people in the beginning of the talk in the Pluto system raised, which were concurrency, modularity and typing. And, you know, one of the things we try to do is make sure that there is a reward for programmers -- you get ID support, you get potential speedups -- so, you know, they will use it. And there's a Web site you can go and play around with the language. Yes. >>: So I'm just curious [inaudible] when you cited the related work why didn't you -- did you look at Common Lisp at all? >> Jan Vitek: At Common Lisp? >>: [inaudible]. >> Jan Vitek: No. No. No. I -- yeah. >>: Just wondered if it was ->> Jan Vitek: I suspect that it has -- you know, it has either, you know, influenced or been influenced by the work by Felleisen because they have been in the Scheme Lisp community. But, yeah, so I haven't looked at Common Lisp itself. Yeah. >>: In a similar vein, how about the runtime typing [inaudible] that Chambers and Ungar did itself? >> Jan Vitek: So the scheme community, so Felleisen and his students, have been looking at soft typing, which is similar to this runtime, for 20 years. And, you know, I'll just relate to you his conclusion: It doesn't work. And his conclusion was based on what? He claimed that it's very brittle, small changes in the program cause, you know, changes in your results. You never know -- you don't have a performance model. You don't know what the compiler has inferred and what he hasn't. And when you get errors, they are incomprehensible. The mileage may vary and all that. I'm just saying that they've really looked at this and the scheme community is at least moving towards declared types in their programs and they're just giving up on inference. The problem, the main problem is that it is not local. And as soon as you go intraprocedural, you know, really you tickle it here and it, you know, laughs there. And it's just ->>: [inaudible] wasn't -- it wasn't static, right, it was looking at runtime and looking at particular call sites, what's the common type of and particular argument, and then I can optimize that path. >> Jan Vitek: So I thinking about another piece of self-work. There was some work that [inaudible] did. >>: This is Chambers and Ungar and [inaudible]. >> Jan Vitek: Yeah. So, yeah, I was thinking about another one. >>: Has anyone thought about this problem [inaudible] making it purely an ID issue? Like have a strongly typed language and let someone simulate the process of writing in a dynamically type language and instantly infer the most general strongly typed language that fits what you've written? >> Jan Vitek: So there has been some work on that. And the problem is you get a proliferation of types and they're not all meaningful and, you know, you have type mismatch. I mean, sometimes the programmer has to change the name of things. Or, you know, it's just inference doesn't seem to just work very well. >> Juan Chen: Let's thank the speaker, then. [applause]

>> Juan Chen: Thanks for coming here. So... Jan Vitek from Purdue University. He's also a chief...

Products

Support

&gt;&gt; Juan Chen: Thanks for coming here. So... Jan Vitek from Purdue University. He's also a chief...

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Juan Chen: Thanks for coming here. So... Jan Vitek from Purdue University. He's also a chief...