Dear Mr. JeeHyun Hwang: I am sorry to inform you that the following submission was not selected by the program committee to appear at DSN-DCCS 2009: Fault Localization for Firewall Policies The paper selection process was very competitive. Only 37 papers out of 177 submissions were selected for this year's program (21% acceptance rate). However, I hope the review comments are constructive and helpful to your future work. I also hope that you can attend DSN 2009 in Estoril, Portugal at the end of June this year. Note that the many workshops at DSN could give you an opportunity to present your work in this year's DSN (deadline is March 16), see www.dsn.org for details. I have enclosed the reviewer comments for your perusal. Note that the reviewers may have augmented their reviews and changed their scores after the author response period based on on-line discussions before the PC meeting and face to face discussions at the PC meeting. Best regards and good luck with your future work! Matti Hiltunen DSN-DCCS 2009 PC chair ==================================================================== ======== DSN-DCCS 2009 Reviews for Submission #185 ==================================================================== ======== Title: Fault Localization for Firewall Policies Authors: JeeHyun Hwang, Tao Xie, Fei Chen and Alex Liu ==================================================================== REVIEWER #1 -------------------------------------------------------------------- ------- Reviewer's Scores Presentation and English: 3 Relevance for DCCS: 4 Novelty: 3 Contribution:: 3 Technical correctness: 2 Confidence on the review: 2 Overall recommendation: 4 -------------------------------------------------------------------------- Comments This paper describes a method for isolating flaws in firewall specifications. The idea is that one uses test packets to identify a flaw, and then uses their reduction techniques to determine automatically which rules produced the flaw. The test packets themselves are generated automatically in order to guarantee coverage. The authors also include results of an experiment involving mutated firewall policies, and report good results. This looks like interesting work that could be useful. A few comments: The description of the technique, which is the meat of this paper, begins with some examples and then continues with a high-level overview. a firm grasp on how their technique works. This makes it difficult to get If the authors included, say, a pseudo-code description of their algorithm that would be helpful. I would also recommend that they put the high-level overview *before* the examples. Throughout this paper I kept asking the question: policy is correct? how do you know whether a firewall The paper doesn't attempt to answer this question. It assumes that there is correct, expected behavior and it is possible to tell whether acceptance or rejection of a packet conforms to this expected behavior. It is beyond the scope of this paper to attempt to come up with a way of specifying correct behavior, but according to the related work section of the paper others have addressed this problem. It would be helpful if the authors devoted more space to the discussion of this work and its relevance to their own. If the authors could show that their experimental approach addresses the kinds of correctness criteria and the kinds of mistakes that arise in real life, this would be very useful. Even if it doesn't, a discussion of how their experimental approach could be extended and/or modified to address this would be very informative. ==================================================================== REVIEWER #2 Reviewer's Scores Presentation and English: 3 Relevance for DCCS: 4 Novelty: 3 Contribution:: 2 Technical correctness: 1 Confidence on the review: 3 Overall recommendation: 2 -------------------------------------------------------------------Comments The paper presents techniques for localizing faults that cause failed acceptance tests during firewall testing. The localization can occur either as an identification of a single faulty rule (for faults that have incorrect decisions), or as a ranked list of a reduced set of rules indicating the order in which rules are likely to have the fault. The topic is an interesting one, and although the paper would be strengthened if the authors had found data to show the rate of prevalence of firewall problems and security incidents related to incorrect rules in the field, I am willing to accept that this is a real problem. Although I would like to encourage the authors to continue exploring ideas on this topic, I feel that in the current state, this work is still too preliminary and lacking sufficient depth for a full technical paper. While the proposed methods make some smart observations, there are problems with them -- the method to identify rule decision changes seems to be incomplete, and an important fault class is not considered at all (see comments below). None of the experimental results are overwhelming (e.g., based on a large study of real firewall faults). Finally, neither is the reduction in the number of rules that have to be considered that large (about a 30% reduction) to merit the inclusion of the paper on any of the above grounds alone. My comments below also highlight areas in which the work could be improved to make a nice paper. Detailed comments: * The organization of the paper could be improved by simply merging (since there isn't that much depth in Section 5), section 4. In the current form, covered in Section 3 and 5 and putting the new section after Section 5 doesn't have that much to add that isn't Section 3. * Section 2.4: test packets are generated based on solving constraints each rule in isolation. Does this effect test coverage? such as the default deny rule is hard any insight from the imposed by e.g., a rule that is deep down to reach because of previous rules. Do you have policies you have studied how prevalent the problem is? Would a combined constraint system be a better solution for this? * On the technique to detect RDC's in Section 3.1, 5.1. It seems you case where all the failed tests belong to a single successful tests. In that an rule r, but that r also has instances of case, it is not possible to assume that the fault is necessarily RDC. * An important firewall fault class: "incorrect rule order" is paper. This fault class can show symptoms redundant rule is moved up the those of an completely omitted by the that resemble both an RDC (e.g., if a chain so that it no longer becomes redundant), or RFC. Therefore, the techniques, as they currently stand, cannot differentiate between the faults they were designed for, vs. making the applicability of the incorrect rule order faults, proposed techniques to real problems suspect. * The description in Section 5.2 and 5.3 is sloppy - the notation to be defined. There are also several typos incorrect (I assume FF(c) also have missed the (e.g., rs, c, etc.) need (in Sec 5.3 especially), making equation 3 means FF(r)?). The description in the text for the equation does not match the equation itself. * When computing the ranking-based reduction percentage, shouldn't you instead of r? i.e., further reductions due to ranking should reduced rule-set (and not the original). use r' be compared against the * I liked the results section, and the authors have done a fairly good lots of scenarios to test their used? job at generating techniques. However, how realistic are the fault models Couldn't you use the previous literature on "common firewall [2, 17]) to produce more realistic firewall faults for your experiments? * I notice that none of the techniques proposed by the authors use successful tests for further localization. It seems useful information and scope for used in a mistakes" (e.g., information from to me like there could be a lot of better localization if both successes and failures are smart manner. Perhaps, this is an opportunity for the authors to strengthen the work. * The comparison to Marmorstein et. al. in Sec 7 didn't make sense to sentence, you say that they propose a technique to firewall policy. However, in the methodology to me. In one identify 2 or 3 faulty rules in a next sentence, you say that they do not provide a identify faulty rules? Don't these sentences contradict each other? ==================================================================== REVIEWER #3 -------------------------------------------------------------------Reviewer's Scores Presentation and English: 4 Relevance for DCCS: 4 Novelty: 3 Contribution:: 2 Technical correctness: 3 Confidence on the review: 3 Overall recommendation: 3 -------------------------------------------------------------------Comments This paper describes an approach to find faulty firewall rules. Single faults are assumed. Debugging firewall rules is currently a manual/tedious process and the contribution is thus relevant. The proposed approach is based on the testing procedure published by the same authors in SRDS 2008, reference [8] in the submission. In that paper the focus is on determining the packet set for testing a firewall, while in this paper the focus is on finding the faulty rules. The proposed approach is based on the following. (1) If a faulty firewall rule exists, than this rule is covered by a failed test; with this approach one can locate decision faults, in which a rule incorrectly specifies "accept" for a packet to which it should have specified "reject" (or vice-versa). (2) for interval faults, in which a firewall rule specifies the incorrect interval for acceptance/rejection of a packet, an approach for reducing the number of rules to be inspected is proposed. First of all the rules for inspection are located above the first rule that results in a failed test. Then (3) the faulty rule's decision must be different from the decision of rules covered by failed tests. (4) Finally an approach is given to rank rules based on test coverage. These strategies are presented in an ad hoc way, i.e. this reviewer would like to have found a proof of correctness of each of the proposed procedures. Instead of presenting a section with short examples (sec. 3) and then only specifying the strategy in section 5, this reviewer suggests the authors to write one section describing (actually, specifying) each strategy followed by examples. This will avoid the redundancy in explaining concepts. In order to evaluate the proposed approach the authors executed a large enough number of experiments using real firewall policies and fault injection. The results show that about a third of rules on average are reduced for inspection (lowest value is 7% highest 51%) while 50% percent of the rules are reduced when after the ranking approach is employed (lowest 33% highest 68%). The paper is very well written and organized, a few suggestions for improvement follow: * [Sec. 1] "help reduce effort to locate" -> "help reduce THE effort to locate" * [Sec. 1] "including a sing fault" -> "including a single fault" * [Sec. 1] The Introduction becomes redundant when you list the "four main contributions", they were made clear in earlier paragraphs. * [Sec. 1] "Section 8 concludes." -> "Section 8 concludes the paper." * [Sec. 2] Please clarify what you mean by "When firewall policies do not include many conflicts, this technique can effectively generate packets..." * [Sec. 3] This section might be called "Description with Examples" instead of just "Example"; see comment above on structure. * [Sec. 3] "This ranking technique.... and compute suspicions" -> "computeD suspicions" * [Sec. 4] why do you use the word "change" in "Therefore, a change (fault) in p can introduce a faulty behavior" * [Sec. 5] "In our approach, we first the first technique" * [Sec. 5] "Using two preceding" -> "using THE two preceding * [SubSec. 5.3] "a technique to rank based" -> "a technique to rank rules based" * [SubSec. 5.3] "For each rule, Suspicions value" -> "For each rule, the Suspicions index" * [SubSec. 6.1] Please rewrite the expression "Our tool" instead of repeating it in each sentence, for instance you might use "The tool then" to make an improved text. * [Sec. 6] "We defin define the ranking" * [Sec. 7] Please explain why "it is difficult to use their techniques to locate faults...", in which sense is it difficult? ==================================================================== REVIEWER #4 ------------------------------------------------------------------ Reviewer's Scores Presentation and English: 3 Relevance for DCCS: 4 Novelty: 3 Contribution:: 3 Technical correctness: 2 Confidence on the review: 2 Overall recommendation: 4 -------------------------------------------------------------------Comments Firewall policy testing is used to detect conflicts or inconsistencies in a firewall policy by emulating firewall filtering against test packets and observing their decision by various rules in the policy. "When a conflict/inconsistency is detected during such testing, how does one narrow down and locate the individual rules responsible for the conflict/inconsistency?" This practically important question (currently answered manually) is addressed by the techniques presented in the paper. The paper builds upon the previous work of the authors for structural testing of firewall policies published in SRDS 2008. The paper models two fault types, Rule Decision Change (RDC), and Rule Field Interval Change (RFC), based on whether the fault is in the decision part of a firewall rule or in the predicate part. Three techniques for fault localization (one for RDC and two for RFC) are presented, and evaluated using instrumented versions of real-life firewall policies. The paper is a difficult read in certain parts, and the presentation could be improved: * Given that the number of tests in Figure 2 and 3 are small (only 10 and 12 respectively), the description can be made clearer by including the actual test packet-decision pairs used in those examples. * What is the rationale behind the metric for ranking-based rule reduction percentage presented in Section 6.2? * Run a spell and grammar checker. * The suspicion metric for Rule R3 in Figure 3 is given as 0.44, but in the description it is incorrectly given as 0.33. * Change PT(r) in Section 5.3 to FT(r) * Change "sing fault" in Section 1 to "single fault". Run a spelling & grammar checker to correct such errors found throughout the paper. ==================================================================== REVIEWER #5 -------------------------------------------------------------------------- Reviewer's Scores Presentation and English: 4 Relevance for DCCS: 3 Novelty: 3 Contribution:: 2 Technical correctness: 3 Confidence on the review: 2 Overall recommendation: 3 -------------------------------------------------------------------Comments This paper builds on a previous one (ref 6). The previous paper describes the production of reduced test sets to test firewall policies. This new paper uses exceptions to that testing to identify possibly erroneous firewall policies. This is almost entirely a theoretical approach. Two plausible error types are assumed, created, and detected without any reference to actual real-world errors. This approach detects what it detects, and provides shortened lists of rules to check. This is a worthwhile goal. But I can't tell if it would actually be useful, or if the approach has much of a future. Presumably, it didn't detect anything in the eleven available firewall rulesets. The lengths of these ruleset are typical for a small installation, and fairly easy to audit by eyeball in most cases. Reducing this number is not a big deal. I would be much more excited about this paper if the techniques were run on one of the rulesets reported by Wool that had 5,000 entries. Discovery of problems, and excitement on the part of that firewall administrator, would make this a much stronger paper. (I realize the difficulty in obtaining access to such rulesets. But the promise of a cleaner ruleset ought to be enough to find a collaborator.) The results in ref 13 are a good example. I like the idea of applying the lessons learned from software development and dependability to firewall rulesets, but those fields wouldn't have much to offer if all the programs were a couple dozen lines long. Some parts of the explanation are murky. suspect, that R4 is erroneous? In Figure 2, why do we know, or even It would be appropriate if we wanted to accept [2,2,5]. Your explanation of why a rule test failure implies a possible error could use some clarification. Page 2, para 4, s/sing/single In section 3.2, para 4, sentence 3 is cast awkwardly, the series of commas of ri, r1, r2, etc confuse the sentence. Dear Mr. JeeHyun Hwang: I am sorry to inform you that the following submission was not selected by the program committee to appear at DSN-DCCS 2009: --------author response by JeeHyun Hwang----------------------------We greatly appreciate the reviewers' detailed and constructive reviews. To Reviewer 1 (1) The correctness of the RDC fault localization technique, the first rule reduction technique, and the second rule reduction can be formally proved; they will be included in the final version. To Reviewer 2 (1) The rationale behind the metrics for the rule ranking technique is that the number of clauses in a predicate that are specified wrong is typically small. Therefore, we first examine the potential faulty rule that has the smallest number of false clauses over all failed packets. To Reviewer 3 (1) We are conducting our experiments over firewall policies with hundreds of rules; in fact, the larger firewall policies that our approach is applied on, the more benefits are achieved by our approach. To Reviewer 4 (1) The two types of firewall faults RDC and RFC that we proposed are realistic. The typical scenario for the RDC fault is that a firewall administrator forgot to change the decision of a legacy rule that is not valid anymore. The typical scenario for the RFC fault is that the firewall administrator adds a new rule to a firewall policy, but the new rule overwrites the rules below it and causes errors. (2) In Figure 2, R4 is a faulty rule because we inject this fault by changing its decision to be incorrect. To Reviewer 5 (1) Firewall errors are mostly caused by incorrect rules. In paper "A quantitative study of firewall configuration errors", by studying a large number of real-life firewall, Wool found that more than 90% of them have errors in their rules. (2) On the choices of test generation based on constraint solving, our previous SRDS'08 paper observed that tests by solving individual rule constraint can achieve comparable coverage with that by solving combined constraints, but requires much lower analysis cost. (3) For the case mentioned by reviewer 5, the technique in Section 3.1 is not applicable and then the technique in Section 3.2 is applied. (4) We cannot use the anomaly models used in the papers [2,17] with two reasons. First, there are often many anomalies in a firewall based on their definition. Second, many such anomalies are not errors. For example, they have defined a conflict between any two rules as an anomaly; however, it is often not. (5) Marmorstein's work does propose a method for identifying faulty rules; however, their method is not systematic. Our proposed techniques have three main advantages over Marmorstein's work. First, Marmorstein's work misses all the potential faulty rules that we covered in the RFC fault because they consider only the rules that match a failed packet whereas we examine the rules above the rule that failed packets match. Second, we reduced the number of possible faulty rules, while Marmorstein's work can find out only all the potential faulty rules that cover the failed packets. Third, we ranked the potential f