Cultivating Research Taste (illustrated via a journey in Program Synthesis research) Programming Languages Mentoring Workshop 2015 Sumit Gulwani Microsoft Research, Redmond Dimensions in Research • Problem Definition – – – – Advisor’s interest and funding, Internship, Course project Intersection with your collaborator’s interest Next logical advance in your current portfolio Talk to potential customers, market surveys • Solution Strategy – Develop new techniques vs. Apply existing techniques – Cross-disciplinary • Impact – Paper, Tool, Awards, Media – Personal happiness Cultivating research taste is a journey! Once you develop it, you start on another journey! 1 Program Synthesis Goal: Synthesize a program in the underlying domain-specific language (DSL) from user intent using some search algorithm. An old problem, but more significant today. • Diverse computational platforms & programming languages. • Enabling technology: Better algorithms & faster machines Synthesis can revolutionize end-user programming if we: • target the right set of application domains – such as Data manipulation • allow the right intent specification mechanism – Examples, Natural Language • can tame the huge search space for real-time interaction – Domain-specific search algorithms PPDP 2010 [Invited talk paper]: “Dimensions in Program Synthesis”; 2 Graduation Advice (2005) You will have too many problems to solve; you can’t pursue them all. Make thoughtful choices. George Necula UC-Berkeley 3 From Program Verification to Program Synthesis Precondition P Statement s Postcondition Q Forward dataflow analysis: From s, P, compute Q Backward dataflow analysis: From s, Q, compute P Program Synthesis: From P, Q, compute s Nebojsa Jojic MSR Redmond (2005) 4 Synthesis using SAT/SMT Constraint Solvers Program synthesis is an extremely hard combinatorial search task! Try using SAT solvers, which have been engineered to solve huge instances. Venkie MSR Bangalore (2006) 5 Initial results in program synthesis Results: Managed to synthesize a wide variety of programs from logic specs. Approach: Reduce synthesis to solving SAT/SMT constraints. • Bit-vector algorithms (e.g., turn-off rightmost one bit) – [PLDI 2011, ICSE 2010] • SIMD algorithms (e.g., vectorization of CountIf) – [PPoPP 2013] • Undergraduate book algorithms (e.g., sorting, dynamic prog) – [POPL 2010] • Program Inverses (e.g, deserializers from serializers) – [PLDI 2011] • Graph Algorithms (e.g., bi-partiteness check) – [OOPSLA 2010] 6 Mid-life Awakening (2010) Software developers Two orders of magnitude more users End users Dimensions in Research Problem Definition – – – – Advisor’s interest and funding, Internship, Course project Intersection with your collaborator’s interest Next logical advance in your current portfolio Talk to potential customers, market surveys • Solution Strategy – Develop new techniques vs. Apply existing techniques – Cross-disciplinary • Impact – Paper, Tool, Media, Awards – Personal happiness Cultivating research taste is a journey! Once you develop it, you start on another journey! 8 Problem Definition: Inspired by Excel help forums Typical help-forum interaction 300_w30_aniSh_c1_b w30 300_w5_aniSh_c1_b w5 =MID(B1,5,2) =MID(B1,FIND(“_”,$B:$B)+1, FIND(“_”,REPLACE($B:$B,1,FIND(“_”,$B:$B),””))-1) Flash Fill (Excel 2013 feature) Dimensions in Research • Problem Definition – – – – Advisor’s interest and funding, Internship, Course project Intersection with your collaborator’s interest Next logical advance in your current portfolio Talk to potential customers, market surveys Solution Strategy – Develop new techniques vs. Apply existing techniques – Cross-disciplinary • Impact – Paper, Tool, Awards, Media – Personal happiness Cultivating research taste is a journey! Once you develop it, you start on another journey! 12 Flash Fill: Domain Specific Language Guarded Expression G := Switch((b1,e1), …, (bn,en)) Boolean Expression b := c1 Æ … Æ cn Atomic Predicate c := Match(vi,k,r) Trace Expression e := Concatenate(f1, …, fn) Atomic Expression f := s // Constant String | SubStr(vi, p1, p2) | Loop(w: e) Index Expression p := k // Constant Integer | Pos(r1, r2, k) // kth position in string whose left/right side matches with r1/r2 Regular Expression r := TokenSequence(T1,…,Tn) POPL 2011: “Automating String Processing in Spreadsheets using Input-Output Examples”; Sumit Gulwani. 13 Substring Operator Let w = SubString(s, p, p’) where p = Pos(r1, r2, k) and p’ = Pos(r1’, r2’, k’) w1 w2 w 1’ p w2’ p’ w r1 matches w1 r2 matches w2 r1’ matches w1’ r2’ matches w2’ s Two special cases: • r1 = r2’ = 𝜖 : This describes the substring • r2 = r1’ = 𝜖 : This describes boundaries around the substring The general case allows for the combination of the two and is 14 thus a powerful operator! Syntactic String Transformations: Example Format phone numbers Input v1 Output (425)-706-7709 425-706-7709 510.220.5586 510-220-5586 235 7654 425-235-7654 745-8139 425-745-8139 Switch((b1, e1), (b2, e2)), where b1 Match(v1,NumTok,3), b2 :Match(v1,NumTok,3), e1 Concatenate(SubStr2(v1,NumTok,1), ConstStr(“-”), SubStr2(v1,NumTok,2), ConstStr(“-”), SubStr2(v1,NumTok,3)) e2 Concatenate(ConstStr(“425-”),SubStr2(v1,NumTok,1), ConstStr(“-”),SubStr2(v1,NumTok,2)) 15 Flash Fill: Search Algorithm Goal: Given input-output pairs: (i1,o1), (i2,o2), (i3,o3), (i4,o4), find P such that P(i1)=o1, P(i2)=o2, P(i3)=o3, P(i4)=o4. Algorithm: 1. Learn set S1 of trace expressions s.t. 8e in S1, [[e]] i1 = o1. Similarly compute S2, S3, S4. Let S = S1 ÅS2 ÅS3 ÅS4. 2(a). If S ≠ ; then result is S. Challenge: Each Sj may have a huge number of expressions. Key Idea: We have a DAG based data-structure that allows for succinct representation and manipulation of Sj. 16 Flash Fill: Search Algorithm Goal: Given input-output pairs: (i1,o1), (i2,o2), (i3,o3), (i4,o4), find P such that P(i1)=o1, P(i2)=o2, P(i3)=o3, P(i4)=o4. Algorithm: 1. Learn set S1 of trace expressions s.t. 8e in S1, [[e]] i1 = o1. Similarly compute S2, S3, S4. Let S = S1 ÅS2 ÅS3 ÅS4. 2(a). If S ≠ ; then result is S. 2(b). Else find a smallest partition, say {S1,S2}, {S3,S4}, s.t. S1 ÅS2 ≠ ; and S3 ÅS4 ≠ ;. 3. Learn boolean formulas b1, b2 s.t. b1 maps i1, i2 to true, and b2 maps i3, i4 to true. 4. Result is: Switch((b1,S1 ÅS2), (b2,S3 ÅS4)) Search Methodology: Reduce learning of an expression to learning of sub-expressions (Divide-and-Conquer!) 17 Ranking General Principles • Prefer shorter programs. – Fewer number of conditionals. – Shorter string expression, regular expressions. • Prefer programs with fewer constants. Strategies • Baseline: Pick any minimal sized program using minimal number of constants. • Machine Learning: Programs are scored using a weighted combination of program features. – Weights are learned using training data. Rishabh Singh 18 Experimental Comparison of various Ranking Strategies Strategy Average # of examples required Baseline 4.17 Learning 1.48 Technical Report: “Predicting a correct program in Programming by Example”; Singh, Gulwani 19 User Interaction Model Current Flash Fill Model • Auto-prediction avoids discoverability issue. • User inspects output and may provide additional examples. Show programs • in any desired language (after conversion from DSL). • Paraphrase in English. Computer initiated interactivity • Highlight less confident entries in the output. • Ask directed questions based on distinguishing inputs. 20 Dimensions in Research • Problem Definition – – – – Advisor’s interest and funding, Internship, Course project Intersection with your collaborator’s interest Next logical advance in your current portfolio Talk to potential customers, market surveys • Solution Strategy – Develop new techniques vs. Apply existing techniques – Cross-disciplinary Impact – Paper, Tool, Awards, Media – Personal happiness Cultivating research taste is a journey! Once you develop it, you start on another journey! 21 Initial Success: Media articles & Blogposts Broader Impact Defined a new research trajectory, which keeps me busy with a passionate sense of purpose. • End-user Programming using Examples and Natural Language • Intelligent Tutoring systems 23 Conclusion Dimensions in Research Problem definition, Solution strategy, Impact Cultivating research taste is a journey Mine involved: “Program analysis” -> “Program synthesis” -> “Program synthesis for end-users using examples” Once you develop it, you start a new journey Mine involves: having fun with cross-disciplinary research in • “Frameworks for end-user programming using examples & NL” • “Intelligent Tutoring systems” Backup Slides for Flash Fill Demo 25