Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning Predicting Unroll Factors • Loop Unrolling sensitive to unroll factor • Current solution: expert design – Difficult: Hand-tuned heuristics – Must be rewritten frequently • Predict parameters with machine learning – Easy: data collection takes ~1wk • No human time – Algorithm does not change with compiler Loop Unrolling • Combines multiple iterations loop body • Fewer Iterations ď‚® Less Branching • Allows other transformations: – Exposes adjacent memory locations – Allows instruction reordering across iterations Unroll Factors • How many iterations to combine? • Too few? – Provides little benefit • Too large – Increased cache pressure – Increase live rangeď‚®register pressure Optimal Unroll Factors QuickTime™ and a decompressor are needed to see this picture. Classification Problems • Input a vector of features – E.g. nest depth, # of branches, # of ops • Output a class – E.g. unroll factor, 1-8 • No prior knowledge required – Meaning of features/classes – Relevance of features – Relationships between features Nearest Neighbors • Paper describes Kernel Density Estimator • All dimensions normalized to [0,1] • Given a test point p: – Consider training points “close” to p • Within fixed distance, e.g. 0.3 – Majority vote among qualifying training points Nearest Neighbors QuickTime™ and a decompressor are needed to see this picture. Support Vector Machine • Assume two classes, easily generalized • Transform data – Make classes linearly separable • Find line to maximize sep. margin • For test point: – Perform transformation – Classify based on learned line Maximal Margin QuickTime™ and a decompressor are needed to see this picture. Non-Linear SVM QuickTime™ and a decompressor are needed to see this picture. Some Features • • • • • • • • # operands Live range size Critical path length # operations Known tripcount # floating point ops Loop nest level # branches • # memory ops • Instruction fan-in in DAG • # instructions • Language: C, fortran • # memory ops • # Implicit instructions • & more (38 total) Results: No Software Parallelism QuickTime™ and a decompressor are needed to see this picture. Results: With Software Parallelism QuickTime™ and a decompressor are needed to see this picture. Big Idea: Easy Maintenance • Performance improvements modest – Sometimes worse, sometimes much better – Usually little change • Requires no re-tuning to change compiler – Gathering data takes ~1wk, no human time • General mechanism – Can be applied to all parameters – No model of system needed • Can be applied to new transformations where expert knowledge is unavailable Integrated CPU and L2 Cache Voltage Scaling using Machine Learning Dynamic Voltage Control • Monitor system • When activity is low, reduce power – Also reduces computational capacity – May need more energy if work takes longer Multiple Clock Domains • Adjust separate components independently • Better performance/power – E.g. CPU-bound application may be able to decrease power to memory and cache without affecting performance • More complex DVM policy Motivation • Applications go through phases • Frequency/voltages should change too • Focus on core, L2 cache – Consume large fraction of total power • Best policy may change over time – On battery: conserve power – Plugged in: maximize performance Learning a DVM Policy • Compiler automatically instruments code – Insert sampling code to record perf. Counters – Instrument code only to gather data • Use machine learning to create policy • Implement policy in microcontroller ML Parameters • Features – Clock cycles per instruction – L2 accesses per instruction – Memory access per instruction • Select voltage to minimize: – Total energy – Energy*delay Machine Learning Algorithm • Automatically learn set of if-then rules – E.g: If (L2PI >= 1) and (CPI <=0) then f_cache=1GHz • Compact, expressive • Can be implemented in hardware QuickTime™ and a decompressor are needed to see this picture. Results • Compared to independently managing core and L2: – Saves 22% on average, 46% max • • • • Learns effective rules from few features Compiler modifications instrument code Learned policy offline Implemented policy in microcontroller Conclusion • Machine learning derives models from data automatically • Allows easy maintenance of heuristics • Creates models that are more effective than hand-tuned