Example: Rumor Performance Evaluation Andy Wang CIS 5930 Computer Systems Performance Analysis Motivation • Optimistic peer replication is popular – Intermittent connectivity – Availability of replicas for concurrent updates – Convergence and correctness for updates • Example: Rumor, Coda, Ficus, Lotus Notes, Outlook Calendar, CVS 2 Background • Replication provides high availability • Optimistic replication allows immediate access to any replicated item, at the risk of permitting concurrent updates • Reconciliation process makes replicas consistent (i.e., two replicas for peer-topeer) 3 Background Continued • Conflicts occur when different replicas of the same file are updated subsequent to the previous reconciliation 4 Optimistic Replication Example Log on Desktop 10:00 Update 10:25 Update Log on Desktop 10:00 Update 10:25 Update 10:40 Update connected disconnected Log on Portable 10:00 Update 10:25 Update Log on Portable 10:00 Update 10:25 Update 10:51 Update 5 Example Continued Log on Desktop 10:00 Update 10:25 Update 10:40 Update Log on Desktop 10:00 Update 10:25 Update 10:40 Update 10:51 Update disconnected connected • Run reconciliation • Detect a conflict • Propagate updates Log on Portable 10:00 Update 10:25 Update 10:51 Update Log on Portable 10:00 Update 10:25 Update 10:40 Update 10:51 Update 6 Goal • Understand the cost characteristics of the reconciliation process for Rumor 7 Services • Reconciliation – Exchange file system states – Detect new and conflicting versions • If possible, automatically resolve conflicts • Else, prompt user to resolve conflicts – Propagate updates 8 Outcomes • Two reconciled replicas become consistent for all files and directories • Some files remain inconsistent and require user to resolve conflicts 9 Metrics • Time – Elapsed time • From the beginning to the completion of a reconciliation request – User time (time spent using CPU) – System time (time spent in the kernel) • Failure rate – Number of incomplete reconciliations and infinite loops (none observed) 10 Metrics not Measured • Disk access time – Require complex instrumentations • E.g., buffering, logging, etc. • Network and memory resources – Not heavily used • Correctness – Difficult to evaluate 11 Monitor Implementation Reconciliation Process Perl library Spool-to-dump C++ Scanner Recon Rfindstored Spool-to-dump Rrecon Server • Top-level Perl time command 12 Parameters • System parameters – CPU (speed of local and remote servers) – Disk (bandwidth, fragmentation level) – Network (type, bandwidth, reliability) – Memory (size, caching effects, speed) – Operating system (type, version, VM management, etc.) 13 Parameters (Continued) • Workload parameters – Number of replicas – Number of files and directories – Number of conflicts and updates – Size of volumes (file size) 14 Workloads • Update characteristics extracted from Geoff Kuenning’s traces File access Readonly access Read-write access Nonshared access Read access Write access Shared access 2-way sharing Read access Write access 3+way sharing Read access Write access 15 Experimental Settings • • • • • • Machine model: Dell Latitude XP CPU: x486 100 MHz RAM: 36MB Ethernet: 10Mb Operating system: Linux 2.0.x File system: ext3 16 Experimental Settings • Should have documented the following as well – CPU: L1 and L2 cache sizes – RAM: Brand and type – Disk: brand, model, capacity, RPM, and the size of on-disk cache – File system version 17 Experimental Design • 255 full factorial design • Linear regression or multivariate linear regression to model major factors • Target: 95% confidence interval 18 5 2 5 • • • • • Full Factorial Design Number of replicas: 2 and 6 Number of files: 10 and 1,000 File size: 100 and 22,000 bytes Number of directories: 10 and 100 Number of updates: 10 and 450 – Capped at 10 updates for 10 files • Number of conflicts: 0 /* typical */ 19 5 2 5 Full Factorial Analysis Elapsed time 150 • Experiment errors < 3% Time (seconds) 100 50 0 0 10 20 30 Experimental number measured time System time 40 predicted time User time 6 5 4 Time 3 (seconds) 2 1 0 40 30 Time 20 (seconds) 10 0 0 10 20 30 Experimental number measured time predicted time 40 0 10 20 30 Experimental number measured time predicted time 40 20 Variation of Effects Top 5 effects for elapsed time % Variation • All major effects significant at 95% confidence interval 100 90 80 70 60 50 40 30 20 10 0 # files # files # updates #files x #updates Factor fileSize fileSize x #files Factor fileSize # updates Top 5 effects for user time Top 5 effects for system time % Variation 100 90 80 70 60 50 40 30 20 10 0 # dirs % Variation 100 90 80 70 60 50 40 30 20 10 0 fileSize x #files # files # replicas # dirs Factor #files x #updates # updates 21 Residuals vs. Predicted Time Elapsed time 20 • Clusters caused by dominating effects of files 15 10 Residuals (seconds) 5 0 -5 0 50 100 150 -10 -15 -20 Predicted time (seconds) System time Residuals (seconds) User time 0.6 0.6 0.4 0.4 0.2 0.2 0 -0.2 0 1 2 3 -0.4 -0.6 4 5 Residuals (seconds) 0 -0.2 0 10 20 30 40 -0.4 Predicted time (seconds) -0.6 Predicted time (seconds) 22 Residuals vs. Experiment Numbers Elapsed time 20 • Residuals show homoscedasticity, almost 15 10 5 0 residuals -5 0 50 100 150 200 -10 -15 -20 Experimental number System time residuals User time 0.6 0.6 0.4 0.4 0.2 0.2 0 -0.2 residuals 0 50 100 150 -0.4 -0.6 200 0 -0.2 0 50 100 150 200 -0.4 Experimental number -0.6 Experimental number 23 Quantile-Quantile Plot Elapsed time 20 y15= 5.6125x + 2E-15 R² = 0.9757 10 • Residuals are normally distributed, almost 5 Residual quantiles -4 0 -2 -5 0 2 4 -10 -15 -20 Normal quantiles System time Residual quantiles -4 User time 0.6 0.6 0.4 y = 0.1125x - 2E-18 0.2 R² = 0.9863 0.4 y = 0.1242x - 3E-16 R² = 0.9524 0.2 0 -2 -0.2 0 -0.4 -0.6 Normal quantiles 2 4 Residual quantiles -4 0 -2 -0.2 0 2 4 -0.4 -0.6 Normal quantiles 24 Multivariate Regression • • • • • • • Number of replicas: 2 Number of files: 4 levels, 10-600 File size: 22,000 bytes Number of directories: 4 levels, 10-60 Number of updates: 0 Number of conflicts: 0 /* typical */ Number of repetitions: 5 per data point 25 Multivariate Regression Elapsed time • Experiment errors < 7% • All coefficients are significant 150 Time (seconds) 100 50 0 0 20 40 60 Experiment number measured time 80 100 80 100 predicted time User time System time 3.5 3 2.5 2 Time (seconds) 1.5 1 0.5 0 40 30 Time 20 (seconds) 10 0 0 20 40 60 80 Experiment number measured time predicted time 100 0 20 40 60 Experiment number measured time predicted time 26 Residuals vs. Predicted Time Elapsed time 15 • Elapsed time shows a bi-model trend • User time shows an exponential trend 10 5 Residuals (seconds) 0 -5 0 20 40 60 80 100 120 -10 -15 Predicted time (seconds) System time User time 0.3 1 0.2 0.1 0 Residuals -0.1 0 (seconds) -0.2 0.5 0.5 1 1.5 2 -0.3 2.5 3 Residuals (seconds) 0 0 10 20 30 40 -0.5 -0.4 -0.5 Predicted time (seconds) -1 Predicted time (seconds) 27 Residuals vs. Experiment Numbers Elapsed time 15 • Not so good for elapsed time and user time 10 5 0 Residuals -5 0 20 40 60 80 100 80 100 -10 -15 Experiment number User time System time 1 0.3 0.2 0.5 0.1 0 residuals -0.1 0 20 40 60 80 100 residuals 0 0 -0.2 20 40 60 -0.5 -0.3 -0.4 -0.5 Experiment number -1 Experiment number 28 Quantile-Quantile Plot Elapsed time 20 • Residuals are not normally distributed for elapsed time and user time y = 5.6775x - 4E-14 R² = 0.8407 15 10 Residual quantiles -3 5 0 -2 -1 -5 0 1 2 3 -10 -15 -20 Normal quantiles System time Residual quantiles-3 -2 0.4 y = 0.1321x - 2E-15 0.3 R² = 0.9789 0.2 0.1 0 -1 -0.1 0 1 2 -0.2 -0.3 -0.4 -0.5 Normal quantiles User time 1.5 y = 0.4811x - 2E-15 R² = 0.9243 1 0.5 3 Residual quantiles -3 0 -2 -1 -0.5 0 1 2 3 -1 -1.5 Normal quantiles 29 Log Transform (User Time) User time 0.04 • ANOVA tests failed miserably 0.02 Residuals (seconds) 0 0 0.5 1 1.5 2 -0.02 -0.04 -0.06 Predicted time (seconds) User Time User time 0.04 0.08 0.06 0.02 0.04 0 residuals 0 20 40 60 -0.02 100 0.02 0 -2 -1 -0.02 0 1 2 3 -0.04 -0.04 -0.06 80 Residual quantiles -3 y = 0.0222x - 1E-15 R² = 0.8709 -0.06 Experiment number -0.08 Normal quantiles 30 Residual Analyses (User Time) 0.25 • No indications that transforms can help… 0.2 Standard 0.15 deviation of 0.1 residuals 0.05 0 0 0.06 10 20 Mean user time 30 40 stdev errors 0.05 0.25 0.04 0.2 Variance of 0.03 residuals 0.02 Standard 0.15 deviation of 0.1 residuals 0.01 0.05 0 0 0 10 20 Mean user time 30 40 0 500 1000 Mean user time squared 1500 31 Possible Explanations • i-node related factors – Number of files per directory block – Crossing block boundary may cause anomalies • Caching effects – Reboot needed across experiments 32 Linear Regression • Number of files: 100, 150, 200, 250, 252, 253, 300, 350, 400, 450 – Test for the boundary-crossing condition as the number of files exceeds one block – Note that Rumor has hidden files • Number of repetitions: 5 per data point • Flush cache (reboot) before each run 33 Linear Regression Elapsed time 100 80 60 Time (seconds) 40 20 0 • > 80% • All coefficients are significant R2 0 100 200 300 Number of files measured time 400 500 predicted time 95% confidence interval User time System time 30 3 20 Time (seconds) 10 2 Time (seconds) 1 0 0 0 100 measured time 200 300 Number of files 400 predicted time 500 0 100 200 300 Number of files measured time 95% confidence interval 400 500 predicted time 34 Residuals vs. Predicted Time Elapsed time • Elapsed time shows a bi-model trend • User time shows an exponential trend 15 10 5 Residuals (seconds) 0 -5 0 20 0.6 0.2 0.4 0.1 0 0.5 1 1.5 2 2.5 Residuals (seconds) 20 25 0.2 0 0 5 10 15 -0.2 -0.2 -0.3 100 User time 0.3 0 80 Predicted time (seconds) System time -0.1 60 -10 -15 Residuals (seconds) 40 Predicted time (seconds) -0.4 Predicted time (seconds) 35 Residuals vs. Experiment Numbers Elapsed time • Elapsed time shows a rising bi-modal trend 15 10 5 residuals 0 0 20 – Randomization of experiments may help -15 Experiment number User time 0.3 0.6 0.2 0.4 0.1 0.2 0 -0.1 residuals 0 10 20 30 40 50 60 0 0 10 20 30 40 50 60 -0.2 -0.2 -0.3 60 -10 System time residuals 40 -5 Experiment number -0.4 Experiment number 36 Quantile-Quantile Plot Elapsed time 15 • Error residuals for elapsed time is not normal 5 Residual quantils -3 0 -2 – Perhaps piece-wise normal -1 -5 3 User time y = 0.0976x - 4E-16 R² = 0.9693 y = 0.2134x + 2E-15 R² = 0.9709 0.4 0.1 0.2 0 -0.1 2 0.6 0.2 -1 1 -15 Normal quantiles 0.3 -2 0 -10 System time Residual quantiles -3 y = 5.8218x + 5E-15 R² = 0.878 10 0 -0.2 -0.3 Normal quantiles 1 2 3 Residual quantiles -3 0 -2 -1 -0.2 0 1 2 3 -0.4 -0.6 Normal quantiles 37 Possible Explanations • • • • i-node related factors: No Caching effects: No Hidden factors: Maybe Bugs: Maybe 38 Conclusion • Identified the number of files as the dominating factor for Rumor running time • Observed the existence of an unknown factor in the Rumor performance model 39 White Slide 40