Copyright©2016SplunkInc. AMathematicalPerspective onDataScience Dr.TomLaGatta StaffSalesEngineer (previouslyStaffDataScientist) AboutMe • MathPhDfromUniversityofArizona – ”Geodesics ofRandomRiemannianMetrics”w/Janek Wehr – Probability+DifferentialGeometry+FunctionalAnalysis • PostdocatCourantInstitute@NYU – FinishedGeodesicspaper,publishedinCommunicationsinMath.Physics – CollaboratedwithPoliticalScientists onheterogeneousvotingbehavior • Was:StaffDataScientistatSplunk – HelpedcustomerswithadvancedusecasesinBusinessAnalytics,Internetof Things,MachineLearning,DataScience • Now:StaffSalesEngineeratSplunk – Helpingbigbigcustomerssolvebigbigbusinessproblems Abstract ● Aswithallthings,theprocessofanalyzingdataadmitsamathematicaldescription.Asa mathematician-turned-data-scientist,Iwilldescribemyapproachtoproblemsolving,andattempt tolooselyformalize"stakeholders","usecases","data"and"deliverables"inmathematical languagefortheenjoymentofthismostlyacademicaudience.Inparticular,Iwilldescribehow querylanguagesareinherentlyfunctional,actingasfunctionaltransformationsofDataintoData, whichobeytheusualfunctionalcompositionlaw.Theprocessofanalyzingdataresultsinan iterativesequenceofqueries,convergingtoafinalquerywhichissatisfactorytotheusecase. Thesequeriesarethenorganizedintodeliverables,whichcanbe"dashboards" (webpageswith visualizations)or"dataproducts"(withscheduledjobs&analysesrunninginthebackground). Whenthisprocessisdoneright,itresultsintheextractionof"value"forstakeholders,whichcan bemeasuredtangiblyintermsofrevenue,costsorriskmetrics.Sometimesthishasafancyname like"datascience",butmoreoftenthannot,isjustthenormaloperationalworkofagooddatasavvyIT,Security,TechorBusinessdepartmentinmodernenterprisesandgovernmentalagencies. Therewillbenoproofs,butIwouldbeveryinterestedtodiscussrigorousapproachestosocial organization&problemsolvingafterthetalk. Agenda • BasicDefinitions: – Data,Stakeholders,UseCases,Deliverables • Querylanguagesasfunctionalprogramming – Everyqueryisamapf:Data->Data – Exampleproblemsolving • PuttingItAllTogether:DoingDataScience – Emphasizeactionableinsights – Tieittogethertodeliver”value”tostakeholders Copyright©2016SplunkInc. BasicDefinitions WhatisData? ”Data”isanyinformationalartifactofreal-worldphenomena • A”metric”or”KPI”isanyaggregatefunctionoflow-leveldata • Examples: • – – – – • Semi-structuredtimestamped events/metrics Structuredrelationaldata(rows&columns) Graphdata(nodes&edges) ”Unstructured”data(images,video,text) Howtomodeldata: – Events:markedpointprocesses&timeseries(Skorokhod space) – Relationalschema:categories(seeDavidSpivak’s work) – Otherdata:dependsontheusecase,mightneednewdatastructuresto representit(incl.vectors,graphs,etc.) WhatisaStakeholder? • A”stakeholder”isaperson,groupororganizationwhoisinvestedin theoutcomeofaninitiative. • Example:Acompanybuyssoftware – StakeholderorgsareIT&theBusiness – IndividualstakeholdersincludeIndividual Contributors,Managers,Directors&Executives. • ITstakeholdershaveperformancemetrics (num.outages,meantimetoresolution,etc) Businessstakeholdershavedifferentmetrics(revenue,cost,risk) • Customerstakeholdersdownstreamalsohavevalue/impactmetrics • WhatisaStakeholder?(cont.) • Howtomodelstakeholders? – FollowGameTheoryforinspiration(butdon’tworryabout”equilibrium”) – CreateanindexsetIwithallstakeholders.Variousactions,outcomes& objectiveswillhavesubscriptsi basedonstakeholders – E.g.,personi choosesactionai,t attimet – Icanbehierarchical(Personi containedinorgA,soparent(i)=A) – ”Value”modeledbyobjectivefunctions(Ui,n =objective#nforpersoni) – Istheaction”pivotal”fortheoutcome?(ie 𝔼[U|do(a)]>𝔼[U|do(nota)]?) • Keeptrackofstakeholdersdata: – Mightbehigh-level(email,Powerpoints)– contextiskey – Mightbeindatabases(transactions,customerrecords,ticketsdata) – Mightbegranulareventsdata(webclickstream,logs,mobile,wiredata) WhatisaUseCase? • A”usecase”consistsofabusinessproblem,astrategytoalleviate theproblem,metricstoevaluatetheoutcome,datatopowera solution,andstakeholderswhoareinvolvedinitsdevelopment. • UseCase:ProblemForecasting. – Companyhascostlynetwork/systemoutages – IThiresaDataScientisttohelpbuildsolution. – DataincludesInfrastructure(CPU,Memory), Operations(OutageReports),Applogs,etc – Metric:costofoutages≈ numoutages*timetoresolution*costoflabor – StakeholdersincludeIT&Business, andimpactedcustomers WhatisaDeliverable? • A”deliverable”isathingproducedtosolveausecase – Caninclude”dashboards”:informationalwebpagesbuiltfromdataqueries – Or”workflows”:notableeventsdeliberatedtooperationsanalysts – Orfull-fledged”dataproduct”:applicationwhich*does*stuffautomatically • Deliverable:ProblemForecastingSystem – Goal:forecastproblemsbeforetheystart, deliver”proximaterootcause”toITtoinvestigate – Data:CPU,Memory,Latency,ServiceTickets – Buildmachinelearningmodeltocorrelate InfrastructuredatawithServiceimpact – Applymodeltoincomingevents,create ”predicted(Risk_Score)” – Surfacehigh-riskeventstoITOperations Copyright©2016SplunkInc. QueriesasFunctional Programming QueryLanguages Querylanguagesprovideaformulaicapproachtoworkingwithdata • A”query”isastringwhichtellswheretogetthedata,whattodo withthedata,andwheretoputthedata(incl visualizationorDB) • Mathematically,everyqueryisaFUNCTIONf:Data->Data • Queriescanbecomposed(with|symbol),andanalysisisiterative • • Example1:SuccessfulPurchaseActionsfromWebLogs sourcetype=access_combined action=purchase status=200 • Example2:PlotPurchaseValueasMetricTimeseries sourcetype=access_combined action=purchase status=200 | timechart partial=f span=5m sum(price) as value 1:SuccessfulPurchaseActionsfromWebLogs 13 2:PlotPurchaseValueasMetricTimeseries Wait!ProblemIdentified! Whyarepurchasesdecreasing? 14 3:InvestigateDatabaseErrors 15 4:PlotDatabaseErrors 16 5:CorrelateDBproblemswithpurchasevalue 17 6:SaveasDeliverable,SendtoStakeholders 18 7:Movetowardproactivemonitoringstance 19 Copyright©2016SplunkInc. PuttingItAllTogether: DoingDataScience EmphasizeActionable Insights • Avoideye-candyvisualizations – “Laserbeam”threatdashboardslookcoolbutareuseless “Howdoesthishelpmesolvemyproblem?” • Guidetheviewertodrilldown&actquickly • Confusingviz:notactionable Goodviz:actionable DoingDataScience • DataScientistsresisteasycharacterization.Abitof: – – – – • Manyscalesofaction: – – – – • Statistician SoftwareEngineer Business Analyst Spockonthebridge Getyourhandsdirtywhenneeded Butstepbackandseethebigpicture Emphasize”actionableinsights”throughoutthewholeorg Alsohavepoliticalcredibilitytosay,”Thisisabaddecision,don’tdoit” DataScientistsguideorganizationsthroughbuildingdataproductsto solvebigproblemsanddelivervalue