A Mathematical Perspective on Data Science Dr. Tom LaGatta Staff Sales Engineer

advertisement
Copyright©2016SplunkInc.
AMathematicalPerspective
onDataScience
Dr.TomLaGatta
StaffSalesEngineer
(previouslyStaffDataScientist)
AboutMe
•
MathPhDfromUniversityofArizona
– ”Geodesics ofRandomRiemannianMetrics”w/Janek Wehr
– Probability+DifferentialGeometry+FunctionalAnalysis
•
[email protected]
– FinishedGeodesicspaper,publishedinCommunicationsinMath.Physics
– CollaboratedwithPoliticalScientists onheterogeneousvotingbehavior
•
Was:StaffDataScientistatSplunk
– HelpedcustomerswithadvancedusecasesinBusinessAnalytics,Internetof
Things,MachineLearning,DataScience
•
Now:StaffSalesEngineeratSplunk
– Helpingbigbigcustomerssolvebigbigbusinessproblems
Abstract
●
Aswithallthings,theprocessofanalyzingdataadmitsamathematicaldescription.Asa
mathematician-turned-data-scientist,Iwilldescribemyapproachtoproblemsolving,andattempt
tolooselyformalize"stakeholders","usecases","data"and"deliverables"inmathematical
languagefortheenjoymentofthismostlyacademicaudience.Inparticular,Iwilldescribehow
querylanguagesareinherentlyfunctional,actingasfunctionaltransformationsofDataintoData,
whichobeytheusualfunctionalcompositionlaw.Theprocessofanalyzingdataresultsinan
iterativesequenceofqueries,convergingtoafinalquerywhichissatisfactorytotheusecase.
Thesequeriesarethenorganizedintodeliverables,whichcanbe"dashboards" (webpageswith
visualizations)or"dataproducts"(withscheduledjobs&analysesrunninginthebackground).
Whenthisprocessisdoneright,itresultsintheextractionof"value"forstakeholders,whichcan
bemeasuredtangiblyintermsofrevenue,costsorriskmetrics.Sometimesthishasafancyname
like"datascience",butmoreoftenthannot,isjustthenormaloperationalworkofagooddatasavvyIT,Security,TechorBusinessdepartmentinmodernenterprisesandgovernmentalagencies.
Therewillbenoproofs,butIwouldbeveryinterestedtodiscussrigorousapproachestosocial
organization&problemsolvingafterthetalk.
Agenda
•
BasicDefinitions:
– Data,Stakeholders,UseCases,Deliverables
•
Querylanguagesasfunctionalprogramming
– Everyqueryisamapf:Data->Data
– Exampleproblemsolving
•
PuttingItAllTogether:DoingDataScience
– Emphasizeactionableinsights
– Tieittogethertodeliver”value”tostakeholders
Copyright©2016SplunkInc.
BasicDefinitions
WhatisData?
”Data”isanyinformationalartifactofreal-worldphenomena
• A”metric”or”KPI”isanyaggregatefunctionoflow-leveldata
• Examples:
•
–
–
–
–
•
Semi-structuredtimestamped events/metrics
Structuredrelationaldata(rows&columns)
Graphdata(nodes&edges)
”Unstructured”data(images,video,text)
Howtomodeldata:
– Events:markedpointprocesses&timeseries(Skorokhod space)
– Relationalschema:categories(seeDavidSpivak’s work)
– Otherdata:dependsontheusecase,mightneednewdatastructuresto
representit(incl.vectors,graphs,etc.)
WhatisaStakeholder?
•
A”stakeholder”isaperson,groupororganizationwhoisinvestedin
theoutcomeofaninitiative.
•
Example:Acompanybuyssoftware
– StakeholderorgsareIT&theBusiness
– IndividualstakeholdersincludeIndividual
Contributors,Managers,Directors&Executives.
•
ITstakeholdershaveperformancemetrics
(num.outages,meantimetoresolution,etc)
Businessstakeholdershavedifferentmetrics(revenue,cost,risk)
• Customerstakeholdersdownstreamalsohavevalue/impactmetrics
•
WhatisaStakeholder?(cont.)
•
Howtomodelstakeholders?
– FollowGameTheoryforinspiration(butdon’tworryabout”equilibrium”)
– CreateanindexsetIwithallstakeholders.Variousactions,outcomes&
objectiveswillhavesubscriptsi basedonstakeholders
– E.g.,personi choosesactionai,t attimet
– Icanbehierarchical(Personi containedinorgA,soparent(i)=A)
– ”Value”modeledbyobjectivefunctions(Ui,n =objective#nforpersoni)
– Istheaction”pivotal”fortheoutcome?(ie 𝔼[U|do(a)]>𝔼[U|do(nota)]?)
•
Keeptrackofstakeholdersdata:
– Mightbehigh-level(email,Powerpoints)– contextiskey
– Mightbeindatabases(transactions,customerrecords,ticketsdata)
– Mightbegranulareventsdata(webclickstream,logs,mobile,wiredata)
WhatisaUseCase?
•
A”usecase”consistsofabusinessproblem,astrategytoalleviate
theproblem,metricstoevaluatetheoutcome,datatopowera
solution,andstakeholderswhoareinvolvedinitsdevelopment.
•
UseCase:ProblemForecasting.
– Companyhascostlynetwork/systemoutages
– IThiresaDataScientisttohelpbuildsolution.
– DataincludesInfrastructure(CPU,Memory),
Operations(OutageReports),Applogs,etc
– Metric:costofoutages≈
numoutages*timetoresolution*costoflabor
– StakeholdersincludeIT&Business, andimpactedcustomers
WhatisaDeliverable?
•
A”deliverable”isathingproducedtosolveausecase
– Caninclude”dashboards”:informationalwebpagesbuiltfromdataqueries
– Or”workflows”:notableeventsdeliberatedtooperationsanalysts
– Orfull-fledged”dataproduct”:applicationwhich*does*stuffautomatically
•
Deliverable:ProblemForecastingSystem
– Goal:forecastproblemsbeforetheystart,
deliver”proximaterootcause”toITtoinvestigate
– Data:CPU,Memory,Latency,ServiceTickets
– Buildmachinelearningmodeltocorrelate
InfrastructuredatawithServiceimpact
– Applymodeltoincomingevents,create
”predicted(Risk_Score)”
– Surfacehigh-riskeventstoITOperations
Copyright©2016SplunkInc.
QueriesasFunctional
Programming
QueryLanguages
Querylanguagesprovideaformulaicapproachtoworkingwithdata
• A”query”isastringwhichtellswheretogetthedata,whattodo
withthedata,andwheretoputthedata(incl visualizationorDB)
•
Mathematically,everyqueryisaFUNCTIONf:Data->Data
• Queriescanbecomposed(with|symbol),andanalysisisiterative
•
•
Example1:SuccessfulPurchaseActionsfromWebLogs
sourcetype=access_combined action=purchase status=200
•
Example2:PlotPurchaseValueasMetricTimeseries
sourcetype=access_combined action=purchase status=200
| timechart partial=f span=5m sum(price) as value
1:SuccessfulPurchaseActionsfromWebLogs
13
2:PlotPurchaseValueasMetricTimeseries
Wait!ProblemIdentified!
Whyarepurchasesdecreasing?
14
3:InvestigateDatabaseErrors
15
4:PlotDatabaseErrors
16
5:CorrelateDBproblemswithpurchasevalue
17
6:SaveasDeliverable,SendtoStakeholders
18
7:Movetowardproactivemonitoringstance
19
Copyright©2016SplunkInc.
PuttingItAllTogether:
DoingDataScience
EmphasizeActionable Insights
•
Avoideye-candyvisualizations
– “Laserbeam”threatdashboardslookcoolbutareuseless
“Howdoesthishelpmesolvemyproblem?”
• Guidetheviewertodrilldown&actquickly
•
Confusingviz:notactionable
Goodviz:actionable
DoingDataScience
•
DataScientistsresisteasycharacterization.Abitof:
–
–
–
–
•
Manyscalesofaction:
–
–
–
–
•
Statistician
SoftwareEngineer
Business Analyst
Spockonthebridge
Getyourhandsdirtywhenneeded
Butstepbackandseethebigpicture
Emphasize”actionableinsights”throughoutthewholeorg
Alsohavepoliticalcredibilitytosay,”Thisisabaddecision,don’tdoit”
DataScientistsguideorganizationsthroughbuildingdataproductsto
solvebigproblemsanddelivervalue
Download