DesigningDataIntensiveapplications Storageneedsindistributedsystems ● ● ● ● ● toredatasothatthey,oranotherapplication,canfinditagainlater(databases) S Remembertheresultofanexpensiveoperation,tospeedupreads(caches) Allowuserstosearchdatabykeywordorfilteritinvariousways(searchindexes) Sendamessagetoanotherprocess,tobehandledasynchronously(streamprocessing) Periodicallycrunchalargeamountofaccumulateddata(batchprocessing) ● A mazonhasalsoobservedthata100msincreaseinresponsetimereducessalesby 1%[20] ● Totalcostofownership ○ Itiswellknownthatthemajorityofthecostofsoftwareisnotinitsinitial development,butinitsongoingmaintenance—fixingbugs,keepingitssystems operational,investigatingfailures,adaptingittonewplatforms,modifyingitfor newusecases,repayingtechnicaldebt,andaddingnewfeatures. ● Alongwithreliabilityandscalability,anotherimportantthingsare:operability,simplicity andplasticity/evolvability ● Makingasystemsimplerdoesnotnecessarilymeanreducingitsfunctionality;itcanalso meanremovingaccidentalcomplexity.MoseleyandMarks[32]definecomplexityas accidentalifitisnotinherentintheproblemthatthesoftwaresolves(asseenby theusers)butarisesonlyfromtheimplementation. ● Oneofthebesttoolswehaveforremovingaccidentalcomplexityisabstraction. Chapter3 ● Indicesspeedupreadsbutslowsdownwrites.Thetradeoff. LSMStorageEngine IntroductiontoLSMTrees: ● LSM(LogStructuredMerge)Treeisadatastructureemployedbyvarious NoSQLdatabaseslikeDynamoDB,Cassandra,andScyllaDB. ● Thesedatabasesaredesignedtohandlelargevolumesofwriteoperations efficiently,whichtraditionalrelationaldatabasesstrugglewith. ● LSMTreesachievethisbyoptimizingwriteperformanceandmaintaining reasonablereadperformance. ComparingStorageEngines: ● ThearticlecontrastsLSMTreeswithB+Trees,whicharecommonlyused inrelationaldatabases. ● UnlikeB+Trees,whichperformin-placeupdates,LSMTreesare append-only.ThiseliminatesrandomI/Ooperations,enhancingwrite performance. ArchitectureofLSMTrees: ● LSMTreesleveragemultipledatastructurestoexploitdifferentstorage devicecharacteristics. ● Theyconsistoftwomaincomponents:MemtablesandSSTables. ● Memtablestemporarilystoreincomingwritesinmemory,organizingthem byobject-keypairs. ● WhenaMemtablereachesacertainsize,it'sflushedtodiskasan immutableSSTable,ensuringsequentialI/Ooperations.Thesearecalled sortedrunfiles.Thereisalsotypicallyasmallsparseindexfortherange ofkeysthatthisfileholds.Thismakessearchingforalargediskfilesuper fast.ThesecretsauceforSSTable:) ● ThenewSSTablebecomesthemostrecentsegmentoftheLSMTree,and thisprocesscontinuesasmoredataarrives. OperationsonLSMTrees: ● Delete:LSMTreeshandledeletionsbyaddingtombstonestothemost recentSSTable,indicatingthatanobjecthasbeendeleted. ● Read:ReadsinvolvesearchingthroughMemtablesandSSTables sequentially.SinceSSTablesaresorted,lookupscanbeefficient. ● Write:Incomingwritesarebufferedinmemoryandperiodicallyflushedto diskasSSTables. ● Compaction:AsSSTablesaccumulate,acompactionprocessmergesand discardsoutdatedordeletedvalues,reclaimingdiskspace. CompactionStrategies: ● Thearticlediscussesdifferentcompactionstrategies,suchassizetier compactionandlevelcompaction. ● C ompactionaimstomanagethenumberofSSTablesefficientlywhile minimizingreadandwriteamplification. LSMTreeEnhancements: ● Variousoptimizations,likesummarytablesandBloomfilters,are employedtoimprovelookupperformanceandreduceI/Ooperations. ● Summarytablesstoremetadataaboutdiskblocks,enablingskipping unnecessarysearches. ● Bloomfiltershelpindeterminingwhetherakeyexistsinalevelwithout performingexhaustivesearches,reducingI/O. DrawbacksofLSMTrees: ● Despitetheiradvantages,LSMTreeshavedrawbacks,particularlyrelated totheresource-intensivenatureofcompaction. ● Compactioninvolvescompression/decompressionofdata,whichcan impactreadandwriteperformance. ● Additionally,readscanbeslowintheworst-casescenarioduetothe append-onlynatureofLSMTrees. Insummary,LSMTreesplayacrucialroleinenablingNoSQLdatabasestohandlehigh writeratesefficiently.Theyachievethisthroughacombinationofappend-onlystorage, efficientflushingmechanisms,andcompactionstrategies.However,mitigatingthe drawbacksassociatedwithcompactionremainsachallengeforoptimizingLSMTree performance. Resources https://www.youtube.com/watch?v=I6jB0nM9SKU&ab_channel=ByteByteGo ● ● BTreeStorageEngine InaB-treeindexstructure,arootpageservesasthestartingpointforkeylookups.Each pagecontainskeysandreferencestochildpages,witheachchildresponsiblefora specifickeyrange.Tofindakey,youfollowthereferencecorrespondingtothekey's rangeuntilyoureachaleafpagecontainingindividualkeysandtheirvalues. henumberofreferencestochildpagesinapageiscalledthebranchingfactor, T typicallyseveralhundred.Toupdateavalueforanexistingkey,youlocatetheleafpage containingthekey,updatethevalue,andwritethepagebacktodisk.Addinganewkey involvesfindingtheappropriatepageandaddingitthere.Ifthere'sinsufficientspace, thepageissplitintotwo,andtheparentpageisupdatedtoreflectthenewkeyranges. hisprocessensuresefficientkeylookupsandupdatesintheB-treestructure,evenas T thedatasetgrows. Notes ● Duringpagesplit,thewriteamplificationcanbehigh ● Afour-leveltreeof4KBpageswithabranchingfactorof500canstoreupto250TB. ● Inordertomakethedatabaseresilienttocrashes,itiscommonforB-tree implementationstoincludeanadditionaldatastructureondisk:awrite-aheadlog(WAL, alsoknownasaredolog). ● ComparisonwithLSMTrees ○ LSMcompactioncanbeheavyandhurtthetaillatenciessimilartojavagarbage collection ○ Thediskbandwidthissharedforbothwrites(i.e.appends)andcompaction OLAPDatabases omeofthecolumnsinthefacttableareattributes,suchasthepriceatwhichtheproductwas S soldandthecostofbuyingitfromthesupplier(allowingtheprofitmargintobecalculated). Othercolumnsinthefacttableareforeignkeyreferencestoothertables,calleddimension tables.Aseachrowinthefacttablerepresentsanevent,thedimensionsrepresentthewho, what,where,when,how,andwhyoftheevent. hename“starschema”comesfromthefactthatwhenthetablerelationshipsarevisualized,the T facttableisinthemiddle,surroundedbyitsdimensiontables;theconnectionstothesetablesare liketheraysofastar. variationofthistemplateisknownasthesnowflakeschema,wheredimensionsarefurther A brokendownintosubdimensions. Columnarstorage lthoughfacttablesareoftenover100columnswide,atypicaldatawarehousequeryonly A accesses4or5ofthematonetime( " SELECT *"queriesarerarelyneededforanalytics)