A View Based Security Framework for XML Wenfei Fan, Irini Fundulaki, Floris Geerts, Xibei Jia, Anastasios Kementsietsidis University of Edinburgh Digital Curation Center Introduction XML data management The importance is clearly demonstrated by the wide adoption of XML related technologies in eScience projects Selective exposure of information in XML a primary concern for data providers, curators and consumers. safeguard data confidentiality, privacy and intellectual property Introduction --- Security View Security View: multiple user groups who wish to query the same XML document different access policies may be imposed, specifying the portions of the document the users are granted or denied access to. Security views are necessarily virtual it is prohibitively expensive to materialize and maintain a large number of views. Example: a medical records XML database Hospital Psychiatry Record Genetics Record Record Date Doctor Bill Patient DateDoctor Bill Patient Date Doctor Bill Patient DiagnosisName Sex NameDiagnosisNameSex Name DiagnosisNameSex Name 'David' 'Mark'' 'David' 'Mary' 'Angela' 'Mary' Patient “Mary”admin Doctor “David” The security canaccess accessthe his ownsee medical records can only records of patients could thehis whole db Insurer’s view Hospital Record Date Bill Record Patient Date Bill Patient Name Sex Name Sex 'Mary' 'Mary' An insurer can only read his customers' billing info Researcher’s View Hospital Record Record Record Date Doctor Patient Date Doctor Patient Date Doctor Patient Diagnosis Sex Diagnosis Sex Diagnosis Sex a medical researcher could retrieve the diagnosis data for research purposes, but not the information on doctors or patients. System Architecture researchers security admins Security Spec. Editor Security Specification S Query Editor View Derivation DR Security View VD for Role UD with XSD DD ... Security View VP for Role UP with XSD DP Security View VR for Role UR with XSD DR Query QR legend input module Result Viewer output module core module optional module on VR virtual view Query Rewriting XML schema XML database XML data flow Query QT other data flow on T XSD D for document T XML document T Indexer Query Optimization Query Evaluation security spec. lang. LS used by admins. view spec. lang. LV transparent to users. view query lang. LQV used by users. doc query lang. LQR transparent to users. Security Specification hospital hospital -> patient* * patient * pname date test medication patient -> pname, visit*, parent* parent visit treatment * (patient,pname) = N (patient,visit) = N parent -> patient visit -> treatment, date (hospital,patient) = [visit/treatment/medication = ‘autism’] (visit, treatment) = [medication] treatment -> test + medication (treatment,test) = N Security Specification Classify the nodes in the XML document Support accessible nodes inaccessible nodes conditional accessible nodes inheritance overriding content-based access privilege context-dependency View derivation module schema availability the availability of an XML schema that specifies the structure of accessible data is critical to the users who can then formulate queries only over this schema. View Specification hospital hospital -> patient* * patient * * treatment parent (patient, treatment) = visit/treatment[medication] (patient, parent) = parent parent -> patient medication patient -> treatment*, parent* (hospital, patient) = patient[visit/treatment/medication = ‘autism’] (parent, patient) = patient treatment -> medication (treatment, medication) = medication Query Over the View Regular XPath Query a mild extension of XPath that supports the general Kleene closure (.)* instead of the limited recursion “//”. Why: XPath is not closed under query rewriting i.e. for an XPath query on a recursively defined view there may not exist an equivalent XPath query on the underlying document Query Over the Document Regular XPath Query However, the size of the rewritten query QT, if directly represented in Regualar XPath, may be exponential in the size of input query QV. We overcome this challenge by employing an automaton characterization of QT ,denoted by MFA(mixed finite state automata), which is linear in the size of QV. Query Rewriting Module MFA: Internal Query Representation AFA: capture filters and 21 20 15 16 treatment medication NFA: capture selecting paths 14 24 pname 0 hos pital 1 patient vis it 5 17 19 parent patient 13 4 3 TEXT_EQUAL 'headache' 7 and 22 8 9 vis it 10 treatment 11 tes t 12 hospital/patient[(parent/patient)*/visit/treatment/test and visit/treatment[medication/text()=“headache”]]/pname Query Evaluation: HyPE We propose a novel algorithm, HyPE (Hybrid Pass Evaluation), for processing Regular XPath queries represented by MFA’s. A unique feature of HyPE is that it needs only a single top-down depth-first traversal of the XML tree, during which HyPE both evaluates predicates of the input query (equivalently, AFA's of the MFA) and identifies potential answer nodes (by evaluating the NFA of the MFA). previous systems require to traverse the XML document at least twice to evaluate XPath queries. HyPE: Cans (candidate answers) The potential answer nodes are collected and stored in an auxiliary structure, referred to as Cans (candidate answers), which is often much smaller than the XML document tree. A pass over Cans is needed to retrieve the real result nodes. HyPE hospital patient patient 4, 8, 7 9, 7, 8 visit parent visit pname 5 24 treatment 10 pname patient 24 treatment medication 9, 7, 8 11 treatment ... “cold” test 19 11 test visit ... 24 21, 15 16, 20 medication... Cans: candidate answer real answer 24 states in AFA 9, 7, 8 medication pruned subtree ... 14 treatment 12 a state in NFA 17 24 24 5 patient 4, 8, 7 parent visit 5 treatment patient 19 9, 7, 8 ... visit ... treatment 24 4, 8, 7 parent pname “headache” 10 24 22 13 a state in NFA annotated by a false AFA 12 patient ... medication 11 test 12 “headache” SMOQE: A Reference Implementation We have developed a reference implementation, called SMOQE(Secure MOdular Query Engine), for the security framework we proposed in this paper. It is implemented in Java. demonstrated in VLDB 2006 Conclusion A generic, flexible view based access control framework for protecting XML data and its implementation: SMOQE able to enforce fine-grained access policies according to the structure and values of the protected XML data schema availability view derivation efficient enforcement of security constraints during XML query evaluation Query rewriting Automaton based representation Evaluation using HyPE and optimization Thank you!