11 Billion Document Benchmark Overview DynamoDB, S3, Elasticsearch & AWS July 2019 © 2019 Technology Services Group, Inc. All Rights Reserved. CONTENTS Executive Summary ............................................................................................................................. 3 DynamoDB ECM Background .............................................................................................................. 3 NoSQL Database ............................................................................................................................................. 4 Searching (DynamoDB Key) .......................................................................................................................... 5 Searching (Elasticsearch & Key) .................................................................................................................... 7 AWS S3 or Glacier FileStore ........................................................................................................................... 7 Benchmark Solution ............................................................................................................................ 8 DynamoDB AWS Benchmark Environment ................................................................................................ 9 Benchmark Data ...........................................................................................................................................10 Benchmark Use Cases .................................................................................................................................10 Benchmark Lessons Learned ............................................................................................................ 11 Phase 1 - Migration ......................................................................................................................................11 Phase 2 – Search Indices .............................................................................................................................12 Phase 3 – Adding Documents .....................................................................................................................15 Phase 4 – Load Testing ................................................................................................................................15 Summary ............................................................................................................................................. 20 Appendix 1 – ECM Database Approach ............................................................................................ 21 DynamoDB – Database Model for Document Management ..................................................................21 DynamoDB – Big Data versus traditional Big Database ..........................................................................21 DynamoDB – Schema-on-Read versus Schema-on-Write.......................................................................21 DynamoDB – what it means for Big Data (and big Document issues) ..................................................22 DynamoDB for Document Management – It’s not about Search ..........................................................23 DynamoDB – Big Data for Document Management – Schema-on-Read Example ..............................23 DynamoDB – Building a Document Model ...............................................................................................24 Summary .......................................................................................................................................................25 Appendix 2 – Additional Source Links .............................................................................................. 25 © 2019 Technology Services Group, Inc. All Rights Reserved. 11 Billion Document Benchmark: ECM Best Practices Executive Summary EXECUTIVE SUMMARY In May of 2019, Technology Services Group initiated an 11 Billion Document DynamoDB benchmark, which was completed in June 2019. With the success of the benchmark, TSG was able to successfully demonstrate that AWS, DynamoDB, Elasticsearch and our OpenContent, OpenAnnotate and OpenMigrate products could scale to an unprecedented level and represented the next evolution of enterprise content management, a Big-Data, NoSQL approach for the multi-billion object repository. This white paper will detail the TSG’s approach to next generation, large scale ECM solutions, benchmark activities and lessons learned. TSG initially announced the development efforts for creating an ECM offering for DynamoDB back in October of 2018. Based on the success of our Hadoop offering (also a NoSQL approach), developing a solution for DynamoDB was greatly simplified with a majority of the development and testing efforts completed in a couple of months. Having success with multiple on-premise Hadoop clients, the TSG team thought an internal benchmark partnering with Amazon would show off the true power and massive scale potential of DynamoDB and AWS. The goal was to simulate all the components of a massively large repository to verify that AWS along with the TSG’s products and approaches could scale for large volume, case management clients. The benchmark focused on typical requirements for large volume clients like health and auto insurance claim repositories, but also included accounts payable and human resource scenarios. DYNAMODB ECM BACKGROUND Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. The main advantage of DynamoDB is that it lets customers offload any administration work of operating and scaling a distributed database to AWS. DynamoDB customers do not need to worry about hardware provisioning, replication, software patching or cluster scaling as AWS handles all of these functions. Amazon believes in the power and scalability of DynamoDB so much that it has become the main database for Amazon’s daily transactions. Amazon’s trust in DynamoDB is due to Dynamo’s high availability and durability design. DynamoDB leverages global tables that replicate data across AWS regions so if one region’s cluster goes down, access to the data is available from another region’s table. DynamoDB also leverages the AWS cloud to provide both point-in-time recovery from any second over the last 35 days and also has the ability to create on-demand backups and restores, as required. © 2019 Technology Services Group, Inc. All Rights Reserved. 3 11 Billion Document Benchmark: ECM Best Practices DynamoDB ECM Background Dynamo compares very favorably in these regards to Hadoop’s Hbase, which is also a non-relational database that provides redundancy across clusters. One of the main differences between Hadoop and Dynamo is the infrastructure that is required for each solution. Hadoop requires Linux servers for installation. Customers can implement Hadoop either on-premise or move Hadoop to the cloud on provisioned servers but need to manage the database and clusters manually or engage a vendor like Hortonworks to manage the Hadoop cluster. DynamoDB automatically handles the administrative tasks associated with provisioning servers but is only available in the AWS Cloud. An enterprise that has a strict on-prem policy would not be able to use DynamoDB. Another major difference between DynamoDB and Hadoop is that Hadoop is an Apache opensource project while DynamoDB APIs are controlled by Amazon. Hadoop allows users to see the source and even contribute enhancements to it if needed, and generally the libraries will not change in substantial ways after deployment. DynamoDB APIs on the other hand are controlled by Amazon, and are subject to change at any point, which could impact deployments against the DB. TSG would expect major DynamoDB API changes to be rare, but it is an important consideration when choosing a database solution. NoSQL Database One of the major benefits for DynamoDB customers is a “not only SQL” approach, often referred to as NoSQL. DynamoDB is a distributed, versioned, non-relational database modeled after Google’s Bigtable: A Distributed Storage System for Structured Data, which sits on top of DynamoDB. Since the 1990’s, all ECM vendors have leveraged some type of database under the covers of their architecture to manage the metadata, relationships and other data components of documents. Metadata & attributes include title, file location, author, security, relationships, annotations, folders and all other data associated with the document. Traditionally, legacy ECM solutions would require Oracle or MS SQLServer, while newer solutions, like Alfresco, would leverage MySQL. In considering “What’s next” for our ECM customers, TSG realized that NoSQL approaches could provide tremendous benefits. See our initial post from 2015 on leveraging a Big Data approach for Document Management as well as our Whitepaper from analyst Alan Pelz-Sharpe. Some of the unique and modern features of DynamoDB or Hadoop versus traditional relational databases include: • Limited Database/Docbase Administration – Built for a “big data” approach, DynamoDB focuses on an approach that allows the database to adapt as new data is presented rather than the traditional approach “call the DBA to add a column or an index”. Users should think of it as a “tagging” structure, rather than a traditional relational database module with a © 2019 Technology Services Group, Inc. All Rights Reserved. 4 11 Billion Document Benchmark: ECM Best Practices DynamoDB ECM Background strict schema. Tagging is something that inherently fits into a content management framework or understanding as we are always tagging documents with metadata. For an ECM example, if storing an invoice document, DynamoDB can receive all of the attributes as consistent column families and DynamoDB will take care of storing all the descriptors and values. If at a later time, the next invoice has a new value not associated with the old invoices, DynamoDB can easily append the value to just those documents. • Limited Back-up/Recovery – Also as a “big data” approach, DynamoDB provides a scalable, redundant model that can be leveraged across multiple AWS Regions with automatic redundancy/clustering. One of the big issues with our typical ECM relational database approach has always been coordinating the back-up of the relational database with the filestore. DynamoDB provides the ability to remove that requirement as well as simplify setting up of a clustered environment. • Scaling and Ingestion – As a schema-Less approach, DynamoDB or Hadoop provide the ability to quickly ingest objects much like a SAN rather than a database that needs to place different data in different columns and make they all tie back together before committing the data. This was a critical component of the benchmark as the goals were to ingest more than 1 billion documents a day and scaling processed up to 20,000 documents per second. Searching (DynamoDB Key) DynamoDB’s redundancy/clustering approach requires a different approach when it comes to searching. Retrieving metadata from DynamoDB is available but slightly different from a relational database as DynamoDB will farm the request out (again a big-data approach) to multiple servers and compile the results. While this approach is acceptable for most big-data analysis, it isn’t always acceptable for large repositories and complex searches. Search performance is always a key requirement of any successful ECM implementation. DynamoDB does provide one indexed key for quickly retrieving objects. Innovative clients can leverage this key if searching requirements consistently fall into one key retrieval. For example, for many of our insurance claims clients, the key to the case folder is typically For many high-volume environments, Case Management represents the vast majority of how users access a document or case of multiple documents. Typically, a Case ID or similar ID (example claim number, vendor number…) can be used to uniquely identify the folder/case. In these applications, it is important to leverage the proper architecture to allow for fast access to the case without requiring the use of the search/index server. By leveraging a smart key design, many of our case management clients can function access the case without the Solr/Elasticsearch infrastructure at all. © 2019 Technology Services Group, Inc. All Rights Reserved. 5 11 Billion Document Benchmark: ECM Best Practices DynamoDB ECM Background When a document can have a Case ID in the object model, NoSQL can make use of a design pattern by prepending the Case ID to the beginning of the document ID. In this manner, when a user performs a search for documents, HBase design best practices and DynamoDB design best practices dictate a pattern like this to allow a lightning fast Range Scan to quickly bring back all of the documents for a particular case. This operation is significantly faster in a large repository, with the added benefit of not needing to leverage the search index at all. We have found that scanning the database directly via these patterns offers predictably fast access to view all documents in a particular claim. Clients with a case management use case can make use of this design pattern to avoid entirely having to create a Solr or Elasticsearch index, which in our experience can be difficult/expensive to maintain for multibillion document repositories. We have many of our case management clients in a production environment without having an index server at all. As we found in our indexing benchmark, the infrastructure costs alone for an index of this size can be multiple times more expensive than the NoSQL database infrastructure, so we recommend that clients take this approach for case management. Below is an example of how an HBase/DynamoDB table looks and how it can efficiently get to the case id documents when the case number is known: © 2019 Technology Services Group, Inc. All Rights Reserved. 6 11 Billion Document Benchmark: ECM Best Practices DynamoDB ECM Background Searching (Elasticsearch & Key) TSG recommends leveraging Elasticsearch/Solr/Lucene as both the metadata and full-text search engine when required. Similar to how legacy ECM systems use a relational database to store attributes, DynamoDB retrieval could be used for system of record requests (ex: What are the attributes of this document). Anytime a search against metadata is needed, Solr/Lucene would be used for searching for documents. Typical scenarios for searching fall into two basic patterns: Retrieval Pattern 1: Search Typically search users will know a few of the attributes of the document (title, date, author) and will run various searches against the Solr/Elasticsearch index to find the documents. Solr and Elasticsearch are the perfect tools for quickly searching through the index and returning the IDs and meta-data for each document that fit the criteria. Once the user has found the document they would like to work with, the ID of that document is passed along to HBase/DynamoDB in order to retrieve the content for the user to view, edit, annotate or another document action. For a typical ECM deployment with high search requirements, the Solr/Elasticsearch infrastructure can be scaled to meet the needs of the system. Search Pattern 2: Analytics If the ability to perform deep analytics on the attributes and full-text content of documents is a requirement, we typically recommend separate indexes in Solr/Elasticsearch targeted for the specific use cases that the data scientists are requesting. TSG no longer recommends one massive index for all of the attributes and full-text content for all documents in the entire repository as it can be problematic, especially if this is the same index that is being used by end users. See our thoughts on creating separate indexes with Solr this post. AWS also has various Index services integrated into their platform. While TSG used Elasticsearch for the DynamoDB search index, this implementation could be moved over to Amazon CloudSearch service in the future – the benefits being AWS managing the index in much of the same way as it manages DynamoDB. AWS S3 or Glacier FileStore DynamoDB is unique that it is not priced solely on storage size, but also on read and write “units” into the DB. This pushes the solution away from wanting to store the physical contents of © 2019 Technology Services Group, Inc. All Rights Reserved. 7 11 Billion Document Benchmark: ECM Best Practices Benchmark Solution documents within DynamoDB and instead leveraging AWS storage solutions (S3 and Glacier). These services have the ability to replace another traditionally costly component of the ECM architecture stack, the Storage Area Network or SAN. DynamoDB has the ability to set a Time-To-Live (TTL) on everything stored within the DB. If data storage and pricing of the data was becoming a concern, customers can configure a TTL for all main content and have it linked back to a normal S3 bucket. Once the TTL in the DB gets hit, the content could be archived in Glacier and the DB metadata could be offloaded, if needed, to a lower cost DB store. For our benchmark, we relied on all the content of the documents being stored in S3 with links in DynamoDB. As the benchmark is focused on metadata repository, the test focuses on the ability of Dynamo to store a high-volume of data while maintaining performance. The benchmark reuses S3 documents to simulate a real repository. The benchmark assumed that the content files have been migrated to S3 in preparation for the migration activities with AWS Snowball or other file migration approaches. BENCHMARK SOLUTION The goal of the 11 Billion Document Benchmark was to simulate all the components of a massively large repository to verify that TSG’s tools and approaches could scale. One truly powerful component of DynamoDB and other NoSQL approaches is how quickly TSG can build a massive repository leveraging AWS hardware scaling. When compared with other ECM repositories based on legacy relational database technologies, DynamoDB’s “schema on read” approach can perform ingestion at unbelievable rates of speed. Rather than pursue a benchmark built on a “do everything” approach, the benchmark focused on capabilities required for large volume clients like health and insurance claim repositories. With large volume clients, “do everything” document management solutions can add significant overhead and performance issues for massive repositories. Some examples of large volume approach include: • Case Security rather than repository security – The case management approach assumes that an external system is handling the security of which case is available for access by which user. A majority of insurance claim clients, prefer this approach to improve performance and reduce bloat and added system requirements of the ECM system. While TSG’s © 2019 Technology Services Group, Inc. All Rights Reserved. 8 11 Billion Document Benchmark: ECM Best Practices Benchmark Solution DynamoDB solution does have ACL and document security, it was not enabled for the benchmark. • Case Search rather than document search – Similar to the security, typical case or claim security allows for searching for a case or claim and, then once in the claim, searching through the documents. The benchmark’s initial phase included an all repository case search based on AWS Elastic and did not initially include an “all repository document” search. Document searching leveraging Lambda Elastic Search index population from DynamoDB was included for certain scenarios. • DynamoDB scaling but not S3 scaling – The benchmark targeted 11 billion unique document objects in DynamoDB pointing to content in S3. The benchmark shared S3 links to actual files in S3. The benchmark assumed that the content files have been migrated to S3 in preparation for the migration activities with AWS Snowball or other file migration approaches. • TSG’s Product Set – The benchmark included migration activities leveraging TSG’s OpenMigrate as well as case viewing and updating with TSG’s OpenContent Management Suite including OpenAnnotate consistent with current client access needs. • High volume input - To complete the benchmark, TSG was estimating 12,000 documents migrated per second with 1 billion documents migrated per day. In preliminary tests, TSG was able to increase our throughput to 20,000 documents per second. • Quick viewing of documents - For viewing, TSG targeted sub-second viewing of case folder contents as well as document viewing. TSG supported both viewing and annotating documents as well as combining document capabilities. DynamoDB AWS Benchmark Environment Components of the AWS testing environment included: • DynamoDB AWS Managed Service – 26,000 write units, Minimal read units • Elastic AWS Managed Service – 8 data nodes – r5.4xlarge.elasticsearch (100 GIB EBS gp2), 3 master nodes – r5.xlarge.elasticsearch • OpenMigrate – 2 EC2 m5.24xlarge instances (380 GB java heap – 96 CPUs) • OpenContent Management Suite and OpenAnnotate – EC2 t2.medium instance (4GB RAM – 2 CPUs) © 2019 Technology Services Group, Inc. All Rights Reserved. 9 11 Billion Document Benchmark: ECM Best Practices Benchmark Solution Post large scale migration, TSG significantly reduced (or fully retired) the OpenMigrate servers. Additionally, DynamoDB was reconfigured for write and read units to be more in line with daily usage. Benchmark Data One of the biggest issues of creating a representative benchmark is providing realistic data at volume. As with most benchmarks, TSG relied on sample/text data. Any successful test needs to make sure that the created documents and folders in the large repository are sufficiently different enough to be a real-world example of how the repository would function at scale. For sample data, TSG leveraged on Open Addresses to create example case folders for each address. TSG targeted example case folders for Accounts Payable, Auto Claim, Health Care Claim and Human Resources for example cases with different types of documents for each type of case. Each address had the four case folders attached with the folder and document name were modified with part of the address to keep the uniqueness of each document and folder. OpenMigrate’s multithreaded capabilities allowed TSG to set up a massive number of threads to read through the addresses and create the folders and documents in DynamoDB. Benchmark Use Cases One thing that has guided the benchmark efforts from the beginning was getting input from outside council including Alan Pelz-Sharpe with Deep Analysis as well as Rich Medina from Doculabs. Early Alan stated that, while multiple companies had done billion document benchmarks in the past, his concern as an analyst was “Great – you can hold a billion documents in the repository, what can you do with those documents?” Once the migration activities are complete, here are the different scenarios we plan on supporting. • Search for Case Folder - Users are able to search across the entire repository for any folder • View Case Documents – Users are able to view a listing of all documents or videos in the folder • Document Actions – Users are able to view and update properties and annotate both documents and video • Folder Actions – Users are able to view all documents with user preferences for attribute order and column sorting, combine PDF and combine into Zip. • Related Documents and Folders – Users are able to see related folders © 2019 Technology Services Group, Inc. All Rights Reserved. 10 11 Billion Document Benchmark: ECM Best Practices Benchmark Lessons Learned • Search for Documents – Leveraging TSG’s Elastic index for all documents, users are able to search for documents based on attributes across the repository. • Add Documents – Users are able to add documents in a variety of supported methods (bulk upload, drag and drop, scanner, traditional file browser). • Load Testing - Concurrent user test of 11,000 threads preforming standard document management including search, annotate and adding documents. BENCHMARK LESSONS LEARNED Phase 1 - Migration Over the course of seven days, TSG was able to successfully ingest 11 billion documents and create almost 1 billion different folders objects. Before starting the benchmark, TSG set a realistic goal for ingestion – be able to move 1 billion documents a day into DynamoDB. Lesson – Iterate, iterate, iterate The migrations started small and over multiple iterations worked towards that goal. In order to hit a billion documents in 24 hours, OpenMigrate would need to move about 12,000 documents a second. Our iterations looked something like: • 1 OM on a t2.medium with default thread setting o • 1 OM on a t2.medium with OM thread performance tweaks o • 553 docs/sec – 8,000 docs moved 1 OM on a m5.24xlarge (96 CPUs) with OM thread performance tweaks o • 510 docs/sec – 8,000 docs moved 3,447 docs/sec – 8,000 docs moved 1 OM on a m5.24xlarge (96 CPUs) with OM thread performance tweaks o 3,389 docs/sec – 70,000,000 docs moved © 2019 Technology Services Group, Inc. All Rights Reserved. 11 11 Billion Document Benchmark: ECM Best Practices Benchmark Lessons Learned • (Continue to steadily iterate and increase documents moved until we hit the final test run before kicking off the benchmark) • 2 OM on m5.24xlarge with OM thread performance tweaks and Elasticsearch indexing performance updates o 22,367 docs/sec – 530,000,000 docs moved Lesson – Reduce Bloat in the Metadata Metadata is always important for document management solutions, and while TSG wanted to have enough metadata that the benchmark was realistic, is was important to reduce the bloat of the metadata and take advantage of the admin capabilities to map metadata to the folder level. TSG would recommend clients critically look at their metadata models and see if they can identify any metadata bloat that could negatively impact performance. Lesson – Challenge Assumptions in regards to Performance The benchmark focused on using TSG’s product, OpenMigrate, for the ingestion/migration of the 11 billion documents. While TSG has done plenty of large migrations with OpenMigrate before, OM had always been constrained by the performance of the ECM repository and underlying database. In initial planning, TSG calculated 10 OpenMigrate instances running concurrently to hit 12,000 documents per second. However, once deployed, 2 OpenMigrate instances running on AWS – 2 EC2 m5.24xlarge instances (380 GB java heap – 96 CPUs) servers with DynamoDB rather than traditional SQL repositories, it was able to hit the goal of 20,000 documents/second. Minimal tweaks to out of the box OpenMigrate enabled TSG to complete this benchmark with only two instances of OpenMigrate instead of ten. Sometimes less really is more. Phase 2 – Search Indices One of the key focus areas of the benchmark in regards to the document index was focused on departmental search versus a complete repository search. TSG recommends pushing for multiple, efficient search indices rather than one large index to allow content services to perform quicker with less complexity. With the success, capabilities and cost of robust search tools like Solr and Elasticsearch, TSG has been recommending for years that clients create focused indices rather than one large “do all” index to serve every possible purpose. For the benchmark, TSG focused on two indices: © 2019 Technology Services Group, Inc. All Rights Reserved. 12 11 Billion Document Benchmark: ECM Best Practices Benchmark Lessons Learned • Large Folder Index – All folders would have an index for basic search. Typically, in a case management scenario, large clients are looking to navigate to the folder to look at document in the case. Getting access to the case folder and documents quickly is a key requirement. • Focused Document Index – More of a focused index for a subset of documents. These indices could be created when needed for specific scenarios or requirements. While TSG felt very confident that an Elasticsearch index for all the documents in the repository especially with sharding capabilities would be possible, the index services from AWS come with a higher price than just DynamoDB storage. The main focus was to prove out on a smaller scale the indexing capabilities while providing a realistic approach for anyone to consider when building and maintaining Elasticsearch indices. Lesson – Create Indices Upon Ingestion As part of the 11 billion document ingestion, TSG leveraged OpenMigrate to create an Elasticsearch index for every folder in the repository finishing with 925,837,980 folders in the Elasticsearch index. TSG created this index as part of the OpenMigrate ingesting process that was storing 21,000 objects per second. Statistics from the folder indexing benchmark include • 925,837,980 folders stored in DynamoDB and indexed in Elasticsearch • 6 Data Nodes and 3 Master Elasticsearch servers to maintain the index • Objects (documents and folders) ingested per second – 21,000 • Indexing time for 925,837,980 folders – 159 hours • Elasticsearch Index Size – 372 Gigabytes – DynamoDB Repository Size – 5.32 Terabytes Lesson – Create Indices for Specific Business Needs For the document index, TSG wanted to show a scenario where an index would be created for a specific purpose. The decision was made to index 1 million documents from the perspective of an accounts payable scenario as a feasible test. This is consistent from client experience where clients have said “I only to index the last X months for X”. Statistics from the document indexing benchmark are different from the DynamoDB effort as the content was already in Dynamo (and needed to be retrieved to be indexed): • 1 million documents already existing in DynamoDB • Document indexed per second – 501.3936 © 2019 Technology Services Group, Inc. All Rights Reserved. 13 11 Billion Document Benchmark: ECM Best Practices Benchmark Lessons Learned • Indexing time – 33 minutes 14 seconds • Elasticsearch Index Increase – 4 Gigabytes – Total Size 376 Gigabytes This effort scanned through existing DynamoDB content, and if the object type of the content was an AP invoice document, it indexed the content until 1 million documents were found. OpenMigrate was able to create the 1 million indices after the DynamoDB ingestion process was complete in just under 35 minutes or roughly 500 documents per second. Unlike the main ingestion process were there were only 7 servers, we had added more processing power to the Elasticsearch to get up to 9 servers. Additional servers could have been added to further reduce the indexing time and improve throughput. Lesson – AWS Lambda TSG initially hoped to use Amazon Lambda for indexing from DynamoDB to Elasticsearch. Unfortunately, initial attempts revealed: • AWS Lambda currently has a 5 minute timeout for execution making the large volume difficult to index. • TSG considered indexing from DynamoDB streams but those were only available for 24 hours so didn’t fit the scenario of building a purpose-built index. TSG decided to leverage OpenMigrate to both read and index documents into Elasticsearch to be consistent and provide the same infrastructure for both initial ingestion as well as the creation of new indices in the future. Lesson – Scaling Elasticsearch versus DynamoDB Is Drastically Different TSG found the current pricing of Elasticsearch versus DynamoDB to be very different with Elasticsearch due to the size of cores needed to support large ingestion and indices. In TSG’s benchmark, DynamoDB stored around 13 times the number of nodes as Elasticsearch did, but Elasticsearch cost about 1.3x more than DynamoDB over the course of the benchmark. Unlike DynamoDB where the infrastructure could scale up for ingestion and then drop read/write units once the large migration was complete, Elasticsearch requires servers to be maintained and operational for both ingestion and later access. DynamoDB read/write units are priced and maintained very differently than Elasticsearch EC2 instances. © 2019 Technology Services Group, Inc. All Rights Reserved. 14 11 Billion Document Benchmark: ECM Best Practices Benchmark Lessons Learned Phase 3 – Adding Documents When setting up the benchmark, TSG specifically chose to separate the first large scale ingestion phase (11 billion documents – almost 1 billion folders) from the Third Phase of users adding documents to folders. This approach is consistent with how many clients roll out TSG’s interfaces when replacing a legacy solution. Many of TSG’s clients have chosen to expose the OpenContent interfaces on their existing content after a large or rolling migration before allowing users to add documents to the new repository. One of the key discussions for this phase focused on how to add/retrieve content from a folder. A key requirement was viewing case documents and allowing users to be able to “view a listing of all documents or videos in the folder”. Lesson – Don’t Assume One Solution Will Work for Every Scenario As part of the benchmark, the team tested two approaches for displaying the contents of a folder. • JSON storage of document objects – The folder DynamoDB object contains all of the document ids in a repeating field. This allows for fast viewing of the objects in the folder, a typical requirement for case management/folder viewing. Benefits included fast, scale-able access to folder objects without a large Elasticsearch index. • Elasticsearch for documents – During the first ingestion phase, Elasticsearch was only being used for access to folder objects. Phase 2 of the benchmark tested indexing part of 11 billion documents to test leveraging Elasticsearch for displaying objects contained in a folder. After testing, the team determined that the JSON object store made the most sense given the size of sample set but that both alternatives for customers based on the size of the number of documents in the folder makes sense. TSG has one client with 65,000 documents in a folder, so the ability to leverage Elasticsearch for certain client scenarios is required. Phase 4 – Load Testing Goals of the concurrent user test were to replicate some of the different issues we have seen from clients in production when a large number of document management users are accessing the system. Components that we specifically wanted to test include: • Application servers to reveal bottlenecks or choke points • DynamoDB performance to detect and remediate any scanning of the table © 2019 Technology Services Group, Inc. All Rights Reserved. 15 11 Billion Document Benchmark: ECM Best Practices • Document searching performance with Elasticsearch service • Document retrieval and viewing performance under stress • Document annotations Benchmark Lessons Learned The benchmark test was patterned after the most common use case for our insurance clients – Claim Viewing. The claim viewing scenario was scripted as a user opening a claim folder with 25 documents followed by viewing 3 to 5 of the documents within our OpenAnnotate viewer. This simulates a user accessing the document management system directly from an insurance claim system. For the test we ran a batch of 11,000 users performing the claim viewing scenario across a selection of 20,000 different medical and auto claims out of the total repository of 11 billion documents. The Claim Viewing scenario tested the Application servers and the DynamoDB table under stress. Searching Not all users executed a search in a Claim Viewing scenario. The number of users who require the ability to search the repository to locate claim information is approximately 25% of the entire universe of users, often these are supervisors, management, or administrators. During the performance test we kept the script simple and executed a search on Name and Address for each user which taxed the Elasticsearch cluster beyond what we originally targeted for testing. Annotations The final use case we implemented was to annotate one of the documents viewed during the Claim Viewing test. Approximately 10% of documents are annotated and for the test we planned on annotating a document for 1 of every 10 users. One issue we discovered is that while it was simple to add an annotation, resetting the data and deleting the annotations between each test run was out of scope for what we wanted to accomplish in this benchmark. We elected to postpone the performance testing of the annotations until an update was made to allow overwrites and deletes for the annotations. Testing 11 Thousand Concurrent Users – Benchmark Testing with AWS Leveraging Amazon Web Services, we started with a moderate sizing of the application servers, DynamoDB read units, and Elasticsearch cluster. Before we scaled the environment up or down, we ran initial tests to set a baseline. Test Runs © 2019 Technology Services Group, Inc. All Rights Reserved. 16 11 Billion Document Benchmark: ECM Best Practices Benchmark Lessons Learned We started with simple baseline test runs and as we encountered and resolved issues, we expanded the architecture to use two JMeter instances and two OCMS instances. When troubleshooting several test-runs we simplified the process and used only one JMeter and one OCMS instance. Run Users Jmeter Instance (#) (vCPU, Memory) OCMS Instance (#) (vCPU / memory) DynamoDB Unit (min read / write) Elasticsearch Data Nodes(#) – 2 AZs 0,1 2 3 4 5 6 7 100 2,000 2,000 5,500 5,500 11,000 11,000 t2.micro (1) (1 / 1) m5a.4xlarge (1) (16 / 64) m5a.4xlarge (2) (16 / 64) m5a.4xlarge (1) (16 / 64) m5a.12xlarge (1) (48 / 192) m5a.12xlarge (2) (48 / 192) m5a.2xlarge (2) (8 / 32 ) t2.medium (1) (2 / 4) t2.medium (1) (2 / 4) r5a.12xlarge (2) (48 / 384) r5a.12xlarge (1) (48 / 384) r5a.12xlarge (1) (48 / 384) r5a.4xlarge (2) (16 / 128) r5a.4xlarge (2) (16 / 128) 100 / 50 5000 / 50 and 7000 / 50 2000 / 50 (auto scale) 2000 / 50 (auto scale) 2000 / 50 (auto scale) 2000 / 50 (auto scale) 2000 / 50 (auto scale) r5.4xlarge (6) r5.large (6) r5.12xlarge (6) r5.12xlarge (6) c5.18xlarge (6) c5.18xlarge (6) r5.4xlarge (6) When running the initial small baseline tests and subsequent 2,000, 5500, and 11,000 user tests we identified several issues, bottlenecks and choke points in each tier of the environment. JMeter Instance relevant issues included: • Unviewable characters in the test dataset csv files caused bad searches and logins. Resolved by changing file encoding settings in JMeter. • Encountered timeout issues and maxing of CPU. Tested ramp up period and task timers to determine impact on the OCMS instance. Modified the JMeter to increase CPU and limits in JMeter to resolve issue. • JVM generic issue message for out of memory was resolved by increasing the Linux process limit. • Script did not properly handle mime types. Updated script to display mime types correctly. OpenContent Management Suite Issues included: • Encountered thread timeouts. Modified the JVM memory settings and increased the tomcat threads. • Out of Memory caused by an issue with garbage collection. Modified to use G1 gc and updated the heap settings. • Linux OS error – too many open files. Resolved by increasing the limit. • Encountered HTTP non response messages and timeouts; an increase apache httpd threads resolved the issue. © 2019 Technology Services Group, Inc. All Rights Reserved. 17 11 Billion Document Benchmark: ECM Best Practices Benchmark Lessons Learned • JVM generic issue message for out of memory was resolved by increasing the Linux process limit • Transformation queue for document viewing maxed out the server CPU. Update script to include less transformation and would recommend leveraging external servers. DynamoDB issues included: • Searching with a bad value for an object id caused a table scan and then blocked the remaining threads. Updated code to validate id and prevent the scan. Elasticsearch issues included: • Searches with unviewable/hidden characters caused non-terminating search errors and maxed out CPU and memory. Test updated to not include unviewable characters. • Executing searches concurrently for each of the 11,000 users across the 5 shards in the cluster maxed out the Elasticsearch. Modified test to include more reasonable percentage of users performing search versus direct access to folder (typical use case). Each of these points above were resolved and retested in subsequent test runs. On bottleneck we experienced that did represent real-world scenarios was maxing out the transformation queue on the OCMS server. For clients with a large volume of users, the transformation queue and process are moved off the OCMS server and scaled out separately. TSG can provide reference architectures for clients implementing OCMS with a large number of users. Creating the Test Data The test data for each of the runs was pulled directly from the Elasticsearch cluster using a Python script. Details for ten thousand auto claim and ten thousand medical claim folders were selected and split across the 11,000 users who were generated in the system for testing. The JMeter test scripts read two csv files, one for users and one for claim folders and properties. Creating the Test Scripts Defining realistic test scenarios seemed fairly simple to do at the outset but required more iteration than expected. We started by using Blazemeter to record a sample set of actions we wanted to take in OCMS. The recording was exported to JMX and then edited to better mimic entry to OCMS from a claim system and accept dynamic parameters from the csv test data files. In order to make the test plan realistic and dynamic we added in http response extractors and logical conditional statements to link the task steps for searching, viewing, and annotating together. © 2019 Technology Services Group, Inc. All Rights Reserved. 18 11 Billion Document Benchmark: ECM Best Practices Benchmark Lessons Learned Lessons Learned – Concurrent User Testing While the Claim Viewing scenario on its surface is a well-defined set of REST endpoints, adding in the automation to process responses and conditionally execute statements took a few days longer than we expected and required more iterative baseline testing loops than originally planned. The more interesting troubleshooting issues occurred when we scaled up to two thousand users. At that point we observed DynamoDB scans where we had not seen them at a smaller scale. We cautiously disabled pieces from the test script and added in several debuggers. It took several iterations to reduce the noise in the testing logs by increasing thread limits and open file limits until we reached the issue of unviewable data in the csv test files. Even after adjusting JMeter to UTF-8 encoding the first line of the csv file might start with an unviewable character. We added in a conditional statement in JMeter to avoid executing any statements containing the character. Alternatively, a more sophisticated means to strip the character from the file could have been done with a pre-processor or other script. This issue was more due to our address data file and we would not anticipate it is a normal client environment. Once it was ensured only good data was being fed into the OCMS test script and we had resolved the thread, process, and open file limits, we scaled up to 5,500 and 11,000 users. While other smaller issues with large amounts of users surfaced, the team was confident that they could be resolved for a production client and ended the testing. © 2019 Technology Services Group, Inc. All Rights Reserved. 19 11 Billion Document Benchmark: ECM Best Practices Summary SUMMARY In TSG’s “Cranking it up to Eleven” Billion Document Benchmark, TSG has been able to prove the scalability and benefits of both Amazon and a NoSQL approach with DynamoDB over traditional document management solutions based on relational database approaches. During the benchmark, we have received some feedback on “why 11 Billion documents and 11 thousand concurrent users”? In deciding on the size of the benchmark, we wanted to exceed the numbers we have seen at clients (7 billion for one prospect) by a large margin. Compared to some of the other, older billion document benchmarks conducted by the ECM software vendors in the past, this benchmark tested all of the scenarios required by our large volume clients. Combined with our rolling migration approach, TSG now has an extensive amount of experience and solutions to move large clients to alternative solutions with our products and people. Some examples of ways the benchmark is currently influencing our current clients include: • As a test harness, TSG is proposing leveraging the migration approach and test data to quickly scale up client’s repositories to production volume before moving real data to test infrastructure and performance. • In our design, we are able to combine lessons learned on creating indexes and alternatives leveraging NoSQL to provide multiple alternatives for typical search scenarios. • We are saving the repository in S3 to allow clients to quickly restore and test scenarios against the repository. Thanks again to everyone that helped us in the benchmark particularly Amazon Web Services, Deep Analysis, and Doculabs. © 2019 Technology Services Group, Inc. All Rights Reserved. 20 11 Billion Document Benchmark: ECM Best Practices Appendix 1 – ECM Database Approach APPENDIX 1 – ECM DATABASE APPROACH DynamoDB – Database Model for Document Management One of the ongoing myths about DynamoDB for Document Management we hear too often is “but isn’t that just for big data?” The following details benefits of DynamoDB’s big data capabilities and data model in a Document Management context compared to traditional database systems. Examples will include how we are currently building our own DynamoDB offering. DynamoDB – Big Data versus traditional Big Database At its core, DynamoDB provides a very robust, distributed data store that allows for powerful parallel processing capabilities with unique data storage capabilities. To understand the difference between DynamoDB and traditional databases requires an understanding of the processes (and timing) of when they were created. Relational databases first emerged in the 1980’s when disk speed was slow and disk space and CPU usage were at a premium. Built for critical business systems, relational databases were built with the focus on storing known data in a static data model. DBA’s were extensively employed to update the model and add indexing and other performance improvements. Performance focused down to the hardware level of where individual fields were stored within the disk array, a very expensive component back in the day. Emerging in the 1990’s and 2000’s, most modern ECM solutions rely on the critical document management fields to be stored in a relational database. DynamoDB, like Hadoop and other “not only SQL (NOSQL)” repositories follows the more modern approach based on the huge gains in the economics of disk space, hardware cost, and new requirements with unstructured big data. DynamoDB allows for very quick and distributed/parallel retrieval of a specific data file that can contain data in a variety of different formats. DynamoDB – Schema-on-Read versus Schema-onWrite One of the big differences between DynamoDB and traditional RDMS is how data is organized in a schema. Traditional databases require Schema-on-Write where the DB schema is very static and needs to be well-defined before the data is loaded. The process for Schema-on-Write requires: © 2019 Technology Services Group, Inc. All Rights Reserved. 21 11 Billion Document Benchmark: ECM Best Practices Appendix 1 – ECM Database Approach • Analysis of data processes and requirements • Data Modeling • Loading/Testing • If any of the requirements change, the process has to be repeated. Schema-on-Read focuses on a less restrictive approach to allow storage of raw, unprocessed data to be stored immediately. How the data is used is determined when the data is read. Table below summarizes differences: Traditional Database (RDMS) Create static DB schema Transform Data into RDMS Query Data in RDMS format New columns must be added by a DBA before new data can be added DynamoDB Copy data in native format Create schema and parser Query Data native format New data can start flowing in any time Schema-on-Read provides a unique benefit for Big Data in that data can be written to DynamoDB without having to know exactly how it will be retrieved. DynamoDB – what it means for Big Data (and big Document issues) In a Big Data world, data needs to be captured without the requirement of knowing the structure to hold the data. As is often mentioned, things like DynamoDB can be used for consumer/social sites that need to store a huge amount of unstructured data quickly with the consumption of that data coming at a later time. As a typical Big Data example, a social site stores all of the different links clicked on by a user. The storing application might store date and time and other data in one HDFS file for the user that is updated each time the user returns. Given the user, a retrieval application can quickly access a large, semi-structured file on that particular user’s activity over time. Also, using Amazon allows leverage of S3 containers within the solution. While content could technically be stored as bytes within the DynamoDB, it makes more sense to use Amazon’s S3 content storage and store a simple S3 link within the DynamoDB metadata to connect the two. This © 2019 Technology Services Group, Inc. All Rights Reserved. 22 11 Billion Document Benchmark: ECM Best Practices Appendix 1 – ECM Database Approach allows DynamoDB to function in its intended design as a pure DB and not also rely on it as both a metadata and content store. DynamoDB for Document Management – It’s not about Search Schema-on-Write works very well for “known unknown” or what we would typically call in document management for a document search, something that Schema-on-Read does not handle as well. To illustrate an example, let’s take a typical document management requirement, search for all documents requiring review by this date: In an RDMS example, the date column would be joined in a query with the “required review” column to quickly provide a list of all the documents that would be needed to be reviewed. If the performance is not acceptable, indexes could be added to the database. In the DynamoDB example, ALL of the documents in the repository would need to first be retrieved and opened, once opened, the required review and date data retrieved to build the list. There are no indices that could speed up performance. After this example, at typical document management architect might conclude that DynamoDB doesn’t really fit this basic requirement. What this opinion doesn’t take into account is the emergence of the “Search Appliance” and particularly Lucene/Solr/Elastic Search as the default indexing/searching engine for document management. For a better performing search for BOTH the RDMS and DynamoDB implementations, we would recommend leveraging Lucene/Solr/Elastic Search to index the document’s fulltext AND meta-data for superior search performance. All modern ECM/Document Management vendors (Documentum and Alfresco as examples) now leverage some type of Lucene/Solr/Elastic Search for all search. Amazon provides a both AWS hosted SolrCloud and Elasticsearch services as an easy way to deploy a DynamoDB combination architecture. By having AWS manage and maintain the physical architecture, organizations can have easier integration for the combined environments. DynamoDB – Big Data for Document Management – Schema-on-Read Example The major advantage of Schema-on-Read is the ability to easily store all of the document’s metadata without having to define column. Some quick examples would include: © 2019 Technology Services Group, Inc. All Rights Reserved. 23 11 Billion Document Benchmark: ECM Best Practices Appendix 1 – ECM Database Approach • Versioning – One difficulty with most document management tools is, given a structured Schema-on-Write DB model, the ability to store different attributes on different versions is not always available or requires a work-around. With DynamoDB, each version would have its own data and different attributes could be stored with different documents. • Audit Trail – This is one we often see difficult to implement with one “audit” database table that ends up getting huge (and is a big data application). With DynamoDB, the audit trail could be maintained as a single entry for that given document row and quickly be retrieved and parsed. • Model Updates – Many times, meta-data needs to be added to a content model after a system has gone live and matures because the content it stores matures with it. With DynamoDB, new meta-data can always be written in as a new column. DynamoDB – Building a Document Model 1. Object Model 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. { objectId (key) objectName title modifyDate creationDate creator contains audits ... content (S3 link) rendition (S3 link) 15. } Objects table contains a single row for each document including: • metadata • content (reference to S3 object) • renditions (references to S3 object) © 2019 Technology Services Group, Inc. All Rights Reserved. 24 11 Billion Document Benchmark: ECM Best Practices Appendix 2 – Additional Source Links This is a rough schema of common data that could exist when a document is added, but since DynamoDB is schema-less, the system is not bound by the picture above and could evolve and change. Every document can add columns “on the fly” with no database schema updates when new metadata is to be captured. One relevant example of this approach is in the case of a corporate acquisition or merging of repositories, when new documents need to quickly be entered into the system from the old company. The metadata can be dumped into the DynamoDB repository as-is, without having to worry about mapping individual fields to the existing schema. Summary DynamoDB has an advantage over traditional RDMS systems in that data can be stored and retrieved quickly in an unstructured file without requiring extensive analysis of how that data will typically be retrieved. The drawbacks of a DynamoDB only approach is that search performance on any attributes other than the “documentId” do not perform well at scale. We would advise that DynamoDB implementations utilize Lucene/Solr/Elastic Search as the query component of a Document Management solution to allow high performant searches against the repository. APPENDIX 2 – ADDITIONAL SOURCE LINKS • • • • • • • • • • • • DynamoDB 11 Billion Document Benchmark – Summary of Postings DynamoDB – Repository Walkthrough DynamoDB Document & Folder Details DynamoDB AWS Walkthrough DynamoDB – Ingestion Success!!! – Lessons Learned Rolling Migration Approach OpenMigrate OCMS Product Overview TSG Hadoop Platform TSG DynamoDB Platform TSG AWS Platform DynamoDB Video Library © 2019 Technology Services Group, Inc. All Rights Reserved. 25 11 Billion Document Benchmark: ECM Best Practices Appendix 2 – Additional Source Links Technology Services Group, Inc 22 West Washington Street, 5th Floor Chicago, IL 60602 inquiry@tsgrp.com www.tsgrp.com Readers are free to distribute this report within their own organizations, provided the Technology Services Group footer at the bottom of every page is also present. © 2019 Technology Services Group, Inc. All Rights Reserved. This white paper was sponsored by Alfresco but written by Technology Services Group, Inc as an independent comparison between ECM products offered by Documentum and Alfresco. © 2019 Technology Services Group, Inc. All Rights Reserved. 26