Uploaded by hmirheydari

11Billion-Final-7-25-19

advertisement
11 Billion Document
Benchmark Overview
DynamoDB, S3, Elasticsearch & AWS
July 2019
© 2019 Technology Services Group, Inc. All Rights Reserved.
CONTENTS
Executive Summary ............................................................................................................................. 3
DynamoDB ECM Background .............................................................................................................. 3
NoSQL Database ............................................................................................................................................. 4
Searching (DynamoDB Key) .......................................................................................................................... 5
Searching (Elasticsearch & Key) .................................................................................................................... 7
AWS S3 or Glacier FileStore ........................................................................................................................... 7
Benchmark Solution ............................................................................................................................ 8
DynamoDB AWS Benchmark Environment ................................................................................................ 9
Benchmark Data ...........................................................................................................................................10
Benchmark Use Cases .................................................................................................................................10
Benchmark Lessons Learned ............................................................................................................ 11
Phase 1 - Migration ......................................................................................................................................11
Phase 2 – Search Indices .............................................................................................................................12
Phase 3 – Adding Documents .....................................................................................................................15
Phase 4 – Load Testing ................................................................................................................................15
Summary ............................................................................................................................................. 20
Appendix 1 – ECM Database Approach ............................................................................................ 21
DynamoDB – Database Model for Document Management ..................................................................21
DynamoDB – Big Data versus traditional Big Database ..........................................................................21
DynamoDB – Schema-on-Read versus Schema-on-Write.......................................................................21
DynamoDB – what it means for Big Data (and big Document issues) ..................................................22
DynamoDB for Document Management – It’s not about Search ..........................................................23
DynamoDB – Big Data for Document Management – Schema-on-Read Example ..............................23
DynamoDB – Building a Document Model ...............................................................................................24
Summary .......................................................................................................................................................25
Appendix 2 – Additional Source Links .............................................................................................. 25
© 2019 Technology Services Group, Inc. All Rights Reserved.
11 Billion Document Benchmark:
ECM Best Practices
Executive Summary
EXECUTIVE SUMMARY
In May of 2019, Technology Services Group initiated an 11 Billion Document DynamoDB benchmark,
which was completed in June 2019. With the success of the benchmark, TSG was able to successfully
demonstrate that AWS, DynamoDB, Elasticsearch and our OpenContent, OpenAnnotate and
OpenMigrate products could scale to an unprecedented level and represented the next evolution of
enterprise content management, a Big-Data, NoSQL approach for the multi-billion object repository.
This white paper will detail the TSG’s approach to next generation, large scale ECM solutions,
benchmark activities and lessons learned.
TSG initially announced the development efforts for creating an ECM offering for DynamoDB back in
October of 2018. Based on the success of our Hadoop offering (also a NoSQL approach), developing
a solution for DynamoDB was greatly simplified with a majority of the development and testing
efforts completed in a couple of months. Having success with multiple on-premise Hadoop clients,
the TSG team thought an internal benchmark partnering with Amazon would show off the true
power and massive scale potential of DynamoDB and AWS. The goal was to simulate all the
components of a massively large repository to verify that AWS along with the TSG’s products and
approaches could scale for large volume, case management clients. The benchmark focused on
typical requirements for large volume clients like health and auto insurance claim repositories, but
also included accounts payable and human resource scenarios.
DYNAMODB ECM BACKGROUND
Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable
performance with seamless scalability. The main advantage of DynamoDB is that it lets customers
offload any administration work of operating and scaling a distributed database to AWS. DynamoDB
customers do not need to worry about hardware provisioning, replication, software patching or
cluster scaling as AWS handles all of these functions. Amazon believes in the power and scalability of
DynamoDB so much that it has become the main database for Amazon’s daily transactions.
Amazon’s trust in DynamoDB is due to Dynamo’s high availability and durability design. DynamoDB
leverages global tables that replicate data across AWS regions so if one region’s cluster goes down,
access to the data is available from another region’s table. DynamoDB also leverages the AWS cloud
to provide both point-in-time recovery from any second over the last 35 days and also has the ability
to create on-demand backups and restores, as required.
© 2019 Technology Services Group, Inc. All Rights Reserved.
3
11 Billion Document Benchmark:
ECM Best Practices
DynamoDB ECM Background
Dynamo compares very favorably in these regards to Hadoop’s Hbase, which is also a non-relational
database that provides redundancy across clusters. One of the main differences between Hadoop
and Dynamo is the infrastructure that is required for each solution. Hadoop requires Linux servers
for installation. Customers can implement Hadoop either on-premise or move Hadoop to the cloud
on provisioned servers but need to manage the database and clusters manually or engage a vendor
like Hortonworks to manage the Hadoop cluster. DynamoDB automatically handles the
administrative tasks associated with provisioning servers but is only available in the AWS Cloud. An
enterprise that has a strict on-prem policy would not be able to use DynamoDB.
Another major difference between DynamoDB and Hadoop is that Hadoop is an Apache opensource
project while DynamoDB APIs are controlled by Amazon. Hadoop allows users to see the source and
even contribute enhancements to it if needed, and generally the libraries will not change in
substantial ways after deployment. DynamoDB APIs on the other hand are controlled by Amazon,
and are subject to change at any point, which could impact deployments against the DB. TSG would
expect major DynamoDB API changes to be rare, but it is an important consideration when choosing
a database solution.
NoSQL Database
One of the major benefits for DynamoDB customers is a “not only SQL” approach, often referred to
as NoSQL. DynamoDB is a distributed, versioned, non-relational database modeled after Google’s
Bigtable: A Distributed Storage System for Structured Data, which sits on top of DynamoDB.
Since the 1990’s, all ECM vendors have leveraged some type of database under the covers of their
architecture to manage the metadata, relationships and other data components of documents.
Metadata & attributes include title, file location, author, security, relationships, annotations, folders
and all other data associated with the document. Traditionally, legacy ECM solutions would require
Oracle or MS SQLServer, while newer solutions, like Alfresco, would leverage MySQL. In considering
“What’s next” for our ECM customers, TSG realized that NoSQL approaches could provide
tremendous benefits. See our initial post from 2015 on leveraging a Big Data approach for
Document Management as well as our Whitepaper from analyst Alan Pelz-Sharpe.
Some of the unique and modern features of DynamoDB or Hadoop versus traditional relational
databases include:
•
Limited Database/Docbase Administration – Built for a “big data” approach, DynamoDB
focuses on an approach that allows the database to adapt as new data is presented rather
than the traditional approach “call the DBA to add a column or an index”. Users should think
of it as a “tagging” structure, rather than a traditional relational database module with a
© 2019 Technology Services Group, Inc. All Rights Reserved.
4
11 Billion Document Benchmark:
ECM Best Practices
DynamoDB ECM Background
strict schema. Tagging is something that inherently fits into a content management
framework or understanding as we are always tagging documents with metadata. For an
ECM example, if storing an invoice document, DynamoDB can receive all of the attributes as
consistent column families and DynamoDB will take care of storing all the descriptors and
values. If at a later time, the next invoice has a new value not associated with the old
invoices, DynamoDB can easily append the value to just those documents.
•
Limited Back-up/Recovery – Also as a “big data” approach, DynamoDB provides a scalable,
redundant model that can be leveraged across multiple AWS Regions with automatic
redundancy/clustering. One of the big issues with our typical ECM relational database
approach has always been coordinating the back-up of the relational database with the filestore. DynamoDB provides the ability to remove that requirement as well as simplify setting
up of a clustered environment.
•
Scaling and Ingestion – As a schema-Less approach, DynamoDB or Hadoop provide the
ability to quickly ingest objects much like a SAN rather than a database that needs to place
different data in different columns and make they all tie back together before committing
the data. This was a critical component of the benchmark as the goals were to ingest more
than 1 billion documents a day and scaling processed up to 20,000 documents per second.
Searching (DynamoDB Key)
DynamoDB’s redundancy/clustering approach requires a different approach when it comes to
searching. Retrieving metadata from DynamoDB is available but slightly different from a relational
database as DynamoDB will farm the request out (again a big-data approach) to multiple servers
and compile the results. While this approach is acceptable for most big-data analysis, it isn’t always
acceptable for large repositories and complex searches. Search performance is always a key
requirement of any successful ECM implementation.
DynamoDB does provide one indexed key for quickly retrieving objects. Innovative clients can
leverage this key if searching requirements consistently fall into one key retrieval. For example, for
many of our insurance claims clients, the key to the case folder is typically
For many high-volume environments, Case Management represents the vast majority of how users
access a document or case of multiple documents. Typically, a Case ID or similar ID (example claim
number, vendor number…) can be used to uniquely identify the folder/case. In these applications, it
is important to leverage the proper architecture to allow for fast access to the case without requiring
the use of the search/index server. By leveraging a smart key design, many of our case management
clients can function access the case without the Solr/Elasticsearch infrastructure at all.
© 2019 Technology Services Group, Inc. All Rights Reserved.
5
11 Billion Document Benchmark:
ECM Best Practices
DynamoDB ECM Background
When a document can have a Case ID in the object model, NoSQL can make use of a design pattern
by prepending the Case ID to the beginning of the document ID. In this manner, when a user
performs a search for documents, HBase design best practices and DynamoDB design best
practices dictate a pattern like this to allow a lightning fast Range Scan to quickly bring back all of the
documents for a particular case. This operation is significantly faster in a large repository, with the
added benefit of not needing to leverage the search index at all. We have found that scanning the
database directly via these patterns offers predictably fast access to view all documents in a
particular claim.
Clients with a case management use case can make use of this design pattern to avoid entirely
having to create a Solr or Elasticsearch index, which in our experience can be difficult/expensive to
maintain for multibillion document repositories. We have many of our case management clients in a
production environment without having an index server at all. As we found in our indexing
benchmark, the infrastructure costs alone for an index of this size can be multiple times more
expensive than the NoSQL database infrastructure, so we recommend that clients take this
approach for case management.
Below is an example of how an HBase/DynamoDB table looks and how it can efficiently get to the
case id documents when the case number is known:
© 2019 Technology Services Group, Inc. All Rights Reserved.
6
11 Billion Document Benchmark:
ECM Best Practices
DynamoDB ECM Background
Searching (Elasticsearch & Key)
TSG recommends leveraging Elasticsearch/Solr/Lucene as both the metadata and full-text search
engine when required. Similar to how legacy ECM systems use a relational database to store
attributes, DynamoDB retrieval could be used for system of record requests (ex: What are the
attributes of this document). Anytime a search against metadata is needed, Solr/Lucene would be
used for searching for documents.
Typical scenarios for searching fall into two basic patterns:
Retrieval Pattern 1: Search
Typically search users will know a few of the attributes of the document (title, date, author) and will
run various searches against the Solr/Elasticsearch index to find the documents. Solr and
Elasticsearch are the perfect tools for quickly searching through the index and returning the IDs and
meta-data for each document that fit the criteria. Once the user has found the document they would
like to work with, the ID of that document is passed along to HBase/DynamoDB in order to retrieve
the content for the user to view, edit, annotate or another document action.
For a typical ECM deployment with high search requirements, the Solr/Elasticsearch infrastructure
can be scaled to meet the needs of the system.
Search Pattern 2: Analytics
If the ability to perform deep analytics on the attributes and full-text content of documents is a
requirement, we typically recommend separate indexes in Solr/Elasticsearch targeted for the
specific use cases that the data scientists are requesting. TSG no longer recommends one massive
index for all of the attributes and full-text content for all documents in the entire repository as it can
be problematic, especially if this is the same index that is being used by end users. See our thoughts
on creating separate indexes with Solr this post.
AWS also has various Index services integrated into their platform. While TSG used Elasticsearch for
the DynamoDB search index, this implementation could be moved over to Amazon CloudSearch
service in the future – the benefits being AWS managing the index in much of the same way as it
manages DynamoDB.
AWS S3 or Glacier FileStore
DynamoDB is unique that it is not priced solely on storage size, but also on read and write “units”
into the DB. This pushes the solution away from wanting to store the physical contents of
© 2019 Technology Services Group, Inc. All Rights Reserved.
7
11 Billion Document Benchmark:
ECM Best Practices
Benchmark Solution
documents within DynamoDB and instead leveraging AWS storage solutions (S3 and Glacier). These
services have the ability to replace another traditionally costly component of the ECM architecture
stack, the Storage Area Network or SAN.
DynamoDB has the ability to set a Time-To-Live (TTL) on everything stored within the DB. If data
storage and pricing of the data was becoming a concern, customers can configure a TTL for all main
content and have it linked back to a normal S3 bucket. Once the TTL in the DB gets hit, the content
could be archived in Glacier and the DB metadata could be offloaded, if needed, to a lower cost DB
store.
For our benchmark, we relied on all the content of the documents being stored in S3 with links in
DynamoDB. As the benchmark is focused on metadata repository, the test focuses on the ability of
Dynamo to store a high-volume of data while maintaining performance. The benchmark reuses S3
documents to simulate a real repository. The benchmark assumed that the content files have been
migrated to S3 in preparation for the migration activities with AWS Snowball or other file migration
approaches.
BENCHMARK SOLUTION
The goal of the 11 Billion Document Benchmark was to simulate all the components of a massively
large repository to verify that TSG’s tools and approaches could scale. One truly powerful
component of DynamoDB and other NoSQL approaches is how quickly TSG can build a massive
repository leveraging AWS hardware scaling. When compared with other ECM repositories based on
legacy relational database technologies, DynamoDB’s “schema on read” approach can perform
ingestion at unbelievable rates of speed.
Rather than pursue a benchmark built on a “do everything” approach, the benchmark focused on
capabilities required for large volume clients like health and insurance claim repositories. With large
volume clients, “do everything” document management solutions can add significant overhead and
performance issues for massive repositories.
Some examples of large volume approach include:
•
Case Security rather than repository security – The case management approach assumes
that an external system is handling the security of which case is available for access by which
user. A majority of insurance claim clients, prefer this approach to improve performance
and reduce bloat and added system requirements of the ECM system. While TSG’s
© 2019 Technology Services Group, Inc. All Rights Reserved.
8
11 Billion Document Benchmark:
ECM Best Practices
Benchmark Solution
DynamoDB solution does have ACL and document security, it was not enabled for the
benchmark.
•
Case Search rather than document search – Similar to the security, typical case or claim
security allows for searching for a case or claim and, then once in the claim, searching
through the documents. The benchmark’s initial phase included an all repository case
search based on AWS Elastic and did not initially include an “all repository document” search.
Document searching leveraging Lambda Elastic Search index population from DynamoDB
was included for certain scenarios.
•
DynamoDB scaling but not S3 scaling – The benchmark targeted 11 billion unique
document objects in DynamoDB pointing to content in S3. The benchmark shared S3 links
to actual files in S3. The benchmark assumed that the content files have been migrated to
S3 in preparation for the migration activities with AWS Snowball or other file migration
approaches.
•
TSG’s Product Set – The benchmark included migration activities leveraging TSG’s
OpenMigrate as well as case viewing and updating with TSG’s OpenContent Management
Suite including OpenAnnotate consistent with current client access needs.
•
High volume input - To complete the benchmark, TSG was estimating 12,000 documents
migrated per second with 1 billion documents migrated per day. In preliminary tests, TSG
was able to increase our throughput to 20,000 documents per second.
•
Quick viewing of documents - For viewing, TSG targeted sub-second viewing of case folder
contents as well as document viewing. TSG supported both viewing and annotating
documents as well as combining document capabilities.
DynamoDB AWS Benchmark Environment
Components of the AWS testing environment included:
•
DynamoDB AWS Managed Service – 26,000 write units, Minimal read units
•
Elastic AWS Managed Service – 8 data nodes – r5.4xlarge.elasticsearch (100 GIB EBS gp2), 3
master nodes – r5.xlarge.elasticsearch
•
OpenMigrate – 2 EC2 m5.24xlarge instances (380 GB java heap – 96 CPUs)
•
OpenContent Management Suite and OpenAnnotate – EC2 t2.medium instance (4GB RAM –
2 CPUs)
© 2019 Technology Services Group, Inc. All Rights Reserved.
9
11 Billion Document Benchmark:
ECM Best Practices
Benchmark Solution
Post large scale migration, TSG significantly reduced (or fully retired) the OpenMigrate servers.
Additionally, DynamoDB was reconfigured for write and read units to be more in line with daily
usage.
Benchmark Data
One of the biggest issues of creating a representative benchmark is providing realistic data at
volume. As with most benchmarks, TSG relied on sample/text data. Any successful test needs to
make sure that the created documents and folders in the large repository are sufficiently different
enough to be a real-world example of how the repository would function at scale.
For sample data, TSG leveraged on Open Addresses to create example case folders for each
address. TSG targeted example case folders for Accounts Payable, Auto Claim, Health Care Claim
and Human Resources for example cases with different types of documents for each type of case.
Each address had the four case folders attached with the folder and document name were modified
with part of the address to keep the uniqueness of each document and folder. OpenMigrate’s multithreaded capabilities allowed TSG to set up a massive number of threads to read through the
addresses and create the folders and documents in DynamoDB.
Benchmark Use Cases
One thing that has guided the benchmark efforts from the beginning was getting input from outside
council including Alan Pelz-Sharpe with Deep Analysis as well as Rich Medina from Doculabs. Early
Alan stated that, while multiple companies had done billion document benchmarks in the past, his
concern as an analyst was “Great – you can hold a billion documents in the repository, what can you
do with those documents?” Once the migration activities are complete, here are the different
scenarios we plan on supporting.
•
Search for Case Folder - Users are able to search across the entire repository for any folder
•
View Case Documents – Users are able to view a listing of all documents or videos in the
folder
•
Document Actions – Users are able to view and update properties and annotate both
documents and video
•
Folder Actions – Users are able to view all documents with user preferences for attribute
order and column sorting, combine PDF and combine into Zip.
•
Related Documents and Folders – Users are able to see related folders
© 2019 Technology Services Group, Inc. All Rights Reserved.
10
11 Billion Document Benchmark:
ECM Best Practices
Benchmark Lessons Learned
•
Search for Documents – Leveraging TSG’s Elastic index for all documents, users are able to
search for documents based on attributes across the repository.
•
Add Documents – Users are able to add documents in a variety of supported methods (bulk
upload, drag and drop, scanner, traditional file browser).
•
Load Testing - Concurrent user test of 11,000 threads preforming standard document
management including search, annotate and adding documents.
BENCHMARK LESSONS LEARNED
Phase 1 - Migration
Over the course of seven days, TSG was able to successfully ingest 11 billion documents and create
almost 1 billion different folders objects. Before starting the benchmark, TSG set a realistic goal for
ingestion – be able to move 1 billion documents a day into DynamoDB.
Lesson – Iterate, iterate, iterate
The migrations started small and over multiple iterations worked towards that goal. In order to hit a
billion documents in 24 hours, OpenMigrate would need to move about 12,000 documents a
second.
Our iterations looked something like:
•
1 OM on a t2.medium with default thread setting
o
•
1 OM on a t2.medium with OM thread performance tweaks
o
•
553 docs/sec – 8,000 docs moved
1 OM on a m5.24xlarge (96 CPUs) with OM thread performance tweaks
o
•
510 docs/sec – 8,000 docs moved
3,447 docs/sec – 8,000 docs moved
1 OM on a m5.24xlarge (96 CPUs) with OM thread performance tweaks
o
3,389 docs/sec – 70,000,000 docs moved
© 2019 Technology Services Group, Inc. All Rights Reserved.
11
11 Billion Document Benchmark:
ECM Best Practices
Benchmark Lessons Learned
•
(Continue to steadily iterate and increase documents moved until we hit the final test run
before kicking off the benchmark)
•
2 OM on m5.24xlarge with OM thread performance tweaks and Elasticsearch indexing
performance updates
o
22,367 docs/sec – 530,000,000 docs moved
Lesson – Reduce Bloat in the Metadata
Metadata is always important for document management solutions, and while TSG wanted to have
enough metadata that the benchmark was realistic, is was important to reduce the bloat of the
metadata and take advantage of the admin capabilities to map metadata to the folder level. TSG
would recommend clients critically look at their metadata models and see if they can identify any
metadata bloat that could negatively impact performance.
Lesson – Challenge Assumptions in regards to Performance
The benchmark focused on using TSG’s product, OpenMigrate, for the ingestion/migration of the 11
billion documents. While TSG has done plenty of large migrations with OpenMigrate before, OM had
always been constrained by the performance of the ECM repository and underlying database. In
initial planning, TSG calculated 10 OpenMigrate instances running concurrently to hit 12,000
documents per second. However, once deployed, 2 OpenMigrate instances running on AWS – 2 EC2
m5.24xlarge instances (380 GB java heap – 96 CPUs) servers with DynamoDB rather than traditional
SQL repositories, it was able to hit the goal of 20,000 documents/second. Minimal tweaks to out of
the box OpenMigrate enabled TSG to complete this benchmark with only two instances of
OpenMigrate instead of ten. Sometimes less really is more.
Phase 2 – Search Indices
One of the key focus areas of the benchmark in regards to the document index was focused on
departmental search versus a complete repository search. TSG recommends pushing for multiple,
efficient search indices rather than one large index to allow content services to perform quicker with
less complexity. With the success, capabilities and cost of robust search tools like Solr and
Elasticsearch, TSG has been recommending for years that clients create focused indices rather than
one large “do all” index to serve every possible purpose.
For the benchmark, TSG focused on two indices:
© 2019 Technology Services Group, Inc. All Rights Reserved.
12
11 Billion Document Benchmark:
ECM Best Practices
Benchmark Lessons Learned
•
Large Folder Index – All folders would have an index for basic search. Typically, in a case
management scenario, large clients are looking to navigate to the folder to look at document
in the case. Getting access to the case folder and documents quickly is a key requirement.
•
Focused Document Index – More of a focused index for a subset of documents. These
indices could be created when needed for specific scenarios or requirements.
While TSG felt very confident that an Elasticsearch index for all the documents in the repository
especially with sharding capabilities would be possible, the index services from AWS come with a
higher price than just DynamoDB storage. The main focus was to prove out on a smaller scale the
indexing capabilities while providing a realistic approach for anyone to consider when building and
maintaining Elasticsearch indices.
Lesson – Create Indices Upon Ingestion
As part of the 11 billion document ingestion, TSG leveraged OpenMigrate to create an Elasticsearch
index for every folder in the repository finishing with 925,837,980 folders in the Elasticsearch index.
TSG created this index as part of the OpenMigrate ingesting process that was storing 21,000 objects
per second. Statistics from the folder indexing benchmark include
•
925,837,980 folders stored in DynamoDB and indexed in Elasticsearch
•
6 Data Nodes and 3 Master Elasticsearch servers to maintain the index
•
Objects (documents and folders) ingested per second – 21,000
•
Indexing time for 925,837,980 folders – 159 hours
•
Elasticsearch Index Size – 372 Gigabytes – DynamoDB Repository Size – 5.32 Terabytes
Lesson – Create Indices for Specific Business Needs
For the document index, TSG wanted to show a scenario where an index would be created for a
specific purpose. The decision was made to index 1 million documents from the perspective of an
accounts payable scenario as a feasible test. This is consistent from client experience where clients
have said “I only to index the last X months for X”. Statistics from the document indexing benchmark
are different from the DynamoDB effort as the content was already in Dynamo (and needed to be
retrieved to be indexed):
•
1 million documents already existing in DynamoDB
•
Document indexed per second – 501.3936
© 2019 Technology Services Group, Inc. All Rights Reserved.
13
11 Billion Document Benchmark:
ECM Best Practices
Benchmark Lessons Learned
•
Indexing time – 33 minutes 14 seconds
•
Elasticsearch Index Increase – 4 Gigabytes – Total Size 376 Gigabytes
This effort scanned through existing DynamoDB content, and if the object type of the content was
an AP invoice document, it indexed the content until 1 million documents were found.
OpenMigrate was able to create the 1 million indices after the DynamoDB ingestion process was
complete in just under 35 minutes or roughly 500 documents per second. Unlike the main ingestion
process were there were only 7 servers, we had added more processing power to the Elasticsearch
to get up to 9 servers. Additional servers could have been added to further reduce the indexing
time and improve throughput.
Lesson – AWS Lambda
TSG initially hoped to use Amazon Lambda for indexing from DynamoDB to Elasticsearch.
Unfortunately, initial attempts revealed:
•
AWS Lambda currently has a 5 minute timeout for execution making the large volume
difficult to index.
•
TSG considered indexing from DynamoDB streams but those were only available for 24
hours so didn’t fit the scenario of building a purpose-built index.
TSG decided to leverage OpenMigrate to both read and index documents into Elasticsearch to be
consistent and provide the same infrastructure for both initial ingestion as well as the creation of
new indices in the future.
Lesson – Scaling Elasticsearch versus DynamoDB Is Drastically Different
TSG found the current pricing of Elasticsearch versus DynamoDB to be very different with
Elasticsearch due to the size of cores needed to support large ingestion and indices. In TSG’s
benchmark, DynamoDB stored around 13 times the number of nodes as Elasticsearch did, but
Elasticsearch cost about 1.3x more than DynamoDB over the course of the benchmark.
Unlike DynamoDB where the infrastructure could scale up for ingestion and then drop read/write
units once the large migration was complete, Elasticsearch requires servers to be maintained and
operational for both ingestion and later access. DynamoDB read/write units are priced and
maintained very differently than Elasticsearch EC2 instances.
© 2019 Technology Services Group, Inc. All Rights Reserved.
14
11 Billion Document Benchmark:
ECM Best Practices
Benchmark Lessons Learned
Phase 3 – Adding Documents
When setting up the benchmark, TSG specifically chose to separate the first large scale ingestion
phase (11 billion documents – almost 1 billion folders) from the Third Phase of users adding
documents to folders. This approach is consistent with how many clients roll out TSG’s interfaces
when replacing a legacy solution. Many of TSG’s clients have chosen to expose the OpenContent
interfaces on their existing content after a large or rolling migration before allowing users to add
documents to the new repository.
One of the key discussions for this phase focused on how to add/retrieve content from a folder. A
key requirement was viewing case documents and allowing users to be able to “view a listing of all
documents or videos in the folder”.
Lesson – Don’t Assume One Solution Will Work for Every Scenario
As part of the benchmark, the team tested two approaches for displaying the contents of a folder.
•
JSON storage of document objects – The folder DynamoDB object contains all of the
document ids in a repeating field. This allows for fast viewing of the objects in the folder, a
typical requirement for case management/folder viewing. Benefits included fast, scale-able
access to folder objects without a large Elasticsearch index.
•
Elasticsearch for documents – During the first ingestion phase, Elasticsearch was only
being used for access to folder objects. Phase 2 of the benchmark tested indexing part of 11
billion documents to test leveraging Elasticsearch for displaying objects contained in a
folder.
After testing, the team determined that the JSON object store made the most sense given the size of
sample set but that both alternatives for customers based on the size of the number of documents
in the folder makes sense. TSG has one client with 65,000 documents in a folder, so the ability to
leverage Elasticsearch for certain client scenarios is required.
Phase 4 – Load Testing
Goals of the concurrent user test were to replicate some of the different issues we have seen from
clients in production when a large number of document management users are accessing the
system. Components that we specifically wanted to test include:
•
Application servers to reveal bottlenecks or choke points
•
DynamoDB performance to detect and remediate any scanning of the table
© 2019 Technology Services Group, Inc. All Rights Reserved.
15
11 Billion Document Benchmark:
ECM Best Practices
•
Document searching performance with Elasticsearch service
•
Document retrieval and viewing performance under stress
•
Document annotations
Benchmark Lessons Learned
The benchmark test was patterned after the most common use case for our insurance clients –
Claim Viewing. The claim viewing scenario was scripted as a user opening a claim folder with 25
documents followed by viewing 3 to 5 of the documents within our OpenAnnotate viewer. This
simulates a user accessing the document management system directly from an insurance claim
system. For the test we ran a batch of 11,000 users performing the claim viewing scenario across a
selection of 20,000 different medical and auto claims out of the total repository of 11 billion
documents. The Claim Viewing scenario tested the Application servers and the DynamoDB table
under stress.
Searching
Not all users executed a search in a Claim Viewing scenario. The number of users who require the
ability to search the repository to locate claim information is approximately 25% of the entire
universe of users, often these are supervisors, management, or administrators. During the
performance test we kept the script simple and executed a search on Name and Address for each
user which taxed the Elasticsearch cluster beyond what we originally targeted for testing.
Annotations
The final use case we implemented was to annotate one of the documents viewed during the Claim
Viewing test. Approximately 10% of documents are annotated and for the test we planned on
annotating a document for 1 of every 10 users. One issue we discovered is that while it was simple
to add an annotation, resetting the data and deleting the annotations between each test run was
out of scope for what we wanted to accomplish in this benchmark. We elected to postpone the
performance testing of the annotations until an update was made to allow overwrites and deletes
for the annotations.
Testing 11 Thousand Concurrent Users – Benchmark Testing with AWS
Leveraging Amazon Web Services, we started with a moderate sizing of the application servers,
DynamoDB read units, and Elasticsearch cluster. Before we scaled the environment up or down, we
ran initial tests to set a baseline.
Test Runs
© 2019 Technology Services Group, Inc. All Rights Reserved.
16
11 Billion Document Benchmark:
ECM Best Practices
Benchmark Lessons Learned
We started with simple baseline test runs and as we encountered and resolved issues, we expanded
the architecture to use two JMeter instances and two OCMS instances. When troubleshooting
several test-runs we simplified the process and used only one JMeter and one OCMS instance.
Run
Users
Jmeter Instance (#)
(vCPU, Memory)
OCMS Instance (#)
(vCPU / memory)
DynamoDB Unit
(min read / write)
Elasticsearch Data
Nodes(#) – 2 AZs
0,1
2
3
4
5
6
7
100
2,000
2,000
5,500
5,500
11,000
11,000
t2.micro (1) (1 / 1)
m5a.4xlarge (1) (16 / 64)
m5a.4xlarge (2) (16 / 64)
m5a.4xlarge (1) (16 / 64)
m5a.12xlarge (1) (48 / 192)
m5a.12xlarge (2) (48 / 192)
m5a.2xlarge (2) (8 / 32 )
t2.medium (1) (2 / 4)
t2.medium (1) (2 / 4)
r5a.12xlarge (2) (48 / 384)
r5a.12xlarge (1) (48 / 384)
r5a.12xlarge (1) (48 / 384)
r5a.4xlarge (2) (16 / 128)
r5a.4xlarge (2) (16 / 128)
100 / 50
5000 / 50 and 7000 / 50
2000 / 50 (auto scale)
2000 / 50 (auto scale)
2000 / 50 (auto scale)
2000 / 50 (auto scale)
2000 / 50 (auto scale)
r5.4xlarge (6)
r5.large (6)
r5.12xlarge (6)
r5.12xlarge (6)
c5.18xlarge (6)
c5.18xlarge (6)
r5.4xlarge (6)
When running the initial small baseline tests and subsequent 2,000, 5500, and 11,000 user tests we
identified several issues, bottlenecks and choke points in each tier of the environment.
JMeter Instance relevant issues included:
•
Unviewable characters in the test dataset csv files caused bad searches and logins. Resolved
by changing file encoding settings in JMeter.
•
Encountered timeout issues and maxing of CPU. Tested ramp up period and task timers to
determine impact on the OCMS instance. Modified the JMeter to increase CPU and limits in
JMeter to resolve issue.
•
JVM generic issue message for out of memory was resolved by increasing the Linux process
limit.
•
Script did not properly handle mime types. Updated script to display mime types correctly.
OpenContent Management Suite Issues included:
•
Encountered thread timeouts. Modified the JVM memory settings and increased the tomcat
threads.
•
Out of Memory caused by an issue with garbage collection. Modified to use G1 gc and
updated the heap settings.
•
Linux OS error – too many open files. Resolved by increasing the limit.
•
Encountered HTTP non response messages and timeouts; an increase apache httpd threads
resolved the issue.
© 2019 Technology Services Group, Inc. All Rights Reserved.
17
11 Billion Document Benchmark:
ECM Best Practices
Benchmark Lessons Learned
•
JVM generic issue message for out of memory was resolved by increasing the Linux process
limit
•
Transformation queue for document viewing maxed out the server CPU. Update script to
include less transformation and would recommend leveraging external servers.
DynamoDB issues included:
•
Searching with a bad value for an object id caused a table scan and then blocked the
remaining threads. Updated code to validate id and prevent the scan.
Elasticsearch issues included:
•
Searches with unviewable/hidden characters caused non-terminating search errors and
maxed out CPU and memory. Test updated to not include unviewable characters.
•
Executing searches concurrently for each of the 11,000 users across the 5 shards in the
cluster maxed out the Elasticsearch. Modified test to include more reasonable percentage of
users performing search versus direct access to folder (typical use case).
Each of these points above were resolved and retested in subsequent test runs.
On bottleneck we experienced that did represent real-world scenarios was maxing out the
transformation queue on the OCMS server. For clients with a large volume of users, the
transformation queue and process are moved off the OCMS server and scaled out separately. TSG
can provide reference architectures for clients implementing OCMS with a large number of users.
Creating the Test Data
The test data for each of the runs was pulled directly from the Elasticsearch cluster using a Python
script. Details for ten thousand auto claim and ten thousand medical claim folders were selected
and split across the 11,000 users who were generated in the system for testing. The JMeter test
scripts read two csv files, one for users and one for claim folders and properties.
Creating the Test Scripts
Defining realistic test scenarios seemed fairly simple to do at the outset but required more iteration
than expected. We started by using Blazemeter to record a sample set of actions we wanted to take
in OCMS. The recording was exported to JMX and then edited to better mimic entry to OCMS from a
claim system and accept dynamic parameters from the csv test data files. In order to make the test
plan realistic and dynamic we added in http response extractors and logical conditional statements
to link the task steps for searching, viewing, and annotating together.
© 2019 Technology Services Group, Inc. All Rights Reserved.
18
11 Billion Document Benchmark:
ECM Best Practices
Benchmark Lessons Learned
Lessons Learned – Concurrent User Testing
While the Claim Viewing scenario on its surface is a well-defined set of REST endpoints, adding in the
automation to process responses and conditionally execute statements took a few days longer than
we expected and required more iterative baseline testing loops than originally planned.
The more interesting troubleshooting issues occurred when we scaled up to two thousand users. At
that point we observed DynamoDB scans where we had not seen them at a smaller scale. We
cautiously disabled pieces from the test script and added in several debuggers. It took several
iterations to reduce the noise in the testing logs by increasing thread limits and open file limits until
we reached the issue of unviewable data in the csv test files. Even after adjusting JMeter to UTF-8
encoding the first line of the csv file might start with an unviewable character. We added in a
conditional statement in JMeter to avoid executing any statements containing the character.
Alternatively, a more sophisticated means to strip the character from the file could have been done
with a pre-processor or other script. This issue was more due to our address data file and we would
not anticipate it is a normal client environment.
Once it was ensured only good data was being fed into the OCMS test script and we had resolved
the thread, process, and open file limits, we scaled up to 5,500 and 11,000 users. While other smaller
issues with large amounts of users surfaced, the team was confident that they could be resolved for
a production client and ended the testing.
© 2019 Technology Services Group, Inc. All Rights Reserved.
19
11 Billion Document Benchmark:
ECM Best Practices
Summary
SUMMARY
In TSG’s “Cranking it up to Eleven” Billion Document Benchmark, TSG has been able to prove the
scalability and benefits of both Amazon and a NoSQL approach with DynamoDB over traditional
document management solutions based on relational database approaches.
During the benchmark, we have received some feedback on “why 11 Billion documents and 11
thousand concurrent users”? In deciding on the size of the benchmark, we wanted to exceed the
numbers we have seen at clients (7 billion for one prospect) by a large margin. Compared to some of
the other, older billion document benchmarks conducted by the ECM software vendors in the past,
this benchmark tested all of the scenarios required by our large volume clients.
Combined with our rolling migration approach, TSG now has an extensive amount of experience and
solutions to move large clients to alternative solutions with our products and people. Some
examples of ways the benchmark is currently influencing our current clients include:
•
As a test harness, TSG is proposing leveraging the migration approach and test data to
quickly scale up client’s repositories to production volume before moving real data to test
infrastructure and performance.
•
In our design, we are able to combine lessons learned on creating indexes and alternatives
leveraging NoSQL to provide multiple alternatives for typical search scenarios.
•
We are saving the repository in S3 to allow clients to quickly restore and test scenarios
against the repository.
Thanks again to everyone that helped us in the benchmark particularly Amazon Web Services, Deep
Analysis, and Doculabs.
© 2019 Technology Services Group, Inc. All Rights Reserved.
20
11 Billion Document Benchmark:
ECM Best Practices
Appendix 1 – ECM Database Approach
APPENDIX 1 – ECM DATABASE APPROACH
DynamoDB – Database Model for Document
Management
One of the ongoing myths about DynamoDB for Document Management we hear too often is “but
isn’t that just for big data?” The following details benefits of DynamoDB’s big data capabilities and
data model in a Document Management context compared to traditional database
systems. Examples will include how we are currently building our own DynamoDB offering.
DynamoDB – Big Data versus traditional Big Database
At its core, DynamoDB provides a very robust, distributed data store that allows for powerful
parallel processing capabilities with unique data storage capabilities. To understand the difference
between DynamoDB and traditional databases requires an understanding of the processes (and
timing) of when they were created.
Relational databases first emerged in the 1980’s when disk speed was slow and disk space and CPU
usage were at a premium. Built for critical business systems, relational databases were built with
the focus on storing known data in a static data model. DBA’s were extensively employed to update
the model and add indexing and other performance improvements. Performance focused down to
the hardware level of where individual fields were stored within the disk array, a very expensive
component back in the day. Emerging in the 1990’s and 2000’s, most modern ECM solutions rely on
the critical document management fields to be stored in a relational database.
DynamoDB, like Hadoop and other “not only SQL (NOSQL)” repositories follows the more modern
approach based on the huge gains in the economics of disk space, hardware cost, and new
requirements with unstructured big data. DynamoDB allows for very quick and distributed/parallel
retrieval of a specific data file that can contain data in a variety of different formats.
DynamoDB – Schema-on-Read versus Schema-onWrite
One of the big differences between DynamoDB and traditional RDMS is how data is organized in a
schema. Traditional databases require Schema-on-Write where the DB schema is very static and
needs to be well-defined before the data is loaded. The process for Schema-on-Write requires:
© 2019 Technology Services Group, Inc. All Rights Reserved.
21
11 Billion Document Benchmark:
ECM Best Practices
Appendix 1 – ECM Database Approach
•
Analysis of data processes and requirements
•
Data Modeling
•
Loading/Testing
•
If any of the requirements change, the process has to be repeated.
Schema-on-Read focuses on a less restrictive approach to allow storage of raw, unprocessed data to
be stored immediately. How the data is used is determined when the data is read. Table below
summarizes differences:
Traditional Database (RDMS)
Create static DB schema
Transform Data into RDMS
Query Data in RDMS format
New columns must be added by a DBA before
new data can be added
DynamoDB
Copy data in native format
Create schema and parser
Query Data native format
New data can start flowing in any time
Schema-on-Read provides a unique benefit for Big Data in that data can be written to DynamoDB
without having to know exactly how it will be retrieved.
DynamoDB – what it means for Big Data (and big
Document issues)
In a Big Data world, data needs to be captured without the requirement of knowing the structure to
hold the data. As is often mentioned, things like DynamoDB can be used for consumer/social sites
that need to store a huge amount of unstructured data quickly with the consumption of that data
coming at a later time.
As a typical Big Data example, a social site stores all of the different links clicked on by a user. The
storing application might store date and time and other data in one HDFS file for the user that is
updated each time the user returns. Given the user, a retrieval application can quickly access a
large, semi-structured file on that particular user’s activity over time.
Also, using Amazon allows leverage of S3 containers within the solution. While content could
technically be stored as bytes within the DynamoDB, it makes more sense to use Amazon’s S3
content storage and store a simple S3 link within the DynamoDB metadata to connect the two. This
© 2019 Technology Services Group, Inc. All Rights Reserved.
22
11 Billion Document Benchmark:
ECM Best Practices
Appendix 1 – ECM Database Approach
allows DynamoDB to function in its intended design as a pure DB and not also rely on it as both a
metadata and content store.
DynamoDB for Document Management – It’s not about
Search
Schema-on-Write works very well for “known unknown” or what we would typically call in document
management for a document search, something that Schema-on-Read does not handle as well. To
illustrate an example, let’s take a typical document management requirement, search for all
documents requiring review by this date:
In an RDMS example, the date column would be joined in a query with the “required review” column
to quickly provide a list of all the documents that would be needed to be reviewed. If the
performance is not acceptable, indexes could be added to the database.
In the DynamoDB example, ALL of the documents in the repository would need to first be retrieved
and opened, once opened, the required review and date data retrieved to build the list. There are
no indices that could speed up performance.
After this example, at typical document management architect might conclude that DynamoDB
doesn’t really fit this basic requirement. What this opinion doesn’t take into account is the
emergence of the “Search Appliance” and particularly Lucene/Solr/Elastic Search as the default
indexing/searching engine for document management.
For a better performing search for BOTH the RDMS and DynamoDB implementations, we would
recommend leveraging Lucene/Solr/Elastic Search to index the document’s fulltext AND meta-data
for superior search performance. All modern ECM/Document Management vendors (Documentum
and Alfresco as examples) now leverage some type of Lucene/Solr/Elastic Search for all search.
Amazon provides a both AWS hosted SolrCloud and Elasticsearch services as an easy way to deploy
a DynamoDB combination architecture. By having AWS manage and maintain the physical
architecture, organizations can have easier integration for the combined environments.
DynamoDB – Big Data for Document Management –
Schema-on-Read Example
The major advantage of Schema-on-Read is the ability to easily store all of the document’s metadata without having to define column. Some quick examples would include:
© 2019 Technology Services Group, Inc. All Rights Reserved.
23
11 Billion Document Benchmark:
ECM Best Practices
Appendix 1 – ECM Database Approach
•
Versioning – One difficulty with most document management tools is, given a structured
Schema-on-Write DB model, the ability to store different attributes on different versions is
not always available or requires a work-around. With DynamoDB, each version would have
its own data and different attributes could be stored with different documents.
•
Audit Trail – This is one we often see difficult to implement with one “audit” database table
that ends up getting huge (and is a big data application). With DynamoDB, the audit trail
could be maintained as a single entry for that given document row and quickly be retrieved
and parsed.
•
Model Updates – Many times, meta-data needs to be added to a content model after a
system has gone live and matures because the content it stores matures with it. With
DynamoDB, new meta-data can always be written in as a new column.
DynamoDB – Building a Document Model
1.
Object Model
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
{
objectId (key)
objectName
title
modifyDate
creationDate
creator
contains
audits
...
content (S3 link)
rendition (S3 link)
15. }
Objects table contains a single row for each document including:
•
metadata
•
content (reference to S3 object)
•
renditions (references to S3 object)
© 2019 Technology Services Group, Inc. All Rights Reserved.
24
11 Billion Document Benchmark:
ECM Best Practices
Appendix 2 – Additional Source Links
This is a rough schema of common data that could exist when a document is added, but since
DynamoDB is schema-less, the system is not bound by the picture above and could evolve and
change. Every document can add columns “on the fly” with no database schema updates when new
metadata is to be captured. One relevant example of this approach is in the case of a corporate
acquisition or merging of repositories, when new documents need to quickly be entered into the
system from the old company. The metadata can be dumped into the DynamoDB repository as-is,
without having to worry about mapping individual fields to the existing schema.
Summary
DynamoDB has an advantage over traditional RDMS systems in that data can be stored and
retrieved quickly in an unstructured file without requiring extensive analysis of how that data will
typically be retrieved. The drawbacks of a DynamoDB only approach is that search performance on
any attributes other than the “documentId” do not perform well at scale. We would advise that
DynamoDB implementations utilize Lucene/Solr/Elastic Search as the query component of a
Document Management solution to allow high performant searches against the repository.
APPENDIX 2 – ADDITIONAL SOURCE LINKS
•
•
•
•
•
•
•
•
•
•
•
•
DynamoDB 11 Billion Document Benchmark – Summary of Postings
DynamoDB – Repository Walkthrough
DynamoDB Document & Folder Details
DynamoDB AWS Walkthrough
DynamoDB – Ingestion Success!!! – Lessons Learned
Rolling Migration Approach
OpenMigrate
OCMS Product Overview
TSG Hadoop Platform
TSG DynamoDB Platform
TSG AWS Platform
DynamoDB Video Library
© 2019 Technology Services Group, Inc. All Rights Reserved.
25
11 Billion Document Benchmark:
ECM Best Practices
Appendix 2 – Additional Source Links
Technology Services Group, Inc
22 West Washington Street, 5th Floor
Chicago, IL 60602
inquiry@tsgrp.com
www.tsgrp.com
Readers are free to distribute this report within their own organizations, provided the
Technology Services Group footer at the bottom of every page is also present.
© 2019 Technology Services Group, Inc. All Rights Reserved.
This white paper was sponsored by Alfresco but written by Technology Services Group, Inc
as an independent comparison between ECM products offered by Documentum and Alfresco.
© 2019 Technology Services Group, Inc. All Rights Reserved.
26
Download