Wix Media Service Fortifies Its Disaster Recovery

advertisement
Wix Media Service Fortifies Its Disaster Recovery Cluster on Google Cloud
Platform
Editor’s Note: This article is written by guest author Eugene Olshenbaum. Eugene is the Head of Media Services at
Wix, a cloud-based web development platform that makes it easy for everyone to create beautiful websites.
About Wix
“Web Creation Made Simple”
Wix provides technology that makes it simple for everyone to create a stunning, professional, and functional web
presence. It is a cloud-based web development platform that allows users to create HTML5 websites and mobile
sites through the use of a powerful online editor. Wix was founded in 2006, has 36 million registered users from all
around the world, and is growing rapidly.
Overview
This technical case study describes how Wix replicated their services, which are running on their managed data
centers, to Google Cloud Platform to increase availability and improve disaster recovery. Taking advantage of the
features provided by App Engine, Compute Engine, and Cloud Storage, Wix was able to complete the migration in a
very short time. This case study also includes key lessons learned during the migration process.
The Challenge
As a website building and serving platform, we want to provide close to 100% uptime for data serving and protect
our user data against loss. We originally ran our service in one managed hosting environment. To improve data
disaster recovery, we added a second one, running both services in active/active mode. Later, we added a third
data center to run our services in 3x active/active mode.
However, we learned that maintaining three cross-data center replicas was much more complex than managing
two, especially with the data centers owned by different ISPs for ISP redundancy. One of the challenges in 3x
active/active mode was database replication. To replicate across three data centers, we had to configure our
MySQL in a ring topology. The ring would break when one data center went down for a long time or failed
completely.
To address this, instead of implementing 3x active/active mode with our current infrastructure, we decided to run
in 2x active/active mode, with the third replica running on an entirely different technology platform. The third
replica also added protection against data poisoning—when a faulty piece of code unintentionally corrupts data
and remains undetected for some time.
Solving the Challenge
We decided to build a full replica of our service on Google Cloud Platform and port all system components to work
efficiently on Google Cloud. This replica would be capable of working as a primary server without dependencies to
the Wix servers in our managed data centers.
We chose Google Cloud Platform for the following reasons:
Ease of Management

App Engine eliminates the need for system management.

App Engine charges only for actual usage, while we have to estimate the required computing resources in
our managed hosting environment.
Scalability

App Engine automatically scales according to the volume of requests.


Datastore scales with our growth.
Cloud Storage scales with our storage demands.
Speed of Development

App Engine provides all the technology building blocks for application development. One example is Task
Queues, which we use heavily.

SPDY protocol is built in. App Engine applications automatically use the SPDY protocol when accessed over
SSL by a browser that supports SPDY.
Our system uses the following Cloud Platform components:
App Engine
We built the following two applications on App Engine:

A replication supervisor that controls application-level replication between Wix data centers and Google

Cloud.
An application server that manages the metadata and provides client API access.
Google Cloud Storage
We used Cloud Storage to store our static media files.
Google Compute Engine
We do all content manipulations, such as image cropping or resizing, on the fly. Although App Engine provides an
Image API for image manipulation, we require image processing capabilities such as noise reduction, sharpening,
and image filters, which are not provided by the API. We have highly optimized code written in C for this purpose.
We developed a set of servers on Compute Engine that handles image manipulation using our code. To optimize
performance, we use a high-memory instance type and perform all the I/O on the RAM disk. This is the only
component that requires attention to its scalability and health management.
Figure 1 shows the high-level architecture of our serving system on Google Cloud Platform:
Figure 1: High-level architecture of the Wix media serving system
Next, we had to choose a database. Our existing application was built on top of MySQL. MySQL has scaling issues
and lacks features we require, such as full text search, arbitrary attributes data structure (image files have
completely different sets of attributes than audio or PDF files), and paging based on multiple criteria. Instead of
continuing development with MySQL, we decided to use the App Engine Datastore since it fits our fuzzy data
models better and could scale with our growth.
Since we wanted to keep the Wix servers in our managed data center and Google Cloud running at the same time,
we had to develop bidirectional synchronization between the systems at the application level.
Replication Process
To synchronize data between the Wix managed hosting and Google Cloud Platform systems, we had to build a
replication supervisor. In our first attempt, we replicated the metadata and static files synchronously as they were
uploaded to our current system (Figure 2).
Figure 2: First version of the replication supervisor
However, replication often failed with a DeadlineExceeded error. App Engine Frontend Instances are designed to
respond to a request within 60 seconds. Our initial design included too many operations in a single request, and
their execution depended on various external factors such as file size and network jitter.
We learned that, in general, App Engine Frontend handlers should be as small as possible and involve at most one
or two Datastore requests. If a handler is more complex, it should be split and chained into smaller asynchronous
blocks.
Task Queue was the perfect match for our revised design. Tasks are executed asynchronously and, instead of 60
seconds, have a 10-minute deadline. Since the time it takes to copy media files varies greatly from request to
request, we used task queues to make file replication an asynchronous process. After the metadata is replicated
and saved in the App Engine Datastore, our application adds a task to the task queue. The task is then pushed to
another handler that handles the copying. If copying is successful, a new task is added to the task queue to notify
the initiator that the transaction is complete. If copying fails, the system retries, until it completes the task
successfully. Figure 3 shows the revised process.
Figure 3: Final design of the replication supervisor
Datastore
Consistency
The App Engine Datastore is designed for scalable web applications. It manages scaling, availability, and replication
across multiple data centers automatically. When data is replicated, there is a delay from the time a Datastore
write is committed until the change becomes visible in all data centers. If an application has to wait for replication
to complete (and data to be consistent across all data centers) before reading the data, its throughput can be
affected.

If a process does not require the latest value of a Datastore entry, eventual consistency suffices, and the
process can read whichever value is available without waiting. This improves throughput and the latest
data will eventually be available.

If a process requires the latest value of a Datastore entry, strong consistency is required, and the process
must wait until the latest change is applied across the board. This ensures consistency but may affect
throughput.
A good example is a blog application. A reader’s own comment to a blog post should be strongly consistent; that is,
they should see their comment after posting it. On the other hand, comments posted by other users can be
eventually consistent; that is, the reader just needs to see them eventually.
To balance data throughput and consistency, developers can choose when eventual data consistency is sufficient,
and when they require strong consistency for their application.
Coming from using MySQL, this was a concept that we had to grasp.
Entity Groups
Data objects in the App Engine Datastore are called entities. Entities can be arranged in parent-child relationships
to form entity groups. An entity group is a group of tightly linked entities to achieve strong consistency and
transactionality. Strong consistency is achieved by limiting a query to a single entity group.
For our application, we need to keep track of users and the media they upload for their websites. Querying the
user data has to be strongly consistent. So, we created an entity group for each user and their media information.
Memcache
Datastore limits the frequency of writes to an entity group in order to provide strong consistency. Since our
application has a high write rate, we ran into concurrent modification exceptions. To reduce write contention, we
decided to use memcache to manage user data updates. Changes are stored in memcache and pushed to the
Datastore once a minute using deferred tasks. We understood that we might lose data if the memcache was
flushed, but our application is designed specifically with that in mind, and no user data or files are actually lost.
Conclusion
By using Google Cloud Platform, we were able to rewrite our Java-MySQL application to a Python App Engine
application that was ready for integration tests in just two weeks.
App Engine provides a rich set of APIs and managed services that freed us from the chore of writing scalability and
fault tolerance-related code. We were able to solve our business problem quickly and seamlessly.
We have started serving our production media traffic from Compute Engine as a primary server and performance
is impressively good and stable.
Download