Requirements

advertisement
MediaCloud™ VM
Latest Revision: August 9, 2012
Contents
Overview ....................................................................................................................................................... 2
Requirements................................................................................................................................................ 2
Supported Input Formats .............................................................................................................................. 2
Quick Start..................................................................................................................................................... 3
Operation Modes .......................................................................................................................................... 5
Instance Parameters ..................................................................................................................................... 6
For all operation modes ............................................................................................................................ 7
For the localFiles mode ............................................................................................................................. 7
For all cloudModes.................................................................................................................................... 7
For cloudMode=folder .......................................................................................................................... 8
For cloudMode=queue.......................................................................................................................... 8
For cloudMode=workflow..................................................................................................................... 8
Run Modes .................................................................................................................................................... 9
Queue and Workflow Support ...................................................................................................................... 9
Queue........................................................................................................................................................ 9
Queue and Workflow Common Message Format .................................................................................. 10
Inbound Message Format ................................................................................................................... 10
Reply Message Format ........................................................................................................................ 11
Heartbeat ................................................................................................................................................ 11
EC2 Instance Type Guidance ................................................................................................................... 11
Logs ............................................................................................................................................................. 12
Sample Code ............................................................................................................................................... 12
Simple Workflow Service ........................................................................................................................ 12
Simple Queue Service ............................................................................................................................. 12
Application Patterns.................................................................................................................................... 13
MPlayer2 GPL Compliance .......................................................................................................................... 14
1
Overview
The MediaCloud™ VM provides RAMP’s speech-to-text (STX) capabilities within the Amazon Web
Services (AWS) framework, allowing you to process input audio/video files and receive XML recognition
output. Single server and distributed workload modes of operation are provided, along with the ability
to process either a folder of files or requests received via AWS Simple Workflow Service or via AWS
Simple Queue Service. Scenarios such as prioritizing content for processing, having certain instances
only process high priority content, etc. are fully supported.
MediaCloud™ VM is available as an Amazon Machine Image (AMI) through the AWS Marketplace,
allowing you to charge usage to your AWS account. There is one AMI for telephony audio (audio with
actual signal up to 4kHz).
Requirements
The RAMP STX XML output format is not documented here. You should refer to the separate
MediaCloud™ Integration Guide for that information.
You will need an Amazon Web Services account and should be familiar with at least the Elastic Compute
Cloud (EC2) and Simple Storage Service (S3).
Supported Input Formats
mplayer2 is used to convert various audio/video file formats. If a generally available build of mplayer or
mplayer2 as of March 2012 works with your file, there is a high likelihood that stxInstance will work as
well.
For a list of mplayer’s supported input formats, see: http://www.mplayerhq.hu/design7/info.html.
2
Quick Start
This QuickStart example demonstrates instantiating an instance in cloudQueue mode, which takes
recognition requests via Amazon Simple Queuing Service (SQS), along with an example Amazon Simple
Storage Service (S3) folder watcher to submit the requests to SQS. The standard US English
Conversational Telephony model will be used.
1. Have or create an S3 bucket for testing. Note your S3 access key and secret.
2. Create a folder.
3. Sign up for the ___<name>__ DevPay product by going to this URL:
https://portal.aws.amazon.com/gp/aws/user/subscription/index.html?ie=UTF8&offeringCode=
B6566F4D
4. Go to the AWS Marketplace page for the current ___<name>___ AMI: ____<url of page to be
created after I get description info from Joe>______
5. Launch the AMI on a c1.medium instance.
6. Establish two SSH connections to the running AMI.
7. In the first SSH connection, run the following command to start up an stxInstance in cloudQueue
mode:
cd ~/stxInstance; ./stxInstance -cloudCredentials <your S3 key>:<your
S3 secret> -cloudMode queue
8. In the second SSH connection, run the following command to start up the sample S3 folder
watcher example:
3
cd ~/stxInstance; ./stxInstance -runQueueExample -watch <your S3
key>:<your S3 secret> en-us/telephony/lm/conversational <your S3 bucke>
<S3 input folder>/ <S3 output folder>/ ""
Note: The final empty "" argument is required. It normally can be a queue name prefix, but a blank
prefix is acceptable.
9. Place several test audio .wav files into the input folder. (You can also use most common audio
or video file formats supported by mplayer2.) The screenshot below shows a 3rd party S3 GUI
tool (CyberDuck).
10. You will see the watch folder window scan the input folder periodically, pick up the new files,
send queue them for recognition. Status and completed messages will also be printed out.
4
11. Finally, the output XML files will be placed into the output folder.
Operation Modes
The QuickStart example demonstrates running MediaCloud™ VM in the cloudQueue mode of operation.
stxInstance supports three other modes of operation for a total of four different ways you can integrate
stxInstance into your processing:
1. localFiles



Recommended primarily for testing purposes and small jobs only
One model and recognition run mode per input file
Local folder paths are specified for inputs. Outputs are written to the same location as
inputs with .xml appended. You must copy your input and output files via scp, etc.
2. cloudFolder using S3
 Recommended primarily for testing purposes and small jobs only
 One model and recognition run mode per run.
 S3 paths are specified for inputs and outputs. You must copy your input and output files
to/from S3, etc. stxInstance will scan the specified S3 path upon startup and will process
all files found.
3. cloudWorkflow using SimpleWorkflowService and S3


Recommended for distributed workflows involving a single model
Multiple models and recognition run modes are supported
5
o



Does not support “model affinity” to reduce overhead of switching models per
recognition request
S3 paths are specified for inputs and outputs. You must copy your input and output files
to/from S3, etc.
No support for priorities
Benefit of simpler coding if you are using SimpleWorkflowService to manage your
processing, as compared to cloudQueue mode which requires specific coding for
stxInstance.
Note: It is not that simple to code to SimpleWorkflowService. If you are focused on a onestep cloud workflow involving just stxInstance, you can probably get started more quickly
using cloudQueue mode.
4. cloudQueue using SimpleQueueService and S3





Recommended for distributed workflows involving multiple models and/or multiple
priorities
Multiple models and recognition run modes are supported
o Supports “model affinity” per worker: a worker will first favor recognition
requests using the same model and will switch models only if no request is
available using the same model (at the same or higher priority)
S3 paths are specified for inputs and outputs. You must copy your input and output files
to/from S3, etc.
Support for 000 through 999 priority levels with lower numbers always being favored
Requires more specific stxInstance targeted coding as compared to cloudWorkflow
mode, which requires mostly standard AWS SimpleWorkflowService coding.
Instance Parameters
Instance parameters may be specified on the command line passed to stxInstance or, if the first
argument is “-ec2AutoStart”, EC2 launch user data will be used. To set your arguments as user data,
use a CGI style string with &param=<value if any>. In the event of multiple values, use
&param.1=<value 1>&param.2=<value 2>. For example:
localFiles.1=en-us/broadcast/lm/general&localFiles.2=foo.mp3
Is equivalent to:
–localFiles en-us/broadcast/lm/general foo.mp3.
The system refers to this as SimpleArgs format.
6
For all operation modes:
Minimally, either –localFiles or –cloudMode <mode> must be given to specify the primary operations
mode.
NAME
listModelsOnly
numWorkers
instanceAlias
TYPE
string
integer
string
DEFAULT
1
stxInstance
REQUIRED?
No
No
No
disableRamDisk
integer
0
No
NOTES
Lists the installed models and exits
The number of processing workers/threads
A friendly string that appears in status
screens, heartbeat messages, etc. The
system always adds @<hostname> when
using this string.
Disables using a RAM disk for temp files.
Normally, stxInstance will use a RAM disk
only if /dev/shm exists, has enough free
space for numWorkers*500MB, and there
is enough system memory for
numWorkers*2.5GB+space for RAM disk.
This is a conservative assumption if the
files being recognized at any given point in
time average to less than an hour and if
you do not send HD resolution video
through.
For the localFiles mode:
NAME
localFiles
TYPE
string
DEFAULT
REQUIRED?
Yes
NOTES
<model> <file path to run in local files mode>
<model>[:<run mode>] <next filepath> <next
model>[:<next run mode>]…
Output will be written to the same place +.xml
For all cloudModes:
NAME
TYPE
cloudCredentials string
cloudMode
string
DEFAULT
REQUIRED?
Yes
Yes
7
NOTES
<key>:<secret>
queue|workflow|folder
For cloudMode=folder:
NAME
cloudModel
cloudStorageLocation
cloudStorageInputPrefix
cloudStorageOutputPrefix
TYPE
string
string
string
string
DEFAULT
REQUIRED?
Yes
Yes
Yes
Yes
NOTES
<model>[:<run mode>]
S3 bucket
All input files must be here
Output files will be this + input tail (i.e.
less input prefix) + .xml
ie., /inputs/foo.wav
/outputs/foo.wav.xml
NAME
inboundQueuePrefix
idleQueuePollingInterval
TYPE
DEFAULT
string
empty
integer 60
REQUIRED?
No
No
NOTES
priorityMinValue
integer 999
No
priorityMaxValue
modelAllowed
integer 0
string
No
No
For cloudMode=queue:
The polling interval for the queue when
the queue is empty (in seconds)
Lower numbers = higher priority.
Cannot be >999.
Cannot be <0.
Can have multiple entries. If no entries,
all models are allowed.
For cloudMode=workflow:
NAME
cloudWorkflowEndpoint
TYPE
string
DEFAULT
https://swf.amazona
ws.com
REQUIRED?
No*
cloudWorkflowDomain
string
RAMPSTX
No
cloudWorkflowTaskList
string
stxInstanceRecognize
TaskList
No
NOTES
* As of the date of this
documentation, because SWF
is still in beta, this value must
be specified as https://swf.useast-1.amazonaws.com
The registered domain for the
workflow to connect to.
The task queue from which to
process STX requests.
Note that the system only cares about the workflowTaskList when retrieving jobs to process. The
particular value of the activity type and version will be passed through to ActivityCompleted,
ActivityHeartbeat, etc., but the system does not do anything special with that value. Instead, the input
string specifies the model, run mode, etc.
Supporting multiple workers, for example, is useful primarily to ease the overhead of management
(fewer instances, less need to clone workers per AMI). Strictly speaking, there is not a great CPU
efficiency gain over running more single worker instances.
8
It is also useful when running a single instance on a cloud folder or on a set of local files if you are using
multiple models (as it will attempt to optimize model reallocations).
Run Modes
Model names follow the form:
<language>/<channel:broadcast>/lm/<domain> </rscr>
For example:
en-us/telephony/lm/conversational
Note: Currently only the telephony model is available. Additional models may be introduced in the future.
Adding /rscr activates rescoring with acoustic cross-word models. Rescoring is computationally
relatively light (~ 10-20% overhead), but is memory intensive.
In addition, there is a run mode associated with each recognition request. The “standard” run mode
(default, if omitted) indicates the native model parameters. Presently the only additional run mode
defined is “fast”, which uses very aggressive pruning parameters resulting in a 2X-4X speed up. Use of
“fast” is recommended when a large volume of data needs to be analyzed for data analytics purposes,
when a rough recognition is needed for text-based timeline alignment and for other like assignments. It
is not recommended for normal recognition.
Combining run mode “fast” with rescoring models will likely result in a system somewhere between
“fast” and “standard” in terms of speed and accuracy but will require the full memory footprint
dominated by rescoring models.
Though rescoring can in theory be shared amongst workers running the same model, the present effort
will not try to implement that.
Queue and Workflow Support
Queue
The system will poll from SQS queues whose names match the following convention:
<prefix>RAMPSTX-<3 digit priority>-<model name encoded>
The model name is encoded by substituting “__” for “/”. (“/” is an illegal character for AWS SQS queue
names.) Use of <prefix> allows you to set up independent pools for stxInstances.
9
Note: AWS SQS ListQueues, which EC2 stxInstance uses, will list at most 1000 queues, so having more than 999
combinations of priority+model name will result in higher queue listing overhead as the system will make
multiple calls. (For simplicity, the system makes 10 calls, one for each possible 100 tranche. If any tranche is >
999 queues, it breaks down to 10 tranches.).
When a message is pulled from a queue, the visibility timeout is set to 5 minutes beyond the heartbeat
interval (which defaults to 5 minutes). This is reset approximately every 5 minutes if there is a live
heartbeat. The default queue visibility timeout should be something more than a few seconds.
Queue and Workflow Common Message Format
Arguments CGI are encoded via SimpleArgs format in the message for Queue and in the input for
Workflow.
Inbound Message Format
NAME
guid
cloudStorageLocation
inputFile
outputFile
model
TYPE
integer
string
string
string
string
DEFAULT
REQUIRED?
Yes
Yes
Yes
Yes
Yes
runMode
sendHeartbeat
string
string
standard
false
heartbeatTimeout
integer
Yes
replyQueue
string
Yes
startNotBefore
integer
No
No
No
10
NOTES
S3 bucket for both input and output
S3 path to input file
S3 path to output file
Should match queue when using
cloudMode=queue
standard | fast
false | true
Assumed true for Workflow
In seconds. If SendHeartbeat=true
(assumed for Workflow), heartbeat is
sent every minimum
(5,Heartbeattimeout-5) minutes. This
allows you to set a less frequent
heartbeat interval. Because of
potentially long model loading times,
setting a heartbeat timeout under 10
minutes is not recommended (the
system will never send heartbeats more
than once every 5 minutes)
Queue URL.
Not relevant for Workflow
Not relevant for Workflow. Reserved for
future implementation of delay queues
Reply Message Format
For Queues, a reply message is sent. For Workflow, a RespondTaskHeartbeat, RespondTaskCompleted
or RespondTaskFailed is called with an explanatory message, if appropriate.
NAME
guid
cloudStorageLocation
inputFile
outputFile
status
TYPE
integer
string
string
string
string
DEFAULT
REQUIRED?
Yes
Yes
Yes
Yes
Yes
instanceAlias
string
Yes
startedTime
messageTime
completedTime
infoMessage
integer
integer
integer
string
Yes
Yes
Yes
Yes
NOTES
S3 bucket for both input and output
S3 path to input file
S3 path to output file
Started (not relevant for Workflow)
Retrying: non-final failure
Failed (not relevant for Workflow)
Heartbeat (if requested)
Completed (not relevant for Workflow)
The instanceAlias that was set at
stxInstance startup.
Seconds since Java epoch
Seconds since Java epoch
Seconds since Java epoch (if Completed)
Any useful info, such as failure reason.
Heartbeat
A heartbeat is sent at periodic intervals (if requested) if recognition is still ongoing.
For Queues, the heartbeat is affected by updating the visibility timeout (even if a heartbeat is not
requested) in addition to sending a Status: Heartbeat message. For Workflow, a
RecordActivityTaskHeartbeat is executed.
This allows for a durable design—For queues, the message will be returned to the queue after the
visibility timeout if stxInstance fails to send a heartbeat (likely because it is hung, crashed, etc.) and for
workflow, the standard timeout support in workflow will offer the same functionality should stxInstance
fail to execute RecordActivityTaskHeartbeat.
Please note the earlier caution that stxInstance will not honor a requested heartbeat timeout under 10
minutes. It will be set to the minimum 10 minutes.
EC2 Instance Type Guidance
You will generally need 1.5-2G per worker for non-crossword rescoring and 2.5-3G per worker with
cross-word rescoring. You should conduct performance testing to determine the best mix of instance
types for your use case. The absolute minimum requirement is c1.medium, which is enough for one
worker STX but does not have enough memory for crossword rescoring
11
Logs
Log files are in /home/ec2-user/stxInstance/logs. They are automatically rotated and up to 50 are kept.
This is configurable by editing /home/ec2-user/stsInstance/logger.xml.
Sample Code
There is an examples directory with two basic demos for the advanced content processing support made
possible via the use of SimpleWorkflowService and SimpleQueueService.
For convenience, these samples are precompiled into stxInstance. Simply start with
–runWorkflowExample<remaining args for the example> or –
runQueueExample<remaining args for the example> as your arguments.
Simple Workflow Service
The Workflow example registers a domain and activity type (Caution: Currently AWS does not let you
delete registered domain/activity types without contacting Premium Support.) It then implements a
workflow via:
1. Scanning an S3 directory
2. Starting a new workflow for each item
3. For each started workflow, scheduling an activity in the stxInstanceRecognizeTaskList task
list
4. stxInstance will perform activities off a configurable task list
5. For each completed activity, close the workflow
6. The example loops forever waiting for more decider events. Be sure to terminate it when
you are done to avoid running up SWF actions indefinitely
Note: This example does not use the Java Workflow framework, which is recommended for true multi-step
workflows as writing a Decider state machine by hand can be tedious. However, the goal is to demonstrate the
basic integration interaction and using the direct APIs is a cleaner exposition, as well as being easier to translate
to languages other than Java for which Amazon does not yet (as of this writing) provide Workflow framework.
Simple Queue Service
The Queue example creates a request queue for the model requested using a fixed priority of 200. It
also creates a reply queue. It then implements an example content processing process via:



Scanning an S3 directory
For each item, send a message on the request queue
stxInstance will scan queues with the matching prefix and process the requests (paying
attention to priority – lower #s are higher priority – and also, when priorities are equivalent,
a given worker will prefer to not swap models a.k.a. model affinity). A reply message will be
sent on the reply queue. (Started and heartbeat messages will also be sent.)
12


The example loops continuously, waiting for reply messages and printing them out as they
arrive. Be sure to terminate it when you are done to avoid running up SWF actions
indefinitely.
There is also a -watch option which implements a looping watch folder on the input folder.
Note that input files are not deleted after they are done, but any file that already exists in
the output folder will be skipped. Again, there is no automatic deletion of files in the S3
input bucket, but you may write your own code to perform this function.
Application Patterns
If you have a large archive of files to run through STX using one or a small number of models, you can
make use of these examples to get the job done in a distributed manner:
1. Place all files on S3 with files using the same model in the same folder.
2. Fire up one of the above examples. (The SQS example is probably preferred because the
Workflow example creates indelible registrations, due to a limitation in AWS SWF which is still in
beta).
3. Configure instance user data to start up stxInstance in the appropriate mode (cloudQueue or
cloudWorkflow) with appropriate parameters.
4. As each instance starts up, it will start processing work from the queue/workflow. It is also very
easy to extend either example to upload files from local disk to S3 first, and then get going (vs
scanning an S3 folder).
5. Reply messages are also returned via the reply queue if using the cloudQueue operation mode.
The provided queue example prints them to screen.
13
STXInstanceQueueExample.java
1
S3 Input Folder
(input files)
(or your variant thereof)
2
SQS
Queue
SQS
Queue
Per
model
Per
priority
Replies
and
Heartbeats
3, 5
stxInstance EC2 Instances
S3 Output Folder
Each can have config # of multiple
workers
4
(output .xml
transcripts)
You can then use CloudWatch to monitor the number of visible queue messages (if using cloudQueue)
and use Auto Scaling to launch more instances. Note that Auto Scaling does not appear to provide a
mechanism (e.g., via a shutdown hook) to perform a soft suspend before termination as of this writing.
You can also implement your own framework that watches queue sizes and performs
launches/soft/hard terminations.
If you have high vs. low priority content, you can use priorities with cloudQueue mode.
You can easily build a content processing pipeline with cloudWorkflow mode by having successor
activities process the resulting transcript .xml. (This can also be custom built with cloudQueue mode.)
MPlayer2 GPL Compliance
A copy of the build tree for mplayer2 and its various libraries, as built, is included in a .tgz in the
stxInstance directory. Because our developer environment file paths are built into some of the files, you
will need to do some work to get it to build in your environment.
14
Download