MediaCloud™ VM Latest Revision: August 9, 2012 Contents Overview ....................................................................................................................................................... 2 Requirements................................................................................................................................................ 2 Supported Input Formats .............................................................................................................................. 2 Quick Start..................................................................................................................................................... 3 Operation Modes .......................................................................................................................................... 5 Instance Parameters ..................................................................................................................................... 6 For all operation modes ............................................................................................................................ 7 For the localFiles mode ............................................................................................................................. 7 For all cloudModes.................................................................................................................................... 7 For cloudMode=folder .......................................................................................................................... 8 For cloudMode=queue.......................................................................................................................... 8 For cloudMode=workflow..................................................................................................................... 8 Run Modes .................................................................................................................................................... 9 Queue and Workflow Support ...................................................................................................................... 9 Queue........................................................................................................................................................ 9 Queue and Workflow Common Message Format .................................................................................. 10 Inbound Message Format ................................................................................................................... 10 Reply Message Format ........................................................................................................................ 11 Heartbeat ................................................................................................................................................ 11 EC2 Instance Type Guidance ................................................................................................................... 11 Logs ............................................................................................................................................................. 12 Sample Code ............................................................................................................................................... 12 Simple Workflow Service ........................................................................................................................ 12 Simple Queue Service ............................................................................................................................. 12 Application Patterns.................................................................................................................................... 13 MPlayer2 GPL Compliance .......................................................................................................................... 14 1 Overview The MediaCloud™ VM provides RAMP’s speech-to-text (STX) capabilities within the Amazon Web Services (AWS) framework, allowing you to process input audio/video files and receive XML recognition output. Single server and distributed workload modes of operation are provided, along with the ability to process either a folder of files or requests received via AWS Simple Workflow Service or via AWS Simple Queue Service. Scenarios such as prioritizing content for processing, having certain instances only process high priority content, etc. are fully supported. MediaCloud™ VM is available as an Amazon Machine Image (AMI) through the AWS Marketplace, allowing you to charge usage to your AWS account. There is one AMI for telephony audio (audio with actual signal up to 4kHz). Requirements The RAMP STX XML output format is not documented here. You should refer to the separate MediaCloud™ Integration Guide for that information. You will need an Amazon Web Services account and should be familiar with at least the Elastic Compute Cloud (EC2) and Simple Storage Service (S3). Supported Input Formats mplayer2 is used to convert various audio/video file formats. If a generally available build of mplayer or mplayer2 as of March 2012 works with your file, there is a high likelihood that stxInstance will work as well. For a list of mplayer’s supported input formats, see: http://www.mplayerhq.hu/design7/info.html. 2 Quick Start This QuickStart example demonstrates instantiating an instance in cloudQueue mode, which takes recognition requests via Amazon Simple Queuing Service (SQS), along with an example Amazon Simple Storage Service (S3) folder watcher to submit the requests to SQS. The standard US English Conversational Telephony model will be used. 1. Have or create an S3 bucket for testing. Note your S3 access key and secret. 2. Create a folder. 3. Sign up for the ___<name>__ DevPay product by going to this URL: https://portal.aws.amazon.com/gp/aws/user/subscription/index.html?ie=UTF8&offeringCode= B6566F4D 4. Go to the AWS Marketplace page for the current ___<name>___ AMI: ____<url of page to be created after I get description info from Joe>______ 5. Launch the AMI on a c1.medium instance. 6. Establish two SSH connections to the running AMI. 7. In the first SSH connection, run the following command to start up an stxInstance in cloudQueue mode: cd ~/stxInstance; ./stxInstance -cloudCredentials <your S3 key>:<your S3 secret> -cloudMode queue 8. In the second SSH connection, run the following command to start up the sample S3 folder watcher example: 3 cd ~/stxInstance; ./stxInstance -runQueueExample -watch <your S3 key>:<your S3 secret> en-us/telephony/lm/conversational <your S3 bucke> <S3 input folder>/ <S3 output folder>/ "" Note: The final empty "" argument is required. It normally can be a queue name prefix, but a blank prefix is acceptable. 9. Place several test audio .wav files into the input folder. (You can also use most common audio or video file formats supported by mplayer2.) The screenshot below shows a 3rd party S3 GUI tool (CyberDuck). 10. You will see the watch folder window scan the input folder periodically, pick up the new files, send queue them for recognition. Status and completed messages will also be printed out. 4 11. Finally, the output XML files will be placed into the output folder. Operation Modes The QuickStart example demonstrates running MediaCloud™ VM in the cloudQueue mode of operation. stxInstance supports three other modes of operation for a total of four different ways you can integrate stxInstance into your processing: 1. localFiles Recommended primarily for testing purposes and small jobs only One model and recognition run mode per input file Local folder paths are specified for inputs. Outputs are written to the same location as inputs with .xml appended. You must copy your input and output files via scp, etc. 2. cloudFolder using S3 Recommended primarily for testing purposes and small jobs only One model and recognition run mode per run. S3 paths are specified for inputs and outputs. You must copy your input and output files to/from S3, etc. stxInstance will scan the specified S3 path upon startup and will process all files found. 3. cloudWorkflow using SimpleWorkflowService and S3 Recommended for distributed workflows involving a single model Multiple models and recognition run modes are supported 5 o Does not support “model affinity” to reduce overhead of switching models per recognition request S3 paths are specified for inputs and outputs. You must copy your input and output files to/from S3, etc. No support for priorities Benefit of simpler coding if you are using SimpleWorkflowService to manage your processing, as compared to cloudQueue mode which requires specific coding for stxInstance. Note: It is not that simple to code to SimpleWorkflowService. If you are focused on a onestep cloud workflow involving just stxInstance, you can probably get started more quickly using cloudQueue mode. 4. cloudQueue using SimpleQueueService and S3 Recommended for distributed workflows involving multiple models and/or multiple priorities Multiple models and recognition run modes are supported o Supports “model affinity” per worker: a worker will first favor recognition requests using the same model and will switch models only if no request is available using the same model (at the same or higher priority) S3 paths are specified for inputs and outputs. You must copy your input and output files to/from S3, etc. Support for 000 through 999 priority levels with lower numbers always being favored Requires more specific stxInstance targeted coding as compared to cloudWorkflow mode, which requires mostly standard AWS SimpleWorkflowService coding. Instance Parameters Instance parameters may be specified on the command line passed to stxInstance or, if the first argument is “-ec2AutoStart”, EC2 launch user data will be used. To set your arguments as user data, use a CGI style string with &param=<value if any>. In the event of multiple values, use &param.1=<value 1>&param.2=<value 2>. For example: localFiles.1=en-us/broadcast/lm/general&localFiles.2=foo.mp3 Is equivalent to: –localFiles en-us/broadcast/lm/general foo.mp3. The system refers to this as SimpleArgs format. 6 For all operation modes: Minimally, either –localFiles or –cloudMode <mode> must be given to specify the primary operations mode. NAME listModelsOnly numWorkers instanceAlias TYPE string integer string DEFAULT 1 stxInstance REQUIRED? No No No disableRamDisk integer 0 No NOTES Lists the installed models and exits The number of processing workers/threads A friendly string that appears in status screens, heartbeat messages, etc. The system always adds @<hostname> when using this string. Disables using a RAM disk for temp files. Normally, stxInstance will use a RAM disk only if /dev/shm exists, has enough free space for numWorkers*500MB, and there is enough system memory for numWorkers*2.5GB+space for RAM disk. This is a conservative assumption if the files being recognized at any given point in time average to less than an hour and if you do not send HD resolution video through. For the localFiles mode: NAME localFiles TYPE string DEFAULT REQUIRED? Yes NOTES <model> <file path to run in local files mode> <model>[:<run mode>] <next filepath> <next model>[:<next run mode>]… Output will be written to the same place +.xml For all cloudModes: NAME TYPE cloudCredentials string cloudMode string DEFAULT REQUIRED? Yes Yes 7 NOTES <key>:<secret> queue|workflow|folder For cloudMode=folder: NAME cloudModel cloudStorageLocation cloudStorageInputPrefix cloudStorageOutputPrefix TYPE string string string string DEFAULT REQUIRED? Yes Yes Yes Yes NOTES <model>[:<run mode>] S3 bucket All input files must be here Output files will be this + input tail (i.e. less input prefix) + .xml ie., /inputs/foo.wav /outputs/foo.wav.xml NAME inboundQueuePrefix idleQueuePollingInterval TYPE DEFAULT string empty integer 60 REQUIRED? No No NOTES priorityMinValue integer 999 No priorityMaxValue modelAllowed integer 0 string No No For cloudMode=queue: The polling interval for the queue when the queue is empty (in seconds) Lower numbers = higher priority. Cannot be >999. Cannot be <0. Can have multiple entries. If no entries, all models are allowed. For cloudMode=workflow: NAME cloudWorkflowEndpoint TYPE string DEFAULT https://swf.amazona ws.com REQUIRED? No* cloudWorkflowDomain string RAMPSTX No cloudWorkflowTaskList string stxInstanceRecognize TaskList No NOTES * As of the date of this documentation, because SWF is still in beta, this value must be specified as https://swf.useast-1.amazonaws.com The registered domain for the workflow to connect to. The task queue from which to process STX requests. Note that the system only cares about the workflowTaskList when retrieving jobs to process. The particular value of the activity type and version will be passed through to ActivityCompleted, ActivityHeartbeat, etc., but the system does not do anything special with that value. Instead, the input string specifies the model, run mode, etc. Supporting multiple workers, for example, is useful primarily to ease the overhead of management (fewer instances, less need to clone workers per AMI). Strictly speaking, there is not a great CPU efficiency gain over running more single worker instances. 8 It is also useful when running a single instance on a cloud folder or on a set of local files if you are using multiple models (as it will attempt to optimize model reallocations). Run Modes Model names follow the form: <language>/<channel:broadcast>/lm/<domain> </rscr> For example: en-us/telephony/lm/conversational Note: Currently only the telephony model is available. Additional models may be introduced in the future. Adding /rscr activates rescoring with acoustic cross-word models. Rescoring is computationally relatively light (~ 10-20% overhead), but is memory intensive. In addition, there is a run mode associated with each recognition request. The “standard” run mode (default, if omitted) indicates the native model parameters. Presently the only additional run mode defined is “fast”, which uses very aggressive pruning parameters resulting in a 2X-4X speed up. Use of “fast” is recommended when a large volume of data needs to be analyzed for data analytics purposes, when a rough recognition is needed for text-based timeline alignment and for other like assignments. It is not recommended for normal recognition. Combining run mode “fast” with rescoring models will likely result in a system somewhere between “fast” and “standard” in terms of speed and accuracy but will require the full memory footprint dominated by rescoring models. Though rescoring can in theory be shared amongst workers running the same model, the present effort will not try to implement that. Queue and Workflow Support Queue The system will poll from SQS queues whose names match the following convention: <prefix>RAMPSTX-<3 digit priority>-<model name encoded> The model name is encoded by substituting “__” for “/”. (“/” is an illegal character for AWS SQS queue names.) Use of <prefix> allows you to set up independent pools for stxInstances. 9 Note: AWS SQS ListQueues, which EC2 stxInstance uses, will list at most 1000 queues, so having more than 999 combinations of priority+model name will result in higher queue listing overhead as the system will make multiple calls. (For simplicity, the system makes 10 calls, one for each possible 100 tranche. If any tranche is > 999 queues, it breaks down to 10 tranches.). When a message is pulled from a queue, the visibility timeout is set to 5 minutes beyond the heartbeat interval (which defaults to 5 minutes). This is reset approximately every 5 minutes if there is a live heartbeat. The default queue visibility timeout should be something more than a few seconds. Queue and Workflow Common Message Format Arguments CGI are encoded via SimpleArgs format in the message for Queue and in the input for Workflow. Inbound Message Format NAME guid cloudStorageLocation inputFile outputFile model TYPE integer string string string string DEFAULT REQUIRED? Yes Yes Yes Yes Yes runMode sendHeartbeat string string standard false heartbeatTimeout integer Yes replyQueue string Yes startNotBefore integer No No No 10 NOTES S3 bucket for both input and output S3 path to input file S3 path to output file Should match queue when using cloudMode=queue standard | fast false | true Assumed true for Workflow In seconds. If SendHeartbeat=true (assumed for Workflow), heartbeat is sent every minimum (5,Heartbeattimeout-5) minutes. This allows you to set a less frequent heartbeat interval. Because of potentially long model loading times, setting a heartbeat timeout under 10 minutes is not recommended (the system will never send heartbeats more than once every 5 minutes) Queue URL. Not relevant for Workflow Not relevant for Workflow. Reserved for future implementation of delay queues Reply Message Format For Queues, a reply message is sent. For Workflow, a RespondTaskHeartbeat, RespondTaskCompleted or RespondTaskFailed is called with an explanatory message, if appropriate. NAME guid cloudStorageLocation inputFile outputFile status TYPE integer string string string string DEFAULT REQUIRED? Yes Yes Yes Yes Yes instanceAlias string Yes startedTime messageTime completedTime infoMessage integer integer integer string Yes Yes Yes Yes NOTES S3 bucket for both input and output S3 path to input file S3 path to output file Started (not relevant for Workflow) Retrying: non-final failure Failed (not relevant for Workflow) Heartbeat (if requested) Completed (not relevant for Workflow) The instanceAlias that was set at stxInstance startup. Seconds since Java epoch Seconds since Java epoch Seconds since Java epoch (if Completed) Any useful info, such as failure reason. Heartbeat A heartbeat is sent at periodic intervals (if requested) if recognition is still ongoing. For Queues, the heartbeat is affected by updating the visibility timeout (even if a heartbeat is not requested) in addition to sending a Status: Heartbeat message. For Workflow, a RecordActivityTaskHeartbeat is executed. This allows for a durable design—For queues, the message will be returned to the queue after the visibility timeout if stxInstance fails to send a heartbeat (likely because it is hung, crashed, etc.) and for workflow, the standard timeout support in workflow will offer the same functionality should stxInstance fail to execute RecordActivityTaskHeartbeat. Please note the earlier caution that stxInstance will not honor a requested heartbeat timeout under 10 minutes. It will be set to the minimum 10 minutes. EC2 Instance Type Guidance You will generally need 1.5-2G per worker for non-crossword rescoring and 2.5-3G per worker with cross-word rescoring. You should conduct performance testing to determine the best mix of instance types for your use case. The absolute minimum requirement is c1.medium, which is enough for one worker STX but does not have enough memory for crossword rescoring 11 Logs Log files are in /home/ec2-user/stxInstance/logs. They are automatically rotated and up to 50 are kept. This is configurable by editing /home/ec2-user/stsInstance/logger.xml. Sample Code There is an examples directory with two basic demos for the advanced content processing support made possible via the use of SimpleWorkflowService and SimpleQueueService. For convenience, these samples are precompiled into stxInstance. Simply start with –runWorkflowExample<remaining args for the example> or – runQueueExample<remaining args for the example> as your arguments. Simple Workflow Service The Workflow example registers a domain and activity type (Caution: Currently AWS does not let you delete registered domain/activity types without contacting Premium Support.) It then implements a workflow via: 1. Scanning an S3 directory 2. Starting a new workflow for each item 3. For each started workflow, scheduling an activity in the stxInstanceRecognizeTaskList task list 4. stxInstance will perform activities off a configurable task list 5. For each completed activity, close the workflow 6. The example loops forever waiting for more decider events. Be sure to terminate it when you are done to avoid running up SWF actions indefinitely Note: This example does not use the Java Workflow framework, which is recommended for true multi-step workflows as writing a Decider state machine by hand can be tedious. However, the goal is to demonstrate the basic integration interaction and using the direct APIs is a cleaner exposition, as well as being easier to translate to languages other than Java for which Amazon does not yet (as of this writing) provide Workflow framework. Simple Queue Service The Queue example creates a request queue for the model requested using a fixed priority of 200. It also creates a reply queue. It then implements an example content processing process via: Scanning an S3 directory For each item, send a message on the request queue stxInstance will scan queues with the matching prefix and process the requests (paying attention to priority – lower #s are higher priority – and also, when priorities are equivalent, a given worker will prefer to not swap models a.k.a. model affinity). A reply message will be sent on the reply queue. (Started and heartbeat messages will also be sent.) 12 The example loops continuously, waiting for reply messages and printing them out as they arrive. Be sure to terminate it when you are done to avoid running up SWF actions indefinitely. There is also a -watch option which implements a looping watch folder on the input folder. Note that input files are not deleted after they are done, but any file that already exists in the output folder will be skipped. Again, there is no automatic deletion of files in the S3 input bucket, but you may write your own code to perform this function. Application Patterns If you have a large archive of files to run through STX using one or a small number of models, you can make use of these examples to get the job done in a distributed manner: 1. Place all files on S3 with files using the same model in the same folder. 2. Fire up one of the above examples. (The SQS example is probably preferred because the Workflow example creates indelible registrations, due to a limitation in AWS SWF which is still in beta). 3. Configure instance user data to start up stxInstance in the appropriate mode (cloudQueue or cloudWorkflow) with appropriate parameters. 4. As each instance starts up, it will start processing work from the queue/workflow. It is also very easy to extend either example to upload files from local disk to S3 first, and then get going (vs scanning an S3 folder). 5. Reply messages are also returned via the reply queue if using the cloudQueue operation mode. The provided queue example prints them to screen. 13 STXInstanceQueueExample.java 1 S3 Input Folder (input files) (or your variant thereof) 2 SQS Queue SQS Queue Per model Per priority Replies and Heartbeats 3, 5 stxInstance EC2 Instances S3 Output Folder Each can have config # of multiple workers 4 (output .xml transcripts) You can then use CloudWatch to monitor the number of visible queue messages (if using cloudQueue) and use Auto Scaling to launch more instances. Note that Auto Scaling does not appear to provide a mechanism (e.g., via a shutdown hook) to perform a soft suspend before termination as of this writing. You can also implement your own framework that watches queue sizes and performs launches/soft/hard terminations. If you have high vs. low priority content, you can use priorities with cloudQueue mode. You can easily build a content processing pipeline with cloudWorkflow mode by having successor activities process the resulting transcript .xml. (This can also be custom built with cloudQueue mode.) MPlayer2 GPL Compliance A copy of the build tree for mplayer2 and its various libraries, as built, is included in a .tgz in the stxInstance directory. Because our developer environment file paths are built into some of the files, you will need to do some work to get it to build in your environment. 14