INDUSTRIAL AUDIO FINGERPRINTING DISTRIBUTED SYSTEM WITH CORBA AND WEB SERVICES

advertisement
INDUSTRIAL AUDIO FINGERPRINTING DISTRIBUTED SYSTEM WITH
CORBA AND WEB SERVICES
Jose P. G. Mahedero, Vadim Tarasov, Eloi Batlle, Enric Guaus, Jaume Masip
Audiovisual Institute
Universitat Pompeu Fabra
Ocata 1, Barcelona, Spain
{jpgarcia, vtarasov, eloi, eguaus, jmasip }@iua.upf.es
ABSTRACT
With digital technologies, music content providers face
serious challenges to protect their rights. Due to the widespread nature of music sources, it is very difficult to centralize the audio management. Audio fingerprinting allows the identification of audio content regardless of the
audio format and without the need of additional metadata.
Monitoring the audio being broadcasted by the TV and
radio stations of a country requires the design and implementation of a scalable, robust and modular framework.
We have chosen CORBA as distributed environment. The
whole functionality needs to be decoupled from clients.
To do so, Web services have been deployed. The audio
identification core uses a Hidden Markov Model-based
audio fingerprinting technology. The paper discusses the
design and implementation issues of a complete distributing system that automatically monitors audio content, specifically music and commercials.Today, a working prototype of such a system already exists, and is dedicated to
monitoring several radio and tv stations in Spain.
1. INTRODUCTION
Due to the outgrowing field of digital rights management
(DRM), the need to automate the tracking of different audio sources arose. Traditionally this task was accomplished
by sampling the sources (or just asking them) and storing
what had been broadcasted in a certain period of time. At
the time the only sources of audio which companies kept
track of were radio stations.
Time went by and some factors showed the need for
companies to automate the tracking process. Among these
factors we could mention the growth of the music rights
management companies and their significance, the increasing number of radio stations, and one of the most important (if not the most) the arrival of the Internet and the
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies
are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page.
c 2004 Universitat Pompeu Fabra.
resulting exponential growth of music transfer.
As the number of sources and the quantity of audio itself was really huge, it became clear that a distributed system supporting many pluggable sources was an optimal
solution.
The system needed to be powerful enough to serve different clients 1 each of them having its own requirements.
It also had to provide efficient load balancing and fault
tolerance. The experience in distributed systems pointed
the use of CORBA (http://www.CORBA.org) as a robust
platform to integrate and cooperate many heterogeneous
systems [5].
2. MAIN DESCRIPTION
We have implemented a system composed by several nodes
in charge of the audio recognition. They communicate in
a hierarchical structure in wich each node tries to recognize audio. In case a node is unable to identify a pattern,
it passes the request to the next node of the hierarchy. The
identification procedure continues until it reaches the top
level of the hierarchy.
Nodes act as caches of the top level node. Each of
them has a list of patterns representing a portion of all
the models present in the system, so a node will identify
a portion of what the whole the system is able to. If the
audio fragment is not identified by the top level node, the
audio passes to manual (human) processing. See figure 1
Each identification transaction (either positive or negative) is logged in order to provide feedback information to
the clients. They can also monitorize the behavior of the
system and adjust global parameters depending on their
needs.
1
We will use the term client to refer to a program that will communicate with the system as well as the end user of the system. This non
distinction is made because from the point of view of the system these
terms (program client and user client) are similar and the system makes
no difference.
User interface : the use of web services lets up to client
the implementation of the end interface, so this part
is not covered in this paper.
3. IDENTIFICATION LAYER
System overview The system presented here goes beyond the template matching paradigm to a statistical
pattern matching paradigm. The goals are to incorporate robustness to the system by statistically modeling the audio evolution while reducing the size of
the fingerprint by considering local and global perceptual redundancies in a corpus of music data.
Figure 1. main description
From a logical point of view, our system is composed
of different layers each of them with its own purpose:
Identification : the core of the system. This layer is in
charge of the identification of the audio. It takes audio streams as input and tries to identify them using
the audio patterns datase. It consists of C++ audio
recognition libraries.
Distribution : groups physical units into virtual nodes
that cooperate for the same purpose, for example
commercials, music or TV virtual node. It manages the load of each virtual node and is in charge
of distributing jobs to the nodes depending on their
respective load. This layer sends jobs to the identification layer and loggs the results received from it.
This layer also manages each virtual node’s pattern
list to avoid overloading the top node by providing
lower level nodes with feedback information from
higher level nodes. This process results in an updated pattern database for each node. Thus, a lower
level node will not fail to identify any given pattern
more than once. At the same time with load balancing, this layer is responsible for error recovery and
handle system failures. CORBA is used to provide
seamless integration between remote components.
Intelligence : it provides mechanisms for prioritizing jobs
either on demand by client request or to satisfy the
distribution layer needs. It also executes jobs periodically or can even estimate processing times to
give clients a better feedback. As well as the distribution layer, this one uses CORBA as a platform
for process distribution and coordination.
Integration : a layer in charge of making the system
accessible to many different clients each of them
having its own operating system and requirements.
This is accomplished by using web services, which
provide a functional API of the system.
Feature extraction A filter-bank based analysis approximates the human inner ear behavior and the Melcepstrum coefficients (MFCC) [2] analise the music
timbers [4].
Channel estimation If the distorting channel H(ω) is slowly
varying we can design a filter that, applied to the
time sequence of parameters, is able to minimize
the effects of the channel. The filter we designed
1−z −1
for our system is H(z) = 0.99 1−0.98z
−1 .
Hidden Markov Model Observers
Polyphonic music can be seen as a sequences of simultaneous acoustic events (notes). In [1], we defined a set of abstract events that allow a mathematical description of complex music as sequences
of events. These events are captured by Hidden
Markov Models(HMM)[6] that model abstract audio generators. (trained with the Expectation-Maximization algorithm [3])
Identification Algorithm
Each HMM fingerprint in the song database uniquely
identifies each song among the others. The fingerprint database is generated using the Viterbi algorithm [8]. The identification algorithm matches an
input streaming audio against all the fingerprints to
determine whenever a song section has been detected. The Viterbi algorithm is used again with
the purpose of exploiting the observation capabilities of the HMM models contained in the fingerprint
sequences.
4. DISTRIBUTION LAYER
This layer has many tasks to accomplish: group nodes:
virtual nodes represent a hardware abstraction layer that
increases system scalability. Assure a good load balancing: in a distributed system it is essential to make a good
load balancing in order to have an almost constant throughput. This is also a task of this layer. Fault tolerance: a
node may be disconnected during processing without any
consequences. Feedback clients and nodes: clients need
to receive the results of the recognition processes as well
as to monitorize the global behaviour of the system. As
we mentioned in the Main description section (2), nodes
are fed back from the top node to prevent future identification failures.
This layer is composed of three different levels (see
figure2):
- Physical level: each node is seen as a single processing unit.
- Logical unit level: hardware resources are grouped
together into virtual nodes. This level divides identification tasks depending on which kind of audio it
will identify. For example we can have three units
for commercial audio identification grouped together
into one virtual node. Other example could be several logical units hosted on the same physical machine. The number of physical units that form each
virtual node can be configured to fit the system needs.
- Process level: this is composed of virtual nodes and
a logger. Each virtual node is built of three elements:Data providers: provide data from heterogeneous sources in a unified manner. Recognizer:
identifies the audio supplied by the data providers.
Controller: handles communication and synchronization issues between data providers and recognizer. It also communicates with other controllers
to perform load balancing and error recovery. In
case of identification failure it is charge of passing
the identification task to the top level node.
Apart from virtual nodes there is a special component
named logger who is in charge of logging all the events received from the controllers of the virtual nodes. It writes
logs to the databases. They can be later consulted by
clients by calling the services provided by the Integration
layer. Logs reflect information about the recognition process such as timing, which node processed the audio, kind
of audio processed etc. This information help clients to
track the whole identification process.
5. INTELLIGENCE LAYER
The Intelligence layer serves for the process optimization.It is separated into several intelligent agents that perform automatic stream tagging based on metadata. The
system preprocesses the stream description, separates the
fragments with voice from those with music. It also performs source signal cache management. The source signal is cached and arranged in manageable fragments allowing to minimize the number of queries made to external data storage system, thus minimizing the traffic. This
layer also provides identification result optimization, automatic job management, automatic pattern list generation
and pattern reindexation.
Figure 2. Distribution layer
6. INTEGRATION LAYER
This layer integrates modules provided by the distribution (see 4 Distribution) and the Intelligence (see 5 Intelligence) layer with the clients. At the same time is lets
server-client interaction as much open as possible by providing a functional API.
SOAP (http://www.w3.org/TR/soap/) is a lightweight protocol based on XML-RPC calls. It is designed to work in
a decentralized environment. It sits on top of the protocol stack and can use many different transport protocols
such as SMTP, FTP or other. However its most widely
used over HTTP because of its facilities for firewall/proxy
filtering.[7]
From a general point of view, working with SOAP consists on making requests to a Web Service (from now on
WS or simply service) in a specific SOAP format, and
receive responses from the WS. The Service interface is
described in a WSDL (http://www.w3.org/TR/wsdl, a superset of XML) file indicating methods calls, objects, and
exceptions that will be sent across the net. A Web Service is a set of related methods exposed to clients via
SOAP protocol. As all the exchange of information is
made with XML (both data and control messages), SOAP
makes possible the interaction between different platforms
or languages such as COM, CORBA, Perl, Tcl, the Javalanguage, C, C++, Python, or PHP programs running anywhere on the Internet.
Features of Web Services: Applications can communicate across the Internet, they are language and platform
independent, they can run over many protocols, and they
are human readable.
Due to their features, WS suit perfectly our needs. The
aims of the layer are accomplished by using WS because
they act as a bridge between clients and modules in the
underlying layers which use a CORBA infrastructure. By
providing a fine grained functional API, WS define the
way clients communicate with the system, but don’t impose any restriction on their implementation. In a MVC
pattern this would mean a complete separation of the View
from Model and Controller and force clients only to follow SOAP encoding rules but not to use a specific language or end user interface.
6.1. implementation
Figure 3. Integration layer organization
This layer is built on different modules (see figure 3):
- Axis: Axis (http://ws.apache.org/axis/) is the name
of the implementation of the SOAP specification we
use. Its is a servlet that must be plugged in to a
servlet container. Its responsibility is to route SOAP
messages between services and clients.
- SOAP Server: we use Tomcat Server as a servlet
container to route SOAP requests and response to
and from AXIS servlet.
- SQL Server: mysql is used as a persistence layer
(audio model storage, system information, etc.)
- Controllers: there are two different controllers. Service controllers that manage database access modules and CORBA controllers that handle interactions
with the underlying layers using CORBA.
- CORBA Components: components in charge of distribution, load balancing and audio processing. Apart
from distribution tasks, they accept audio recognition jobs from their controllers and invoke Identification layer components.
7. CONCLUSIONS
The combination of Hidden Markov Models applied to
fingerprinting, with CORBA and Web Services results in
a robust, scalable and performant distributed framework.
Hidden Markov Models provide an accurate fingerprinting
technique which is made robust and scalable with the use
of CORBA. The integration provided by Web Services lets
as much open as possible the interaction between clients
and system functionalities.
8. REFERENCES
[1] E. Batlle. Automatic Song Identification in Broadcast Audio. Proc. IASTED Signal and Image Processing Conference, 2002.
[2] J. A. F. E. Batlle, C. Nadeu. Feature Decorrelation Methods in Speech Recognition. Int Conf on Spoken Language
Processing, 1998.
[3] A. D. et Altri. Maximum Likelihood from Incomplete Data
via the EM Algorithm. Journal of the Royal Statistical Society, 39(1):1–38, January 1977.
[4] B. Logan. Mel Frequency Cepstral Coefficients for Music
Modeling. Proc. ISMIR, 2000.
[5] S. V. Michi Henning. Advanced CORBA programming with
C++. Addison Wesley, 1999.
[6] L. R. Rabiner. A Tutorial on HMM and Selected Applications in Speech Recognition. Proceedings of the IEEE,
77(2):257–286, 1989.
[7] E. D. Simon St Laurent, Joe Johnston. Programming Web
Services with XML-RPC. O Reilly, 2001.
[8] A. J. Viterbi. Error Bounds for Convolutional Codes and
an Asymptotically Optimum Decoding Identification. IEEE
Trans. Info. Theory, 1967.
Download