An Object-Based Knowledge Representation
Approach
for Multimedia Presentations
Abdelmadjid Ketfi, Jérôme Gensel, Hervé Martin
Grenoble, France
Abstract - This paper deals with the
coupling of V-STORM, which is both a video manager and a multimedia presentation
system, with AROM, an object-based knowledge representation system. We present
here an AROM knowledge base, called the AVS model, which constitutes a generic
model for multimedia presentations. This model encompasses any multimedia
presentation described using the SMIL standard. By instantiating this knowledge
base, the author describes his/her multimedia presentation and the way media
objects interact in it. Then, the corresponding SMIL file is exhibited and sent
to V-STORM in order to be played. This coupling shows to be relevant for two
reasons: first, by its UML-like formalism, AROM eases the task of a multimedia
presentation author; second, AROM is put in charge of checking the spatial and
temporal consistencies of the presentation during its description. This way, a
consistent presentation is sent to V-STORM.
Keywords – Multimedia presentations,
Videos, Knowledge Representation, SMIL
In the last
decade, multimedia and, more particularly, video systems have benefited from a
tremendous research interest. The main reason for this is the increasing
ability computers now have for supporting video data, notably thanks to
unceasing improvements in data compression formats (as MPEG-4 and MPEG-7), in
networks transfer rates and operating systems [1], and in disk storage
capacity. Unsurprisingly, new applications have risen such as video on demand,
video conferencing and home video editing which directly benefit from this
evolution. Following this trend, research efforts ([2], [3]) have been made to
extend DataBase Management Systems (DBMS) so that they support video data types
not simply through Binary Large Objects (BLOB). Indeed, DBMS seem to be
well-suited systems for tackling problems posed by the video, namely storage,
modeling, querying and presentation. Video data types must be physically
managed apart from other conventional data types in order to fulfill their
performance requirements. Video modeling must take into account the
hierarchical structure of a video (shots, scenes and sequences) and allow
overlapping and disjoint segment clustering [4]. The video query language must
allow one to query video content using textual annotations or computed
signatures (color, shape, texture, ….) and deal with the dynamic (movements) of
objects in the scenes as well as with semi-structural aspects of videos and,
finally, must offer the possibility of creating new videos.
We have designed and implemented V-STORM [5] a video system which captures video data in an object DBMS. The V-STORM model considers video data from different perspectives (represented by class hierarchies): physical (as a BLOB), structural (a video is made up of shots which are themselves composed of scenes which can be split into sequences), composition (for editing new videos using data already stored in the database), semantics (through an annotation, a video segment is linked to a database object or a keyword). V-STORM uses and extends the O2 object DBMS and comes as a tool for formulating queries on videos, composing a video using the results of queries, and generating video abstracts. V-STORM can play videos (or segments of) of its database but also virtual videos (or segments of) composed through an O2 interface. Moreover, it is possible to use V-STORM as a multimedia player for presentations described using the SMIL [6] standard . This way, V-STORM can be classified in the family of multimedia presentation software like GriNS [7] or RealNetworks G2 [8].
We show
here how AROM [9], an object-based knowledge representation system, can be used
to help a V-STORM user to build, in a more declarative way, a multimedia
presentation by instantiating a knowledge base rather than by writing a SMIL
file. Then, we show how both spatial and temporal consistencies of multimedia
presentation can be maintained by AROM.
The paper is organized
as follows : sections 2 and 3 present respectively the V-STORM and AROM systems
; section 4 describes the AVS model, an AROM knowledge base which corresponds
to a general multimedia presentation structure ; section 5 gives the related
works before we conclude in section 6.
V-STORM differentiates between the raw video stored in the database and the video which is watched and manipulated by end-users. From a user point of view, a video is a continuous media which can be played, stopped, paused, etc. From a DBMS point of view, a video is a complex object composed of an ordered sequence of frames, each having a fixed display time. This way, new virtual videos can be created using frames from different segments of videos.
Figure 1. The
V-STORM architecture. Through the video composer interface, a video is
described by the user, translated in OQL so that its component video segments
can be sought in the O2 video database. Then, the video is played by the
V-STORM.
In V-STORM, the Object Query Language (OQL) [10] is used (see Figure 1) to extract video segments to compose virtual videos. Video query expressions are stored in the databases and the final video is generated at presentation time. This approach avoids data replication. A video query returns either a video interval which is a continuous sequence of frames belonging to the same video, or a whole video, or an excerpt of a raw video (by combination of the two previous cases), or a logical extract of a video stemming from various raw videos.
Video composition in V-STORM is achieved using a set of algebraic operators. This way a virtual video can be the result of the concatenation, or the concatenation without duplication (union), or the intersection, or the difference of two videos, or, as well, the reduction (by elimination of duplicate segments) or the finite repetition of a single video. Annotations in V-STORM are used to describe salient objects or events appearing in the video. They can be declared at each level of the video hierarchy. Annotations are manually created by the users through an annotation tool. V-STORM also integrates an algorithm to automatically generate video abstracts. Video abstracts aims at optimizing the time for watching a video in search of a particular segment. The user has to provide some information concerning the expected abstract: its source (one or more videos), its duration, its structure (which reflects the structure of the video), and its granularity (in the video segments might be more relevant than others). Finally, in order to open V-STORM to the multimedia presentation standardization, we have developed a SMIL parser (see Figure 2) so that V-STORM can read a SMIL document and play the corresponding presentation. Also, interactivity is possible since V-STORM handles the presence of anchors for hypermedia links during presentations.
Figure 2. V-STORM
can also be used as multimedia presentation. The presentation is described in
SMIL, sent to a parser and played by the VSTORM video player.
The parser checks the validity of the SMIL document against the SMIL DTD (extended to support new temporal operations carried out by V-STORM). Then the different SMIL elements are translated in V-STORM commands and the video is displayed. Currently, this parser is limited and does not exploit all the V-STORM functionalities concerning operations on videos. The work presented here extends the description of a la SMIL multimedia presentations in order to better exploit V-STORM capabilities.
Object-Based
Knowledge Representation Systems (OBKRS) are known to be declarative systems
for describing, organizing and processing large amounts of knowledge. In these systems
[11], once built, a knowledge base (KB) can be exploited through various and
powerful inference mechanisms such as classification, method calls, default
values, filters, etc. AROM (which stands for Associating Relations and Objects
for Modeling) is a new OBKRS which departs from others in two ways. First, in
addition to classes (and objects) which often constitute the
unique and central representation entity in OBKRS, AROM uses associations (and tuples), similar to those found in UML [12], to describe and
organize links between objects having common structure and semantics. Second, in addition to the classical OBKRS inference mechanisms, AROM
integrates an algebraic modeling language (AML) for expressing operational
knowledge in a declarative way. The
AML is used to write constraints, queries, numerical and symbolic equations
involving the various elements of a KB.
A class in AROM
describes a set of objects sharing common properties and constraints. Each
class is characterized by a set of properties called variables and by a set of constraints. A variable denotes a
property whose basic type is not a
class of the KB. Each variable is characterized by a set of facets (domain restriction facets,
inference facets, and documentation facets). Expressed in the AML, constraints
are necessary conditions for an object to belong to the class. Constraints bind
together variables of – or reachable from – the class. The
generalization/specialization relation is a partial order organizes classes in
a hierarchy supported by a simple inheritance mechanism. An AROM object
represents a distinguishable entity of the modeled domain. Each object is
attached to exactly one class
In AROM, like in UML,
an association represents a set of similar links between n (n ³ 2) classes, being distinct or not. A link
contains objects of the classes (one for each class) connected by the
association. An association is described by means of roles, variables and
constraints. A role corresponds to the connection between an association and
one of the classes it connects. Each role has a multiplicity, whose meaning is
the same as in UML. A variable of an association denotes a property associated
with a link and has the same set of available facets as a class variable. A tuple of an n-ary association having m
variables vi (1 £ i £ m)
is the (n+m)-uple made up of the n
objects of the link and of the m
values of the variables of the association. A tuple is an "instance"
of an association. Association constraints involve variables or roles and are
written in the AML, and must be satisfied by every tuple of the association.
Associations are organized in specialization hierarchies. See Figures 3 and 4
for a textual and a graphical sketches of an AROM KB dedicated to multimedia
presentations.
First introduced in
Operations Research, algebraic modeling languages (AMLs) make it possible to
write systems of equations and/or of constraints, in a formalism close to
mathematical notations. They support the use of indexed variables and expressions,
quantifiers and iterated operators like å (sum) and Õ (product), in
order to build expressions such as . AMLs have been used for linear and non-linear, for
discrete-time simulation, and recently for constraint programming [13]. In AROM,
the AML is used for writing both equations, constraints, and queries. AML
expressions are built from the following elements: constants, indices and
indexed expressions, operators and functions, iterated operators, quantified
expressions, variables belonging to classes and associations, and expressions
that allow to access to the tuples of an association. An AML interpreter solves
systems of (non-simultaneous) equations and processes queries. Written in Java
1.2, AROM is available as a platform for knowledge representation and
exploitation. It comprises an interactive modeling environment, which allows
one to create, consult, and modify an AROM KB; a Java API, an interpreter for
processing queries and solving sets of (non-simultaneous) equations written in
AML, and WebAROM, a tool for consulting and editing a KB through a Web browser.
As mentioned above,
multimedia scenarios played by V-STORM can be described using SMIL. The
starting point of this study is twofold. We aim first at providing a UML-like
model in order to ease the description of a multimedia presentation and,
second, at reinforcing consistency regarding spatial and especially temporal
constraints between the components of a multimedia presentation. It is our conviction
that, SMIL like XML [14], are not intuitive knowledge representation languages,
and one needs to be familiar with their syntax before to read or write and
understand the structure of a document. So, we propose an AVS (AROM/V-STORM)
model (see Figures 3 and 4), which consists of an AROM knowledge base whose
structure incorporates any SMIL element used in the description of a multimedia
presentation. This way, we provide a V-STORM user with an operational UML-like
model for describing her multimedia presentation. Using an entity/relation (or
class/association) approach for modeling is now a widely accepted approach,
where UML has become a standard. Through the AROM Interface Modeling
Environment, the graphical representation of classes and associations which
constitute the AVS model, gives the user a more intuitive idea of the structure
of her presentation. Moreover, taking advantage of the AROM's AML and type
checking, the user can be informed about the spatial and temporal consistencies
of her presentation.
Since V-STORM can play
any presentation described with SMIL, our AROM model for multimedia
presentation is SMIL compliant. This means that it incorporates classes and
associations corresponding to every element that can be found in the structure
of a SMIL document. However, the
main objective of the AVS model is to give the user the opportunity to invoke
any kind of operations V-STORM can performed on a video.
class: RootLayout variables: variable: b_color type: string variable: title type: string variable: height type: integer variable: width type: integer class: Region variables: variable: b_color type: string variable: fit type: string variable: title type: string variable: top type: integer variable: height type: integer variable: width type: integer |
class:
CommonAttributes variables: variable: abstract type: string variable: author type: string variable: begin type: float default: 0 variable: end type: float definition: end=begin+dur variable: dur type: float default: 0 definition: dur=end-begin variable: region type: string variable: repeat type: integer variable: s_bitrate type: integer variable: s_caption type: boolean variable: s_language type: list-of string cardinality: min:0 max: * |
class: Block super-class:
CommonAttributes variables: variable: sync type: string default: "seq" variable: endsync type: string class: Element super-class:
CommonAttributes variables: variable: media type: string variable: src type: string variable: alt documentation: "specifies an
alternate text, if the media can not be displayed" type: string variable: fill documentation: "if fill=true so
freeze else remove" type: boolean |
association: CBE roles: role: block type: Block multiplicity: min: 0 max: * role: element type: Element multiplicity: min: 0 max: * |
Figure
3. An excerpt of the AROM textual description showing 7 classes and 1
association of the AVS model for multimedia presentation. In the CommonAttributes abstract class, a
definition is given for the end
and dur variables using the
AML.
Figure
4. A view of the Interactive Modeling Environment through which the AVS model
can be instantiated. On the left, a view of the class and association
hierarchies. On the right, the UML- like graphical description of the mode. A
textual description is automatically generated. For instance, one can find the
corresponding textual description of the class Element in Figure 3.
In
the AVS model, the various features of a multimedia presentation are modeled
using classes and associations. The class Presentation gives the more general information about the
multimedia presentation. Concerning the spatial formatting which describes the
way displayable objects are placed into the presentation window, it is
described by objects of the Layout class, in accordance with the SMIL recommendation.
When a presentation gathers more than one layout V-STORM chooses the first
layout that matches the user preferences. This way, V-STORM permits some
adaptability concerning the characteristics of the machine on which the
presentation is played. A layout can be associated with a root-layout and
several regions (described respectively by classes RootLayout,
Region and associations HasRootLayout and HasRegion) where the media objects appear. Concerning
the time model, a V-STORM presentation is made up of blocks. Each block can contain other blocks and/or media objects. Basic media objects
supported by V-STORM are continuous media with an intrinsic duration (video,
audio…) or discrete media without an intrinsic duration (text, image…). The
variable sync in the Block class determines the temporal behavior (namely
parallel or sequential presentation) of the elements in the blocks, depending
on its value seq or par. Three temporal information can be associated with a media object or a
block: its duration (variable dur), its begin and end times
(variables begin and end). When no value is specified for this
variable, the duration of a discrete object is null and the duration of a
continuous object is its natural duration. The semantics concerning the
effective begin of objects linked to a parallel or sequential block is the same
as the one defined in the SMIL recommendation. Also, every date associated with
an object must be defined as a float value. This is not a limitation since the
model allows to associate to a media object a set of reaction methods (start,
end, load…) in response to events (click, begin, end…) triggered by other
objects. Compared with an authoring language like GRiNS, this event-reaction
mechanism offers more synchronization possibilities between objects and through
the AVS model, it is more easy and intuitive to express a temporal scenario.
The knowledge base contains a Switch class in charge of adapting the presentation
to the system capabilities and settings. The variables found in this class (s_bitrate,s_captions, s_language…) are equivalent to the attributes of the switch
element in SMIL. The player will play the first element of the switch
acceptable for presentation. Finally, the two kinds of navigational links
proposed by SMIL (a and anchor) and allowing interactivity
during a presentation, are represented in the knowledge base by the A_Link and
Anchor_Link classes. The power of the event-reaction mechanism implemented in
V-STORM allow an author to define more powerful and intuitive user interaction
possibilities than in SMIL. For instance, a media object can start some time
after the click on another object, and it can end just after the load of a new
object.
To build a multimedia presentation, a V-STORM user simply has to instantiate the AROM KB. For a local KB, this can be done either by using the AROM Interactive Modelling Environment (see Figures 4 and 5), or by completing the ASCII document describing the KB (like in Figure 3), or by using the Java API of AROM in a program. For a distant KB, this can be done through a web browser using WebAROM. Since this instantiation is made under the control of AROM (type checking, multiplicity constraint satisfaction, …), both the spatial and temporal consistencies of the described presentation are guaranteed.
Figure 5. The editor for instantiating the AVS model
Once this instantiation is performed, an AROM-SMIL parser we have written is launched and the resulting SMIL file is sent to the SMIL parser of V-STORM (see Figure 6).
Figure 6. The
architecture of the AROM/V-STORM coupling
The coupling between V-STORM and AROM combines
the video management richness of the former to the expressing and modelling
power of the latter. Compared to the classical specification and presentation
of multimedia documents, this coupling offers several advantages.
UML-like description: The AVS-model is
described in a graphical notation close to UML. Object-oriented analysis
and design methods, have shown the relevance of using graphical notations to
improve communication between all the actors of a design process (for instance
collaborating authors).
Modularity and reuse: The author can edit parts of the
presentation independently and group them to compose its documents, just by
manipulating AROM objects. This object approach allows the reuse of existing
blocks to compose new presentations, saving a large amounts of works in the
design phase.
Object identity: In an AROM KB, each object has a
unique identifier. This property has been exploited to prevent inconsistencies
due to the assignment of the same identifier to two different media objects.
For the name given to regions, for instance, the existence of such names is
checked.
Consistency
maintenance: When
a presentation contains some inconsistencies, for instance when it says that an
object B starts at the end of an object A, while an object C starts at the end
of B and C starts at the same time as A, classical multimedia systems ignore or
do just warn about these inconsistencies at play time. Here, temporal checking
is performed by AROM during the construction of the presentation and the author
is warned about such inconsistencies. This allows sending to the presentation
system a consistent document. This static checking allows us to obtain a global
trace of the presentation or a timeline view, which aligns all events on a
single axis of time.
Virtual videos: In addition to raw videos, the
author can include in the presentation virtual videos. They correspond to video
objects having no value for their src variable. Associations (Extraction, Reduction, Repetition, BinaryOperation) corresponding to the V-STORM
operations for creating virtual videos have been introduced in the KB. Once
these associations are instantiated, their tuples link a virtual video to the
video(s) (raw or virtual) it is derived from.
Keywords and video
abtracts: It is
possible to use the keywords variable to annotate and to
formulate queries on the content of a video. Moreover, the model includes an AbstractOf association in order to link an abstract (an object of the VAbstract class), having possibly a given duration, to a video. Thus, the video
can be replaced by its abstract during the presentation. A VAbstract object can be created manually or automatically using the AROM API and
the V-STORM video abstract generator.
For a complete comparison of V-STORM with other
multimedia projects, one can refer to [15]. Among numerous research works on
authoring and presentation environments for interactive multimedia documents,
Madeus [16] is a very complete environment with a graphical authoring interface
and a spatial formatting editor. Madeus is based on a constraint-based
approach. It offers flexibility for frequent scenario modifications carried out
by the author before reaching the desired scenario, a coupling between the
editing and presentation process, and an incremental editing process which
consists in readjusting the solution each time the author adds or deletes a
constraint. Constraint propagation maintains the consistency of the new
scenario: at each editing step, the author is sure of having a consistent
scenario. Our AVS model also relies on a similar approach since AROM integrates
a constraint solver. The AML allows expressions of constraints involving
classes, associations, objects or tuples. But for authoring, we put the
emphasis on a yet more declarative approach through the use a UML-like model in
which constraints are implicitly embedded into temporal and spatial operators.
Also, unlike V-STORM, other presentation tools pay few attention to the video
data type management. Finally, to our knowledge, this study is the first
attempt to benefit from the expressing power of a object-based knowledge
representation system to describe and check the consistency of a multimedia
presentation.
This paper presents a first attempt to couple
an object-based knowledge representation system (OBKRS) called AROM, with a
multimedia presentation authoring tool named V-STORM. This coupling has three
main results. First, the multimedia presentation scenario can be modelled using
UML diagram class-like description of AROM which shown to be more intuitive
than a SMIL file. Second, the inference and consistency engines of AROM checks
the presentation against validity. Third, the richness of V-STORM video
operators is better exploited. The AROM KB proposed here, called AVS, is a
generic model for multimedia presentations. Classes and associations of the AVS
model just have to be instantiated to create an effective multimedia
presentation. Notably, this model incorporates every characteristic of SMIL
elements for describing how to arrange media objects in a scenario. A parser
has been written to translate such an AROM KB into a SMIL document. At its
turn, this SMIL document is parsed by V-STORM and the presentation is played.
This work is only at its beginning but three main directions are already
privileged. The first one concerns the integration of the V-STORM video query
language into the AROM model. The idea here is to substitute the OQL query
language by the algebraic modelling language of AROM. Eventually, a parser will
directly connect AROM to V-STORM, without having recourse to the existing
AROM/SMIL parser. Second, a graphical timeline interface could help the user
during the authoring process to interactively control the changes made on her
multimedia document through a real time support. Third, parallel works we make
[17] for a better use of database capabilities in the context of Web
presentations could be integrated within the AVS model.
[1] A.
Laursen, J. Olkin and M. Porter, Oracle Media Server: providing consumer
interactive access to Multimedia data, SIGMOD, 1994.
[2] K.
Nwosu, B. Thuraisingham and B. Berra, Multimedia Database Systems: design and
implementation strategies, Kluwer Academic Publishers, 1996.
[3] B.
Ozden, R. Rastogori and A. Silberschatz, Multimedia Database Systems, Issues
and Research Directions, Springer-Verlag, 1996.
[4] R.
Weiss, A. Duda and D. Gifford, Composition and Search with a Video Algebra,
IEEE multimedia, pp 12-25, Springer Ed., 1995.
[5] R.
Lozano, M. Adiba, F. Mocellin and H. Martin, An Object DBMS for Multimedia
Presentations including Video Data, Proc. of ECOOP'98 Workshop Reader, Springer
Verlag, Lecture Notes in Computer Science, 1543, 1998.
[6] W3C
Recommendation: Synchronized Multimedia Integration Language (SMIL) 1.0
Specification http://www.w3.org/TR/REC-smil
[7] GriNS
Authoring Software, http://www.oratrix.com/GRiNS/index.html
[8] RealNetworks
G2, http://www.realnetworks.com
[9] M.
Page, J. Gensel, C. Capponi, C. Bruley, P. Genoud, D. Ziébelin, D. Bardou and
V. Dupierris, A New Approach in Object-Based Knowledge Representation : the
AROM System, IEA/AIE-2001, June 4-7, Budapest, Hungary, 2001, http://www.inrialpes.fr/romans/arom
[10] R.G.G.
Cattell and D. Barry, The Object Database Standard : ODMG 2.0, Morgan
Kaufmann,1997.
[11] R. J.
Brachman and J. G. Schmolze, An Overview of the KL-ONE Knowledge Representation
System, Communications of the ACM, 31 (4), pp. 382-401, 1988.
[12] J.
Rumbaugh, I. Jacobson and G. Booch, The Unified Modeling Language Reference
Manual., Addison-Wesley, 1999.
[13] P.
Van Hentenryck, The OPL Optimization Programming Language, MIT Press, 1999.
[14] W3C Recommendation: Extensible Markup Language (XML) 1.0 (Second Edition) http://www.w3.org/TR/REC-xml
[15] R. Lozano, Intégration de données video dans un SGBD à objets, PhD Thesis (in French), Joseph Fourier University, Grenoble, France, 2000.
[16] M.
Jourdan, N. Layaïda, C. Roisin, L. Sabry-Ismaïl and L. Tardif, Madeus, an
Authoring Environment for Interactive Multimedia Documents, in ACM Multimedia,
pp 267-272, Bristol, UK, 1998.
[17] Mulhem
and H. Martin, From Database to Web multimedia Documents, in Journal of
Multimedia Tools and Applications (to appear), 2001.