1 Introduction

In the last decade, multimedia and, more particularly, video systems have benefited from a tremendous research interest. The main reason for this is the increasing ability computers now have for supporting video data, notably thanks to unceasing improvements in data compression formats (as MPEG-4 and MPEG-7), in networks transfer rates and operating systems [1], and in disk storage capacity. Unsurprisingly, new applications have risen such as video on demand, video conferencing and home video editing which directly benefit from this evolution. Following this trend, research efforts ([2], [3]) have been made to extend DataBase Management Systems (DBMS) so that they support video data types not simply through Binary Large Objects (BLOB). Indeed, DBMS seem to be well-suited systems for tackling problems posed by the video, namely storage, modeling, querying and presentation. Video data types must be physically managed apart from other conventional data types in order to fulfill their performance requirements. Video modeling must take into account the hierarchical structure of a video (shots, scenes and sequences) and allow overlapping and disjoint segment clustering [4]. The video query language must allow one to query video content using textual annotations or computed signatures (color, shape, texture, ….) and deal with the dynamic (movements) of objects in the scenes as well as with semi-structural aspects of videos and, finally, must offer the possibility of creating new videos.

We have designed and implemented V-STORM [5] a video system which captures video data in an object DBMS. The V-STORM model considers video data from different perspectives (represented by class hierarchies): physical (as a BLOB), structural (a video is made up of shots which are themselves composed of scenes which can be split into sequences), composition (for editing new videos using data already stored in the database), semantics (through an annotation, a video segment is linked to a database object or a keyword). V-STORM uses and extends the O2 object DBMS and comes as a tool for formulating queries on videos, composing a video using the results of queries, and generating video abstracts. V-STORM can play videos (or segments of) of its database but also virtual videos (or segments of) composed through an O2 interface. Moreover, it is possible to use V-STORM as a multimedia player for presentations described using the SMIL [6] standard . This way, V-STORM can be classified in the family of multimedia presentation software like GriNS [7] or RealNetworks G2 [8].

We show here how AROM [9], an object-based knowledge representation system, can be used to help a V-STORM user to build, in a more declarative way, a multimedia presentation by instantiating a knowledge base rather than by writing a SMIL file. Then, we show how both spatial and temporal consistencies of multimedia presentation can be maintained by AROM.

The paper is organized as follows : sections 2 and 3 present respectively the V-STORM and AROM systems ; section 4 describes the AVS model, an AROM knowledge base which corresponds to a general multimedia presentation structure ; section 5 gives the related works before we conclude in section 6.

2 The V-STORM System

V-STORM differentiates between the raw video stored in the database and the video which is watched and manipulated by end-users. From a user point of view, a video is a continuous media which can be played, stopped, paused, etc. From a DBMS point of view, a video is a complex object composed of an ordered sequence of frames, each having a fixed display time. This way, new virtual videos can be created using frames from different segments of videos.

Figure 1. The V-STORM architecture. Through the video composer interface, a video is described by the user, translated in OQL so that its component video segments can be sought in the O2 video database. Then, the video is played by the V-STORM.

In V-STORM, the Object Query Language (OQL) [10] is used (see Figure 1) to extract video segments to compose virtual videos. Video query expressions are stored in the databases and the final video is generated at presentation time. This approach avoids data replication. A video query returns either a video interval which is a continuous sequence of frames belonging to the same video, or a whole video, or an excerpt of a raw video (by combination of the two previous cases), or a logical extract of a video stemming from various raw videos.

Video composition in V-STORM is achieved using a set of algebraic operators. This way a virtual video can be the result of the concatenation, or the concatenation without duplication (union), or the intersection, or the difference of two videos, or, as well, the reduction (by elimination of duplicate segments) or the finite repetition of a single video. Annotations in V-STORM are used to describe salient objects or events appearing in the video. They can be declared at each level of the video hierarchy. Annotations are manually created by the users through an annotation tool. V-STORM also integrates an algorithm to automatically generate video abstracts. Video abstracts aims at optimizing the time for watching a video in search of a particular segment. The user has to provide some information concerning the expected abstract: its source (one or more videos), its duration, its structure (which reflects the structure of the video), and its granularity (in the video segments might be more relevant than others). Finally, in order to open V-STORM to the multimedia presentation standardization, we have developed a SMIL parser (see Figure 2) so that V-STORM can read a SMIL document and play the corresponding presentation. Also, interactivity is possible since V-STORM handles the presence of anchors for hypermedia links during presentations.

Figure 2. V-STORM can also be used as multimedia presentation. The presentation is described in SMIL, sent to a parser and played by the VSTORM video player.

The parser checks the validity of the SMIL document against the SMIL DTD (extended to support new temporal operations carried out by V-STORM). Then the different SMIL elements are translated in V-STORM commands and the video is displayed. Currently, this parser is limited and does not exploit all the V-STORM functionalities concerning operations on videos. The work presented here extends the description of a la SMIL multimedia presentations in order to better exploit V-STORM capabilities.

3 The AROM System

Object-Based Knowledge Representation Systems (OBKRS) are known to be declarative systems for describing, organizing and processing large amounts of knowledge. In these systems [11], once built, a knowledge base (KB) can be exploited through various and powerful inference mechanisms such as classification, method calls, default values, filters, etc. AROM (which stands for Associating Relations and Objects for Modeling) is a new OBKRS which departs from others in two ways. First, in addition to classes (and objects) which often constitute the unique and central representation entity in OBKRS, AROM uses associations (and tuples), similar to those found in UML [12], to describe and organize links between objects having common structure and semantics. Second, in addition to the classical OBKRS inference mechanisms, AROM integrates an algebraic modeling language (AML) for expressing operational knowledge in a declarative way. The AML is used to write constraints, queries, numerical and symbolic equations involving the various elements of a KB.

A class in AROM describes a set of objects sharing common properties and constraints. Each class is characterized by a set of properties called variables and by a set of constraints. A variable denotes a property whose basic type is not a class of the KB. Each variable is characterized by a set of facets (domain restriction facets, inference facets, and documentation facets). Expressed in the AML, constraints are necessary conditions for an object to belong to the class. Constraints bind together variables of – or reachable from – the class. The generalization/specialization relation is a partial order organizes classes in a hierarchy supported by a simple inheritance mechanism. An AROM object represents a distinguishable entity of the modeled domain. Each object is attached to exactly one class, at any moment.

In AROM, like in UML, an association represents a set of similar links between n (n ³ 2) classes, being distinct or not. A link contains objects of the classes (one for each class) connected by the association. An association is described by means of roles, variables and constraints. A role corresponds to the connection between an association and one of the classes it connects. Each role has a multiplicity, whose meaning is the same as in UML. A variable of an association denotes a property associated with a link and has the same set of available facets as a class variable. A tuple of an n-ary association having m variables v_i (1 £ i £ m) is the (n+m)-uple made up of the n objects of the link and of the m values of the variables of the association. A tuple is an "instance" of an association. Association constraints involve variables or roles and are written in the AML, and must be satisfied by every tuple of the association. Associations are organized in specialization hierarchies. See Figures 3 and 4 for a textual and a graphical sketches of an AROM KB dedicated to multimedia presentations.

First introduced in Operations Research, algebraic modeling languages (AMLs) make it possible to write systems of equations and/or of constraints, in a formalism close to mathematical notations. They support the use of indexed variables and expressions, quantifiers and iterated operators like å (sum) and Õ (product), in order to build expressions such as . AMLs have been used for linear and non-linear, for discrete-time simulation, and recently for constraint programming [13]. In AROM, the AML is used for writing both equations, constraints, and queries. AML expressions are built from the following elements: constants, indices and indexed expressions, operators and functions, iterated operators, quantified expressions, variables belonging to classes and associations, and expressions that allow to access to the tuples of an association. An AML interpreter solves systems of (non-simultaneous) equations and processes queries. Written in Java 1.2, AROM is available as a platform for knowledge representation and exploitation. It comprises an interactive modeling environment, which allows one to create, consult, and modify an AROM KB; a Java API, an interpreter for processing queries and solving sets of (non-simultaneous) equations written in AML, and WebAROM, a tool for consulting and editing a KB through a Web browser.

4 Coupling AROM and V-STORM

As mentioned above, multimedia scenarios played by V-STORM can be described using SMIL. The starting point of this study is twofold. We aim first at providing a UML-like model in order to ease the description of a multimedia presentation and, second, at reinforcing consistency regarding spatial and especially temporal constraints between the components of a multimedia presentation. It is our conviction that, SMIL like XML [14], are not intuitive knowledge representation languages, and one needs to be familiar with their syntax before to read or write and understand the structure of a document. So, we propose an AVS (AROM/V-STORM) model (see Figures 3 and 4), which consists of an AROM knowledge base whose structure incorporates any SMIL element used in the description of a multimedia presentation. This way, we provide a V-STORM user with an operational UML-like model for describing her multimedia presentation. Using an entity/relation (or class/association) approach for modeling is now a widely accepted approach, where UML has become a standard. Through the AROM Interface Modeling Environment, the graphical representation of classes and associations which constitute the AVS model, gives the user a more intuitive idea of the structure of her presentation. Moreover, taking advantage of the AROM's AML and type checking, the user can be informed about the spatial and temporal consistencies of her presentation.

4.1 An AROM Model for Multimedia Presentations

Since V-STORM can play any presentation described with SMIL, our AROM model for multimedia presentation is SMIL compliant. This means that it incorporates classes and associations corresponding to every element that can be found in the structure of a SMIL document. However, the main objective of the AVS model is to give the user the opportunity to invoke any kind of operations V-STORM can performed on a video.

variables:

variable: b_color

type: string

variable: title

type: string

variable: height

type: integer

variable: width

type: integer

variables:

variable: b_color

type: string

variable: fit

type: string

variable: title

type: string

variable: top

type: integer

variable: height

type: integer

variable: width

type: integer

variables:

variable: abstract

type: string

variable: author

type: string

variable: begin

type: float

default: 0

variable: end

type: float

definition: end=begin+dur

variable: dur

type: float

default: 0

definition: dur=end-begin

variable: region

type: string

variable: repeat

type: integer

variable: s_bitrate

type: integer

variable: s_caption

type: boolean

variable: s_language

type: list-of string

cardinality:

min:0

max: *

super-class: CommonAttributes

variables:

variable: sync

type: string

default: "seq"

variable: endsync

type: string

super-class: CommonAttributes

variables:

variable: media

type: string

variable: src

type: string

variable: alt

documentation: "specifies an alternate text, if the media can not be displayed"

type: string

variable: fill

documentation: "if fill=true so freeze else remove"

type: boolean

association: CBE

roles:

role: block

type: Block

multiplicity:

min: 0

max: *

role: element

type: Element

multiplicity:

min: 0

max: *

Figure 3. An excerpt of the AROM textual description showing 7 classes and 1 association of the AVS model for multimedia presentation. In the CommonAttributes abstract class, a definition is given for the end and dur variables using the AML.

In the AVS model, the various features of a multimedia presentation are modeled using classes and associations. The class Presentation gives the more general information about the multimedia presentation. Concerning the spatial formatting which describes the way displayable objects are placed into the presentation window, it is described by objects of the Layout class, in accordance with the SMIL recommendation. When a presentation gathers more than one layout V-STORM chooses the first layout that matches the user preferences. This way, V-STORM permits some adaptability concerning the characteristics of the machine on which the presentation is played. A layout can be associated with a root-layout and several regions (described respectively by classes RootLayout, Region and associations HasRootLayout and HasRegion) where the media objects appear. Concerning the time model, a V-STORM presentation is made up of blocks. Each block can contain other blocks and/or media objects. Basic media objects supported by V-STORM are continuous media with an intrinsic duration (video, audio…) or discrete media without an intrinsic duration (text, image…). The variable sync in the Block class determines the temporal behavior (namely parallel or sequential presentation) of the elements in the blocks, depending on its value seq or par. Three temporal information can be associated with a media object or a block: its duration (variable dur), its begin and end times (variables begin and end). When no value is specified for this variable, the duration of a discrete object is null and the duration of a continuous object is its natural duration. The semantics concerning the effective begin of objects linked to a parallel or sequential block is the same as the one defined in the SMIL recommendation. Also, every date associated with an object must be defined as a float value. This is not a limitation since the model allows to associate to a media object a set of reaction methods (start, end, load…) in response to events (click, begin, end…) triggered by other objects. Compared with an authoring language like GRiNS, this event-reaction mechanism offers more synchronization possibilities between objects and through the AVS model, it is more easy and intuitive to express a temporal scenario. The knowledge base contains a Switch class in charge of adapting the presentation to the system capabilities and settings. The variables found in this class (s_bitrate,s_captions, s_language…) are equivalent to the attributes of the switch element in SMIL. The player will play the first element of the switch acceptable for presentation. Finally, the two kinds of navigational links proposed by SMIL (a and anchor) and allowing interactivity during a presentation, are represented in the knowledge base by the A_Link and Anchor_Link classes. The power of the event-reaction mechanism implemented in V-STORM allow an author to define more powerful and intuitive user interaction possibilities than in SMIL. For instance, a media object can start some time after the click on another object, and it can end just after the load of a new object.

4.2 Building a multimedia presentation

To build a multimedia presentation, a V-STORM user simply has to instantiate the AROM KB. For a local KB, this can be done either by using the AROM Interactive Modelling Environment (see Figures 4 and 5), or by completing the ASCII document describing the KB (like in Figure 3), or by using the Java API of AROM in a program. For a distant KB, this can be done through a web browser using WebAROM. Since this instantiation is made under the control of AROM (type checking, multiplicity constraint satisfaction, …), both the spatial and temporal consistencies of the described presentation are guaranteed.

Figure 5. The editor for instantiating the AVS model

Once this instantiation is performed, an AROM-SMIL parser we have written is launched and the resulting SMIL file is sent to the SMIL parser of V-STORM (see Figure 6).

Figure 6. The architecture of the AROM/V-STORM coupling

4.3 Benefits of the AVS model

The coupling between V-STORM and AROM combines the video management richness of the former to the expressing and modelling power of the latter. Compared to the classical specification and presentation of multimedia documents, this coupling offers several advantages.

UML-like description: The AVS-model is described in a graphical notation close to UML. Object-oriented analysis and design methods, have shown the relevance of using graphical notations to improve communication between all the actors of a design process (for instance collaborating authors).

Modularity and reuse: The author can edit parts of the presentation independently and group them to compose its documents, just by manipulating AROM objects. This object approach allows the reuse of existing blocks to compose new presentations, saving a large amounts of works in the design phase.

Object identity: In an AROM KB, each object has a unique identifier. This property has been exploited to prevent inconsistencies due to the assignment of the same identifier to two different media objects. For the name given to regions, for instance, the existence of such names is checked.

Consistency maintenance: When a presentation contains some inconsistencies, for instance when it says that an object B starts at the end of an object A, while an object C starts at the end of B and C starts at the same time as A, classical multimedia systems ignore or do just warn about these inconsistencies at play time. Here, temporal checking is performed by AROM during the construction of the presentation and the author is warned about such inconsistencies. This allows sending to the presentation system a consistent document. This static checking allows us to obtain a global trace of the presentation or a timeline view, which aligns all events on a single axis of time.

Virtual videos: In addition to raw videos, the author can include in the presentation virtual videos. They correspond to video objects having no value for their src variable. Associations (Extraction, Reduction, Repetition, BinaryOperation) corresponding to the V-STORM operations for creating virtual videos have been introduced in the KB. Once these associations are instantiated, their tuples link a virtual video to the video(s) (raw or virtual) it is derived from.

Keywords and video abtracts: It is possible to use the keywords variable to annotate and to formulate queries on the content of a video. Moreover, the model includes an AbstractOf association in order to link an abstract (an object of the VAbstract class), having possibly a given duration, to a video. Thus, the video can be replaced by its abstract during the presentation. A VAbstract object can be created manually or automatically using the AROM API and the V-STORM video abstract generator.

5 Related Works

For a complete comparison of V-STORM with other multimedia projects, one can refer to [15]. Among numerous research works on authoring and presentation environments for interactive multimedia documents, Madeus [16] is a very complete environment with a graphical authoring interface and a spatial formatting editor. Madeus is based on a constraint-based approach. It offers flexibility for frequent scenario modifications carried out by the author before reaching the desired scenario, a coupling between the editing and presentation process, and an incremental editing process which consists in readjusting the solution each time the author adds or deletes a constraint. Constraint propagation maintains the consistency of the new scenario: at each editing step, the author is sure of having a consistent scenario. Our AVS model also relies on a similar approach since AROM integrates a constraint solver. The AML allows expressions of constraints involving classes, associations, objects or tuples. But for authoring, we put the emphasis on a yet more declarative approach through the use a UML-like model in which constraints are implicitly embedded into temporal and spatial operators. Also, unlike V-STORM, other presentation tools pay few attention to the video data type management. Finally, to our knowledge, this study is the first attempt to benefit from the expressing power of a object-based knowledge representation system to describe and check the consistency of a multimedia presentation.

6 CONCLUSION

This paper presents a first attempt to couple an object-based knowledge representation system (OBKRS) called AROM, with a multimedia presentation authoring tool named V-STORM. This coupling has three main results. First, the multimedia presentation scenario can be modelled using UML diagram class-like description of AROM which shown to be more intuitive than a SMIL file. Second, the inference and consistency engines of AROM checks the presentation against validity. Third, the richness of V-STORM video operators is better exploited. The AROM KB proposed here, called AVS, is a generic model for multimedia presentations. Classes and associations of the AVS model just have to be instantiated to create an effective multimedia presentation. Notably, this model incorporates every characteristic of SMIL elements for describing how to arrange media objects in a scenario. A parser has been written to translate such an AROM KB into a SMIL document. At its turn, this SMIL document is parsed by V-STORM and the presentation is played.

This work is only at its beginning but three main directions are already privileged. The first one concerns the integration of the V-STORM video query language into the AROM model. The idea here is to substitute the OQL query language by the algebraic modelling language of AROM. Eventually, a parser will directly connect AROM to V-STORM, without having recourse to the existing AROM/SMIL parser. Second, a graphical timeline interface could help the user during the authoring process to interactively control the changes made on her multimedia document through a real time support. Third, parallel works we make [17] for a better use of database capabilities in the context of Web presentations could be integrated within the AVS model.

7 References

[1] A. Laursen, J. Olkin and M. Porter, Oracle Media Server: providing consumer interactive access to Multimedia data, SIGMOD, 1994.

[2] K. Nwosu, B. Thuraisingham and B. Berra, Multimedia Database Systems: design and implementation strategies, Kluwer Academic Publishers, 1996.

[3] B. Ozden, R. Rastogori and A. Silberschatz, Multimedia Database Systems, Issues and Research Directions, Springer-Verlag, 1996.

[4] R. Weiss, A. Duda and D. Gifford, Composition and Search with a Video Algebra, IEEE multimedia, pp 12-25, Springer Ed., 1995.

[5] R. Lozano, M. Adiba, F. Mocellin and H. Martin, An Object DBMS for Multimedia Presentations including Video Data, Proc. of ECOOP'98 Workshop Reader, Springer Verlag, Lecture Notes in Computer Science, 1543, 1998.

[6] W3C Recommendation: Synchronized Multimedia Integration Language (SMIL) 1.0 Specification http://www.w3.org/TR/REC-smil

[7] GriNS Authoring Software, http://www.oratrix.com/GRiNS/index.html

[8] RealNetworks G2, http://www.realnetworks.com

[9] M. Page, J. Gensel, C. Capponi, C. Bruley, P. Genoud, D. Ziébelin, D. Bardou and V. Dupierris, A New Approach in Object-Based Knowledge Representation : the AROM System, IEA/AIE-2001, June 4-7, Budapest, Hungary, 2001, http://www.inrialpes.fr/romans/arom

[10] R.G.G. Cattell and D. Barry, The Object Database Standard : ODMG 2.0, Morgan Kaufmann,1997.

[11] R. J. Brachman and J. G. Schmolze, An Overview of the KL-ONE Knowledge Representation System, Communications of the ACM, 31 (4), pp. 382-401, 1988.

[12] J. Rumbaugh, I. Jacobson and G. Booch, The Unified Modeling Language Reference Manual., Addison-Wesley, 1999.

[13] P. Van Hentenryck, The OPL Optimization Programming Language, MIT Press, 1999.

[14] W3C Recommendation: Extensible Markup Language (XML) 1.0 (Second Edition) http://www.w3.org/TR/REC-xml

[15] R. Lozano, Intégration de données video dans un SGBD à objets, PhD Thesis (in French), Joseph Fourier University, Grenoble, France, 2000.

[16] M. Jourdan, N. Layaïda, C. Roisin, L. Sabry-Ismaïl and L. Tardif, Madeus, an Authoring Environment for Interactive Multimedia Documents, in ACM Multimedia, pp 267-272, Bristol, UK, 1998.

[17] Mulhem and H. Martin, From Database to Web multimedia Documents, in Journal of Multimedia Tools and Applications (to appear), 2001.