Tuesday, December 22, 2009

O Universal Multimedia Access, Where Art Thou? (Part II)

-by Christian Timmerer, Klagenfurt University, Austria

Preface: First I thought about writing this article for a journal or something equivalent but then I concluded to make this article available through my blog. The aim is to perform an experiment in order to determine whether it is possible (a) to get direct feedback through comments and (b) to be referenced from elsewhere. As it is a quite comprehensive article, it’s split up in separate parts. If someone (i.e., a journal editor) is interested in publishing this article, yes, I can still do that! :-)

Part I was about giving an introduction to the topic and an overview on multimedia content adaptation techniques. This part focuses on the adaptation by transformation approach that utilizes scalable coding formats such as JPEG2000, MPEG-4 BSAC, and MPEG-4 SVC and is mainly based on [1].

Part II – Adaptation by Transformation

Scalable coding techniques have been recognized as an appropriate tool for realizing the concepts of UMA. Furthermore, if widely adopted across industries, scalable coding would provide a generalized solution to the interoperability problem.

In [2], a scalable bitstream is defined as a coded multimedia resource (i.e., audio-visual multimedia resources) consisting of a structured sequence of binary symbols which is organized in such a way that, by retrieving the bitstream, it is possible to first render a degraded version of the bitstream, and then progressively improve it by loading additional data. This definition implicates a bitstream structure where the bitstream can be logically divided into several layers, i.e., a base layer and one or more enhancement layers. The base layer offers a minimal quality of the bitstream whereas each of the enhancement layers successively provides improvements with respect to the quality in various dimensions. These dimensions include improvements in the temporal, spatial, signal-to-noise ratio (SNR), color, region-of-interest (ROI), and complexity domain, among others. Recently, abstract models describing scalable bitstreams have been proposed [3][4] which are briefly reviewed in the following.

In general, a scalable bitstream can be organized in a logical hypercube model where each axis represents a scalability dimension (e.g., temporal, spatial, quality) and every data block within this model corresponds to a certain bitstream segment (cf. Figure 1). The adaptation of a bitstream corresponding to such a model comprises the removal of one or more data blocks sometimes followed by minor updates of the remaining data blocks. Please go to [4] for a more detailed overview on adaptation possibilities.

Figure 1. Scalability Model using the Hypecube Model according to [3][4].

In the following I'd like to introduce some coding formats and their scalability features featuring the hypercube model as introduced above, namely:

  • JPEG2000 which introduces spatial, color, SNR, and ROI scalability for still images; 
  • MPEG-4 Visual Elementary Stream (VES) with temporal and semantic scalability;
  • MPEG-4 BSAC with fine-grained SNR scalability;
  • MPEG-4 SVC with native support for temporal, spatial, and SNR scalability.
Please note each scalable coding format is introduced with a special focus on its scalability aspects. For details regarding basic coding techniques the reader is referred to appropriate literature, e.g., [5] or [6].


The JPEG2000 standard [7][8][9] is known as the successor of the world-famous and widely adopted JPEG standard [10]. The JPEG2000 standard has been developed in order to accommodate the increasing demands and additional requirements for multimedia and Internet applications. In particular, some of the most important features (with respect to scalability) the JPEG200 standard should offer are progressive transmission by pixel accuracy and resolution as well as Region of Interest (ROI) coding and random code-stream access and processing. Progressive transmission enables the rendering of images with different resolution and pixel accuracy starting from a base version up to a high-resolution/quality version in an incremental manner, i.e. more and more data is added to the base layer by only transmitting the additional data which is required for increasing quality and/or resolution.

The hypercube model for JPEG2000 with the dimensions represent color, spatial, and SNR scalability respectively is depicted in Figure 2.

Figure 2. Hypercube for JPEG2000 scalability and a possible bitstream layout.

In particular, the figure shows the hypercube for JPEG2000 with its scalability dimensions and a possible bitstream layout with quality-spatial-color progression order. The gray cube represents the base layer with QCIF dimension, Y color component only, any a quality of 29 dB PSNR. In contrast, the blue cubes represent another version of the tile including more quality layers, i.e., a PSNR of 31 dB, with a CIF resolution but still only one color component, i.e., the resulting image is still a grayscale version of the original image.

MPEG-4 Visual Elementary Streams

MPEG-4 [11][12][13] also provides support for scalability in the spatial, temporal, and SNR dimensions but only a small amount of the scalability features has been adopted by industry, i.e., the temporal scalability. The spatial and SNR scalability features introduced too much coding overhead which was the main reason for not adopting these features at this time.

Temporal scalability is often also referred to as frame dropping where frames or visual object planes (VOPs) are removed which are not used as a reference frame for other frames. Bi-directional coded VOPs (B-VOPs) are not used as a reference for other frames, i.e., B-VOPs can be dropped arbitrarily. In case a predictive coded VOP (P-VOP) needs to be dropped all corresponding B-VOPs which use this P-VOP as reference frame need to be dropped as well. Similar behavior holds for intra coded VOPs (I-VOPs) although usually not dropped in traditional temporal scalability scenarios.

Another dimension of scalability is introduced here known as semantic scalability. This additional dimension associates properties to a group of VOPs (GoVs) providing means for summarization or personalization of MPEG-4 visual resources. With respect to the scalability model, GoVs can be compared with parcels and VOPs can be seen as the data blocks. The semantic scalability is, of course, also applicable for other audio/visual coding formats including those introduced in this article.

Figure 3. Hypercube for MPEG-4 VES scalability and a possible bitstream layout.

A possible configuration for an MPEG-4 Visual Elementary Stream hypercube model is depicted in Figure 3 with two levels of scalability, namely temporal and semantic. The former is characterized by different frame rates and the latter uses terms from the Internet Content Rating Association (ICRA) for rating the violence level of the actual content. For example, level 0 indicates no violence or sports-related content, respectively. The gray block represents a base layer (e.g., a scene or even only one I-VOP) with a frame rate of 15 Hz and a violence level 0 whereas the blue block indicate a scene with violence level 2 and 20 frames per second (fps).

MPEG-4 Bit-Sliced Arithmetic Coding

The concept of bit-sliced arithmetic coding for audio coding was introduced in [14] but is also excerpted in [11][15]. It is very similar to the well-known Advanced Audio Coding (AAC) [16] scheme except that the quantized values are not Huffman coded but arithmetically coded in bit-slices. Thus, MPEG-4 Bit-Sliced Arithmetic Coding (BSAC) provides fine-grain scalability of approximately 1 kbit/s per audio channel per enhancement layer. The base layer comprises side information, scaling factors and the actual audio data according to the bit rate of the base layer. Each enhancement layer incrementally adds more and more information with respect to the bit rate and a maximum of 48 enhancement layers are allowed. Due to the small size of the enhancement layers, i.e., 20 to 60 bits per AAC frame typically representing 20 to 30 ms, which may result in undesired packetization overhead, data packets of consecutive frames can be grouped together.

Figure 4. Hypercube for MPEG-4 BSAC scalability and a possible bitstream layout.

A hypercube model for a stereo MPEG-4 BSAC bitstream including a possible bitstream layout is illustrated in Figure 4. The base layer is encoded at 48 kbit/s/channel and a possible adapted stereo version of the bitstream with 50 kbit/s is indicated as well.

MPEG-4 Scalable Video Coding

MPEG-4 Scalable Video Coding (SVC) [17] is being introduced as an extension of MPEG-4 Advanced Video Coding (AVC) [18] which is part 10 of the MPEG-4 family of audio/visual coding standards. MPEG-4 SVC natively supports three scalability dimensions, namely temporal, spatial, and quality (SNR).

Figure 5.  Hypercube for MPEG-4 SVC scalability and a possible bitstream layout.

In Figure 5 a hypercube model with the three scalability dimensions of MPEG-4 SVC including a possible bitstream layout is shown. In this example, the base layer provides a QCIF version at 20Hz with a PSNR of 28dB. Additionally, an improved version with higher temporal, spatial, and SNR resolution is indicated.

Adaptation of Scalable Bitstreams

The adaptation of scalable bitstreams can be basically organized into two category:
  • The first category is a coding-format specific approach which, in general, is applicable to one coding format only such as the Bitstream Extractor that is part of the Joint Scalable Video Model (JSVM). The disadvantage here is that for each coding format a separate "bitstream extraxtor" is needed which become an issue for a growing number of instances.
  • The second category is referred to as coding-format independent or generic approach that is applicable to all scalable coding format but requires additional metadata [19]. As this approach is rather new and not commonly known, I will give an brief overview in the following.
Please note that a comparison between the generic and specific approach in the context of SVC is reported in [20].

Generic Multimedia Content Adaptation

This section discusses means to process (i.e., adapt, customize, manipulate, etc.) multimedia content independently of the actual coding format by utilizing XML-based metadata describing the high-level structure (i.e., syntax) of a bitstream. That is, the resulting XML document describes the bitstream how it is organized at different syntactical and even semantic levels, e.g., in terms of packets, headers, layers, units, segments, shots, scenes, etc., depending on the actual application requirements. It is important to note that the XML description does not describe the bitstream on a bit-by-bit basis, i.e., it does not replace the actual bitstream but provides metadata regarding bit/byte positions of meaningful segments for the given application. Therefore, the XML description does not necessarily provide any information of the actual coding format used as only the positions and – in some cases – meanings are required for processing.

High-level Architecture of Generic Content Adaptation

Figure 6 depicts the high-level architecture of generic multimedia content adaptation which can be logically divided into two processes, namely the Description Transformation and the Bitstream Generation.

Figure 6. High-level architecture of Generic Multimedia Content Adaptation (adopted from [21]).

The description transformation process receives as an input the XML description of the source bitstream and a so-called style sheet that transforms the XML document according to the context information, e.g., the device capabilities. The output of this process is a transformed description which already reflects the bitstream segments of the target (i.e., adapted) bitstream. However, the transformed description still refers to the bit/byte positions of the source bitstream which needs to be parsed in order to generate the target bitstream within the second step of the adaptation process, i.e., the bitstream generation.

Please note that the description transformation and bitstream generation processes should be combined by applying appropriate implementation techniques in order to achieve the required performance. However, implementation and optimization techniques for this kind of approach are out of scope of this article and the interested reader is referred to [22-25].

Technical Solution Approaches

The literature offers several technical solution approaches for generic multimedia content adaptation which are briefly highlighted in the following:
  • (X)Flavor [26]: A Formal Language for Audio-Visual Object Representation which has been extended with XML features.
  • Bitstream Syntax Description Language (BSDL) [27]: An XML Schema-based language for constructing a Bitstream Syntax Schema (BS Schema) for a given coding format [28]. It enables the generation of a Bitstream Syntax Description (BSD) based on a given bitstream and vice versa. The generic counterpart of the coding format-specific BS Schema is referred to as gBS Schema which is fully coding format-agnostic. An XML document conforming to the gBS Schema is referred to as a generic Bitstream Syntax Description (gBSD) [29].
  • BFlavor [30]: A method that combines BSDL and XFlavor and basically uses XFlavor techniques – enhanced with BSDL concepts – to generate Java code which is used for automatic generation of BSDs.


Figure 7 gives a summary of the various multimedia content adaptation techniques presented in Part I and Part II. The summary has been adopted and extended from [31].

Figure 7. Summary of Multimedia Content Adaptation (adopted from [31]).

This is the end of Part II and I will continue in Part III with the adaptation decision-taking also known as the brain of multimedia content adaptation. Thus, stay tuned!

[1] C. Timmerer, Generic Adaptation of Scalable Multimedia Resources, VDM Verlag Dr. Müller, 2008.
[2] ISO/IEC 21000-7, Information technology — Multimedia framework (MPEG-21) — Part 7: Digital Item Adaptation, October 2004.
[3] S. Lerouge, R. De Sutter, P. Lambert, and R. Van de Walle, "Fully Scalable Video Coding in Multicast Applications", Proceedings of SPIE/Electronic Imaging 2004, vol. 5308, San Jose, CA, US, 2004, pp. 555-564.
[4] D. Mukherjee, A. Said, and S. Liu, "A framework for fully format-independent adaptation of scalable bit-streams," IEEE Transactions on Circuits and Systems for Video Technology, Special Issue on Video Adaptation, vol. 15, no. 10, October 2005, pp. 1280-1290.
[5] R. Steinmetz, Multimedia-Technologie. Grundlagen, Komponenten und Systeme, Springer, Berlin, July 2000.
[6] F. Halsall, Multimedia Communications. Applications, Networks, Protocols and Standards, Addison Wesley, November 2000.
[7] ISO/IEC 15444-1:2004, Information technology — JPEG 2000 image coding system: Core coding system, 2nd edition, September 2004.
[8] D. Taubman and M. Marcellin (eds.), JPEG2000: Image Compression Fundamentals, Standards and Practice, Springer, November 2001.
[9] C. Christopoulos, A. Skodras, and T. Ebrahimi, "The JPEG2000 Still Image Coding System: An Overview", IEEE Transactions on Consumer Electronics, vol. 46, no. 4, November 2000, pp. 1103-1127.
[10] G. K. Wallace, "The JPEG still picture compression standard", Communications of the ACM, vol. 34, no. 4, April 1991, pp. 30-44.
[11] F. Pereira and T. Ebrahimi (eds.), The MPEG-4 Book, Prentice Hall PTR, August 2002.
[12] S. Battista, F. Casalino, and C. Lande, "MPEG-4: A Multimedia Standard for the Third Millennium, Part 1", IEEE MultiMedia Magazine, vol. 6, no. 4, October-December 1999, pp. 74-83.
[13] S. Battista, F. Casalino, and C. Lande, "MPEG-4: A Multimedia Standard for the Third Millennium, Part 2", IEEE MultiMedia Magazine, vol. 7, no. 1, January-March 2000, pp. 76-84.
[14] S. Park, Y. Kim, S. Kim, and Y. Seo, "Multi-Layer Bit-Sliced Bit-Rate Scalable Audio Coding", in 103rd AES Convention, preprint 4520, New York, September 1997.
[15] H. Prunhagen, "An Overview of MPEG-4 Audio Version 2", Proceedings of AES 17th International Conference on High-Quality Audio Coding, Florence, Italy, September 1999, pp. 157-168.
[16] ISO/IEC 13818-7:2006, Information technology — Generic coding of moving pictures and associated audio information — Part 7: Advanced Audio Coding (AAC), 4th edition, January 2006.
[17] H. Schwarz, D. Marpe, T. Wiegand, "Overview of the Scalable Video Coding Extensions of the H.264/AVC Standard", IEEE Transactions on Circuits and Systems for Video Technology, vol. 17, no. 9, Sep. 2007, pp. 1103-1120.
[18] T. Wiegand, G. J. Sullivan, G. Bjøntegaard, A. Luthra, "Overview of the H.264/AVC Video Coding Standard", IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, July 2003, pp. 560-576.
[19] C. Timmerer, M. Ransburg, and H. Hellwagner, "Generic Multimedia Content Adaptation", in: Borko Furht (ed.), Encyclopedia of Multimedia, 2nd edition, Springer, pp. 263-271, October 2008.
[20] M. Eberhard, L. Celetto, C. Timmerer, E. Quacchio and H. Hellwagner, "Performance Analysis of Scalable Video Adaptation: Generic versus Specific Approach", Proceedings of WIAMIS 2008, Klagenfurt, Austria, May 2008.
[21] C. Timmerer and H. Hellwagner, “Interoperable Adaptive Multimedia Communication”, IEEE Multimedia Magazine, vol. 12, no. 1, pp. 74-79, January-March 2005.
[22] C. Timmerer, G. Panis, and E. Delfosse, “Piece-wise Multimedia Content Adaptation in Streaming and Constrained Environments”, Proceedings of the 6th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS 2005), Montreux, Switzerland, April 2005.
[23] C. Timmerer, T. Frank, and H. Hellwagner, “Efficient processing of MPEG-21 metadata in the binary domain”, Proceedings of SPIE International Symposium ITCom 2005 on Multimedia Systems and Applications VIII, Boston, Massachusetts, USA, October 2005.
[24] M. Ransburg, C. Timmerer, H. Hellwagner, and S. Devillers, “Processing and Delivery of Multimedia Metadata for Multimedia Content Streaming”, Proceedings of the Workshop Multimedia Semantics - The Role of Metadata, RWTH Aachen, March 2007.
[25] M. Ransburg, H. Gressl, and H. Hellwagner, “Efficient Transformation of MPEG-21 Metadata for Codec-agnostic Adaptation in Real-time Streaming Scenarios”, Proceedings of the 9th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS 2008), Klagenfurt, Austria, May 2008.
[26] D. Hong and A. Eleftheriadis, “XFlavor: Bridging Bits and Objects in Media Representation”, Proceedings IEEE International Conference on Multimedia and Expo (ICME), Lausanne, Switzerland, pp. 773- 776, August 2002.
[27] M. Amielh and S. Devillers, “Bitstream Syntax Description Language: Application of XML-Schema to Multimedia Content”, 11th International World Wide Web Conference (WWW 2002), Honolulu, May, 2002.
[28] G. Panis, A. Hutter, J. Heuer, H. Hellwagner, H. Kosch, C. Timmerer, S. Devillers and M. Amielh, “Bitstream Syntax Description: A Tool for Multimedia Resource Adaptation within MPEG-21”, Signal Processing: Image Communication, vol. 18, no. 8, pp. 721-747, September 2003.
[29] C. Timmerer, G. Panis, H. Kosch, J. Heuer, H. Hellwagner, and A. Hutter, “Coding format independent multimedia content adaptation using XML”, Proceedings of SPIE International Symposium ITCom 2003 on Internet Multimedia Management Systems IV, Orlando, Florida, USA, pp. 92-103, September 2003.
[30] W. De Neve, D. Van Deursen, D. De Schrijver, S. Lerouge, K. De Wolf, and R. Van de Walle, “BFlavor: A harmonized approach to media resource adaptation, inspired by MPEG-21 BSDL and XFlavor”, Signal Processing: Image Communication, vol. 21, no. 10, pp. 862-889, November 2006.
[31] B. Shen, W-T. Tan, F. Huve, “Dynamic Video Transcoding in Mobile Environments“, IEEE Multimedia, vol. 15, no. 1, Jan.-Mar. 2008, pp. 42-51.

Post a Comment