Making 3D Content More Accessible on the Web: “Semantic XR” Proof of Concept

Introduction

This blog post demonstrates an approach for making 3D scenes in interactive 3D environments, videos and images more accessible. Doing so by using a semantic description of these 3D scenes. For example: using a list of objects’ names and locations for generating spatial audio cues. A Proof of Concept of using a concrete semantic 3D scene description is presented, including demo videos and discussion about personalization and other opportunities.

This work could be extended to other contexts but in this post the focus is on accessibility for people who are blind or have low vision and on 3D content in the browser and integration with existing web standards.

Background

The idea of using a semantic scene description to support web accessibility is discussed in several places, including the following non comprehensive resources list: in the context of accessibility for WebXR, in the XRA Semantics Module document, in a summary of the W3C Workshop on Web Games, in a previous post on this blog and in an Immersive Web Proposal Issue which also references ARIA and AOM potential usage in this context.

This blog post and videos describe a Proof of Concept of this idea. It is a followup work to my presentation at the W3C Workshop on Inclusive Design for Immersive Web Standards (more details on the workshop provided below).

Demo Videos

Demo 1: Interactive 3D Environment

Improving accessibility of an interactive 3D environment with a screen reader and spatial audio cues.

The 3D environment is a small area with blocks. The user should reach a destination block while avoiding other blocks and obstacles.

Technically, the 3D rich internet application updates the web page with its semantic scene description. Then, JavaScript code which is part of the web page processes this semantic scene description and turns it into meaningful text for the user or into spatial audio.

Semantic Interactive 3D Environment

Spatial audio which is part of competitive games could be implemented with code that is part of the game itself. This code could still use the semantic data but use it internally and not make it available to any external entities.

The semantic scene description includes, in addition to what was demonstrated, also a high level description of the 3D scene. This description is made available as regular text on the web page

Demo 2: A Video of an Interactive 3D Environment

Improving accessibility of a video of an interactive 3D environment (the same environment from Demo 1 above). Doing it using spatial audio cues for the objects that appear in the video

Semantic Video of an Interactive 3D Environment

Demo 3: Real World Video and Photo

Visual augmentation using semantic scene description of a video and a photo of the real world. The demo shows how a user interacts with specific objects “inside” the video. The video shows two coffee mugs on a wide plate which is being rotated around its center

Semantic Real World Video and Photo

Personalization Opportunities

In many XR accessibility discussions a need for personalization arises (like during the W3C workshop mentioned below or in XR Access discussions or in the game accessibility guidelines). This was also my own personal experience with feedback on Accessible Realities.

Once 3D applications, XR experiences, videos and images provide access to their semantic data this could open up fine-tuning and personalization opportunities: the availability of the entire raw semantic data of the 3D scene could enable accessibility solutions to make the data accessible in many different ways (like: do the spatial audio scan from left-to-right or from closest-to-most-distant or only as high level description etc.)

This would enable people who are blind and are software developers or accessibility related companies and NGOs to develop solutions to make the raw semantic data accessible in a way that is the best for them or their users. Let accessibility solution providers use the data and do their own magic…

More Opportunities

  • Accessibility Solutions’ integration with different kinds of media types and immersive experiences as long as these support the semantic scene description format (Virtual Reality, Augmented Reality, Mixed Reality, 3D applications, videos, images etc.)
  • Accessibility for the Real World – in the future, AI-based methods might be fast, accurate and safe enough to generate the semantic data in real-time for the real world (in a similar manner to a previous video using an older version of the semantic data model)
  • Support different kinds of disabilities as part of the semantic format
  • Platform and language independent format – enabling exchange of semantic data between systems and a single authoritative defining standard
  • Extensibility – via versions or hooks for custom extensions like one for specific media types
  • Developer control and ease of use – authoring tools could include extensions to enable developers to control what specific parts of the semantic data should be generated. These tools would greatly assist developers in generating the semantic data in a valid form in real-time

Challenges

  • Privacy – enable users to control sharing of their own semantic data (like their name and location in a virtual world) with other users and entities
  • Security – similar considerations to those mentioned in the WebVTT documentation if using the Timed Text Track approach
  • Competitive Games – naturally, semantic data that would be shared externally in competitive environments would be extremely limited. Yet, there are XR applications which are not competitive like ones for education, training, social VR communities etc. Non interactive media like Images and videos could benefit from metadata semantic data format as well. Note that the developer controls what parts of the semantic data are made available to their users. In addition, even if the semantic data is not made available to external entities it could still be made available to 3rd party accessibility libraries which are part of the code base. These libraries could enable accessibility without providing a distinct competitive advantage
  • Performance – this potential issue might be mitigated using different levels of details of the semantic data: temporal resolution (frequency of updates), spatial resolution and semantic level of detail (for example: how many objects are being described? only high level objects? all objects? etc.)

Future Opportunities

Authoring Tools integration for Generation of Semantic Data

Authoring tools (such as: game engines, animation tools and 3D computer graphics software) could include plugins for generating the semantic data in real time. This would be added as metadata to 3D interactive and immersive XR content and to images and videos. Semantic data generation plugins could also support the rising trend of Virtual Production. Introducing semantic data into virtual production tools could be a powerful way for making videos and movies more accessible. AI-based methods to generate semantic data could also be used either in real-time or for existing media.

It could make a lot of sense to separate entirely and decouple the generation of the semantic data from its usage. Plugins for game engines and other authoring tools that focus only on generating the semantic data could be easier to develop, test and maintain. In turn, accessibility solutions developers such as research labs, NGOs, tool providers and better yet people with disabilities who are also software developers could write accessibility solutions that build upon the semantic data. This would enable integration with any content that supports the semantic data format.

Uses of Semantic Data (Web and Non-web)

Clients of the semantic data that could leverage it for accessibility could include: browsers and their extensions, screen readers and their extensions and open source software libraries (for example: JavaScript ones for web and C++, C# and others for game engines).

The semantic data could be made accessible in many different ways in addition to the ways demonstrated above.

One example could be a chat app that would answer questions like: “how many people are in front of me?”, “where is the closest couch?”, “please briefly describe the area to me” etc.

Other non-accessibility uses could include: creative semantic editing of images and videos and training Machine Learning models in 3D scenes (in virtual worlds and in the real world for robotics for example).

Technical Details

How It Was Done

  • The demos were implemented using a semantic data format (Semantic-XR). This format include properties such as: objects’ names and locations, collision prediction details and more. It was defined based on checking a variety of existing accessibility solutions as well as my experience with Accessible Realities
  • Semantic data was added as metadata to the videos using a Text Track format, specifically WebVTT (kind=metadata”). Note that semantic data could be added retroactively to old videos in this way to make them more accessible
  • Screen reader support was implemented using JavaScript-based updates of text in an aria-live region (using “polite” updates). The text was recreated each time the Semantic-XR data got updated
  • Spatial audio augmentation was done using a JavaScript library with spatial audio support
  • Visual augmentation was done by leveraging a Semantic-XR metadata track that included depth map data. It was implemented using the Canvas API
  • Tools used: Unreal Engine 4.22 game engine, Visual Studio IDE, Howler.js audio library with spatial audio support, NVDA screen reader, Firefox web browser. The depth map data in both the real world video and the real world photo demos was created manually (with a lot of patience). This manual process could be replaced with AI-based methods
  • Semantic-XR in this context is defined as: a description of an XR experience as meaningful and well-defined data in order to make it more accessible for humans and machines. Note: the “XR” part of the name “Semantic-XR” originally had a more narrow meaning (its commonly used meaning of: Virtual Reality, Augmented Reality and the like). Currently, it is used in the broadest sense of the word “Reality” including 3D scenes in images and videos as the format was found to be useful for these as well
  • Accessible Realities Unreal Engine 4 library was used to dynamically generate the Semantic-XR data in the 3D interactive environment demos. Doing it using a new library feature of Semantic-XR export. In the future, if a semantic data format is standardized, a variety of tools could be developed or extended to generate the semantic data (for different authoring tools, using AI-based methods etc.)

Technical Suggestions For Consideration

  • The idea of a standard semantic scene description is a very powerful one for accessibility and other uses as described in many places linked above. Doing the Proof of Concept has strengthened my own belief that this could be a useful approach which is worth further evaluation
  • Integration of a semantic format with existing standards
    • Consider adding kind=”semantic” to Timed Text Track
    • Consider allowing Canvas (or any Rich Internet Applications’ container) to have Timed Text Track children. This could be a powerful way to deliver captions, subtitles, audio descriptions and metadata for Rich Internet Applications
    • Consider adding standard semantic metadata as a standard metadata section for different media files (like: different image formats)
  • When defining the format, consider keeping the format semantic and not tied to for example only to 3D visual content formats. This would enable having semantic data which is not only visual, in a similar manner to what was requested in this WebXR issue. Non visual semantic content could include:
    • Cognitive related data, for example up-to-date instructions for an interactive experience could be made available in-game on demand, like suggested in the Games Accessibility Guidelines
    • Audio related data supporting people who are deaf or hard of hearing. For example: who said what to whom, when and where in the 3D scene. This data could then be used to populate captions and subtitles tracks including spatial visual cues showing where the sound came from

W3C Workshop On Inclusive Design for Immersive Web standards

This blog post is a followup work to a presentation I gave at the W3C Workshop on Inclusive Design for Immersive Web Standards which I was very fortunate to attend and present at. The workshop included a lot of highly interesting and useful presentations and the atmosphere felt very inclusive and welcoming. It was definitely one of the best workshops I have ever attended. Links to the presentations were made available online. Also available online is a great summary followup talk with many insights by Thomas Logan. I would like to thank the organizers of this W3C workshop and hope that this followup work would be in some way useful to advance the important goal of inclusive and accessible XR on the web and beyond.

Feedback

If there is interest, I would be happy to make the Semantic-XR format details available online.

Your feedback is very welcome, I’m always eager to improve (available here).