成人论坛

The World Service archive prototype

Post categories: Archives,听IRFS

Yves Raimond | 10:32 UK time, Thursday, 29 November 2012

Over this past summer we built a prototype that puts on the web. The prototype lets you explore and listen to around 70,000 radio programmes covering 60 years of the World Service. Because it is such a large and diverse archive with sparse descriptive data, we have had to categorise and tag all these programmes with machines running speech-to-text, topic extraction and speaker identification algorithms. And now we want people to help us validate and correct this automatically generated data and improve the archive for everyone. Please and let us know what you think.

We previously wrote about our work on automated tagging of large archives in this blog, done within our ABC-IP project. Since then, we have been deriving more and more data automatically: topic tags, segmentations and speaker identifications. However automated tools will never be perfect, especially for something as subjective as tags. The World Service Archive prototype aims to test a new approach to publishing large archives online. First, automated processes are used to annotate the archive with tags, bootstrapping search and navigation for users. Then user feedback on these tags will make them better, improving the search and navigation, but also feeding back to improve our automated tools.

This approach is significantly different from the way 成人论坛 archives are currently published online, focusing on archive segments around particular brands (e.g. Desert Island Discs, or more recently Letters from America) or particular topics (e.g., World War II), manually annotating that segment of the archive and building segment-specific navigation using those annotations. However there are a number of questions we need to answer when testing our novel approach of combining automated metadata with crowdsourcing techniques. Is it acceptable to publish an archive where the metadata hasn't been comprehensively checked? What are the minimal features required to make such an archive proposition work? Is variable quality metadata acceptable to users? Does user feedback actually lead to increased accuracy? What are the best mechanisms to engage our users in helping us with improving that data?

The prototype was first tested by the World Service listener's panel, and registration .

Features of the prototype

After signing in, users are redirected to the homepage of the prototype. This page contains a set of manually curated programmes from the archive, a list of programmes recently listened to, and links to aggregations of topical content in the archive, generated from "on this day" and "in the news" information from .

On individual programmes in the archive, users are presented with data coming directly from the World Service archive database if it is present (e.g. synopsis, title, duration, broadcast date), an image, and a set of tags. Each tag can come from one of three different sources: it can be derived automatically from text associated with the programme in the World Service archive database, it can be derived automatically from the audio content itself using the framework described previously on this blog, or it can come directly from users. When logged in, users can upvote or downvote each individual tag. They can also add new tags through an auto-completed list, using and as target vocabularies. When generating the page, the aggregate of all those edits along with the initial weights assigned to each tag by our automated tools will be used in ranking the different tags. Only tags that have been upvoted more than they have been downvoted will be pushed to the search engine. An image also gets automatically assigned to each programme, using images associated with the top tags from . Users can manually override this image with images associated with other tags, which gives us implicit information on what tag describes the programme best. Users can navigate from those programmes to aggregations of programmes around particular topics, and on to other programmes.

A programme in the World Service archive prototype

Users can also search through the archive using the search box at the top. This search indexes all textual content associated with programmes as well as tags that have emerged as 'good' tags from aggregated user interactions. Facets are displayed on the left-hand side to refine the results by e.g. year of broadcast.

Search results for 'smallpox' in the World Service archive prototype

For most episodes, we have also generated automated speaker segmentations so you can see the different voices in an episode, and jump between these segments in the programme (we'll write a bit more about it in another blog post). Our tools can identify the distinct voices in the programme, but can't identify who the speakers are. Therefore, we also give the ability for users to name those speakers and the name picked by the most users will be displayed to everyone. Our tools also enable us to recognise speakers across programmes. So when clicking on a speaker names users are directed to an aggregation of all programmes featuring the same speakers. And these user-contributed speaker names will be automatically propagated to all the other programmes, where we ask users to manually approve or correct them. We can then use the resulting data to evaluate our speaker identification algorithm. We are in the process of improving this interface and at some point we will roll this feature out to other programmes.

Conclusion

As we said at the beginning, the aim of this prototype is to test a new approach to publishing large archives with sparse or incorrect metadata. Our guiding principle has been to use algorithms and people, feeding off each other to make this metadata better. We use automated techniques to create the initial metadata and bootstrap a prototype, users to correct and improve this data, and then feed this information back to the algorithms to make them better. Hopefully this creates a useful feedback cycle that results in a better and better archive experience.

We are still in the process of gathering more and more data for this archive, and are aiming to use that data to improve the prototype, both by improving the overall quality of the archive metadata and by understanding the user needs and behaviours a bit more. The questions we asked in the introduction of this post still need to be answered, but we feel this prototype and the community we are trying to grow around it gives us a good mechanism to try and answer them.

We will soon publish two other posts focusing on this World Service archive prototype, one focusing on development, and one focusing on user experience.

Share this page

Comments Post your comment

Be the first to comment

Jump to more content from this blog

About this blog

This is the Research & Development blog, where researchers, scientists and engineers from 成人论坛 R&D share their work in developing the media technologies of the future.

For the latest updates across 成人论坛 blogs,
visit the Blogs homepage.

Subscribe to Research and Development

You can stay up to date with Research and Development via these feeds.

Research and Development Feed(RSS)

Research and Development Feed(ATOM)

If you aren't sure what RSS is you'll find useful.

Other Related 成人论坛 Blogs

Mothballed Blogs

成人论坛 R&D Main Site

R&D Homepage Image

For a detailed breakdown of our activities, teams, locations and how we collaborate visit our main website. We also host videos on the main website without UK only distribution restrictions.

The World Service archive prototype

Features of the prototype

Conclusion

Comments Post your comment

About this blog

Subscribe to Research and Development

Other Related 成人论坛 Blogs

成人论坛 R&D Main Site

More from this blog...

Topical posts on this blog

Being Discussed Now

Archives

Categories

Latest contributors

成人论坛 navigation

成人论坛 links

成人论坛

The World Service archive prototype

Features of the prototype

Conclusion

Comments Post your comment

About this blog

Subscribe to Research and Development

Other Related 成人论坛 Blogs

成人论坛 R&D Main Site

More from this blog...

Topical posts on this blog

Being Discussed Now

Archives

Categories

Latest contributors

成人论坛 iD

成人论坛 navigation

成人论坛 links