Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audio codec, priming samples and padding, without container or information from the encoder #626

Open
padenot opened this issue Jan 13, 2023 · 13 comments
Assignees
Labels
CR Blocking Needs to resolved for Candidate Recommendation need-definition An issues where something needs to be specified normatively

Comments

@padenot
Copy link
Collaborator

padenot commented Jan 13, 2023

How are implementations supposed to handle encoder delay / priming samples / packet padding?

If PCM is encoded with a particular codec, and then decoded, does it roundtrip properly? I don't think it can, because the number of priming samples is not returned by the encoder, so there's no way to mux the file properly to signal this (e.g., using an edit list in an mp4, or any other codec/container pair specific way).

It's possible to make a guess for some codecs (e.g. AAC is frequently 2112 frames), but this isn't ideal (and sometimes it depends on the implementation of a codec).

This is important for accurate synchronization, but a requirement for musical applications (because it breaks sample-accurate looping).

@dalecurtis
Copy link
Contributor

Agreed! Chrome doesn't do anything today for this. @tguilbert-google

@tguilbert-google
Copy link
Member

You're right.

Chrome pads the end of Opus encodes with silence.

Would the idea to spec how much priming/padding should be used, or just to expose the information and let users trim accordingly?

@tguilbert-google
Copy link
Member

As briefly discussed with Dale today, this could probably be an addition to EncodedAudioChunkMetadata

@padenot
Copy link
Collaborator Author

padenot commented Feb 7, 2023

A design that would be fairly clean would be to have this information on the EncodedAudioChunk themselves.

For the encode side, this would output EncodedAudioChunk.timestamp that can be negative, and EncodedAudioChunk.duration that can be different from the number of audio frames actually obtained when decoding (internally).

Then when decoding, it's clear that if the pts is negative, the output frames are trimmed, up to a start time of zero, and if the duration is different from the number of frames, the frames are also trimmed, so that the number of frames matches the duration. This can be done by the Web Codec implementation, so there's no confusion.

This means that encoding / decoding round trips properly without having to do anything, and that there is no extradata to carry at any point, each packet carries the information. Additionally, with this design, it's now impossible to ignore the fact that some audio isn't to be rendered, generally yielding better output. More over, this means that comparing timestamps for audio and video makes sense, because they're on the same timeline, and not offset by the encoder delay. This also makes looping of audio assets seamless by default when looping the PCM output (while keeping in mind that decoders need to be reset and that all packets, including the ones with negative timestamps, are to be decoded again for the output to be correct on the second loop iteration).

On the other side of this issue, authors that write demuxers for use with Web Codecs can decide to create EncodedAudioChunks with negative timestamps/adjusted duration for their EncodedAudioChunk, if they know the encoder delay / padding. Depending on the container, some subtraction will be needed.

Authors that write muxers will be able to look at the timestamps of EncodedAudioChunks and map this to what the container prefers (e.g. this can be the same, negative timestamps, or sometimes an explicit encoder delay + a total media time to know the padding, etc.).

@dalecurtis
Copy link
Contributor

This is kind of the scheme ffmpeg uses and while it works it does have some warts:

  • You can't assume start time is zero; e.g., first packet timestamp might be 0 < timestamp < padding.
  • High sample rates will affect accuracy of a timestamp/duration relative to frames trimmed. E.g., 1/768000 falls below microsecond precision.
  • Users may not always want negative timestamps trimmed.

Having an explicit sample count to trim resolves these issues. In general I haven't loved this scheme over the years, but it is mostly tried and true -- so long as there is a precise set of wpt tests for each codec whatever scheme we choose is fine though.

@padenot
Copy link
Collaborator Author

padenot commented Feb 8, 2023

  • You can't assume start time is zero; e.g., first packet timestamp might be 0 < timestamp < padding.

In general, we set the first packet timestamp to be negative (when that indicates the preroll), not really something in between 0 or the padding (maybe you meant preroll/encoder delay?).

It's also frequently 0, and then can be anything greater than zero, but that doesn't have a relation with the encoder delay either, that can be any file that's been demuxed / remuxed without reencoding or any other processing (e.g. part of a live stream). That's a bit broken because quite a few codecs need this delay, so rendering might be wrong, except if the author actually splits the file a bit before the intended start point and then discards the first few decoded output buffers correctly.

But here Web Codecs will just decode the packets, and it is assumed folks know what they are doing. A demuxer or post-demuxing/pre-decoding pass can also rebase the clock if it knows the intended start point, and offset everything, so it's back to signaling the encoder delay, discarding, and everything is fine again.

If we find EncodedAudioData with weird/negative timestamps, because the demuxer is passing the timestamp through without looking at them, anything might happen indeed.

  • High sample rates will affect accuracy of a timestamp/duration relative to frames trimmed. E.g., 1/768000 falls below microsecond precision.

This is technically true, but I'm not sure that it matters, because 768kHz audio won't be in AAC (max 192kHz) or MP3 (max 48kHz) or Opus (always 48kHz), or Vorbis (looks like 192kHz) or whatever else but PCM, not even flac (max (2 << 15) * 10 Hz, which is 655360Hz). And WAV doesn't have encoder delay / padding.

  • Users may not always want negative timestamps trimmed.

If negative timestamps are considered to be pre-roll / padding, then I'm not sure when they'd like to have them, but maybe I'm missing something.

Having pre-roll/padding separately and in sample-frame does indeed resolve some issues at the expense of requiring all authors to write the same code for most codecs, which I don't find satisfactory because that won't happen, and web apps will generally be worse.

We can also stick two more properties on EncodedAudioChunk and have a facility on AudioData to be able to automatically get the PCM data correctly trimmed (probably with two more attributes). We could make those sample-frame-based and not microsecond based. During the initial design, we originally kind of said that for any precise audio work, authors would have to count sample-frames anyway. This would also allow authors to get the audio data in PCM that's not to be rendered, although I’m not sure why anybody would want to do that.

Also, to answer your last point I've got a rather extensive test suite for this precise problem though (to test Firefox's AudioContext.decodeAudioData, HTMLMediaElement.duration and sample-accurate seamless HTMLMediaElement.loop), but that's easily portable to WPT in the sense that it doesn't use any internal testing facility/API.

It's a collection of WAV files generated exactly for this purpose, of length that is known (sometimes exactly 0.5s, or sometimes exactly 1.0s), at various sample-rates, channel count, that precisely loop seamlessly without discontinuity and then converted with all encoders that I have access to for all the codecs we care about and then also muxed in all containers that make sense for those codecs, sometimes with various ways of indicating preroll/padding (depending on the container). It's fairly probably that I end up extending those tests and porting to WPT while implementing the audio side of Web Codecs.

@dalecurtis
Copy link
Contributor

In general, we set the first packet timestamp to be negative (when that indicates the preroll), not really something in between 0 or the padding (maybe you meant preroll/encoder delay?).

I meant we've seen streams where the first packets has discard padding and the pts > 0. If negative timestamps are the only way to express discard that will require client side manipulation.

If we find EncodedAudioData with weird/negative timestamps, because the demuxer is passing the timestamp through without looking at them, anything might happen indeed.

This will definitely happen :) We still get a couple weird timestamp bugs each quarter.

This is technically true, but I'm not sure that it matters, because 768kHz audio won't be in AAC (max 192kHz) or MP3 (max 48kHz) or Opus (always 48kHz), or Vorbis (looks like 192kHz) or whatever else but PCM, not even flac (max (2 << 15) * 10 Hz, which is 655360Hz). And WAV doesn't have encoder delay / padding.

Correct, I mention it only as future proofing.

  • Users may not always want negative timestamps trimmed.

If negative timestamps are considered to be pre-roll / padding, then I'm not sure when they'd like to have them, but maybe I'm missing something.

Maybe this is what you meant by padding, but mp4 edit lists may shift arbitrary amounts before zero for discard -- so it's not always just decoder delay.

Having pre-roll/padding separately and in sample-frame does indeed resolve some issues at the expense of requiring all authors to write the same code for most codecs, which I don't find satisfactory because that won't happen, and web apps will generally be worse.

We can also stick two more properties on EncodedAudioChunk and have a facility on AudioData to be able to automatically get the PCM data correctly trimmed (probably with two more attributes).

Sorry, I wasn't arguing for both separately and in-frame, but rather either in-frame or metadata. My preference being in EncodedAudioChunkMetadata if we desire clients to handle it manually and AudioDecoderConfig if we want the UA to handle things automatically. I pictured this as discardFrames = {fromStart: #, fromEnd: #} on one of those structures.

I like this approach since container discard metadata is often specified as a duration / frame count / time range without regard to how many frames each individual packet has. By putting it on the chunk metadata / decoder config we can avoid clients manually mapping this data onto input chunks (which may require codec level understanding) or worry about shifting timestamps across the entire stream.

I think the AudioDecoderConfig approach requires less consideration (one shot versus per chunk) and code than negative timestamps -- which AFAIK aren't common outside of ogg containers. E.g., mp3 and adts use xing and mp4 uses itunsmb or elst.

elst can cut from anywhere, but I feel we already provide the necessary AudioData.copyTo() options to handle such cuts so I don't see a lot of value in providing per frame discard annotations.

Also, to answer your last point I've got a rather extensive test suite for this precise problem though (to test Firefox's AudioContext.decodeAudioData, HTMLMediaElement.duration and sample-accurate seamless HTMLMediaElement.loop), but that's easily portable to WPT in the sense that it doesn't use any internal testing facility/API.

Great! That will be really helpful.

@dalecurtis
Copy link
Contributor

Sorry, I just realized fromEnd won't work since we don't know EOS ahead of time and unless it's Vorbis most codecs generate output for each input. So I'd amend my suggested solution to just discardFrontFrames or discardFrontDuration, but acknowledge that can't round-trip a signal cleanly without the client manually trimming.

I'm also worried about reusing timestamp, duration since containers are infamous for these fields being imprecise. I think we'd apply trimming more often accidentally than intentionally.

So my vote is still for explicit fields; either both on the chunk (accepting that discardFrontFrames may be larger than a single packet) or only allow discardFrontFrames on the config and discardEndFrames on the chunk to ensure values are always limited to a single packet (and no cumulative / cancellation effects need to be considered).

We can also stick two more properties on EncodedAudioChunk and have a facility on AudioData to be able to automatically get the PCM data correctly trimmed (probably with two more attributes).

I didn't follow this point. Why would we need something on AudioData? I assumed we're providing this mechanism on the chunk to automatically remove padding frames; so AudioData would only have the trimmed signal in it.

@dalecurtis
Copy link
Contributor

One new proposal:

  • Only do things on the config, keep last N end-trimming samples and require flush() to get them.

@dalecurtis dalecurtis added the need-definition An issues where something needs to be specified normatively label Mar 16, 2023
@padenot padenot added the CR Blocking Needs to resolved for Candidate Recommendation label Aug 26, 2024
@padenot
Copy link
Collaborator Author

padenot commented Aug 26, 2024

I'd like this fixed before going to CR, because it makes rendering audio in a way that's 100% correct impossible. I'll propose things based on previous discussion.

@padenot
Copy link
Collaborator Author

padenot commented Aug 27, 2024

Only do things on the config, keep last N end-trimming samples and require flush() to get them.

What does this mean? How does your decoder know how to trim the last packet?

I've implemented the following:

  • New attribute on the config: primingFrames
  • First primingFrames audio frames are discarded for each decoding session (i.e. after a successful configure(...)), from the first packet (or sometimes the second packet, typical when dealing with AAC). Empty packets aren't output (also typical in AAC).
  • All timestamps except the timestamp of the first packet moved back primingFrames frames so that PCM time match what has been compressed. The duration member is adjusted accordingly.
  • A question that arose when implementing this: what happens if primingFrames disagrees with something found in description? This only affect Opus, I think. I think the primingFrames member is overridden, a console.warn(...) potentially emitted.

@dalecurtis
Copy link
Contributor

One way was to have discardFrontFrames, which works the same as the primingFrames you've proposed, and discardEndFrames which works by holding that many samples of output until a flush() occurs.

If instead we put those same fields on the chunks, the semantics would be similar but flush wouldn't be required. We'd discard the next discardFrontFrames frames from the front of the next decoded buffer (and following ones), we'd then discard the next discardEndFrames from the end of the next decoded buffer (and following ones).

The second idea solves your description issue by just stating that these are always additional frames discarded. If we allow carryover between chunks we need to specify how this field interacts with flush().

@padenot
Copy link
Collaborator Author

padenot commented Aug 28, 2024

I now understand, thanks. While it certainly works, I find this re-queueing a bit heavy-handed, requiring copies, potentially adding latency or calling flush() a lot.

I think it'd prefer to add something to trim the end padding on the packet itself, and leave most packets untouched, except the first and the last, like it usually happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CR Blocking Needs to resolved for Candidate Recommendation need-definition An issues where something needs to be specified normatively
Projects
None yet
Development

No branches or pull requests

3 participants