Audio and video synchronization is a very important logic in the player, which has a huge impact on the user’s actual experience. As explained in the previous article, audio and video are decoded and displayed separately, and the two threads are executed separately. If the synchronization operation is not added with time stamp, the audio and video will be out of sync in the played video. The Video decoder is performed in the MediaCodecVideoRenderer. Audio decoded playback is performed in the MediaCodecAudioRenderer;

Video synchronization

The entry of the video decoding execution function is MediaCodecRenderer render function: introduced into two parameters

public void render(long positionUs, long elapsedRealtimeUs)
Copy the code

ElapsedRealtimeUs is the current timestamp, incoming drainOutputBuffer start reading after decoding the output buffer, ready for show, execution to MediaCodecVideoRenderer. ProcessOutputBuffer

  protected boolean processOutputBuffer(
      long positionUs,
      long elapsedRealtimeUs,
      MediaCodec codec,
      ByteBuffer buffer,
      int bufferIndex,
      int bufferFlags,
      long bufferPresentationTimeUs,
      boolean isDecodeOnlyBuffer,
      boolean isLastBuffer,
      Format format)
Copy the code
  • BufferPresentationTimeUs is the PTS of the frame
  • ElapsedRealtimeUs is the timestamp for preparing render
    long earlyUs = bufferPresentationTimeUs - positionUs;

    // Fine-grained adjustment of earlyUs based on the elapsed time since the start of the current
    // iteration of the rendering loop.
    long elapsedSinceStartOfLoopUs = elapsedRealtimeNowUs - elapsedRealtimeUs;
    earlyUs -= elapsedSinceStartOfLoopUs;

    // Compute the buffer's desired release time in nanoseconds.
    long systemTimeNs = System.nanoTime();
    long unadjustedFrameReleaseTimeNs = systemTimeNs + (earlyUs * 1000);

    // Apply a timestamp adjustment, if there is one.
    long adjustedReleaseTimeNs = frameReleaseTimeHelper.adjustReleaseTime(
        bufferPresentationTimeUs, unadjustedFrameReleaseTimeNs);
    earlyUs = (adjustedReleaseTimeNs - systemTimeNs) / 1000;
Copy the code
  • PositionUs can be thought of as audio time points
  • BufferPresentationTimeUs can be thought of as the PTS of the video frame
  • ElapsedSinceStartOfLoopUs is the current timestamp – render execution time stamp, I think here is just for calibration.
  • EarlyUs – = elapsedSinceStartOfLoopUs earlyUs calibration of the PTS of PTS is video and audio

The next step is to calibrate the display time. FrameReleaseTimeHelper. AdjustReleaseTime is doing this job. The closestVsync function is used to find the closest display time point.

    // Find the timestamp of the closest vsync. This is the vsync that we're targeting.
    long snappedTimeNs = closestVsync(adjustedReleaseTimeNs, sampledVsyncTimeNs, vsyncDurationNs);
Copy the code
  private static long closestVsync(long releaseTime, long sampledVsyncTime, long vsyncDuration) {
    long vsyncCount = (releaseTime - sampledVsyncTime) / vsyncDuration;
    long snappedTimeNs = sampledVsyncTime + (vsyncDuration * vsyncCount);
    long snappedBeforeNs;
    long snappedAfterNs;
    if (releaseTime <= snappedTimeNs) {
      snappedBeforeNs = snappedTimeNs - vsyncDuration;
      snappedAfterNs = snappedTimeNs;
    } else {
      snappedBeforeNs = snappedTimeNs;
      snappedAfterNs = snappedTimeNs + vsyncDuration;
    }
    long snappedAfterDiff = snappedAfterNs - releaseTime;
    long snappedBeforeDiff = releaseTime - snappedBeforeNs;
    return snappedAfterDiff < snappedBeforeDiff ? snappedAfterNs : snappedBeforeNs;
  }
Copy the code

VideoFrameReleaseTimeHelper. Java define an inner class:

private static final class VSyncSampler implements FrameCallback, Handler.Callback { @Override public void doFrame(long vsyncTimeNs) { sampledVsyncTimeNs = vsyncTimeNs; choreographer.postFrameCallbackDelayed(this, CHOREOGRAPHER_SAMPLE_DELAY_MILLIS); }}Copy the code

The onFrame callback can be used to specify the exact time of send-off, as this is the point at which choreographer sends send-off.

After the above treatment, the calibration time is more accurate.

boolean treatDroppedBuffersAsSkipped = joiningDeadlineMs ! = C.TIME_UNSET; if (shouldDropBuffersToKeyframe(earlyUs, elapsedRealtimeUs, isLastBuffer) && maybeDropBuffersToKeyframe( codec, bufferIndex, presentationTimeUs, positionUs, treatDroppedBuffersAsSkipped)) { return false; } else if (shouldDropOutputBuffer(earlyUs, elapsedRealtimeUs, isLastBuffer)) { if (treatDroppedBuffersAsSkipped) { skipOutputBuffer(codec, bufferIndex, presentationTimeUs); } else { dropOutputBuffer(codec, bufferIndex, presentationTimeUs); } return true; } if (Util.SDK_INT >= 21) { // Let the underlying framework time the release. if (earlyUs < 50000) { notifyFrameMetadataListener( presentationTimeUs, adjustedReleaseTimeNs, format, currentMediaFormat); renderOutputBufferV21(codec, bufferIndex, presentationTimeUs, adjustedReleaseTimeNs); return true; } } else { // We need to time the release ourselves. if (earlyUs < 30000) { if (earlyUs > 11000) { // We're a little too  early to render the frame. Sleep until the frame can be rendered. // Note: The 11ms threshold was chosen fairly arbitrarily. try { // Subtracting 10000 rather than 11000 ensures the sleep time will be at least 1ms. Thread.sleep((earlyUs - 10000) / 1000); } catch (InterruptedException e) { Thread.currentThread().interrupt(); return false; } } notifyFrameMetadataListener( presentationTimeUs, adjustedReleaseTimeNs, format, currentMediaFormat); renderOutputBuffer(codec, bufferIndex, presentationTimeUs); return true; }Copy the code

The implementation method of the above section is mainly to get the calibrated time earlyUs, and then to drop, jump or wait for the audio decoding according to earlyUs. If the earlyUs time difference is positive, it means that the video frame should be displayed after the current system time, in other words, the video frame came in early, whereas if the earlyUs time difference is negative, it means that the video frame should be displayed before the current system time, in other words, the video frame came in late. If a certain threshold is exceeded, that is, the video frame comes too late, the frame is discarded and not displayed. According to the preset threshold value, if the video frame comes more than 50ms earlier than the predetermined time, it will enter the next cycle with an interval of 10ms, and then continue to judge, otherwise, the video frame will be displayed.

Audio synchronization

MediaCodecAudioRenderer. GetPositionUs get audio timestamp

  @Override
  public long getPositionUs() {
    if (getState() == STATE_STARTED) {
      updateCurrentPosition();
    }
    return currentPositionUs;
  }
Copy the code
  private void updateCurrentPosition() {
    long newCurrentPositionUs = audioSink.getCurrentPositionUs(isEnded());
    if (newCurrentPositionUs != AudioSink.CURRENT_POSITION_NOT_SET) {
      currentPositionUs =
          allowPositionDiscontinuity
              ? newCurrentPositionUs
              : Math.max(currentPositionUs, newCurrentPositionUs);
      allowPositionDiscontinuity = false;
    }
  }
Copy the code

Call defaultaudiossin.getCurrentPositionUs

@Override public long getCurrentPositionUs(boolean sourceEnded) { if (! isInitialized() || startMediaTimeState == START_NOT_SET) { return CURRENT_POSITION_NOT_SET; } long positionUs = audioTrackPositionTracker.getCurrentPositionUs(sourceEnded); positionUs = Math.min(positionUs, configuration.framesToDurationUs(getWrittenFrames())); return startMediaTimeUs + applySkipping(applySpeedup(positionUs)); }Copy the code

AudioTrackPositionTracker. GetCurrentPositionUs audioTrackPositionTracker. Java is tracking AudioTrack class play position.

  private long getPlaybackHeadPositionUs() {
    return framesToDurationUs(getPlaybackHeadPosition());
  }

  private long framesToDurationUs(long frameCount) {
    return (frameCount * C.MICROS_PER_SECOND) / outputSampleRate;
  }
Copy the code

The getPlaybackHeadPosition function counts the current number of audio frames. Since audio has a bit rate, the current playing position can be obtained based on the number of frames and the duration between frames of the audio.

summary

Audio sync I have three common ways: 1. Take audio as the benchmark, video close to audio 2. 3. Find a common baseline to which both audio and video converge

The first solution is used in ExoPlayer