Options include having a global clock provided from the network or using clock synch between computers. Or we could simply carry a clock in all packets and use for clock synch calculation a la NTP/DTS.
More generally, we could encapsulate the media in the same transmission stream. This is very effective but may entail computationally expensive labour at the recipient un-ravelling the streams - for example, H.221 works like this, but since it is designed to introduce only minimal delay in doing so, is a bit level framing protocol and is very hard to decode rapidly. Some recipients may not want or be capable of displaying all media (or have the capacity to receive all without disruption to some or other stream!).
Alternatively, we could use much the same scheme as is used to synchronise different sources from different places. However, since media from the same source are timestamped by the same clock, the offset calculation is a lot simpler, and can be done In the receiver only - basically messages between an audio decoder and a video decoder can be exchanged inside the receiver and used to synchronise the playout points.
This latter approach assumes that the media are timestamped at the "real" source (i.e. at the point of sampling, not at the point of transmission) to be accurate.