Wow. I love that you just get in there mate, awesome.
It does not fully work as you've intended but it's getting a lot of interest so you may be on to something to commit some more time to?
If the vision is a watch party where people can watch the same thing together at the same time, the first disconnect is you're using a VOD-style video (MP4). The thing that is missing in VOD there's no concept of "current time" so the client has no way of knowing where other sessions are playing back.
It would be needed to use an HLS style of link (m3u8) where the server is keeping track of "when is now" and when a new viewer shows up, only presents the current chunk, so everyone ends up watching the same thing at the same time (how TV stations work)
Same goes for presenting multiple "streaming" tag links. Clients only know what to do with one of them at a time, and can't say what other viewers are viewing.
What you're on to here there is very cool.
If you were to want to truly have a crack, the missing piece is to take your MP4s, convert into a chunked format e.g. m3u8, and then broadcast that to a streaming server host (self host or e.g. Zap Stream) to publish the Nostr event with only the one m3u8 streaming tag.
There is some off the shelf software that would let you do exactly that, e.g.
https://github.com/Eyevinn/channel-engine
Or
https://github.com/tin-robot/RTMP-Playlist
Good luck and have fun!