Cuz that's how nostr works. It just hosts text, but that text can link to a video, gif, image this is to make the servers (relays) as robust as possible. Moving the data load from the actual relays to other servers makes it possible for more to be hosted.
So now we have relays that just the network traffic, and servers that will host the images, gif, video portion of the network. You can host all of which in your own machine yourself if you dont wish to use others infrastructure