I'd use readability.
Use it in my toolchain, works well.

Pandoc's pretty capable. I wrote a python library for HTML -> Markdown, long ago, when I was a lesser engineer. It worked against a pretty constrained set of HTML. Not sure how well it would work against The Internet. https://github.com/crossway/antimarkdown
whoa
how did I not know about this?
#[2] in 1 week