Pandoc's pretty capable. I wrote a python library for HTML -> Markdown, long ago, when I was a lesser engineer. It worked against a pretty constrained set of HTML. Not sure how well it would work against The Internet. https://github.com/crossway/antimarkdown
