Today's ask Twitter for prior art: What's a performant + reasonable approach to determine if, for a given http URL, is the "same" HTML content available at the https protocol?
7
1
2
Assume that the page will have stuff like absolute URLs with different schemes and CSRF tokens in the body which should be treated as the "same" content for this use case.
2
2
My first straw man is a breadth first compare of the first N tag names, where N might be as small as 10 or 20, to give a ballpark of what kind of accuracy / speed tradeoff I'd like to make.
1
Replying to @hillbrad
If they use a common template, might be same HTML structure with different content. Can’t you diff the two raw HTML and allow a % variance? That allows for stock quotes that update, or time stamps, but is essentially same content.

Aug 6, 2018 · 5:15 PM UTC

1