Clipping to MD does not clip all `a` elements

chrillek · January 13, 2023, 1:31pm

In another thread, I was led to a web document whose “cluttered” MD clip lacks (at least) the final three links.
The URL: A Practical Introduction to Web Scraping in Python – Real Python

The last three links:

The preview of the MD document as clipped to DT (There’s nothing after the last line shown here!):

The (end of the) source of the MD document as clipped to DT:

## Additional Resources

For more information on web scraping with Python, check out the following resources:

[1]: https://realpython.com/learning-paths/data-science-python-core-skills/
[2]: https://en.wikipedia.org/wiki/Web_scraping#Legal_issues
[3]: https://docs.python.org/3/library/
[4]: https://realpython.com/urllib-request/
[5]: https://realpython.com/python-encodings-guide/#unicode-vs-utf-8
[6]: https://realpython.com/python-print/
[7]: https://realpython.com/html-css-python/
[8]: https://files.realpython.com/media/website_aphrodite.10b67047ebc2.png
[9]: https://docs.python.org/3/library/re.html
[10]: https://realpython.com/python-lists-tuples/
[11]: https://realpython.com/replace-string-python/#leverage-resub-to-make-complex-rules
[12]: https://realpython.com/replace-string-python/
[13]: https://realpython.com/python-for-loop/
[14]: https://beautiful-soup-4.readthedocs.io/en/latest/
[15]: http://olympus.realpython.org/profiles/dionysus
[16]: https://files.realpython.com/media/website_dionysos_page.8d7be251d9a0.png
[17]: https://realpython.com/python-xml-parser/#lxml-use-elementtree-on-steroids
[18]: http://olympus.realpython.org/profiles
[19]: https://pypi.org/
[20]: https://mechanicalsoup.readthedocs.io/en/stable/
[21]: https://realpython.com/what-is-pip/
[22]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status
[23]: http://olympus.realpython.org/login
[24]: https://files.realpython.com/media/website_login.739f488fbe74.png
[25]: https://files.realpython.com/media/website_dice.3cdd09061f55.png
[26]: http://olympus.realpython.org/dice
[27]: https://developer.mozilla.org/en-US/docs/Web/CSS/ID_selectors
[28]: https://realpython.com/python-sleep/
[29]: https://realpython.com/python-time-module/
[30]: https://realpython.com/python-conditional-statements/

As you can see, the original links are not there, only reference links from the rest of the document. This is, let’s say, unfortunate. One would expect the whole document to be clipped, especially in cluttered mode. Especially, since the HTML does nothing weird:

<ul>
<li><a href="https://realpython.com/beautiful-soup-web-scraper-python/">Beautiful Soup: Build a Web Scraper With Python</a></li>
<li><a href="https://realpython.com/api-integration-in-python/">API Integration in Python</a></li>
<li><a href="https://realpython.com/python-api/">Python &amp; APIs: A Winning Combo for Reading Public Data</a></li>
</ul>

eboehnisch · January 16, 2023, 4:20pm

This look like over-aggressive decluttering. In the HTML code, do the missing three links differ in any way from the others?

chrillek · January 16, 2023, 4:55pm

I didn’t turn de-cluttering on (in fact, it doesn’t even seem available for MD).

Not that I can see. Here’s a sample HTML excerpt with a link that gets clipped:

<p>The number <code>200</code> represents the 
<a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Status">status code</a> 
returned by the request. A status code of <code>200</code> means that the request was successful. An unsuccessful request might show a status code of <code>404</code> if the URL doesn’t exist or <code>500</code> if there’s a server error when making the request.</p>

The only difference I can see is that the clipped links appear in the text, the omitted ones are the only content of li elements.

Also, why are these clipped links all clipped as references, not directly in the MD text but at the end?

eboehnisch · January 16, 2023, 5:10pm

Interesting questions. I’d need to dig into the code which is based on third-party code and modified by us.

chrillek · January 16, 2023, 5:21pm

It does not happen with all HTML documents, but it does with this one.

eboehnisch · June 2, 2023, 10:04am

The declutterer removed the last paragraph because it only contains a list of links. We are now explicitly allowing <ul> and <ol>elements containing only links.