Recently, I was looking for HTML parsers for use in a .NET project, and I came across these:
- HTML Tidy – seems very popular with ports in Java platform as well. You can create a .NET wrapper around this C++ library, and a few people have already done this for you! Like here 🙂 Couple of GUI tools are also available, like Tidy UI. The documentation seems a little complex, so I will try Tidy the last!
- ACRUX HTML Parser. I installed the trial version, but it is not a fully-functional-time-bound trial.
- Html Agility Pack – “This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don’t HAVE to understand XPATH nor XSLT to use it, don’t worry…). It is a .NET code library that allows you to parse “out of the web” HTML files. The parser is very tolerant with “real world” malformed HTML. The object model is very similar to what
proposes System.Xml, but for HTML documents (or streams)“. This seems easy to use, and coded directly in .NET!
- Html DOM – “A class library that implements HTML DOM (Document object Model) for .Net platform.“
- WebLexicon – “Open-Source Markup Language Parser Library for .NET (XHTML/HTML/SGML/XML/MATHML)“