I’ve been working with HTML in low-level languages like C/C++ and I must say that it’s been a rather frustrating experience. One would think that the task I’m doing should have been done ages ago, since the conception of the project. Alas, that’s not quite right. What you fail to realize is that you should NEVER underestimate the dirtiness of HTML outputs all around the world. This whole week has been about cleaning/scrubbing/sanitizing HTML. I ended up grabbing to rather awesome third party libraries, one called
pugixml and the other
libtidy (also known as tidy-html5).
libtidy, being a pure C library took me a while to get the hang of as I’m not that experienced in C. Even though I’m writing C++ it doesn’t mean I’m working with pointers the majority of time (mostly on the stack). It was more of a game of “copy this chunk of memory and put it as an argument in this C++-powered method so I have more control over it with the classes I’ve built. But I needed it to work, regardless on how long it took to get right because without tidy it would be really, really hard to do any scrubbing. pugixml and other XML parsers can get quite cranky while parsing misplaced tags.
Which brings me to pugixml!
Apparently QtXml module is no longer maintained as it has reached a matured state. The array of classes QDomDocument have in general feels suiting to do the job, sadly the documentation stresses that it uses too much memory, and apparently “pugixml is faster than QXmlStreamReader”
In the end. I chose pugixml because it’s just simple, it doesn’t have the annoyances QXmlStreamReader brings.
Another subject I wanted to bring is the interest of the C language.
I’ve been thinking of learning more about C. At least the basics to defend myself from cases like libtidy. I also wanted to learn more about C due to Gtk3 and due to getting closer to system programming (not that C++ isn’t capable). Mix this interest with me wanting to work with microcomputers or embedded systems and I might make something out of it.
Lastly there’s that thought of me wanting to get a job related to my field. I might try to get a job remotely, well, time will tell. For now I’m too tired, mentally spent, yet as I finish this post I have a class to attend to in college. 🙁