textract – extract text from any document. Currently supports .doc, .docx, .eml, .json, .html, .pptx, .pdf, and .txt.
Tag: Microsoft Word
HTML from the Microsoft Word
You just gotta love the HTML that comes out of the Microsoft Word. Particularly useful are the HTML comments, which are not closed, breaking the rest of the webpage below the paste. Yes, exactly where, for example, JavaScript is being loaded in the footer of the page.
This is a good old case of “Don’t trust any user input”, reinforced with “especially if they are using Microsoft tools”.