{"id":16511,"date":"2012-07-18T00:38:09","date_gmt":"2012-07-17T22:38:09","guid":{"rendered":"https:\/\/mamchenkov.net\/wordpress\/?p=16511"},"modified":"2012-07-18T00:38:09","modified_gmt":"2012-07-17T22:38:09","slug":"boilerpipe-boilerplate-removal-and-fulltext-extraction-from-html-pages","status":"publish","type":"post","link":"https:\/\/mamchenkov.net\/wordpress\/2012\/07\/18\/boilerpipe-boilerplate-removal-and-fulltext-extraction-from-html-pages\/","title":{"rendered":"boilerpipe &#8211; Boilerplate Removal and Fulltext Extraction from HTML pages"},"content":{"rendered":"<!-- google_ad_section_start -->\n<p><a href=\"http:\/\/code.google.com\/p\/boilerpipe\/\">boilerpipe &#8211; Boilerplate Removal and Fulltext Extraction from HTML pages<\/a><\/p>\n<blockquote><p>The boilerpipe library provides algorithms to detect and remove the surplus &#8220;clutter&#8221; (boilerplate, templates) around the main textual content of a web page.<\/p>\n<p>The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.<\/p>\n<p>Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate.<\/p><\/blockquote>\n<!-- google_ad_section_end -->\n","protected":false},"excerpt":{"rendered":"<!-- google_ad_section_start -->\n<p>boilerpipe &#8211; Boilerplate Removal and Fulltext Extraction from HTML pages The boilerpipe library provides algorithms to detect and remove the surplus &#8220;clutter&#8221; (boilerplate, templates) around the main textual content of a web page. The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual &hellip; <a href=\"https:\/\/mamchenkov.net\/wordpress\/2012\/07\/18\/boilerpipe-boilerplate-removal-and-fulltext-extraction-from-html-pages\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">boilerpipe &#8211; Boilerplate Removal and Fulltext Extraction from HTML pages<\/span><\/a><\/p>\n<!-- google_ad_section_end -->\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"link","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"_links_to":"","_links_to_target":""},"categories":[1,18,62,1334],"tags":[1190,1330],"keyring_services":[],"class_list":["post-16511","post","type-post","status-publish","format-link","hentry","category-general","category-programming","category-technology","category-web-work","tag-html","tag-web-development","post_format-post-format-link"],"aioseo_notices":[],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":16540,"url":"https:\/\/mamchenkov.net\/wordpress\/2012\/07\/22\/weekly-digest-2012-07-22\/","url_meta":{"origin":16511,"position":0},"title":"Weekly digest &#8211; 2012-07-22","author":"Leonid Mamchenkov","date":"July 22, 2012","format":false,"excerpt":"Tribute to MS Outlook http:\/\/t.co\/XjIhglhJ # Stash - privately hosted Git repositories http:\/\/t.co\/Ms8Y0ZVR # Requiem For A Digg http:\/\/t.co\/SBiUFv8w # The Most Important Tech Company You've Never Heard Of http:\/\/t.co\/KEvwjBJP # This is What Snake Venom Does to Blood! http:\/\/t.co\/Z6HRcelU # New note : \u041a\u0430\u043a \u0432\u044b\u0433\u043b\u044f\u0434\u044f\u0442 \u043a\u0440\u044f\u043a\u043e\u0437\u044f\u0431\u0440\u044b? http:\/\/t.co\/TYM79cYA # New\u2026","rel":"","context":"In &quot;All&quot;","block_context":{"text":"All","link":"https:\/\/mamchenkov.net\/wordpress\/category\/general\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":23814,"url":"https:\/\/mamchenkov.net\/wordpress\/2015\/04\/06\/styleguide-boilerplate-patterns\/","url_meta":{"origin":16511,"position":1},"title":"Styleguide &#038; Boilerplate Patterns","author":"Leonid Mamchenkov","date":"April 6, 2015","format":"link","excerpt":"Styleguide & Boilerplate Patterns - feature comparison of many CSS templates and frameworks.","rel":"","context":"In &quot;All&quot;","block_context":{"text":"All","link":"https:\/\/mamchenkov.net\/wordpress\/category\/general\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":22048,"url":"https:\/\/mamchenkov.net\/wordpress\/2014\/06\/23\/google-web-starter-kit-boilerplate-tooling-for-multi-device-development\/","url_meta":{"origin":16511,"position":2},"title":"Google Web Starter Kit &#8211; Boilerplate &#038; Tooling for Multi-Device Development","author":"Leonid Mamchenkov","date":"June 23, 2014","format":"link","excerpt":"Google Web Starter Kit - Boilerplate & Tooling for Multi-Device Development","rel":"","context":"In &quot;All&quot;","block_context":{"text":"All","link":"https:\/\/mamchenkov.net\/wordpress\/category\/general\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":18020,"url":"https:\/\/mamchenkov.net\/wordpress\/2013\/04\/15\/wordpress-themes-roots\/","url_meta":{"origin":16511,"position":3},"title":"WordPress themes : Roots","author":"Leonid Mamchenkov","date":"April 15, 2013","format":"link","excerpt":"WordPress themes : Roots Roots is a WordPress starter theme based on HTML5 Boilerplate & Bootstrap from Twitter.","rel":"","context":"In &quot;All&quot;","block_context":{"text":"All","link":"https:\/\/mamchenkov.net\/wordpress\/category\/general\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":28490,"url":"https:\/\/mamchenkov.net\/wordpress\/2018\/04\/18\/wordpress-plugin-boilerplate-a-standardized-organized-object-oriented-foundation-for-building-high-quality-wordpress-plugins\/","url_meta":{"origin":16511,"position":4},"title":"WordPress Plugin Boilerplate &#8211; a standardized, organized, object-oriented foundation for building high-quality WordPress Plugins","author":"Leonid Mamchenkov","date":"April 18, 2018","format":false,"excerpt":"WordPress is an excellent system for a whole lot of different projects and needs.\u00a0 It's widely used, fast, and flexible.\u00a0 However it does show its age in many ways.\u00a0 One of the areas where things could be a lot better and simpler is the WordPress plugin development. WordPress plugins are\u2026","rel":"","context":"In &quot;All&quot;","block_context":{"text":"All","link":"https:\/\/mamchenkov.net\/wordpress\/category\/general\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":28179,"url":"https:\/\/mamchenkov.net\/wordpress\/2017\/11\/06\/php-ml-machine-learning-library-for-php\/","url_meta":{"origin":16511,"position":5},"title":"PHP-ML &#8211; Machine Learning library for PHP","author":"Leonid Mamchenkov","date":"November 6, 2017","format":false,"excerpt":"PHP-ML is a machine learning library for PHP.\u00a0 Given, PHP is probably not the best choice when it comes to machine learning, but sometimes one is limited in technology stack choices, so it's good have options like this one. Fresh approach to Machine Learning in PHP. Algorithms, Cross Validation, Neural\u2026","rel":"","context":"In &quot;All&quot;","block_context":{"text":"All","link":"https:\/\/mamchenkov.net\/wordpress\/category\/general\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"jetpack_sharing_enabled":true,"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/mamchenkov.net\/wordpress\/wp-json\/wp\/v2\/posts\/16511","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mamchenkov.net\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mamchenkov.net\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mamchenkov.net\/wordpress\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mamchenkov.net\/wordpress\/wp-json\/wp\/v2\/comments?post=16511"}],"version-history":[{"count":0,"href":"https:\/\/mamchenkov.net\/wordpress\/wp-json\/wp\/v2\/posts\/16511\/revisions"}],"wp:attachment":[{"href":"https:\/\/mamchenkov.net\/wordpress\/wp-json\/wp\/v2\/media?parent=16511"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mamchenkov.net\/wordpress\/wp-json\/wp\/v2\/categories?post=16511"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mamchenkov.net\/wordpress\/wp-json\/wp\/v2\/tags?post=16511"},{"taxonomy":"keyring_services","embeddable":true,"href":"https:\/\/mamchenkov.net\/wordpress\/wp-json\/wp\/v2\/keyring_services?post=16511"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}