Skip to content

Leonid Mamchenkov

Life, universe, and everything else

Home
Archives
About
Contact me

Email
Skype
LinkedIn
GitHub
Facebook
Twitter
Instagram
Flickr
YouTube
SlideShare
RSS Feed

Search for:

On this day...

2017: Is VPN Legal in Your Country?
2016: After a year of using NodeJS in production
2016: Why Some People Get Promoted (And Others Don’t)
2016: Wikiwand – Wikipedia Modernized
2016: Rejected Princesses
2016: The Slashdot Interview With Larry Wall
2016: Test your backups!
2016: Fedora 24 : the day of 64-bit has come
2015: Beacon bitter child
2015: Heinz variety
2014: Ukrain-Russia, political
2014: Interface sketch
2014: Never saw geeks singing?
2014: Daily dose of Instagram
2012: GoDaddy.com
2012: In search of the sasquatch
2012: First Cyprus hackaton
2012: WP Help – build a help system into your WordPress project
2011: Day in brief – 2011-07-18
2011: Trailer : Sherlock Holmes 2 : A Game of Shadows

boilerpipe – Boilerplate Removal and Fulltext Extraction from HTML pages

boilerpipe – Boilerplate Removal and Fulltext Extraction from HTML pages

The boilerpipe library provides algorithms to detect and remove the surplus “clutter” (boilerplate, templates) around the main textual content of a web page.

The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.

Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate.

Share:

Click to share on Twitter (Opens in new window)
Click to share on Facebook (Opens in new window)
Click to share on LinkedIn (Opens in new window)
Click to share on Pinterest (Opens in new window)
Click to share on Pocket (Opens in new window)
Click to share on Reddit (Opens in new window)
Click to email a link to a friend (Opens in new window)
More

Click to share on WhatsApp (Opens in new window)
Click to share on Telegram (Opens in new window)
Click to share on Tumblr (Opens in new window)
Click to print (Opens in new window)

Related

Format LinkPosted on July 18, 2012Author Leonid MamchenkovCategories All, Programming, Technology, Web workTags HTML, web development

Leave a CommentCancel reply

Post navigation

Previous Previous post: My Galaxy Nexus now runs Android 4.1 Jelly Bean. …

Next Next post: WP Help – build a help system into your WordPress project

Proudly powered by WordPress

This website uses cookies. Purely for technical reasons. Accept Reject Read more

Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.

Necessary

Always Enabled

Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.

Non-necessary

Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.

SAVE & ACCEPT

Go to mobile version