boilerpipe – Boilerplate Removal and Fulltext Extraction from HTML pages

boilerpipe – Boilerplate Removal and Fulltext Extraction from HTML pages

The boilerpipe library provides algorithms to detect and remove the surplus “clutter” (boilerplate, templates) around the main textual content of a web page.

The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.

Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate.

Stash – privately hosted Git repositories

As far as I am concerned, GitHub is the king and queen of applications in the git world.  But it has a downside that is not easy to work around: GitHub Enterprise is expensive.  Keeping code on GitHub infrastructure is not always allowed by authorities and such, and then things get really expensive.  That’s where, I think, Stash can come in.

Stash is a product of Atlassian, the same company that owns Jira, BitBucket, and a few other well-known developer tools.  Given that Stash has only been launched this year, and judging by the screenshots, GitHub probably provides more functionality.  But as I said earlier, GitHub’s price might be simply too high for some companies.

It’s also worth noting that both companies have recently received large investments (Atlassian got $60 million and GitHub got $100 million).  Since private repositories and in-house installations seem to be the primary source of income for both of them, I’m seeing a revved up competition between the two in the nearest future.

jQuery 2.0 will drop support for MSIE 6, 7, and 8

Slashdot reports:

The developers of jQuery recently announced in a blog entry that jQuery 2.0 will drop support for legacy versions of Internet Explorer. The release will come in parallel with version 1.9, however, which will include support for older versions of IE. The versions will offer full API compatibility, but 2.0 will ‘benefit from a faster implementation that doesn’t have to rely on legacy compatibility hacks.

A few comments mentioned that dropping support for MSIE 6 and 7 is fine, but MSIE 8 is still widely used by people with Windows XP.  The solution to the problem seems to be conditional tags.  Since jQuery 2.0 will have fully compatible APIs to jQuery 1.9, something along the lines of:


<!--[if lt IE 9]>
<script src="jquery-1.9.0.js"></script>
<![endif]-->
<!--[if gte IE 9]>
<script src="jquery-2.0.0.js"></script>
<!--<![endif]-->

should solve the problem.

Huge, huge thanks to git bisect! With its help, I …

Huge, huge thanks to git bisect! With its help, I just sorted out a huge argument about who removed a piece of code and when.  With an actively developed project among few developers and branches, it’s not trivial to say when the change was introduced.  Unless, of course, you are using git bisect.  Every developer should know how to use it.

CakePHP 2.1.4, 2.2, and a pick into 3.0

There’s been a stream of good news from the CakePHP headquarters recently.  If you are as slow as me on catching up with these things, here is a quick summary.

  • CakePHP 2.1.4 has been release, and that’ll be the last release for the 2.1 branch.  It’s time to move on.
  • CakePHP 2.2 stable has been released, and that’s what you should be using for your projects.
  • CakePHP 3.0 has been mentioned, so if you are interested in contributing early, here is your chance.

CakePHP 3.0 will take a few month to develop.  Mainly, the work is focused around the following:

  • Drop support for PHP 5.2.
  • Add and improve support of PHP 5.4+.
  • Reorganized CakePHP classes to use namespaces to avoid collisions with other libraries and classes.
  • Improve bootstrapping for better control by developers.
  • Rewrite the model layer to support more drivers, object mapping, richer API, etc.
  • Rewrite the routing to work faster and be more flexible.

Overall, it looks like some really healthy activity in CakePHP project.