An English-only Planet Iron Man

Reading time: 2 minutes

I’m very happy to know that Perl has global appeal from seeing all the non-English Perl blogs aggregated on Planet Iron Man, but since I’m a (typical American) monoglot, I’d prefer an Iron Man feed with only English articles. So I made one.

It’s available at http://feeds.dagolden.com/ironman-english.xml. It updates hourly from the master feed.

And for the curious, or for anyone who wants to adapt this for other languages, here’s the Perl program that I whipped-up to create the feed:

Update: I’ve also put the code up on Github: ironman-feedfilter

# feedfilter.pl - downloads and filters the Perl Ironman feed for English
# entries. Results sent to STDOUT.
#
# The heuristic filters out entries unless the content is mostly latin
# characters and English is close to the best guess of a language.  Short
# entries with code seem to confuse Lingua::Identify, so we take entries that
# seem "close-enough".  Tuned via trial-and-error.
#
# Copyright (c) 2010 by David Golden - This may be used or copied under the
# same terms as Perl itself.

use 5.008001;
use strict;
use warnings;
use utf8;
use autodie;

use IO::File;
use Lingua::Identify qw(:language_identification);
use Time::Piece;
use URI;

use XML::Atom::Feed;
$XML::Atom::ForceUnicode = 1;
$XML::Atom::DefaultVersion = "1.0";

# Global heuristic tuning
my $latin_target = 0.95;  # 95% latin chars
my $lang_fuzz = 0.02;     # English within 2% probability of best language

run();

#--------------------------------------------------------------------------#

sub latin_ratio {
  my $string = shift;
  my $alpha =()= $string =~ /(\p{Alphabetic})/g;
  my $latin =()= $string =~ /(\p{Latin})/g;
  
  return 0 if ! $latin || !$alpha; # !$alpha probably redundant
  return $latin / $alpha;
}

sub run {
  my $in_feed = XML::Atom::Feed->new(URI->new("http://ironman.enlightenedperl.org"));

  my $out_feed = XML::Atom::Feed->new;
  $out_feed->title("Planet Iron Man: English Edition");
  $out_feed->subtitle( $in_feed->subtitle );
  $out_feed->id("tag:feeds.dagolden.com,".gmtime->year().":ironman:english");
  $out_feed->generator("XML::Atom/" . XML::Atom->VERSION);
  $out_feed->updated( gmtime->datetime . "Z" );
  for my $l ( $in_feed->link ) {
    $out_feed->link($l);
  }

  for my $e ( $in_feed->entries ) {
    my $content = $e->content->body;
    my $latin = latin_ratio($content);
    my %lang = langof($content);
    my $best = [sort { $lang{$b} <=> $lang{$a} } keys %lang]->[0];
    $lang{en} ||= 0;
    $out_feed->add_entry($e)
      if $latin > $latin_target && ($lang{$best} - $lang{en} < $lang_fuzz);
  }

  binmode(STDOUT, ":utf8");
  print $out_feed->as_xml;
}

•      •      •

If you enjoyed this or have feedback, please let me know by or