Explaining Metabase 🔗︎
At YAPC::NA, Ricardo and I finally released some early alpha Metabase modules to CPAN. Metabase is a project we started in 2008 at the Oslo QA Hackathon to provide a new transport and storage infrastructure for CPAN Testers and other CPAN distribution metadata. We’ve been working out of github until now, but think it’s time to encourage others in the Perl programming community to take a look.
One of the challenges we’ve had is just explaining Metabase to people. Here’s what I’ve come up with: Metabase is a database framework and web API to store and search opinions from anyone about anything. In the terminology of Metabase, Users store Facts about Resources. Facts are really opinions or claims; there is no inherent truth in them. Resources are logical identifiers, generally in the form of a URI, since the things we’re most interested in describing are online.
Since CPAN Testers is our first use case, I’ll use that to illustrate the concept. Currently, CPAN Testers email test reports about distributions on CPAN. In the Metabase world, each CPAN tester is a User. The Resource is a CPAN distribution file – a ‘distfile’ URI that describes a distribution tarball on CPAN such as cpan://distfile/DAGOLDEN/Capture-Tiny-0.06.tar.gz. The Fact is the test report. Today that’s just the text of the email message, but in the future it will be structured data. ((Technically, a CPAN Testers report will be a collection of more granular facts like prerequisites, test output, environment and so on.))
It’s easy to see how this can be generalized to an entire ecosystem of facts and opinions about CPAN distributions, such as CPANTS, ratings, annotations, tags, etc. Today, every CPAN website has its own database and its own API so mashups and remixes require building an interface to each. Metabase offers the possibility of a unified way for all these tools to store, share and cross-reference information about CPAN.
Design considerations 🔗︎
Metabase has some similarities to other, general ontology frameworks, but we decided to focus our efforts around some pragmatic tradeoffs using CPAN Testers to guide the design. We think these decisions generalize well to other uses, but we’re trying to follow a YAGNI strategy and only implement what CPAN Testers will need before worrying about more general cases.
(1) High fact volume 🔗︎
There are currently about 18,000 “latest” distributions on CPAN, and the entire history of CPAN is a bit over 100,000 distributions. By contrast, there are over 4 million CPAN Testers reports already and the total is growing by about 250,000 per month. Reports are usually consumed as a batch process to create summary statistics on the CPAN Testers website and relatively few reports are ever reviewed in detail. Therefore, the Metabase design to date prioritizes Fact submission over search and retrieval.
(2) Minimalist Perl for client libraries 🔗︎
Many CPAN Testers want to run smoke tests using a Perl installation that has as few extra modules installed as possible. Therefore, any Metabase components used by clients such as Metabase::Fact or Metabase::Client::Simple need minimal non-core dependencies. On the other hand, anything that will run on the server side can take full advantage of Modern Perl tools like Moose, DBIC and Catalyst.
(3) No barriers to contribution 🔗︎
Today, new CPAN Testers don’t have to register or get authorized to start contributing, they just start sending emails to a mailing list that collects reports. For Metabase, users generate a user profile fact and submit it as a credential with their facts. Metabase manages user identities within the Metabase itself – adding new users automatically – allowing new contributors to join without any external pre-authorization hurdles that might discourage participation. ((Verification or authorization is optional))
(4) Open to extension and evolution 🔗︎
We want to make CPAN Testers tools available for corporations or perl5-porters or anyone with customized testing needs. We want people to come up with new facts and new resources. That means defining interfaces well and leaving implementation details open to change. Metabase allows lots of choices about things like underlying data storage technologies or permissible fact types.
Metabase specifies data storage capabilities, but the actual database storage is pluggable, from flat files to relational databases to cloud services. Metabase defines how Fact classes can provide validation and custom index data on top of a very simple data model. Moreover, Metabase is a fairly dumb repository; any intelligent analysis must be done by third parties extracting and transforming data.
Where do we go from here? 🔗︎
I’ve successfully submitted test reports to a local Metabase, but for Metabase to become what we need for CPAN Testers (and more), there are still quite a number of things to do:
- Improve documentation
- Test and refine Metabase components
- Design and implement search capabilities
- Establish one or more ‘development server’ Metabases for testing robustness and throughput
- Migrate the 4 million CPAN Testers reports from NNTP to Metabase
- Migrate CPAN Testers downstream analytics to use Metabase as a source
- Have some CPAN Testers pilot submitting reports to Metabase
- Develop and release new CPAN Testers clients that can submit structured Metabase reports
If you are interested in following or participating in Metabase, there is a (still low-volume) mailing list you can subscribe to and several of the key developers can be found on #toolchain on irc.perl.org.