Earn a Bowl Full of Money via the National Data Science Bowl

National Data Science Bowl imageWould you like to earn $175,000?

Got your attention!

Now, don’t get worried that I am going to suddenly offer you untold riches by doing no work — this is not one of those untoward scams.

Instead, this is a chance to show the world that you have what it takes to advance our understanding about data science, plus aid the environment, and gleefully win $175,000 for doing so.

In a first time ever contest, the Association for Computing Machinery (ACM) has teamed up with other partners to provide an open competition for showcasing your data science skills and knowledge.

The competition is known as the National Data Science Bowl.

It is free to enter as a contestant.

This is an open book type of exam, so to speak, in that you have several months to tackle a data intensive problem, and provide a “solution” that is better than other solutions submitted, thus, potentially offering a breakthrough in data science, and getting you the dough.

Imagine yourself like Oliver Twist, bowl in hand, asking not Mr. Bumble for “Please, sir, I want some more” gruel, but instead, asking for more data, and more money.

I realize that Charles Dickens might shudder if he saw my above reference to his great work, but, all kidding aside, the National Data Science Bowl is a great opportunity for bringing forth attention to data science and getting young and old alike to see what is going on.

Reasons for industry to pay attention include:

• Might provide ideas for solving data intensive problems in your own business or industry.

• Might be a source of highly promising future data scientists that you should consider courting and hiring.

• Might offer a sense of what is possible today versus the future in terms of data intensive processing.

    CONTEST DETAILS

So, what is the contest, you ask?

In conjunction with the Oregon State University (OSU) Hatfield Marine Science Center, about 50 million oceanic images were collected over an 18-day period and consist of about 80 terabytes of imagery data. You are able to freely download the data from the contest web site (more on this at the end of this piece).

Your task is to help analyze this data.

As the contest rules say, you are to predict ocean health, one plankton at a time.

Not joking, by the way, since plankton is an essential part of the oceanic ecosystem. Estimates indicate that plankton account for at least half of the primary productivity on earth and generate an enormous amount of the total carbon in the global carbon cycle.

They are integral to the aquatic food web.

Knowing their population levels, along with other metrics and relationships, provides an important underlying backbone for understanding our oceans and the environment.

Within the 80 terabytes of images are lots and lots of plankton, plus numerous other sea life is pictured.

If you were to try and manually review each image, the manual task would be incredibly onerous, taking a tremendous amount of time, and subject to errors and other maladies.

Even if we were to parcel out the task to thousands of people by a crowd tasking effort, it would still have inherent issues, plus, this would not be a particular readily repeatable approach to solving the problem.

And so, we turn toward the use of computing as a means to analyze the images.

At first glance, you might think, well, haven’t we already solved the whole image processing effort, and we have today lots of examples of systems that do image processing, including even facial image analysis such as widely known on Facebook?

Yes, there has been a great deal of advancement in image processing, but there are still lots of room left to go.

This contest helps to highlight some of the aspects that we still need more work on.

Image processing also can be enriched by lending a hand from other allied areas of computer science. For example, the use of machine learning techniques and tools is a big part of the image processing forte.

And, we are the cusp of the Internet of Things, which means that soon, tons of other everyday products will be Internet connected, generating tons and tons of data, coming from our watches, our eyeglasses, our kitchen appliances, and the rest.

Analyzing and classifying the contents of the images for this contest will be tough, since there are a wide variety of sea life pictured, including tiny single-celled creatures to large fish, and due to the notion that the images capture them in varying 3D orientations, encompassing lots of fecal matter floating around (yucky, but there for a reason), and the images themselves are noisy, meaning that there is fuzziness of the images or other difficulties (imagine the pictures you take and the kinds of “noise” found in them).

    HOW TO APPROACH THE BOWL

Rather than tackling this problem by just en masse trying to analyze all of the data at once, it makes more sense to use a divide-and-conquer kind of strategy, first analyzing smaller chunks. In fact, the contest provides a so-called training data set, allowing you to more easily gauge the effectiveness of your image classification programs.

That being said, in discussing this contest with some potential entrants, I tried to emphasize that they should not try to take a mindless “hacking” approach to this problem. In other words, rather than just coding up some image analysis program, or grabbing one off-the-shelf, I am guessing that the better approach will first involve carefully thinking about how to approach the problem, and designing a solution.

Of course, the solution, once you start to construct and test it, will likely need to be adjusted, or in today’s parlance be “pivoted” – but it is likely better to have an overall plan, rather than just do the jump in the middle strategy and hope that you find your way out of the puzzle.

I’ve heard some complaints from some computer scientists and data scientists that the contest is stacked toward the imagery specialists.

That admittedly is a potential qualm, and their complaints that a big data problem could have been more well-rounded, such as by using tweets or some other kind of large data sets.

But, it is the first such contest, and the imagery data does not necessarily mean that only the top notch image processing gurus are going to take the prize.

They might fall into the trap of using one of their run of the mill imagery analysis routines, and could potentially be caught unawares by someone from another discipline that brings to play an algorithm that the image pros had not previously considered.

Admittedly, it is a bit like a sporting contest where the problem deals with say baseball skills, and so the football and basketball players are inherently at a disadvantage of the baseball wizards.

You might consider that rather than fighting them, joining them.

Reach out to your imagery expert colleagues, extend a hand of cooperation, and maybe the winning team will arise from such a combination.

Either way, please consider taking the plunge and diving into the competition (did you catch those puns?).

    WRAP-UP

Here’s the web site:

www.datasciencebowl.com

The contest ends on March 16, 2015.

Please don’t wait until say March 15 to get started, though I suppose a last minute hail mary notion might be tempting to some.

I will update you along the way about how the contestants are doing (there is a leaderboard being displayed at the site), and I will provide an assessment of the final winners at the end, sharing with you some insights gleamed by their efforts.

Get the data, save the oceans, make some bucks.

Enough said.