Sample datasets for Big Data experimentation

Another week, another gem from the Data Science Association. If you’re trying to prototype a data analysis algorithm, benchmark performance on a new platform like Spark, or just play around with a new tool, you’re going to need reliable sample data.

As anyone familiar with testing knows, good data can be tough to find. Although there’s plenty of data in the public domain, most of it is not ready to use. A few months ago, I downloaded some data sets from a US government site and it took a few hours of cleaning before I had the data in shape for analysis.

Behold: https://github.com/datasets. Here the Frictionless Data Project has compiled a set of easily accessible and well documented data sets. The specific data may not be of much interest, but these are great for trials and experimentation. For example, if you want to analyze time series financial data, there’s a CSV file with updated S&P 500 data.

Well worth a look!

0 Responses to “Sample datasets for Big Data experimentation”


  • No Comments

Leave a Reply