Data mining – the research community at the coalface

The huge increase in data in biology and medicine creates new challenges for researchers. Extracting information from databases is the research community’s new day job.

Dag Ahren

Dag Ahrén teaches database mining to biologists. He is part of a national network of bioinformaticians, BILS (Bioinformatics Infrastructure for Life Sciences).

Research engineer Dag Ahrén observes that there have been rapid developments over the past decade in genome research, i.e. research on the entire DNA of an organism. He believes that the quantities of data are just going to keep on increasing. More and more research groups in biology are seeing the possibilities of big data. There are new questions to answer and more organisms to map.

“The bottleneck in many research projects is that it takes longer to analyse the results than to generate them”, he says.

At the start of November, Dag Ahrén held a week-long database course for doctoral students in biology. The course taught participants how to create their own databases to store analysed genetic data and how to extract data from public databases to analyse alongside their own results. The term ‘data mining’ is frequently used in this context, i.e. the art of extracting research results from databases to search for patterns and correlations in large collections of data. Using databases in research suddenly resembles the hunt for valuable minerals in a mine; in both cases, unexploited resources are located and dug out.

“Yes, the databases are a tremendous resource and you have to know how to extract relevant information in the most effective way”, says Dag Ahrén.

He has a background in genetic research and long experience of data mining and management of large quantities of data generated within his field.

He now works as a research engineer at the Department of Biology, providing assistance to researchers with bioinformatics for the analysis of DNA sequences. This could concern everything from algae and fungi to bacteria and parasitic worms. He offers the illustration of a meeting he had recently with a colleague. They discussed an ongoing project that has generated 1.5 billion DNA sequences. The discussion was about the choice of database and how they could publish this data with the research article.

Dag Ahrén explains that most scientific journals require researchers to make their data available in conjunction with the publication of research. This means that researchers need to know better how to deal with raw data, how to analyse data and finally how to publish it so that it can be used by other researchers.

“It is important that this knowledge is in place at the university departments. We need to have staff who can deal with the data”, says Dag Ahrén.

Text: Lena Björk Blixt

Photo: Gunnar Menander

 FOOTNOTE: The database course was organised by the Geneco graduate school and the CAnMove and PlantLink research programmes.

Read more about big data