Love Your Data Week Day 1 Defining Data Quality

Today’s LYD post features the thoughts of Dylan Shields, the Graduate Assistant for the Chemistry-Biology Library and Chemistry Graduate Student in Anna Gudmundsdottir’s Lab.

Welcome back to another edition of Love Your Data Week!!

The first topic for this week is going to focus on DEFINING DATA QUALITY!

So what IS data quality? Well, first off it is important to note that data quality definitions and practices can differ quite vastly depending on the field of study. However, there are a few markers of data quality that can be broadly applicable to most research. These markers include: accuracy, consistency, completeness, and accessibility.

So what are these markers and why are they important?

Accuracy of the data is paramount for assuring data quality. If the data is not accurate, can it truly be trusted? Data accuracy can be difficult to determine because often times researchers are trying to discover something new where an accepted notion or value may not be known. Alternatively, data accuracy includes whether the data is still currently true. Older data may no longer be accurate if the data is time sensitive! If you have doubts about your data accuracy it is always better to take steps to improve accuracy. This may require tasks such as more data collection, updating older data, different experimentation, or continued analyzation of the existing data!

The consistency of your data is one problem sure to ruin any data set. However, unlike data accuracy, data consistency is often times easier to spot because your data needs to agree with itself! Data consistency problems are often times the result of experimental errors. Be sure to critically think about where errors in consistency may arise in your experimental process. More experimental runs, different experimentation, or selective exclusion of data points are often the best way to address issues with consistency. Excluding data points should be your last resort and you must have an excellent reason for doing so!

The completeness of your data is extremely important. This category is fairly straight forward; you need to have a full set of data for your conclusions to be fully accepted! To elaborate on this subject, there are a few different criteria for completeness. Firstly, completeness of data can be simply doing all the necessary experiments or collecting all of the necessary pieces of information so that conclusions can be drawn accurately and reliably. Another criterion for completeness could be transforming raw data into a form that is more easily understood. Finally, and most likely the most difficult aspect, is drawing all of the conclusions from the data.

Accessibly is a multifaceted area of data quality. In the most basic meaning of the term, people have to be able to obtain your data! Many publication outlets have an area specifically designated for relevant data included. However, there are additional areas to share your data. Having your data openly available allows for easier collaborations and the increased likelihood of someone using your data for a separate study. A list of Open Data Repositories for many research fields can be found at the following link: Before adding to these repositories it is important to request permission from all collaborators that may have had a hand in creating the data. In addition to making your data visible, it is also important for your data to be easily understood when viewed. The best way to achieve this is to label your data as clearly as possible. Esoteric labels can readily detract from the quality of the data because other researchers aren’t going to expend additional brain power to understand poor labeling! Graphical representation is another important aspect to consider. There are some graphical representations that fit certain data sets more effectively than others, so be sure to give careful thought when choosing a graph type.