Metropolia LibGuides: Data management for thesis: Data Description and Documentation

Data description for the Thesis

The documentation and description of research data in a thesis often refer to the same concept, which involves describing how the research data was produced, what it contains, and how it has been processed.

Metadata or descriptive information refers to data about data, usually in a standardized format, regarding the research data. These terms are closely related and sometimes used interchangeably. However, for the purpose of a thesis, what matters more than the differences between these concepts is the intention behind documentation and metadata, and what they aim to achieve.

Why is data described or documented?

Without sufficient detailed information about the context, research data are often useless. Let's consider, for example, measurement results stored in an Excel spreadsheet. If there is no information about what was measured, with what instruments, and on what scale, the data becomes mere numbers in a table, making it impossible to understand and interpret. Description brings forth this contextual information.

Therefore, describing data is an essential part of responsible data management because without it, the data can be difficult or even impossible to interpret.

Think in advance about the information that needs to be recorded for each dataset (e.g., interview data or measurement dataset) to make the data understandable. It is important that the data remains comprehensible even after a long period of time or when examined by someone other than yourself. The possibility of misinterpretation makes the data unreliable as it affects the results obtained from it.

The thesis supervisor should also be able to understand the research data.

Description and documentation is also significant for result verification and research reproducibility. It helps confirm that the results of the thesis are reliable and could potentially be replicated if necessary.

Sufficient descriptive information and documentation is particularly important if you are interested in making the data open access or using it for further purposes after completing the thesis.

How should research data be described?

Description is always specific to the dataset, as its purpose is to make the data understandable and correctly interpreted. The following levels of description are examples of the information that should be recorded.

Thesis-level description

Thesis-level description refers to basic information about the research conducted in the thesis, such as:

The purpose for which the data was collected.
The method used to collect the data.
How the data collection was carried out (who, where from, when, and with what tools).
Access and usage conditions for the data (if provided, such as through a Creative Commons license).

File-level description

File-level description refers to the description of individual files, marking down their characteristics. Its purpose is to facilitate finding the correct information and maintain file integrity. If there are only a few separate files, file-level description may not be necessary in a thesis. Examples of file-level description information include:

File format
File size
File name
Relationship between files, such as different versions
How the files are organized into different folders
How the folders are organized and named

Variable-level description

Variable-level description refers to the description of variables in the dataset. In addition to a list of variables and the measurement scale, it is also good to note any used notations, abbreviations, and codes.

To ensure the reliability and integrity of the data, it is also important to document how the data has been processed and modified.

Contextual information and paradata

In addition to the above, contextual information or paradata can also be recorded if they are relevant to the dataset.

Contextual information refers to data about external conditions that prevailed during data collection and could potentially affect the data. These may include societal events, natural disasters, or accidents.

Paradata, on the other hand, is empirical data about the data collection process itself. For example, it could include the duration of different parts of an interview, response delays, or visual observations made by the interviewer during the interview situation.

Where are the description data and documentation stored?

There are several options for storing description data. For example:

Separate text file: Create a separate text file named "Readme" and store it alongside the data. Multiple Readme files can be created if needed.
Within the actual data file: For example, in an Excel spreadsheet, you can include variable descriptions on a separate worksheet or include participant background information and date at the beginning of an interview transcript.
Excel spreadsheet or database: These options are often more suitable for larger datasets compared to Readme files.
Digital research diary: You can use a digital research diary to record and store description data along with other research notes and documentation.