Metropolia LibGuides: Open RDI and Data management: Identifiable Data and Anonymisation

Identifiable data and anonymisation

The processing of identifiable data requires particular care. Data is identifiable if it can be used to identify an individual person or a cluster of persons, such as a family.

Identifiable data can be used in research and development activities when it is purposeful, planned, justified, and there is a legal basis for data processing, such as the participant's consent or research carried out in the public interest.

Personal data must be anonymised from the dataset as soon as they are no longer needed.

The content of this page is based on the Data Management Guidelines of the Finnish Social Science Data Archive (FSD). You can find Metropolia's privacy guidelines and templates on OMA intranet.

Personal or identifiable data

Indentifiable data includes any information that can directly or indirectly identify a person. Research or development data may also contain identifying information about the research subject's close associates or other individuals. Information that identifies them is also considered identifiable data.

Direct identifiers: a person's full name, social security number, email address containing the personal name, and biometric identifiers (fingerprints, facial image, voice patterns, iris scan, hand geometry or manual signature).

Strong indirect identifiers: e.g. a postal address, phone number, vehicle registration number, unusual job title, very rare disease or various unique identifiable codes, such as a student ID number.

Indirect identifiers: information that on its own is not enough to identify someone but, when linked with other available information, could be used to deduce the identity of a person. These include, for example, gender, age, education, occupational status, household composition, income, marital status, mother tongue, nationality, ethnic background, or place of work or study. When the target group of a study is relatively small, by combining indirect background information, an individual can be reasonably easily identifiable.

Special categories of personal data

Sensitive personal data refers to the special categories of personal data defined by the General Data Protection Regulation (GDPR), which include information revealing:

Racial or ethnic origin
Political opinions
Religious or philosophical beliefs
Trade union membership
Data concerning health
Sexual orientation or activity
Genetic and biometric data for identifying the person

Sensitive data must be protected with particular care, as their processing can pose risks to fundamental rights of individuals. Therefore, their processing is generally prohibited. However, there are exceptions to this prohibition, one of which is the explicit consent of the individual for processing such sensitive personal data.

Please note that storing special categories of personal data in cloud services is prohibited at Metropolia.

Processing of special categories of personal data (tietosuoja.fi)

Minimization of Identifiable Data

The principle of minimization is to avoid the collection of unnecessary identifiable data. This principle should be followed when planning research.

Collect only those identifiable data that are essential to answer the research questions.
Do not collect identifiable data "just in case."
Avoid collecting sensitive information.
Avoid open-ended response options in surveys, as you cannot control what respondents write.
During interviews, ask the interviewee to avoid providing specific details such as names or workplaces.
Consider how detailed information you need. Is it sufficient to use categories or generalizations instead of precise data? For example, using age ranges like "20-29 years old" instead of exact age, or referring to "a university of applied sciences" instead of specifying Metropolia University of Applied Sciences.

Processing data containing identifiers

The processing of research data containing identifiers must be planned thoroughly and executed carefully. Data protection must not be jeopardised, for example, by careless preservation or insecure digital transfers.

General protective measures in processing personal data include pseudonymisation, anonymisation and storage limitation.

Pseudonymisation

Pseudonymisation refers to the removal or replacement of identifiers with pseudonyms or codes, which are kept separate from the data and protected by technical and organisational measures. Organisational measures refer to the protection of physical environment and documented access control. Technical measures refer to secure data storage solutions. Pseudonymous data become anonymous when separately kept identifying information (decryption key, personal data and information on the techniques used to pseudonymise the data) is destroyed.

Anonymisation

Data anonymisation refers to the process of handling data in a way that it no longer contains any identifiable information. In the case of personal data, this means that individuals cannot be reasonably identified from the dataset. Similarly, organizational information or other confidential data can also be anonymised from the dataset.

Even if you do not collect personal data directly from the research subjects, it may still be possible to identify them from the dataset. For example, an anonymous survey may not be truly anonymous if the research subjects can reveal information about themselves in open-ended responses or if the survey form records the respondent's IP address. Such data is not anonymous and is subject to data protection laws.

There is no single anonymisation technique suitable for all types of data. Anonymisation should always be planned case by case.

You can get a clear picture of the anonymisation process in both qualitative and quantitative research with the help of the following questions:

What kinds of direct and indirect identifiers do the data contain?
Do the data contain exceptional or unique observations?
What combinations of information in the data could be used to identify a person?
Can information from external sources be linked to your data and thus identify the observations/individuals?
Think of the use for the data and what are the features that need to be preserved and what can be “sacrificed” in the anonymisation process.

Techniques for anonymisation include, for example:

Individual data deletion. It can be marked in the dataset with square brackets "[data removed]."
Reclassification of data. For example, if you have collected specific ages or occupations, you can replace them with age groups or occupational categories.
Using pseudonyms. If names appear in the dataset, instead of deletion, you can also replace them with pseudonyms.
Generalization. You can make specific data more general, for example, replacing "AIDS" with the term "disease" and "Metropolia" with the term "university of applied sciences."

Storage limitation

Personal data that are no longer needed to conduct the research should be erased as soon as possible. For example, names, addresses and other similar identifiers needed at the data collection stage should be removed immediately after they are no longer necessary to carry out the research. If personal identity codes were used to link data, they should also be deleted when they are no longer needed.

Data Protection Checklist for RDI projects

Create a data management plan. Identify whether you collect and process personal data. Collect personal data only when needed for the purposes of the research or development work.
If the dataset poses significant privacy risks for the research subject, conduct a Data Protection Impact Assessment (DPIA). This is necessary, for example, when the dataset contains sensitive personal data or involves research subjects who are children. An ethical review may also be necessary. You can find the DPIA form on Oma-intra.
Define the data controller.
Prepare a privacy notice. The collection, retention, processing, and destruction of personal data must be planned in advance and described clearly and comprehensibly in the privacy notice. The privacy notice is included in Metropolia's template form for informing research subjects.
You need a legal basis for collecting and processing personal data. It can be, for example, the research subject's consent or the public interest.
Before collecting data, inform the research subject about the research and the processing of personal data in an understandable manner. Metropolia has its own template form for this purpose. After receiving the information, the research subject can give consent to participate in the research and the processing of their personal data. There is also a separate form for this.
Process personal data with care and only as communicated to the research subjects. Use Metropolia-approved tools for data collection, transfer, and storage. Data breaches and negligence can lead to sanctions.