DSC-2025-20 | No Messy Data – Introduction to Data Cleaning and Pre-processing in R

Wann?

04. November 2025
09:00 - 16:30 Uhr

Wo?

Campus (Raum folgt in Kürze)

Trainer*in

Dr. Susanne de Vogel
Data Science Center, Universit?t Bremen

Anzahl Teilnehmende: Max. 20
Sprache: Englisch

Why is the topic important?

Survey and other research data are often messy: variable names may be unclear, values missing, or the structure not suited for the analysis you have in mind. Before you can run meaningful analyses, you usually need to tidy up — and that’s exactly where data cleaning and pre-processing come in.

Learning how to handle these early steps well is essential for understanding your data and producing reliable results. It’s also part of good scientific practice: preparing your data in a way that is transparent, understandable, and easy to reproduce later on.

Understanding common pitfalls, best practices, and standard conventions for handling data not only helps avoid errors but also increases the credibility and trustworthiness of your research. By using R — a free, open-source and widely adopted tool in the research community — you can develop efficient, script-based workflows that are not only adaptable and transparent, but also reproducible and shareable. The tidyverse package provides a consistent and intuitive syntax for data manipulation, making it easy to learn and apply.

Workshop Goal

By the end of the workshop, participants will understand why data cleaning is a crucial step for transparent and reproducible research, and be able to recognize typical issues in raw research data. They will gain hands-on experience transforming and structuring data into a format suitable for analysis. In addition, participants will understand how to document the data cleaning process clearly and reproducibly using R scripts.

Workshop Content

The workshop combines basic theoretical input with plenty of hands-on exercises in R and tidyverse, so that you not only understand key concepts but also gain confidence in applying them in practice.

  • Best practises in data cleaning for quality, transparency, and reproducibility
  • Exploring and reducing data
  • Creating and manipulating variables
  • Working with variable and value labels
  • Detecting and handling missing values
  • Documenting data cleaning

Target Audience & Prior Knowledge

This workshop is a beginners training. It’s aimed at researchers in the social sciences, health sciences, and humanities who are working with – or planning to work with – survey-based or other quantitative data, but have little or no prior experience in handling such data or using statistical software.

A little programming experience in R or another language is an advantage, but not a requirement. All that's needed is a willingness to engage with R and take the first steps into scripting and coding.

Technical Requirements

Your own laptop and a stable Wifi connection (e.g. via eduroam).

Installation of R Version 4.5.0 and higher and RStudio Version 2025.05.1+513 and higher prior to the course. Both programs are free and open source.


About the Trainer

Dr. Susanne de Vogel is a data scientist for training and consulting at the DSC.

Dr. Susanne de Vogel is a data scientist for training and consulting at the DSC. She holds a diploma in Social Sciences from the University of Cologne (2013) and a PhD in Sociology from the Martin Luther University of Halle-Wittenberg (2019). Susanne has worked for over 10 years on the development and implementation of various panel studies at the German Center for Higher Education Research and Science Studies (DZHW) in Hanover. Her competencies lie in survey design, instrument development and in the collection, preparation, analysis, and management of (survey) data.