KUB Datalab: Cleaning

Tools

At KUB Datalab we use and support a large array of software.

We have tried to organise our main tools into categories below. Bear in mind, that many types of software can be used for multiple purposes. We have tried to categorise by main purpose.

Tools for cleaning data

Excel

Microsoft Excel allows users to organize, format and calculate data with formulas using a spreadsheet system. It features the ability to perform basic calculations, use graphing tools, create pivot tables and a macro programming language called Visual Basic for Applications, among other useful features.
Spreadsheet applications such as MS Excel use a grid of cells arranged in numbered rows and letter-named columns to organize and manipulate data. They can also display data as charts, histograms and line graphs.
MS Excel permits users to arrange data in order to view various factors from different perspectives. Microsoft Visual Basic is a programming language used for applications in Excel, allowing users to create a variety of complex numerical methods.

OpenRefine

OpenRefine is a free tool, which can help you clean messy data. A typical workflow is to import a data file, work with the many data cleaning options in OpenRefine, and export the file after the cleaning. OpenRefine has a range of import and export options. Users can use OpenRefine’s graphical user interface and coding (GREL and Regular Expressions). OpenRefine does not help users collect data, analyse, or visualise data.

Orange

Orange is a component-based visual programming software package for data visualization, machine learning, data mining, and data analysis.
Orange components are called widgets. They range from simple data visualization, subset selection, and preprocessing to empirical evaluation of learning algorithms and predictive modeling.
Visual programming is implemented through an interface in which workflows are created by linking predefined or user-designed widgets, while advanced users can use Orange as a Python library for data manipulation and widget alteration.

Python

Python is a programming language available under an Open Source license. It is smart to know a bit about python programming, partly because the programming language is becoming more and more widespread and used in research, partly because more analyzes in the humanities, social sciences and natural sciences depend on algorithms and calculations. KUB Datalab’s python courses deal with e.g. on analyzes of text data and web scraping.

R

R is a programming language specifically designed for statistical data analysis. It is more or less the industry standard for explorative data analysis, data cleaning and visualization.
KUB Datalab offers courses, both general and tailored to specific needs in R. In our open workshops, we consults on how solve specific problems in and with R. You must find out on your own which statistical test you need to apply, and what visualization best suits your data. When that is decided – we will do our utmost to get you to your goal.
Our approach is based on the tidyverse. We find that Base R solutions, in general are more difficult for beginners to grasp. Close collaboration with our resident Python experts ensures that we are ready to switch gears if necessary.

RegEx

Regular expressions is a structured way of describing patterns in text. A solid grasp of regular expressions enables us to find every word in a text that begins with "th", is followed by 3 or 4 characters and ending with either e or r. Regular expressions is a useful technique in a lot of situations and available in several of the software packages supported in KUB Datalab. We aim to provide training in regular expressions and incorporate the method in other situations where useful.