16 Lab: Exploring a Dataset

We will work here with this real-world dataset of resting state EEG signals https://github.com/OpenNeuroDatasets/ds005420.

16.1 Download data

16.1.1 Clone the repository

Inside the /pycourse/data folder:

git clone https://github.com/OpenNeuroDatasets/ds005420

16.1.2 Install `git-annex`

The files with the actual data are not there, but we have the references to them so that we can pull them down. We will need the tool git-annex tool.

uv tool install git-annex

16.1.3 Pull the data

Open a terminal inside ds005420 and run:

git-annex get .

You should see a progress dialog showing ...from s3-PUBLIC...
After that, you’re ready to go with the exercises.

Tip

Jupyter Notebooks are ideal for this kind of exploratory tasks. Make a directory called /python/notebooks and open there a jupyter lab instance. Having the notebooks there will help us keeping things tidy for later reproducibility of our workflow.

16.2 Explore files

List only the sub-directories in path.
List only the sub-directories with subject data.
Write a function that lists sub-directories with subject data.

16.3 Validating the data

We will start by making sure our data/metadata contains the information we expect at a high level.

Write a unit test (inside /pycourse/tests/test_data.py) to make sure the number of subject sub-directories corresponds to actual the number of subjects. Hint: Look at the metadata.
Verify that all subject directories have a eeg sub-directory.
Assert that EEG data for all subjects was taken using 20 channels and sampling frequency 500.

16.4 Exploratory data analysis

Now we want to look at the data. We find that the data is in a format called European Data Format (.edf) and we need to install a third-party library, mne, to read it. You can check out the library documentation here

Tip

It’s a very good idea to first take a look at the documentation of a tool before installing it. Executing someone else’s code is a potential risk so you should try to find out if you can actually trust the source.

Hints:

Look at function mne.io.read_raw_edf to load data.
Look at the method .to_data_frame of the loaded data.

Plot one time series.
Clean the column names removing “EEG”, eg “EEG C4-A1A2” -> “C4-A1A2”
Plot all time series with labels according to channel name. Hint: Look at melt method of dataframes
Plot the channels that start with “P”, “T” or “O”.
Plot a correlation plot of all-vs-all the “P”, “T” and “O” channels as a heatmap. Hint: Look up seaborn’s documentation on heatmaps.
Save the correlation plot in svg format.

16.5 Single-subject data

After having taken this quick look at the data, we want to start processing the data. So far we are working with data coming from one subject.

Substract the mean from each channel
Plot the time series with substracted mean for all channels
Standarize and plot all time series again. Standarization means: \[ y = (x - mean) / standardDeviation \]

16.6 Multi-subject data

Here we are going to work with data of more than one subject at the time.

Plot a histogram of RecordingDuration across all subjects. Hint: assume we want data in “oc_eeg”
Pick 3 EEG channels and plot the time series (aggregated across all subjects) in one plot. Differentiate the lines by channel. Hint: Use seaborn and look up the hue parameter.
Plot a grid of subplots with each plot representing 1 channel (aggregated across subjects). Hint: Adapt this example
Pick 5 channels and only 3 time points. Simulate subjects belong to 3 groups.
Adapt this example to plot a comparison between channels/subjects/time.

16.7 Consolidate pipeline

Let’s consolidate our workflow into a pipeline to:

Read and assert subfolders
Clean column names
Standarize values
Plot and save correlations
Run tests