16 Lab: Exploring a Dataset
We will work here with this real-world dataset of resting state EEG signals https://github.com/OpenNeuroDatasets/ds005420.
16.1 Download data
16.1.1 Clone the repository
Inside the /pycourse/data folder:
git clone https://github.com/OpenNeuroDatasets/ds00542016.1.2 Install git-annex
The files with the actual data are not there, but we have the references to them so that we can pull them down. We will need the tool git-annex tool.
uv tool install git-annex16.1.3 Pull the data
Open a terminal inside ds005420 and run:
git-annex get .You should see a progress dialog showing ...from s3-PUBLIC...
After that, you’re ready to go with the exercises.
Jupyter Notebooks are ideal for this kind of exploratory tasks. Make a directory called /python/notebooks and open there a jupyter lab instance. Having the notebooks there will help us keeping things tidy for later reproducibility of our workflow.
16.2 Explore files
- List only the sub-directories in path.
- List only the sub-directories with subject data.
- Write a function that lists sub-directories with subject data.
16.3 Validating the data
We will start by making sure our data/metadata contains the information we expect at a high level.
- Write a unit test (inside
/pycourse/tests/test_data.py) to make sure the number of subject sub-directories corresponds to actual the number of subjects. Hint: Look at the metadata. - Verify that all subject directories have a eeg sub-directory.
- Assert that EEG data for all subjects was taken using 20 channels and sampling frequency 500.
16.4 Exploratory data analysis
Now we want to look at the data. We find that the data is in a format called European Data Format (.edf) and we need to install a third-party library, mne, to read it. You can check out the library documentation here
It’s a very good idea to first take a look at the documentation of a tool before installing it. Executing someone else’s code is a potential risk so you should try to find out if you can actually trust the source.
Hints:
- Look at function
mne.io.read_raw_edfto load data. - Look at the method
.to_data_frameof the loaded data.
- Plot one time series.
- Clean the column names removing “EEG”, eg “EEG C4-A1A2” -> “C4-A1A2”
- Plot all time series with labels according to channel name. Hint: Look at
meltmethod of dataframes - Plot the channels that start with “P”, “T” or “O”.
- Plot a correlation plot of all-vs-all the “P”, “T” and “O” channels as a heatmap. Hint: Look up seaborn’s documentation on heatmaps.
- Save the correlation plot in svg format.
16.5 Single-subject data
After having taken this quick look at the data, we want to start processing the data. So far we are working with data coming from one subject.
- Substract the mean from each channel
- Plot the time series with substracted mean for all channels
- Standarize and plot all time series again. Standarization means: \[ y = (x - mean) / standardDeviation \]
16.6 Multi-subject data
Here we are going to work with data of more than one subject at the time.
- Plot a histogram of
RecordingDurationacross all subjects. Hint: assume we want data in “oc_eeg” - Pick 3 EEG channels and plot the time series (aggregated across all subjects) in one plot. Differentiate the lines by channel. Hint: Use seaborn and look up the
hueparameter. - Plot a grid of subplots with each plot representing 1 channel (aggregated across subjects). Hint: Adapt this example
- Pick 5 channels and only 3 time points. Simulate subjects belong to 3 groups.
- Adapt this example to plot a comparison between channels/subjects/time.
16.7 Consolidate pipeline
Let’s consolidate our workflow into a pipeline to:
- Read and assert subfolders
- Clean column names
- Standarize values
- Plot and save correlations
- Run tests