11 Common Data Formats

11.1 CSV

Code

data = """\
date,id,age
2020-01-01,x12,19
2020-01-02,x11,23
2020-01-02,x3,22
2020-01-03,x19,28
"""
print(data)

date,id,age
2020-01-01,x12,19
2020-01-02,x11,23
2020-01-02,x3,22
2020-01-03,x19,28

Code

from pathlib import Path
filepath = "/tmp/data.csv"
Path(filepath).write_text(data)

Loading data into a dataframe is not the only but one the most common ways to to load this data. We will use here pandas, a very popular library for data wrangling in python.

Install pandas:

uv add pandas

import pandas as pd

df = pd.read_csv(filepath)
df

	date	id	age
0	2020-01-01	x12	19
1	2020-01-02	x11	23
2	2020-01-02	x3	22
3	2020-01-03	x19	28

We can also save a dataframe as csv:

df.to_csv("/tmp/data.csv", index=None)

!cat /tmp/data.csv

date,id,age
2020-01-01,x12,19
2020-01-02,x11,23
2020-01-02,x3,22
2020-01-03,x19,28

Code

df.to_csv("/tmp/data.tsv", sep="\t", index=None)

11.2 TSV

Similar to CSV, Tab Separated Values format uses Tabs instead of comma to separate values of each line.

!cat /tmp/data.tsv

date    id  age
2020-01-01  x12 19
2020-01-02  x11 23
2020-01-02  x3  22
2020-01-03  x19 28

Luckily pandas can handle that too.

df = pd.read_csv("/tmp/data.tsv", sep="\t")
df

	date	id	age
0	2020-01-01	x12	19
1	2020-01-02	x11	23
2	2020-01-02	x3	22
3	2020-01-03	x19	28

11.3 Excel

We can also read in data coming from an excel spreadsheet.

df.to_excel("/tmp/data.xlsx", index=None)

pd.read_excel("/tmp/data.xlsx")

	date	id	age
0	2020-01-01	x12	19
1	2020-01-02	x11	23
2	2020-01-02	x3	22
3	2020-01-03	x19	28

11.4 JSON

JSON (JavaScript Object Notation) is by far one of the most used data formats, nowadays the default format to transfer data over the internet. It is also very commonly used for configuration files and logging.

flat = {
  "name": "John Doe",
  "age": 25
}

nested = {
  "name": {
    "firstName": "John",
    "lastName": "Doe",
    "middleName": "Smith"
  },
  "age": 25,
  "hobbies": ["reading", "writing"]
}

11.4.1 Encoding

Also called “serialization”.

import json

flat_str = json.dumps(flat)
flat_str

'{"name": "John Doe", "age": 25}'

nested_str = json.dumps(nested)
nested_str

'{"name": {"firstName": "John", "lastName": "Doe", "middleName": "Smith"}, "age": 25, "hobbies": ["reading", "writing"]}'

arr_str = json.dumps(arr)
arr_str

'{"name": "John Doe", "age": 25, "hobbies": ["reading", "writing"]}'

Path("/tmp/flat.json").write_text(flat_str)
Path("/tmp/nested.json").write_text(nested_str)

11.4.2 Decoding

Also called “deserialization”.

json.load(
    Path("/tmp/flat.json").open()
)

{'name': 'John Doe', 'age': 25}

json.loads(
    Path("/tmp/flat.json").read_text()
)

{'name': 'John Doe', 'age': 25}

json.load(
    Path("/tmp/nested.json").open()
)

{'name': {'firstName': 'John', 'lastName': 'Doe', 'middleName': 'Smith'},
 'age': 25,
 'hobbies': ['reading', 'writing']}

json.loads(
    Path("/tmp/nested.json").read_text()
)

{'name': {'firstName': 'John', 'lastName': 'Doe', 'middleName': 'Smith'},
 'age': 25,
 'hobbies': ['reading', 'writing']}

Notice we load the data into a python dictionary:

out = json.loads(
    Path("/tmp/nested.json").read_text()
)
type(out)

dict

We can also store a list as JSON array:

Path("/tmp/arr.json").write_text(
    json.dumps([
        {"a": 1},
        {"b": 2},
    ])
)

out = json.load(Path("/tmp/arr.json").open())
print(type(out))
out

<class 'list'>

[{'a': 1}, {'b': 2}]

11.5 Parquet

Parquet is a column oriented format. For a number of reasons, this format is much more efficient than csv and other formats.

df

	date	id	age
0	2020-01-01	x12	19
1	2020-01-02	x11	23
2	2020-01-02	x3	22
3	2020-01-03	x19	28

With pandas we can save data to a parquet file:

df.to_parquet("/tmp/data.parquet")

And read in:

pd.read_parquet("/tmp/data.parquet")

	date	id	age
0	2020-01-01	x12	19
1	2020-01-02	x11	23
2	2020-01-02	x3	22
3	2020-01-03	x19	28

Tip

Prefer parquet format when possible. It is faster to read and it stores metadata that can be used by libraries for optimization, for example, applying some filters.

11.6 Exercises

Download from this public repo the files dataset_description.json, participants.tsv.

Read in the file dataset_description.json and print the field Description.
Read in the file participants.tsv and print the mean of all the values of the column age.