Code
data = """\
date,id,age
2020-01-01,x12,19
2020-01-02,x11,23
2020-01-02,x3,22
2020-01-03,x19,28
"""
print(data)date,id,age
2020-01-01,x12,19
2020-01-02,x11,23
2020-01-02,x3,22
2020-01-03,x19,28
date,id,age
2020-01-01,x12,19
2020-01-02,x11,23
2020-01-02,x3,22
2020-01-03,x19,28
Loading data into a dataframe is not the only but one the most common ways to to load this data. We will use here pandas, a very popular library for data wrangling in python.
Install pandas:
| date | id | age | |
|---|---|---|---|
| 0 | 2020-01-01 | x12 | 19 |
| 1 | 2020-01-02 | x11 | 23 |
| 2 | 2020-01-02 | x3 | 22 |
| 3 | 2020-01-03 | x19 | 28 |
We can also save a dataframe as csv:
date,id,age
2020-01-01,x12,19
2020-01-02,x11,23
2020-01-02,x3,22
2020-01-03,x19,28
Similar to CSV, Tab Separated Values format uses Tabs instead of comma to separate values of each line.
date id age
2020-01-01 x12 19
2020-01-02 x11 23
2020-01-02 x3 22
2020-01-03 x19 28
Luckily pandas can handle that too.
We can also read in data coming from an excel spreadsheet.
JSON (JavaScript Object Notation) is by far one of the most used data formats, nowadays the default format to transfer data over the internet. It is also very commonly used for configuration files and logging.
Also called “serialization”.
'{"name": {"firstName": "John", "lastName": "Doe", "middleName": "Smith"}, "age": 25, "hobbies": ["reading", "writing"]}'
'{"name": "John Doe", "age": 25, "hobbies": ["reading", "writing"]}'
Also called “deserialization”.
{'name': {'firstName': 'John', 'lastName': 'Doe', 'middleName': 'Smith'},
'age': 25,
'hobbies': ['reading', 'writing']}
{'name': {'firstName': 'John', 'lastName': 'Doe', 'middleName': 'Smith'},
'age': 25,
'hobbies': ['reading', 'writing']}
Notice we load the data into a python dictionary:
We can also store a list as JSON array:
Parquet is a column oriented format. For a number of reasons, this format is much more efficient than csv and other formats.
With pandas we can save data to a parquet file:
And read in:
| date | id | age | |
|---|---|---|---|
| 0 | 2020-01-01 | x12 | 19 |
| 1 | 2020-01-02 | x11 | 23 |
| 2 | 2020-01-02 | x3 | 22 |
| 3 | 2020-01-03 | x19 | 28 |
Prefer parquet format when possible. It is faster to read and it stores metadata that can be used by libraries for optimization, for example, applying some filters.
Download from this public repo the files dataset_description.json, participants.tsv.
dataset_description.json and print the field Description.participants.tsv and print the mean of all the values of the column age.