Dataset#

Datasets in Pharmpy are represented using the pd.DataFrame class and a separate pharmpy.model.DataInfo class that provides additional information about the dataset. This could contain for example a description of how the columns are used in the model or the units used for the data.

Retrieving the dataset from a model#

The dataset connected to a model can be retrieved from the dataset attribute.

from pharmpy.modeling import read_model

model = read_model(path / "pheno_real.mod")
df = model.dataset
df
ID TIME AMT WGT APGR DV FA1 FA2
0 1 0.0 25.0 1.4 7.0 0.0 1.0 1.0
1 1 2.0 0.0 1.4 7.0 17.3 0.0 0.0
2 1 12.5 3.5 1.4 7.0 0.0 1.0 1.0
3 1 24.5 3.5 1.4 7.0 0.0 1.0 1.0
4 1 37.0 3.5 1.4 7.0 0.0 1.0 1.0
... ... ... ... ... ... ... ... ...
739 59 108.3 3.0 1.1 6.0 0.0 1.0 1.0
740 59 120.5 3.0 1.1 6.0 0.0 1.0 1.0
741 59 132.3 3.0 1.1 6.0 0.0 1.0 1.0
742 59 144.8 3.0 1.1 6.0 0.0 1.0 1.0
743 59 146.8 0.0 1.1 6.0 40.2 0.0 0.0

744 rows × 8 columns

This is the dataset after applying any model specific filtering and handling of special values.

The raw dataset can also be accessed

raw = model.read_raw_dataset()
raw
ID TIME AMT WGT APGR DV FA1 FA2
0 1 0. 25.0 1.4 7 0 1 1
1 1 2.0 0 1.4 7 17.3 0 0
2 1 12.5 3.5 1.4 7 0 1 1
3 1 24.5 3.5 1.4 7 0 1 1
4 1 37.0 3.5 1.4 7 0 1 1
... ... ... ... ... ... ... ... ...
739 59 108.3 3.0 1.1 6 0 1 1
740 59 120.5 3.0 1.1 6 0 1 1
741 59 132.3 3.0 1.1 6 0 1 1
742 59 144.8 3.0 1.1 6 0 1 1
743 59 146.8 0 1.1 6 40.2 0 0

744 rows × 8 columns

Note that all values here are strings

raw.dtypes
ID      object
TIME    object
AMT     object
WGT     object
APGR    object
DV      object
FA1     object
FA2     object
dtype: object

Update the dataset of a model#

Since the Pharmpy dataset is a pandas dataframe, it can be manipulated as such. A new or updated dataset can be set to a model like this:

import numpy as np

df['DV'] = np.log(df['DV'], where=(df['DV'] != 0).values)
model = model.replace(dataset=df)

The pharmpy.modeling module has several functions to examine and modify the dataset, see the user guide for dataset modeling.

DataInfo#

Every model has a DataInfo object that describes the dataset.

Note

A datainfo file can be created for .csv-files here.

di = model.datainfo
di
name      type   scale  continuous                         categories            unit  drop datatype           descriptor
  ID        id nominal       False                               None               1 False    int32   subject identifier
TIME       idv   ratio        True                               None            hour False  float64                 None
 AMT      dose   ratio        True                               None       milligram False  float64                 None
 WGT covariate   ratio        True                               None        kilogram False  float64          body weight
APGR covariate ordinal       False (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)               1 False  float64                 None
  DV        dv   ratio        True                               None milligram/liter False  float64 plasma concentration
 FA1   unknown nominal       False                             (0, 1)               1 False  float64                 None
 FA2   unknown nominal       False                             (0, 1)               1 False  float64                 None

The path to the dataset file if one exists.

di.path

Separator character for the dataset file.

di.separator
'\\s+'

ColumnInfo#

Each column of the dataset can here be given some additional information.

model.datainfo['AMT']
type               dose
scale             ratio
continuous         True
categories         None
unit          milligram
drop              False
datatype        float64
descriptor         None
Name: AMT

type#

Column type is the role a data column has in the model. Some basic examples of types are id for the subject identification column, idv for the independent variable (mostly time), dv for the dependent variable and dose for the dose amount column. Columns that not have been given any particular type will get the type value unknown. See pharmpy.ColumnInfo.type for a list of all supported types.

scale#

The scale of a column is the statistical scale of measurement of its data using “Stevens’ typology” (see https://en.wikipedia.org/wiki/Level_of_measurement). The scale can be one of nominal for non-ordered categorical data, ordinal for ordered categorical data, interval for numeric data were ratios cannot be taken and ratio for general numeric data. Note that nominal and ordinal data is always discrete, but interval and ratio data can be both discrete and continuous.

continuous#

If this is True the data is continuous and if it is False it is discrete. Note that ratio data can be seen as discrete for example if it has been rounded to whole numbers and cannot take on any real number.

categories#

A list of all values that the data column could have. Not all values have to be present in the dataset. Instead categories creates a possibility to annotate all possible values. It is also possible to name the categories by using a dict from the name to its numerical encoding.

unit#

The physical unit of the column data. Units can be input as a string, e.g. “kg” or “mg/L.”

drop#

A boolean that is set to True if the column is not going to be used by the model or False otherwise.

datatype#

The datatype of the column data. This describes the low level encoding of the data. See pharmpy.ColumnInfo.datatype for a list of all supported datatypes.

descriptor#

The descriptor can provide a high level understanding of the data in a machine readable way. See pharmpy.ColumnInfo.descriptor for a list of all supported descriptors.

datainfo file#

If a dataset file has an accompanying file with the same name and the extension .datainfo this will be read in when handling the dataset in Pharmpy. This file is a representation (a serialization) of a DataInfo object and its content can be created manually, with an external tool or by Pharmpy. Here is an example of the content:

di.to_json()
'{"columns": [{"name": "ID", "type": "id", "scale": "nominal", "continuous": false, "categories": null, "unit": "1", "datatype": "int32", "drop": false, "descriptor": "subject identifier"}, {"name": "TIME", "type": "idv", "scale": "ratio", "continuous": true, "categories": null, "unit": "hour", "datatype": "float64", "drop": false, "descriptor": null}, {"name": "AMT", "type": "dose", "scale": "ratio", "continuous": true, "categories": null, "unit": "milligram", "datatype": "float64", "drop": false, "descriptor": null}, {"name": "WGT", "type": "covariate", "scale": "ratio", "continuous": true, "categories": null, "unit": "kilogram", "datatype": "float64", "drop": false, "descriptor": "body weight"}, {"name": "APGR", "type": "covariate", "scale": "ordinal", "continuous": false, "categories": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], "unit": "1", "datatype": "float64", "drop": false, "descriptor": null}, {"name": "DV", "type": "dv", "scale": "ratio", "continuous": true, "categories": null, "unit": "milligram/liter", "datatype": "float64", "drop": false, "descriptor": "plasma concentration"}, {"name": "FA1", "type": "unknown", "scale": "nominal", "continuous": false, "categories": [0, 1], "unit": "1", "datatype": "float64", "drop": false, "descriptor": null}, {"name": "FA2", "type": "unknown", "scale": "nominal", "continuous": false, "categories": [0, 1], "unit": "1", "datatype": "float64", "drop": false, "descriptor": null}], "path": null, "separator": "\\\\s+", "missing_data_token": "-99"}'

It is a json file with the following top level structure:

Name

Type

columns

array of columns

path

string

separator

string

And the columns structure:

Name

Type

type

string

scale

string

continuous

boolean

categories

array of numbers or string-number map

unit

string

drop

boolean

datatype

string

descriptor

string