Dataset#
Datasets in Pharmpy are represented using the pd.DataFrame
class and a separate
pharmpy.model.DataInfo
class that provides additional information about the dataset. This could contain
for example a description of how the columns are used in the model or the units used for the data.
Retrieving the dataset from a model#
The dataset connected to a model can be retrieved from the dataset attribute.
from pharmpy.modeling import read_model
model = read_model(path / "pheno_real.mod")
df = model.dataset
df
model <- read_model(path / "pheno_real.mod")
df <- model$dataset
df
ID | TIME | AMT | WGT | APGR | DV | FA1 | FA2 | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 0.0 | 25.0 | 1.4 | 7.0 | 0.0 | 1.0 | 1.0 |
1 | 1 | 2.0 | 0.0 | 1.4 | 7.0 | 17.3 | 0.0 | 0.0 |
2 | 1 | 12.5 | 3.5 | 1.4 | 7.0 | 0.0 | 1.0 | 1.0 |
3 | 1 | 24.5 | 3.5 | 1.4 | 7.0 | 0.0 | 1.0 | 1.0 |
4 | 1 | 37.0 | 3.5 | 1.4 | 7.0 | 0.0 | 1.0 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
739 | 59 | 108.3 | 3.0 | 1.1 | 6.0 | 0.0 | 1.0 | 1.0 |
740 | 59 | 120.5 | 3.0 | 1.1 | 6.0 | 0.0 | 1.0 | 1.0 |
741 | 59 | 132.3 | 3.0 | 1.1 | 6.0 | 0.0 | 1.0 | 1.0 |
742 | 59 | 144.8 | 3.0 | 1.1 | 6.0 | 0.0 | 1.0 | 1.0 |
743 | 59 | 146.8 | 0.0 | 1.1 | 6.0 | 40.2 | 0.0 | 0.0 |
744 rows × 8 columns
This is the dataset after applying any model specific filtering and handling of special values.
The raw dataset can also be accessed
raw = model.read_raw_dataset()
raw
raw <- model$read_raw_dataset()
raw
ID | TIME | AMT | WGT | APGR | DV | FA1 | FA2 | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 0. | 25.0 | 1.4 | 7 | 0 | 1 | 1 |
1 | 1 | 2.0 | 0 | 1.4 | 7 | 17.3 | 0 | 0 |
2 | 1 | 12.5 | 3.5 | 1.4 | 7 | 0 | 1 | 1 |
3 | 1 | 24.5 | 3.5 | 1.4 | 7 | 0 | 1 | 1 |
4 | 1 | 37.0 | 3.5 | 1.4 | 7 | 0 | 1 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
739 | 59 | 108.3 | 3.0 | 1.1 | 6 | 0 | 1 | 1 |
740 | 59 | 120.5 | 3.0 | 1.1 | 6 | 0 | 1 | 1 |
741 | 59 | 132.3 | 3.0 | 1.1 | 6 | 0 | 1 | 1 |
742 | 59 | 144.8 | 3.0 | 1.1 | 6 | 0 | 1 | 1 |
743 | 59 | 146.8 | 0 | 1.1 | 6 | 40.2 | 0 | 0 |
744 rows × 8 columns
Note that all values here are strings
raw.dtypes
raw$dtypes
ID object
TIME object
AMT object
WGT object
APGR object
DV object
FA1 object
FA2 object
dtype: object
Update the dataset of a model#
Since the Pharmpy dataset is a pandas dataframe, it can be manipulated as such. A new or updated dataset can be set to a model like this:
import numpy as np
df['DV'] = np.log(df['DV'], where=(df['DV'] != 0).values)
model = model.replace(dataset=df)
df['DV'] <- np$log(df['DV'], where=(df['DV'] != 0).values)
model <- model$replace(dataset=df)
The pharmpy.modeling
module has several functions to examine and modify the dataset, see the user guide for
dataset modeling.
DataInfo#
Every model has a DataInfo object that describes the dataset.
Note
A datainfo file can be created for .csv-files here.
di = model.datainfo
di
di <- model$datainfo
di
name type scale continuous categories unit drop datatype descriptor
ID id nominal False None 1 False int32 subject identifier
TIME idv ratio True None hour False float64 None
AMT dose ratio True None milligram False float64 None
WGT covariate ratio True None kilogram False float64 body weight
APGR covariate ordinal False (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10) 1 False float64 None
DV dv ratio True None milligram/liter False float64 plasma concentration
FA1 unknown nominal False (0, 1) 1 False float64 None
FA2 unknown nominal False (0, 1) 1 False float64 None
The path to the dataset file if one exists.
di.path
di$path
Separator character for the dataset file.
di.separator
di$separator
'\\s+'
ColumnInfo#
Each column of the dataset can here be given some additional information.
model.datainfo['AMT']
model$datainfo['AMT']
type dose
scale ratio
continuous True
categories None
unit milligram
drop False
datatype float64
descriptor None
Name: AMT
type#
Column type
is the role a data column has in the model. Some basic examples of types are id
for the subject identification column, idv
for the independent
variable (mostly time), dv
for the dependent variable and dose
for the dose amount column. Columns that not have been given any particular type
will get the type value unknown
. See pharmpy.ColumnInfo.type
for a list of all supported types.
scale#
The scale
of a column is the statistical scale of measurement of its data using “Stevens’ typology” (see https://en.wikipedia.org/wiki/Level_of_measurement). The scale can be one of nominal
for non-ordered categorical data, ordinal
for ordered categorical data, interval
for numeric data were ratios cannot be taken and ratio
for general numeric data. Note that nominal
and ordinal
data is always discrete, but interval
and ratio
data can be both discrete and continuous.
continuous#
If this is True
the data is continuous and if it is False
it is discrete. Note that ratio data can be seen as discrete for example
if it has been rounded to whole numbers and cannot take on any real number.
categories#
A list
of all values that the data column could have. Not all values have to be present in the dataset. Instead categories
creates a possibility to annotate all possible values. It is also possible to name the categories by using a dict
from the name to its numerical encoding.
unit#
The physical unit of the column data. Units can be input as a string, e.g. “kg” or “mg/L.”
drop#
A boolean that is set to True if the column is not going to be used by the model or False otherwise.
datatype#
The datatype of the column data. This describes the low level encoding of the data. See pharmpy.ColumnInfo.datatype
for a list of all supported datatypes.
descriptor#
The descriptor can provide a high level understanding of the data in a machine readable way. See pharmpy.ColumnInfo.descriptor
for a list of all supported descriptors.
datainfo file#
If a dataset file has an accompanying file with the same name and the extension .datainfo
this will be read in when handling the dataset in Pharmpy. This file is a representation (a serialization) of a DataInfo
object and its content can be created manually, with an external tool or by Pharmpy. Here is an example of the content:
di.to_json()
di$to_json()
'{"columns": [{"name": "ID", "type": "id", "scale": "nominal", "continuous": false, "categories": null, "unit": "1", "datatype": "int32", "drop": false, "descriptor": "subject identifier"}, {"name": "TIME", "type": "idv", "scale": "ratio", "continuous": true, "categories": null, "unit": "hour", "datatype": "float64", "drop": false, "descriptor": null}, {"name": "AMT", "type": "dose", "scale": "ratio", "continuous": true, "categories": null, "unit": "milligram", "datatype": "float64", "drop": false, "descriptor": null}, {"name": "WGT", "type": "covariate", "scale": "ratio", "continuous": true, "categories": null, "unit": "kilogram", "datatype": "float64", "drop": false, "descriptor": "body weight"}, {"name": "APGR", "type": "covariate", "scale": "ordinal", "continuous": false, "categories": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], "unit": "1", "datatype": "float64", "drop": false, "descriptor": null}, {"name": "DV", "type": "dv", "scale": "ratio", "continuous": true, "categories": null, "unit": "milligram/liter", "datatype": "float64", "drop": false, "descriptor": "plasma concentration"}, {"name": "FA1", "type": "unknown", "scale": "nominal", "continuous": false, "categories": [0, 1], "unit": "1", "datatype": "float64", "drop": false, "descriptor": null}, {"name": "FA2", "type": "unknown", "scale": "nominal", "continuous": false, "categories": [0, 1], "unit": "1", "datatype": "float64", "drop": false, "descriptor": null}], "path": null, "separator": "\\\\s+", "missing_data_token": "-99"}'
It is a json file with the following top level structure:
Name |
Type |
---|---|
|
array of columns |
|
string |
|
string |
And the columns structure:
Name |
Type |
---|---|
|
string |
|
string |
|
boolean |
|
array of numbers or string-number map |
|
string |
|
boolean |
|
string |
|
string |