Inspecting Data Objects

Author

NumCosmo developers

Introduction

In this example, we show how to inspect observational data objects available in NumCosmo. This is useful before building a likelihood or adapting an existing fitting example to a different dataset.

A pure Python companion script is also provided with this example as inspect_data_objects.py.

We first inspect the NumCosmo source tree to identify the available data-object files. Then we inspect one selected data source file and finally create a concrete DataHubble object from Python to examine its runtime properties.

The source-tree inspection gives a broad map of available data types. The runtime inspection shows what a concrete object exposes through methods such as list_properties() and get_property().

Inspecting the Data Source Files

NumCosmo data objects are implemented in the source tree under:

numcosmo/nc/data/

Files named nc_data_*.h and nc_data_*.c define the main data-object types used by NumCosmo. For example, files such as nc_data_hubble.h, nc_data_bao.h, nc_data_snia.h, and nc_data_cmb_dist_priors.h correspond to different families of observational data.

In this first step, we inspect the source tree and create a compact table of available data-type files.

Code

from pathlib import Path
import os
import re

import pandas as pd
from IPython.display import HTML, display

from numcosmo_py import Nc, Ncm

# Initialize the library.
Ncm.cfg_init()


def show_pandas(df: pd.DataFrame):
    """
    Display a Pandas DataFrame as an HTML table.
    """

    return HTML(df.to_html(index=False, max_rows=20))


def show_numeric_pandas(df: pd.DataFrame):
    """
    Display a numeric Pandas DataFrame with compact floating-point formatting.
    """

    return HTML(df.to_html(index=False, max_rows=20, float_format="%.4f"))


def find_numcosmo_data_dir():
    """
    Try to find the NumCosmo source data directory.

    The preferred method is to set the environment variable:

        NUMCOSMO_SOURCE_DIR=/path/to/NumCosmo

    If this variable is not set, we try to find the repository root
    by walking upward from the current working directory.
    """

    candidates = []

    env_source_dir = os.environ.get("NUMCOSMO_SOURCE_DIR")

    if env_source_dir is not None:
        candidates.append(Path(env_source_dir).expanduser().resolve())

    cwd = Path.cwd().resolve()

    for parent in [cwd] + list(cwd.parents):
        candidates.append(parent)

    for candidate in candidates:
        data_dir = candidate / "numcosmo" / "data"

        if data_dir.is_dir() and any(data_dir.glob("nc_data*.h")):
            return data_dir

    return None


def read_text(path):
    return path.read_text(encoding="utf-8", errors="replace")


def family_from_header_name(header_name):
    """
    Infer a broad data family from the header filename.

    Examples
    --------
    nc_data_bao.h              -> bao
    nc_data_bao_a.h            -> bao
    nc_data_hubble.h           -> hubble
    nc_data_hubble_bao.h       -> hubble
    nc_data_cluster_ncount.h   -> cluster
    nc_data_cmb_dist_priors.h  -> cmb
    nc_data_snia_cov.h         -> snia
    """

    stem = header_name.removesuffix(".h")

    if not stem.startswith("nc_data_"):
        return "other"

    rest = stem.removeprefix("nc_data_")
    return rest.split("_")[0]


def extract_concrete_data_type(header_text):
    """
    Extract the public C/GObject data type from declarations such as:

        G_DECLARE_FINAL_TYPE (NcDataHubble, ...)
        G_DECLARE_DERIVABLE_TYPE (NcDataBaoA, ...)

    Some base or factory-style files may not declare a concrete type.
    In those cases, this function returns None.
    """

    pattern = re.compile(
        r"G_DECLARE_(?:FINAL|DERIVABLE)_TYPE\s*\(\s*(NcData[A-Za-z0-9_]*)\s*,",
        re.MULTILINE,
    )

    match = pattern.search(header_text)

    if match:
        return match.group(1)

    return None


def analyze_data_header(header_path):
    """
    Analyze one nc_data_*.h file.
    """

    source_path = header_path.with_suffix(".c")
    header_text = read_text(header_path)

    return {
        "family": family_from_header_name(header_path.name),
        "header": header_path.name,
        "source": source_path.name if source_path.exists() else None,
        "concrete_data_type": extract_concrete_data_type(header_text),
    }


def data_type_overview_dataframe(data_dir):
    """
    Return a compact overview of the NumCosmo data-type source files as a DataFrame.
    """

    headers = sorted(data_dir.glob("nc_data*.h"))
    items = [analyze_data_header(header) for header in headers]

    items.sort(key=lambda item: (item["family"], item["header"]))

    rows = []

    for item in items:
        rows.append(
            {
                "Family": item["family"],
                "Header": item["header"],
                "Source": item["source"] if item["source"] is not None else "-",
                "Concrete data type": (
                    item["concrete_data_type"]
                    if item["concrete_data_type"] is not None
                    else "-"
                ),
            }
        )

    return pd.DataFrame(rows)


data_dir = find_numcosmo_data_dir()

The helper code above searches for the NumCosmo source directory and defines a compact summary of the available data-object files. The code is folded to keep the rendered page focused on the output.

if data_dir is None:
    print("Could not find the NumCosmo source data directory.")
    print()
    print("If you are running this example from outside the NumCosmo repository,")
    print("set NUMCOSMO_SOURCE_DIR to the NumCosmo source path.")
    print()
    print("For example:")
    print("  export NUMCOSMO_SOURCE_DIR=$HOME/dev/NumCosmo")
else:
    print("Using NumCosmo data directory:")
    print(f"  {data_dir}")

Could not find the NumCosmo source data directory.

If you are running this example from outside the NumCosmo repository,
set NUMCOSMO_SOURCE_DIR to the NumCosmo source path.

For example:
  export NUMCOSMO_SOURCE_DIR=$HOME/dev/NumCosmo

Code

if data_dir is not None:
    data_type_overview = data_type_overview_dataframe(data_dir)
    display(show_pandas(data_type_overview))

Table 1: NumCosmo data-type source files found in the source tree.

The table gives a first map of the available data-object families and their corresponding source files. For example, the hubble family contains the Hubble expansion-rate data object, while the bao family contains several BAO-related data formats.

A dash in the Concrete data type column does not mean that the file is incomplete or unused. It only means that this simple inspection code did not detect a concrete GObject data class declared directly in that header.

Some files, such as nc_data_bao.h, can still be important because they may define shared dataset identifiers, factory functions, or common infrastructure for a whole data family. More specific files, such as nc_data_bao_a.h or nc_data_bao_dv.h, may then define concrete data classes for particular formats.

Inspecting a Selected Data Source File

After listing the available data-type files, we can choose one of them for a more focused inspection. Here we choose:

nc_data_hubble.h

This file corresponds to the Hubble expansion-rate data object.

The next code block prints documentation-relevant information: the data family, the corresponding source file, the concrete data type when detected, dataset identifier enum types when present, the enum members when present, and GObject property names found in the source file.

It intentionally avoids lower-level C details such as public C functions or internal GObject declarations.

Code

def remove_c_comments(text):
    """
    Remove C and C++-style comments before parsing enums.
    """

    text = re.sub(r"/\*.*?\*/", "", text, flags=re.DOTALL)
    text = re.sub(r"//.*?$", "", text, flags=re.MULTILINE)

    return text


def extract_dataset_enums(header_text):
    """
    Extract dataset identifier enums such as:

        NcDataHubbleId
        NcDataBaoId
        NcDataSNIAId

    Returns a list of dictionaries with enum names and members.
    """

    text = remove_c_comments(header_text)

    pattern = re.compile(
        r"typedef\s+enum(?:\s+[A-Za-z0-9_]+)?\s*\{"
        r"(?P<body>.*?)"
        r"\}\s*(?P<name>NcData[A-Za-z0-9_]*Id)\s*;",
        re.DOTALL,
    )

    enums = []

    for match in pattern.finditer(text):
        body = match.group("body")
        enum_name = match.group("name")

        members = []

        for raw_item in body.split(","):
            item = raw_item.strip()

            if not item:
                continue

            item = item.split("=")[0].strip()

            if item.startswith("NC_DATA_"):
                members.append(item)

        enums.append(
            {
                "name": enum_name,
                "members": members,
            }
        )

    return enums


def common_prefix_ending_in_underscore(names):
    """
    Find a readable common prefix among enum members.

    Example
    -------
    NC_DATA_HUBBLE_GOMEZ_VALENT_COMP2018
    NC_DATA_HUBBLE_RIESS2018

    gives:

    NC_DATA_HUBBLE_
    """

    if not names:
        return ""

    prefix = names[0]

    for name in names[1:]:
        while not name.startswith(prefix):
            prefix = prefix[:-1]

            if not prefix:
                return ""

    if "_" in prefix:
        return prefix[: prefix.rfind("_") + 1]

    return ""


def extract_gobject_properties(source_text):
    """
    Extract property names from g_param_spec_* calls.

    These names often correspond to properties accessible in Python through:

        data.get_property("property-name")
    """

    pattern = re.compile(
        r"g_param_spec_[A-Za-z0-9_]+\s*\(\s*\"([^\"]+)\"",
        re.MULTILINE,
    )

    return sorted(set(pattern.findall(source_text)))


def analyze_selected_data_header(data_dir, header_name):
    """
    Analyze one selected nc_data_*.h file and its corresponding .c file.
    """

    header_path = data_dir / header_name

    if not header_path.exists():
        raise FileNotFoundError(f"Header not found: {header_path}")

    source_path = header_path.with_suffix(".c")

    header_text = read_text(header_path)
    source_text = read_text(source_path) if source_path.exists() else ""

    return {
        "family": family_from_header_name(header_path.name),
        "header": header_path.name,
        "source": source_path.name if source_path.exists() else None,
        "concrete_data_type": extract_concrete_data_type(header_text),
        "dataset_enums": extract_dataset_enums(header_text),
        "properties": extract_gobject_properties(source_text),
    }


def selected_header_summary_dataframe(item):
    """
    Return a one-row summary table for a selected data header.
    """

    return pd.DataFrame(
        [
            {
                "Family": item["family"],
                "Header": item["header"],
                "Source": item["source"] if item["source"] is not None else "-",
                "Concrete data type": (
                    item["concrete_data_type"]
                    if item["concrete_data_type"] is not None
                    else "-"
                ),
            }
        ]
    )


def dataset_enum_dataframe(item):
    """
    Return a table with dataset identifier enum members.
    """

    rows = []

    for enum in item["dataset_enums"]:
        prefix = common_prefix_ending_in_underscore(enum["members"])

        for member in enum["members"]:
            short_name = member.removeprefix(prefix) if prefix else member

            rows.append(
                {
                    "Enum type": enum["name"],
                    "Python member": short_name,
                    "C enum member": member,
                }
            )

    return pd.DataFrame(rows)


def gobject_properties_dataframe(item):
    """
    Return a table with GObject property names found in the source file.
    """

    return pd.DataFrame(
        [
            {
                "Property": prop,
            }
            for prop in item["properties"]
        ]
    )

Now we inspect the selected Hubble data source file.

selected_header = "nc_data_hubble.h"

if data_dir is None:
    selected_item = None
    print("Skipping source-file inspection because the NumCosmo source directory was not found.")
else:
    selected_item = analyze_selected_data_header(data_dir, selected_header)

Skipping source-file inspection because the NumCosmo source directory was not found.

Code

if selected_item is not None:
    display(show_pandas(selected_header_summary_dataframe(selected_item)))

Table 2: Summary of the selected NumCosmo data source file.

Code

if selected_item is not None:
    enum_df = dataset_enum_dataframe(selected_item)

    if len(enum_df) > 0:
        display(show_pandas(enum_df))
    else:
        print("No dataset identifier enum members were detected.")

Table 3: Dataset identifier enum members found in the selected data source header.

Code

if selected_item is not None:
    prop_df = gobject_properties_dataframe(selected_item)

    if len(prop_df) > 0:
        display(show_pandas(prop_df))
    else:
        print("No GObject property names were detected.")

Table 4: GObject property names found in the selected data source implementation.

When the selected file does not show a concrete data type, this does not necessarily mean that the file is unused. It may define common infrastructure for a data family rather than a directly instantiable data object.

The most important item for a Python user is the dataset identifier enum. In this case, the enum tells us which built-in Hubble datasets can be passed to the constructor. For example, if GOMEZ_VALENT_COMP2018 appears among the identifier members, it corresponds to the Python enum member:

Nc.DataHubbleId.GOMEZ_VALENT_COMP2018

Runtime Inspection

The source-tree inspection is useful to understand how data objects are organized in the NumCosmo source code. However, once a concrete data object has been created, the most direct way to inspect it from Python is to use its runtime properties.

We first list the available members of the DataHubbleId enum as exposed in Python.

Code

def list_python_enum_members(enum_cls):
    """
    List enum-like members exposed through Python/GObject introspection.
    """

    members = []

    for name in sorted(dir(enum_cls)):
        if name.isupper() and not name.startswith("_"):
            members.append(name)

    return members


hubble_id_members = pd.DataFrame(
    [
        {
            "Python member": name,
        }
        for name in list_python_enum_members(Nc.DataHubbleId)
    ]
)

display(show_pandas(hubble_id_members))

Python member
BORGHI2022
BUSCA2013_BAO_WMAP
CABRE
GOMEZ_VALENT_COMP2018
JIAO2023
JIMENEZ2023
MORESCO2012_BC03
MORESCO2012_MASTRO
MORESCO2015
MORESCO2016_DR9_BC03
MORESCO2016_DR9_MASTRO
RATSIMBAZAFY2017
RIESS2008_HST
RIESS2016_HST_WFC3
RIESS2018
SIMON2005
STERN2009
TOMASETTI2023
ZHANG2012

Table 5: DataHubbleId members exposed through Python.

We now instantiate one concrete Hubble expansion-rate dataset.

data = Nc.DataHubble.new_from_id(Nc.DataHubbleId.GOMEZ_VALENT_COMP2018)

The object data contains the observational data and metadata associated with the selected Hubble-rate compilation.

A first useful check is to inspect the Python type of the object.

print(type(data))

<class 'gi.repository.NumCosmo.DataHubble'>

The method list_properties() lists the GObject properties exposed by the object. These are the properties that can usually be accessed from Python with get_property().

Code

runtime_properties = pd.DataFrame(
    [
        {
            "Property": prop.name,
            "Value type": prop.value_type.name,
        }
        for prop in data.list_properties()
    ]
)

display(show_pandas(runtime_properties))

Property	Value type
name	gchararray
desc	gchararray
long-desc	gchararray
init	gboolean
bootstrap	NcmBootstrap
n-points	guint
w-mean	gboolean
mean	NcmVector
sigma	NcmVector
z	NcmVector

Table 6: Runtime properties exposed by the selected Hubble data object.

Many NumCosmo data objects provide a short textual description through the desc property.

print(data.get_property("desc"))

Gomez-Valent 2018 -- arXiv:1802.01505

The number of data points is available through the n-points property.

n_points = data.get_property("n-points")

print(f"Number of points: {n_points}")

Number of points: 31

Accessing Numerical Arrays

For this Hubble-rate dataset, the redshifts, measured values, and uncertainties are exposed through the properties z, mean, and sigma.

z = data.get_property("z")
mean = data.get_property("mean")
sigma = data.get_property("sigma")

These properties are NumCosmo vector objects. Individual entries can be accessed with .get(i).

print("First redshift:", z.get(0))
print("First H(z) value:", mean.get(0))
print("First uncertainty:", sigma.get(0))

First redshift: 0.07
First H(z) value: 69.0
First uncertainty: 19.6

We can display a small table with the first few data points.

Code

hubble_preview = pd.DataFrame(
    [
        {
            "i": i,
            "z": z.get(i),
            "H(z)": mean.get(i),
            "sigma": sigma.get(i),
        }
        for i in range(min(5, n_points))
    ]
)

display(show_numeric_pandas(hubble_preview))

i	z	H(z)	sigma
0	0.0700	69.0000	19.6000
1	0.0900	69.0000	12.0000
2	0.1200	68.6000	26.2000
3	0.1700	83.0000	8.0000
4	0.1791	75.0000	4.0000

Table 7: First rows of the selected Hubble-rate dataset.

This gives a quick preview of the dataset before using it in a fit.

Helper Function

When exploring data objects, it is useful to define a small helper function.

Code

def inspect_data_object(data):
    """
    Display basic runtime information about a NumCosmo data object.
    """

    print("=" * 80)
    print("Runtime data object inspection")
    print("=" * 80)
    print()

    print("Object type:")
    print(type(data))
    print()

    property_names = [prop.name for prop in data.list_properties()]

    if "desc" in property_names:
        print("Description:")
        print(data.get_property("desc"))
        print()

    if "n-points" in property_names:
        print("Number of points:")
        print(data.get_property("n-points"))
        print()

    runtime_properties = pd.DataFrame(
        [
            {
                "Property": prop.name,
                "Value type": prop.value_type.name,
            }
            for prop in data.list_properties()
        ]
    )

    display(show_pandas(runtime_properties))

We can then call the helper on the Hubble data object.

inspect_data_object(data)

================================================================================
Runtime data object inspection
================================================================================

Object type:
<class 'gi.repository.NumCosmo.DataHubble'>

Description:
Gomez-Valent 2018 -- arXiv:1802.01505

Number of points:
31

Property	Value type
name	gchararray
desc	gchararray
long-desc	gchararray
init	gboolean
bootstrap	NcmBootstrap
n-points	guint
w-mean	gboolean
mean	NcmVector
sigma	NcmVector
z	NcmVector

This helper does not replace the documentation for each data class, but it provides a useful first overview.

General Strategy

When working with NumCosmo data objects, a useful workflow is:

Look at the available data-type source files under numcosmo/nc/data/.
Choose the data family relevant to the example, such as Hubble, BAO, CMB, or supernova data.
Inspect the corresponding source file to find the dataset identifier enum, if available.
Instantiate one concrete data object from Python.
Use list_properties() to see what the object exposes at runtime.
Use get_property() to access metadata and numerical arrays.
Print a small preview before using the data in a fit.

For example, the Hubble case follows this pattern.

data = Nc.DataHubble.new_from_id(Nc.DataHubbleId.GOMEZ_VALENT_COMP2018)

print(type(data))
print(data.get_property("desc"))
print(data.get_property("n-points"))

for prop in data.list_properties():
    print(prop.name)

<class 'gi.repository.NumCosmo.DataHubble'>
Gomez-Valent 2018 -- arXiv:1802.01505
31
name
desc
long-desc
init
bootstrap
n-points
w-mean
mean
sigma
z

In fitting examples, it is usually better to keep the main text focused on the cosmological model, likelihood, sampler, and results. Instead of explaining the internal structure of the data object in every fitting example, those examples can briefly mention the dataset being used and point to this page for details.