# [Volute] r3641 - trunk/projects/dm/provenance/description

Volute commit messages volutecommits at g-vo.org
Tue Oct 18 21:45:54 CEST 2016

Author: kriebe
Date: Tue Oct 18 21:45:54 2016
New Revision: 3641

Log:
Moved "use cases" A-E from requirements to "Goals" section. Added minimum requirements to Requirements sections. Some minor edits.

Modified:
trunk/projects/dm/provenance/description/Makefile
trunk/projects/dm/provenance/description/ProvenanceDM.pdf
trunk/projects/dm/provenance/description/ProvenanceDM.tex
trunk/projects/dm/provenance/description/intro-general.tex
trunk/projects/dm/provenance/description/intro-previousefforts.tex
trunk/projects/dm/provenance/description/intro-requirements.tex
trunk/projects/dm/provenance/description/prov-refs.bib

Modified: trunk/projects/dm/provenance/description/Makefile
==============================================================================
--- trunk/projects/dm/provenance/description/Makefile	Tue Oct 18 19:15:06 2016	(r3640)
+++ trunk/projects/dm/provenance/description/Makefile	Tue Oct 18 21:45:54 2016	(r3641)
@@ -7,7 +7,8 @@
DOCVERSION = 0.3

# Publication date, ISO format; update manually for "releases"
-DOCDATE = $(DATE) +date :=$(shell date +"%Y-%m-%d")
+DOCDATE = \$(date)

# What is it you're writing: NOTE, WD, PR, or REC
DOCTYPE = WD

Modified: trunk/projects/dm/provenance/description/ProvenanceDM.pdf
==============================================================================
Binary file (source and/or target). No diff available.

Modified: trunk/projects/dm/provenance/description/ProvenanceDM.tex
==============================================================================
--- trunk/projects/dm/provenance/description/ProvenanceDM.tex	Tue Oct 18 19:15:06 2016	(r3640)
+++ trunk/projects/dm/provenance/description/ProvenanceDM.tex	Tue Oct 18 21:45:54 2016	(r3641)
@@ -9,11 +9,15 @@
\ivoagroup{DM}

\author{Kristin Riebe}
-\author{Michèle Sanguillon}
\author{Mathieu Servillat}
-\author{Florian Rothmaier}
-\author{Mireille Louys}
\author{François Bonnarel}
+\author{Mireille Louys}
+\author{Florian Rothmaier}
+\author{Michèle Sanguillon}
+\author{IVOA Data Model Working Group}
+
+\editor{Kristin Riebe}
+\editor{Mathieu Servillat}

% \previousversion[????URL????]{????Funny Label????}
\previousversion[http://volute.g-vo.org/svn/trunk/projects/dm/provenance/description/ProvDM-0.2-20160428.pdf]{ProvDM-0.2-20160428.pdf}
@@ -117,8 +121,8 @@

\section{Introduction}
\input{intro-general}
-\input{intro-VOarchitecture}
\input{intro-requirements}
+\input{intro-VOarchitecture}
\input{intro-previousefforts}

\section{The provenance data model}

Modified: trunk/projects/dm/provenance/description/intro-general.tex
==============================================================================
--- trunk/projects/dm/provenance/description/intro-general.tex	Tue Oct 18 19:15:06 2016	(r3640)
+++ trunk/projects/dm/provenance/description/intro-general.tex	Tue Oct 18 21:45:54 2016	(r3641)
@@ -30,3 +30,81 @@
Provenance information may be recorded in minute detail or by using coarser
elements, depending on the intended usage and the desired level of detail
for a specific project that records provenance.
+
+The following list is a collection of tasks which the Provenance Data Model should help to solve. They are flagged with [S] for problems which are more interesting for the end user of datasets and with [P] for tasks that are probably more important for data producers and publishers.
+
+\paragraphlb{A: Tracking the production history [S]}
+        Find out which steps were taken to produce a dataset and list the methods/tools/software that was involved.
+        Track the history back to the raw data files/raw images, show the workflow (backwards search) or return a list of progenitor datasets.
+
+        \noindent Examples:
+        \begin{itemize}
+            \item Is an image from catalogue xxx already calibrated?
+What about dark field subtraction? Were foreground stars removed? Which technique
+was used?
+
+            \item Is the background noise of atmospheric muons still present in my neutrino data sample?
+        \end{itemize}
+
+        We do not go so far as to consider easy reproducibility as a use case -- this would be too ambitious. But at least the
+        major steps undertaken to create a piece of data should be recoverable.
+
+
+\paragraphlb{B: Attribution and contact information [S]}
+        Find the people involved in the production of a dataset,
+
+        \noindent Examples:
+        \begin{itemize}
+            \item I want to use an image for my own work -- who was involved in
+creating it? Who do I need to cite or who can I contact to get this information?
+            \item I have a question about column xxx in a data
+        \end{itemize}
+
+
+\paragraphlb{C: Aid in debugging [S, P]}
+        Find possible error sources.
+
+        \noindent Examples:
+        \begin{itemize}
+            \item I found something strange in an image. Where does
+the image come from? Which instrument was used, with which characteristics
+etc.? Was there anything strange noted when the image was taken?
+            \item Which pipeline version was used -- the old one
+with a known bug for treating bright objects or a newer version?
+            \item This light curve doesn't look quite right. How was
+the photometry determined for each data point?
+        \end{itemize}
+
+
+\paragraphlb{D: Quality assessment [P]}
+        Judge the quality of an observation, production step or dataset.
+
+        \noindent Examples:
+        \begin{itemize}
+            \item Since wrong calibration images may increase the
+number of artifacts on an image rather than removing them, knowledge about
+the calibration image set will help to assess the quality of the calibrated
+image.
+        \end{itemize}
+
+
+\paragraphlb{E: Search in structured provenance metadata [P, S]}
+        This would allow one to also do a forward search'', i.e. locate derived datasets or outputs, e.g. finding all images produced by a certain processing step or derived from data which were taken by a given facility.
+
+        \noindent Examples:
+        \begin{itemize}
+            \item Give me more images that were produced using the
+same pipeline.
+            \item Give me an overview on all images reduced with the same calibration dataset.
+            \item Are there any more images attributed to this observer?
+            \item Which images of the crab nebula are of good quality and were produced within the last 10 years by someone not from ESO or NASA?
+          % add another specific use case for tracking scientific productivity?
+        \end{itemize}
+
+        This task is probably the most challenging. It also includes tracking the history of data items as in A, but we still have listed this task separately, since we may decide that we can't keep this one, but we definitely want A.
+
+
+More specific use cases in the astronomy domain for different types of datasets and workflows along with example implementations are given in Section \ref{sec:usecases-implementations}.
+

Modified: trunk/projects/dm/provenance/description/intro-previousefforts.tex
==============================================================================
--- trunk/projects/dm/provenance/description/intro-previousefforts.tex	Tue Oct 18 19:15:06 2016	(r3640)
+++ trunk/projects/dm/provenance/description/intro-previousefforts.tex	Tue Oct 18 21:45:54 2016	(r3641)
@@ -1,5 +1,5 @@
\subsection{Previous efforts}
-The provenance concept was early introduced by the IVOA within the scope of the Observation Data Model \citep[see IVOA note][]{note:observationdm} as a class describing where the data are coming from. A full observation data model dedicated to the specific spectral data was then designed \citep[Spectral Data Model,][]{std:SpectralDM} as well as a fully generic characterisation data model of the measurement axes of the data \citep[Characterisation Data Model,][]{std:CharacterisationDM} while the progress on the provenance data model was slowing down.
+The provenance concept was early introduced by the IVOA within the scope of the Observation Data Model \citep[see IVOA note][]{note:observationdm} as a class describing where the data are coming from. A full observation data model dedicated to the specific spectral data was then designed \citep[Spectral Data Model,][]{std:SpectralDM} as well as a fully generic characterisation data model of the measurement axes of the data \citep[Characterisation Data Model][]{std:CharacterisationDM} while the progress on the provenance data model was slowing down.

The IVOA Data Model Working Group first gathered various use cases coming from different communities of observational astronomy (optical, radio, Xray, interferometry). Common motivations for a provenance tracing of the history included: quality assessment, discovery of dataset progenitors and access to metadata necessary for reprocessing. The provenance data model was then designed as the combination of \emph{Data processing}, \emph{Observing configuration} and \emph{Observation ambient conditions} data model classes.
The \emph{Processing class} was embedding a sequence of processing stages which were hooking specific ad hoc details and links to input and output datasets, as well as processing step description.
@@ -7,7 +7,7 @@

Outside of the astronomical community, the Provenance Challenge series (2006 -- 2010), a community effort to achieve inter-operability between different representations of provenance in scientific workflows, resulted in the Open Provenance Model (\cite{moreau2010}).
Later, the W3C Provenance Working Group was founded and released the W3C Provenance Data Model as Recommendation in 2013 (\cite{std:W3CProvDM}).
-OPM was designed to be applicable to anything, scientific data as well as cars or immaterial things like decisions. With the W3C model, this becomes more focused on the web.  Nevertheless, the core concepts are still in principle the same in both models and very general, so they can be applied to astronomical datasets and workflows as well.
+OPM was designed to be applicable to anything, scientific data as well as cars or immaterial things like decisions. With the W3C model, this becomes more focused on the web. Nevertheless, the core concepts are still in principle the same in both models and very general, so they can be applied to astronomical datasets and workflows as well.
The W3C model was taken up by a larger number of applications and tools than OPM, we are therefore basing our modeling efforts on the W3C Provenance data model, making it less abstract and more specific, or extending it where necessary.

Modified: trunk/projects/dm/provenance/description/intro-requirements.tex
==============================================================================
--- trunk/projects/dm/provenance/description/intro-requirements.tex	Tue Oct 18 19:15:06 2016	(r3640)
+++ trunk/projects/dm/provenance/description/intro-requirements.tex	Tue Oct 18 21:45:54 2016	(r3641)
@@ -1,79 +1,72 @@
-\subsection{Requirements for provenance and use cases}
-\subsubsection{Requirements}\label{sec:requirements}
+\subsection{Minimum requirements for provenance}\label{sec:requirements}

-An IVOA provenance data model should provide solutions to the following tasks:
+We derived from our goals and use cases the following minimum requirements for the Provenance Data Model:

-\paragraphlb{A: Tracking the production history}
-        Find out which steps were taken to produce a dataset and list the methods/tools/software that was involved.
-        Track the history back to the raw data files/raw images, show the workflow.
-
-        \noindent Examples:
-        \begin{itemize}
-            \item Is an image from catalogue xxx already calibrated?
-What about dark field subtraction? Were foreground stars removed? Which technique
-was used?
-
-            \item Is the background noise of atmospheric muons still present in my neutrino data sample?
-        \end{itemize}
-
-        We do not go so far as to consider easy reproducibility as a use case -- this would be too ambitious. But at least the
-        major steps undertaken to create a piece of data should be recoverable.
-
-
-        Find the people involved in the production of a dataset,
-
-        \noindent Examples:
-        \begin{itemize}
-            \item I want to use an image for my own work -- who was involved in
-creating it? Who do I need to cite or who can I contact to get this information?
-            \item I have a question about column xxx in a data
-        \end{itemize}
-
-
-\paragraphlb{C: Aid in debugging}
-        Find possible error sources.
-
-        \noindent Examples:
-        \begin{itemize}
-            \item I found something strange in an image. Where does
-the image come from? Which instrument was used, with which characteristics
-etc.? Was there anything strange noted when the image was taken?
-            \item Which pipeline version was used -- the old one
-with a known bug for treating bright objects or a newer version?
-            \item This light curve doesn't look quite right. How was
-the photometry determined for each data point?
-        \end{itemize}
-
-
-\paragraphlb{D: Quality assessment}
-        Judge the quality of an observation, production step or dataset.
-
-        \noindent Examples:
-        \begin{itemize}
-            \item Since wrong calibration images may increase the
-number of artifacts on an image rather than removing them, knowledge about
-the calibration image set will help to assess the quality of the calibrated
-image.
-        \end{itemize}
-
-
-\paragraphlb{E: Search in structured provenance metadata}
-        Find all images produced by a certain processing step and similar tasks.
-
-        \noindent Examples:
-        \begin{itemize}
-            \item Give me more images that were produced using the
-same pipeline.
-            \item Give me an overview on all images reduced with the same calibration dataset.
-            \item Are there any more images attributed to this observer?
-            \item Which images of the crab nebula are of good quality and were produced within the last 10 years by someone not from ESO or NASA?
-        \end{itemize}
+\TODO{Readers do not know about the terms \emph{activity} and \emph{entity}, yet at this point!!}

-        This task is probably the most challenging. It also includes tracking the history of data items as in A, but we still have listed this task separately, since we may decide that we can't keep this one, but we definitely want A.
+\begin{itemize}

+% == other models / serialisation
+
+\item Provenance information must be stored in a standard model, with standard serialization formats.
+
+\item Provenance information must be machine readable.
+
+\item Provenance data model classes and attributes should be linked to other IVOA concepts when relevant (ObsCoreDM, SimDM, VOTable, UCDs...).
+
+\item Provenance information should be serializable into the W3C PROV standard formats with minimum information loss.
+
+
+
+
+\item Provenance metadata must contain information to find immediate progenitor(s) (if existing) for a given entity, i.e. a dataset.
+%All produced entities must contain information to find its immediate progenitor(s).
+
+
+\item An entity must point to the activity that generated it (if the activity is recorded).
+%Provenance metadata must contain information to find the activity that generated a given entity.
+%* All produced entities must contain information to find the activity that generated it
+\item Activities may point to output entities.
+
+\item Activities must point to input entities (if applicable).
+
+\item Provenance information should make it possible to derive the chronological sequence of activities.
+%The order of the activities should be available.
+
+
+%\item Provenance information should contain the list of activities and progenitor entities.
+% too vague .... must be an ordered list ... One step should also be allowed.
+\end{itemize}
+
+% ==== Comment:
+%These links can be used to trace back the sequence of processing steps (activities) and possibly the interim results.
+
+
+
+\begin{itemize}
+
+% Released entities must have a unique, persistent identifier (DOI, obs_publisher_did, ...), at least in their domain.
+\item Provenance information can only be given for uniquely identifiable entities, at least inside their domain.
+% comment: (DOI, obs_publisher_did, ...)
+% Thus entities have to have a unique, persistent identifier.
+% (to avoid ambiguities).
+
+\item Released entities should have a main contact.
+
+\item It is recommended that all activities and entities have contact information and contain a (short) description or link to a description.
+% could also be the documentation.
+
+\end{itemize}
+
+
+% Should this go into the requirements or the model?
+%\item Activities should be defined by following keywords (attributes):
+%    \begin{itemize}
+%    \item unique ID
+%    \item status (COMPLETED/ERROR/...)
+%
+%... (see working draft and data model)
+%* Entities should be defined by... (see working draft and data model)

-\subsubsection{More specific use cases}
-More specific use cases with example serialisations for different types of astronomical datasets are given in Section \ref{sec:usecases-implementations}.

Modified: trunk/projects/dm/provenance/description/prov-refs.bib
==============================================================================
--- trunk/projects/dm/provenance/description/prov-refs.bib	Tue Oct 18 19:15:06 2016	(r3640)
+++ trunk/projects/dm/provenance/description/prov-refs.bib	Tue Oct 18 21:45:54 2016	(r3641)
@@ -91,12 +91,19 @@
@misc{std:CharacterisationDM,
author = {IVOA Data Model Working Group},
title = {Data Model for Astronomical DataSet Characterisation, Version 1.13},
-    howpublished = {{IVOA} Note},
+    howpublished = {{IVOA} Recommendation},
month = mar,
year = 2008,
url = {http://www.ivoa.net/documents/latest/CharacterisationDM.html}
}
-
+ at misc{std:CharacterisationDM2,
+    author = {Francois Bonnarel and Igor Chilingarian and Mireille Louys and Juande Santander Vela and Jesus Salgado},
+    title = {Characterisation DM: Complements and new features. Observation quality and variability - complex datasets, Version 1.0},
+    howpublished = {{IVOA} Draft},
+    month = oct,
+    year = 2012,
+    url = {http://www.ivoa.net/documents/Characterisation2/index.html}
+}
@misc{std:previousefforts,
author = {Francois Bonnarel, IVOA Data Model Working Group},
title = {Provenance Data Model Legacy},