In Documenting Data with Metadata we discussed how Jenkins lacks a built-in framework for relating arbitrary Jenkins projects, builds, and artifacts. This creates a challenge for linking data and metadata generated in independent builds.
Jenkins job and build configuration, parameters and artifacts are persisted as separate files on the server file system. When Jenkins starts, it builds an in-memory Jenkins object model from the XML configuration and build files of every project, as well as from the file structure of the ‘jobs’ folder. However, there is no dedicated RDBMS (relational database management system) backing up this Jenkins model, and no attempt is made to formally relate builds to each other. Once the server is shutdown, the object model is lost and needs to be rebuild from scratch on the next restart.
In this post, we will examine strategies for overcoming these limitations, and establishing build relationships that are important for data reuse, comprehension and provenance in research and data science applications.
Let’s summarize the build and artifact types we introduced in the previous post and recall the typical ways that metadata is generated:
Build Type | Build Artifacts | Comment |
---|---|---|
Data-source | Data, Metadata | build generates primary data and metadata |
Analysis | Data, Metadata | build generates derived data and potentially additional metadata |
Metadata-only | Metadata | build associates user provided metadata to a data-source |
Metadata is generated by:
We will focus on how to establish and use relationships (links) between the different build types and their artifacts. The format of the artifacts is irrelevant (it can be binary, CSV, java properties, XML, JSON, html etc.) .
Note that our discussion involves arbitrary, asynchronous builds and not those chained into a job pipeline, which Jenkins typically links with an upstream-downstream relationship. We are mostly concerned with freestyle, parametrized Jenkins projects, as they provide interactive build forms suitable for Jenkins research and data-science applications.
Nonetheless, the strategies we will discuss can be used in pipelined builds to generate required build links.
As is the case for a RDBMS, we need a common key to establish a relationship between any two builds (entities) in the Jenkins model.
For parametrized builds, Jenkins provides a run parameter type , that can be used to reference (link) the current build to a previous build of another project. The value of a run parameter is the absolute URL to the job build.
Luckily, Jenkins also provides a unique sequential key for each build, the BUILD_NUMBER . A composite key (composed of the JOB_NAME and the BUILD_NUMBER) is even more useful and creates a unique reference to a job build on the server.
We will call this composite and unique key a BUILD_KEY , and conveniently format it as JOB_NAME#BUILD_NUMBER a format that is easily parsed to identify the referenced Jenkins job and build.
When build keys are used as build parameters, they relate the current build to another (from the same or other project). In a future blog we will discuss how we can implement and enhance BUILD_KEY(s) using Active Choices parameters.
Note: When one uses pipeline compatible steps the Run Selector Plugin provides references to previous builds similarly to a ‘run parameter’.
Links between data-sources and their metadata are modeled by a one-to-many relation as the same data-source is reused in one or more metadata generating builds. Also, such links typically exist across different Jenkins projects.
We will demonstrate linking a data-source build with a metadata build using an example.
Let’s assume that we have two data-source Jenkins projects (jobs), DSR_A and DSR_B, and a ANL analysis parametrized project that uses as inputs the artifacts generated from DSR_A and DSR_B.
A user configures an ANL build form by selecting the INPUT to be analyzed. The INPUT parameter is a selectable option referencing the available builds in DSR_A and DSR_B projects.
Figure 1 (above) displays DSR_A and DSR_B builds on the left and a schematic of the ANL build form on the right.
Assigning a DSR project BUILD_KEY to the value of the ANL INPUT parameter, introduces a unique DSR build reference in the ANL build configuration. This DSR BUILD_KEY will be stored in the build object model and can be used later to relate the ANL builds with the corresponding data-source build .
Note that each of the ANL builds, includes additional parameters (INPUT_DATA,PARAM01,PARAM02) and generates a new ARTIFACT as output.
Figure 2 (above) The Jenkins builds model after several ANL builds. We have color-coded DSR builds and their corresponding ANL builds to highlight the one-to-many relation that exists between them.
We have successfully established a relational model between the DSR project builds (used as input) and the ANL analysis builds that reference them. The DSR (data-source) and ANL (metadata) projects are now linked via the ANL-INPUT parameter=DSR-BUILD_KEY relation.
Using BUILD_KEYS this relational model can be extended to more than one projects similarly to how one would build a relational model for an RDBMS using Unique and Foreign entity keys .
The value of any relational model is demonstrated in its ability to retrieve related entity data. We are now ready to use the Jenkins relational model of Figure 2 to reference any of the metadata builds (and artifacts) generated from a primary data-source. We will demonstrate this with an example.
Let’s now assume that a second analysis project ANL_X uses as INPUT the same data-sources as ANL. Can the ANL artifacts be referenced from the ANL_X project (before or after ) a build starts?
The before option is of interest to research and data-science applications where if ANL results and data-source metadata could be referenced, they could provide useful information and guide the selection of ANL_X analysis parameters .
The relational model we established between DS and ANL projects supports the discovery and reuse of ANL metadata artifacts from ANL_X.
Figure 3 The ANL builds and artifacts can be retrieved from ANL_X through a lookup with the DS BUILD_KEY used as INPUT for both the ANL and ANL_X build.
As a result, during the configuration of an ANL_X build a user can retrieve and dynamically display the ART01 or ART02 artifacts of the corresponding ANL project.
Figure 4 As the ANL_X build is configured, ANL artifacts relevant to the selected ANL_X INPUT can be retrieved and dynamically rendered in the ANL_X build form. The users gets additional insights and information on the INPUT source. This in turn can assist the user in selecting options for additional parameters of the ANL_X project.
In a follow-up blog entry I will give implementation details on the generation of BUILD_KEYS and their use in retrieving and displaying artifacts across Jenkins project builds. If you would like a head-start, become familiar with the Active Choices Jenkins Plugin. The power of Groovy, Jenkins Java API, javascript and dynamic HTML come together when we use this plugin to form a framework for dynamic, fully interactive UI that links and displays Jenkins artifacts across builds.
]]>However, managing hundreds of Groovy scripts and several different job parameters may be quite a challenge.
You could use the Scriptler parameter, externalise the configuration to a configuration management tool such as Puppet, Ansible, or SaltStack, or simply build your own automation with some language such as Python, Perl, Shell script, and access Jenkins’ API via its Groovy console or remotely via REST services.
In today’s post I will show a way of achieving it with the Job DSL Plugin. With his plugin, you are able to use a domain-specific language (or DSL) to programmatically create Jenkins projects.
Our example project is quite simple. We will use one of our examples from the Wiki. More specifically, the example with some Brazilian states and respective cities. So that when you choose a state, its cities are displayed as options as another parameter.
Normally you would create a project manually but when you use the Job DSL you normally start by creating the seed project. When this seed project is built it will create new projects, and hence the name. So this is only built when you need a new project, and it can be triggered manual or automatically.
Let’s start with the new project. Create a FreeStyle project with any name you prefer. For this example I will be using “job-dsl-active-choices-states-seed”.
You can also parameterise your project, allowing you to use parameters to further customise your new projects. I will create a String Parameter named “NEW_PROJECT_NAME”. And I will use this variable as the name of the new project.
Next we will add a build step. Click Add build steps, and choose ”Process Job DSLs”. Here you will be asked whether you have your DSL sitting somewhere in your workspace, or if you would like to create one.
This means that you can even store your DSLs in a repository somewhere like GitHub, GitLab, BitBucket, etc.
Here’s what the example from our Wiki looks like in the Job DSL syntax.
job ("$NEW_PROJECT_NAME") {
parameters {
activeChoiceParam('States') {
description('Select a state option')
filterable()
choiceType('SINGLE_SELECT')
groovyScript {
script('["Sao Paulo", "Rio de Janeiro", "Parana:selected", "Acre"]')
fallbackScript('return ["ERROR"]')
}
}
activeChoiceReactiveParam('Cities') {
description('Active Choices Reactive parameter')
filterable()
choiceType('CHECKBOX')
groovyScript {
script('''
if (States.equals('Sao Paulo')) {
return ['Barretos', 'Sao Paulo', 'Itu'];
} else if (States.equals('Rio de Janeiro')) {
return ['Rio de Janeiro', 'Mangaratiba']
} else if (States.equals('Parana')) {
return ['Curitiba', 'Ponta Grossa']
} else if (States.equals('Acre')) {
return ['Rio Branco', 'Acrelandia']
} else {
return ['Unknown state']
}
''')
fallbackScript('return ["Script error!"]')
}
referencedParameter('States')
}
}
}
When you build your seed project, it will ask you for any parameters you may have configured your project with.
And then once executed you will have your new project created! What’s even better, you are able to track the projects created from the seed project page in Jenkins. As in the following figure.
And likewise, you can also find the seed project from your created project’s page.
Finally, here’s the result. Same as if you had manually created the project. But now you can create as many projects like this as you would like.
Depending on how complex you design your projects, you may need to spend some long time reading the job dsl plugin API - which is great and well up to date.
You will probably use parameters in your project seeds, to further customise your new projects. You can even use Active Choices parameters for that :-) And you can version control your parameters within your project configuration.
It might still be hard to track if you have several scripts and you want to re-use them in different projects. In that case, you may want to look into using Scriptler plugin as well.
As per the Perl motto TIMTOWTDI, this is just one way of achieving it. Hope you find it interesting and useful!
Happy hacking!
]]>Before anything, here is a TL;DR on transient fields in Java. When you have a field such as
public class SomeClass {
private transient String someVariable;
}
you are specifying that you do not want it to be persisted if/when the class is serialized.
We made some fields transient in Active Choices Plug-in, which caused users to lose the Groovy scripts used in their jobs parameters.
So sit back and relax, while I tell you what happened to the release 1.5 of our Active Choices Plug-in, which was dropped and never made the update center, why we had to remove release 1.5.0 from Jenkins update center. And finally, how we fixed it in release 1.5.1.
We released 1.5-alpha to the Jenkins experimental update center on 20th March this year. We cut that release due to the script-security-plugin integration.
This alpha release is not available to all users, unless they choose to use the alpha version. This was announced in our mailing list, and I tried giving it some testing.
But our post mortem has - almost - nothing to do with the script-security-plugin.
The problem with the release 1.5, which was next following after our previous releases 1.3 and 1.4, was that since 1.4, Jenkins plug-in API changed, and plug-ins now are required to update how they define the Jenkins version against which they are built against. You can read more about it in INFRA-588.
The release process works almost completely flawlessly. Except for the upload stage. Before the upload stage, there is a task where a tag is created. The tag (uno-choice-1.5) is in GitHub, in the active-choices-plugin repository under the Jenkins organization.
After fixing the issue, following the instructions in INFRA-588, we couldn’t release 1.5, since the task to create a tag would fail, before it tried to upload the plug-in binary to the update center.
It is possible to request someone with karma to manually delete the tag from GitHub, or we can just skip that version. That is what was done, and that is why the release 1.5 was skipped.
This is important because it plays part in the issue. After the plug-in was ready to be
released as 1.5.0, it was just a matter of running mvn clean release:prepare release:perform
.
However, during the execution of the release, besides running tests,
Jenkins is now also running FindBugs.
It means that every plug-in released recently, has been scanned by FindBugs, which is great.
One of the issues in Active Choices Plug-in, was that instead of Groovy scripts being Strings, they are now instances of SecureGroovyScript, from the script-security-plugin.
The SecureGroovyScript does not implement Serializable. Which is not really a problem since Jenkins is not using Java traditional serialization to persist objects as XML. But since FindBugs was complaining about it, and it was Friday night, I fixed that warning by making that field as transient…
So Monday morning I started seeing users reporting the issue. Actually, one issue reported before Monday that his scripts had disappeared, but since he mentioned the pipeline-plugin, I didn’t consider it could be such a blocker issue (active-choices-plugin does not support the pipeline-plugin).
It took Ioannis reporting the same issue, and some coffee, until I realized what had happened. Anyone who installed the 1.5.0 release, and saved the job, or had a plug-in saving the job, or if Jenkins decided to save the job, would get all the GroovyScript instances removed from the job XML. Meaning that the next time someone tried to execute the job, the parameters would probably be empty.
Another issue reported that the same happened to him, and suggested to remove that release, so that other users would not lose their scripts. It took just a few minutes (OSS is great, Jenkins project is great, Daniel Beck is great! Hooray!). I logged in to Jenkins #jenkins IRC channel, briefly explained the issue and asked how to remove a release from the update center.
You just need to submit a pull request to the backend-update-center2, like this one.
So now users won’t be affected by the issue. Great, let’s fix the issue and release a new version.
That’s how we fixed it. Revert the change, removing the transient tokens from the code, add an ignore filter for FindBugs. Took just a few minutes to write the fix, but even then we still had to test the change, before releasing 1.5.1.
So I arrived earlier from work today, and decided to thoroughly test that the fix would work, before releasing 1.5.1.
Now here’s the job XML.
<?xml version='1.0' encoding='UTF-8'?>
<project>
<actions/>
<description></description>
<keepDependencies>false</keepDependencies>
<properties>
<jenkins.model.BuildDiscarderProperty>
<strategy class="hudson.tasks.LogRotator">
<daysToKeep>-1</daysToKeep>
<numToKeep>-1</numToKeep>
<artifactDaysToKeep>-1</artifactDaysToKeep>
<artifactNumToKeep>-1</artifactNumToKeep>
</strategy>
</jenkins.model.BuildDiscarderProperty>
<hudson.model.ParametersDefinitionProperty>
<parameterDefinitions>
<org.biouno.unochoice.ChoiceParameter plugin="uno-choice@1.4">
<name>param001</name>
<description></description>
<randomName>choice-parameter-6543037871533</randomName>
<visibleItemCount>1</visibleItemCount>
<script class="org.biouno.unochoice.model.GroovyScript">
<script>return [1,2,3,4]</script>
<fallbackScript>return []</fallbackScript>
</script>
<projectName>test-001</projectName>
<choiceType>PT_MULTI_SELECT</choiceType>
<filterable>true</filterable>
</org.biouno.unochoice.ChoiceParameter>
</parameterDefinitions>
</hudson.model.ParametersDefinitionProperty>
</properties>
<scm class="hudson.scm.NullSCM"/>
<canRoam>true</canRoam>
<disabled>false</disabled>
<blockBuildWhenDownstreamBuilding>false</blockBuildWhenDownstreamBuilding>
<blockBuildWhenUpstreamBuilding>false</blockBuildWhenUpstreamBuilding>
<triggers/>
<concurrentBuild>false</concurrentBuild>
<builders/>
<publishers/>
<buildWrappers/>
</project>
Notice the <script class=”org.biouno.unochoice.model.GroovyScript”>.
Here’s the job XML.
<?xml version='1.0' encoding='UTF-8'?>
<project>
<actions/>
<description></description>
<keepDependencies>false</keepDependencies>
<properties>
<jenkins.model.BuildDiscarderProperty>
<strategy class="hudson.tasks.LogRotator">
<daysToKeep>-1</daysToKeep>
<numToKeep>-1</numToKeep>
<artifactDaysToKeep>-1</artifactDaysToKeep>
<artifactNumToKeep>-1</artifactNumToKeep>
</strategy>
</jenkins.model.BuildDiscarderProperty>
<hudson.model.ParametersDefinitionProperty>
<parameterDefinitions>
<org.biouno.unochoice.ChoiceParameter plugin="uno-choice@1.5.0">
<name>param001</name>
<description></description>
<randomName>choice-parameter-6543037871533</randomName>
<visibleItemCount>1</visibleItemCount>
<script class="org.biouno.unochoice.model.GroovyScript"/>
<projectName>test-001</projectName>
<choiceType>PT_MULTI_SELECT</choiceType>
<filterable>true</filterable>
</org.biouno.unochoice.ChoiceParameter>
</parameterDefinitions>
</hudson.model.ParametersDefinitionProperty>
</properties>
<scm class="hudson.scm.NullSCM"/>
<canRoam>true</canRoam>
<disabled>false</disabled>
<blockBuildWhenDownstreamBuilding>false</blockBuildWhenDownstreamBuilding>
<blockBuildWhenUpstreamBuilding>false</blockBuildWhenUpstreamBuilding>
<triggers/>
<concurrentBuild>false</concurrentBuild>
<builders/>
<publishers/>
<buildWrappers/>
</project>
Noticed anything?
<?xml version='1.0' encoding='UTF-8'?>
<project>
<actions/>
<description></description>
<keepDependencies>false</keepDependencies>
<properties>
<jenkins.model.BuildDiscarderProperty>
<strategy class="hudson.tasks.LogRotator">
<daysToKeep>-1</daysToKeep>
<numToKeep>-1</numToKeep>
<artifactDaysToKeep>-1</artifactDaysToKeep>
<artifactNumToKeep>-1</artifactNumToKeep>
</strategy>
</jenkins.model.BuildDiscarderProperty>
<hudson.model.ParametersDefinitionProperty>
<parameterDefinitions>
<org.biouno.unochoice.ChoiceParameter plugin="uno-choice@1.5.1-SNAPSHOT">
<name>param001</name>
<description></description>
<randomName>choice-parameter-6543037871533</randomName>
<visibleItemCount>1</visibleItemCount>
<script class="org.biouno.unochoice.model.GroovyScript">
<secureScript plugin="script-security@1.24">
<script>return [1,2,3,4]</script>
<sandbox>false</sandbox>
</secureScript>
<secureFallbackScript plugin="script-security@1.24">
<script>return []</script>
<sandbox>false</sandbox>
</secureFallbackScript>
</script>
<projectName>test-001</projectName>
<choiceType>PT_MULTI_SELECT</choiceType>
<filterable>true</filterable>
</org.biouno.unochoice.ChoiceParameter>
</parameterDefinitions>
</hudson.model.ParametersDefinitionProperty>
</properties>
<scm class="hudson.scm.NullSCM"/>
<canRoam>true</canRoam>
<disabled>false</disabled>
<blockBuildWhenDownstreamBuilding>false</blockBuildWhenDownstreamBuilding>
<blockBuildWhenUpstreamBuilding>false</blockBuildWhenUpstreamBuilding>
<triggers/>
<concurrentBuild>false</concurrentBuild>
<builders/>
<publishers/>
<buildWrappers/>
</project>
Again, not quite the same as 1.4, but our script is in persisted in the job configuration. So what changed? The script is not wrapped around a <secureScript> tag, from the script-security-plugin.
Users that installed 1.5.0 have lost their scripts. Unless they used a test bed server to install and test the plug-in against their jobs, or they have some backup process in place, I am afraid there is not much that can be done.
I also tested a very similar scenario, but going from 1.4 to 1.5.1, which is going to happen to users that have not upgraded to 1.5.0. Again, the script was not correctly rendered.
But this time it is not exactly an issue in the plug-in code. Maybe a usability, an UX, issue. What I got in the logs after upgrading from 1.4 to 1.5.1 was the following exception.
SEVERE: Error executing script for dynamic parameter
java.lang.RuntimeException: Failed to evaluate fallback script: script not yet approved for use
at org.biouno.unochoice.model.GroovyScript.eval(GroovyScript.java:178)
at org.biouno.unochoice.util.ScriptCallback.call(ScriptCallback.java:96)
at org.biouno.unochoice.AbstractScriptableParameter.eval(AbstractScriptableParameter.java:233)
at org.biouno.unochoice.AbstractScriptableParameter.getChoices(AbstractScriptableParameter.java:196)
at org.biouno.unochoice.AbstractScriptableParameter.getChoices(AbstractScriptableParameter.java:184)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.commons.jexl.util.introspection.UberspectImpl$VelMethodImpl.invoke(UberspectImpl.java:258)
at org.apache.commons.jexl.parser.ASTMethod.execute(ASTMethod.java:104)
at org.apache.commons.jexl.parser.ASTReference.execute(ASTReference.java:83)
at org.apache.commons.jexl.parser.ASTReference.value(ASTReference.java:57)
at org.apache.commons.jexl.parser.ASTReferenceExpression.value(ASTReferenceExpression.java:51)
at org.apache.commons.jexl.ExpressionImpl.evaluate(ExpressionImpl.java:80)
at hudson.ExpressionFactory2$JexlExpression.evaluate(ExpressionFactory2.java:74)
at org.apache.commons.jelly.expression.ExpressionSupport.evaluateRecurse(ExpressionSupport.java:61)
at org.apache.commons.jelly.expression.ExpressionSupport.evaluateAsIterator(ExpressionSupport.java:94)
at org.apache.commons.jelly.tags.core.ForEachTag.doTag(ForEachTag.java:89)
at org.apache.commons.jelly.impl.TagScript.run(TagScript.java:269)
at org.apache.commons.jelly.tags.core.CoreTagLibrary$2.run(CoreTagLibrary.java:105)
at org.kohsuke.stapler.jelly.JellyViewScript.run(JellyViewScript.java:95)
at org.kohsuke.stapler.jelly.IncludeTag.doTag(IncludeTag.java:147)
at org.apache.commons.jelly.impl.TagScript.run(TagScript.java:269)
at org.kohsuke.stapler.jelly.ReallyStaticTagLibrary$1.run(ReallyStaticTagLibrary.java:99)
at org.apache.commons.jelly.impl.ScriptBlock.run(ScriptBlock.java:95)
at org.apache.commons.jelly.TagSupport.invokeBody(TagSupport.java:161)
at org.apache.commons.jelly.tags.core.WhenTag.doTag(WhenTag.java:46)
at org.apache.commons.jelly.impl.TagScript.run(TagScript.java:269)
at org.apache.commons.jelly.impl.ScriptBlock.run(ScriptBlock.java:95)
at org.apache.commons.jelly.TagSupport.invokeBody(TagSupport.java:161)
at org.apache.commons.jelly.tags.core.ChooseTag.doTag(ChooseTag.java:38)
at org.apache.commons.jelly.impl.TagScript.run(TagScript.java:269)
at org.apache.commons.jelly.impl.ScriptBlock.run(ScriptBlock.java:95)
at org.kohsuke.stapler.jelly.ReallyStaticTagLibrary$1.run(ReallyStaticTagLibrary.java:99)
at org.kohsuke.stapler.jelly.CallTagLibScript$1.run(CallTagLibScript.java:99)
at org.apache.commons.jelly.tags.define.InvokeBodyTag.doTag(InvokeBodyTag.java:91)
at org.apache.commons.jelly.impl.TagScript.run(TagScript.java:269)
at org.kohsuke.stapler.jelly.ReallyStaticTagLibrary$1.run(ReallyStaticTagLibrary.java:99)
at org.apache.commons.jelly.impl.ScriptBlock.run(ScriptBlock.java:95)
at org.kohsuke.stapler.jelly.ReallyStaticTagLibrary$1.run(ReallyStaticTagLibrary.java:99)
at org.apache.commons.jelly.impl.ScriptBlock.run(ScriptBlock.java:95)
at org.apache.commons.jelly.tags.core.CoreTagLibrary$2.run(CoreTagLibrary.java:105)
at org.kohsuke.stapler.jelly.CallTagLibScript.run(CallTagLibScript.java:120)
at org.apache.commons.jelly.impl.ScriptBlock.run(ScriptBlock.java:95)
at org.apache.commons.jelly.tags.core.CoreTagLibrary$2.run(CoreTagLibrary.java:105)
at org.kohsuke.stapler.jelly.JellyViewScript.run(JellyViewScript.java:95)
at org.kohsuke.stapler.jelly.IncludeTag.doTag(IncludeTag.java:147)
at org.apache.commons.jelly.impl.TagScript.run(TagScript.java:269)
at org.apache.commons.jelly.impl.ScriptBlock.run(ScriptBlock.java:95)
at org.apache.commons.jelly.tags.core.CoreTagLibrary$2.run(CoreTagLibrary.java:105)
at org.kohsuke.stapler.jelly.JellyViewScript.run(JellyViewScript.java:95)
at org.kohsuke.stapler.jelly.IncludeTag.doTag(IncludeTag.java:147)
at org.apache.commons.jelly.impl.TagScript.run(TagScript.java:269)
at org.kohsuke.stapler.jelly.ReallyStaticTagLibrary$1.run(ReallyStaticTagLibrary.java:99)
at org.apache.commons.jelly.TagSupport.invokeBody(TagSupport.java:161)
at org.apache.commons.jelly.tags.core.ForEachTag.doTag(ForEachTag.java:150)
at org.apache.commons.jelly.impl.TagScript.run(TagScript.java:269)
at org.apache.commons.jelly.impl.ScriptBlock.run(ScriptBlock.java:95)
at org.kohsuke.stapler.jelly.CallTagLibScript$1.run(CallTagLibScript.java:99)
at org.apache.commons.jelly.tags.define.InvokeBodyTag.doTag(InvokeBodyTag.java:91)
at org.apache.commons.jelly.impl.TagScript.run(TagScript.java:269)
at org.kohsuke.stapler.jelly.ReallyStaticTagLibrary$1.run(ReallyStaticTagLibrary.java:99)
at org.kohsuke.stapler.jelly.ReallyStaticTagLibrary$1.run(ReallyStaticTagLibrary.java:99)
at org.apache.commons.jelly.impl.ScriptBlock.run(ScriptBlock.java:95)
at org.apache.commons.jelly.tags.core.CoreTagLibrary$2.run(CoreTagLibrary.java:105)
at org.kohsuke.stapler.jelly.CallTagLibScript.run(CallTagLibScript.java:120)
at org.apache.commons.jelly.impl.ScriptBlock.run(ScriptBlock.java:95)
at org.kohsuke.stapler.jelly.CallTagLibScript$1.run(CallTagLibScript.java:99)
at org.apache.commons.jelly.tags.define.InvokeBodyTag.doTag(InvokeBodyTag.java:91)
at org.apache.commons.jelly.impl.TagScript.run(TagScript.java:269)
at org.apache.commons.jelly.impl.ScriptBlock.run(ScriptBlock.java:95)
at org.apache.commons.jelly.tags.core.CoreTagLibrary$1.run(CoreTagLibrary.java:98)
at org.apache.commons.jelly.impl.ScriptBlock.run(ScriptBlock.java:95)
at org.apache.commons.jelly.tags.core.CoreTagLibrary$2.run(CoreTagLibrary.java:105)
at org.kohsuke.stapler.jelly.CallTagLibScript.run(CallTagLibScript.java:120)
at org.apache.commons.jelly.impl.ScriptBlock.run(ScriptBlock.java:95)
at org.kohsuke.stapler.jelly.CallTagLibScript$1.run(CallTagLibScript.java:99)
at org.apache.commons.jelly.tags.define.InvokeBodyTag.doTag(InvokeBodyTag.java:91)
at org.apache.commons.jelly.impl.TagScript.run(TagScript.java:269)
at org.apache.commons.jelly.impl.ScriptBlock.run(ScriptBlock.java:95)
at org.kohsuke.stapler.jelly.ReallyStaticTagLibrary$1.run(ReallyStaticTagLibrary.java:99)
at org.apache.commons.jelly.impl.ScriptBlock.run(ScriptBlock.java:95)
at org.kohsuke.stapler.jelly.ReallyStaticTagLibrary$1.run(ReallyStaticTagLibrary.java:99)
at org.apache.commons.jelly.impl.ScriptBlock.run(ScriptBlock.java:95)
at org.kohsuke.stapler.jelly.ReallyStaticTagLibrary$1.run(ReallyStaticTagLibrary.java:99)
at org.apache.commons.jelly.impl.ScriptBlock.run(ScriptBlock.java:95)
at org.kohsuke.stapler.jelly.ReallyStaticTagLibrary$1.run(ReallyStaticTagLibrary.java:99)
at org.apache.commons.jelly.impl.ScriptBlock.run(ScriptBlock.java:95)
at org.apache.commons.jelly.tags.core.CoreTagLibrary$2.run(CoreTagLibrary.java:105)
at org.kohsuke.stapler.jelly.CallTagLibScript.run(CallTagLibScript.java:120)
at org.apache.commons.jelly.impl.ScriptBlock.run(ScriptBlock.java:95)
at org.apache.commons.jelly.tags.core.CoreTagLibrary$2.run(CoreTagLibrary.java:105)
at org.kohsuke.stapler.jelly.JellyViewScript.run(JellyViewScript.java:95)
at org.kohsuke.stapler.jelly.DefaultScriptInvoker.invokeScript(DefaultScriptInvoker.java:63)
at org.kohsuke.stapler.jelly.DefaultScriptInvoker.invokeScript(DefaultScriptInvoker.java:53)
at org.kohsuke.stapler.jelly.JellyRequestDispatcher.forward(JellyRequestDispatcher.java:55)
at jenkins.model.ParameterizedJobMixIn.doBuild(ParameterizedJobMixIn.java:188)
at hudson.model.AbstractProject.doBuild(AbstractProject.java:1759)
at hudson.model.AbstractProject.doBuild(AbstractProject.java:1765)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.kohsuke.stapler.Function$InstanceFunction.invoke(Function.java:324)
at org.kohsuke.stapler.Function.bindAndInvoke(Function.java:167)
at org.kohsuke.stapler.Function.bindAndInvokeAndServeResponse(Function.java:100)
at org.kohsuke.stapler.MetaClass$1.doDispatch(MetaClass.java:124)
at org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:58)
at org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:746)
at org.kohsuke.stapler.Stapler.invoke(Stapler.java:876)
at org.kohsuke.stapler.MetaClass$5.doDispatch(MetaClass.java:233)
at org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:58)
at org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:746)
at org.kohsuke.stapler.Stapler.invoke(Stapler.java:876)
at org.kohsuke.stapler.Stapler.invoke(Stapler.java:649)
at org.kohsuke.stapler.Stapler.service(Stapler.java:238)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1669)
at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:135)
at hudson.util.PluginServletFilter.doFilter(PluginServletFilter.java:126)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at hudson.security.csrf.CrumbFilter.doFilter(CrumbFilter.java:86)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:84)
at hudson.security.UnwrapSecurityExceptionFilter.doFilter(UnwrapSecurityExceptionFilter.java:51)
at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
at jenkins.security.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:117)
at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
at org.acegisecurity.providers.anonymous.AnonymousProcessingFilter.doFilter(AnonymousProcessingFilter.java:125)
at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
at org.acegisecurity.ui.rememberme.RememberMeProcessingFilter.doFilter(RememberMeProcessingFilter.java:142)
at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
at org.acegisecurity.ui.AbstractProcessingFilter.doFilter(AbstractProcessingFilter.java:271)
at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
at jenkins.security.BasicHeaderProcessor.doFilter(BasicHeaderProcessor.java:93)
at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
at org.acegisecurity.context.HttpSessionContextIntegrationFilter.doFilter(HttpSessionContextIntegrationFilter.java:249)
at hudson.security.HttpSessionContextIntegrationFilter2.doFilter(HttpSessionContextIntegrationFilter2.java:67)
at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
at hudson.security.ChainedServletFilter.doFilter(ChainedServletFilter.java:76)
at hudson.security.HudsonFilter.doFilter(HudsonFilter.java:171)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at org.kohsuke.stapler.compression.CompressionFilter.doFilter(CompressionFilter.java:49)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at hudson.util.CharacterEncodingFilter.doFilter(CharacterEncodingFilter.java:82)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at org.kohsuke.stapler.DiagnosticThreadNameFilter.doFilter(DiagnosticThreadNameFilter.java:30)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:553)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.eclipse.jetty.server.Server.handle(Server.java:499)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
at winstone.BoundedExecutorService$1.run(BoundedExecutorService.java:77)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.jenkinsci.plugins.scriptsecurity.scripts.UnapprovedUsageException: script not yet approved for use
at org.jenkinsci.plugins.scriptsecurity.scripts.ScriptApproval.using(ScriptApproval.java:459)
at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SecureGroovyScript.evaluate(SecureGroovyScript.java:168)
at org.biouno.unochoice.model.GroovyScript.eval(GroovyScript.java:175)
... 167 more
This happens to scripts that are pending approval. Users now have to go to the manage page, and the in process approval option. There they are able to approve/reject scripts, and manage their script security. This is not part of the active-choices-plugin, but from the script-security-plugin.
Version 1.5.1 has just been released. If you intend to upgrade, please remember to always use a test bed server, as well as to have a good backup process. And also that you may have to have some time to approve your parameter scripts.
On the bright side:
Apologies for the inconvenience.
]]>A friend of mine has called well-annotated data ‘civilized data’, others have called it ‘tidy data’ 2.
Here I establish some metadata vocabulary for Jenkins data science applications, so we can continue future blogs with a common vocabulary.
Note: In the context of this discussion I will use the term annotation and metadata almost interchangeably.
Research application metadata is either automatically or manually created and associated with primary data. Automatic metadata can be added by instruments or software systems that acquire scientific measurements, and observations. However, many critical metadata can only be provided by a human familiar with the design, samples, data, and methods involved in the generation of experimental data.
A familiar example of these two classes of metadata can be found in stored digital images. A digital camera records along with each digital image a wealth of acquisition metadata such as an auto-generated file name, the camera model, a time-stamp, exposure, f-stop and even geocoding information. However, only a human can annotate the image with metadata that are really useful for other humans to search and find a picture in a collection. In this situation a human would have to annotate pictures with metadata such as people’s names, occasion, etc.
Measurements from a well-conducted scientific study should generated data and associated metadata that is relevant and if possible standardized 1.
Figure 1 : Scientific datasets should be annotated with relevant metadata that documents and describes the data.
In life sciences, there are many cases, where data annotation is left to an expert data curator (and more recently to machines with artificial intelligence), but that is typically a last ditch effort to salvage a dataset that was improperly annotated in the first place. I will not discuss these issues here, but I will introduce you to some of the strategies that we can use with Jenkins and R to generate ‘civilized data’.
The BioUno project is championing the use of Jenkins as a platform for data-science and research computing. In this context this blog series will focus on how we deal with datasets and their metadata when they are processed by Jenkins projects and stored as Jenkins artifacts.
When a Jenkins build generates one or more dataset artifacts we call it a data source build. Data source build artifacts are typically used by downstream builds as input artifacts for downstream analysis and processing. Data source builds themselves can vary, but in general they act to import, process or transform data.
The files produced by a data source build are primary datasets that we can use as input for additional processing algorithms
Data source builds are also responsible for generating the primary metadata associated with the data source artifacts.
Each data source artifact is automatically associated with a rich set of Jenkins build metadata . This is similar to a digital camera recording automated metadata. Jenkins records build metadata in the corresponding build.xml
file and include useful details such as a build status flag (success/failure), a unique build identifier, user, version of the processing script as well as the options and parameter values that were used in the build.
In reality, build metadata can easily extend unlike the limited set of automated metadata that a camera can record. During a freestyle build users can not only select analysis parameters, but can also also enter useful annotation that can extend and supplement the auto-generated metadata. Anything captured on the job build form as a build parameter,is recorded in the build.xml
and remains associated with the build artifacts (datasets, results).
Finally, a data source build can dynamically generate new metadata about the structure and format of the data source artifact. For tabular data this can for example include the number of rows and columns in the dataset and the type of the values in each column (numeric, character date etc.) Useful summary statistics can also be generated and stored as data source metadata.
Not every build generates a data source artifact. Some builds process the input data with the goal of deriving new information from it. I’ll refer to these as metadata-only builds .
To illustrate, suppose you are analyzing trends in the price of a company stock 3. The data source artifact may consist of a dataset of a company’s stock performance (daily open,close high low stock price) over the past year. The first step is to identify the overall trend. We perform this using a trend-analysis Jenkins job, then we use a second high-low analysis Jenkins job to analyze the 52-week high/low market stock price. The trend and calculated 52-week high/low prices from these build are just additional metadata to be associated with the input data source. Figure 2 : An example illustrating data source and metadata-only builds. Data source builds generate dataset(s) and metadata. Metadata-only builds generate metadata related to their input dataset(s).
Metadata-only builds have to address a Jenkins ‘operational’ challenge , how to relate the input data source to the generated metadata artifacts. While a single Jenkins build has a well established input and output artifacts and metadata relationship, there is no built-in Jenkins mechanism for maintaining these relationships across various jobs and builds . The issue gets further complicated when one needs to analyze the same data source iteratively and in a non-linear manner.
Metadata-only builds create artifacts that are not stored in the same build (or even project) as the input data source and this raises some important questions:
These questions do not frequently arise in the context of Dev-Op Jenkins pipelines. In these pipelines, artifacts are related by ‘upstream and downstream’ job relationships established in pipeline designs that are typically serial and immutable in nature. In contrast, for research and data science Jenkins applications, we must establish artifact data and metadata relationships and maintain them as users analyze data in a non-linear and predictable fashion.
We need to address these issues using custom designs that still takes advantage of the Jenkins application model and capabilities. We will address this in a follow up blog.
I started using Jenkins-CI almost 5 years ago and I was ecstatic with the functionality I could get out of this little gem of a workflow engine for life-science data management and analysis. Soon I had it crank through thousands of images generated in our lab from High Content Image analysis, generating useful numerical data. Most recently, coupled with the power of the R statistical language much of these numerical data can be processed into a variety of statistical graphs and plots and biologically relevant information for the discovery of new medicines. It was not the first time I had hijacked a build system for life/data-sciences service. In the past, I had build biological data management workflows based on ‘Ant’, ‘CruiseControl’ and ‘Gradle’ and I was just happy to find a new build system that provided a way to easily enter build parameters (free style project) to support the needs of life science data flows.
Well…not exactly! But still great! Having build workflows with Ant, I was already traumatized from writing and editing XML build files by hand. Jenkins build configuration is still stored as XML, but it’s edited through the Jenkins configuration page. The job configuration page is now a major focus of the Jenkins 2.x UI improvements. The form has transitioned from one long unwieldy configuration page to one (still long) form but with improved navigation to various project sections by using a set of easily accessible tab hyperlinks. The new project configuration page is certainly an improvement for initially configuring a project. However, what still remains a challenge is the lack of any significant tools (graphical or otherwise) to review the project configuration in the future. The only way to do this is to return to the configuration page and access each of the job sections to review settings and actions. There is no easy way to visually inspect the project steps, actions and parameter configuration. Not even a simple tabular summary that I’m aware off. What about the various project dependencies in terms of plugins, groovy scriptlets and external executables? In my experience (and opinion) the limited options for graphically reviewing Jenkins jobs and drilling down to their components and parameters (a common practice in workflow systems) is a major headache especially if you are maintaining multiple jobs and several job versions.
Visualization of build pipelines (job instances) seem to have received more attention. Several Jenkins plugins exist that provide graphical visualization of in progress and build pipelines from build history. One of my favorites (and with over 18K installations apparently a favorite of many others) is the Build Pipeline Plugin. The Build Pipeline plugin provides a Build Pipeline View of upstream and downstream connected jobs that typically form a build pipeline.
Figure 1: Build Pipeline visualization using the Build Pipeline plugin
The Delivery Pipeline Plugin provides a similar visual representation but jobs can be further grouped to form stages.
Figure 2: Visualization of a pipeline with several stages using the Delivery Pipeline Plugin
In both the Build Pipeline and Delivery Pipeline visual representations, the individual project details displayed are rather limited. The Build Pipeline Plugin allows listing the parameters of the first job but no other job details are available. Importantly, the visualization capabilities of this plugin are available only after a specialized ‘Build Pipeline View’ is added. This makes it impractical for visualizing large numbers of jobs and jobs that are not linked into pipelines. It’s clear that the main focus of these plugins is the visualizations of the status (success/failure ) of the pipeline as a whole and not the review of the parameters/actions/steps of a job. A good introduction to pipeline orchestration and visualization can be found in the Orchestrating your Delivery Pipelines with Jenkins blog post.
One of the problems encountered with chained job pipelines is the fact that the entire pipeline is not captured in a single place (file) but it is distributed in the individual contributing job configurations. In addition, although the concept of a ‘downstream build’ is well established and captured as such in the build log, an ‘upstream build’ is not recorded in the build log. Both of these issues contribute to challenges when the pipeline and its versions need to be documented and maintained. As a result, the Jenkins community and CloudBees (the company providing commercial support for Jenkins) have come with the concept of scripted pipelines a.k.a pipeline as code. Now the entire pipeline can be scripted using Groovy based DSL and the Job-DSL plugin, or the Pipeline plugin. These plugins do solve their intended issue, but they now further complicate job comprehension as multiple job configurations are captured in a single script. Unfortunately, these plugins do not offer any additional graphical tools for the detailed comprehension of a pipeline project’s structure. Similarly to the Build Pipeline plugin, they provide graphical views of a running pipeline and build history but no visibility into the detail configuration of each job.
Figure 3: Visualization for a scripted Pipeline
The observations I just described make me a little more pragmatic and reserved for the new Blue Ocean UI of Jenkins introduced this past week. Twitter was on fire with comments and kudos for the project! Obviously, the perceived functionality/usability of a software is largely attributed to a modern and responsive user interface. A couple tweets that made me smile said ‘Wow, the new Jenkins UI finally looks usable!’ and another ‘Oh wow, a Jenkins UI that not only doesn’t suck but actually looks really awesome.’
Figure 4: The new Blue Ocean Jenkins scripted pipeline view
I’m looking forward to testing the new Blue Ocean plugin to see if it improves on the job comprehension issues that I have described in this post. But somehow, I think that it will just provide a more visually pleasing backdrop for the existing graphical visualizations. In promoting Jenkins for life and data science duty, I’m obliged to compare it with existing tools in the field such as Knime the Galaxy Project and the commercial Pipeline Pilot tool.
Figure 5: The Pipeline Pilot Workflow Editor
Figure 6: Workflow editor in Knime and Galaxy
All of these tools provide interactive configuration and exploration of scientific workflows with well organized libraries of components that can be used to construct and enhance the data pipelines. I think that Jenkins would benefit from a similar approach to the exploration and construction of jobs and job pipelines. In the mean time, I welcome the Blue Ocean Jenkins project and hope that it will truly form the basis of a more interactive, modern and comprehensive framework for building future life and data science applications.
]]>I first met Kohsuke Kawaguchi in the summer of 2014 when I had an opportunity to present about the BioUno work using Jenkins at the JUC 2014 meeting in Boston. It was a very busy couple of days and I did not have much time to connect with him or with the CloudBees group that were busy answering Dev Ops questions to more than 400 people that attended that meeting. Now Kohsuke twitted that he was looking to meet with people in the Boston area to strengthen his connection with the OSS Jenkins practitioners. As Kohsuke put it, ‘in the early days of Jenkins before it was called Jenkins, I and my colleagues were using it to deliver our software. That gave me opportunities to deeply understand the problems developers are facing, and that drove much of feature development. Lately I’ve been feeling like I lost that touch to practitioners. Nowadays I only hear about specific tactical problems that some specific users have. That’s good, but those are limited & skewed data points for OSS Jenkins project. Since I was going to (be in) the area, I thought it’d be a good opportunity to meet real-world developers outside (the) Silicon Valley echo chamber. I’m hoping to meet maybe 3 companies and spend 2 hour each. I’d love you to walk me through how a change committed by developer gets to your customers, as if I’m a newly hired engineer. Once I understand how your org works, I’d like to discuss about your pain points, what things go wrong, and what you are thinking about addressing’. All of these were admirable points that Kohsuke was trying to address, the only problem was that I work with Jenkins in a way totally different than the 99% of Jenkins practitioners out there.
I was thrilled when Kohsuke twitted back his plans for the visit to Boston, and even more so, when he replied to me ‘I’m still interested’ to my warning that I don’t use Jenkins for its intended use but for creating life/data science application platforms. He would come to my company for a 2 hour meeting at the end of the day. To share my excitement I emailed my BioUno colleague Bruno Kinoshita in New Zealand, and invited my boss named (no kidding) Jeremy Jenkins to attend. Finally things got even better when Kohsuke mentioned that he would be accompanied by Jesse Glick, senior developer from CloudBees.
The most important planning notes for this meeting came from Bruno in a few well-thought out words: ‘Wow, that’s great news! Kohsuke is an amazing person. Send my regards to him, and if you drink with him, pour some beer in the Japanese style kind like this (see picture) . Kohsuke taught us this in a bar, and I have friends in Sao Paulo that are still pouring beer to each other like that”. The problem was I’m not a beer-drinker, and we would not be going out with Kohsuke as he was already booked with other plans after our meeting.
So in my ‘dry, cold science’ way, I’ve prepared an outline with links to some live Jenkins projects to demonstrate the ways that we have been using Jenkins in the spirit of the Biouno project, and prepared to discuss future Jenkins improvements that could ease my current ‘pain points’ and could facilitate further adoption of Jenkins in the life and data sciences fields.
It was late afternoon when Kohsuke and Jesse arrived at the research facility where I work in Cambridge, MA. We started with a visit to the high throughput biology lab where industrial robots set up screening assays for the discovery of new medicines. I knew that this would be interesting for a pair of techno-geeks, but more importantly, I wanted to give them an idea of the type of problems we try to solve in life and data sciences using Jenkins. I think it was effective and caught their attention, although Kohsuke still took quick glimpses into his phone messenger app. For the remaining two hours of the visit, we huddled in a small conference room discussing the various ways we had adapted Jenkins as an image processing and data analysis platform for life sciences. We also discussed the BioUno project and the Active Choices plugin (which was renamed mostly on Jesse’s recommendation from the original UnoChoice name). The interactive interface parameters afforded by the Active Choices plugin so critical for data analytics were a major discussion point. Jesse noted that what we seem to be building is a new application framework based on Jenkins. Finally, we’ve discussed what Jenkins improvements are needed to strengthen its position among other scientific data pipelining and workflow platforms (such as Knime and Galaxy) as I will expand on below.
Jenkins is a great task integration and workflow execution platform and has proved its value in thousands of continuous integration and continuous delivery projects. However, in the life sciences space Jenkins is practically unknown (not withstanding the efforts of the BioUno project to spread the word about Jenkins) and it has some rather strong entrenched rivals in the face of Knime and Galaxy two data integration/pipelining and workflow orchestration projects. Naturally, applications that are developed with focus on particular fields face challenges when used outside their main domain. As Galaxy or Knime would not fit very naturally in development operations, similarly, Jenkins does not immediately stand out as a natural fit in life/data science applications. Jenkins draws much of its strength from a rich ecosystem of plugins and currently only a handful of life-science and data science oriented plugins exist (all contributed by BioUno). This makes an ‘out of the box’ Jenkins installation impractical for life sciences. But beyond that, there are some cross-cutting improvements that would benefit both the development operations as well as life/data sciences communities. I was glad to hear both Kohsuke and Jesse agree with this. Such improvements revolve around a better global search engine for Jenkins, build metadata, artifact re-use and (what I call) relational builds or perhaps bi-directional build dependencies. It is likely that these improvements will find their way into the OSS version of Jenkins as a result of the current Jenkins community efforts. I also think that the introduction of the Pipeline Plugin a.k.a. ‘pipeline as code’ is a good example of a cross-cutting improvement, as it further increases the build reproducibility and artifact traceability both critical for scientific data flows.
Finally, Jenkins needs better graphical project and pipeline configuration and presentation tools. These features are already present in Knime and Galaxy but mostly missing from Jenkins. Such graphical tools would make Jenkins project and pipeline comprehension a lot easier than it currently is. I hope that others in the community feel similarly about this issue, so some solutions will start to emerge.
I’m thankful that the Jenkins community is lively debating the future of the project (with the current release of Jenkins 2.0) and supportive of new ideas. This will insure that Jenkins remains relevant both for the main domain for which it was developed, as well as in new fields such as life and data sciences. I was impressed with Kohsuke’s dedication to the project, which send him around the country meeting with people like me to hear it from the ‘trenches’. I hope it was useful for him, but I can certainly say that it has renewed my conviction that the powerful ideas that have been embedded into the Jenkins project and the support of the community will continue to spread Jenkins adaption and use in domains for which it was not originally envisioned. I know that this is already the case for many of my projects!
]]>In daily practice, lab and data scientists require tools that allow them to integrate a variety of data management and data analysis tasks. Scientific data flows from a variety of instruments and in a variety of formats. Frequently, data needs to be transformed and visualized for quality control purposes before an analysis can even begin. Experimental annotation and analysis metadata need to be overlaid with the instrument measurements and presented in context for improved comprehension of the results. Finally, the data needs to be processed with domain specific tools (such as image processing, sequence analysis, database searching, statistical learning etc.) to generate high quality results from the raw data 1.
In all of these tasks, having easy to use data pipelining tools that scientists can use to progressively manage, visualize, and analyze their data is critical for the efficient, accurate and reproducible processing of scientific data. Various data tasks and tools need to be used and integrated while maintaining an accurate record of their use and intermediate results. Furthermore, in the spirit of team collaboration (another cornerstone of science), raw, intermediate, and final data need to be shared in a transparent and timely way with others.
Having worked with Jenkins-CI in the life sciences context for the last few years, I am finding that it satisfies many of these requirements and so it can serve in its own right as a robust framework for efficient data pipelining integration, and analysis.
Here I will briefly describe some of the attributes that make Jenkins-CI a suitable platform for data scientists
A variety of data can be maintained in the Jenkins artifact archives. Builds can upload, access and efficiently process data from a variety of sources to generate and annotate file artifacts. A variety of plugins support typical data management tasks such as copying, archiving, deleting and moving files across file systems and cloud services. Furthermore, Jenkins provides excellent provenance information and tools for file artifacts (such as logs, date-stamps, and fingerprints).
Although Jenkins is not currently backed by a database system this has the advantage of simplicity and flexibility. In addition, Jenkins provides an extensive API for querying build and artifact information across the entire build system thus providing some of the advantages of structured database storage.
This is one of the areas where Jenkins really shines. Any external process or tool can be easily integrated as a build step. Data ‘wrangling and munging’ are important data science competencies that can be carried out with ease on Jenkins. Support for the Groovy dynamic scripting language and Python, which is popular with data scientists, provides an unlimited way of integrating custom scripts and external programs into Jenkins workflows without the need to write custom plugins. Support for ssh allows remote execution of commands and can be easily adapted for high performance parallel computing tasks. Scientific software packages may be available only on certain compute servers. Using ssh build steps you can easily execute these packages on the remote servers and then manage the output using Jenkins. The open source BioUno project has pioneered the implementation of scientific software packages as Jenkins plugins 2.
In the same way that Jenkins can integrate general data ‘wrangling and munging’ tasks, it can also pipeline data through domain-specific software. We have already demonstrated Jenkins applications in areas such as phylogenetics, genetic analysis and image analysis [3,4]. The BioUno project has pioneered the development of domain specific plugins for Jenkins, but in general any software package that can be executed and parametrized from the command line can be easily executed, monitored and pipelined as a Jenkins project. In fact we have found that these domain-specific applications benefit from being ‘wrapped’ as Jenkins projects in that they become easier to access and easier to use by laboratory scientists without the need for dedicated computer savvy data scientists.
At this time Jenkins has a limited number of data analysis plugins, but the ones that are available can provide crucial support for data science. For example, the Jenkins R-plugin [5] is simple yet powerful. The R plugin allows you to take full advantage of the R statistical language, the ‘lingua-franca’ of scientific data analysis and visualization. Python another popular scripting language for analytics, visualization and machine learning is fully supported in Jenkins through the Python plugin [6] and can also be used for data analysis tasks. Jenkins support for both R and Python make it a compelling tool for data scientists working with modern statistical and machine learning algorithms. Interestingly, using Jenkins to maintain data-science artifacts, creates a supportive environment for reproducible research. The analytical components and results from multiple analyses can be managed, evaluated, compared and validated using testing and verification tools popularized by the software engineering community (such as SCM, testing etc.)
Using Jenkins for data science assumes that we can easily generate, and support graphical visualizations and compelling data reports. R and Python packages can generate a variety of visualizations and graphics. Luckily most of these visualizations are in web standard formats (PNG, JPEG, PDF etc.) that can be compiled into useful reports using existing Jenkins plugins. The Image Gallery [7], Summary Report [8] and HTML Publisher [9] plugins are some of the Jenkins plugins that I have successfully used to create comprehensive graphical reports and visualizations.
Finally, results can be easily communicated between data and laboratory scientists through Jenkins and the build in access and authorization tools that the platform provides
In summary, I feel that Jenkins makes an excellent platform for data science experimentation and for providing practical and easy access to data science algorithms and visualizations to lab scientists. Although several challenges remain for making Jenkins a mainstream platform for data scientists (many are currently experimenting with the concept of analysis ‘notebooks’, such as iPython, Apache Jupyter, Beaker [10-12] etc.), I content that an even more significant challenge stems from the historical user base of Jenkins. I have found that many DevOp engineers are ambivalent about the life/data-science uses of Jenkins that I and my colleagues at the BioUno project are proposing. Although many appreciate the demonstrable power that Jenkins can inject into computational life/data sciences, many worry about the complexities required to support Jenkins in these challenging, demanding and less understood (from a DevOp engineer’s view point) research areas. Many DevOp shops want to have control over the Jenkins configuration, plugins and environment, and rightly so, don’t want a data scientist to dictate additional complexity. Even so, I have found that the ease with which one can deploy a working instance of Jenkins and effectively use it outside the DevOp Jenkins environment makes Jenkins a relatively easy addition to the toolset that data scientists can deploy and maintain on their own. Of course, it would be great if the DevOps community was to accept and contribute to these new and open research areas, and I hope that the BioUno project will become one of many connecting nodes between DevOp engineers and life/data scientists.
What was suggested in JENKINS-23772, was that instead of accepting only integers for the width, that the plug-in started to accept text values as well. This way 10
, 10px
or 10%
as valid values. The challenge in user requests like this, is how to maintain backward compatibility in your plug-in, while releasing a new version that changes objects and attributes.
An ImageGallery implements the Descriptor/Describable pattern for Jenkins, and users can choose an implementation in the job configuration. The ImageGallery abstract class contains an imageWidth
Integer attribute, which is persisted on the disk by Jenkins, using XStream.
You can read more about retaining backward compatibility in this Jenkins Wiki page.
Our task is to change that attribute to String, make sure the behaviour is consistent in the image gallery implementations, and guarantee that Jenkins will not crash when trying to load jobs with old imageWidth
Integer attribute.
So first you have to make sure that your Serializable classes bump the serialVersionUID
value, and that your unit tests are still passing after your changes.
If we have data already persisted on the disk and being used by XStream, changing attributes may result in strange errors. In our case we would like to change an Integer attribute to String, and persist it again.
The solution in this case, is add the @Deprecated
annotation to the existing Integer field, add another String field with a different name, and implement the readResolve
method to load the String value from the Integer value, when necessary.
Remember to also move the @DataBoundConstructor
to your new constructors, and add @Deprecated
to the right fields, methods, classes, and so it goes.
Good. So now our code already supports our changes.
There are at least two places where the integer image width was being saved in our previous jobs: the ImageGallery implementation object, and the Action being saved for each build.
Now that we have made our changes in the code, and left the old fields deprecated, we have to tell XStream to use the new field when reading old entries like these.
What it does, basically, it tell our program to use the value of the Integer fields to create a new object, with the String fields that we just created. This way, old instances serialized onto the disk, will be deserialized and filled with the old values.
In other words, it will be transparent to users, no errors on the screen or logs, and we will have kept backward compatibility.
Just remember to review your code, make sure your Jelly is passing the right field names, you are not using the old value, and that everything seems to work.
Happy hacking!
]]>