
We heard a lot about connected labs or at least that was the revolution a few years ago. But is the data actually connected?
Connected labs
In a connected lab, you can control instruments remotely, the measured data from an instrument is accessible from your sofa at home if you wish, as long as you have access to the internet.

What if your colleague looks at the data from the instrument without your guidance, will they know what it is about?
Let’s examine a simple example: a connected pH meter sends a reading of 6.7. What do you make of this? As an isolated measurement, there is not much you can deduce from this, only that the solution measured is slightly acidic. But what is in the solution? Why is the pH measured? Is it an initial measurement or is it after x amount of time? Is there something else being measured? So many unanswered question by this data point alone!
The isolated value of the pH is useless unless you connect it to the context of the measurement.
Data silos

The pH-meter may have the ability to store the measurements taken along some metadata such as the data and time, the type of probe used, the instrument serial number. It may also ask the user to type manually an experiment reference.
Now, the single point data has a bit more information with its collection of metadata. But there are a few major issues with this:
- Because the experiment reference is manually entered, there is a risk of errors.
- Is there a convention for the experiment reference, i.e. a defined convention used by everybody for the experiment reference? one user may be using a lab notebook number while another maybe using the experiment title.
- Is the data from the pH meter isolated from the rest of the data i.e. in a silo?
Having all the measurements and their metadata can be useful: you can use them to visualise the evolution of the pH during a period of time for example. But without context of why you are measuring the pH and other events surrounding this measure will not provide any insights. This is the definition of a data silo: isolated set of data not linked other data.
Data silos are very common in R&D. A lab typically has many instruments recording very specific measurements, each with metadata. In many cases each of the measurement types and their metadata are stored independently of each other with little or no link. This is not only the case for instruments, any other system may be working in complete isolation. Making sense of the data is not easy and can’t be done in isolation.
Connecting data
Connecting data shouldn’t be an afterthought. Delaying in adding link increases the risks of errors, therefore a cascade of inaccuracies leading to doubtful decisions. Connecting existing systems can be done but need to be thoroughly thought through , looking both at future use but also at the existing data and how it will be used and mapped.

Connecting data doesn’t mean copying data from its originator system to another, manually or programmatically. This is a recipe for errors: with copies of data, in addition to the potential errors if copied manually, the biggest issue is identifying which version is the source of truth. How do you know which one is the correct one when you have discrepancies?
Most, if not all, the R&D domain use either Electronic Lab Notebooks (ELNs) or Laboratory Information Management Systems (LIMS) or both. ELNs are focus on experiments whilst LIMS are usually focused on samples. Both approaches have their pros and cons. However using these systems alone is not sufficient to ensure full connection of the data.
To ensure full data connection, also known as traceability, it is important that all the variables and conditions are recorded as part of an experiment or assay: what is the study/project, what is the target, what strain was used, what batch of substance was test, what are the details of the assay, the procedure used, what the conditions of the assay i.e. what buffer was used, what batch of this buffer etc.
As you can see, it very quickly amounts to a huge set of data. Rather than recording all these within an experiment, it is easier to reference the various items and artefacts and allow seamless access to their associated data. In this scenario, an “inventory” system is crucial for the success, to record not only the materials, the instruments used but also the recipes or procedures used.
It is important that any variations is recorded accurately. The saying “devil is in the details” applies very much to R&D data. Often a little variation can be the source of a breakthrough.
Seamless connection

The connection of the data needs to be seamless, both from the user perspective but also from the machines’ perspective. This is especially true when leveraging the data recorded for Machine Learning (ML) and Artificial Intelligence (AI). It is common to hear that data scientists spend more time wrangling with the data than analysing it.
It is important that the various systems “speak” the same language. Adopting the right and consistent terminology in all the systems from the start is a very important part of the equation. Even better if the language is understood outside of the organisation, to enable comparison with public data for example. To identify the data, you need to associate a “label” or “tag” to it: 6.7 needs to be associated with the label/tag pH to be understood. Typos and varied nomenclature can reduce dramatically the usefulness of data.
Using agreed terms and ontologies limit this problem. There are quite a few ontology projects e.g. Pistoia Alliance ontologies projects and ontology management companies such as SciBite’s CENtree that can help with this. Adoption of these ontologies and practises can really make a difference to a data set, evolving towards FAIR data.
A user, with the correct sets of permissions for accessing the data, should be able to navigate the graph of the connected data. A problem often encountered by R&D scientists is the creation of a report when the data is not connected. It has been reported many times that it usually takes a lot of efforts and time, taking the scientist away from the bench. With connected and correctly labelled data and if the system used to interrogate the data has accessed to all parts of the data set and is able to do so, reporting should be much faster than manually compiling all the information for example for filling with the regulatory administration or patent office.
Beyond reporting, AI and ML systems that can access all the connected relevant and tagged data, may identify patterns that the human eye would miss just because of the vast amount of data to scrawl through.
Connected Labs? Yes, of course! But the real deal is Connected Data!
Having the traceability/connection and tagging of the data is a must for future breakthroughs in my opinion.
I would love to hear your thoughts on the subject. Please leave a comment, or better, contact us.
Leave a Reply