Amidst the COVID-19 crisis we face a need to rapidly mount a scientific response to the pandemic. We have important questions to answer as fast as possible regarding the best medical approaches to take and to answer questions of when to use each option. As patients are overwhelming hospitals physicians and scientists are looking to find regulatory pathways for both obvious and still undiscovered treatment options that may emerge. Many drugs are being considered through repurposing as COVID-19 treatments and net new options in vaccines and creative medical device approaches are under development. It will take a shift in process for acquiring clinical data to achieve the speed and quality needed to establish evidence to prevent new infections, lower the impact in infected patients, and reduce continued loss of life.

For most diseases Real World Data has been acquired in a purposeful but often very slow manner. Large scale registries that have had major impacts in multiple sclerosis, multiple myeloma, Parkinson’s, and IBD took 3-5 years to plan and execute. COVID-19 does not provide the luxury of waiting on those time frames. But with COVID-19 we do have a global consciousness of the magnitude of the problem and are seeing funding and unity towards a common outcome which are two of the primary challenges that limit data availability in Real World Data.

At Graticule we focus on providing clients with advanced Real World Data solutions. This has meant working to access and aggregate harder to use unstructured data such as radiology images, free text notes, pathology, IoT, patient reported data, and genomics. Over the past 60 days, like many groups in our industry including Datavant, Health Verity, AWS Data Exchange, TriNetX, and IQVIA we have turned our attention towards making advanced RWD data sets available for critical research in COVID-19. Before looking at how we are thinking about working solutions let’s begin with the problems we are hoping to solve.

Here is a short and incomplete list of problems to solve with RWD for COVID-19:

  • Lack of visibility into historical disease pathology and treatments for similar infections
    We do not have detailed information on the historical course of disease from patients with non-COVID-19 infections or conditions that led to respiratory and cardiac distress that are similar enough to COVID-19. If we had this information we could gain insight into the biology occurring in patients. Understanding how related diseases have been treated historically with many variations could lead to focusing on treatments that work better. Influenza, pneumonia, and tuberculosis infections, while different from the current novel coronavirus, often lead to related patterns and outcomes of respiratory and cardiovascular system dysfunction. Even environmentally induced situations such as HAPE (High Altitude Pulmonary Edema) and chemical lung injury from smoke in fires may provide useful insights. We have seen high variation in severity among patients and it is possible that the variation relates to a common biological trait that can be tested for and may already be seen in existing variations in other diseases. Having access to laboratory results, medications, treatment regimens, lung imaging, clinical notes, genetics, and outcomes could lead to new findings about drug safety or identification of biomarkers that allow us to better understand treatment outcomes and expectations to have as patients progress through their illness. At a minimum we can learn what to expect regarding response to treatment as complications arise such as secondary infections and novel presentations.
  • Lack of visibility into the comparative effectiveness of treatments being provided to COVID-19 patients over the past few months and the natural course of disease
    COVID-19 treatments are being tested in real time through open label prescribing, compassionate use programs, Emergency Use Authorizations, and phase III clinical trials. A number of novel drugs have been tested including Gilead’s anti-viral remdesivir, IL6 inhibitors, Flouroquinolone, and sponsored trials such as Phase III basket trials and vaccine trials are being structured now to support the over 800 initiated trials. All of these studies can benefit from a integrated views of the longitudinal history of COVID-19 patients including demographics, medical history, treatment, and outcomes. In particular the research depends on testing interventions at appropriate points, defining meaningful comparisons and establishing relevant end-points since they will guide future medical decisions. To design studies life sciences clinical development teams need to understand the current treatment patterns and outcomes. Specific data sets from the rapidly evolving standard of care for these patients are needed with sufficient detail to compare clinical trials with the care being provided outside of research protocols. Without controls from standard of care the new studies will take longer and become more complex including the challenge of recruitment into potential parallel placebo controlled studies in high mortality patients. Having shared controls through RWD will be more efficient and improve the ethics of research by limiting redundant control arms receiving placebo or standard of care.
  • Understanding of related diseases and biological effects outside of lung infection for drug repurposing
    Drugs that treat a variety of targets may be strong candidates for repurposing. The challenge is to select which ones would have the highest likelihood of utility and the lowest safety issues in the at populations they are intended to treat. Multiple drugs that treat malaria and rheumatoid arthritis are being considered for COVID-19 and other indications involving aspects of the immune system, anti-viral, or ACE2 biology may be applicable to reducing viral replication or inflammation. To identify candidate drugs that may demonstrate promise it is helpful to have detailed knowledge of the medical histories and detailed information on patients treated with them such as lab values, imaging, and clinical notes. These may provide important evidence that a drug is acting in a pathway beneficial to preventing or treating COVID-19. Rheumatoid Arthritis is a particularly interesting disorder where multiple drugs have been seen to have potential effects because of cytokine activity and anti-viral activity of existing drugs. But other disorders, like CAR-T treated cancer or diseases where monoclonal antibodies are used (anti-TNF, anti-ILx) may emerge as critical areas to execute deep dives into patient care, variance in treatment response and results. Like early HIV treatments combination approaches may be needed and understanding the impact of these is important to maintaining safety in a highly vulnerable population. Now that the population with information encoded in medical records is growing we can also look directly at the intersection of patients on therapy who are COVID-19 positive along with controls of propensity matched COVID-19 patients to determine if treatments are demonstrating protective effects that qualify them for further investigation.
  • Molecular biology information that can support pathway driven drug repurposing or discovery
    Biological level data sets provide insights into how our cells and organs operate in different conditions. Molecular biology is generally where pharmacological interventions have effects. So clinical data often will not provide the insights needed to find new options, stratify patients for risk, or predict where to focus for repurposing drugs. We already know a lot about the mechanism that COVID-19 employs to replicate. The virus engages ACE2 and then replicates within the cell. Some of our most detailed molecular biology information comes from genetic studies and cancer research but it is not easily available today. If we find a series of SNPs or variants that relate to ACE2 and viral related pathways we might be able find a way to block the path of the infection at a biological level. Knowing how this works in humans vs. animal models will be essential as often animal data doesn’t replicate into human studies. Antivirals often use alternate base pairs but many drugs change how specific proteins operate and cascade signals.  Scientists will likely have a lot of findings in COVID-19 patients or within their research in producing anti-viral compounds that need to be validated. Giving scientists rapid access to relevant repositories of human data and how variants or genetic changes in cancer cells impact health could save months or years towards progressing research. A simple example I looked into was – “Are there any genetic studies relating to pneumonia or tuberculosis susceptibility?”. These queries generate results but it would be far more valuable to have access to integrate or federate research across multiple repositories on these topics to determine if those findings extend into other areas or if they fit into broader pathways. It is possible that finding genetic markers, and patients may already have these available, could allow us to stratify which patients are at risk for respiratory complications and should be followed with different medication approaches.
  • Collecting data in a pandemic is difficult because you can’t engage patients easily
    We want to know more about how infected patients experience the disease. But many of them with mild symptoms need to be ‘at home’ to avoid infecting others. Furthermore even if they are in a hospital setting the interactions with healthcare professionals need to be limited in order to prevent spreading the virus to care providers. Some of the best tools for ‘seeing’ the patient such as CT exams present logistic challenges with the risk of spreading the infection through the equipment since they aren’t COVID-19 dedicated CT scanners. Any alternative mechanisms for understanding the disease in affected patients at a distance could help determine things like progression and effectiveness of treatment. Given that at home monitoring solutions are coming of age and being used for asymptomatic and mild COVID-19 cases we can learn from these systems to understand how the disease progresses in cases that do and don’t become severe in order to intervene earlier on signals today we can not identify. Understanding the signals for early infections, timing of progression, and predictions for asymptomatic patients can help to plan and execute vaccination clinical trials faster and with better safety standards given that in target populations for severe symptoms a failure of the vaccine could be fatal.

PART II – Some solution concepts we are pursuing towards COVID-19 real world data challenges

Many of these problems have the common theme that the data needed to do research isn’t available to research teams yet. The challenge is finding a pathway to obtain that data. Many of them can be solved by integrated approaches.

Lung infections from pathogens have been around for a long time so this data exists within electronic health records and the ancillary systems such as radiology and labs. We can obtain this data by building a ‘lung’ treatment and response data set. Structured information such as medications and diagnoses are available in EHR and claims data sets. These take a variety of forms and we are looking to existing aggregators such as IBM and IQVIA, and health systems with research databases as a resource for this underlying data. We likely need to identify a deeper data set than structured information because much of the variation and the details in care such as timing of events, radiological results or morphology, and rationales taken by clinicians are obscured in the structured data. We can link the deeper data sets to traditional sources through data linkage frameworks such as Datavant and Health Verity in order to bring together both broad administrative views with deeper unstructured data.

We are working with health information technology partners who integrate and make digitized free text notes and radiology data more useful for clinical teams. They offer the capability to scale access to relevant cases quickly because they are already installed within networks of health systems. Some health systems and academic medical centers have already built repositories that are ready to execute research by building Real World Data Enterprise departments so we are working closely with them to support controlled access.

Many health systems still need specific additional tools such as identification of relevant historical cases, extraction of data from PACS systems, deidentification processes for unstructured data (imaging and notes), and honest broker mechanisms for data sharing of advanced data types to enable access for research purposes. Our partnership with Medexprim is working to support these types of capabilities across multiple health systems in Europe so that we can help access emergent information in locations such as Italy where there is a significant treatment effort to provide a global view of COVID-19 treatment.

In parallel we are working with Health Information Interchange networks and data aggregators as a way to rapidly aggregate data across systems in a timely fashion for monitoring COVID-19. Interchanges have the potential to acquire information at a larger scale and to provide rapid response to emergent data. Our partner Life Image is the leading national radiology HIE infrastructure. They are working to identify health systems interested in aggregating relevant data. Just integrating data in radiology is insufficient for broader use cases because radiology data needs to be combined with other sources to provide context. Where possible and in collaboration with health systems interested in participating in research we can work to use the Life Image HIE capabilities to share additional information that provide context to the radiology studies or combine radiology aggregation with existing research tools. We are also working towards using identify protection tools for sharing data across data providers, Health Verity and Datavant, to tokenize information in order to link radiology records to other aggregated data such as claims and lab results.

Large scale repositories already in place such as the UK Biobank, VA Million Veterans Project that offer broad views of genetics in patient populations where samples and background information is already available. Countless biobanks also hold biological samples or biological information that may be helpful towards identifying genes that are of interest. We have worked with UK Biobank data in the past for machine learning efforts on phenotypes and genetics which could be adapted to COVID-19. Most pharma groups already have a relationship with the data but other groups can work either on their own or with teams like Graticule to analyze relevant COVID-19 pathways through approved applications from the UK Biobank or in collaboration with VA researchers.

We have seen a wealth of at home tools arise as a result of COVID-19. There is a large block of tools for self-diagnosis. These are collecting early data on symptoms and some have a full integration into medical records that offer a unique view of the early diagnostic pathway and are enriched for mild symptoms. Other groups offer at home monitoring through remote devices such as blue tooth spirometry, digital thermometers, and digital pulse ox devices. They are increasingly used to avoid having patients at hospitals who are COVID-19 positive but would be safer isolated in quarantine. Physician visits such as ED visits are being transitioned to telehealth which can lead to digital footprints of the patient encounter that was previously unavailable from in person visits if these are recorded or transcribed. Patient support networks like Inspire and Patients Like Me have large numbers of patients with specific diseases – such as psoriasis, rheumatoid arthritis, or lupus who have been treated on drugs of interest for repurposing. These networks enable outreach to patients to answer questions to include self reported data or authorizations to obtain medical records for research initiatives. The data being collected from all these applications can be helpful in identifying early markers. We are working with the teams who develop and support these types of applications to source data for analyses in a variety of use cases.


What we have found to date in trying to move fast is that the biggest hurdle we are facing is clarity of purpose. Often the teams in life sciences companies have an early idea of what study they would like to perform but rather than writing a detailed protocol they ask ‘what data is available?’. This then is challenging to translate in advanced Real World Data because the groups with access to novel information would like details to answer questions like ‘how will you be using the data?’, or ‘What is the study protocol?’. The real rate limiting step for RWE at the moment is structuring clear questions that can translate between these groups looking to answer questions and the teams that have access to data to answer them.

COVID-19 offers organizations in both life sciences companies, healthcare services, and consumer applications who have been working in the RWD space with a short window to creatively solve urgent problems. This crisis is a burning platform and we are seeing groups rise to the occasion to jump to new approaches and sources for RWD to address it.