Trying out Scribus for scientific graphs

This post describes my first attempt at creating a graph for scientific publications using Scribus, a free and open-source (FOSS) desktop publishing (DTP) application. In addition to a WYSIWYG interface, Scribus also allows programmatic editing of page contents using Python. A few lines of code are enough to automate repetitive tasks that require high precision (e.g. placing points inside graphs or markers in scales). I found that this combination makes Scribus an excellent option for creating graphs for scientific publications.

Scribus vs. the competition

Scribus is more precise than run-of-the-mill office software, cheaper and easier to use than commercial DTP software, and more flexible than the graphics libraries in statistical packages such as R. Available export formats include PDF and the most popular image formats; therefore, the output is appropriate for most academic publications.

Creating a simple graph

To test the above, I created the graph below, which depicts a comparison between two groups of patients. Creating this graph was one of the assignments in the Designing Figures and Tables class at UCSD. The design is generally based on recommendations from Tom Lang’s excellent book How to Report Statistics in Medicine.

Graph-UCSD class

Automating the nitty-gritty

Calculating the page coordinates of each point of the curve and of the scales in the axes by hand would be tedious and error-prone, but this task can be automated using a simple Python script.

First, a function to create the x and y axes…

def DrawAxes (base, x_length, y_length):
     createLine (base[0], base[1], base[0] + x_length, base[1])
     createLine (base[0], base[1], base[0], base[1] - y_length)

…then, a second function to draw the scale markers on the x and y scales…

def DrawScale (base, length, marker_spacing, marker_length, axis, offset):
	if axis == "x":
		for i in xrange (base[0], base[0] + length, marker_spacing):
			createLine(i + offset, base[1], i + offset, base[1] + marker_length)
	elif axis == "y":
		for i in xrange (base[1], base[1] - length, - marker_spacing):
			createLine(base[0] - marker_length, i - offset, base[0], i - offset)

…and finally, a function to draw the curves:

def DrawCurve (base, points, offset):
	createLine(base[0] + offset[0] + i[0][0],
		base[1] - offset[1] - i[0][1],
		base[0] + offset[0] + i[1][0],
		base[1] - offset[1] - i[1][1]) for i in zip (points, points[1:])

Putting it all together:

#Data for the graph.
#'healthy' lists the heights and ages of health children (one age-height pair per tuple).
#'ckd' lists the heights and ages of children with CKD.
#'plot_area_base' indicates the origin of the graph.
healthy = [(5,75), (6, 73), (7, 79), (8, 80), (9, 90), (10, 110), (11, 140), (12, 150)]
ckd     = [(5,50), (6, 62), (7, 65), (8, 75), (9, 80), (10, 85),  (11, 110), (12, 120)]
plot_area_base = (250, 500)

#Actual drawing of the graph
DrawScale (plot_area_base, 200, 25, 5, "x", 10)
DrawScale (plot_area_base, 150, 25, 5, "y", 10)
DrawCurve (plot_area_base, [(25 *(i[0] - 5), i[1] - 50) for i in healthy], (10, 10))
DrawCurve (plot_area_base, [(25 *(i[0] - 5), i[1] - 50) for i in ckd], (10, 10))
DrawAxes  (plot_area_base, 200, 150)

The functions above will draw a skeleton graph (shown below), which can be edited to prepare the final version.

Skeleton Graph-UCSD class

After a few more minutes of work—adding a few text fields, inserting scale numbers, and tweaking line patterns —the final version:
Graph-UCSD class

Comparison with Excel and PowerPoint

There’s no comparison, really. When I did this assignment for the first time, I tried to use Excel and PowerPoint, which are not really designed for this application. Unfortunately both these programs impose severe limitations on the formatting of the graph. In particular, I could not prevent the lines from touching the axes, choose specific values or intervals for the scales, measure the exact height of data points, or position captions exactly. After much struggling, it took me longer to produce an inferior graph when I first took the Designing Figures and Tables class at UCSD.

Next steps
The code above is just a quick sketch to demonstrate the practicality of drawing graphs with Scribus. Evidently there are countless other possibilities, such as creating any other kind of graph, reading data files, and customization to suit requirements from publishers, printers, presenters, or other stakeholders in the communications process. I will polish the script and publish an improved version in an upcoming post.

Conclusion
Scribus is an excellent option for creating graphs for scientific publications. It is easier to use than DTP software, more powerful than run-of-the-mill Office packages (I created this graph more quickly than when I completed the original assignment using Excel), and, perhaps most interesting, more flexible than anything else available out there. If you create graphs for scientific publications and know a little Python, you may want to try it out.

Advertisement

Some thoughts on Coursera’s Personalized Medicine MOOC

I completed Case Studies in Personalized Medicine, a massive open online course (MOOC) offered by Vanderbilt University through Coursera and taught by Dan Roden, MD.

The course

Dr. Roden presented several clinical cases, including both rare, single-gene syndromes (e.g. hypercholesterolemia secondary to PCSK9 deficiency, long-QT syndrome, cystic fibrosis, and some rare forms of diabetes and heart failure) and more common cancers (e.g. BRCA-related tumors and Lynch syndrome), as well as clinically important drug interactions (e.g. the infamous CYP2D61) and side effects (e.g. carbamazepine, statins, venlafaxine and lithium).

Much of the course revolved around genome-wide association studies (GWAS), a powerful technique that can identify potentially abnormal points in the genome where relevant pathophysiological mechanisms may be present. Knowing where to look is a good start, even if figuring out what is happening at the point in question is definitely nontrivial.

The figure below shows an example of a Manhattan plot illustrating an interesting if somewhat trivial finding. Each dot is a single-nucleotide polymorphism (SNP, pronounced “snip”).

File:GWAS of human eye color in the Cape Verdean cohort.pngSingle-nucleotide polymorphisms associated with human eye color in Cape Verde.2

In contrast, chronic diseases such as hypertension, diabetes or degenerative disorders will usually generate more ambiguous findings such as those shown in the plots below, where the SNPs point to a surrogate marker that may be related to cardiovascular health. The underlying mechanisms and their relevance remain unclear, and no specific interventions are available as of yet.

Manhattan plots for (a) retinal venular caliber and (b) retinal arteriolar caliber.Single-nucleotide polymorphisms vs. retinal venular caliber (a) and retinal arteriolar caliber (b) in a sample of Caucasian patients.3

Slow but steady progress in the clinic

One notable trend is that gene sequencing, including whole-genome sequencing, is becoming cheaper and more widely available. This means that clinically relevant genetic defects are being discovered, their pathophysiology is being elucidated, and this knowledge is being put into practice in the clinic. A particularly useful development is that patients susceptible to rare but potentially severe drug side effects can now be identified and this information can be used to guide therapy (e.g. abacavir and HLA B*5701)4.

Unfortunately, progress has been slow and often involves relatively rare diseases where genetic defects are discrete. The chronic, presumably polygenic diseases that cause most morbidity and mortality worldwide will seldom if ever be traced to a localized, treatable defect in a single gene.

Despite these caveats, clinically relevant findings are accumulating and yielding useful results. Perhaps the most important advance are the targeted therapies for cancer. Despite these new drugs, cancer often develops resistance to the drugs, and the disease’s sheer complexity has led to only modest overall results for some indications. These advances have spawned a new paradigm where cancer is considered a disease of the genome.5

Another remarkable result was that obtained with ivacaftor for cystic fibrosis. Ivacaftor can partially restore function of the CFTR gene in patients with mutations. Since low levels of CFTR gene function are enough to avoid the more severe complications of cystic fibrosis, partial restoration of CFTR’s channel activity can have dramatic clinical effects. In the first randomized trials of ivacaftor, the randomization was evident to all because patients assigned to receive the active drug showed obvious clinical improvement.6

Ivacaftor illustrates a paradigm should become more common: treatment will be tailored to individual patients according to genetics. Unfortunately, there will be few diseases in which highly specific treatment will yield spectacular results. Sometimes it will yield no results. In one trials, tailoring warfarin dosages according to pharmacogenomics did not improve clinical outcomes.7

I sometimes wonder whether it would be possible to identify, e.g., “diabetogenic” or “oncogenic” profiles in genomes and use this information to prevent and/or guide treatment for major diseases. It is even possible that in such cases Manhattan plots will reveal no statistically significant SNPs in the usual sense, but instead consist of patterns of genes that do not yield statistically significant peaks. Maybe the whole field will become much more analytical and mathematical in the future. Whatever the case, maybe there is a radically new, devastatingly effective approach out there waiting to be found.

Copious data is available, but is it enough?

Despite all these new discoveries, maybe we actually have too little information given the sheer complexity of genomics (and transcriptomics and metabolomics). How do 20,000 genes encoded in some 6 billion base-pairs interact with the environment—all possible environments—in hundreds of diseases? And what can be done with this information? Which morsels will be clinically useful?

One approach to get more data are electronic medical records (EMRs) and biobanks, which are repositories of biological samples collected from patients in clinical trials. Systematic collection of clinical data and biological samples in large cohorts of patients would be required to provide the massive samples required for relevant genome-wide association studies (GWAS) and uncertain inheritance patterns spanning several generations.

One concern is patient privacy. Complex issues involving tamper-proof anonymization of large datasets, compliance with HIPAA and foreign equivalents, data security, informed consent, and interaction with Institutional Review Boards will have to be addressed. I am optimistic that this issue will be solved and even that security will improve. It should be possible to adopt preemptive controls that prevent leakage of data rather than maintaining multiple files and enforcing access restrictions across study sites.

Conclusions

Genomics in medicine has the potential to transform the entirety of healthcare. In all of history, few developments ever spawned such colossal changes. One could imagine a world of automated trawling of massive datasets for information on how to prevent, diagnose, treat and assess the risks of all diseases known to mankind. This is a sobering prospect, which could change the very nature of medicine and healthcare.

Yet another change is the need to disseminate information. Healthcare professionals must obviously keep abreast of these hugely important developments, and the lay public must learn about the science, what it can and cannot do, how it can benefit them and even how it can be abused by peddlers of false hopes and other disreputable parties.

References

  1. National Center for Biotechnology Information. CYP2D6 cytochrome P450 family 2 subfamily D member 6 [ Homo sapiens (human) ] https://www.ncbi.nlm.nih.gov/gene/1565
  2. Beleza S, Johnson NA, Candille SI, Absher DM, Coram MA, Lopes J, et al. (2013) Genetic Architecture of Skin and Eye Color in an African-European Admixed Population. PLoS Genet9(3): e1003372. https://doi.org/10.1371/journal.pgen.1003372
  3. Ikram MK, Xueling S, Jensen RA, Cotch MF, Hewitt AW, Ikram MA, et al. (2010) Four Novel Loci (19q13, 6q24, 12q24, and 5q14) Influence the Microcirculation In Vivo. PLoS Genet6(10): e1001184. https://doi.org/10.1371/journal.pgen.1001184
  4. Mallal S, Phillips E, Carosi G, Molina JM, Workman C, Tomažič J, et al. PREDICT-1 Study Team. HLA-B*5701 Screening for Hypersensitivity to Abacavir. N Engl J Med 2008; 358:568-579. https://doi.org/10.1056/NEJMoa0706135
  5. MacConaill LE, Garraway LA. Clinical Implications of the Cancer Genome. Journal of Clinical Oncology 2010 28:35, 5219-5228. https://dx.doi.org/10.1200/JCO.2009.27.4944
  6. Ramsey BW, Davies J, McElvaney NG, Tullis E, Bell SC, Dřevínek P, Griese M, McKone EF, Wainwright CE, et al. VX08-770-102 Study Group. A CFTR Potentiator in Patients with Cystic Fibrosis and the G551D Mutation. N Engl J Med 2011; 365:1663-1672 https://dx.doi.org/10.1056/NEJMoa1105185
  7. Chong K. Warfarin Dosing and VKORC1/CYP2C9. Medscape https://emedicine.medscape.com/article/1733331-overview

Current Medical Diagnosis and Therapy 2018 Edition is out!

I just learned that Current Medical Diagnosis and Therapy CMDT 2018 is out.

CMDT is one of my favorite medical texts. Although it is not always as deep as Harrison’s, Cecil-Goldman or Oxford, it is much more than a simplified précis and does a solid job of presenting an overview of the most important medical conditions. It is also updated more frequently and has a more consistent writing style than the standard textbooks.

The big textbooks sometimes wander off into obscure diseases and less relevant minutiae, and CMDT can be a better choice if you just need to quickly review the basics of a subject or prepare for exams.
CMDT 2018

Is there such a thing as US Portuguese?

Last week I got a request for translating some patient information material into Portuguese. They requested a Portuguese-speaking translator living in the United States; in other words, they were after some sort of “US Portuguese.” I guess they heard of “Spanish (United States)” as a locale and extrapolated this to Portuguese, which is a somewhat similar language after all.

But is there such a thing as US Portuguese? Short answer: nope.

The differences between Brazilian and European Portuguese are more marked than between variants of English, to the extent that quite a few common words have acquired different meanings in Portugal and in Brazil. Some curious examples include “comboio” (train in Portugal and convoy in Brazil), “bicha” (queue in Portugal and effeminate gay man in Brazil), and “cu” (informal for buttocks in Portugal, but vulgar for anus in Brazil).

In speech the differences are even more marked. Brazilian Portuguese was heavily influenced by Native Brazilian languages and, to a lesser extent, by Yoruba and other African languages brought along by slave trade. In the meantime,  European Portuguese shifted to the Estremenho dialect, which is different from the language that was “exported” to Brazil in the first place.

Due to these historical differences, some Portuguese television programs and interviews are subtitled when shown in Brazil. However, the Portuguese press regularly interviews Brazilians in Portuguese, and Brazilian telenovelas also find some audience in Portugal.

In summary, both variants are mutually understandable, but you must pick one—unfortunately there is no such thing as “neutral” Portuguese either—and choosing the right variant is important. A message in the “wrong” dialect will be understood most of the time, but it can also be unpersuasive, ineffective, useless or even offensive. It pays to pay attention.

Old but Gold

This is a link to my favorite blog post:

The Craft so Long to Lerne.

Written in 2011 by John E. McIntyre, copyeditor of the Baltimore Sun, it’s old by internet standards. In keeping with the oldie theme, its title is a centuries-old quote and it ends with timeless advice.

Unfortunately, as the post’s title suggests, reducing Mr. McIntyre’s advice to practice requires nothing less than the labor of a lifetime. But this laboriousness is entirely expected: since language is at the very core of our intellect, mastering writing is arguably one of the highest intellectual mountains one could climb.

An overview of Udacity’s Localization Essentials course

I completed Udacity’s Localization Essentials free course a few weeks ago. Here’s their pitch on YouTube:

https://www.youtube.com/watch?v=vUGBbnRT8p8

The course is brief and can be comfortably finished in half a day. It offers a broad if not very deep overview of software localization. In particular, translation memories and glossaries are discussed only briefly and the concept of tagged file formats isn’t discussed at all.

As with other Udacity offerings, this course was developed in partnership with a company, namely Google. The most interesting part of the course were the interviews with localization leads and language managers in charge of localizing Google products. They presented a bird’s eye view of the overall workflow for the localization of Google products, including both apps and cloud-based products.

I suspect they watered down the tech content because the relevant web-development material is readily available elsewhere—both on Udacity itself and on the wider web—and to make the course more accessible to people with a less technical background. Indeed the responses in the forums suggest that most students are linguists, translators, translation students, and translator wannabes than techies. That said, I think a brief overview of the tagged file formats most commonly used in localization or even just a few pointers to the technology on which the products are built and localized would have been helpful to students without a tech background.

Another interesting omission is crowdsourced translation. I wonder whether this is because they don’t put much faith in this approach—and indeed it produces inconsistent or downright awful results—or because they don’t want to tell newbies that at least some companies in their supply chain expect people to work for a pittance of even for free—as The Economist bluntly put it, work as “the coffee-bean pickers of the future.” Silicon Valley has its ugly side too.

In summary this course provides an overall idea of what localization is about and can help you decide whether you want to plunge in. Notwithstanding the ugliness, there is a job to be (well) done and maybe even an industry to be disrupted.

Hello, World!

I’m Paulo Mendes, medical writer and translator. After taking a deep breath, I decided to start out with this blogging thingy.

Some wisecrack said once that “we write not to say what we think, but to figure out what we think.” In this blog, I’ll try to figure out what I think about medical writing, medical translation and language in general.

I will throw in some other random subjects and stray thoughts, mostly on the geeky side as this post’s title suggests. However, as the antique look suggests, I will stick mostly to literate rather than numerate topics.

Hopefully I will reach interesting conclusions, or at least this little journey will be worth it.