Big data and development in India. The hype and the reality

Many around the world celebrated the agreement of the Sustainable Development Goals (SDGs) and a new agenda for transformative development by 2030. But, practitioners and policy makers were left scratching their heads as to how they were going to monitor the detailed 169 targets and ever more numerous indicators, never mind understanding and achieving these goals.

It is in this context that we’re seeing a growth of interest in using data to help solve development problems. Indeed, we can say that the infrastructures now being built to support data are likely to become central to how we make development decisions in the future.

How will such data infrastructures shape our thinking about development over the next decade? What types of limitations and biases might they embed? How should they best be designed and implemented? It is these questions that we looked to explore in a recent paper [1] analysing big data use for development in India.

In this paper we dug into two cases where big data was being used to support wider development over commercial goals – the Bengaluru Metropolitan Transport Corporation (BMTC) and big data transport upgrades in Bengaluru, India; and Stelcorp (name changed), a state initiative using big data for improving electricity systems.

Digging into big data

Digging into these cases, we found that both of these initiatives were connected into longer, often decades-old histories of data collection and decision making. This meant that new data innovations were being introduced in an attempt to understand long running development problems. Thus, the main focus of BMTC was on using vehicle tracking and big data innovations to improve the notoriously unreliable city bus services.

We found that big data innovation allowed improved integration of rich information flows, and led to centralisation of decision making. In StelCorp, previously manually-collected meter data was now digitally-collected and aggregated (see images below). The supporting infrastructure allowed a near real-time analysis of the status of the electricity network, and was more effective at monitoring around failures and blackouts. A new central data centre played a growing role in processing and analysing this data. In BTMC, new bus transportation data was aggregated and fed in real-time to large screens in a “control centre” where activity was monitored by administrators.

Digitalisation in Stelcorp: Meters such as those on the left supply real time data about network usage. Even manual meter reading data is now often transferred through automated reading devices (right) to later be input into the system.

Beyond day-to-day monitoring, we also saw signs that the new data was feeding into more strategic decisions. In the electricity sector, for example, upgrades have been plagued by poor and politicised decision making, but the state-wide data from Stelcorp is now being used in upgrading decisions.

More conceptually, there is evidence that these initiatives are playing a role in supporting new forms of state commitments, or citizen interaction. BTMC has been associated with a ‘Smart City’ initiative and citizens interacting with a set of efficient urban services. Indeed, BTMC introduced a citizen mobile app for tracking bus routes which has had over 50,000 downloads. In the Stelcorp initiative, state political visions about “24/7 electricity” have in part emerged from the better data that allows improved management of the electricity system.

Limitations

Whilst big data has led to these operational, strategic and visionary advances, there were a number of concerns in these projects. One key concern raised was the quality of data being used in these projects, which was often incomplete, short-term, or skewed.

Most problematic was that data from marginal groups was difficult to obtain, so in Stelcorp, automated electricity data was mainly coming from cities, where rural data was still manually collected, and in both cases there was often the need for “data wrangling” before the data had value.

These data limitations pose questions of how representative the data being used is of the population. If certain measures are skewed towards those more affluent, data coming from those more marginal might then be seen as “nonconforming” or even deviant. Moreover, the way that the data is selected, measured and transformed in such systems will be important in determining what processes are made visible by data and what might remain in the shadows.

The Smart Cities Challenge: Such visions can be seen to be made viable by the growth of big data. However in reality big data projects often tend to have a narrower focus. Source: http://www.smartcitieschallenge.in/

There were also more general questions about the focus of big data projects. These projects were marketed and discussed under lofty development goals, but in implementation they were often quite narrow projects. BTMC, for all its discussion of smart cities and citizens, was far more focussed on stamping out corruption among bus employees than making the city’s public transport smart.

Further, in all these projects there is scant sharing of the new data produced. These projects have not been about the public shining a light on opaque mechanisms of decision making. In fact, with a growing number of public and private actors involved, mechanisms of decision making are becoming even less transparent.

Big data for development

Big data projects are in their infancy in countries like India, but as these cases show they are becoming important to support decision making on key development issues, not only at an operational level, but in strategic decision making and in supporting new visions of developmental partnerships between citizens, private sector and the state.

However, these initiatives rarely follow the vision of big data driving transformative changes. They so-far tend to use problematic data to enhance decision making. They also tend to focus on quite narrow aspects of problems in implementation over the bigger development problems that might be more impactful.

We also need to make sure that big data does not solely lead to technocratic solutions, or underplay the importance of integrating with a wider set of social and political activities for development – data showing electricity pilferage will have limited impact without solving the complexities of local politics of electricity in rural and slum areas, and data on public vehicle movements cannot replace the underfunding of urban transport.

[1] Heeks, R., Rakesh, V., Sengupta, R., Chattapadhyay, S. & Foster, C. (In press) Datafication, Value and Power in Developing Countries: Big Data in Two Indian Public Service Organisations. Development Policy Review.


This is an adapted version of a blog originally posted on the Sheffield Institute of International Development (SIID) blog.

With thanks to Vanya Rakesh & Ritam Sengupta for their research in India and SIID and the University of Manchester for the small grant support for this work.

Advertisement

Positive Deviance: A Data-Powered Approach to the Covid-19 Response

Nations around the world are struggling with their response to the Covid-19 pandemic.  In particular, they seek guidance on what works best in terms of preventive measures, treatments, and public health, economic and other policies.  Can we use the novel approach of data-powered positive deviance to improve the guidance being offered?

Positive Deviance and Covid-19

Positive deviants are those in a population that significantly outperform their peers.  While the terminology of positive deviance is absent from public discourse on Covid-19, the concept is implicitly present at least at the level of nations.  In an evolving list, countries like New Zealand, Australia, Taiwan, South Korea and Germany regularly appear among those seen as most “successful” in terms of their relative infection or death rates so far.

Here we argue first that the ideas and techniques of positive deviance could usefully be called on more directly; second that application of PD is probably more useful at levels other than the nation-state.  In the table below, we summarise four levels at which PD could be applied, giving potential examples and also potential explanators: the factors that underpin the outperformance of positive deviants.

Level Potential positive deviants Potential PD explanators
Nation[i] Countries with very low relative infection or death rates
  • Early lockdown
  • Extensive testing
  • Use of contact-tracing incl. apps
  • Cultural acceptance of mask-wearing
  • Prior mandatory TB vaccination
  • Quality of leadership
Locality (Regions, Cities)[ii] Cities and regions with significantly slower spread of Covid-19 infection than peers
  • Extensive or innovative community education campaigns
  • Testing well in excess of national levels
  • Earlier-than-national lockdown
  • Extensive sanitisation of public transport
  • Quality and breadth of local healthcare
  • Quality of leadership
Facility (Hospitals, Health Centres)[iii] Health facilities with significantly higher recovery rates than peers
  • Innovative use of existing (scarce) healthcare technologies / materials
  • Innovative use of new healthcare technologies: AI, new treatments
  • Level of medical staff expertise and Covid-19-specific training
Health facilities with significantly lower staff infection rates than peers
  • Provision of high-quality personal protective equipment in sufficient quantity
  • Strict adherence to infection monitoring and control measures
  • Strict adherence to high-quality disinfection procedures
  • Innovative use of contact-free healthcare technologies: chat bots, robots, interactive voice response, etc
Individual[iv] Individuals in vulnerable groups who contract full-blown Covid-19 and survive
  • Psychological resilience
  • Physical fitness
  • Absence of underlying health conditions
  • Effective therapies
  • Genetics

 

At present, items in the table are hypothetical and/or illustrative but they show the significant value that could be derived from identification of positive deviants and their explanators.  Those explanators that are under social control – such as use of technological solutions or policy/managerial measures – can be rapidly scaled across populations.  Those explanators such as genetics or pre-existing levels of healthcare capacity which are not under social control can be built into policy responses; for example in customising responses to particular groups or locations.

Evidence from positive deviance analysis can help currently in designing policies and specific interventions to help stem infection and death rates.  Soon it will be able to help design more-effective lockdown exit strategies as these start to show differential results, and as post-lockdown positive deviants start to appear.

However, positive deviance consists of two elements; not just outperformance but outperformance of peers.  It is the “peers” element that confounds the value of positive deviance at the nation-state level.

Public discourse has focused mainly on supposedly outperforming nations [v]; yet countries are complex systems that make meaningful comparisons very difficult[vi]: dataset definitions are different (e.g. how countries count deaths); dataset accuracy is different (with some countries suspected of artificially suppressing death rates from Covid-19); population profiles and densities are different (countries with young, rural populations differing from those with old, urban populations); climates are different (which may or may not have an impact); health service capacities are different; pre-existing health condition profiles are different; testing methods are different; and so on.  Within all this, there is a great danger of apophenia: the mistaken identification of “patterns” in the data that are either not actually present or which are just random.

More valid and hence more useful will be application of positive deviance at lower levels.  Indeed, the lower the level, the more feasible it becomes to identify and control for dimensions of difference and to then cluster data into true peer groups within which positive deviants – and perhaps also some of their explanators – can then be identified.

Data-Powered Positive Deviance and Covid-19

The traditional approach to identifying positive deviants has been the field survey: going out into human populations (positive deviants have historically been understood only as individuals or families) and asking questions of hundreds or thousands of respondents.  Not only was this time-consuming and costly but it also becomes more risky or more difficult or even impractical during a pandemic.

Much better, then, is to look at analysis of large-scale datasets which may be big data[vii] and/or open data, since this offers many potential benefits compared to the traditional approach[viii].  Many such datasets already exist online[ix], while others may be accessed as they are created by national statistical or public health authorities.

Analytical techniques, such as those being developed by the Data-Powered Positive Deviance project, can then be applied: clustering the data into peer groups, defining the level of outperformance needed to be classified as a positive deviant, identifying the positive deviants, then interrogating the dataset further to see if any PD explanators can be extracted from it.

An example already underway is clustering the 368 districts in Germany based on data from the country’s Landatlas dataset and identifying those which are outperforming in terms of spread of the virus.  Retrospective regression analysis is already suggesting structural factors that may be of importance in positive deviant districts: extent and nature of health infrastructure including family doctors and pharmacies, population density, and levels of higher education and of unemployment.

This can then be complemented in two directions – diving deeper into the data via machine learning to try to predict future spread of the disease; and complementing this large-scale open data with “thick data” using online survey and other methods to identify the non-structural factors that may underlie outperformance.  The latter particularly will look for factors under socio-political control such as policies on lockdown, testing, etc.

Of course, great care must be taken here.  Even setting aside deliberate under-reporting, accuracy of the most basic measures – cases of, and deaths from Covid-19 – has some inherent uncertainties[x].  Beyond accuracy are the broader issues of “data justice”[xi] as it applies to Covid-19-related analysis[xii], including:

  • Representation: the issue of who is and is not represented on datasets. Poorer countries, poorer populations, ethnic minority populations are often under-represented.  If not accounted for, data analysis may not only be inaccurate but also unjust.
  • Privacy: arguments about the benefits of analysing data are being used to push out the boundaries of what is seen as acceptable data privacy; opening the possibility of greater state surveillance of populations. As Privacy International notes, any boundary-pushing “must be temporary, necessary, and proportionate”[xiii].
  • Access and Ownership: best practice would seem to be datasets that are publicly-owned and open-access with analysis that is transparently explained. The danger is that private interests seek to sequester the value of Covid-19-related data or its analysis.
  • Inequality: the key systems of relevance to any Covid-19 response are the economic and public health systems. These contain structural inequalities that benefit some more than others.  Unless data-driven responses take this into account, those responses may further exacerbate existing social fracture lines.

However, if these challenges can be navigated, then the potential of data-powered positive deviance can be effectively harnessed in the fight against Covid-19.  By identifying Covid-19 positive deviants, we can spotlight the places, institutions and people who are dealing best with the pandemic.  By identifying PD explanators, we can understand what constitutes best practice in terms of prevention and treatment; from public health to direct healthcare.  By scaling out those PD explanators within peer groups, we can ensure a much-broader application of best practice which should reduce infections and save lives.  And using the power of digital datasets and data analytics, we can do this in a cost- and time-effective manner.

The “Data-Powered Positive Deviance” project will be working on this over coming months.  We welcome collaborations with colleagues around the world on this exciting initiative and encourage you to contact the GIZ Data Lab or the Centre for Digital Development (University of Manchester).

This blogpost was co-authored by Richard Heeks and Basma Albanna and was originally published on the Data-Powered Positive Deviance blog.

 

 

[i] https://interestingengineering.com/7-countries-keeping-covid-19-cases-in-check-so-far; https://www.forbes.com/sites/avivahwittenbergcox/2020/04/13/what-do-countries-with-the-best-coronavirus-reponses-have-in-common-women-leaders; https://www.maskssavelives.org/; https://www.bloomberg.com/news/articles/2020-04-02/fewer-coronavirus-deaths-seen-in-countries-that-mandate-tb-vaccine

[ii] https://www.weforum.org/agenda/2020/03/how-should-cities-prepare-for-coronavirus-pandemics/; https://www.wri.org/blog/2020/03/covid-19-could-affect-cities-years-here-are-4-ways-theyre-coping-now; https://www.fox9.com/news/experts-explain-why-minnesota-has-the-nations-lowest-per-capita-covid-19-infection-rate; https://www.bbc.co.uk/news/world-asia-52269607

[iii] https://hbr.org/2020/04/how-hospitals-are-using-ai-to-battle-covid-19; https://www.cuimc.columbia.edu/news/columbia-develops-ventilator-sharing-protocol-covid-19-patients; https://www.esht.nhs.uk/2020/04/02/innovation-and-change-to-manage-covid-19-at-esht/; https://www.med-technews.com/topics/covid-19/; https://www.innovationsinhealthcare.org/covid-19-innovations-in-healthcare-responds/; https://www.cnbc.com/2020/03/23/video-hospital-in-china-where-covid-19-patients-treated-by-robots.html; https://www.researchprofessionalnews.com/rr-news-new-zealand-2020-4-high-quality-ppe-crucial-for-at-risk-healthcare-workers/; https://www.ecdc.europa.eu/sites/default/files/documents/Environmental-persistence-of-SARS_CoV_2-virus-Options-for-cleaning2020-03-26_0.pdf

[iv] https://www.sacbee.com/news/coronavirus/article241687336.html; https://www.thelocal.it/20200327/italian-101-year-old-leaves-hospital-after-recovering-from-coronavirus; https://www.vox.com/science-and-health/2020/4/8/21207269/covid-19-coronavirus-risk-factors; https://www.medrxiv.org/content/10.1101/2020.04.22.20072124v2; https://www.bloomberg.com/news/articles/2020-04-16/your-risk-of-getting-sick-from-covid-19-may-lie-in-your-genes

[v] Specifically, this refers to the positive discourse.  There is a significant “negative deviant” discourse (albeit, again, not using this specific terminology) that looks especially at countries and individuals which are under-performing the norm.

[vi] https://www.bbc.co.uk/news/52311014; https://www.theguardian.com/world/2020/apr/24/is-comparing-covid-19-death-rates-across-europe-helpful-

[vii] https://www.forbes.com/sites/ciocentral/2020/03/30/big-data-in-the-time-of-coronavirus-covid-19; https://healthitanalytics.com/news/understanding-the-covid-19-pandemic-as-a-big-data-analytics-issue

[viii] https://doi.org/10.1002/isd2.12063

[ix] E.g. via https://datasetsearch.research.google.com/search?query=coronavirus%20covid-19

[x] https://www.medicalnewstoday.com/articles/why-are-covid-19-death-rates-so-hard-to-calculate-experts-weigh-in; https://www.newsletter.co.uk/health/coronavirus/coronavirus-world-health-organisation-accepts-difficulties-teasing-out-true-death-rates-covid-19-2527689

[xi] https://doi.org/10.1080/1369118X.2019.1599039

[xii] https://www.opendemocracy.net/en/openmovements/widening-data-divide-covid-19-and-global-south/; https://www.wired.com/story/big-data-could-undermine-the-covid-19-response/; https://www.thenewhumanitarian.org/opinion/2020/03/30/coronavirus-apps-technology; https://botpopuli.net/covid19-coronavirus-technology-rights

[xiii] https://privacyinternational.org/examples/tracking-global-response-covid-19; see also https://globalprivacyassembly.org/covid19/

Measuring the Broadband Speed Divide using Crowdsourced Data

Digital applications and services increasingly require high-speed Internet connectivity. Yet a strong “broadband divide” exists between nations [1,2]. We try to understand how big data can be used to measure this divide. In particular, what new measurement opportunities can crowdsourced data offer?

The broadband divide has been widely measured using subscription rates. However, the broadband speed divide measured using observed speeds has been less explored due to the lack of data in the hands of regulators and statistical offices. This article focuses on measuring the fixed-network broadband speed divide between developed and developing countries, exploring the benefits and limitations of using new crowdsourced data.

To this aim we used measurements from the Speedtest Global Index, generated by Ookla using data volunteered by Internet users verifying the speed of their Internet connections [3]. These crowdsourced tests allow this firm to estimate monthly measurements of the average upload and download speeds at the country level.

The dataset used for this analysis comprised monthly data, from January to December 2018, for a total of 120 countries. Using the income and regional categorisations set by the World Bank we identified 64 developing countries and 54 developed countries in seven regions. Complete data for only two of the least developed countries were available so these were not included in the analysis.

The following table presents the download and upload speed averages on the fixed network, aggregated by region and level of development, and the totals for all the countries in our final sample (n=118), while the figure below shows the download and upload speeds aggregated by level of development.

Table 1. Average upload and download speed by region and development level, fixed network. January – December 2018 (Mbps)

Note: Unweighted averages
Source: Author calculations using data from Ookla’s Speedtest Global Index [3]

Figure 1. Average upload and download speed by level of development, fixed network. January – December 2018 (Mbps)

-Download speeds. We observe that the divide between developed and developing countries is pronounced with average download speeds for the latter being around one-third of the former. However, the divide is also evident within regions: in the developed world, countries in North America have speeds three-times higher than those in the Middle East. Within the developing countries those in Europe & Central Asia have the highest download speeds and those in the Middle East & North Africa have the lowest. Overall, download speeds are much lower in the developing world, thus creating an important impediment to the use of data-intensive digital applications and services.

-Upload speeds. We identify that overall there is an existing divide between developed and developing countries similar in magnitude to the one observed in download speeds. However, when looking at the group of developing countries we see that regional rankings are different compared to those identified using download speeds: the East Asia & Pacific region ranks first and North America ranks third – the latter with speeds that are two-thirds of their download speeds. Across regions, upload speeds are always slower in the developing world, and again the Middle East & North Africa region ranks at the bottom; but the divide between download and upload speeds is lower in the developing world. Considering that faster upload speeds are also required in a data-intensive era, the majority of the countries are far from the ideal of having faster networks with synchronous speeds.

Some benefits and limitations are identified when measuring the broadband speed divide using this type of crowdsourced data.

-Benefits. First, the availability of these types of data allows us to measure the broadband speed divide between developed and developing countries using observed instead of theoretical speeds. Second, these measurements are openly available on a website that can be accessed by the general public at no cost. Third, the divide can be measured and tracked over time more frequently than when using survey or administrative data. Finally, this site reports both download and upload speeds which are important to measure in a data-intensive era.

-Limitations. Even if there are data available for a good number of countries there are no complete data about the least developed countries, leaving behind this group. Also, there might be some bias in the production of data as crowdsourced measurements might be coming from ICT-literate individuals in certain countries [4]. Finally, from this source it is not possible to access complete datasets with additional data points such as the number of observations, medians, and latencies for each country.

These findings derive from a broader research project that, overall, is researching use of big data for measurement of the digital divide.  Readers are welcome to contact the author for details of that broader project: luis.riveraillingworth@manchester.ac.uk

References

[1] ITU (2018). Measuring the Information Society Report 2018. Geneva, Switzerland: International Telecommunication Union.

[2] Broadband Commission (2018). The State of the Broadband: Broadband catalyzing sustainable development. Geneva, Switzerland: Broadband Commission for Sustainable Development.

[3] Ookla. (2018). Speed Test Global Index [Online]. Available: http://www.speedtest.net/global-index/about [Accessed 01/03/2019]

[4] Bauer, S., Clark, D. D. & Lehr, W. (2010). Understanding broadband speed measurements. In,TPRC 2010. Available at SSRN: https://ssrn.com/abstract=1988332

Using Big Data to Learn from Positive Outliers

Why do a few individuals, communities or organisations achieve significantly better results than their peers?  The positive deviance approach tries to answer this question.

The story began in 1990, the Vietnamese government invited Save the Children (SCF) to help overcome the problem of child malnutrition.  Jerry Sternin, the SCF Programme Director, was asked to demonstrate impact within six months and decided to try the idea of positive deviance.  Building on past work[1]he undertook a village survey of child height and weight, looking for positive deviants: children from poor families, living among high malnutrition rates, who were nonetheless well-nourished.

In the pilot survey, he found six such families and began to study them intensively (see Figure 1).  By observing the food preparation, cooking and serving behaviours of these families, he found three consistent yet rare behaviours. Mothers of positive deviants:

  1. washed their children’s hands every time they came in contact with anything unclean;
  2. added to their children’s diet tiny shrimps from the rice paddies, and the greens from sweet potato tops; and
  3. fed their children less per meal but more often: four to five times per day compared to two times in non-positive deviant families.

Sternin and his team then scaled out those simple, affordable, community-inspired practices and, within two years, this had reduced malnutrition by 80% in 250 communities, rehabilitating an estimated 50,000 malnourished children[2].

Figure 1: Jerry Sternin speaking to mothers in a village in Vietnam

The simple power of the positive deviance (PD) approach has led to its successful application in more than 60 countries across the globe[3].  Yet PD still faces a number of challenges to its diffusion and implementation.  As a result, we decided to investigate whether big data might help address those challenges, via a systematic review, published in the Electronic Journal of Information Systems in Developing Countries.

A priori, big data provides opportunities in relation to two main PD challenges.

1. Time, Cost and Sample Size. Relying on in-depth primary data collection, the PD approach is time- and labour-intensive with costs proportional to sample size[4]. As a result, PD sample sizes are traditionally small.  Statistically and practically, this can make it hard to identify positive deviants, given their relative rarity (see Figure 2)[5].  By contrast, cost of gathering big data tends to be very low since it often makes use of already existing “data exhaust” from digital processes.  With big data thus covering large – often very large – sample sizes, greater numbers of PDs can be identified, and generalisation to even-larger populations is easier.

Figure 2: Positive deviants in a normal distribution

2. Domain and Geographic Scope. To date, most applications of PD have been highly concentrated. In a recent systematic literature review[6], 89% of applications in developing countries were in public health, 83% were in rural communities, and just four countries had hosted roughly half of all PD implementations.  A simultaneous review of big data in developing countries, on the other hand, showed datasets and demonstrated value across a much wider set of domains and locations.  As a result, big data could help positive deviance to break from its current path dependency.

To assess these and other benefits that big data may bring to the PD approach – relating to behaviour identification, methodological risk, and scalability – a “big data-based positive deviance” research project has been designed and is underway.  The project is currently identifying positive deviants from large-scale datasets in the education and agriculture domains, with results planned to emerge in 2019.

For further details on the challenges of positive deviance and the opportunities offered by big data, please refer to the review article.

REFERENCES

[1]Wishik, S. M. & Van Der Vynckt, S. (1976) The use of nutritional “positive deviants” to identify approaches for modification of dietary practices, American Journal of Public Health, 66(1), 38–42. Zeitlin, M. F. et al.(1990) Positive Deviance in Child Nutrition: With Emphasis on Psychosocial and Behavioural Aspects and Amplications for Development. Tokyo: United Nations University.
[2]Sternin, J. (2002) Positive deviance: a new paradigm for addressing today’s problems today, The Journal of Corporate Citizenship, 57–63.
[3]Felt, L. J. (2011) Present Promise, Future Potential: Positive Deviance and Complementary Theory.  Lapping, K. et al.(2002) The positive deviance approach: challenges and opportunities for the future., Food and Nutrition Bulletin, 23(4 Suppl), 130–7.  Marsh, D. R., Schroeder, D. G., Dearden, K. A., Sternin, J. & Sternin, M. (2004) The power of positive deviance, BMJ, 329(7475), 1177–1179.
[4]Marsh et al. (ibid.).
[5]Springer, A., Nielsen, C. & Johansen, I. (2016) Positive Deviance by the NumbersPositive Deviance Initiative. Available at: https://positivedeviance.org/background/.
[6]Albanna, B. & Heeks, R. (2018) Positive deviance, big data and development: a systematic literature review, Electronic Journal of Information Systems in Developing Countries.

Big Data and Healthcare in the Global South

The global healthcare landscape is changing. Healthcare services are becoming ever more digitised with the adoption of new technologies and electronic health records. This development typically generates enormous amounts of data which, if utilised effectively, have the potential to improve healthcare services and reduce costs.

The potential of big data in healthcare

Decision making in medicine relies heavily on data from different sources, such as research and clinical data, rather than only based on individuals’ training and professional knowledge. Historically, healthcare organisations have often based their understanding of information on an incomplete grasp of reality on the ground, which could lead to poor health outcomes. This issue has recently become more manageable with the advent of big data technologies.

Big data comprises unstructured and structured data from clinical, financial and operational systems, and data from public health records and social media that goes beyond the health organisations’ walls. Big data, therefore, can support more insightful analysis and enable evidence-based medicine by making data transparent and usable at much broader verities, much larger volumes and higher velocities than was ever available to healthcare organisations [1].

Using big data, healthcare providers can, for example, manage population health by identifying patients at high-risk during disease outbreaks and then take preventive actions. In one case, Google used data from user search histories to track the spread of influenza around the world in near real time (see figure below).

Google Flu Trends correlated with influenza outbreak[2]

Big data can also be used for identifying procedures and treatments that are costly or delivering insignificant benefits. For example, one healthcare centre in the USA has been using clinical data to bring to light costly procedures and other treatments. This helped it to reduce and identify unnecessary procedures and duplicate tests. In essence, big data not only helped to improve high standards of patient care but also helped to reduce the costs of healthcare [3].

Medical big data in the global south

The potential healthcare benefits of big data are exciting. However, it can offer the most significant potential rewards for developing countries. While global healthcare is facing challenges to improve health outcomes and to reduce costs, these issues can be severe in developing countries.

Lack of sufficient resources, poor use of existing funds, poverty, and lack of managerial and related capabilities are the main differences between developing and developed countries. This means health inequality is more pronounced in the global south. Equally, mortality and birth rates are relatively high in developing countries as compared to developed countries, which have better-resourced facilities [4].

Given improvements in the quality and quantity of clinical data, the quality of care can be improved. In the global south in particular, where health is more a question of access to primary healthcare than a question of individual lifestyle, big data can play a prominent role in improving the use of scarce resources.

How is medical big data utilised in the global south?

To investigate this key question, I analysed the introduction of Electronic Health Records (EHR), known as SEPAS, in Iranian hospitals. SEPAS is a large-scale project which aims to build a nationally integrated system of EHR for Iranian citizens. Over the last decade, Iran has progressed from having no EHR to 82% EHR coverage for its citizens [5].

EHR is one of the most widespread applications of medical big data in healthcare. In effect, SEPAS is built with the aim to harness data and extract value from it and to make real-time and patient-centred information available to authorised users.

However, the analysis of SEPAS revealed that medical big data is not utilised to its full potential in the Iranian healthcare industry. If the big data system is to be successful, the harnessed data should inform decision-making processes and drive actionable results.

Currently, data is gathered effectively in Iranian public hospitals, meaning that the raw and unstructured data is mined and classified to create a clean set of data ready for analysis. This data is also transferred into summarised and digestible information and reports, confirming that real potential value can be extracted from the data.

In spite of this, the benefit of big data is not yet realised in guiding clinical decisions and actions in Iranian healthcare. SEPAS is only being used in hospitals by IT staff and health information managers who work with data and see the reports from the system. However, the reports and insights are not often sent to clinicians and little effort is made by management to extract lessons from some potentially important streams of big data.

Limited utilisation of medical big data in developing countries has also been reported in other studies. For example, a recent study in Saudi Arabia [6] reported the low number of e-health initiatives. This suggests the utilisation of big data faces more challenges in these countries.

Although this study cannot claim to have given a complete picture of the utilisation of medical big data in the global south, some light has been shed on the topic. While there is no doubt that medical big data could have a significant impact on the improvement of healthcare in the global south, there is still much work to be done. Healthcare policymakers in developing countries, and in Iran in particular, need to reinforce the importance of medical big data in hospitals and ensure that it is embedded in practice. To do this, the barriers to effective datafication should be first investigated in this context.

References

[1] Kuo, M.H., Sahama, T., Kushniruk, A.W., Borycki, E.M. and Grunwell, D.K. (2014). Health big data analytics: current perspectives, challenges and potential solutions. International Journal of Big Data Intelligence, 1(1-2), 114-126.

[2] Dugas, A.F., Hsieh, Y.H., Levin, S.R., Pines, J.M., Mareiniss, D.P., Mohareb, A., Gaydos, C.A., Perl, T.M. and Rothman, R.E. (2012). Google Flu Trends: correlation with emergency department influenza rates and crowding metrics. Clinical infectious diseases, 54(4), 463-469.

[3] Allouche G. (2013). Can Big Data Save Health Care? Available at: https://www.techopedia.com/2/29792/trends/big-data/can-big-data-save-health-care (Accessed: August 2018).

[4] Shah A. (2011). Healthcare around the World. Global Issues. Available at: http://www.globalissues.org/article/774/health-care-around-the-world (Accessed: August 2018).

[5] Financial Tribune (2017). E-Health File for 66m Iranians. Available at: https://financialtribune.com/articles/people/64502/e-health-files-for-66m-iranians (Accessed: August 2018).

[6] Alsulame K, Khalifa M, Househ M. (2016). E-Health Status in Saudi Arabia: A Review of Current Literature. Health Policy and Technology, 5(2), 204-210.

Measuring the Big Data Knowledge Divide Using Wikipedia

Big data is of increasing importance; yet – like all digital technologies – it is affected by a digital divide of multiple dimensions. We set out to understand one dimension: the big data ‘knowledge divide’; meaning the way in which different groups have different levels of knowledge about big data [1,2].

To do this, we analysed Wikipedia – as a global repository of knowledge – and asked: how does people’s knowledge of big data differ by language?

An exploratory analysis of Wikipedia to understand the knowledge divide looked at differences across ten languages in production and consumption of the specific Wikipedia article entitled ‘Big Data’ in each of the languages. The figure below shows initial results:

  • The Knowledge-Awareness Indicator (KAI) measures the total number of views of the ‘Big Data’ article divided by total number of views of all articles for each language (multiplied by 100,000 to produce an easier-to-grasp number). This relates specifically to the time period 1 February – 30 April 2018.
  • ‘Total Articles’ measures the overall number of articles on all topics that were available for each language at the end of April 2018, to give a sense of the volume of language-specific material available on Wikipedia.

‘Big Data’ article knowledge-awareness, top-ten languages*

ko=Korean; zh=Chinese; fr=French; pt=Portuguese; es=Spanish; de=German; it=Italian; ru=Russian; en=English; ja=Japanese.
Note: Data analysed for 46 languages, 1 February to 30 April 2018.
* Figure shows the top-ten languages with the most views of the ‘Big Data’ article in this period.
Source: Author using data from the Wikimedia Toolforge team [3]

 

Production. Considering that Wikipedia is built as a collaborative project, the production of content and its evolution can be used as a proxy for knowledge. A divide relating to the creation of content for the ‘Big Data’ article can be measured using two indicators. First, article size in bytes: longer articles would tend to represent the curation of more knowledge. Second, number of edits: seen as representing the pace at which knowledge is changing. Larger article size and higher number of edits may allow readers to have greater and more current knowledge about big data. On this basis, we see English far ahead of other languages: articles are significantly longer and significantly more edited.

Consumption. The KAI provides a measure of the level of relative interest in accessing the ‘Big Data’ article which will also relate to level of awareness of big data. Where English was the production outlier, Korean and to a lesser extent Chinese are the consumption outliers: there appears to be significantly more relative accessing of the article on ‘Big Data’ in those languages than in others. This suggests a greater interest in and awareness of big data among readers using those languages. Assuming that accessed articles are read and understood, the KAI might also be a proxy for the readers’ level of knowledge about big data.

We can draw two types of conclusion from this work.

First, and addressing the specific research question, we see important differences between language groups; reflecting an important knowledge divide around big data. On the production side, much more is being written and updated in English about big data than in other languages; potentially hampering non-English speakers from engaging with big data; at least in relative terms. This suggests value in encouraging not just more non-English Wikipedia writing on big data, but also non-English research (and/or translation of English research) given research feeds Wikipedia writing. This value may be especially notable in relation to East Asian languages given that, on the consumption side, we found much greater relative interest and awareness of big data among Wikipedia readers.

Second, and methodologically, we can see the value of using Wikipedia to analyse knowledge divide questions. It provides a reliable source of openly-accessible, large-scale data that can be used to generate indicators that are replicable and stable over time.

This research project will continue exploring the use of Wikipedia at the country level to measure and understand the digital divide in the production and consumption of knowledge, focusing specifically on materials in Spanish.

References

[1] Andrejevic, M. (2014). ‘Big Data, Big Questions |The Big Data Divide.’ International Journal of Communication, 8.

[2] Michael, M., & Lupton, D. (2015). ‘Toward a Manifesto for the “Public Understanding of Big Data”.’ Public Understanding of Science, 25(1), 104–116. doi: 10.1177/0963662515609005

[3] Wikimedia Toolforge (2018). Available at: https://tools.wmflabs.org/