What Problems Fit?
Julia Koschinsky, Ph.D.
GeoDa Center for Geospatial Analysis and Computation
Arizona State University
P.O. Box 875302, Tempe, AZ 85287
(480) 965-7533
1. ABSTRACT public and nonprofit problems. Therefore the question arises
which types of public and nonprofit questions and problems can
Making sense of emerging sources of big, open, and be informed by data science.
administrative data has become paramount. This analysis assesses
key characteristics of projects that are widely assumed to generate Like technology, data science does not improve social outcomes
new and actionable insights and have social impacts. I review 72 by itself. At its best, it augments existing implementation
use cases by prominent organizations in the “data science for processes [3]. Figuring out which organizational processes are
good” community to determine the types of problems where data particularly prone to such an augmentation and which agencies
science techniques add value. The four main categories I identify are ready to adopt or expand data-driven cultures is important to
are 1) improving data infrastructure by combining data with translating insights into impacts. And even when data analytics
higher temporal and spatial resolution and automating data proves to make existing operational processes more efficient, the
analysis to enable more rapid and locally specific responses, 2) question still remains whether the outcomes are socially and
predicting risk to help target prevention services, 3) matching politically desirable or not [4]. This skepticism is reflected in the
supply and demand more efficiently through near-real time growing criticism of Minority Report-style surveillance (informed
predictions for optimized resource allocation, and 4) using by predictive modeling of crime such as in Chicago) and of the
administrative data to assess causes, effectiveness and impact. In “governance of algorithms,” which are often black-boxed and
almost all cases, the insights that are generated are based on an outside of the realm of accountability to residents [5, 6].
automated process, are localized, in near real-time and Another prominent critique of technological and data-driven data-
disaggregated. for-good projects is that they promote a perspective that assumes
that “there’s an app” for every problem [7]. There is a tendency to
Keywords offer band-aid solutions that, for instance, might help manage the
Data science, problem, actionable insight, impact process of serving a few homeless persons a little better but ignore
long-term structural problems such as inequality, racial
2. THE CHALLENGE discrimination or shifts to lower paying service sector jobs that
Making sense of big data, open data, administrative data, social cannot be easily fixed with a civic tech tool or predictive model
media data and combinations of these data has become paramount developed during a hackathon weekend.
[1]. We are not only looking for insights but for actionable As more cities are displaying the results of quantitative indicators
insights that can augment existing government and nonprofit on dashboards, critics point out that the choices about what data
practices. Government and nonprofits are not alone in figuring out are collected, how indicators are measured, what goals they
this puzzle: According to a 2011 survey of 3,000 companies in 30 represent, and how they are displayed are innately political rather
industries and 100 countries, for almost four of 10 respondents, than merely technical [8, 9]. Arguments to “just let the data speak”
“the leading obstacle to widespread analytics adoption is lack of as if they were objectively representing an independent truth are
understanding of how to use analytics to improve the misleading.
business” [2]. Even if one can figure out how to gain not only new
but actionable insights from data, another challenge is the 3. QUESTION AND METHODOLOGY
translation of these insights into impacts. Insights need to be This analysis assesses key characteristics of projects that are
“closely linked to business strategy, easy for end-users to widely assumed to generate new and actionable insights and have
understand and embedded into organizational processes so that social impacts. To do so, I conduct a preliminary review of 72 use
action can be taken at the right time.” [2]. cases by prominent organizations in the “data science for good”
A key reason why it is non-trivial to translate data analytics results community to determine the types of problems where data science
into insights and impacts for social good is that this effort requires techniques add different kinds of value. I chose organizations that
collaboration across traditionally siloed disciplines, skillsets and focus on data science methods such as machine learning or
departments with their own jargons, cultures and ways of predictive modeling and impact measurement. Efforts that
thinking. In order to inform a social or public problem with data primarily specialize in the visualization of raw data or basic
analytics, this problem needs to be defined from the perspective of statistical analysis are not included. Related projects in civic tech
the people making decisions that influence the problem resolution, (such as Code for America’s) are also excluded since most of
so that technological and statistical solutions can ultimately be these projects focus more on technological advances to
embedded within these decision workflows. Not all techniques in government problems than on data analytics.
the toolboxes of computer and data scientists meaningfully inform Four organizations were chosen that represent well-known efforts
in the data-for-good community: DataKind, Bayes Impact, the
Data Science for Social Good (DSSG) Fellowship (University of
Bloomberg Data for Good Exchange Conference. Chicago), and New York’s Mayor’s Office of Data Analytics
28-Sep-2015, New York City, NY, USA.
, (MODA). This is a convenience sample that is not designed to be future behavior (e.g. risk scores of who might soon drop out) to
representative. It will be extended over time as more projects are informing intervention strategies (e.g. to help prevent drop-outs).
documented online. However, it is noteworthy that there is already
substantial overlap in the questions and problems addressed by the The use cases general fit Santos’ [3] criteria for actionable results:
four organizations, suggesting that the sample does effectively They a) inform a better-than-usual selection of response that b)
capture some common trends. can be implemented in a feasible and efficient way and that c) are
related to an improved outcome. One of the reasons why the
To identify the universe of use cases for this paper, I chose the results are actionable is the reliance on disaggregated units of
projects listed on the organization’s websites in mid-July 2015 analysis rather than aggregates, which is driven by the increasing
(specifically the seven winning projects of the 2014 Bayes Impact availability of electronically generated data at this scale. This
24-hour Hackathon1, 23 projects on DataKind’s project page2, 38 disaggregation makes insights actionable at the individual level
fellowship projects of DSSG3, and four projects from MODA’s since it focuses the analysis on a unit that matches that of
2013 Annual Report4). Table 1 contains the categorization, name, decisionmakers. An example is the disaggregation of smart meter
agency, sponsor, problem, short description, data, and method readings for the total household to estimate the energy that could
used in each of these projects. potentially be saved by individual appliances.
4. FINDINGS These are the four categories I identify to characterize the types of
problems and value added by data science in the sample (letters
4.1 Types of Problems also used in Table 1):
The four main categories that I identify to classify the 72 use A. Improving data infrastructure to enable faster and
cases in terms of the problems that data science helps to address local responses
are 1) improving data infrastructure by combining data with by combining data with higher temporal and spatial
higher temporal and spatial resolution and automating data resolution and automating data analysis to enable more
analysis to enable more rapid responses, 2) targeting limited rapid and locally specific responses
resources to highest risks for prevention efforts, 3) matching
supply and demand more efficiently through near-real time B. Predicting risk to help target prevention services
predictions for optimized resource allocation, and 4) using assisting a service provider with targeting of limited
existing data to assess performance and impact. In almost all of prevention resources based on prediction of elevated risk
the cases, the insights that are generated are based on an C. Detecting space-time clusters to help match supply
automated process, are localized, in near real-time and and demand
disaggregated. This section discusses these findings in more predicting optimized allocation of resources to better
detail. match supply and demand across a complex system
The range of problem areas of the use cases is so broad that the D. Using administrative data to assess causes,
choice of problem area does not seem to be a constraining factor. effectiveness and impact
How the problem is defined and what data are available within a improving services or systems through an assessment of
given problem area seems to be more relevant. Common problem effectiveness based on analysis of existing administrative
areas in the sample include health problems, non-completion of data or combining past data on process and outcome
school or service programs, government corruption, human rights These categories are not mutually exclusive: The same use case
violations, neighborhood blight (abandoned properties), access to can be part of multiple categories at different stages. For instance,
funding for nonprofits, poverty and homelessness, as well as a case could start with building a data infrastructure, then proceed
government operations (such as fire, building codes, and with estimating elevated risks in sub-samples and conclude with
policing). assessing the effectiveness of service delivery. To reflect this,
One of the key differences between traditional quantitative some use cases in Table 1 are classified in more than one category.
analysis in the social sciences (e.g. using multivariate regression This following sections illustrate each of these categories with
models) and the use cases analyzed here is that their reliance on examples (see Table 1 for more details).
machine learning methods comes with a shift in focus from Improving Data Infrastructure to Enable Faster and Local
descriptive to predictive and prescriptive insights [10]. Responses
Descriptive insights address the question what has happened and
why; predictive insights focus on what could happen while Several use cases pertain to automating and centralizing data
prescriptive insights inform choices about how to respond to what access, usually for data that are more location-specific and timely
has happened or could happen. The vast majority of use cases in than traditional censuses (although often not as complete).
this sample produce prescriptive or predictive insights, i.e. Examples include the creation of a central atlas of businesses in
insights that are (or at least appear to be) actionable without New York after Hurricane Sandy as part of post-disaster aid and
requiring additional analysis. Projects often include the full the merging of data for different medications to identify negative
pathway from identifying patterns in past and current data (e.g. side effects of interactions between them. There are also examples
why students dropped out of school in the past) to predicting of improving automated measurement, e.g. testing the
measurement of poverty with proxies from satellite images such
1 http://bayeshack.devpost.com/submissions
2 http://www.datakind.org/projects/
3 http://dssg.io/projects/
4http://www.nyc.gov/html/analytics/downloads/pdf/ annual_report_2013.pdf