Author: Jake Porway

  • The Chart Wars Have Begun

    A few years ago, data scientist Alex Lundry gave a fantastic presentation describing the ways data visualization is being used by political parties to push their own agendas. He showed a Republican visualization of the House Democrats’ health plan — an infographic full of sinuous pipes and literal red tape, smattered with ugly unreadable fonts and unwelcoming 8-bit color palettes — next to a Democratic visualization of the same plan, which instead looked like an Easter basket, a perfectly designed and welcoming bundle of pastel circles.

    The difference between the two visualizations, which present the same information, was striking. It’s clear that the spin we’re accustomed to hearing from politicians is now something we’re going to be seeing from politicians as well

    The most troubling part of all this is that “we the people” rarely have the skills to see how data is being twisted into each of these visualizations. We tend to treat data as “truth,” as if it is immutable and only has one perspective to present. If someone uses data in a visualization, we are inclined to believe it. This myopia is not unlike imagining the red velvet cake we see in front of us to be the only thing that could have been created from the eggs and milk we mixed together to make it. We don’t see in the finished product the many transformations and manipulations of the data that were involved, along with their inherent social, political, and technological biases.

    This is why I love this New York Times‘ interactive, which Times graphics editor Amanda Cox discussed on this site a few weeks ago:

    demrep.png
    Click or touch to see larger image. View the interactive version here.
    source: New York Times

    The Times begins with the “raw” data* from which you can compare the Republican interpretation (red-colored glasses) and Democrat interpretation (blue-colored glasses). As the window slides to either set of glasses, we’re greeted by that party’s familiar talking points, accompanied by the interpretation of the data that supports that view. The beauty of this visualization is not in either the Democrat or the Republican end products, but in the concise way it draws our attention to the process of visualizing data to suit our own ends.

    Taken in isolation, either final visualization gives us an answer. Taken together, the opposing visualizations force us to ask questions.

    Beyond raising awareness about political bias in data visualization, this piece employs a technique that many visualizations can benefit from: comparison. For example, while I can see how well I kept to my own budget by visualizing my monthly expenses, I see a different picture when I visualize my expenses against those of others in my demographic. Similarly, visualizing my spending by type vs. by time of day gives me entirely different views of the same data. The power in each of these examples comes from seeing the data from many perspectives, which altogether form a more informed view than any one individually.

    I hope that, as we move into a world where people will increasingly be exposed to shiny infographics and visually-stunning data interactives, that more pieces remind us of the process and motives behind them. Maybe as data visualization becomes more democratized we’ll learn this lesson through doing, or perhaps The New York Times and others will still have to remind us. Either way, I hope others will help bring to light what’s going on behind the scenes so that we can take the task of visualizing data on with our eyes open.

    *Let’s leave aside the biases and assumptions in the data itself, e.g. how do we define unemployment?

  • You Can’t Just Hack Your Way to Social Change

    “We have a lot of data, but we have no idea what we should do with it.” The director of the foundation looked plaintively across the table at me. “We were thinking of having a hackathon, or maybe running an app competition,” he smiled. His co-workers nodded eagerly. I shuddered.

    I have this conversation about once a week. Awash in data, an organization — be it a healthcare nonprofit, a government agency, or a tech company — desperately wants to capitalize on the insights that the “Big Data” hype has promised them. Increasingly, they are turning to hackathons — weekend events where coders, data geeks, and designers conspire to build software solutions in just 48 hours — to get new ideas and fill their capacity gap. There’s a lot to be said for hackathons: They give the technology community great social opportunities and reward them with money and fame for their solutions, and companies get free access to a community of diligent experts they otherwise wouldn’t know how to reach. For all of these upsides, however, hackathons are not ideal for solving big problems like reducing poverty, reforming politics, or improving education and, when they’re used to interpret data for social impact, they can be downright dangerous.

    At DataKind we run “DataDives”, weekend events that team nonprofits with pro bono data scientists to solve tough social problems. They are not easy to get right. Data events like these require special requirements beyond your average hackathon. You need to have a clear problem definition, include people who understand the data not just data analysis, and be deeply sensitive with the data you’re analyzing.

    Any data scientist worth their salary will tell you that you should start with a question, NOT the data. Unfortunately, data hackathons often lack clear problem definitions. Most companies think that if you can just get hackers, pizza, and data together in a room, magic will happen. This is the same as if Habitat for Humanity gathered its volunteers around a pile of wood and said, “Have at it!” By the end of the day you’d be left with a half of a sunroom with 14 outlets in it.

    Without subject matter experts available to articulate problems in advance, you get results like those from the Reinvent Green Hackathon. Reinvent Green was a city initiative in NYC aimed at having technologists improve sustainability in New York. Winners of this hackathon included an app to help cyclists “bikepool” together and a farmer’s market inventory app. These apps are great on their own, but they don’t solve the city’s sustainability problems. They solve the participants’ problems because as a young affluent hacker, my problem isn’t improving the city’s recycling programs, it’s finding kale on Saturdays.

    To avoid this problem, organizations have to be willing to put time and effort into scoping problems with the technologists ahead of time. Reinvent Green could have invited recycling managers, urban planners, or other experts to converse with the hackers before the event. Organizations also need to be willing to get down-and-dirty with the data geeks during the weekend. It’s not enough to just throw the data over the wall and hope for the best.

    Subject matter experts are doubly needed to assess the results of the work, especially when you’re dealing with sensitive data about human behavior. As data scientists, we are well equipped to explain the “what” of data, but rarely should we touch the question of “why” on matters we are not experts in. Take for example a finding from the data team at Uber that prostitution arrests increased on Wednesdays based on Oakland Crime Data. One hypothesis for the uptick was that welfare checks are distributed on Wednesdays, meaning more welfare recipients had money to spend on prostitution. Wild, right? However, one commenter on Uber’s site who had worked with the Oakland Police Department pointed out that prostitution arrests occur on quieter nights, so maybe there weren’t more prostitution incidents on Wednesdays, just more prostitution arrests. If experts in the data — like arresting police officers — had been involved, this would have been apparent.

    Statisticians have long known that data analysis helps us understand our world, but never fully explains it. George Box famously said “All models are wrong, but some are useful.” What this means is that we must be vigilant in communicating that, while all of this new big data will give us new and wonderful insights into our world, no single result should stand as the ultimate truth.

    Take, for example, a project the Grameen Foundation brought to a DataKind event. The Community Knowledge Worker program employs Ugandan workers to provide rural farmers with timely agricultural information via cellphone. Grameen wanted to use the mobile data to evaluate which of their workers in Uganda were “good” and which were “bad”. If you only look at the number of times a worker gives someone information, a certain set of people are identified as good performers. If you instead look at the number of farmers a worker gives information to, a very different set is seen as effective. Which metric is right? Well, both of them. And neither of them. They are merely different perspectives on the same data. Together they form a richer picture of the world for Grameen Foundation, but neither should be considered “right”.

    We live in exciting and promising times. The flood of data we are collecting will yield new and earth-changing insights, some of which will be made by enthusiastic volunteers at hackathons. Let’s lay the foundation for their success by bringing together world-class teams to ask the right questions, collaborating on the best interpretations of the data, and striving, always, to be sensitive. Data isn’t just a spreadsheet or a database: It’s us. It’s the people we care about. It’s our world. Let’s not just hack it.

    Please join the conversation and check back for regular updates. Follow the Scaling Social Impact insight center on Twitter @ScalingSocial and register to stay informed and give us feedback.