In this episode, we chat with Sandeep Junnarkar, Director of Interactive Journalism at the Craig Newmark Graduate School of Journalism. He tells us about his journey into code launching the New York Times on the web, what data journalism is and how to do it, and why it’s important to tell stories through code.
[00:00:00] SY: Last year, more than a hundred thousand developers participated in Call for Code 2018, a virtual hackathon with the goal of finding ways to reduce the impact of natural disasters through technology. Submit your idea by midnight Pacific Time, July 29 for your chance to win $200,000 and support from IBM and other partners. Last year’s winner was Project OWL, a deployable mesh network that brings connectivity to survivors of natural disasters which you can learn more about in Episode 1 of this season. Start building your life-saving app today. Just a heads up that after this episode’s end credits, we’ll be playing a trailer of Command Line Heroes, the other podcast I host that’s all about open source and produced by Red Hat. So stay tuned.
[00:00:51] (Music) Welcome to the CodeNewbie Podcast where we talk to people on their coding journey in hopes of helping you on yours. I’m your host, Saron, and today, we’re talking about data journalism and telling stories through code with Sandeep Junnarkar, Director of Interactive Journalism at the Craig Newmark Graduate School of Journalism.
[00:01:08] SJ: What can I do in a story and in an interactive to let my audience see how this event or these numbers affect them personally?
[00:01:20] SY: Sandeep talks about his journey into code, launching The New York Times on the web, what data journalism is, and how to do it, and why it’s important to tell stories through code after this.
[00:01:38] You might have already heard us talking about bootcamps on this show and how important they’ve been to changing people’s careers and their lives. Well, Flatiron School is one of the best. The education you receive and the skills you gain and the community you’ll have will prepare you for the rapidly growing tech field. Go to flatironschool.com/codenewbie to learn more. That’s flatironschool.com/codenewbie.
[00:02:00] Actualize Online Live is an online bootcamp created and taught by expert educators. It’s 100% live and can be taken from the comfort of your home. They use video conferencing, so you get to actually see and talk to your instructors and classmates in real time. That means you have live interaction and feedback, not just during instruction but during all your exercises and projects as well. Learn more at actualize.co/codenewbie. That’s actualize.co/codenewbie.
[00:02:29] If you’re designing a website or building a mobile app, you’re probably going to want to take payments at some point. Square APIs allow you to easily implement their payment form on your website. And with their In-App Payments SDK, you can add it to your mobile app. You don’t have to worry about dealing with PCI compliance because they’ll take care of that for you. Don’t let anything come between you and your money. Start building with Square over at squareup.com/go/codenewbie.
[00:03:02] SY: Thanks so much for being here.
[00:03:03] SJ: Thanks so much for having me.
[00:03:05] SY: So there’s this idea of computer-assisted reporting which is now called Data Journalism. How would you define computer-assisted reporting?
[00:06:17] SY: Tell us about your backstory and how you got into coding.
[00:06:20] SJ: It was definitely a roundabout way, definitely had some interest as a teenager encoding, wound up applying for a job after graduate school. I took no coding in college or in graduate school. But at that time, the job was for The New York Times. They were taking the newspaper onto the internet. Mind you, this was even before the World Wide Web at NYTimes.com. This was on AOL and they were looking for people to help them bring it onto the internet. And I applied for the job and I got it and it was a rather crazy story getting that job because when I applied for it, at that time on my resume, I put my email down, my email address. Mind you, this was 1994. So I talked to my editor once and said, “What were you looking for on a resume?” And he said, “Well, you are the only person who put an email address. So we figured you knew about the future and we’re ready for it.” It was a stroke of luck for sure, but while I was there, the real coding did start. We have to start learning how to put content onto the internet and then onto the World Wide Web. And from there, they were all incremental steps in learning how to code on the job. It was challenging because you had to continue to make editorial decisions, learn to code, and put this stuff on the web itself. So that’s how I kind of got into it and kept developing, kept getting more complex forms and stories. In the end, it’s been and continues to be a learning experience.
[00:08:06] SY: So you started in journalism, then you have this really job of putting The New York Times on the web. At any point did you start to change how you view yourself? Did you ever call yourself a coder?
[00:08:19] SJ: I do not call myself a coder. I call myself a coding journalist. My primary loyalty is to accuracy, fairness, and a really good story, like you have to convey information that’s useful to people and can change their lives in some way or help guide them. And if I can use code to explain information or let people immerse themselves in that information, then I’m all for using code. So I’ve always called myself a coding journalist.
[00:08:54] SY: So I want to hear more about this huge undertaking of translating The New York Times to the web. It sounds like such a big job. What was it like? What were the first steps to doing that?
[00:09:05] SJ: It was a lot of taking what was on paper and putting it onto the internet as you can imagine was challenging. You didn’t wind up putting the entire newspaper. You selected which items, so there was some editorial decision making involve right there, much like Apple News today where they have live editors. It’s not an algorithm deciding the new stories. We would do the same and then it was learning AOL’s proprietary language to transfer information, then it was learning HTML and CSS. But in the end, it really was right when it would start to feel somewhat wrote, you had an opportunity to use your skills to do something more. For example, we had an opportunity to try out new things. And one of the things that I was most excited about at that time, and now it sounds so common, but back in 1994, ’95, this was really exciting which was we actually got to interview people via the internet on chat and some of these people were not even on Earth.
[00:10:16] SY: Oh!
[00:10:17] SJ: For example, we wound up interviewing astronauts on the space shuttle.
[00:10:22] SY: Wow!
[00:10:23] SJ: We had a link with NASA and we had our audience members actually texting messages to us for questions they wanted to ask. It was an interview that both we did and we let our audience engage with the astronauts as well. So now, obviously, you see a lot of this on Twitter and Instagram and WhatsApp, but this was way before all of those things and it was just so exciting because almost every day we would try to come up with some new way of telling a story.
[00:10:54] SY: So I’m wondering how The New York Times felt about this job because on the one hand, clearly they were on board because they hired you, but on the other hand, I imagine there were some people who maybe were resistant or maybe a little scared of what it was like to go online. Were there kind of mixed feelings about the project you’re working on?
[00:11:13] SJ: Yeah. Absolutely. There were mixed feelings about it for sure and emblematic of that was that there were four of us who were initially hired to take The New York Times onto AOL and we were on a floor that I remember in the old headquarters in Times Square that had no carpeting. We had furniture from I think it was like the 1920s or 1930s. It was not ergonomic at all. The desk looked like a tank. Across the hall was a garbage chute. So there was always this feeling like, “Hey, you know, if this internet stuff does not work out, I think they’re just going to toss us down the chute.” So there was that feeling because we were so isolated and indeed they moved us around a lot. Like as we expand it from a four-person operation, they didn’t have us in the Times’ building anymore. They moved us across near Grand Central and then on to 6th Avenue and it wasn’t until recently really that what I did as a web producer did they actually integrate my role in two different desks at the time. So there was that aspect, but there was also a mistrust from the editorial side. They were very protective of the new stories, especially the investigative ones or some of the ones where they were scooping The Washington Post or The Wall Street Journal and they weren’t sure if we were journalists. They thought of us more as coders who were just taking content and putting it on the internet. They didn’t realize that we were actually coming up with new ways of telling stories and we understood the importance of not leaking stories to the web by accident.
[00:12:59] SY: Yeah, because I was wondering. I get the mistrust part. I get the, you know, “Are they real journalist or are they coders?” That question makes sense to me, but I’m also wondering what was their worst-case scenario? What were they afraid was going to happen?
[00:13:14] SJ: They were afraid that we would publish a story at 11 P.M. and that The Washington Post would see it on our site, that reporters of The Washington Post would call a few people, their sources. In those days, you needed to get the story to the printing press. You had till about 2:00 or 3:00 in the morning. So they were afraid that The Washington Post would get it into print and then it would look like it wasn’t a scoop for The New York Times. So as we know, journalists and media organizations are very protective and competitive about scoops and that was definitely playing out there.
[00:13:53] SY: And was that valid? Did ever that happen?
[00:13:55] SJ: No, it never happened. It never happened. In fact, shortly after those days, we came up with what was called “The Continuous News Desk” which allowed us to publish all day long and not have a publish time of midnight. So we were literally publishing all day long as news was breaking. That became one of my roles as well to actually talk to people and to write up short stories as well to add to the website.
[00:14:25] SY: So what were you hoping to accomplish with this new web frontier?
[00:14:30] SJ: I think that the beauty of what we were doing was at that time having the paper available to a lot more people than we’re subscribing to it. And as we know, that became a problem later in terms of subscriptions and people got used to a free New York Times on the web. But in those days, that was really exciting to know that you are moving beyond a small subscriber base to being viewed around the world. Even now with a paywall, it has grown and has really become an international paper. It’s no longer even a national paper. It is truly an international paper. And I think that we played a big part in accomplishing that.
[00:15:21] SY: I’m wondering if something also happened in the context of data to the fact that you were coders and you were a data journalist, not just a web journalist, not just a journalist but a data journalist, did that have an effect on the way The New York Times wrote articles?
[00:15:37] SJ: Yeah. I mean, it has definitely impacted the way the newspaper writes stories. Like I said, I did wind up writing articles for them and sometimes you would pitch ideas based on conversations. You’ve had conferences or at parties even. And in those days, I think a lot of journalists were guilty of, you know, if you’ve got three people saying the same thing, you’ve got a trend, right?
[00:16:05] SY: Yeah.
[00:16:06] SJ: And then of course you reach out and find a few more people, but that’s not a data story. I randomly happened to speak to three people and they fit a pattern, but what has happened now is that data is being collected on everything and it’s much easier to come up with a story that is really airtight because you’re not just talking to three people, you’re talking to three people and then you have all this data that says that whatever has happened doesn’t just apply to three people, it applies to a whole group of people. A really fun article I wrote was on telepsychiatry. And that basically is this concept that there are places in America, like rural Kansas, rural Nebraska where psychiatrists do not want to practice. They want to practice in Chicago, in Los Angeles, in New York City, but not in the middle of nowhere. And the problem was that in the middle of nowhere is where you were having methamphetamine, epidemics, people hooked on the stuff, and there just weren’t enough psychologists and psychiatrists to help deal with that situation. And I wound up writing an article about that and already there was data to back me up, like the growing number of these centers. You could talk to the Center of the Cleveland Clinic and they would give you data on all the people they’ve reached. And so already we’re starting to add more and more of the data into our stories. So it was no longer a couple of people that you talk to and that now has become so pervasive that just a couple weeks ago there was an article in The New York Times about how they are now offering data training to all journalists. It’s not just data journalists. It’s still all journalist because I truly believe you cannot be a journalist today without understanding numbers. These numbers back up your assertions, make them airtight and no one can really doubt that you have a true trend if you’ve got the numbers that back you up.
[00:19:11] Actualize Online Live is not only a super convenient way to receive a top-notch bootcamp instruction from the comfort of your home, they also have nifty tools to help you learn everything from new coding concepts to syntax. They even produce a free weekly video called “Think Like A Software Engineer”, which teaches you things like how to debug code, how to research problems, and how to teach yourself new languages. Learning the mindset of a software engineer is the key to getting past the hurdles that can bog you down as you code. Check out the series at actualize.co/codenewbie. That’s actualize.co/codenewbie.
[00:19:51] SY: So when you think about what data journalism looked like when you first started to what it looks like now, what are some of the big differences?
[00:19:58] SJ: I think the biggest difference is that initially you wanted to just have charts and graphs that people could, at a glance, see what the trends were and there’s no doubt that that can be very powerful. Just at a glance when you see the number of incarcerated black men compared to any other demographic out of the tiniest county in Indiana, you say, “What the heck is happening there?” Right? This does not seem right. That’s very powerful. There’s no doubt about it. But I also think that what I have been exploring is this concept of personalization. Where do I fit into these numbers? There’s definitely been a deluge of numbers and it’s sometimes hard to connect with what’s happening. I really have been exploring. What can I do in a story and in an interactive to let my audience see how this event or these numbers affect them personally? When you do provide a personalized analysis, you do see where you fit in the larger context as well, how others are affected as well. So that’s really been my push over the last couple of years. And in fact, I developed a web tool called “PathChartr”, and basically to build some of these interactives takes a lot of time, it takes coding skills, and it takes money, right? You know, server space and paying coders to build it out, et cetera. And certainly there are large news organizations who can afford all of that, but there are smaller news organizations, community organizations that don’t have those resources. So PathChartr is basically a web tool that allows these news outlets to really deliver personalized and relevant insight and information to its users without they themselves having to write a line of code. It is a plug-and-play tool, but I just feel that this form of storytelling where someone sees how they fit into the numbers, into the trends is so vital that it shouldn’t be reserved for only for news organizations with resources.
[00:22:20] SY: So I want to talk about some of the tools and resources a bit more. You mentioned Python, you mentioned Pandas. What else is there that people might come across or they might use as a data journalist?
[00:22:31] SJ: You know, there’s Beautiful Soup for scraping websites. I’ve been taught not to categorize people, right? That’s a good thing not to like stereotype or categorize people, but Python does have libraries that allow you to figure out gender based on first names and sometimes that’s useful when you’re trying to analyze pay gaps for exactly those, right? So you have all these names. You have thousands of names. Usually, they won’t tell you if the person is male or female, for all these companies across the US. So you would have to sit there manually figuring out. It’s just the numbers. There’s so many of them. So in that case, all of a sudden having a library that allows you to figure out gender is great. And I know that there are issues with that as we become less of a binary society. And I do think it’s important. I think that this library, it’s actually called Gender Guesser, will probably incorporate some of those aspects. But until then, it’s a great way to analyze whether someone’s male or female. The other one that comes to mind is NLTK, which is natural language processing, right? So that’s a great way. You can get a document and you can process which are the words that come up most often. Do they have a negative sentiment to them? Do they have a positive sentiment to them? So these are all for analysis. Some other ones that I use personally, there’s FuzzyWuzzy for string matching. Just as an example, Melania Trump gives a speech back in 2016. You could easily check whether the words she used and how they’re strung together, how closely they correlate to or match Michelle Obama’s speeches.
[00:24:29] SY: Oh, interesting. Yeah. Yeah.
[00:25:44] SY: If we kind of take a step back from the coding itself and think about the high-level concepts, what are some of the concepts that you need to understand to be a good data journalist?
[00:25:54] SJ: The idea that data is always dirty. There’s missing information. There are labels that you have no idea what they mean. There are empty spaces. The same name spelt in multiple ways that would result in a miscalculation, right? If I’m looking for Elizabeth Warren and that spelled in three different ways, code would tell you that that’s three different people. So that’s definitely something that has remained. I definitely train people to be skeptical about the data, where it came from, who collected it. Just the act of saying what are the categories that we are looking for can lead to bias. Look at data and think about it from your own experience. Does this ring true? For example, if New York City puts out data saying, “Eight-five percent of subway trains are now running on time,” we might say, “Hmm, how is that possible? This can’t be right. I’m late to work every other day. How is this possible?” Right? So look at your own experience and the data and see if it smells right and often you’re able to tell that.
[00:27:15] SY: You mentioned having so much data that Excel can’t even open all the datasets that we have and all the millions of rows of data that we’re working with, where does all this data come from?
[00:27:25] SJ: Sensors, from biometrics, from street cameras. It’s just everywhere. It’s pervasive at this point. It comes from our watches. There’s data being collected on everything. And this is just over a couple of years of collection using sensors. Imagine what it’s going to be like if we try to compare the year 2030 with the year 2020. There’s going to be so much more information. And perhaps at some point Python is not going to be able to handle it either.
[00:27:59] SY: But how do journalism, even everyday people, get access to it? Because I get that it’s been collected with all these different things, but to have it in our hands to then code with, that’s kind of a different thing, right? How do we get access to it ourselves?
[00:28:12] SJ: Yeah. It really depends. I mean, you might have a government agency that’s using an algorithm, and when you ask them for it, they will say, “Oh, we’d love to give it to you, but a private company wrote the algorithm.” So you reach out to the private company and they tell you, “Well, this is our intellectual property. We cannot share it with you,” right? So you don’t have access to it. So now all of a sudden, you’re going back to the government agency and saying, “Okay, I. I understand that I can’t get a hold of the actual algorithm. Can you tell us what went into making that algorithm? Can you tell us what factors you suggest it to this company?” So now all of a sudden, you’re using the freedom of information, perhaps. Not to get a hold of the actual code, but the planning for the code and sometimes you’ll get a hold of it, but they’ll give it to you in a format that is so difficult to work with. So for example, I can’t provide too many details on this, but I’ve gotten a hold of lawsuits involving a particular professional field and it’s definitely a field that requires licensing and so you can get a hold of it through the state. However, they give you these lawsuits in the form of PDFs that are images. So you wind up with 10,000 PDFs that have about 20 pages each, but they’re all images. So you can’t do searches for word. You have to first run optical character recognition on it before you can even start to analyze it. So it’s tough for a reason. They don’t want to share it with you. They don’t want to be held accountable.
[00:30:08] SY: Coming up next, Sandeep talks about one of his favorite data journalism stories created by one of his students and what the future of data journalism looks like after this.
[00:30:28] We’ve talked about open source a bunch of times on this podcast, but frankly, open source is so big and complex, and fascinating that it needs its own show, and it has one called Command Line Heroes. It’s produced by Red Hat and it’s hosted by me. That’s right. I’ve got another podcast talking to incredible people about all things open source. We talk about the history of open source, the introduction of DevOps and then DevSecOps, and we even do an interview with the CTO of NASA. And that’s just the beginning. We also dig into cloud and serverless and big data, and all those important tech terms you’ve heard of, and we get to explore. If you’re looking for more tech stories to listen to, check out redhat.com/commandlineheroes. That’s redhat.com/commandlineheroes.
[00:31:15] Square has APIs and SDKs to make taking payments easy whether you’re building a mobile app in iOS, Android, Flutter, or React Native. If you want to embed a checkout experience into your website, it’s super easy. Square has processed billions of transactions so you know it’s tried and true. Start building with Square over at squareup.com/go/codenewbie.
[00:31:42] SY: So do you have any favorite or impactful stories you’ve worked on involving data?
[00:31:47] SJ: Oh, you know what? Can I tell you one about my students?
[00:31:50] SY: Sure. Go for it.
[00:31:51] SJ: Sometimes you have to compile data yourself and it’s not by scraping websites, but it’s literally by coming up with a methodology and collecting it yourself. So I have a student, she looked through the income levels in different neighborhoods in the city and she took the outlier, the neighborhood that has the biggest gap between the richest people and the poorest people. She wondered if people who are the poorest people in that neighborhood can they actually afford to shop there for groceries. And it turned out that they don’t shop in that neighborhood, even if there is something like a Key Food, which is a very middle-class grocery store. It’s not like Whole Foods. So what she wound up doing was she went to the Key Food in the financial district and she looked up some staple foods. She drew up a list. One-pound bag of potatoes, apples, a gallon of milk, a loaf of bread, et cetera, and she found the prices for them. And then she went to where some of these people shop, which was in The Bronx, same store she collected the prices in those stores. in her interactive, you go shopping, you add things to your cart and you automatically see the price difference between these two places and it keeps counting the difference and then it extrapolates it for the whole year. The price difference for the year comes out to about 600 or 700 dollars and for someone living at minimum wage, that’s a huge amount. So very impactful story. Presentation obviously is important because if you collect all this data and no one has a way to interact with it may not be as useful. No matter how brilliant you are at coding, you need to be able to convey it visually as well. So that’s definitely a big part of what we teach at CUNY and other places obviously. It’s just important.
[00:33:59] SY: So what do you think is the future of data journalism?
[00:34:02] SJ: The reality is that in a couple of years, you know, there’s increasing use of artificial intelligence and machine learning and algorithms to make decisions on social policy issues, on who gets Medicaid service, when would a social worker visit someone’s house. In the next frontier, we need to be able to interrogate these systems, right? So that’s kind of the next level of data journalism. How do we check out how, and I’m putting this in air quotes, “Malevolent players are using algorithms and artificial intelligence and machine learning to make decisions,” and the complementary side to that is, how can we as journalists use machine learning artificial intelligence, et cetera, to analyze vast information of data? Because at some point, when I said, “The datasets can get so massive that we may not even know where to look,” but we could certainly use next generation data journalism techniques that involve machine learning to say where the code tells you, “Hey, Sandeep, I see a cluster of incidents here toward the left top side of this curve and I see some incidents here on the bottom right side.” You may want to look into that, right? So when we see those clusters, we have to say, “Let me now put on my reporter’s hat and my data journalist’s hat to examine that smaller section of data.”
[00:35:50] SY: Now at the end of every episode, we ask our guests to fill in the blanks of three very important questions. Sandeep, are you ready to fill in the blanks?
[00:35:57] SJ: Okay.
[00:35:58] SY: Number one, worst advice I’ve ever received is?
[00:36:01] SJ: Go back to medical school.
[00:36:03] SY: Oh! Was it your parents?
[00:36:06] SJ: A very close family member.
[00:36:08] SY: And what was the story behind that?
[00:36:10] SJ: I did study all the pre-med requirements in undergrad but became very interested in storytelling. Once I started working at The New York Times, I mentioned the telepsychiatry story, I wrote other medical stories. At some point, some very close to me, my mother-in-law, offered to put me through medical school.
[00:36:31] SY: Oh, wow! That’s nice.
[00:36:33] SJ: That was very nice. But I stuck it out as a journalist.
[00:36:37] SY: Nice. Well, I’m glad you did because you got some wonderful stories to share with us. Thank you. Number two, my first coding project was about?
[00:36:45] SJ: Oh gosh. My first coding project was just to put The New York Times onto the internet.
[00:36:52] SY: I think that is the largest first coding project we’ve had on the show. That’s pretty big. Wow.
[00:36:58] SJ: It really was and I have to tell you that’s what we did at that time. You just have to keep learning.
[00:37:04] SY: Number three, one thing I wish I knew when I first started to code is?
[00:37:09] SJ: You don’t have to be perfect.
[00:37:11] SY: Tell me about that.
[00:37:12] SJ: There are two types of coders. They’re the people who’ve taken computer science and taken massive bootcamps on coding and so forth and to them coding is like poetry. The more elegant and concise it can be, the better. And that’s true in many ways, right? It runs more efficiently and so forth. But sometimes when you’re on deadline as a journalist and you just need something to happen, as long as it works and it’s accurate, it’s okay.
[00:37:44] SY: I like that. Well, thank you so much Sandeep for being on the show and sharing all your wonderful data journalism stories with us.
[00:37:51] SJ: Thank you so much for having me. I really appreciate it.
[00:38:00] SY: This episode was edited and mixed by Levi Sharpe. You can reach out to us on Twitter at CodeNewbies or send me an email, firstname.lastname@example.org. Join us for our weekly Twitter chats. We’ve got our Wednesday chats at 9 P.M. Eastern Time and our weekly coding check-in every Sunday at 2 P.M. Eastern Time. For more info on the podcast, check out www.codenewbie.org/podcast. Thanks for listening. See you next week.
Command Line Heroes trailer:
[00:38:38] GC: At the end of the meeting, they were in agreement. They wanted one data-processing language. The language which came to be known as COBOL.
[00:38:45] SY: That’s programming pioneer, Grace Hopper. We told her story last season and there was so much love for the tale of Hopper and the early days of programming languages that we decided to follow up with a whole season of amazing language stories. This is Season 3 of Command Line Heroes, an original podcast from Red Hat. And I’m your host, Saron Yitbarek. In Season 1, we tracked the emergence of open source.
[00:39:12] MAN: I think a world without open source is almost bound to be evil.
[00:39:17] SY: In Season 2, we pushed the limit of what developers can shoot for.
[00:39:21] MAN: One day we’re going to put humans on Mars. We’re going to explore even further to find Earth 2.0.
[00:39:28] SY: But we cannot wait to share Season 3 stories with you. Each episode takes you further into the world of programming languages. We’ve been out on the road, listening to hundreds of developers and sysadmins, and your excitement for languages, your curiosity has inspired us to devote a whole season to exploring their secret histories and amazing potential.
[00:39:50] MAN: The language I love the most right now is Python.
[00:39:56] WOMAN: Okay. I know this sounds weird, but a language that I love is VAX Assembler.
[00:40:30] WOMAN: We now see all of these collaborative projects that are interwoven. So it’s quite an evolution.
[00:40:37] WOMAN: Most programming languages, you can just learn a bit and you can really make it do whatever you want.
[00:40:43] SY: It’s a meeting of the minds between humans and our technologies, a journey that extends the possibilities of programming past anything that’s come before. Command Line Heroes Season 3 drops this summer. You can subscribe today wherever you get your podcast so you don’t miss an episode. Check redhat.com/commandlineheroes for all the details.
Thank you to these sponsors for supporting the show!