Math in Data Science (feat. Bren Cavallo)
Alumni Aloud Episode 23
Bren Cavallo earned his PhD in Mathematics from the Graduate Center. He’s a data scientist at M Science, a data-driven research and analytics firm that uses unconventional data sets to uncover strategic insights on trends for leading financial institutions and corporations.
In this episode of Alumni Aloud, Bren talks about the similarities and differences between data science and academic research; how agile design sprints can enhance your productivity; and the benefits of taking a bootcamp to sharpen your skills and work portfolio for emerging tech jobs where credentials are always evolving.
This episode’s interview was conducted by Anders Wallace. The music is “Corporate (Success)” by Scott Holmes.
Listen
Listen to the episode below, download it, or stream it in Apple Podcasts (or your preferred podcast player).
Podcast: Play in new window | Download
Subscribe: Apple Podcasts | RSS
Transcript
-
(Music)
VOICE OVER: This is Alumni Aloud, a podcast by Graduate Center students for Graduate Center students. In each episode, we talk with a GC graduate about their career path, the ins and outs of their current position, and the career advice they have for students. This series is sponsored by the Graduate Center’s Office of Career Planning & Professional Development.
(Music)
ANDERS WALLACE, HOST: I’m Anders Wallace, a PhD candidate in the Anthropology program at the Graduate Center. In this episode, I sit down with Bren Cavallo, who’s a data scientist at M Science—a data-driven research and analytics firm that uses unconventional datasets to uncover strategic insights on trends for leading financial institutions and corporations. Bren earned his PhD in Mathematics at the Graduate Center in 2015.
In this episode, Bren talks about the similarities and differences between data science and academic research, how agile design sprints can enhance your productivity, and the benefits of taking a bootcamp to sharpen your skills and work portfolio for emerging tech jobs where credentials are always evolving.
BREN CAVALLO, GUEST: My name is Bren Cavallo and I’m a data scientist at M Science, which is an alternative data equity research firm. Alternative data is essentially nontraditional data sources in the finance world. Often it might be something you have to buy from an app—like maybe an app reports people’s location data and you can buy the location data from the app and use that to do financial research. That’s just an example of a type of data set. Another might be web scraping results, such as how many seats are left on flights over a long period of time for different airlines. Stuff of that nature. So using that type of data to do investment research.
WALLACE: Can you tell me a bit more about how you came into this work as a data scientist?
CAVALLO: Yeah, so in the summer before my final year at the Grad Center I was definitely spending a lot time thinking about what I wanted to do. Mathematics is a really great field to enter—a lot of these general tech or finance areas. So I’d been toying with maybe a very traditional financial quantitative analyst role or maybe being a developer. And then I met someone who was a data scientist, and it sounded like that was a great job. He said that people are generally very happy. It would leverage my math background heavily, but also I’d get to learn more tech skills, learn some engineering skills. And, you know, it has a good work life balance. So that’s why I thought that I’d pursue that direction.
WALLACE: You said that, coming from a math background, you could’ve gone into other routes more quantitative—finance or coding.
CAVALLO: Yeah, I feel like quantitative finance is maybe the classic example for math PhDs. I might be wrong about this. But I feel my first year at the Grad Center, when data science was maybe not so much a field with a name, a lot of people were… the main non-academic job was more of a quant role. I’m not totally sure as to when it blew up, but certainly it seems like it was more of a thing when I graduating than it was when I started.
WALLACE: Take us back a bit more to talk about your academic background and what your passion is in Mathematics.
CAVALLO: So I started studying Mathematics as an undergrad at Vassar College. I thought it was a very good department there and I liked a lot of the courses I took. It seemed like a really good area and I was very excited at the time by the higher mathematics that was much more proof-based and logic-based rather than just sort of “solve this calculus problem.” So it really seemed like there was a lot of interesting stuff to learn. And as I was graduating I really felt like I wanted to learn a lot more. So it made sense to do a PhD where you can support yourself while actually really diving much deeper into the field. I came here and I took a lot of really great classes. I always thought the professors here were really really great. I studied computational problems and group theory. I always had a lot of fun doing mathematics and really wanted to learn more, so that’s why I went here.
WALLACE: What’s a typical day in the office like for you?
CAVALLO: So, as a data scientist, I manage a lot of processes that run daily—it might output results for a product that I manage. So I will often spend the beginning of the day going over automatized QA processes just to make sure they all work.
WALLACE: QA processes—like coding that’s extracting certain data?
CAVALLO: Yeah, like for instance, I might have a product that says “Okay, we’re gonna tell clients that this metric should be this number.” So I want to make sure that, for all of these, the things look reasonable. That things updated properly—like we’re not getting two answers when we should be getting one answer—stuff of that nature. So I’ll spend a bit of time going through that stuff. There are people who work on different projects that I manage, different people that I manage, so I’ll often chat with them about their projects at the beginning of the day. Additionally, I have my own projects that I work on, that I develop.
One thing that’s very common in tech is that you sort of follow what’s called agile methodology. So my company, we do weekly sprints. So essentially at the beginning of every week, I say what I want to accomplish by the end of the week in terms of developing and answering certain questions about a product I’m currently developing. So I’ll try to check boxes off that list.
WALLACE: Because that’s a big buzzword too—agile methodology—and listeners may not know really what that entails in a nutshell beyond the checklist function you describe, what does it mean really?
CAVALLO: So we’ve used it at both places that I’ve worked at. So at my current job and then at my last job, which was in programmatic advertising. So basically you have someone—a sprint master—who determines a sprint calendar that might be once every one week or two weeks or a month. And people, at the end of this sprint calendar or the beginning, will tell everyone what they accomplished in the previous sprint and what they’ll do in the next sprint. So everyone has a very good idea of what everyone else is working on, where everyone is in their projects. Potentially, if there’s something you might want to use that’s related to what someone else is doing, you kind of get an opportunity to hear what they’re doing and sync up on that.
I guess the entire purpose of it is that it’s a very sort of regimented way to make sure that everyone knows what they should be doing, and to make sure the business is aware of what they should be doing and that whatever they’re going to be doing is aligning with business goals. I think these are sort of the main highlights for me. This is the kind of thing that certain people take very seriously, and it can get very complicated in terms of how it works, but I’d say those are some of the main features that have been true across both companies that I’ve worked at.
WALLACE: Wow, that’s an interesting technique. It sounds like it balances accountability and then enhances collaboration.
CAVALLO: Yeah, exactly. And I’d say especially as someone coming from grad school, you definitely see a lot of people who have a problem and they don’t seem to work on it very often. And they seem to kind of languish and take awhile and wait maybe until something strikes. But in a business, where things have to move a lot more quickly, you can’t really rely on someone to just eventually work. So you kind of need to meter everything out and make sure that everyone is aware of what everyone else’s goals are.
WALLACE: Yeah, that makes a lot of sense. And then you were talking about your day and how you have projects of your own. Is that in addition to doing something that’s directed to you?
CAVALLO: Yeah, this is something that is directed to me. And I guess I’m lucky that I’m currently getting a certain amount of autonomy in terms of developing this project. A lot of our projects are based around individual datasets we get. So we acquired a dataset and we want to find a way to monetize that for our clients. This is a sell side research firm. So essentially we take the datasets and we perform various analyses on them. Most of the teams actually produce largely written reports in addition to a variety of products. But we, largely, on the data science team do general data science projects. So a lot of them would be monetizing an individual dataset by cleaning it, processing it, performing certain predictive analytics on it so that you can actually use the dataset to measure something in the real world—use the dataset to measure how a given company’s revenue might be changing on a daily or monthly basis. And with this dataset there’s a very high barrier to entry I’d say.
WALLACE: Meaning?
CAVALLO: Because these datasets are often very very messy. So it’s a huge investment, I think, for any of our clients to create a team that would be able to do the same kind of work that we do. So ultimately I think that’s why it’s useful for many clients. Rather than hiring a large number of people to go through a dataset that have certain skills, you can just get our research and perform much of what you would get if you were to have the dataset in-house.
WALLACE: So can you tell me more about the atmosphere in your workplace?
CAVALLO: You know, we work in a large office building. It’s a fairly casual office, very small—kind of open office plan with some cubicles. Not exactly cubicles, but they’re sort of pods—like groups of four—maybe the size of an office at the GC that four people will each have a desk within rather. I think the sort of atmosphere you have varies team by team, but I’d say especially in my team, because it is a data science team, it’s a lot of people coming from academia—they try to make it a little more casual. Most of us don’t really interact with clients, so I don’t think there’s quite as much of a necessity for us to be formal every single day.
WALLACE: That’s interesting. And one thing you mentioned before was that unlike academia there’s a faster pace to produce.
CAVALLO: Oh, yeah. So my company is in a way a start-up. We’ve essentially been sort of adopting this strategy I think for around two years, maybe a little bit more. And also this area—alternative data—has really really blown up. This is an extremely hot area, to the extent that maybe if you were to become a data scientist and have the title data scientist in any financial role right now, there’s a very good chance that you would be working with alternative data. You know, considering that the market is so huge right now and that it’s so early, I think that we’re trying to really push our products very quickly. So one thing I’d say is very true of my company—with any start up maybe—is pretty ambitious deadlines and definitely trying to help people make them.
WALLACE: Was that a hard transition from academia where, as you said, it’s so much more unstructured?
CAVALLO: Well, my advisor was actually very focused on writing papers and kind of pushing me out of here. So in a way she definitely—and I think this was one of the good things about working with her—she definitely made it clear to me that it’s very important that, if you spend time working on something, maybe you’re not going to solve the problem immediately, but you at least need to be able to articulate what you’ve done and what you’ve learned from it. And in a job like mine, that’s very important.
And one of the reasons, as a data scientist or on a data science team, that you hire so many researchers is that a lot of it is very similar to academic research. It’s just much smaller problems often. I mean for the kind of stuff that I do that’s more sort of corporate data science rather than actually doing machine learning research as a data science role. But this stuff—these kind of nontrivial problems that take a certain amount of time—when you’re presented them you really have no idea how you’re going to do them, but you do kind of need to really make it clear to people that you’re making progress. And then after you finish your problem, you need to make sure that people are actually using it and using it as you intend them to use it, so that you can kind of get some adoption and move up in your organization and what not.
WALLACE: And it sounds like—if this is fair to say—you’re describing it not really as a contrast to academia but as a continuation of things you experienced there.
CAVALLO: Yeah, and I think that a lot of very successful people in academia might take sort of a similar process. I mean, definitely there are plenty of people in academia—plenty of professors I had here—who write dozens of papers a year it seems like. Which in mathematics is especially challenging—I have no idea how some of them manage to do that. I’d like to think that they work very hard, and they definitely have sort of these highly effective and efficient habits, and I think that’s also very useful in the corporate setting.
WALLACE: So students could even take on some of this agile methodology for themselves to be more productive.
CAVALLO: Yeah, perhaps. My advisor had a grant that I was getting paid through a little bit. And to justify having me on the grant, she made me write down everything that I did every single day. In fact, I forget if it was just a daily summary or if it was even more than that—an hourly summary of every day. So I had to say for every day how many hours I spent working on any one thing that was related to the research the grant. I’m not so effective now. I mean, I’m not doing this at my current job. But maybe if I did—it would probably be less fun for me—but I’d maybe be more effective if I did something like that.
WALLACE: One thing you mentioned at the start of our interview is that your company is hiring. One question that may fit in that ballpark is, what do you wish people knew about your field of Applied Data Science that they may not know? A misperception people may have?
CAVALLO: I can think of one—and this one might be disappointing for some people who are entering the field. I’d say the main thing that’s important for someone as a data scientist—the best skill for someone at any company I’ve been at (and I might be biased because I’ve been at very small companies only rather than a larger company like a Google)—the most important skill as a data scientist is programming. So the better you can program and build things, the better you can iterate on any of your research ideas and test more models out, do analytics more quickly and in a more thorough way, in a more replicable way.
So good engineering practices and good coding skills is really the most important thing. And in that vein, data science is not all machine learning. Even though at every role I’ve had I’ve done machine learning or I’ve spent a lot of time working on machine learning problems, I’d say that especially in terms of my day to day, really what’s most important is engineering skills and general problem-solving skills.
And also for a lot of these corporate problems, machine learning is not always the best way to solve them. And maybe this is just sort of my opinion based on what I’ve seen. Three may be better sort of machine learning ways to solve things than I’ve previously come up with or done on my own, but I’d say that in terms of good business-oriented solutions or just practical solutions for having a product that you can deploy or you want to behave in a very regimented way, often it’s better to not even use a machine learning solution—to use something that’s a bit more intuitive. It’s clear that there are certain things that only really AI can do well or that deep learning can do well in a scalable way. But unless you’re sort of a deep learning researcher that’s going to be entering a deep learning job, then probably or very likely you’re not working with those kind of problems.
WALLACE: What do you enjoy most about your work?
CAVALLO: I definitely like that fact that I get to spend a lot of time solving these difficult non-intuitive problems. I guess it’s always intimidating at first when you see something and really have no idea how you’re going to do it. And especially, unlike the Grad Center, I sort of have to put a deadline on this thing. Even before I know how I’m going to do something, I do have to say, “Look, this should be done in some form or another in four months.” And it can kind of be intimidating at first. But then you sort of go and look at the data, you answer some small questions for yourself, you kind of put things into a much more tractable scenario and give yourself increasingly more tractable problems, and eventually you can kind of come up with an idea of something you can solve and you can build and then you build it.
WALLACE: That’s a great model for tackling these complex challenges.
CAVALLO: Yeah, I think it’s definitely part of the reason I like mathematics. And it’s also fun, too, as a data scientist because, you can spend awhile doing problem-solving, but then you have to build and that’s much more of a concrete thing. So maybe if I have a problem that’s six months long, I’ll spend one month thinking, one month building, one month thinking. And it’s kind of two different types of problems, but generally the building problems are a lot more concrete. They’re a lot easier to solve, and that part is much more focused on coding and good coding practices and optimizing your code.
WALLACE: Interesting. So what do you find the most frustrating about your work or your job?
CAVALLO: Well, I think with my job, specifically this current job, alternative data is a very difficult area, and the fact that essentially your product is very reliant on a dataset that you have no control over at all… But then your career is reliant on being able to use that dataset to provide a product to people. And all kinds of things can happen that just would not be good for me. Like maybe a dataset just completely breaks. You know, these vendors—their technological infrastructure is very important to whatever process I run. So I’d say those kind of unknowns, and the fact that things can just break and it has nothing to do with you…
WALLACE: And then the value of your work has evaporated?
CAVALLO: I wouldn’t necessarily say it’s totally evaporated. For instance, if a data vendor says, “We can’t give you this data anymore,” then yeah, that project has evaporated. But I’d say, for instance, let’s say that a vendor messes up and it makes something look bad for a client of ours, or we give a bad result to a client, or potentially a client notices that our number is bad or we can’t deliver something to our client on time… So our client is ultimately holding us responsible for that. Which is not to say that happens frequently by any means, but the fact that it can happen is very scary. And we have a lot of infrastructure checks to make sure that’s not going to happen and that there are lots of layers that would need to break for it to happen. But for instance, if one of our vendors that fuels one of our products just says, “We’re not going to give you any data for a month,” I mean, I’m sure that would be a major breach of their contract and I don’t know exactly what route we’d go down that way, but I can’t really think of how I alone as a data scientist could solve that problem.
WALLACE: Did you ever see yourself becoming a professor? Or you always knew you would leave academia?
CAVALLO: Not really. It was never something that I wanted to do much in the first place.
WALLACE: Being a professor?
CAVALLO: Yeah… I think, and I don’t know if you relate to this, but I feel like—I don’t know why this is the case, but I feel like when a lot of people start in grad school, they don’t think of getting tenure as nearly as competitive as it is. So it’s the kind of thing that maybe I thought at the time that if I was really good at this and I just saw the jobs coming my way, then I’d be a professor. But I think my first week of grad school I learned how unrealistic that would be and it made much more sense to go this route. Because it was never something that I wanted to do. I mean, I never really wanted to be a professor.
WALLACE: Okay, so that was a natural thing. You said, well, that’s fine, because this was a path that would open a lot of different doors.
CAVALLO: Exactly. I mean, my goal at the time was really just to learn more math. So it was never really to do anything specifically with it. But I think if you had even asked me my senior year of college or my summer before I started at the Grad Center if I thought I was ever going to be a professor or even if I wanted to be a professor, I probably would’ve said no to both of those. And I think a lot of math professors or a lot of fellow students I was with at the Grad Center, they really liked or they really wanted to become professor— like this was their dream and they loved math and that’s all they really wanted to do in the future. But I guess because of the fact that, especially when you live in New York, the amount of money you make really starts to wear on you. And when you’re a grad student and you can only afford to live in some remote neighborhood, and all your friends are other grad students living in other remote neighborhoods, and everyone’s in a three or four bedroom, you really like the idea of “what if I had a one bedroom in a central neighborhood.” And that’s the kind of thing that you can just afford immediately after grad school if you enter in one of these more quantitative roles.
WALLACE: Is there anything you miss about academia?
CAVALLO: Oh yeah, I mean I had a lot of fun here often. I loved a lot of the classes I took. I had a lot of really good friends here. They’re still friends, but it was nice when I got to see them every day. Stuff of that nature. I really miss the—this is again coming as a mathematics student—when you can just kind of do math anywhere.
I liked the overall freedom I had with my schedule. To be able to—if I heard about an interesting place to eat in the city—I didn’t have to go there after work when it would be very crowded. I could go on a Monday lunch or something like that. It’s very easy to kind of enjoy the city. I never had to take a rush hour train back then. It wasn’t until I worked that I honestly ever—that can’t be totally true—but I sort of remember right around when I started working, being surprised at what the typical NYC rush hour commute was like. I’d never had anything like that at the Grad Center where my earliest class may have been like 10AM or something.
WALLACE: So were there any mentors that helped you make the transition into your role as a data scientist?
CAVALLO: Sure. My advisor and a lot of my professors I had here were instrumental in shaping me throughout grad school. I took a really great, just general big data class, looking at the tools surrounding big data. Starting to read some very simple sort of machine learning, almost data science papers. I started seeing what people who do machine learning research—seeing what kind of papers they write that might be approachable to someone from my background. And I took another class on data visualizations, and that was a really interesting class. It had all these really amazing speakers. The professor brought in all these speakers, some of whom were really big in the world of data visualization, to give these great lectures. And a lot of it was stuff that I never would’ve even considered thinking about, but just getting to see it out there like that was incredible. So for anyone who’s at the Grad Center currently who wants to make a transition into data science, there are a lot of great online resources and I’m sure some of you are aware of those, but I also would say, don’t discount some of the really good classes you have here.
WALLACE: Are there any credentials that someone looking to move into this field… that it would be worth them pursuing? Or is it still young enough of a field that there’s not?
CAVALLO: I wouldn’t say any specific credentials. What I would say though is definitely learning programming. So I guess by credentials…I don’t know what certification would help… but learning Python or R maybe. But Python is the one I’ve used at both my jobs, and I feel that’s maybe the best language right now to learn as a data scientist. People might disagree with me on this, but learning Python as well as you can would be very useful. Especially any Python resources that focus on data science or data analysis is very important. Learning machine learning and statistics is important, and again for the machine learning—I ask a lot of machine learning questions in my interviews because people I’m interviewing say that they know machine learning, and that’s one of the few shared topics that we can talk about, which is why I like to ask those questions. So learning machine learning can be very useful on the job, but it’s also definitely very useful for interviewing.
WALLACE: Have there been other ways that getting a PhD has benefitted you in your career, besides the hard programming skills?
CAVALLO: Well, I’ll say I didn’t really do any programming as part of my PhD. That was all stuff I had to learn on my own. Or I took a data science boot camp actually. So I did this one called Data Incubator that I think I saw a notice for on the wall of the Grad Center, so I applied. I had already starting learning some programming. They didn’t really want to take people who already know some programming already. They give you sort of test and I did the test and got into the program. It was one of those programs where it’s free or may or may not be free, but they hook you up with a job afterwards, so that’s how they get paid. So that’s how I got my first job—through them. So I guess in a way, you asked about credentials, thinking about it now, doing a boot camp is a way of showing you have more data science oriented skills going on than you would’ve just gotten through your PhD.
WALLACE: How long was the boot camp?
CAVALLO: I think eight weeks.
WALLACE: Eight weeks and then they hook you up with a job?
CAVALLO: Eight weeks, and during that time, you’re applying to jobs sort of through them. So you’re applying to jobs that have an affiliation with them.
WALLACE: So this boot camp model, I’ve also seen this in other tech-inspired places. It seems like it’s a bigger thing now.
CAVALLO: Yeah, I think the people who are doing it are doing pretty well.
WALLACE: Are there any other resources at the GC or experiences you had here that helped you?
CAVALLO: Yeah, I think teaching helped with my communication. For instance, at my job we’re hiring a lot and we’ve had a lot of new employees lately. So I hope that my teaching experience helped teach them sort of how our processes work at my job. Of course, just focusing on a research problem helps. I think having a research background is really important as a data scientist just because you are going to have this sort of very vague problem and you have to try to tackle it. That’s something I don’t think I would’ve been ready for coming out of undergrad.
I don’t think it’s necessary that every data scientist you hire has a PhD or a Masters in a quantitative field, but that it is rare to have someone coming from undergrad—or if you’re an undergrad with research experienced that would be fine, too—but it’s rare to see someone coming from undergrad without research experience that would have some of this research mentality.
WALLACE: So just being able to dig deeply into a problem?
CAVALLO: Yeah, and feeling okay with failure and stuff of that nature.
WALLACE: Managing variables, right, and being okay with failure, uncertainty, being self-starting, confronting those questions.
CAVALLO: But I think math is a really good skill for going into data science. So definitely being around some really great professors and seeing how they approach problems and how they deconstruct topics—especially thinking about if you learn something you have to ask the question to yourself: “Why is this important? What’s important about this? Why is someone emphasizing this topic over another?” You know, if you’re saying something about… giving me a fact about penalized regression and I have to say “Why is this a fact you’re bringing up to me?” That’s something you have to answer for yourself often.
WALLACE: Knowing what the tools in the toolbox are and where and when to use them.
CAVALLO: Yeah, that helps you answer that, I’d say.
WALLACE: Is there anything else you can think of that you want to add or communicate to students or pitch?
CAVASLLO: Yeah, well, like I said, my company is hiring.
WALLACE: Your company is…
CAVALLO: M Science. So you can always connect with me on LinkedIn and ask me questions. Or if you have any questions about more online resources that I’d recommend for data science, I could send some of those along, too.
(Music)
WALLACE: That’s a wrap for this episode of Alumni Aloud. I want to thank Bren for coming on the show to share his experiences in data science for our listeners. Remember to stay tuned for more episodes of Alumni Aloud, published every other week during the Fall and Spring semesters. Subscribe on iTunes and you’ll automatically be notified when new episodes are released. Also, check out our Facebook, Twitter, and career planning website at cuny.is/careerplan for more updates from our office or to make appointments with our career counselors. Thanks for listening and see you next time.
(Music)
This entry is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.