Hi! My name is Susan, and I am currently a Principal Data Scientist at Clearco.
Clearco, formerly known as Clearbanc, is an ecommerce investor providing non-dilutive growth capital to founders. Clearbanc's co-founder and president, Michele Romanow, has been on the Dragon's Den TV show (think Shark Tank in Canada), where she noticed that for early stage companies, raising funding via equity can be way too expensive for founders.
So Clearco uses machine learning and automation to provide fast, non-dilutive funding, with less friction from human biases than traditional funding (banks, venture capital).
Naturally, the machine learning components are what my team works on. There are the models themselves, as well as the infrastructure that allows the inference of multiple models to be served on demand. I have been focusing on leading the ML infrastructure improvements as of late to focus on scale, as we have raised a series C lately, putting Clearco at unicorn status.
To sum up: My team works on augmenting Clearco's funding products with machine learning and data science. This includes ML model training, data quality, infrastructure, and automation.
My educational background was in Economics, which heavily focused on inferential or predictive modelling, using data from financial markets, product pricing, household earnings, and so on. Sounds quite a lot like data science in industry, doesn’t it?
During my studies, Econometrics gave me a solid understanding of statistics, and upper year courses required a lot of calculus and matrix algebra, which were invaluable when I started self-learning about machine learning algorithms.
But, there was a catch - I only learned statistical programming through a proprietary software called Stata, not the languages common in industry, such as Python or R.
However, I had been using Python for a couple of years due to programming video games for fun. One day a friend mentioned to me, “you have these two skills, in statistics and programming - have you heard of a field called data science”? I had not, and googled it that day. It was a perfect, almost accidental combination of my knowledge of statistics and Python. I write about this entire process in this two-part article series.
As a Principal Data Scientist, I find that most of my work is around figuring out unknowns for larger scale data science initiatives. My responsibility is being good at mapping untread ground from 30% resolution (fog of war) to 70% resolution or more (scoped out task chunks). Then, I pass on the knowledge to the team, to complete the task chunks with more minute detail.
In other words, I prototype, design and establish the paths to production, and then delegate.
To answer the question of day-to-day, I'd say for ~3 days of the week, it hovers more on problem solving and technical decision making at the mid to high level. Then for the other 1~2 days' worth of the week, I am deep in the weeds coding when the project is at the initial prototyping stage, since I focus on getting the end to end to work.
Once the system design and pipelines have been further solidified during this stage, I can then take a step back, and let others complete the details in the ML models, such as improvements, logic, tests, logging, and so on.
This was probably the biggest transition from a new grad to someone more experienced - learning to not do the minute things that don't make sense, but instead let the team learn what I did, to scale up the team. Being a multiplier, not an IC who contributes linearly to the hours I spend.
Along with making the best use of my skills and broader context about the company's products and how ML fits into them, which is a perspective unique to the DS team and not the broader tech team, this approach has the added benefit of the solutions being tailored to what Clearco needs, keeping us nimble to mix and match the functionalities that scale our machine learning operations.
A major part of my role is also knowledge-sharing via lunch and learns, and capturing the content of those talks in documentation on best practices. I often help out team members through Slack calls and chats to unblock them, working as a multiplier instead of trying to be the “tank” of the team, to use a video game analogy.
Code reviews are also essential, and I have been experimenting with blocking out a regular time to do them. Since I am familiar with a broad part of the stack, not just the cozy corner of data science, I provide feedback and leadership on code quality as well as design.
This is a loose framing, but in my experience it is important to communicate in the lens of the following: How does it lead to more customers, retention, or a better user experience? Pick something from what the specific business cares about.
Recently on the systems side, I needed to make the case that this new framework or infrastructure can help us serve more customers. In the end this still leads to better user experience.
Some initiatives fall under "improving developer experience" or "reducing technical debt". These two types of initiatives basically lead to faster iteration time, meaning faster reaction of the company and our product to the market and user needs.
Even if they might sound abstract, especially without further explanation, to folks outside of the engineering team, they still contribute to the user experience, which can contribute to users staying with us, and so on. In the end, I think there is no secret here apart from practicing communication skills, and being able to find a middle ground. I personally use what I call "data science storytelling", which I've written about in this article.
One awesome thing that I experienced at a previous employer, Canada's largest telecommunications firm, was that we had a 3 week training period at a call centre.
We listened to phone calls, and also had to take support calls on the phones. It was there that I learned so much about the end user, and the empathy that I could bring to my org, which was quite far removed from the end user.
After that experience I had another call centre shadowing experience, which was only for one day, but we shadowed the top performers and saw how they interacted with internal tools which were related to the ML project we were building.
If answering to support calls or tickets personally isn't possible in your org, speak to those who are more customer facing, such as customer success. Or join their Slack channels, and see what issues are being reported (my approach at my Clearco). Clearco also shares customers stories on the weekly company wide calls, which helps those in non-customer facing roles keep those real users in mind.
Due to the nature of my role, I am at the front of dealing with unknowns in a DS project. To use an analogy of a coloring book: I am responsible for leading and drawing the black outlines to resemble a distinct object, and allowing others to fill in the colors, add in shading, and help others improve the outlines based on what they find in the trenches.
Faced with problems that haven't been solved before in tech in Clearco, these are my loose steps:
First of all, I need to see what this ML approach should achieve in the context of the business. This can be how and why it's integrated into a product, and how it's integrated into the broader tech team's production processes.
This step is to establish a loose list of approaches that are feasible, and others that will be more difficult to implement due to, for example, how easy it will be to integrate with the web stack. In this situation, my (random) experience as a video game developer, as well as a contract full-stack web developer, helps a lot. Perhaps counterintuitively, my strength here is to provide perspectives that aren't just in the cozy corner of data science.
Secondly, I spend time comparing what has been done in industry, for similar use cases. This is where my experience hosting ML livestreams with leading researchers at Aggregate Intellect (14k+ YouTube subscribers) comes in handy - I have a habit of keeping up with engineering blogs and research papers from large tech companies.
Eugene's curated applied ML papers is also a resource I recommended to my team, as well as my LinkedIn connections.
Finally, I put my hands to the keyboard and prototype, sometimes trying out 2~3 or more frameworks. There is a lot of iteration between researching and prototyping - more often than not, there are tons of mistakes that are made in this step, which is the intent.
One cannot be afraid of mistakes - which I think is good advice for beginners and experienced folks alike. No amount of online courses or sanitized datasets can really compare to the leveling up we gain from simply doing things, even if there is no playbook at all. (And it's my current job to figure out the playbook.)
Our DS stack integrates right into the main tech stack. Our tech team ships to production twice a day, thanks to CI/CD and git integrations.
One downside for new data scientists joining, is that they might not expect to need to know so much additional overhead. Back when I worked at a large enterprise, this was abstracted away from me.
In my current role as a Principal Data Scientist, I take on a lot of this knowledge from the external tech team to pass it to the DS team. My previous weird mix experience in software development helps me have an advantage in terms of communication.
Data scientists need to know git, and know git well, something I've tried to bring to my teams. It is easy especially for new DS or DS who have had no reason to really understand the software development process to neglect this part (I'm super serious about this), but I'd argue it improves their work and the reproducibility of their work by multitudes.
In a previous role we had a person dedicated to helping with the ongoing reporting of ML model performance by setting up an automated dashboard.
The model performance was then presented to executives on a regular cadence as part of a larger, initiative update. So there was transparency on how the models were actually performing as real users were interacting - as a person that built the model, it was exciting but also nerve-wracking in the sense that there would be nowhere to hide, and it was outside of the sanitized training data. For that project in particular the ML model achieved a ~2x lift over non-ML approaches on the metrics the business previously decided on, so that was a massive win for me as a DS, as well as for the product.
On the maintenance side, there were changes from the business logic we had to deal with, for example, updating labels of the recommended items. Thanks to the way I designed and implemented the training pipeline, the new labels in the new dataset could simply be added to the pipeline and re-ran which would populate the recommendations again. The training was automated to be re-run nightly.
The manual effort for the most part was the manual data input of new labels, in the format the model intakes, and also analyzing to see if the new labels would break anything (not likely, but possible). Thankfully, after doing this manual label addition once, the rest of the model pipeline portion was automated.
We had considered automating this manual portion many times, but due to the format that we received it from the business (Excel, sometimes with column names and order changed randomly), it was placed at a very low priority.
In cases like these, there is always going to be a backlog of "good to have" features. Can't have them all! The main decision-making rationale is to weigh the effort, the frequency of use, and the benefit.
For folks that build ML in production that fits into a product, it's important to have a "Minimal viable product (MVP)" mindset - envisioning how things work end to end.
It's not just me having fun on my little keyboard tinkering away, doing my research for ages (if I were in a research branch of an org, like DeepMind, then of course this is fine) - there is an actual product and an actual customer.
In terms of the dev work, I like to do an end to end implementation first, e.g. if it's meant to use an API endpoint in production, use a .pkl first, then sub it out as soon as feasible so I can show the stakeholder. This exercise means that I unearth a lot of potential problems faster, and don't traverse down too far into a dead end.
I think this makes a huge difference between data scientists that gain experience with production work - there are many data scientists I've seen who have their work rotting away in some Jupyter notebook, or some unused modules and scripts in the team repo.
I think it is a meta-skill to be able to
The importance of understanding the product, influencing stakeholders (not only interacting passively), and scaling by doing things that are uniquely my strengths.
Joining any research journal discussion or meet-up which actually forces you to 1. read the paper 2. discuss the paper (much more involved than passively watching/consuming) 3. question parts of the paper (even more involvement, because this requires critical thinking, which requires a more in-depth understanding of ML concepts and practical uses).
To keep learning at work, when I've done something repeatedly, I try to expand a bit more to something adjacent, but not irrelevant to my role. For example, if I've been doing too many types of the same model (let's say someone has built 3 churn models in a row) then maybe it's time to do another type of ML as learning. Or expand a bit into understanding how to optimize queries (work with your Data Eng).
Eugene Yan's paper curation works are a role model for me :)
Read more mentor interviews?