I’m a Data Scientist at Bowery Farming, an indoor farming company that is growing fresh produce at vertical farms near major urban centers. I'm part of the data team that is devoted to ensuring that our crops get the experience they need to reach their full potential.
At Bowery we have installed cameras which continuously record images of every crop and we measure lots of environmental factors around each crop (temperature, humidity, irrigation). This allows us to use ML (often CV) and data to measure and infer properties of our crops, and help determine how to grow better crops. We work closely with our farmers, agriculture scientists, and commercial colleagues to build data and ML products that help them do their jobs more effectively.
My role has shifted over time at Bowery. When I was first hired we needed more data infrastructure and data automation work to help us get to the place where we could support more advanced projects (for example, ML) in a sustainable, maintainable way.
For my first few years I built internal APIs, helped define the data teams development and deployment workflow, migrated us to Snowflake, and set up our data pipelines using Airflow. As we’ve matured I have focussed more on data projects that help our farms, these range from analytics work, to ML and Integer programming.
I'm now part of the “Seed to Harvest” team within Data/AI, which focuses on how to measure and optimize the experience our crops get over the course of their growth, as well the processes surrounding that.
I spent my 20s in academia, eventually earning a PhD from the University of Toronto in planetary physics. I studied how planets generate magnetic fields, using numerical simulations of the fluids in planetary cores.
I eventually realised that what I liked about physics was using computers as a tool to solve problems, and found I could apply that outside of physics. This led me to move to New York to do an Insight Data Science fellowship, which helped me re-orient the skills I had accumulated during my PhD to be applicable to business problems. I owe Insight Data Science a tremendous debt of gratitude for helping me break into the industry.
I learned machine learning mainly through online courses (e.g. Andrew Ng’s) and by building small, silly side projects.
It’s a happy coincidence that doing a PhD which centered around computational fluid dynamics taught me a lot of skills which naturally transferred into machine learning and data science. That being said, if you want to do machine learning and data science, spending a lot of time doing a PhD in physics is a terribly inefficient way of going about it.
Our ideas for ML projects at Bowery tend to originate from both inside and outside the data team. We have examples of really high impact, successful projects which have originated from both streams.
We hire a lot of very smart, entrepreneurial people who bring great ideas to the data team all the time. If it’s an inbound idea we typically go through a process to determine whether we can and _should _solve this problem, whether ML is appropriate for the problem, if we think we can achieve the model performance needed to make it useful, and if the benefit is large enough.
Ideas that are generated by the data team need to go through stakeholders to make sure that they actually care about the information we can provide. If we are able to make a prediction, but can only provide it too late to be actionable, or have an error rate that’s too high, it is probably not worth pursuing.
What we try hard to avoid is the type of problem which our stakeholders would like to see an ML (most commonly CV) solution for, and would be fun for the data team to work on, but is unlikely to be able to perform to an acceptable level for production work. These types of problems are dangerous because all of the incentives are aligned to put more time, and more resources into them with little prospect of any value to the business.
Our stakeholders are typically internal, so to us “end users” usually means our farmers, our agricultural scientists and technicians as well as our farm maintenance technicians.
I’ve found it most useful to go to the farm and talk to the people who have to use our products every day in person. Fortunately, because our farms are close to cities, this is really easy.
For example, myself and another data scientist recently went to our second production farm near Baltimore to talk to an irrigation engineer and Bowery’s maintenance lead for that Farm. We found out within 15 minutes that because of the physical design of the irrigation spouts, a system we were designing to help detect irrigation issues was much more difficult than we initially thought.
We totally missed this subtlety on Slack, and without going to the farm, would have built the wrong product and wasted everyone's time.
My immediate reaction is to question whether ML is really necessary. As soon as you add ML to a problem, you have so many more things to worry about (monitoring, retraining, model deployment), all of which can quietly allow your system to produce really bad output if they aren’t done correctly.
At a small company with big plans like Bowery, the data team is so oversubscribed that an ML solution has a serious opportunity cost over a dumb, simple solution that we can ship in a quarter of the time. If we can build four simple things in the time it takes to build one ML project, the benefit from the ML project needs to be substantial to be worth the time and ongoing maintenance.
If we have decided that the potential upside is large enough, and we have the right kind of data to solve this problem with ML, our next step is to really understand how our stakeholders are going to use this system.
If our predictions are great, but they aren’t able to be surfaced in the right place, or at the right time we have built the wrong product and wasted time doing it.
Finally, if we’ve figured out that a project is worth doing, and ML is the way we are going to solve it, our bias is to use techniques which have been commoditized. This means there are likely high quality, open source solutions with pre-trained models available.
For example, there’s evidence that transformers can beat CNN’s on image classification tasks. We have tons of straightforward, clearly posed image classification tasks at Bowery:
Getting a model to detect these issues using a CNN is only a couple of lines in Keras, and using a pretrained TFHub module means you don’t need much training data. Also, these issues are obvious enough in our images that a CNN will do a great job of detecting them.
If we were to use a transformer for these tasks we might have to implement it ourselves, or rely on bleeding edge open source implementations.
This isn’t to say reading papers is bad, or that new code is inherently untrustworthy. It’s more that the marginal improvement we’d get from the shiniest, newest model architecture probably doesn’t justify the time it would take us to write a production implementation, particularly if the simpler method performs great.
If we find that the project is getting a lot of use and we would get a big benefit if it was a bit better, we can always take a second pass at it with a fancier method.
Data at Bowery has a centralized reporting structure, but we’ve recently adopted the strategy of “Objectives” on the data team, which I’m a big fan of.
As I mentioned earlier I'm part of the “Seed to Harvest” objective which means we have a handful of metrics describing our crops while they are growing, and it’s our job to push on those. This means we can build expertise on the data team around a few systems, and really own those systems end to end.
I feel fairly strongly that the minimum team size for a non-trivial project is two, that is, nobody should ever work totally alone on a project. On a small data team it’s very tempting to put everybody on a different project, so that the team can be as responsive to stakeholders as possible.
I’ve always found that when I work alone on projects I don’t have anyone to call out my bad ideas before I waste a week on them.
There are a few decisions I think that data at Bowery made early on that have really paid dividends.
The first is to formalize our world in dbt as the single source of truth for all things that have ever happened in our farms. We put a lot of effort into thinking through how our choice of first class concepts in dbt would scale (in terms of developer usability).
This makes interrogating our long, and growing history of crops really easy. If we want to build new models, or iterate on old ones we can identify our biggest opportunities, and identify potential training examples quickly.
The second is that we put a bunch of time into formalizing standard, automated ways of writing, deploying, and serving models so that we can get our cycle time as short as possible. We aren’t there yet, but I dream of a day where our data scientists only write business logic.
This meant we consciously delayed applying ML even in places where it was obviously the correct solution. But it has meant we can move much faster now.
I think that the best decision the Bowery data team made was to use dbt to manage our warehouse. It is central not only to how we manage our data, but to our machine learning models.
Since dbt is our central source of truth for all things that have ever happened we can use it to both determine which observations need predictions, as well as to generate features. Sam Swift, the VP of Data at Bowery gave a (now somewhat outdated) talk which outlines this a few years ago.
I’m constantly astounded by the number of dbt models (database tables), and the amount of data that a small data team can keep tabs on with dbt.
The math of vertical farms is shockingly simple, we care about two things: yield (the pounds of produce we grow per farm area per time), and the money we had to spend to get that yield.
Taking yield as an example: it would seem that we have a really easy metric “Did our intervention increase yield?”
In reality it’s much more complicated than that.
There are many things we can’t observe which affect yield, many more that we are continually improving (many groups are working on yield). Finally crops take a long time to grow, if you’re trying to see a direct yield impact you might be waiting a very long time.
Instead we often try to phrase our metrics in terms of factors we know affect yield but are easier to tie to interventions.
The project I worked on that had the biggest impact had no ML component at all. It is an API I built in collaboration with Katie Yoshida, another data scientist at Bowery to serve images and encode time lapse videos of our crops.
This API means farmers and agricultural technicians can do “virtual” crop audits, it means when we detect something amiss in the farm we can pair it with a recent image of the problematic crop.
This API was intentionally designed to be developer friendly. It’s great to see the Slack bots, and reports that people have built independently of the data team.
Our machine learning models are typically divided into two types:
In the first example, we will eventually harvest that crop and learn its mass. In the second, we don’t have any independent evidence of when the light broke.
Monitoring and retraining for the former is pretty straightforward, we use ML-Flow to handle model tracking and datadog to handle ongoing performance monitoring. We typically retrain these models on a timescale of at most a week, and the retraining is scheduled through Airflow.
Since Bowery makes important operational decisions based on the output of our models we are serious about testing the quality of our automatically retrained models before we promote them to production. In order to be promoted, the model must perform well on metrics run against our validation sets.
Additionally we often use “smoke tests”, as a last line of defense against promoting a subtly wrong model into production. These consist of curated, unambiguous validation examples that the model should do well on.
For example, another data scientist at Bowery Farming, Lauren McCarthy, and I built a model which uses images and crop experience data to predict the mass we will harvest from a crop while it is still growing.
Bowery uses this model to prioritize attention towards underperforming crops by sorting on the expected harvest mass. We also use it to forecast how much product we will deliver to our customers each day over the coming weeks by adding up the predicted mass of each crop due to be harvested on each day. Many people at Bowery make important decisions based on this model, so we need to have confidence in its predictions.
We have curated a selection of crops that failed (for example, due to a mechanical failure in the farm) and some crops which performed very well.
We would never expect a newly retrained model to infer the crop mass exactly, but if the model doesn’t give the failed crops in our smoke tests quite low expected masses and the great crops high expected masses, we should not be promoting that model. Any failure to promote a model will raise an alert that we can investigate further.
For cases where we don’t get incoming labelled data, we’ve hand curated datasets (I’ve spent many hours labelling images myself). Since we don’t get labels automatically we can’t know about our models real time performance and there is a real danger that our input data can drift somehow and our predictions can degrade in quality.
We have a few ways of dealing with this, but most often it’s easiest to just hand label some more examples and look at the model performance on those. Labelling a few hundred images only takes an hour or so and can provide confidence that our models are still working well.
At a small company it’s valuable to have a broad set of skills that center around being able to write good production code. Not having to wait for another group to productionize or deploy your project reduces your cycle time massively.
Broad competency also allows you to assume ownership over a problem in a way that multiple handoffs between data/software/infra teams don’t.
It also means that you can hop on to projects that aren’t strictly within your job description. While I’m not the best frontend engineer in the world, I wrote a ton of JS/React in order to build the frontend for the API which serves images and timelapse videos at Bowery. This was enormously impactful to the business, because anyone at Bowery could suddenly see every crop in the Farm, and every crop we’ve ever grown in seconds.
The second trait is to have a product-oriented mindset. I’m of the opinion that your job as a Data/ML person isn’t to write code, or to train models, it’s to deliver business value. This means constantly fighting against the impulse to work on fun, but low value projects. It means using the slightly older, reliable ML rather than implementing the latest method from an Arxiv paper you found.
I think the hardest parts about building ML systems are
These are the hard, messy, organizational problems which don’t have a well defined loss function. Getting these right requires you to have a product-oriented mindset which I’ve found has been very rewarding to consciously cultivate.
I think Twitter is actually one of the best resources I’ve found to be constantly exposed to new ideas around DS/ML.
On tech Twitter, I think it’s important to be a “weak learner” and to form your opinions with many inputs.
There are so many hot, new technologies it’s easy to think that you’re doing everything wrong. In reality, every piece of new, exciting tech comes with new, exciting problems which are rarely fully explored in a tweet.
Gather information from multiple sources and understand that many solutions might be optimizing for different things than you are.
Many of the people I follow have already been interviewed on ApplyingML! So the list of other interviews is a great place to start.
Read more mentor interviews?