Learn more about Pranam Janney on his LinkedIn.
I currently work as a Principal Data Scientist at Scentre Group within Strategic Analytics, Insights & Research team. Scentre Group owns and manages Westfield shopping centres in Australia and New Zealand, with a sole purpose of Creating extraordinary places, connecting and enriching communities. Our team focuses on enabling our Westfields to better serve our customers and retailers.
To be honest, I never set out to become a data scientist or a machine learning practitioner. I was fortunate enough to have circumstances that allowed me to do what I wanted and I did. I took up telecommunications engineering in my bachelors and picked up digital signal processing. The concepts within signal processing piqued my interest, which led to me coding a lot on Matlab. Back then Matlab was the preferred tool of choice and one could do a lot using it. That led me to a Master's degree in electrical engineering, where I realised one dimensional signal processing could lead to two-dimensional signal processing aka image processing. So, my Master's project was in image processing. I then added another dimension to image processing i.e. time dimension, and applied for PhD at The University of New South Wales and National ICT Australia with a proposal that was around video understanding/summarisation. All throughout, I was using data, advanced statistics/calculus and machine learning algorithms as tools to achieve my goal. When I look back and think about it, the factors that actually helped a lot was my eagerness and curiosity to solve problems.
My time is now split between exploring new greenfield projects, developing & productionising analytics products and managing our deliverables & production environment.
Machine learning, in my opinion, is a tool to solve a problem. We do not seek out problems for machine learning specifically. We operate very much like a consulting firm e.g. Bain & Company or McKinsey. When we speak to the business or a stakeholder, we try to understand their goals or painpoints and come up with a set of deliverables that could enable them to achieve their goals or alleviates their painpoints using a top-down approach. Bottom most layer is the raw data layer and the top most layer is the deliverable. All this happens consultatively and conceptually and most likely we will have not touched any data or analytics by then. Of course, this needs a thorough understanding of our existing data, algorithms and their limits and capabilities. In classical academic research, this would be called a proposal, we then come up with a plan to execute this proposal. Sometimes, our proposal might become infeasible due to limits of the data/algorithms/etc that was previously unknown to us. We then create assumptions (thereby approximating the solution) or find proxy data that could fill the gap. Every solution to a business problem can be enabled by data science but every data science solution may not necessarily solve a business problem.
In my field of work, the stakeholders have problems that need simplification or a solution. Taking a step back; machine learning is a sub-field of data science. Keeping that in context, if I am asked the same question by replacing 'machine learning' with 'data science' then usually I follow the below steps in that order:
This is a classical top-down approach, starting with the highest level user-story and distilling layer by layer from the top. I believe, this is applicable in every aspect of data science.
We have data engineering capability within the team. However, there is a separate data engineering team within the wider organisation with whom we collaborate for integrating complex ML systems with datawarehouse and other legacy systems and to scale up.
This is a very difficult question and the generic answer is "it depends". Data science teams are in my view, mini-startups in their own right.
From the organisation's perspective, when building a data science team, one must take into consideration necessary functions that the team will need or be skilled in, such as science/mathematics, behavourial science, machine learning, data engineering & devops, stakeholder management, business knowledge/analysis, visualisation/user experience, etc. A big organisation that is introducing data science into their capability matrix, most likely will have most of these functions already, excluding science/mathematics, behavourial science and machine learning, spread across different business units. So, more likely than not they will embark on building a data science team with only scientists and machine learning engineers and call upon the pre-existing functions as and when required from other business units. However, I have not yet seen an instance of data science team built with this thinking be successful. Especially in a big and/or legacy organisation, it is quite important to seed and build a data science team as a start-up that encompasses all the required functions within itself until such point where data science is proven and widely used capability within the organisation. Once, data science capability reaches a certain maturity, going into a hub-spoke model or a centre-of-excellence model would be ideal depending on the needs of the organisation.
From an employee/individual perspective, an all-rounder data scientist is much more valuable and has superior career prospects. What I mean by all-rounder data scientist is a data scientist who is skilled in all aspects of data science and various business domains/fields. Therefore, rotation of data scientists across all necessary aspects of a data science business function and also rotation across all business domains within the organisation should be an important charter for a data science team. This not only inspires data scientists to join your team but also is a strong motivator for data scientist to remain in the team/organisation. Implementing this charter will not be a problem in the start-up phase of the team but will become difficult when the team progresses into hub-spoke or centre-of-excellence model. It is the responsbility of the Director 1 and/or Chief/Principal Data Scientist to ensure rotation of data scientists across functions and domains by lobbying the cause with business stakeholders.
Rotation is also helpful from business-continuity perspective as you will eliminate the single-person-risk in any data science function. Good data scientists are scarce therefore single-person-risk has an exponential impact on data science functions when compared to other business functions.
During early part of my career, I happened to walk out of the office at the same time as my then CFO to pick up lunch. I started small talk and in my efforts to sound smart, I mentioned to him that it was difficult to measure impact of data science. His answer was quite blunt. He said "find the end metric your stakeholders are trying to improve, you have your measurement right there". I stopped trying to sound smart after that.
I think KISS principle is applicable here. Aligning your goals to the goals of the stakeholder is the easiest way to quantify impacts of one's work. Usually, the stakeholder are one of three types, they:
Irrespective of which bucket they fall into, they have an end metric that they are all trying to improve upon (e.g. sales, volume, reduce costs, etc). For example, the above three buckets would have end metrics such as,
As part of your role as a data scientist, you should understand the end metric, manage expectations for measurements and deliver against that. Therefore, measuring impact of one's work is a product of good understanding of your stakeholder's expectations and well thought out problem statement.
One has to be very mindful that a data scientist's job does not finish after hand-over of a product. You have to do health-checks with the business stakeholders periodically. This enables you to better understand and chart impact of your work.
In terms of the greatest impacts, I can only list my failures (failure might be a strong negative phrase) because that is what I track personally. It might sound counter intuitive. In research and in data science, keep track of failures and things that does not help/work, because that gets you to your next goal faster. If you are working towards a solution then that solution, by definition, will have significant impact on the business otherwise you would/should not be working on it.
Monitoring performance is one of the key pillars of a ML DevOps framework. ML pipeline cannot be considered to be in-production if it does not have a performance monitoring framework around it. There are a few concepts here, elaborating on them might shed more light.
We live in a World that is made up of random variables and many random variables are explicitly or implicitly related to one another thus making many of them non-stationary. A variable deemed stationary at a point in time, might turn out to be non-stationary at a future point in time and vice versa. ML pipelines are built using these variables that exist in this World as is or directly/indirectly correlated to a variable that exists in this World. Through passage of time, variables tend to deviate, thereby the machine-learnt model tends to deviate away. You should not be serving deviated insights to users from in-production models therefore a constant performance monitoring regime is warranted. By constantly measuring performance of the ML pipeline, one can identify when the model is drifting/deviating and take appropriate actions.
If the model is drifting, the appropriate action could depend upon the diagnosis. It could be as straightforward as retraining using updated data or as involved as starting from scratch i.e. rethinking the problem itself.
In my career, I have noticed that most people who are effective data scientist and/or ML practioners display the below qualities, in no particular order,
Pick a problem and try to solve it, along the way you will pick up the tools that are required to solve the problem. Sounds very simple, but I have seen lots of people chase the tools and forget the concept of solving a problem.
Therefore, when we are hiring we look for the above qualities.
When I was doing my Master's and in my PhD, the main goal was to build solutions that 'do better' than existing state of the art. For example, the existing state of the art produces 90% accuracy then the goal was to beat the 90% accuracy (depth). I was honed in on metrics such as accuracy, performance, faster, quicker, better, etc. As I moved into the commercial setting, I found myself working on defining appropriate problem statments, building solutions to problems using 80-20 rule, stakeholder management and relationship management (breadth).
Schools are great places to get into the depth aspect of a subject/domain. You can do hundreds of courses/projects related to breadth but you will most likely be viewed as second to someone who has achieved breadth in a commercial setting. On the other hand, commercial environments are great places to get into breadth and they rarely offer opportunities to get into depth. So, my advice to anyone is to choose wisely, depending on what you want to achieve in your professional life.
As I have noted earlier, we look for builders & problem-solvers. If someone builds solution to problems in their free time, that is the person everybody wants. Think about how you can showcase your enterprenurial aspect when applying for jobs or roles.
Your website/newsletter is a great source of information to keep me updated :-). MOOC's such as Courseera are a great source for learning new things. Also, I read a variety of blogs ranging from mathematics / algorithms to science & technology to policy/regulatory to econometrics. As a data scientist, I believe it is quite important to be well rounded in understanding how these fields are progressing and how they interact and influence each other.
1 Australian based company title hierarchy is a bit different to the US. A Director in an Australian based company would be similar to EVP/SVP in the US, I believe.
Read more mentor interviews?