ApplyingML

Chirasmita Mallick - Principle Data Scientist @ G2

Learn more about Chirasmita on her site, Twitter, and LinkedIn.

Please share a bit about yourself: your current role, where you work, and what you do?

I’m a Principal Data Scientist at G2 and Data Science Mentor with Springboard India. I am part of the Product team focused on Software Buyer Experience around Search and Personalization. I oversee the NLP applications deployments at G2. On a whole we deliver thousands of data points in terms of Buyer Intent while purchasing software. There are other products that track software spend and software stack that are an add on value to what we provide for a software buyer.

I enjoy solving broad problems in enabling conversations and understanding between humans and machines systematically. That’s one of the prime factors why I enjoy decoding Languages and playing with textual data.

I am actively involved in Speaking on various topics around Language Models and its scalability as well as career level talks for ODSC, California on Mental Models for Data Science.

I also teach Women in Data Science coding as part of Women who code & Women in Data Science with these initiatives.

What was your path towards working with machine learning? What factors helped along the way?

I have a Computer Science degree and my first encounter with the world of AI happened when I came across the word “Big Data” while flipping through a magazine in my third year at college. At that point I had no idea what it meant but I started digging in more about it all over the Internet. At that time I had no idea what Data Science looked like as a career choice. I had no mentors.

That’s when the Grace Hopper Celebration for women in Computing happened and there was no looking back. I met some badass women handling Data architecturing at Yahoo Inc and was fascinated how they charted their career with absolute courage and curiosity to learn. The two values which are very close to my heart to this day.

At that point of time, I was just out of college and had no job and didn't want to take a traditional software engineering job. All I wanted was to build something new. I did a lot of self learning. After my repeated requests this lady from Yahoo graciously agreed to mentor me and taught me to build my very first recommender system. I was exhilarated.

That was the beginning of my Data Science journey as I cracked my first job as a Data Science Analyst in Healthcare. Since then there has been no looking back. Year on year I got to work in very different domains. From Healthcare to semiconductors, People Analytics to SaaS marketplace. I got to test my skills in challenging domains that proved to be a huge learning point in my career. From leading technical data teams to being an SME. It has been a fulfilling journey so far.

The biggest factors that helped me the most along the way was Compounding Learning (Learning something new daily - Reference - Atomic Habits) and having an active learning approach. I create Mental Models for learning Data Science that has helped me learn fast and build swiftly while applying my skills. You can find more on mental models here.

Here is a talk I gave on “How to become a Data Scientist”

How do you spend your time day-to-day?

My daily work looks very basic: List -> Action -> Articulate.

A lot of my time is spent in data discovery, rapid prototyping the first MVP, and writing Design docs (Thanks to Eugene for structuring content on writing a great design doc. I use this design doc for alignment. We work in a very Agile structure with the Product team. Most of my time goes in aligning with Product Managers, Developers, devops and design for scoping and defining success metrics for a given project. It’s an amazing way to collaborate end to end on how the final product would look like.

This process enables all the stakeholders to align on a single vision and gives the ability to experiment without going outside the scope hence getting projects LIVE in shorter durations.

I enjoy writing/ideating equally as I enjoy writing code. That helps in articulating complex ideas into simpler action points for all team members to understand. Here is a blog I write and also teach NLP Masterclass every month to folks looking to transition into NLP.

Past two years have been pretty entrepreneurial. Able to find time to write more, consult with startups, and provide technical talks and deliver lectures apart from my daily duties as Principal Data Scientist at G2 has helped me have a meaningful career where I get to connect with amazing peers and orgs and help scale their Data Science journey too.

How do you work with business to identify and define problems suited for machine learning? How do you align ML projects with business objectives?

I refrain from using ML first approach to any idea. The first solution should be able to solve a problem in the most simple way possible. The problems suitable for ML are the ones that cannot be solved by traditional programming. For instance, building a search system to autocorrect queries typed by an user.

Ideally I take a product first approach. What business problem is my product facing and can it be solved using ML?

Taking a decision about can this be solved using rule-based system or does it involved predictions helps filter down how I align a product problem with ML problem

The best way is to take the 5 step approach:

Set a business Goal
Create a hypothesis
Collect all data points
Test your hypothesis
Analyze your results and repeat

This approach helps in coming up with insights whether a ML-based system is able to drive meaningful decision making insights for a specific problem at hand.

Machine learning systems can be several steps removed from users, relative to product and UI. How do you maintain empathy with your end-users?

Great question.

Across all my last roles, we had seen a real challenge with connecting the dots from customers to their usage of ML systems. I believe understanding cause and effect is very crucial to gauging empathy with end-users.

Let’s dive in further. If I see a drop in traffic suddenly on our website. Rather than trying to fix SEO or blame an algorithm, it's a better idea to understand if any new feature affects the usage of current features on the site. Almost everything we interact or collect data for works on causation .

“X caused Y or Y occurred because of X” causal statements explain events, allow predictions about the future, and make it possible to take actions to affect the future. Knowing more about causality can be useful to researchers to better validate what is “causing” & what is triggering the “ effect”.

Read more about Casuality and Inference here.

You can build a working but not "perfect" solution for your end-user and do an A/B test. Keep getting more data points from user interaction. Take this data as feedback loop to drive meaningful improvements. This helps with creating empathy with end-users. You are providing features a customer needs which might not be perfect now, but with customer interaction you get enough data points to make it just right for what your customer wants.

Imagine you're given a new, unfamiliar problem to solve with machine learning. How would you approach it?

The one thing open source has done for all of us, is making code accessibility open for everyone. This has enabled getting to new ways of solving problems easier.

When I encounter an unfamiliar problem, I always look for “How others have solved it” on platforms like Github or Papers with Code. These are great places for code inspiration on SOTA methodologies.

I also refer to popular blogging sites like (Towards Data Science) and refer to tech blogs outsourced by companies as well for data architectural use case.

I enjoy staying in touch with the Academia world, Graham Neubig is one of my favourite professors in Language Tech and Multilingual Systems.

The one thing that has helped me the most is relying on ML/NLP global and local communities. Staying in touch with relevant peers in the same field as myself has helped me build meaningful relationships where we are able to help each other faster as to what has worked for them vs. what hasn't. I have lost count how many friends in the Data Science & NLP community have helped me ideate and build fantastic solutions which has reduced my experimentation time.

Designing, building, and operating ML systems is a big effort. Who do you collaborate with? How do you scale yourself?

All big team efforts should start with a “Single Core Idea”. No partial/several ideas. Just a single core idea for all stakeholders to align on.

Everything else falls into place when an idea is singular.

The best way to grab attention and build interest is to present a single core idea, fully fledged. This allows the user to make a binary decision about it: “Am I interested or not?” Introducing a feature in a way that people can instantly map it to a desired outcome will help them prioritize and be confident about their next step.

This is a big Design Principle that I abide by.

Over time I learnt that the ability to communicate clearly across the org has helped me scale as a Data Scientist as well.

There are many ways to structure DS/ML teams—what have you seen work, or not work?

Having a Data Science team being end-to-end has helped a lot in scaling and productionizing data science pipelines much faster than relying on Data Engineering full time. Especially in a high growth startup like G2, working independently to experiment, track, and demonstrate and deploy models has helped speed up data science development.

We do not have much hierarchy and it's pretty much a flat org. All Data Science and Analysts report to the Senior Data Science & Analytics Manager whichs comes under Engineering (Product R&D) function. This helps immediate ideation to deployment cycles without much time being wasted in unnecessary process approvals.

How does your organization or team enable rapid iteration on machine learning experiments and systems?

While we are still improving our ML system experiments to Iterate fast.

Here are few tools that help us in Experiment tracking and comparing model workflow:

Mlflow - Monitor/Experiment/Reproduce solutions
AWS SageMaker - For fast deployments and read to use case templates
Papermill - Parameterizing, executing, and analyzing jupyter notebooks
DVC - Version Control

What processes, tools, or artifacts have you found helpful in the machine learning lifecycle? What would you introduce if you joined a new team?

Having an Agile ML approach helps in delivering on Data Science projects faster. Keeping tasks in sprints might not always work but helps things move as a due date is assigned to subtasks.

Here are a few tools that help if you are starting afresh:

Confluence/Notion - Great documentation tools
Github Actions - Absolutely a blessing to automate workflows
MLflow - Ml reproducibility and lifecycle tracking
Slack - For Org level communications
JIRA - Tracking your task tickets, Helps align with Dev + PM
Weights & Biases - Dev tools to visualize/monitor ML models
Looker/Amplitude - Dashboard for Non Data Team members

How do you quantify the impact of your work? What was the greatest impact you made?

I equate “impact” with “Usability”. Every feature I have built and seeing thousands of users interact with it is extremely satisfying to me. Most of the impactful features I have created are around Personalization & Search for Marketplace.

After shipping your ML project, how do you monitor performance in production? Did you have to update pipelines or retrain models—how manual or automatic was this?

Updating Pipelines, taking care of ML model drift is a real challenge when tracking ML systems in production. Since we use Elastic for Search, we leverage their observability solution for continuous model improvements.

We also use A/B tests for ML models, updating/retraining on parameters from time to time, and analyze log data real time to detect anomalies. This helps the most in detection of unusual patterns in the dataset.

Think of people who are able to apply ML effectively–what skills or traits do you think contributed to that?

Most people who are able to apply ML effectively are great system level thinkers and problem solvers. Ability to use statistical analysis to problems and articulate complex ideas into a single core idea is a very critical skill in a great Data Scientist today.

The Innovator’s Dilemma starts with a brilliant question: “How can great firms fail?” The same applies to Data Scientist’s ability to make decisions on ML effectively. Knowledge of “What will not work” is extremely critical. Clarity drives effective decision making, helps align stakeholders, and gets features deployed.

Do you have any lessons or advice about applying ML that's especially helpful? Anything that you didn't learn at school or via a book (i.e., only at work)?

“Do it Live” - There is no better way to learn than actually developing/trying to build new things. It’s okay if it does not work initially. The depth of applied ML comes from problem solving, right from data discovery, rapid prototyping, deploying and monitoring simple models teaches the in’s and out’s of “How to create an end to end functioning system that works well”.

School enables theoretical learning and there is no pressure to deliver results that can be out of the syllabus.

At work, it’s all about building meaningful products/features/processes that add to the organization's vision and growth. You are expected to ship projects fast and collaborate across teams in an efficient manner to get a common goal LIVE. The ability to be creative, think better "fast" and develop solutions quickly is something I learnt at the workplace. Learning never stops!

How do you learn continuously? What are some resources or role models that you've learned from?

I am a firm believer of Compound applied learning. I think of learning in terms of Systems (Thanks to Daniel Kahneman’s Thinking Fast and Slow).

Simply put, The Machine Learning community is growing aggressively and I have really benefited from the following researchers in a big way.

Andrew Ng - Owe everything for making Data Science so accessible to all.
Chip Huyen - For being phenomenal, specifically her work with System Design for ML systems
Devi Parikh - Research Scientist FAIR ( Facebook), Gatech Prof
Ilya Sutskever - Co-Founder and Chief Scientist of OpenAI
Balaji Srinivasan - Not exactly in AI space but a huge investor in AI and Crpto economy. I admire his work in decentralizing blockchain and vocalizing on Computer Science fundamentals in scaling decentralized finance apps.

Read more mentor interviews?