All Projects

Future Projects Search

End-to-End Project Custom Semantic Search on Question and Answer repository
Developed high precision semantic search using Elastic search inverted indexing with BERT vectors and Universal Sentence Encoder transformer model (Tensorflow) on text corpus of Question-Answer repository and deployed using Docker.
Read More

Passion Project Deep learning MusicGen
Generate new music from an input sequence of musical-events using Character Recurrent Neural Networks
Read More

Passion Project Autonomous car model simulation
Building self-driving simulations using NVIDIA end-to-end CNN model.
Read More

Passion Project Sensor-based activity recognition
Designed data processing pipeline to collect, compile, sort and transform 3-axial linear acceleration and angular velocity newline signals from sensors to predict human activity/ posture such as Walking (Upstairs/Downstairs), Sitting, Standing or Laying.
Read More

Passion Project The Notorious/Classic Netflix Challenge
Developing a Collaborative based recommendation engine that does 10% better in prediction accuracy than what Cinematch can do on the same training data set
Read More

Passion Project Amazon Fashion discovery engine (Content Based recommendation)
Recommending similar apparel items/products in e-commerce to boost increase in revenue.
Read More

Academic Project Ad Click-Through Rate Prediction
This project walks you through the predictive modeling process to accurately predict the likelihood that a given ad will be clicked, also known as Click-Through Rate (CTR)
Read More

Competition Multi-class classification of Malware families
Identifying whether a piece of file/software is a malware and effectively analyzing/classifying in order to cluster large amounts of data into groups and identify their respective families.
Read More

Competition Question tagging in StackOverflow
In this project, I seek to develop a machine learning model that automatically infers and tags the topic of a question on StackOverflow.
Read More

Passion Project Predicting Yellow Taxi service demand in New York City
Read More

Competition Link prediction in directed social network graph
Predicting missing links in a directed social graph to recommend friends/connections/followers
Read More

Competition Classifying the effect of Genetic mutations to enable Personalized Medicine
Once sequenced, a cancer tumor can have thousands of genetic mutations. But the challenge is distinguishing the mutations that contribute to tumor growth (drivers) from the neutral mutations (passengers). Using this knowledge base as a baseline, I seek to develop a machine learning model that automatically classifies genetic variations.
Read More

Passion Project Quora Question duplication detection
In this project, I aim to Identify redundancy/duplication of questions asked on Quora using Natural language processing. This could be beneficial in instantly providing answers to questions that have already been answered.
Read More

Academic Project The Filmmaker's Guide
Analysis of trends in film and television industries worldwide using R.
Read More

Passion Project Naïve Bees - 3 : Deep Learning with Images
Can a machine distinguish between a honey bee and a bumble bee? Being able to identify bee species from images, while challenging, would allow researchers to more quickly and effectively collect field data. In this project, I wish to build a simple deep learning model that can automatically detect honey bees and bumble bees, then load a pre-trained model for evaluation. I'd be using keras, scikit-learn, scikit-image, and numpy, among other popular Python libraries.
This project is the third part of a series of projects that walk through working with image data, building classifiers using traditional techniques, and leveraging the power of deep learning for computer vision.
Read More

Passion Project Naïve Bees - 2 : Predict Species from images
Can a machine distinguish between a honey bee and a bumble bee? Being able to identify bee species from images, while challenging, would allow researchers to more quickly and effectively collect field data.
In this Project, I wish to use the Python image library Pillow to load and manipulate image data. I aim to apply common transformations of images and build them into a pipeline.
This project is the second part of a series of projects that walk through working with image data, building classifiers using traditional techniques, and leveraging the power of deep learning for computer vision.
Read More

Passion Project Naïve Bees - 1 : Image Loading and Processing
Can a machine distinguish between a honey bee and a bumble bee? Being able to identify bee species from images, while challenging, would allow researchers to more quickly and effectively collect field data.
In this Project, I wish to use the Python image library Pillow to load and manipulate image data. I aim to apply common transformations of images and build them into a pipeline.
This project is the first part of a series of projects that walk through working with image data, building classifiers using traditional techniques, and leveraging the power of deep learning for computer vision.
Read More

Passion Project Player retention A/B testing with Cookie Cats
Cookie Cats is a hugely popular mobile puzzle game developed by Tactile Entertainment. It's a classic 'connect three' style puzzle game where the player must connect tiles of the same color in order to clear the board and win the level.
As players progress through the game they will encounter gates that force them to wait some time before they can progress or make an in-app purchase. In this project, I wish to analyze the result of an A/B test where the first gate in Cookie Cats was moved from level 30 to level 40. In particular, I plan to study the impact on player retention.
Read More

Passion Project American Sign Language recognition with Deep Learning
American Sign Language (ASL) is the primary language used by many deaf individuals in North America, and it is also used by hard-of-hearing and hearing individuals. The language is as rich as spoken languages and employs signs made with the hand, along with facial gestures and bodily postures.
In this project, I aim to a convolutional neural network to classify images of ASL letters.
Read More

Passion Project Charles Darwin's book similarities and recommendations
Recommendation systems are at the heart of many products such as Netflix or Amazon. They generally rely on metadata (e.g., the actors or director of a movie) or on user tastes (e.g., the movies you liked before) to determine which you are most likely to enjoy. But when you are working with text-heavy datasets, you have access to a much richer resource, the whole text!
In this Project, I propose to build the basic book recommendation system based on their content. I plan to use Charles Darwin's bibliography to find out which books might interest you!
Read More

Passion Project Trending Machine Learning Topics
Neural Information Processing Systems (NIPS) is one of the top machine learning conferences in the world where groundbreaking work is published.
In this Project, I seek to analyze a large collection of NIPS research papers from the past decade to discover the latest trends in machine learning. The techniques I use here to handle large amounts of data can be applied to other text datasets as well.
Read More

Passion Project Filter passwords using NIST guidelines
Almost every web service you join will require you to come up with a password. But what makes a good password? In June 2017 the National Institute of Standards and Technology (NIST) published publication 800-63B titled Digital Identity Guidelines: Authentication and Lifecycle Management. This publication doesn't tell you what is a good password, but it does have specific rules for what is a bad password.
In this project, I have taken a list of user passwords and, using publication 800-63B, I wish to write code that automatically detects and flags the bad passwords.
Read More

Passion Project Finding Movie Similarity from Plot Summaries
Natural Language Processing (NLP) is an exciting field of study for data scientists where they develop algorithms that can make sense out of conversational language used by humans.
In this Project, I aim to use NLP to find the degree of similarity between movies based on their plots available on IMDb and Wikipedia.
Read More

Passion Project Moby Dick Word Frequency
Using web scraping and NLP to find the most frequent words in Herman Melville's novel, Moby Dick.
Read More

Passion Project Minimizing Traffic Mortality in the USA
While the rate of fatal road accidents has been decreasing steadily since the 80s, the past ten years have seen a stagnation in this reduction. Coupled with the increase in number of miles driven in the nation, the total number of traffic related-fatalities has now reached a ten year high and is rapidly increasing.
By looking at the demographics of traﬃc accident victims for each US state, we find that there is a lot of variation between states. Now we want to understand if there are patterns in this variation in order to derive suggestions for a policy action plan. In particular, instead of implementing a costly nation-wide plan we want to focus on groups of states with similar profiles. How can we find such groups in a statistically sound way and communicate the result effectively?
Read More

Passion Project Kardashians Vs Jenners Analysis
While I'm not a fan nor a hater of the Kardashians and Jenners, the polarizing family intrigues me. Why? Their marketing prowess. Say what you will about them and what they stand for, they are great at the hype game. Everything they touch turns to content.
In this Project, I wish to explore the data underneath the hype in the form of search interest data from Google Trends. I also intend to recreate the Google Trends plot to visualize their ups and downs over time, then make a few custom plots of my own. And finally answer the big question: is Kim even the most famous sister anymore?
Read More

Passion Project Analyze scraped Google Play Store data to understand the Android app market
Mobile apps are everywhere. They are easy to create and can be lucrative. Because of these two factors, more and more apps are being developed.
In this Project, I aim to perform a comprehensive analysis of the Android app market by comparing over ten thousand apps in Google Play across different categories. I also plan to look for insights in the data to devise strategies to drive growth and retention.
Read More

Passion Project Visualizing and Analyzing Nobel laureates
The Nobel Prize is perhaps the world's most well known scientific award. Every year it is given to scientists and scholars in chemistry, literature, physics, medicine, economics, and peace. The first Nobel Prize was handed out in 1901, and at that time the prize was Eurocentric and male-focused, but nowadays it's not biased in any way. Surely, right?
In this project, I explore patterns and trends in over 100 years worth of Nobel Prize winners. What characteristics do the prize winners have? Which country gets it most often? And has anybody gotten it twice?
Read More

Passion Project Game of Thrones analysis with Networks
Jon Snow, Daenerys Targaryen, or Tyrion Lannister? Who is the most important character in Game of Thrones? Let's see what mathematics can tell us about this!
In this project, I look at the character co-occurrence network and its evolution over the five books in R.R. Martin's hugely popular book series A Song of Ice and Fire (perhaps better known as the TV show Game of Thrones). I also look at how the importance of the characters changes over the books using different centrality measures.
Read More

Passion Project Song Genres classification from Audio data
Using a dataset comprised of songs of two music genres (Hip-Hop and Rock), I wish to train a classifier to distinguish between the two genres based only on track information derived from Echonest (now part of Spotify). I first make use of pandas and seaborn packages in Python for subsetting the data, aggregating information, and creating plots when exploring the data for obvious trends or factors you should be aware of when doing machine learning. Next, I use the scikit-learn package to predict whether we can correctly classify a song's genre based on features such as danceability, energy, acousticness, tempo, etc. I seek to make use of implementations of common algorithms such as PCA, logistic regression, decision trees, and so forth.
Read More

Passion Project Using Regression Discontinuity to see which debts are worth collecting
Playing bank data scientist and using regression discontinuity to see which debts are worth collecting.
Read More

Passion Project Credit Card Approval Prediction
Commercial banks receive a lot of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays.
In this project, I intend to build an automatic credit card approval predictor using machine learning techniques, just like the real banks do.
Read More

Passion Project Super Bowl Analysis
Whether or not you like football, the Super Bowl is a spectacle. There's drama in the form of blowouts, comebacks, and controversy in the games themselves. There are the ridiculously expensive ads, some hilarious, others gut-wrenching, thought-provoking, and weird. The half-time shows with the biggest musicians in the world, sometimes riding giant mechanical tigers or leaping from the roof of the stadium. And in this Project, I intend to find out how some of the elements of this show interact with each other.
Read More

Passion Project You know Everything John Snow
In 1854, Dr. John Snow (no, not the Game of Thrones's character) used a pre-computer method of spatial analysis by mapping patterns and occurrences of cholera outbreaks in Soho, London. He mapped the deaths in the neighbourhood and determined that a vast majority occurred around one particular water well and that those that died used that well. It is not only one of the earliest uses of data visualization, but by solving this problem, he also founded spatial analysis and modern epidemiology.
In this project, I line up to recreate John Snow's famous map of the 1854 cholera outbreak in London.
Read More

Passion Project Analyzing the GitHub History of the Scala Language
Open source projects contain entire development histories - who made changes, the changes themselves, and code reviews.
In this project, I aim to read in, clean up, and visualize the real-world project repository of Scala that spans data from a version control system (Git) as well as a project hosting site (GitHub). With almost 30,000 commits and a history spanning over ten years, Scala is a mature language. I plan to find out who has had the most influence on its development and who are the experts.
Read More

Passion Project Drunk Analysis
Flexing my pandas muscles on breath alcohol test data from Ames, Iowa, USA. I aim to group, summarize, and visualize data on breath alcohol tests in Ames, Iowa, (home of Iowa State University) from 2013-2017. Some questions that could be answered include, "What is the highest recorded value?" and "When do breath alcohol tests occur most?"
Read More

Passion Project Keyword Generation for Google Ads
You work for a digital marketing agency, which is approached by a massive online retailer of furniture. You are tasked with creating a prototype set of keywords for search campaigns for their sofas section.
The most important task in structuring a search engine marketing account is mapping the right keywords to the right ads and making sure they send users to the right landing pages. Having figured that out is a big part of the account setup and makes the life of the account manager much easier.
Read More

Passion Project Predicting gender using Sound data
The same name can be spelled out in a many ways (for example, Marc and Mark, or Elizabeth and Elisabeth). Sound can, therefore, be a better way to match names than spelling.
In this project, I plan to use the Python package Fuzzy to find out the genders of authors that have appeared in the New York Times Best Seller list for Children's Picture books.
First, using fuzzy (sound) name matching, I seek to search for author names in a dataset provided by the US Social Security Administration that contains names and genders of all individuals who have applied for Social Security Cards. Next, I aggregate the author dataset by including gender. Finally, I aim to use the new dataset to plot the gender distribution of children's picture books authors over time.
Read More

Passion Project Data Analysis in Baseball using MLB's Statcast data
There's a new era of data analysis in baseball. Using a new technology called Statcast, Major League Baseball is now collecting the precise location and movements of its baseballs and players.
In this project, I aim to use Statcast data to compare the home runs of two of baseball's brightest (and largest) stars, Aaron Judge (6'7") and Giancarlo Stanton (6'6"), both of whom now play for the New York Yankees.
Read More

Passion Project Exploring the evolution of Linux
Version control repositories like CVS, Subversion or Git store rich evolution information about a software project.
In this project, I seek to read in, clean up and visualize a real world Git repository dataset of the Linux kernel. With almost 700k commits and thousands of contributors, there could be some little data cleaning and wrangling challenges that I may encounter in addition to gaining insights about the development activities over the last 13 years.
Read More

Passion Project Dr. Semmelweis and the Discovery of Handwashing
In 1847 the Hungarian physician Ignaz Semmelweis makes a breakthrough discovery: He discovers handwashing. Contaminated hands was a major cause of childbed fever and by enforcing handwashing at his hospital he saved hundreds of lives.
In this python project, I wish to reanalyze the medical data Semmelweis collected.
Read More

Passion Project Exploring Lego database
The Rebrickable database includes data on every LEGO set that ever been sold; the names of the sets, what bricks they contain, what color the bricks are, etc. It might be small bricks, but this is big data!
In this project, I aim to explore the Rebrickable database.
Read More

Passion Project Exploring the Bitcoin Cryptocurrency Market
To better understand the growth and impact of Bitcoin and other cryptocurrencies I have, in this project, explored the market capitalization of different cryptocurrencies.
Read More

Industry Project Election Analysis
When I interned at a Governmental organization, I have developed valuable insights and correlations between literacy and growth rates in the different regions of state in order to produce visualizations on 2018 state election results to assess reasons for political party victories in specific regions, to identify distinguished and unique reasons for the presence of outliers in the election database, reporting status of political parties and votes distribution, composition/allocation of seats in the assembly, women participation and expenditure incurred in Election using Tableau.
Read More

Academic Project KweriME
The dissertation aims to address the issues that reside in the community based Q&A websites with KweriME, a reputation based QA system which employs a category and theme based reputation management system to evaluate users willingness and capability to answer various kinds of questions, while at the same time improving the response latency and answer quality.
Read More

Industry Project Health Datathon
I have actively participated with a team of three in the HEALTH DATATHON project funded by the State Government, which focused on the prevalent issue of Low haemoglobin levels (<10% mg/dl) faced by people, estimated to be about 14.3% of the target population. By running statistical estimation procedures on the lab data, revamping status review and reporting of stakeholders, facilitating Continuum of Digital tracking by managing digital health records and by remodeling strategies to cluster and associate healthcare facilities to prone areas, we have achieved a National and International standards (12.5 mg/dl) in decreasing the overall haemoglobin levels to <5% of target population.
Read More