I would like to share the youtube link of my talk for Women Who Code: Decision Tree and Random Forest. If you are interested please click the following link.


Data is the fuel of big data era, and we can get insightful information from data. However, tons of data in a high number of dimensions may cover valuable knowledge. Therefore, data mining and feature engineering become essential skills to uncover valuable information underneath the data.

For the dimensionality reduction, Principle Component Analysis (PCA) is the most popular algorithm. PCA is an algorithm encoding the original features into a compact representation and we can drop the “unimportant” features while still retaining most of useful information. …

CDC confirms the first COVID-19 suspected local transmission case in California on 2–26–2020, this case had no travel history to an outbreak area, nor contact with anyone diagnosed with the virus, and it is suspected to be the first local transmission case in the united states. After the first local transmission case confirmed in CA, most of confirmed cases in the state belongs to community transmission.

I live in CA, and eagerly to understand COVID-19 transmission and related facts. Two time series trends about number of COVID-19 confirmed cases and death toll presented in the following figures. In the last…

It is too complicated to predict the trends of COVID-19, because the confirmed cases is controlled by too many factors, natural or man-made. Just described the trends of COVID-19 of each state in US during past several weeks.

After completed two recommendation projects using Amazon Personalize, I have deeply understanding about the mechanism of collaborative filtering recommendation, especially the recommendation method based on cooccurrence. This post will present the detailed algorithm theory and python code about co-occurrence recommendation machine learning algorithm.

Co-occurrence recommendation belongs to collaborative filtering approach. Technically, there are two approaches to build recommender systems: content-based and collaborative filtering. These are intrinsically different methods, content-based approach needs meta-data about the items so that items with similar properties are recommended. Some meta-data for houses are area, year-built, number of bed rooms, number of bath rooms etc. …

Understand high level knowledge about data

Before diving into data, it will be very helpful to understand data’s attributes, structure, data types and missing values. Followings are some basic knowledge about data we should explore.

  1. Data dimensions

2. Data columns


3. Data types of each column


Select specific datatypes (for exampe: datetime64)columns from a dataframe

df_cols = [i for i in df.select_dtypes(inclue = ['datetime64'])]

4. Percentage of missing values of each column

df.isnull().sum()/df.shape[0] * 100

5. For categorical column, check the counts of each category


6. For categorical column, check how many categories totally


7. For categorical column, check categories


— — Machine Learning and feature engineer of text data

The major objective of this project is adopting xgboost, neural netwok, and LSTM neural network to develop predictive models and automatically predicts the helpfulness of specific reviews to customers.

Many e-commercial companies, for example Amazon, heavily depend on consumers’ review to provide the potential purchasers’ evaluation of a product. The online merchants are distinct from local markets where consumers can directly choose items based on on-site evaluation, online customers must rely on other information to help them make their purchase decisions. Therefore, developing online communicate channels among customers becomes critically important for commercial websites. To this end, the Amazon allows…

Among different machine learning algorithms, Xgboost is one of top algorithms providing the best solutions to many different problems, prediction or classification. The major objective of this post is to explore how categorical encoding methods affect Xgboost model performance.

  1. Xgboost Model + Target Encoding for categorical variable
  2. Xgboost Model + Label Encoding for categorical variable
  3. Xgboost Model + One Hot Encoding for categorical variable
  4. Neural Network + Entity Embedding for categorical variable (primary task is to provide entity embedding matrix of categorical variable for Xgboost model)
  5. Xgboost Model + Entity Embedding for categorical variable

Personally, Xgboost is always the first…

For this project, the basic idea is words that tend to appear in similar context are likely to be related. You shall know a word by the company it keeps (Firth, 1957). Standing on this concept, this project is mainly investigated an embedding of words that is based on co-occurrence statistics.

Data source: Brown corpus is a collection of text samples from a wide range of sources, with a total of over a million words. The analysis of this project is mainly based on Brown corpus.

Download brown corpus data from nltk.corpus (brown.words).

import nltk
import matplotlib as plt
import pandas as…

  1. Example of shallow copy: list copy

colours1 = [‘red’, ‘blue’]
colours2 = colours1

Shallow Copy

Above example is simply assigned colours1 list to colours2. The list of colours2 is a so-called “shallow list”, because it does not have nested structure.

Xia Song

Data Scientist Costar

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store