Data is the fuel of big data era, and we can get insightful information from data. However, tons of data in a high number of dimensions may cover valuable knowledge. Therefore, data mining and feature engineering become essential skills to uncover valuable information underneath the data.
For the dimensionality reduction, Principle Component Analysis (PCA) is the most popular algorithm. PCA is an algorithm encoding the original features into a compact representation and we can drop the “unimportant” features while still retaining most of useful information. …
CDC confirms the first COVID-19 suspected local transmission case in California on 2–26–2020, this case had no travel history to an outbreak area, nor contact with anyone diagnosed with the virus, and it is suspected to be the first local transmission case in the united states. After the first local transmission case confirmed in CA, most of confirmed cases in the state belongs to community transmission.
I live in CA, and eagerly to understand COVID-19 transmission and related facts. Two time series trends about number of COVID-19 confirmed cases and death toll presented in the following figures. In the last…
It is too complicated to predict the trends of COVID-19, because the confirmed cases is controlled by too many factors, natural or man-made. Just described the trends of COVID-19 of each state in US during past several weeks.
After completed two recommendation projects using Amazon Personalize, I have deeply understanding about the mechanism of collaborative filtering recommendation, especially the recommendation method based on cooccurrence. This post will present the detailed algorithm theory and python code about co-occurrence recommendation machine learning algorithm.
Co-occurrence recommendation belongs to collaborative filtering approach. Technically, there are two approaches to build recommender systems: content-based and collaborative filtering. These are intrinsically different methods, content-based approach needs meta-data about the items so that items with similar properties are recommended. Some meta-data for houses are area, year-built, number of bed rooms, number of bath rooms etc. …
Before diving into data, it will be very helpful to understand data’s attributes, structure, data types and missing values. Followings are some basic knowledge about data we should explore.
2. Data columns
3. Data types of each column
Select specific datatypes (for exampe: datetime64)columns from a dataframe
df_cols = [i for i in df.select_dtypes(inclue = ['datetime64'])]
4. Percentage of missing values of each column
df.isnull().sum()/df.shape * 100
5. For categorical column, check the counts of each category
6. For categorical column, check how many categories totally
7. For categorical column, check categories
The major objective of this project is adopting xgboost, neural netwok, and LSTM neural network to develop predictive models and automatically predicts the helpfulness of specific reviews to customers.
Many e-commercial companies, for example Amazon, heavily depend on consumers’ review to provide the potential purchasers’ evaluation of a product. The online merchants are distinct from local markets where consumers can directly choose items based on on-site evaluation, online customers must rely on other information to help them make their purchase decisions. Therefore, developing online communicate channels among customers becomes critically important for commercial websites. To this end, the Amazon allows…
Among different machine learning algorithms, Xgboost is one of top algorithms providing the best solutions to many different problems, prediction or classification. The major objective of this post is to explore how categorical encoding methods affect Xgboost model performance.
Personally, Xgboost is always the first…
For this project, the basic idea is words that tend to appear in similar context are likely to be related. You shall know a word by the company it keeps (Firth, 1957). Standing on this concept, this project is mainly investigated an embedding of words that is based on co-occurrence statistics.
Data source: Brown corpus is a collection of text samples from a wide range of sources, with a total of over a million words. The analysis of this project is mainly based on Brown corpus.
Download brown corpus data from nltk.corpus (brown.words).
import nltk import matplotlib as plt import…
Data Scientist Costar