A question of binary classification in machine learning

requirements
currently, 10 to 20 high-quality articles are manually selected from 1000 to 2000 crawled articles every day and pushed to customers
want to be changed to machine automatic screening

would like to ask how to use machine learning to achieve this requirement?

The

sample looks like this:
whether the content length of platform title content pushes (label)

Mar.24,2021

how big is the sample? Do you have 100W? If the sample is very large, you can learn directly and deeply. If it is not too large, it can be logically regressed directly according to the sample you give. However, if you want to extract your own features, if the feature is too small, it may not be recommended accurately, and if the feature is too large, it may be over-fitted. You still have to try it yourself.


think carefully in terms of feature engineering first. Grab more dimensions in the process of grabbing news. Then the focus is on high quality how to judge.

  • such as the number of times, the number of comments, the number of mutual reviews, the length, whether advertising or not, may be the factors that affect the "quality" of the article.
  • then pass the above data into the model (LR/DT/SVM) as input, and output the result.

in addition, if you don't want to do feature engineering, you can consider deep learning. Each piece of news is word embedding in the form of a string of long text as a sequence into the neural network, and the output is divided into two categories of whether the quality is high or not. Word embedding can be trained in advance or at the same time in the training model.

Menu