Github
TextAnalyzer
a text analizer that can analyze text. so far, it can extract hot words in a text segment by using tf-idf algorithm,at the same time using a score factor to optimize the final score.
also it provides machine learning to make a classification.
Features
extracting hot words from a text.
- to gather statistics via frequence.
- to gather statistics via by tf-idf algorithm
- to gather statistics via a score factor additionally.
synonym can be recognized
SVM Classificator
this analyzer supports to classify text by svm. it involves vectoring the text. we can train the samples and then make a classification by the model.
for convenience,the model,tfidf and vector will be stored.
kmeans clustering && xmeans clustering
this analyzer supports to clustering text by kmeans and xmeans.
vsm clustering
this analyzer supports to clustering text by vsm.
Dependence
TODO
- other ml algorithms.
- emotion analization.
How to use
just simple like this
extracting hot words
- indexing a document and get a docId.
long docId = TextIndexer.index(text);复制代码
- extracting by docId.
HotWordExtractor extractor = new HotWordExtractor(); Listlist = extractor.extract(0, 20, false); if (list != null) for (Result s : list) System.out.println(s.getTerm() + " : " + s.getFrequency() + " : " + s.getScore());复制代码
a result contains term,frequency and score.
失业证 : 1 : 0.31436604户口 : 1 : 0.30099702单位 : 1 : 0.29152703提取 : 1 : 0.27927202领取 : 1 : 0.27581802职工 : 1 : 0.27381304劳动 : 1 : 0.27370203关系 : 1 : 0.27080503本市 : 1 : 0.27080503终止 : 1 : 0.27080503复制代码
SVM classificator
- training the samples.
SVMTrainer trainer = new SVMTrainer();trainer.train();复制代码
- predicting text classification.
double[] data = trainer.getWordVector(text);trainer.predict(data);复制代码
kmeans clustering && xmeans clustering
Listlist = DataReader.readContent(KMeansCluster.DATA_FILE);int[] labels = new KMeansCluster().learn(list);复制代码
vsm clustering
Listlist = DataReader.readContent(VSMCluster.DATA_FILE);List labels = new VSMCluster().learn(list);复制代码
==========广告时间==========
鄙人的新书《Tomcat内核设计剖析》已经在京东预售了,有需要的朋友可以到 进行预定。感谢各位朋友。
=========================
欢迎关注: