博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
开源一个文本分析项目
阅读量:5955 次
发布时间:2019-06-19

本文共 2181 字,大约阅读时间需要 7 分钟。

Github

TextAnalyzer

a text analizer that can analyze text. so far, it can extract hot words in a text segment by using tf-idf algorithm,at the same time using a score factor to optimize the final score.

also it provides machine learning to make a classification.

Features

extracting hot words from a text.

  1. to gather statistics via frequence.
  2. to gather statistics via by tf-idf algorithm
  3. to gather statistics via a score factor additionally.

synonym can be recognized

SVM Classificator

this analyzer supports to classify text by svm. it involves vectoring the text. we can train the samples and then make a classification by the model.

for convenience,the model,tfidf and vector will be stored.

kmeans clustering && xmeans clustering

this analyzer supports to clustering text by kmeans and xmeans.

vsm clustering

this analyzer supports to clustering text by vsm.

Dependence

TODO

  • other ml algorithms.
  • emotion analization.

How to use

just simple like this

extracting hot words

  1. indexing a document and get a docId.
long docId = TextIndexer.index(text);复制代码
  1. extracting by docId.
HotWordExtractor extractor = new HotWordExtractor(); List
list = extractor.extract(0, 20, false); if (list != null) for (Result s : list) System.out.println(s.getTerm() + " : " + s.getFrequency() + " : " + s.getScore());复制代码

a result contains term,frequency and score.

失业证 : 1 : 0.31436604户口 : 1 : 0.30099702单位 : 1 : 0.29152703提取 : 1 : 0.27927202领取 : 1 : 0.27581802职工 : 1 : 0.27381304劳动 : 1 : 0.27370203关系 : 1 : 0.27080503本市 : 1 : 0.27080503终止 : 1 : 0.27080503复制代码

SVM classificator

  1. training the samples.
SVMTrainer trainer = new SVMTrainer();trainer.train();复制代码
  1. predicting text classification.
double[] data = trainer.getWordVector(text);trainer.predict(data);复制代码

kmeans clustering && xmeans clustering

List
list = DataReader.readContent(KMeansCluster.DATA_FILE);int[] labels = new KMeansCluster().learn(list);复制代码

vsm clustering

List
list = DataReader.readContent(VSMCluster.DATA_FILE);List
labels = new VSMCluster().learn(list);复制代码

==========广告时间==========

鄙人的新书《Tomcat内核设计剖析》已经在京东预售了,有需要的朋友可以到 进行预定。感谢各位朋友。

=========================

欢迎关注:

转载地址:http://zyrxx.baihongyu.com/

你可能感兴趣的文章
《Python Enhancement Proposal #8》要点 学习摘录
查看>>
HTML中的div标签
查看>>
Unity3d札记 --TanksTutorial收获与总结
查看>>
oracle-审计3
查看>>
有关索引的DMV
查看>>
django url 中的namespace详解
查看>>
html----学习笔记
查看>>
WPF 用Main函数方式启动程序
查看>>
Microsoft Azure 大计算 – 宣布收购 GreenButton
查看>>
android 应用的资源
查看>>
HDU 2563 统计问题 (DFS + 打表)
查看>>
转*SqlSever查询某个表的列名称、说明、备注、注释,类型等
查看>>
vivo面试经验4(linux基本操作,最基本,必须得会!!)
查看>>
thinkphp数据表操作恐怖事件。
查看>>
C#重绘TabControl控件的源码(转)
查看>>
Just a test
查看>>
nginx 的启动、停止与重启
查看>>
matlab练习程序(高斯金字塔)
查看>>
HTML表格
查看>>
串口类QextSerialPort
查看>>