python机器学习中使用的简单函数们
将List按列合并(按行合并直接+即可)
感觉这个操作很常见,然而百度很久并没有找到靠谱的答案,以下函数同时处理了不等长的情况。
csv读取,很简洁的写法,好评
拆分训练集和测试集,通过random_state可控制不同的随机样本,种子不变则拆分相同。
去除字符串中的非中文,乱七八糟随便写了一下,测试可以达到目的(十分迷茫),不确定其有效性
TFIDF特征降维
随机森林算法
def mergeList(*lsts): maxlength = max(map(len, lsts)) data = map(lambda x: x + (maxlength - len(x)) * ['NULL', 'NULL'], lsts) return [sum(r, []) for r in map(list, zip(*data))]
final = open('consno.csv' , 'r') data = [line.strip().split(',') for line in final] feature = [[int(x) for x in row[-7:-1]] for row in data[0:9956]]
from sklearn.cross_validation import train_test_split feature_train, feature_local, target_train, target_local = train_test_split(feature, target, test_size=0.1, random_state=0)
def is_chinese(uchar): try: re.match(ur"[一-龥]", uchar.decode('utf-8')) return ' ' except: return x def chinese_trim(string): string = map(is_chinese, string) string = "".join(string) return string.replace(" ","").replace(",","")
from sklearn.feature_selection import SelectKBest,chi2 vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(messages) ch2 = SelectKBest(chi2, k=1000) X_ch2 = ch2.fit_transform(X, Y)
from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier(n_estimators=treenum,n_jobs=4) #训练模型 s = clf.fit(feature_train , target_train) #评估模型准确率 r = clf.score(feature_local , target_local) prob=clf.predict_proba(feature_test)