标签:eth === value int learn enumerate matrix float methods
Cora Dataset是对Machine Learning Paper进行分类的数据集
-- README: 对数据集的介绍;
-- cora.cites: 论文之间的引用关系图。文件中每行包含两个Paper ID, 第一个ID是被引用的Paper ID; 第二个是引用的Paper ID。
-- cora.content: 包含了2708篇Paper的信息,每行的数据格式如下: <paper_id> <word_attributes>+ <class_label>。
paper id是论文的唯一标识;
word_attributes是是一个维度为1433的词向量,词向量的每个元素对应一个词,0表示该元素对应的词不在Paper中,1表示该元素对应的词在Paper中。
class_label是论文的类别,每篇Paper被映射到如下7个分类之一: Case_Based、Genetic_Algorithms、Neural_Networks、Probabilistic_Methods、Reinforcement_Learning、Rule_Learning、Theory。
import pandas as pd
import numpy as np
import scipy.sparse as sp
# 导入数据:分隔符为Tab
raw_data_content = pd.read_csv(‘cora/cora.content‘,sep = ‘\t‘,header = None)
# [2708 * 1435]
(row, col) = raw_data_content.shape
print("Cora Contents’s Row: {}, Col: {}".format(row, col))
print("=============================================")
# 每行是1435维的向量,第一维是论文的ID,最后一维是论文的Label
raw_data_sample = raw_data_content.head(3) #读取前3行的数据
features_sample =raw_data_sample.iloc[:,1:-1] #iloc通过行号来取行数据 ,排除ID and label
labels_sample = raw_data_sample.iloc[:, -1] #读取两边的 ID和label
labels_onehot_sample = pd.get_dummies(labels_sample)
print("features:{}".format(features_sample))
print("=============================================")
print("labels:{}".format(labels_sample))
print("=============================================")
print("labels one hot:{}".format(labels_onehot_sample))
raw_data_cites = pd.read_csv(‘cora/cora.cites‘,sep = ‘\t‘,header = None)
# [5429 * 2]
(row, col) = raw_data_cites.shape
print("Cora Cites’s Row: {}, Col: {}".format(row, col))
print("=============================================")
raw_data_cites_sample = raw_data_cites.head(10)
print(raw_data_cites_sample)
print("=============================================")
# Convert Cite to adj matrix
idx = np.array(raw_data_content.iloc[:, 0], dtype=np.int32)
idx_map = {j: i for i, j in enumerate(idx)} #序号和ID
edge_indexs = np.array(list(map(idx_map.get, raw_data_cites.values.flatten())), dtype=np.int32)
edge_indexs = edge_indexs.reshape(raw_data_cites.shape)
adjacency = sp.coo_matrix((np.ones(len(edge_indexs)),
(edge_indexs[:, 0], edge_indexs[:, 1])),
shape=(edge_indexs.shape[0], edge_indexs.shape[0]), dtype="float32")
print(adjacency)
Tensorflow-GCN-Cora Dataset实战-老年痴呆自我回忆手册
标签:eth === value int learn enumerate matrix float methods
原文地址:https://www.cnblogs.com/aluckystone/p/14161584.html