2014年11月24日星期一

Large Scale Machine Learning and Other Animals: Matrix Market Format

Large Scale Machine Learning and Other Animals: Matrix Market Format: Matrix Market is a very simple format devised by NIST to store different types of matrices. For GraphLab matrix libraries: linear solvers...
I am working in Python and I have a matrix stored in a text file. The text file is arranged in such a format:
row_id, col_id
row_id, col_id
...
row_id, col_id
row_id and col_id are integers and they take values from 0 to n (in order to know n for row_id and col_id I have to scan the entire file first).

there's no header and row_ids and col_ids appear multiple times in the file, but each combination row_id,col_id appears once. There's no explicit value for each combination row_id,col_id , actually each cell value is 1. The file is almost 1 gigabyte of size.

Unfortunately the file is difficult to handle in the memory, in fact, it is 2257205 row_ids and 122905 col_ids for 26622704 elements. So I was looking for better ways to handle it. Matrix market format could be a way to deal with it.

Is there a fast and memory efficient way to convert this file into a file in a market matrix format using Python?