Tag Archives: Text Processing

Reading tabular data in a non-standard format

This post was going to be called ‘faffing around with data’, which might summarise many peoples feelings towards dealing with data that isn’t in the format they need it to be in. An earlier post described adding an ENVI header to binary files so they could be read into GDAL. This deals post details using Python to read tabular data stored in a text file, although many of the functions would apply to extracting any data from a text file.

Built in functions

Before setting out it’s a good idea to confirm the data can’t be read with any of the functions available in Python. If it opens in Excel, it’s likely these will work and they may be an easier option (no point reinventing the wheel!)

The CSV module in the standard Python library is really easy to use and can handle various delimiters:

import csv

# Open file
f = open('soil_moisture.csv', 'rU')
csvfile = csv.reader(f, delimiter=',')

# Skip first line (header)
headerline = next(csvfile)

# Iterate through file
for line in csvfile:
 print('First col = ',line[0])

f.close()

In numpy the genfromtxt function can be used to read a file directly to a numpy array.

smdata = np.genfromtxt('soil_moisture.csv', delimiter=',',skip_header=1)

Depending on how many files you have to process, a bit of clever use of find and replace in a text editor or using the sed command might allow you to get your data into a format that can be read by the built in functions. If this does end up being too messy it will at least provide some insight into the steps needed to write a script to read the data.

Writing a Custom Reader

The basic structure of a custom reader is to iterate through each line in the file and split it into columns.


# Open file
infile = open('soil_moisture.csv', 'rU')

# Read header to separate variable (repeat for multiple lines)
header1 = next(file)

# Iterate through file
for line in infile:
    # Count delimiters (:)
    ncolon = line.count(':')
    # Check we have 3 (4 columns)
    if ncolon == 3:
        # Seperate
        elements = line.split(':')
        print('First col = ', elements[0])
infile.close()

If the data is fixed width and separated by multiple spaces, I use the re module in Python to replace consecutive spaces with another character.

import re
line = 'tonzi  2013   12  18 1  1 9  1   20   15.69'
linesub = re.sub('\s+',':',line)
elements = linesub.split(':')

Where ‘\s+’ is a regular expression for one or more spaces. When choosing the character to use as a replacement, make sure it’s not used anywhere else in the file.

Time Data

Time can be stored in a range of different formats. Python offers ways of storing and manipulating time and data data as part of the time module. One of the easiest ways of getting time from a text file and into the Python representation is to use the ‘struct_time’ function. If each component of the date (year, month etc.,) is stored as a separate column you can use the following:

import time
year = int('20' + line[1])
month = int(line[2])
day = int(line[3])
hour = int(line[4])
minute = int(line[5])
second = int(line[6])

mTimeTS = time.struct_time((year,month,day,hour,minute,second,0,0,0))

If the date is stored in a single column the ‘strptime’ function can be used to convert to a Python time structure. For example:

import time
timeStr = '2014-07-10 08:14:01'
mTimeTS = time.strptime(timeStr,'%Y-%m-%d %H:%M:%S')

Writing data out

The CSV library, mentioned earlier, can also be used to write formatted data out to a CSV file.

outfile = open('outdata.csv', 'w')

outcsv = csv.writer(outfile)

# Write header row
outcsv.writerow(['x','y'])

# Itterate through input file
for line in infile:
    # Read and split data into 'elements'
    # as descriped earlier
    
    # Write first two columns out to CSV
    outcsv.writerow([elements[0],elements[1]])

outfile.close()