Tutorial on Geographical spatial temperature prediction

Introduction

In this competition, the weather information of some nearby places at a timestamp are provided. You are asked to predict the temperature of a target place at the same timestamp.

We provide various observation data, including Temperature, Precipitation (in last hour), Sunshine Duration (in last hour), and location information. Although utilizing all of them may improve the performance of prediction, in this tutorial, as a baseline, we only use the temperature information of the nearby places.

This tutorial is based on python and the packages of Scikit-learn and NumPy are required.

Load features and labels

First, we load the feature data and target data from tsv data files.

temperaturetrainfeaturefile = './Temperature_Train_Feature.tsv'
temperaturetraintargetfile = './Temperature_Train_Target.tsv'
temperaturetestfeaturefile = './Temperature_Test_Feature.tsv'

trainfeature = readfeature(temperaturetrainfeaturefile, True)
traintarget = readtarget(temperaturetraintargetfile, False)
testfeature = readfeature(temperaturetestfeaturefile, True)

The functions used for loading the data are as follows.

def readfeature(featurefile, has_header):
    feature = read_tsv_data(featurefile, has_header)
    featurefilled = fillmissingbyplacebyhour(feature)

    totalnum = len(featurefilled)
    data = []
    for i in range(0,totalnum):
        data.append(featurefilled[i][3:NUM_PLACE+3])
    return data

def readtarget(targetfilename, has_header):
    with open(targetfilename) as targetfile:
        if has_header: targetfile.readline()
        data = []
        for line in targetfile:
            value = line.strip()
            data.append(float(value))
    return data

def read_tsv_data(file_path, has_header):
    with open(file_path) as f:
        if has_header: f.readline()
        data = []
        for line in f:
            values = line.strip().split("\t")
            rowdata = []
            for x in values:
                 rowdata.append(float(x))
            data.append(rowdata)
    return data

Fill the missing values

Because there are some missing values which are marked as nan in the data file. When we load the data, we need to fill the missing values. It is possible to fill the missing values in various ways. In this tutorial, we fill the value of a place at a timestamp in a day by using median value of this place at this timestamp in other days. The function of filling the missing values fillmissingbyplacebyhour() which is called by the above-mentioned function readfeature() is as follows.

import math
from numpy import nanmedian

NUM_PLACE = 10
NUM_HOUR = 24

def fillmissingbyplacebyhour(databyhour):
    databyhourfilled = databyhour
    for j in range(3,NUM_PLACE+3): # for each place (column in data)
        valuebyhour = []
        medianbyhour = []
        for i in range(0,NUM_HOUR): # for each hour (row in data, 24 rows per day)
            valuebyhour.append([])

        # store the values of each hour for a place respectively, real value or NaN
        hourid = 0
        for row in databyhour:  # for each hour (row in data, 24 rows per day)
            valuebyhour[hourid].append(row[j])
            hourid += 1
            if (hourid % 24 == 0):
                hourid = 0
        # compute the median value of each hour (except NaN)
        for hourid in range(0,NUM_HOUR):
            median = nanmedian(valuebyhour[hourid])
            medianbyhour.append(median)
        medianofplace = nanmedian(medianbyhour)
        for hourid in range(0,NUM_HOUR):
            if math.isnan(medianbyhour[hourid]):
                medianbyhour[hourid] = medianofplace

        # assign the NaN with median value of that hour
        num = len(databyhour)
        for k in range(0,num):
            if math.isnan(databyhour[k][j]):
                databyhourfilled[k][j] = medianbyhour[k%24]

    return databyhourfilled

Training

The features and targets are already prepared. The model we utilize for training and prediction is linear regression. This is a quick tutorial and we do not split the raw training data into training and validation sets to tune the model parameters.

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model = model.fit(trainfeature, traintarget)

Prediction and submission

The prediction result is same with the results submitted by the user named “University of Big Data”. You can save the prediction results into a file and submit this file.

def writetarget(targetdata,targetfilename):
    targetfile = open(targetfilename,'w')
    for value in targetdata:
        targetfile.write('%f\n' % value)
    targetfile.close()

testpred = model.predict(testfeature)
temperaturetesttargetsamplefile = './Temperature_Test_Target_Sample.tsv'
writetarget(testpred, temperaturetesttargetsamplefile)