### Introduction

In this competition, the weather information of some nearby places at a timestamp are provided. You are asked to predict the temperature of a target place at the same timestamp.

We provide various observation data, including Temperature, Precipitation (in last hour), Sunshine Duration (in last hour), and location information. Although utilizing all of them may improve the performance of prediction, in this tutorial, as a baseline, we only use the temperature information of the nearby places.

This tutorial is based on python and the packages of Scikit-learn and NumPy are required.

### Load features and labels

First, we load the feature data and target data from tsv data files.

temperaturetrainfeaturefile = './Temperature_Train_Feature.tsv' temperaturetraintargetfile = './Temperature_Train_Target.tsv' temperaturetestfeaturefile = './Temperature_Test_Feature.tsv' trainfeature = readfeature(temperaturetrainfeaturefile, True) traintarget = readtarget(temperaturetraintargetfile, False) testfeature = readfeature(temperaturetestfeaturefile, True)

The functions used for loading the data are as follows.

def readfeature(featurefile, has_header): feature = read_tsv_data(featurefile, has_header) featurefilled = fillmissingbyplacebyhour(feature) totalnum = len(featurefilled) data = [] for i in range(0,totalnum): data.append(featurefilled[i][3:NUM_PLACE+3]) return data def readtarget(targetfilename, has_header): with open(targetfilename) as targetfile: if has_header: targetfile.readline() data = [] for line in targetfile: value = line.strip() data.append(float(value)) return data def read_tsv_data(file_path, has_header): with open(file_path) as f: if has_header: f.readline() data = [] for line in f: values = line.strip().split("\t") rowdata = [] for x in values: rowdata.append(float(x)) data.append(rowdata) return data

### Fill the missing values

Because there are some missing values which are marked as `nan`

in the data file. When we load the data, we need to fill the missing values. It is possible to fill the missing values in various ways. In this tutorial, we fill the value of a place at a timestamp in a day by using median value of this place at this timestamp in other days. The function of filling the missing values `fillmissingbyplacebyhour()`

which is called by the above-mentioned function `readfeature()`

is as follows.

import math from numpy import nanmedian NUM_PLACE = 10 NUM_HOUR = 24 def fillmissingbyplacebyhour(databyhour): databyhourfilled = databyhour for j in range(3,NUM_PLACE+3): # for each place (column in data) valuebyhour = [] medianbyhour = [] for i in range(0,NUM_HOUR): # for each hour (row in data, 24 rows per day) valuebyhour.append([]) # store the values of each hour for a place respectively, real value or NaN hourid = 0 for row in databyhour: # for each hour (row in data, 24 rows per day) valuebyhour[hourid].append(row[j]) hourid += 1 if (hourid % 24 == 0): hourid = 0 # compute the median value of each hour (except NaN) for hourid in range(0,NUM_HOUR): median = nanmedian(valuebyhour[hourid]) medianbyhour.append(median) medianofplace = nanmedian(medianbyhour) for hourid in range(0,NUM_HOUR): if math.isnan(medianbyhour[hourid]): medianbyhour[hourid] = medianofplace # assign the NaN with median value of that hour num = len(databyhour) for k in range(0,num): if math.isnan(databyhour[k][j]): databyhourfilled[k][j] = medianbyhour[k%24] return databyhourfilled

### Training

The features and targets are already prepared. The model we utilize for training and prediction is linear regression. This is a quick tutorial and we do not split the raw training data into training and validation sets to tune the model parameters.

from sklearn.linear_model import LinearRegression model = LinearRegression() model = model.fit(trainfeature, traintarget)

### Prediction and submission

The prediction result is same with the results submitted by the user named “University of Big Data”. You can save the prediction results into a file and submit this file.

def writetarget(targetdata,targetfilename): targetfile = open(targetfilename,'w') for value in targetdata: targetfile.write('%f\n' % value) targetfile.close() testpred = model.predict(testfeature) temperaturetesttargetsamplefile = './Temperature_Test_Target_Sample.tsv' writetarget(testpred, temperaturetesttargetsamplefile)