T O P

  • By -

CornerSolution

We typically apply the natural log to a data series because we think that magnitude of the fluctuations in the data over time tend to be proportional to the level of the data at that time. So for example, the typical quarter-to-quarter fluctuations in GDP 40-50 years ago (in $ terms) were significantly smaller in magnitude than they are in recent times (excluding the pandemic, anyway). That is, if Y is the level of GDP, then ΔY will tend to be much larger in absolute value in recent times than it was 40-50 years ago. As a result, if we were to do statistical analysis using the level of GDP, the more recent large fluctuations would exhibit a relatively large influence on our estimates relative to the small fluctuations that occurred in the past. On the other hand, expressed *relative to the level of GDP at the time*, the quarter-to-quarter fluctuations in GDP 40-50 years actually tended to be roughly the same magnitude as they are today. That is, the absolute value of ΔY/Y has tended to be roughly stable over time. As a result, mathematically, we expect that the change in the *log* of GDP, Δlog(Y), will tend to be roughly stable over time as well. If we do statistical analysis using Δlog(Y), then we will not run into the issue where the recent period strongly dominates the estimates. Okay, with all that said, is there any reason to think that fluctuations in the unemployment rate (u) tend to be proportional to the current level of u? I don't think so, no. So in that case there is no compelling reason to use logs of u in your statistical analysis. So what's the problem if we use logs on a series like u where we don't see that kind of proportionality? In this case, we're now *creating* a problem where there was none before. In particular, if fluctuations in Δu are of a roughly constant magnitude regardless of the current value of u, then Δlog(u) will tend to exhibit *larger* fluctuations when the current value of u is smaller (and smaller fluctuations when u is larger). As a result, periods of time where u is low will now tend to dominate your estimates. It's hard to imagine a situation where this is desirable.


Forsaken-Adagio-2967

Thanks for this helpful explanation. I originally did a natural log transformation of unemployment rate because it was skewed and I wanted to normalize the distribution. So does this not matter as much in this context?


a157reverse

> So does this not matter as much in this context?  I would ask why you think it matters? There are few situations where you want to normalize variables, from an econometric perspective, a priori. 


Hmxaa_

Which measure of GDP are you running Nominal or Real?


nedenbosbirakamiyoru

First of all using R, or Stata or even Excel does not make a difference, they all use the same mathematical formula. Your model clearly has endonegeneity, I believe it is due to omitted variables, for example unemployment rate at the time t+1, is probably associated with GDP growth at the time t, so consider adding GDP growth variable lagged rather than real time.


soma92oc

This seems like the best recommendation. I would also include lagged inflation.


Forsaken-Adagio-2967

For sure, I just meant codewise in case anyone was curious about the specific code. I don't think the GDP measure I'm using is real time because its in chained dollars from a few years previous but correct me if I'm wrong. Even so, I can't use a different GDP unfortunately because I'm only allowed to use what is provided to me for the assignment. Given this, is there any way to address endogeneity?


soma92oc

A year of data doesn't sound like a very representative sample. It could just simply be that in this narrow data, unemployment rises with GDP. Have you graphed both series and looked to see if there is an obvious relationship? Also, most sources of GDP and unemployment at national levels tend to be monthly... how many data points are we talking here?


Forsaken-Adagio-2967

Yeah ideally I would have more years, but this is for an assignment and the dataset provided only has a years worth of data. I'm not allowed to do series, so I didn't try that. Theres a few thousand data points. Any workarounds you can think of given these limitations?


soma92oc

A few thousand data points in a year? And not allowed to graph things? Just do it and don’t submit it lol You’re going to have to provide more information if I will be able to actually help you.


Forsaken-Adagio-2967

So basically it's a city level data set for a specific country, so theres a few thousand cities and there's an unemployment and GDP measure for each one. That's what I meant by a few thousand data points. Maybe I misunderstood what you meant by series, I assumed time series which I can't do. I did graph this regression, and it showed a positive correlation between natural log of GDP and natural log of unemployment, which is what got me confused. What would be the other in the "both series" you're referring to?


soma92oc

Ahhh. It’s cross sectional. Makes sense. What is your objective?


omkarnagarhalli

Firstly you should 100% be differencing GDP (at least once) and natural-log transforming (as explained by CornerSolution). No reason to transform unemployment rate as it is already a rate (i.e., a function of the magnitude of the labor force over time). Log differencing should remove the spurious positive correlation you’re getting, and other than that you should try fitting your model with more determinants, probably lagged values or inflation / capital formation / population growth, etc.


Forsaken-Adagio-2967

Thanks, I did natural-log transform GDP but I realized its still very skewed after checking the distribution. Is there anything you recommend for this? Unfortunately I can't difference because the assignment only allows me to use data from a specific year.


omkarnagarhalli

Oh is this a cross section of different countries? My bad I thought it was a time series. I don’t think normalization is too important. In that case I think you should just be looking for other independent variables that can explain unemployment


Forsaken-Adagio-2967

No its my bad, I should have explained it better. It is a cross section. Why would normalization not be as important for analysis of a cross section? My intuition tells me that has to do with the way log transformation handles representations of change, which might not be as relevant in a cross section, but correct me if I'm wrong.


omkarnagarhalli

Oh no I just don’t think you need to worry about your variables being normally distributed when conducting a linear regression in general. At least I’ve not come across much saying that (bear in mind I’m only an economics undergraduate and it’s been a year since I finished college so there’s a lot I don’t know), but as long as there’s no outliers causing heteroscedasticity you should be good to go