Image Missing

Here I want to show you how to visualize municipal budgets in Germany with the help of Offener Haushalt.

This is the final result of the script above, imported to Excel. As you can see, there are only minor errors in the last row. The sum-line is still available, and we can use it to check the OCR for plausibility. It would be great to have the files right away instead of going through so much work. As mentioned already, I also did it because I wanted to use my new Python skills on something, but not everybody wants to dive into this topic like me.

After fixing the errors by hand and also doing a spellchecking for the second row, the description of the spending, the file is basically ready to get transformed into the right format to get uploaded to open spending, an international equivalent of Offener Haushalt, and the place where Offener Haushalt gets its data from.

The last work is to change the table in the long-format. That’s also done by hand but doesn’t take longer than 15min. I have a couple of slides that show the outcome.

Just before Christmas, I read an article on about the visualization of public budget data, such as state’s and municipality’s budgets. The service is provided by Offener Haushalt a project by the Open Knowledge Foundation Germany, which – among other things – is lobbying for more transparency in public institutions. Other interesting projects that get supported by them are, a platform that provides help in issuing questions based on German laws to Ministries or other public institutions. I used to ask my former municipality Hemsbach, where I grew up, for their budget data between 2013 and 2018. If you’re German speaking or trust Google’s translation, can you find the inquiry here.

In the following text I’ll explain how the process of getting the information actually worked and how I uploaded it to Offener Haushalt.

If you clicked on the inquiry before, you already get an idea of the procedure. For me it was the second time I used FragDenStaat, but the first time I informed myself a bit about my rights in the inquiry. FragDenStaat makes it very easy for everyone to ask the public authority by asking the user for simple information like to who it’s directed and what the topic is. In my case it was Stadt Hemsbach. The authority is obliged to answer within a certain time frame. As far as I know, is this timeframe different for the different authorities (town, state or national level), but FragDenStaat knows that better.

The only actual work for me was to write a letter with the specific questions that I have, the referring rules and laws get automatically appended to the email and make it look much more important. So, my letter was:

As you can see above, I got a file with the budget for the years 2013 – 2018. Some discussion was necessary and they were a bit reluctant to give me the whole budget and only gave me a scan of the short version. It might be interesting to know, that most municipalities have their budged on their homepage as download. That’s transparency, even though it’s mostly a PDF, it’s immediately accessible. A short conclusion of my case:

They answered after I had to complain about not answering within the legal deadline, FragDenStaat triggers that automatically.

Their reply was that there is a pdf file for the budget, it has apparently 266 pages, a typical size for the budget of one year. Even though a pdf of that size with only text wouldn’t be too big to send, they didn’t do that right away. Why? I don’t know. The fact that they wouldn’t send me an Excel or CSV-File as asked for, was already a bit annoying, but as far as I’m informed, are they not obliged to give me a certain format, only the content matters. Sending me a scanned version of their summary is definitely not forthcoming. Anyway, I was happy to get the file about 5 weeks after I asked for them. To be fair, it was just before Christmas when I asked first.

The PDF can be reviewed here.

I didn’t want to give up now and googled for ways to convert a PDF to a CSV-File. At least it was a clean table, no unnecessary information or changing formats. As I’m currently studying Data Research with Python on, I was interested in testing my skills in the real world.

"",,Einnahmen,Ausgaben,Verpfl .-Erm.,Einnahmen,Ausgaben,Einnahmen Ausgaben
Nr.,Bezeichnung,2013,2013,2013,2012,2012,2011 2011
1,2,3,4,5,6,7,8 9
0010,Gemeindeorgane,61.400,443.371,0,65.450,400.126,59.496 400.553
0200,Hauptverwaltung,31 .500,171 .287,0,29.050,156.823,26.850 176.135
0240,�ffenthchkeitsarbe1t,0,50.583,0,0,49.764,580 46.622
0300,Finanzverwaltung,136.600,251 .320,0,136.300,215.484,129.566 285.592
0310,Stadtkasse,14.700,60.832,0,13.900,95.818,13.776 97.670
0500,Standesamt,10.000,72.661,0,12.000,70.015,10.631 78.910
0520,Wahlen,10.000,35.200,0,0,500,8.048 40.565

In case you want to do it yourself, maybe your municipality has a similar policy, I have some simple steps here, which worked for me.

I used an OCR reader to read the scanned document and transfer every word or number to an actual character that is mark able and digitally readable. OCR stands for Optical Character Recognition. There is freeware as well as expensive commercial software on the market. The commercial software is usually better, but depending on the quality of your scan, the freeware might do the job as well.

A lot of publications are in PDF, it’s a nice format with a consistent formatting. Unfortunately, it’s not a data format and can’t get read by data analysis programs or databases. (That’s what I know at least).

Tabula is an open source project that does just that. The developers use the nice phrase “liberating data tables trapped inside PDF files”. I didn’t use the tabula Windows applications, but the Python version: tabula-py and will shortly explain how I used it in this project. I don’t know if the JAVA-based version which runs in the browser has the same functions.

I used a tutorial by Aki Ariga as red thread.

I use Anaconda 3 as Python compiler. The conda-shell command is:

conda install -c auto tabula-py

DONE! Extremely easy! The code for pip is:

pip install tabula-py

The use is just as simple. Here you can also get some more information from the linked tutorial above. In my case I imported tabula with

from tabula import convert_into

There are other options, but I wanted to have the file first to edit it afterwards. My PDF was 60 pages long, with all years in one file. I wanted to have the years in separate files in case there are differences, which there are! To export the files I used this script:

i = 1
export_file = "haushalt" + str(i) + ".csv"
convert_into("haushalt.pdf", "export_file", output_format="csv", pages="1-10")
for i in range(1,6):
  export_file = "haushalt" + str(i) + ".csv"
  page_range = str(i * 10 + 1) + "-" + str(i * 10 + 10)
  convert_into("haushalt.pdf", export_file, output_format="csv", pages=page_range)

As my OCR as well as tabula didn’t give me the cleanest files and I had to get rid of some common errors throughout all the files, I wrote a script for that as well, this one is longer. You can download or contribute on GitHub. What it essentially does, is to check for the header of each table, which is a text and therefore only important once and for the combination “1,2,3,4…” which is also in the header and not relevant.

A bit trickier is the function luckenschliessen, which looks at the last row of the table, where tabula-py interpreted the last two rows of the table as one (I can only assume that the distance was too small). My function splits this element in two. Works only sometimes, but I would say most of the times. The biggest problem is when there is more than one space character. Then I call the function more-spaces, which is not really working so far, but it was easier to fix the 5-6 errors per file by hand than rewriting the script. If you know what the problem is, please tell me!

# open the relevant files
# !!!!! CHANGE FILE NAME !!!!!!
f = open("haushalt6_clean.csv", "w+")
t = open("haushalt6.csv", "r")

# Creates one line of text
def textbau(l):
    l = l.strip("\"")               # delete "-symbols
    l = l.split(",")                # create list l from ","
    s = l[-1].split(" ")            # create sublist s from last list l item
    if len(s) == 2:                 # if sublist s has 2 items
        more_spaces(s, l)               # start function more_spaces(s) if more items than 2
    l[-1] = s[0]
    #print("Davor ", l)
    l = luckenschliesser(l)
    #print("Fertig ", l)
    if l[0] == "":
        del l[0]
    tmp = ""
    for item in l:
        tmp = tmp + "," + item
    return tmp

# Deletes spaces in cells
def luckenschliesser(zellen):
    zellen = zellen
    ind = 0
    for zelle in zellen:
        if zelle[:3].isalpha():
            z_new = ""
            for z in zelle:
                if z == " ":
                    z_new = z_new + z
            zellen[ind] = z_new
        ind += 1
    l = zellen
    #print("Inzwischen ", l)
    return l

# In case there are more spaces in the last row
def more_spaces(s, l):
    count = len(s)
    l[-1] = s[0]
    for index in range(1, count):       # ERROR! Cullumn gets copied twice for some reason
        l.append(s[index])              # or appended twice...
    return l

text = ""
for l in t:
    if l[:2] == "\"\"":
        if l[:8] == "\"\",Summe":
            text = text + textbau(l)
    elif l[:2] == "Nr":
    elif l[:3] == "1,2":
        text = text + textbau(l)

# write in "haushalt1_clean.csv"

# close files

After changing everything into the “Long-Format”, I had to change the Amount (Betrag) to text, so there are no decimal points or such, makes it easier to adjust later and prevents errors while uploading to Open Spending.

The following part, how to upload it to Offener Haushalt, is well explained in a couple of youtube videos here. My final CSV-File can be viewed, analysed and downloaded on Open Spending and Offener Haushalt.