Python: Download ZIP and parse URL from text file

Hey there. It’s been a while. I’m back on Home Assistant, and I’ve been trying some stuff.

Along the way, I tested the Swiss Meteo integration and it was not quite working, so I looked how I could eventually fix it or make one myself. So far, I’m nowhere near the end, yet I thought this script below could help someone.

To get the 10 min updates of Swiss Meteo, I need to download a ZIP file. In that ZIP file stand a text file, in which one can find the URL to download the CSV containing the 10 min update data.

Ok… Bare with me:

Download a ZIP file
Uncompress the text file we need
Parse the URL to the CSV file

Let’s start by importing the required modules:

from io import BytesIO
from zipfile import ZipFile
import requests
import re
import sys

ByteIO allows to load a binary file in memory rather than to need to write it on drive
zipfile has an obvious name… It allows to handle…. Yeeaaahhh, ZIP files…
requests will allow to make an https request, hence download the zip file
re to use the regex to extract the URL from the text file
and sys to handle error codes. Even if in this example, I don’t use any.

Next, we define variables. Quite self exlpainatory:

url = "https://whateverurlyouneed"
# Define string to be found in the file name to be extracted
filestr = "anystring"
# Define string to be found in URL to parse
urlstr = "anyotherstring"
# Define regex to extract URL
regularex = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()[]{};:'\".,<>?«»“”‘’]))"

Now, let’s download the zip file and put it in a variable called “content”:

content = requests.get(url)

We then open the file in memory:

zipfile = ZipFile(BytesIO(content.content))

From that stream, we retrieve the text file with the name containing the filestr variable, and we load it in the variable called data:

data = [zipfile.open(file_name) for file_name in zipfile.namelist() if filestr in file_name][0]

Finally, we read line by line, and from each line, we try to extract a URL corresponding to the regex we entered above in regularex variable. This will find all URLs containing urlstr variable content.

for line in (line for line in data.readlines() if urlstr in line.decode("latin-1")):
    urls = re.findall(regularex,line.decode("latin-1"))
    print([url[0] for url in urls])
    break

And we exit the script with error code 0.

sys.exit(0)

So the full script looks like:

#!/bin/env python
from io import BytesIO
from zipfile import ZipFile
import requests
import re
import sys
# define url value
url = "https://whateverurlyouneed"
# Define string to be found in the file name to be extracted
filestr = "anystring"
# Define string to be found in URL
urlstr = "anyotherstring"
# Define regex to extract URL
regularex = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()[]{};:'\".,<>?«»“”‘’]))"
# download zip file
content = requests.get(url)
# Open stream
zipfile = ZipFile(BytesIO(content.content))
# Open first file from the ZIP archive containing 
# the filestr string in the name
data = [zipfile.open(file_name) for file_name in zipfile.namelist() if filestr in file_name][0]
# read lines from the file. If csv found, print URL and exit
# This will return the 1st URL containing CSV in the opened file
for line in (line for line in data.readlines() if urlstr in line.decode("latin-1")):
    urls = re.findall(regularex,line.decode("latin-1"))
    print([url[0] for url in urls])
    break
sys.exit(0)