<![CDATA[Karn Wong's Blog]]>https://blog.karnwong.me/https://blog.karnwong.me/favicon.pngKarn Wong's Bloghttps://blog.karnwong.me/Ghost 4.11Sat, 28 Aug 2021 09:09:52 GMT60<![CDATA[Moved!]]>Moved to https://karnwong.me/posts

]]>
https://blog.karnwong.me/moved/6129fce6736ac000011f685aSat, 28 Aug 2021 09:08:25 GMTMoved to https://karnwong.me/posts

]]>
<![CDATA[Python venv management]]>

When you create a project in python, you should create requirements.txt to specify dependencies, so other people can have the same environment when using your project.

However, if you don’t specify module versions in requirements.txt, you could end up with people using the wrong module version,

]]>
https://blog.karnwong.me/python-venv-management/60dec6568838f700014db16aFri, 02 Jul 2021 07:59:00 GMT

When you create a project in python, you should create requirements.txt to specify dependencies, so other people can have the same environment when using your project.

However, if you don’t specify module versions in requirements.txt, you could end up with people using the wrong module version, where some APIs can be deprecated or have different behaviors than older versions.

Another issue is that maybe you’re working on a few python projects, each uses different python versions (eg. projectA uses python3.6, projectB uses python3.9, etc).

Enters pyenv and pipenv (I will discuss about poetry later), where you can easily switch python versions, and have different environment (with python version locking) for projects you’re working on.

Installing pyenv

Follow instructions here. For windows, use this.

Useful commands

# list available python versions
pyenv install --list

# install specific version
pyenv install 3.8.0

# list installed versions
pyenv versions

# activate new env
pyenv shell 3.8.0 # support multiple version

# config venv
pyenv virtualenv 3.8.0 my-data-project

# set env per folder/project
pyenv local my-data-project

Installing pipenv

Notes: make sue pyenv is installed, and remove anaconda / miconda & python3 installed via official installer from your system. Then run:

$ pip install pipenv

# run this command every time pip installs a .exe
$ pyenv rehash

pipenv workflow

pipenv --python 3.7

# install a specific module
pipenv install jupyterlab==2.2.9

# install from existing requirements.txt or from Pipfile definition
pipenv install

# remove venv
pipenv --rm

# running inside venv
pipenv run jupyter lab
pipenv run python main.py # is equivalent to `pipenv shell && python3 main.py`

Windows only

$ pyenv install 3.7.7 # see Pipfile for required python version
$ pyenv local 3.7.7 # IMPORTANT. global / shell doesn't work with pipenv
$ pyenv rehash
$ pip install pipenv # done once per pyenv python version
$ pyenv rehash
$ pipenv --python 3.7
$ pipenv install
$ pipenv run python tokenization_sandbox.py

Notes

  • On linux/mac, do not use system python. OS updates would mean python version upgrade, in turn making all your installed modules gone. Use python installed via pyenv instead.
  • On windows, start fresh with pyenv.
  • Do not use anaconda distribution. It does too much background magic that can make things harder to manage environment property. In addition, venv definition from anaconda is often doesn’t work cross-platform (eg. venv def from windows wouldn’t work on mac due to different wheel binary versions).
  • Always create venv via pipenv per each project. Although you can have a playground venv via pyenv, so you can shell into it and do a quick analysis / scripting on an adhoc basis.
  • I heard good things about poetry but it doesn’t integrate with pyenv natively. It would work if you use it to publish python modules, since it simplifies a lot of processes.
    • poetry also picks up the wrong python version from pyenv. And if you sync python version via pyenv, it has to be the same python version across all OSes, including minor version. pipenv doesn’t have this restriction, and it also picks up the correct python version from pyenv by default (via pipenv --python 3.8).
]]>
<![CDATA[Don't write large table to postgres with pandas]]>We have a few tables where the data size is > 3GB (in parquet, so around 10 GB uncompressed). Loading it into postgres takes an hour. (Most of our tables are pretty small, hence the reason why we don't use columnar database).

I want to explore whether there&

]]>
https://blog.karnwong.me/dont-write-to-warehouse-with-pandas/60d878818838f700014db12fSun, 27 Jun 2021 13:19:10 GMTWe have a few tables where the data size is > 3GB (in parquet, so around 10 GB uncompressed). Loading it into postgres takes an hour. (Most of our tables are pretty small, hence the reason why we don't use columnar database).

I want to explore whether there's a faster way or not. The conclusion is writing to postgres with spark seems to be fastest, given we can't use COPY since our data contain free text, which means it would make CSV parsing impossible.

I also found out that the write performance from pandas to postgres is excruciatingly slow because:

  • It first decompresses the data in-memory. For a 30MB parquet (around 100MB uncompressed) it used more than 20GB of RAM (I killed the task before it finishes, since by this time the RAM usage is climbing up)
  • But even with reading plain JSON line in pandas with chunksize and use to_sql with multi option, it's still very slow.

In contrast, writing the said 30MB parquet file to postgres takes only 1 minute.

Big data is fun, said data scientists 🧪 (until they run out of RAM 😆)

]]>
<![CDATA[Data engineering toolset (that I use) glossary]]>

Big data

  • Spark: Map-reduce framework for dealing with big data, especially for data that doesn't fit into memory. Utilizes parallelization.

Cloud

  • AWS: Cloud platform for many tools used in software engineering.
  • AWS Fargate: A task launch mode for ECS task, where it automatically shuts down once a container
]]>
https://blog.karnwong.me/data-engineering-toolset-glossary/60ba55438838f700014db0ebFri, 04 Jun 2021 16:57:04 GMT

Big data

  • Spark: Map-reduce framework for dealing with big data, especially for data that doesn't fit into memory. Utilizes parallelization.

Cloud

  • AWS: Cloud platform for many tools used in software engineering.
  • AWS Fargate: A task launch mode for ECS task, where it automatically shuts down once a container exits. With EC2 launch mode, you'll have to turn off the machine yourself.
  • AWS Lambda: Serverless function, can be used with docker image too. Can also hook this with API gateway to make it act as API endpoint.
  • AWS RDS: Managed databases from AWS.
  • ECS Task: Cron-like schedule for a task. Essentially at specified time, it runs a predefined docker image (you should configure your entrypoint.sh accordingly).

Data

  • Parquet: Columnar data blob format, very efficient due to column-based compression with schema definition baked in.

Data engineering

  • Dagster: Task orchestration framework with built-in pipelines validatioin.
  • ETL: Stands for extract-transform-load. Essentially it means "moving data from A to B, with optional data wrangling in the middle."

Data science

  • NLP: Using machine (computer) to work on human languages. For instance, analyze whether a message is positive or negative.

Data wrangling

  • Pandas: Dataframe wrangler, think of programmable Excel.

Database

  • Postgres: RMDBS with good performance.

DataOps

  • Great expectations: A framework for data validation.

DevOps

  • Docker: Virtualization via containers.
  • Git: Version control.
  • Kubernetes: Container orchestration system.
  • Terraform: Infrastructure as code tool, essentially you use it to store a blueprint for your infra setup. If you were to move to another account, you can re-conjure existing infra with one command. This makes editing infra config easier too, since it automatically cleans up / update config automatically.

GIS

  • PostGIS: GIS extension for Postgres.

MLOps

  • MLflow: A framework to track model parameters and output. Can also store model artifact as well.

Notebook

  • Jupyter: Python notebook, used for exploring solutions before converting it to .py.
]]>
<![CDATA[Automatic scrapy deployment with GitHub actions]]>Repo here

Scrapy is a nice framework for web scraping. But like all local development processes, some settings / configs are disabled.

This wouldn't pose an issue, but to deploy a scrapy project to zyte (a hosted scrapy platform) you need to run shub deploy, and if you run

]]>
https://blog.karnwong.me/automatic-scrapy-deployment-with-github-actions/60b7965b8838f700014db096Wed, 02 Jun 2021 14:48:34 GMTRepo here

Scrapy is a nice framework for web scraping. But like all local development processes, some settings / configs are disabled.

This wouldn't pose an issue, but to deploy a scrapy project to zyte (a hosted scrapy platform) you need to run shub deploy, and if you run it and forget to reset the config back to prod settings, a Titan may devour your home.

You can set auto deployment from github via the UI in zyte, but it only works with github only. Plus if you want to run some extra tests during CI/CD you're out of luck. So here's how to set up CI/CD to deploy automatically:

Note: I would assume that you have your scrapy project set up already.

Create scrapinghub.yml + add repo secrets

project: ${PROJECT_ID}

requirements:
  file: requirements.txt

stack: scrapy:${YOUR_SCRAPY_VERSION_IN_PIPFILE}
apikey: null

Notice that apikey is left blank. This is because it's considered a good practice to not checkin sensitive information & credentials in version control. Instead apikey will be added to github secrets, so it can be called as environment variable.

Create github workflow file

name: Deploy

on:
  push:
    branches: [ master, main ]
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - name: Set up Python 3.9
      uses: actions/setup-python@v2
      with:
        python-version: 3.9
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install pyyaml shub
    - name: Deploy to zyte
      if: github.ref == 'refs/heads/master'
      run: python3 utils/edit_deploy_config.py && shub deploy
      env:
        APIKEY: ${{ secrets.APIKEY }}

Translation:

  • On push to this repo (this doesn't work for PRs)
  • Download this repo
  • Setup python3.9
  • Install some pip modules
  • Run a script to overwrite scrapinghub.yml's apikey value, in which the value is obtained from github secrets
  • Execute deploy command
]]>
<![CDATA[Elasticsearch with custom dictionary]]>Elasticsearch is a search engine with built-in analyzers (combination of tokenizer and filters), which makes it easier to set it up and get it running, seeing you don’t have to implement NLP logic from scratch. However, for some languages such as Thai, the built-in Thai analyzer may not

]]>
https://blog.karnwong.me/elasticsearch-with-custom-dictionary/608fa00bc705a3000119bf19Mon, 03 May 2021 07:04:42 GMTElasticsearch is a search engine with built-in analyzers (combination of tokenizer and filters), which makes it easier to set it up and get it running, seeing you don’t have to implement NLP logic from scratch. However, for some languages such as Thai, the built-in Thai analyzer may not be working quite as expected.

For instance, for region name search autocomplete, it doesn’t recommend anything when I type เชียง, but it should be showing เชียงใหม่ or เชียงราย. This is because these two regions are recognized as one token, which is why it doesn’t recommend anything when querying with เชียง.

But if I create a custom dictionary for tokenizers with เชียง as one of the tokens, it manages to recommend the two regions when querying with the prefix.

Below is an index_config for using a custom dictionary for tokenizer:

{
    "settings": {
        "analysis": {
            "analyzer": {
                "thai_dictionary": {
                    "tokenizer": "standard",
                    "filter": [
                        "char_limit_dictionary"
                    ]
                }
            },
            "filter": {
                "char_limit_dictionary": {
                    "type": "dictionary_decompounder",
                    "word_list": tokens, # <-- word list array here
                    "max_subword_size": 22
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "title": { # <-- search key
                "type": "text",
                "analyzer": "thai_dictionary"
            }
        }
    }
}

See elasticsearch documentation for more details: https://www.elastic.co/guide/en/elasticsearch/reference/7.12/analysis-dict-decomp-tokenfilter.html

]]>
<![CDATA[Shapefile to data lake]]>Background: we use spark to read/write to data lake. For dealing with spatial data & analysis, we use sedona. Shapefile is converted to TSV then read by spark for further processing & archival.

Recently I had to archive shapefiles in our data lake. It wasn't rosy for

]]>
https://blog.karnwong.me/shapefile-to-data-lake/60830d38c705a3000119befdFri, 23 Apr 2021 18:25:13 GMTBackground: we use spark to read/write to data lake. For dealing with spatial data & analysis, we use sedona. Shapefile is converted to TSV then read by spark for further processing & archival.

Recently I had to archive shapefiles in our data lake. It wasn't rosy for the following reasons:

Invalid geometries

Sedona (and geopandas too) whines if it encounters invalid geometry during geometry casting. The invalid geometries could be from many reasons, one of them being unclean polygon clipping.

Solution: use gdal to filter out invalid geometries.

Spatial projection

Geometric projections requires projection, otherwise you could be on the wrong side of the globe. This matters because by default, the worldwide-coverage projection is EPSG:4326, but the unit is in degrees, so sometimes for analysis the data is converted to a local projection which covers a smaller geographical region, but uses meter as the unit.

This means that if the source projection is in A, and you didn't cast it to EPSG:4326, spark would mistakenly think it's on EPSG:4326 by default. Something like seeing the entirely of the UK in Africa.

Solution: verify the source projection and cast to EPSG:4326 before writing to data lake.

Extra new line character

Sometimes when editing shapefile data by hand using applications like ArcGIS or QGIS, you could copy a text which might contain "new line" character, and set it as a cell value. Spark doesn't play nice with "new line" characters in a middle of a record.

Solution: strip new line characters by hand.

Yes, I really did that 😶. Thankfully it was a very small shapefile that has the issue.

Takeaways: count yourself lucky if you never have to deal with spatial data.

]]>
<![CDATA[Spark join OOM fix]]>I have a big pipelines where one step performs crossjoin on 130K x 7K. It fails quite often, and I have to pray to the Rice God for it to pass. Today I found the solution: repartition before crossjoin.

The root cause is that the dataframe with 130K records has

]]>
https://blog.karnwong.me/spark-join-oom-fix/60731f8b25ce9d0001e548a7Sun, 11 Apr 2021 16:20:23 GMTI have a big pipelines where one step performs crossjoin on 130K x 7K. It fails quite often, and I have to pray to the Rice God for it to pass. Today I found the solution: repartition before crossjoin.

The root cause is that the dataframe with 130K records has 6 partitions, so when I perform crossjoin (one-to-many) it's working against those 6 partitions. Total output in parquet is around 350MB, which means my computer (8 cores, 10GB RAM provisioned for spark) needs to be able to hold all uncompressed data in memory. It couldn't hence the frequent OOM.

So by increasing the partition size from 6 to 24, the current working dataframe size is smaller, which means things could pass along faster while not filling up my machine's RAM.

]]>
<![CDATA[ถอดรหัสคำตอบเวลาถามเรื่องกลูเตน]]>อ่านตรงนี้ อย่าข้าม!!!!!

กลูเตนไม่ใช่ผู้ร้าย ขนมปังที่เด้ง

]]>
https://blog.karnwong.me/th-drhaskhamt-bewlaathaameruue-ngkluuetn/6064717f25ce9d0001e54800Wed, 31 Mar 2021 13:35:55 GMTอ่านตรงนี้ อย่าข้าม!!!!!

กลูเตนไม่ใช่ผู้ร้าย ขนมปังที่เด้งดึ๋งนี่ก็เพราะฝีมือกลูเตน อะไรที่ทำมาจากแป้ง กินแล้วมีความหนึบหนับ ฟันธงได้เลยว่าฝีมือกลูเตน แป้งสาลีก็มีสารอาหารเยอะกว่าแป้งกลูเตนฟรี แต่มนุดบางจำพวกกินกลูเตนไม่ได้ แล้วบางคนหัวใสเอาไปโปรโมทว่า กลูเตนฟรีมันเฮลตี้ คนเลยแห่กันไปทำอาหารกับขนมกลูเตนฟรีออกมา แต่เกือบหมดนั้นคนที่แพ้กลูเตนจริงๆ กินไม่ได้!

เข้าเรื่อง

ส่วนนี้จะเป็นคำตอบที่ตัวแทนจากผู้ผลิต (คนที่รับสายน่ะแหละ) ตอบ ซึ่งจะอธิบายหมายเหตุ++ ว่าต้องแปลแต่ละคำตอบยังไง


เดี๋ยวติดต่อกลับนะครับ ขอไปหาข้อมูลก่อน

เกือบทั้งหมดที่ตอบแบบนี้ จะไม่ติดต่อกลับมาอีกเลย ☁️☁️☁️

สินค้าของเราไม่มีกลูเตนเลยค่ะ วัตถุดิบก็มีแค่ A, B C ซึ่งทั้งหมดนี้ไม่มีกลูเตนอยู่แล้วตามธรรมชาติ

อันว่า cross-contamination นั้นก็เหมือน 💩 ก็คือ อะไรที่มันไปโดน ก็จะ 🤮 ตามไปด้วย ต่อให้จริงๆ แล้วมันสะอาดก็ตาม

ซึ่งก็แปลว่า ต่อให้วัตถุดิบ[1] จะปลอดกลูเตนแค่ไหนก็ตาม แต่ถ้ามันทำบนไลน์ผลิตเดียวกับสินค้าอื่นที่มีกลูเตน ก็ไม่รอดอยู่ดี[2]

สำหรับสินค้าพวก homemade ทั้งหลายแหล่ ต่อให้จะ gluten free เบอร์ไหน ถ้าเขาบอกว่าใช้เตาอบอันเดียวกันกับสินค้าปกติ(ที่มีกลูเตน) ก็จงใส่เกียร์หมาซะ เหตุผลเดียวกับข้างบน

ทางเราไม่แนะนำให้ลูกค้าทานสินค้าเราค่ะ เพื่อความปลอดภัย

  1. พีคตรงที่ฉลากไม่เขียนว่า(อาจจะมีส่วนผสมของ)มีกลูเตน -> ก็เขียนแปะสิว่าอาจจะมีส่วนผสมของกลูเตน ไม่ใช่ให้คนกินมาลุ้นยังกะจับใบดำใบแดง
  2. คนตอบเองนั่นแหละที่ไม่กล้ารับปากอะไรทั้งนั้น เพราะไม่รู้ข้อมูล เลยพยายามปัดให้พ้นตัว

ฉลากของเราผ่านอ.ย. ค่ะ อิงตามกฎหมายไทย

ถ้าได้คำตอบแบบนี้มา จงวิ่งหนีสุดชีวิต เพราะอ.ย.ไทยไม่แคร์ใดๆ ทั้งนั้น ตามท้องตลาดมีสินค้าที่มีซีอิ๊ว แต่แจ้งแค่ มีส่วนผสมของถั่วเหลือง เยอะมากกกกกกก (ซีอิ๊วหมักกับแป้งสาลี) และ อ.ย.ระบุไว้ว่า สินค้าที่ส่วนผสมหลักไม่มีกลูเตน ไม่สามารถเขียนบนฉลากว่า gluten free[3]

ฉลากของเราถูกต้องตามกฎหมายค่ะ เพราะมันมี customer protection law อยู่ ไม่งั้นก็ต้องจ่ายค่าเสียหาย

จงหนีให้ไว เพราะที่เมกามีเคสคนกิน Cheerios[4] (คนที่กินกลูเตนไม่ได้รู้กันว่า gluten free ยี่ห้อนี้มันเก๊ยิ่งกว่าสินค้าเสิ่นเจิ้น) แล้วอาการหนักจนต้องเข้าห้องฉุกเฉิน สิ่งที่แบรนด์นี้ทำ คือ จ่ายค่าสินค้าชดเชยให้ เฉพาะค่าสินค้า ค่าห้องฉุกเฉินไปจ่ายกันเอาเอง

แต่ละคนเซนซิทีฟไม่เหมือนค่ะคุณลูกค้า

เอ้า บทคนกินกลูเตนไม่ได้ก็คือไม่ได้ บางคนกินได้มากน้อย[5] แต่ถ้าทำแบบคนเซนซิทีฟมากๆ กินได้ก็จบหนิ[6] ป่ะ?

(ตอบไม่ตรงคำถาม พูดวนไปวนมา)

วางสายโดยพลัน เพราะเขาเลี่ยงไม่รับปากอะไรทั้งนั้น กลัวลูกค้าอัดเสียงไว้แล้วเอาไปใช้เป็นหลักฐานในศาล เกิดซื้อไปกินแล้วอาการออกทั้งๆ ที่รับปากแล้ว[7]


  1. แต่ถ้าทั้งโรงงานไม่มีการใช้วัตถุดิบที่มีกลูเตนเลย ก็ยังต้องตามไปเช็คว่า วัตถุดิบที่เขารับมามีการปนเปื้อนกลูเตนจากการบวนการขนส่งตั้งแต่ทาง supplier มั้ย ↩︎

  2. อาจจะรอด ถ้ามีการทำความสะอาดไลน์ผลิต แต่ก็ต้องถามหาผลตรวจกลูเตนตกค้างด้วยว่าต่ำกว่า 20 PPM (mg/L -> 1 ส่วนในล้าน) มั้ย ↩︎

  3. นมรสมอลต์ โกโก้ผสมมอลต์ และอื่นๆ อีกมากมาย ↩︎

  4. https://topclassactions.com/lawsuit-settlements/consumer-products/329574-general-mills-faces-new-class-action-over-gluten-free-cheerios/ ↩︎

  5. มนุดที่กินกลูเตนไม่ได้มีสามจำพวก: celiac disease, gluten intolerance, wheat allergy มนุด celiac disease จะกลูเตนน้อยแค่ไหนก็ไม่ได้ (อ่ะ ยอมได้แค่ 20 PPM ต่อวัน) ส่วนอีกสองพวกที่เหลืออาจจะทนกินได้บ้างเล็กน้อย ↩︎

  6. EU กำหนดไว้ว่า สินค้าใดๆ ที่ปริมาณกลูเตนสูงกว่า 20 PPM ไม่สามารถแปะได้ว่า กลูเตนฟรี ↩︎

  7. เขาตอบเรางี้จริงๆ นะ ↩︎

]]>
<![CDATA[Add Ghost content to Hugo]]>Ghost CMS is very easy to use, but the deployment overhead (maintaining db, ghost version, updates and etc) might be too much for some. Luckily, there's a way to convert a Ghost site to static pages, which you can later host on Github pages or something similar.

Setup:

]]>
https://blog.karnwong.me/create-static-site-from-ghost-blog/6063f27625ce9d0001e547beWed, 31 Mar 2021 04:05:02 GMTGhost CMS is very easy to use, but the deployment overhead (maintaining db, ghost version, updates and etc) might be too much for some. Luckily, there's a way to convert a Ghost site to static pages, which you can later host on Github pages or something similar.

Setup:

  • static site engine: Hugo
  • a Ghost instance

Usage

  1. Install https://github.com/Fried-Chicken/ghost-static-site-generator
  2. cd to static directory in your Hugo folder
  3. run gssg --domain ${YOUR_GHOST_INSTANCE_URL} --dest posts --url ${YOUR_STATIC_SITE_DOMAIN_WITHOUT_TRAILING_SLASH} --subDir posts
  4. Update your hugo config to link to the above folder:
[[menu.main]]
    identifier = "posts"
    name       = "Posts"
    url        = "/posts"

All done! 🎉🎉🎉

]]>
<![CDATA[Hello Caddy]]>Since starting self-hosting back in 2017, I've always used apache2 since it's the first webserver I came across. Over time adding more services and managing separate vhost config is a bit tiresome.

Enters Caddy. It's very simple to set up and configure. Some services

]]>
https://blog.karnwong.me/hello-caddy/60448c903b0e1d0001e528daSun, 07 Mar 2021 08:32:19 GMTSince starting self-hosting back in 2017, I've always used apache2 since it's the first webserver I came across. Over time adding more services and managing separate vhost config is a bit tiresome.

Enters Caddy. It's very simple to set up and configure. Some services where I have trouble setting up in apache2 do not need extra config at all, even TLS is set up by default. Starting from Caddy2 it works with CNAME by default without extra setups.

You can set it up using a Caddy docker container, but some containers I use also expose port 443, so I have to install Caddy natively instead.

For multiple sites config setup:

# /etc/caddy/Caddyfile

SUBDOMAIN1.DOMAIN.com {
    reverse_proxy 127.0.0.1:${PORT}
}
SUBDOMAIN2.DOMAIN.com {
    reverse_proxy 127.0.0.1:${PORT}
}

For basic authentication, it's very, very simple (to the point I regret time researching it in apache2):

# generate password hash
caddy hash-password --algorithm bcrypt

# add basicauth to Caddyfile
SUBDOMAIN1.DOMAIN.com {
    basicauth * {
        ${USERNAME} ${CADDY_PASSWORD_HASH}
    }
    reverse_proxy 127.0.0.1:${PORT}
}

And run systemctl reload caddy. You're all set!

]]>
<![CDATA[Password auth with apache2 reverse-proxy]]>EDIT: see https://blog.karnwong.me/hello-caddy/ for Caddy, also easier to set up too.

Sometimes you found an interesting project to self-hosted, but it doesn't have password authentication built-in. Luckily, we need to reverse-proxy them anyway and apache2/ nginx / httpd happen to provide password auth with reverse-proxy

]]>
https://blog.karnwong.me/setting-up-password-auth-with-apache2-reverse-proxy/60335d453b0e1d0001e5289fMon, 22 Feb 2021 07:38:07 GMTEDIT: see https://blog.karnwong.me/hello-caddy/ for Caddy, also easier to set up too.

Sometimes you found an interesting project to self-hosted, but it doesn't have password authentication built-in. Luckily, we need to reverse-proxy them anyway and apache2/ nginx / httpd happen to provide password auth with reverse-proxy by default.

To set up password auth with apache2 via reverse-proxy:

  1. echo "${PASSWORD}" | htpasswd -c -i /etc/apache2/.htpasswd ${USER} on your host machine which has apache2 installed.
  2. create a vhost config:
<VirtualHost *:80>
    ProxyPreserveHost On

    ProxyPass / http://localhost:${EXPOSED_CONTAINER_PORT}/
    ProxyPassReverse / http://localhost:${EXPOSED_CONTAINER_PORT}/

    ServerName ${YOUR_DOMAIN}

    <Proxy *>
        Order deny,allow
        Allow from all
        Authtype Basic
        Authname "Password Required"
        AuthUserFile /etc/apache2/.htpasswd
        Require valid-user
    </Proxy>
</virtualhost>

That's it!

]]>
<![CDATA[Buying tea when you have Celiac]]>Might come as a surprise to some of you, but tea can contain gluten from additives & cross-contamination, in which barley or malt is added for flavorings. Teavana is known for adding such additives in their tea (and I got glutented from it one time).

Say, I'm interested

]]>
https://blog.karnwong.me/buying-tea-when-you-have-celiac/6028b0d63b0e1d0001e5285eSun, 14 Feb 2021 05:20:54 GMTMight come as a surprise to some of you, but tea can contain gluten from additives & cross-contamination, in which barley or malt is added for flavorings. Teavana is known for adding such additives in their tea (and I got glutented from it one time).

Say, I'm interested in some tea. I first need to look up its country of origin, because that tells me about how good the food labeling laws is. If it's from Australia / New Zealand, this means the label will always state gluten. If it's from EU (not pan-Europe) it's also good, except they allow products with < 20 PPM to be labeled gluten-free. But this is still good. If it's from USA, a lot of scrutiny is called because the US FDA does not enforce gluten labeling, only wheat is required. But not all gluten-y stuff is wheat :/

In this case, I'm interested in Turkish tea. So I asked a Turkish friend to do some digging for me on the manufacturer's website, since I couldn't find the info in English. He reported that he didn't find any info about gluten on the website, but he send me a photo of the package where the ingredients are listed. I spotted German on the labels, this is a good sign because it means this product is also sold in Germany, and food labels have to comply with the country it's being sold in too*.

That's pretty much end of story. Ordered the tea and brew it. Tastes very good!

*Thank you a local Thai QA who told me about this 🙏

]]>
<![CDATA[Workarounds for archiving large shapefile in data lake]]>If you work with spatial data, chances are you are familiar with shapefile, a file format for viewing / editing spatial data.

Essentially, shapefile is just a tabular data like csv, but it does thingamajig with geometry data type, where any gis tools like qgis or arcgis can understand right away.

]]>
https://blog.karnwong.me/workarounds-for-archiving-large-shapefile-in-data-lake/6016e5f18f9b6a00014fe01fSun, 31 Jan 2021 17:40:53 GMTIf you work with spatial data, chances are you are familiar with shapefile, a file format for viewing / editing spatial data.

Essentially, shapefile is just a tabular data like csv, but it does thingamajig with geometry data type, where any gis tools like qgis or arcgis can understand right away. If you have a csv file with geometry column in wkt format (something like POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))), you'll have to specify which column is to be used for geometry.

If you want to store shapefile in data lake, it's best to store it as parquet or any format you normally use, since it's faster to read and filter. For comparison, parsing a 5GB+ shapefile and filter takes longer than reading a gzipped json, filter, and export to shapefile.

Normally I would use geopandas to read spatial data and convert it to pandas dataframe, then send it to spark. But since the shapefile is very large, it takes forever to read in geopandas. This tells me that there is a parsing bottleneck going on. And geopandas can't read shapefile with multiple geometry types (this shouldn't happen, but sometimes during editing, clipping this here and there can cause invalid geometry).

Qgis has a tool to fix invalid geometries, so I tried exporting shapefile to csv, but qgis went OOM. But both qgis and geopandas use gdal for backend, and it has a CLI interface, so I look up how to export shapefile to tsv (tab as a seperator makes it faster to parse since it rarely occurs).

Now things work perfectly. As a bonus, gdal also skip invalid geometries by default (unlike in geopandas where it will throw an error and there's no way to ignore it and tell the parser to keep going).

At this point I have a nice tsv file, and reading & archiving via spark is now a breeze. Yay!

Takeaway

  • If it takes too long to read, maybe it's a parsing bottleneck. Find a way to convert it to another format so it's eaiser to parse.
  • Sometimes your initial tools of choice might have some quirks. In most cases there will be similar tools out there that can workaround the issues. (In this case, use gdal to convert to csv in lieu of geopandas because gpd can't work with invalid geometries & takes longer to read compared to feeding spark a straight csv/tsv).
]]>
<![CDATA[Mongodb export woes]]>There's a task where I need to export 4M+ records out of mongodb, total uncompressed size is 17GB+ 26GB

export methods

mongoexport

The recommended way to export is using mongoexport utility, but you have to specify the output attributes, which doesn't work for me because the

]]>
https://blog.karnwong.me/mongodb-export-woes/6010eca30ae88a00013bb800Wed, 27 Jan 2021 04:51:41 GMTThere's a task where I need to export 4M+ records out of mongodb, total uncompressed size is 17GB+ 26GB

export methods

mongoexport

The recommended way to export is using mongoexport utility, but you have to specify the output attributes, which doesn't work for me because the schema from older set of records are less than the newer set

DIY python script

the vanilla way

But you can interact with mongodb from python, and if you read from it it'll return a dict, which is perfect for this because you don't have to specify the required attributes beforehand. So what I do is:

cursor = collection.find({})
total_records = collection.estimated_document_count()

with open(filename, 'w') as f:
    for i in tqdm(cursor, total=total_records):
        f.write(json.dumps(i, default=myconverter, ensure_ascii=False))
        f.write('\n')

The cons for this solution is it needs a lot of hdd space since it's uncompressed. But it works best if you need to export a collection with mismatched schema.

the incremental export way

You can also incrementally export your collection from mongodb using .skip($START_INDEX).limit($INCREMENT_SIZE) , but it performs worse than the vanilla way, since what mongodb does is just iterating through everything all over again to get to your specified start:end index.

Performance comparison

On my local machine (<10 MB/s transfer speed) I could export a collection with around 4.5M records within 1 hour, but on a VPS with incremental export it takes 9 hours and counting.

Takeaway

Please do not store a large dataset in mongodb where you need to dump everything out, especially if you use it as a raw data source. It's fine if you store prepped output for API to be queried via _id (primary key).

]]>