Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Parsiranje velikih log fajlova
#1
Zdravo svima,

Skorije sam imao potrebu da analiziram prilicno veliki (500MB) mod_security log fajl i to u "realnom" vremenu, tj da svaki upis u fajl registrujem odmah i da promene vidim u nekom obliku koji je jednostavan za analizu napada na odredjeni web sajt.
Takodje fajl se jako brzo menja i rotira.

Resenje koje sam primenio je da sam napisao parser u pythonu koji radi slicno kao "tail -f logfile" i zanima me da li postoji bolji pristup za to:

Code:
...
with open('/var/log/apache2/modsec_audit.log') as f:
    while True: # This will wait for next line in file

        if psutil.cpu_percent() > 50:
            time.sleep(0.2) # CPU usage protection
        line = f.readline()
        if line:
            # Do something with line from file
            # Put it in DB, etc ...
...

Ovo odlicno radi jer cita liniju po liniju i ne zauzima puno memorije, zatim problem sa zakucavanjem cpu u while petlji sam resio sa sleep (jel ima bolji predlog?). Imao sam jos jedan problem a to je prekidanje skripta u pola ili bilo gde, i to sam resio sa pomeranjem file pointera u odnosu na zadnji upis u bazi. 

Deo za analizu je u HTML/PHP-u i on nije sporan jer cita podatke iz DB.

@mikikg ti si cini mi se radio nesto sa nekim velikim logovima?


Hvala,
Ivan
“If you think you are too small to make a difference, try sleeping with a mosquito.” - Dalai Lama XIV
Reply
#2
Ja sam radio sa velikom kolicinom fajlova, oko 4TB koje sam trebao da "sazvacem" ali su to bili fixni (Apache) log fajlovi, nisu bili "zivi" u smislu da se radi tail -f fora. Pojedini fajlovi su bili veliki oko 200 - 600MB.
Tu sam primenio citanje po chunk-ovima, pomerao sam file pointer pa sam citao recimo 0-10MB i odradim parsiranje, pa 10-20MB i tako redom. Naravno ne moze se ucitati ceo fajl u memoriju a i da moze to bude sve uzasno sporo onda.

Trebas razmisliti malo detaljnije oko te problematike jer "pulling" varijanta (stalno proveravas da li nesto ima novo) nije bas najsrecnije resenje, moras nekako doci do event-driven varijante, da se nekakva funkcija aktivira bas kad nastane promena.

Ja za tim nisam imao potrebe pa se nisam mnogo bavio, ali sigurno ima neka sistemska varijanta za tako nesto …

Vidi da li NodeJS ima nesto u ponudi oko toga …

BTW: Ne znam sta ti je rezultat parsiranja i gde to smestas, ali u mom slucaju sam imao oko 10 milijardi redova i to nije moglo ni u ludilu da stane u MySQL! Morao sam da jurim neke druge engine, Casandra, CoachDB, InfiniDB, Hadoop/Xbase, tj specificne analiticke baze podataka ...
Reply
#3
Da gledao sam kako da se prebacim na event-driven varijantu i to sa ovime: http://pyinotify.sourceforge.net/. Moracu da testiram oba pristupa ... Za sada ce MySQL biti OK jer ne moram da imam toliko podataka. Ovo se manje vise primenjuje samo u odredjenim situacijama prilikom napada.
“If you think you are too small to make a difference, try sleeping with a mosquito.” - Dalai Lama XIV
Reply
#4
Zanimljivo resenje:

Quote:InfluxDB is a time series database built from the ground up to handle high write and query loads. It is the second piece of the TICK stack. InfluxDB is meant to be used as a backing store for any use case involving large amounts of timestamped data, including DevOps monitoring, application metrics, IoT sensor data, and real-time analytics.

Quote:Key Features

Here are some of the features that InfluxDB currently supports that make it a great choice for working with time series data.
  • Custom high performance datastore written specifically for time series data. The TSM engine allows for high ingest speed and data compression.
  • Written entirely in Go. It compiles into a single binary with no external dependencies.
  • Simple, high performing write and query HTTP(S) APIs.
  • Plugins support for other data ingestion protocols such as Graphite, collectd, and OpenTSDB.
  • Expressive SQL-like query language tailored to easily query aggregated data.
  • Tags allow series to be indexed for fast and efficient queries.
  • Retention policies efficiently auto-expire stale data.
  • Continuous queries automatically compute aggregate data to make frequent queries more efficient.
  • Built in web admin interface.

Free je samo za single node.

Quote:However, the open source edition of InfluxDB runs on a single node. If your requirements dictate a high-availability setup to eliminate a single point of failure, you should explore InfluxDB Enterprise Edition.

Linkovi:

https://github.com/influxdata/influxdb
https://docs.influxdata.com/influxdb/v1.3/

Zanimljiva prezentacija ("Challenges of monitoring distributed systems") koju sam slusao:

https://www.slideshare.net/NenadBozic2/c...s-76090175
“If you think you are too small to make a difference, try sleeping with a mosquito.” - Dalai Lama XIV
Reply


Forum Jump:


Users browsing this thread: 1 Guest(s)